Senior principal scientist Jasha Droppo on the shared architectures of large language models and spectrum quantization text-to-speech models — and other convergences between the two fields.Read More
Announcing Amazon S3 access point support for Amazon SageMaker Data Wrangler
We’re excited to announce Amazon SageMaker Data Wrangler support for Amazon S3 Access Points. With its visual point and clikc interface, SageMaker Data Wrangler simplifies the process of data preparation and feature engineering including data selection, cleansing, exploration, and visualization, while S3 Access Points simplifies data access by providing unique hostnames with specific access policies.
Starting today, SageMaker Data Wrangler is making it easier for users to prepare data from shared datasets stored in Amazon Simple Storage Service (Amazon S3) while enabling organizations to securely control data access in their organization. With S3 Access Points, data administrators can now create application- and team-specific access points to facilitate data sharing, rather than managing complex bucket policies with many different permission rules.
In this post, we walk you through importing data from, and exporting data to, an S3 access point in SageMaker Data Wrangler.
Solution Overview
Imagine you, as an administrator, have to manage data for multiple data science teams running their own data preparation workflows in SageMaker Data Wrangler. Administrators often face three challenges:
- Data science teams need to access their datasets without compromising the security of others
- Data science teams need access to some datasets with sensitive data, which further complicates managing permissions
- Security policy only permits data access through specific endpoints to prevent unauthorized access and to reduce the exposure of data
With traditional bucket policies, you would struggle setting up granular access because bucket policies apply the same permissions to all objects within the bucket. Traditional bucket policies also can’t support securing access at the endpoint level.
S3 Access Points solves these problems by granting fine-grained access control at a granular level, making it easier to manage permissions for different teams without impacting other parts of the bucket. Instead of modifying a single bucket policy, you can create multiple access points with individual policies tailored to specific use cases, reducing the risk of misconfiguration or unintended access to sensitive data. Lastly, you can enforce endpoint policies on access points to define rules that control which VPCs or IP addresses can access the data through a specific access point.
We demonstrate how to use S3 Access Points with SageMaker Data Wrangler with the following steps:
- Upload data to an S3 bucket.
- Create an S3 access point.
- Configure your AWS Identity and Access Management (IAM) role with the necessary policies.
- Create a SageMaker Data Wrangler flow.
- Export data from SageMaker Data Wrangler to the access point.
For this post, we use the Bank Marketing dataset for our sample data. However, you can use any other dataset you prefer.
Prerequisites
For this walkthrough, you should have the following prerequisites:
- An AWS account.
- An Amazon SageMaker Studio domain and user. For details on setting these up, refer to Onboard to Amazon SageMaker Domain Using Quick setup.
- An S3 bucket.
Upload data to an S3 bucket
Upload your data to an S3 bucket. For instructions, refer to Uploading objects. For this post, we use the Bank Marketing dataset.
Create an S3 access point
To create an S3 access point, complete the following steps. For more information, refer to Creating access points.
- On the Amazon S3 console, choose Access Points in the navigation pane.
- Choose Create access point.
- For Access point name, enter a name for your access point.
- For Bucket, select Choose a bucket in this account.
- For Bucket name, enter the name of the bucket you created.
- Leave the remaining settings as default and choose Create access point.
On the access point details page, note the Amazon Resource Name (ARN) and access point alias. You use these later when you interact with the access point in SageMaker Data Wrangler.
Configure your IAM role
If you have a SageMaker Studio domain up and ready, complete the following steps to edit the execution role:
- On the SageMaker console, choose Domains in the navigation pane.
- Choose your domain.
- On the Domain settings tab, choose Edit.
By default, the IAM role that you use to access Data Wrangler is SageMakerExecutionRole
. We need to add the following two policies to use S3 access points:
- Policy 1 – This IAM policy grants SageMaker Data Wrangler access to perform
PutObject
,GetObject
, andDeleteObject
:
- Policy 2 – This IAM policy grants SageMaker Data Wrangler access to get the S3 access point:
- Create these two policies and attach them to the role.
Using S3 Access Points in SageMaker Data Wrangler
To create a new SageMaker Data Wrangler flow, complete the following steps:
- Launch SageMaker Studio.
- On the File menu, choose New and Data Wrangler Flow.
- Choose Amazon S3 as the data source.
- For S3 source, enter the S3 access point using the ARN or alias that you noted down earlier.
For this post, we use the ARN to import data using the S3 access point. However, the ARN only works for S3 access points and SageMaker Studio domains within the same Region.
Alternatively, you can use the alias, as shown in the following screenshot. Unlike ARNs, aliases can be referenced across Regions.
Export data from SageMaker Data Wrangler to S3 access points
After we complete the necessary transformations, we can export the results to the S3 access point. In our case, we simply dropped a column. When you complete whatever transformations you need for your use case, complete the following steps:
- In the data flow, choose the plus sign.
- Choose Add destination and Amazon S3.
- Enter the dataset name and the S3 location, referencing the ARN.
Now you have used S3 access points to import and export data securely and efficiently without having to manage complex bucket policies and navigate multiple folder structures.
Clean up
If you created a new SageMaker domain to follow along, be sure to stop any running apps and delete your domain to stop incurring charges. Also, delete any S3 access points and delete any S3 buckets.
Conclusion
In this post, we introduced the availability of S3 Access Points for SageMaker Data Wrangler and showed you how you can use this feature to simplify data control within SageMaker Studio. We accessed the dataset from, and saved the resulting transformations to, an S3 access point alias across AWS accounts. We hope that you take advantage of this feature to remove any bottlenecks with data access for your SageMaker Studio users, and encourage you to give it a try!
About the authors
Peter Chung is a Solutions Architect serving enterprise customers at AWS. He loves to help customers use technology to solve business problems on various topics like cutting costs and leveraging artificial intelligence. He wrote a book on AWS FinOps, and enjoys reading and building solutions.
Neelam Koshiya is an Enterprise Solution Architect at AWS. Her current focus is to help enterprise customers with their cloud adoption journey for strategic business outcomes. In her spare time, she enjoys reading and being outdoors.
Automatically generating labeled training images
Inverting generative adversarial networks to learn label assignments enables a high-quality labeled-image generator that’s trained on 50 images or fewer.Read More
Machine learning with decentralized training data using federated learning on Amazon SageMaker
Machine learning (ML) is revolutionizing solutions across industries and driving new forms of insights and intelligence from data. Many ML algorithms train over large datasets, generalizing patterns it finds in the data and inferring results from those patterns as new unseen records are processed. Usually, if the dataset or model is too large to be trained on a single instance, distributed training allows for multiple instances within a cluster to be used and distribute either data or model partitions across those instances during the training process. Native support for distributed training is offered through the Amazon SageMaker SDK, along with example notebooks in popular frameworks.
However, sometimes due to security and privacy regulations within or across organizations, the data is decentralized across multiple accounts or in different Regions and it can’t be centralized into one account or across Regions. In this case, federated learning (FL) should be considered to get a generalized model on the whole data.
In this post, we discuss how to implement federated learning on Amazon SageMaker to run ML with decentralized training data.
What is federated learning?
Federated learning is an ML approach that allows for multiple separate training sessions running in parallel to run across large boundaries, for example geographically, and aggregate the results to build a generalized model (global model) in the process. More specifically, each training session uses its own dataset and gets its own local model. Local models in different training sessions will be aggregated (for example, model weight aggregation) into a global model during the training process. This approach stands in contrast to centralized ML techniques where datasets are merged for one training session.
Federated learning vs. distributed training on the cloud
When these two approaches are running on the cloud, distributed training happens in one Region on one account, and training data starts with a centralized training session or job. During distributed training process, the dataset gets split into smaller subsets and, depending on the strategy (data parallelism or model parallelism), subsets are sent to different training nodes or go through nodes in a training cluster, which means individual data doesn’t necessarily stay in one node of the cluster.
In contrast, with federated learning, training usually occurs in multiple separate accounts or across Regions. Each account or Region has its own training instances. The training data is decentralized across accounts or Regions from the beginning to the end, and individual data is only read by its respective training session or job between different accounts or Regions during the federated learning process.
Flower federated learning framework
Several open-source frameworks are available for federated learning, such as FATE, Flower, PySyft, OpenFL, FedML, NVFlare, and Tensorflow Federated. When choosing an FL framework, we usually consider its support for model category, ML framework, and device or operation system. We also need to consider the FL framework’s extensibility and package size so as to run it on the cloud efficiently. In this post, we choose an easily extensible, customizable, and lightweight framework, Flower, to do the FL implementation using SageMaker.
Flower is a comprehensive FL framework that distinguishes itself from existing frameworks by offering new facilities to run large-scale FL experiments, and enables richly heterogeneous FL device scenarios. FL solves challenges related to data privacy and scalability in scenarios where sharing data is not possible.
Design principles and implementation of Flower FL
Flower FL is language-agnostic and ML framework-agnostic by design, is fully extensible, and can incorporate emerging algorithms, training strategies, and communication protocols. Flower is open-sourced under Apache 2.0 License.
The conceptual architecture of the FL implementation is described in the paper Flower: A friendly Federated Learning Framework and is highlighted in the following figure.
In this architecture, edge clients live on real edge devices and communicate with the server over RPC. Virtual clients, on the other hand, consume close to zero resources when inactive and only load model and data into memory when the client is being selected for training or evaluation.
The Flower server builds the strategy and configurations to be sent to the Flower clients. It serializes these configuration dictionaries (or config dict for short) to their ProtoBuf representation, transports them to the client using gRPC, and then deserializes them back to Python dictionaries.
Flower FL strategies
Flower allows customization of the learning process through the strategy abstraction. The strategy defines the entire federation process specifying parameter initialization (whether it’s server or client initialized), the minimum number of clients available required to initialize a run, the weight of the client’s contributions, and training and evaluation details.
Flower has an extensive implementation of FL averaging algorithms and a robust communication stack. For a list of averaging algorithms implemented and associated research papers, refer to the following table, from Flower: A friendly Federated Learning Framework.
Federated learning with SageMaker: Solution architecture
A federated learning architecture using SageMaker with the Flower framework is implemented on top of bi-directional gRPC (foundation) streams. gRPC defines the types of messages exchanged and uses compilers to then generate efficient implementation for Python, but it can also generate the implementation for other languages, such as Java or C++.
The Flower clients receive instructions (messages) as raw byte arrays via the network. Then the clients deserialize and run the instruction (training on local data). The results (model parameters and weights) are then serialized and communicated back to the server.
The server/client architecture for Flower FL is defined in SageMaker using notebook instances in different accounts in the same Region as the Flower server and Flower client. The training and evaluation strategies are defined on the server as well as the global parameters, then the configuration is serialized and sent to the client over VPC peering.
The notebook instance client starts a SageMaker training job that runs a custom script to trigger the instantiation of the Flower client, which deserializes and reads the server configuration, triggers the training job, and sends the parameters response.
The last step occurs on the server when the evaluation of the newly aggregated parameters is triggered upon completion of the number of runs and clients stipulated on the server strategy. The evaluation takes place on a testing dataset existing only on the server, and the new improved accuracy metrics are produced.
The following diagram illustrates the architecture of the FL setup on SageMaker with the Flower package.
Implement federated learning using SageMaker
SageMaker is a fully managed ML service. With SageMaker, data scientists and developers can quickly build and train ML models, and then deploy them into a production-ready hosted environment.
In this post, we demonstrate how to use the managed ML platform to provide a notebook experience environment and perform federated learning across AWS accounts, using SageMaker training jobs. The raw training data never leaves the account that owns the data and only the derived weights are sent across the peered connection.
We highlight the following core components in this post:
- Networking – SageMaker allows for quick setup of default networking configuration while also allowing you to fully customize the networking depending on your organization’s requirements. We use a VPC peering configuration within the Region in this example.
- Cross-account access settings – In order to allow a user in the server account to start a model training job in the client account, we delegate access across accounts using AWS Identity and Access Management (IAM) roles. This way, a user in the server account doesn’t have to sign out of the account and sign in to the client account to perform actions on SageMaker. This setting is only for purposes of starting SageMaker training jobs, and it doesn’t have any cross-account data access permission or sharing.
- Implementing federated learning client code in the client account and server code in the server account – We implement federated learning client code in the client account by using the Flower package and SageMaker managed training. Meanwhile, we implement server code in the server account by using the Flower package.
Set up VPC peering
A VPC peering connection is a networking connection between two VPCs that enables you to route traffic between them using private IPv4 addresses or IPv6 addresses. Instances in either VPC can communicate with each other as if they are within the same network.
To set up a VPC peering connection, first create a request to peer with another VPC. You can request a VPC peering connection with another VPC in the same account, or in our use case, connect with a VPC in a different AWS account. To activate the request, the owner of the VPC must accept the request. For more details about VPC peering, refer to Create a VPC peering connection.
Launch SageMaker notebook instances in VPCs
A SageMaker notebook instance provides a Jupyter notebook app through a fully managed ML Amazon Elastic Compute Cloud (Amazon EC2) instance. SageMaker Jupyter notebooks are used to perform advanced data exploration, create training jobs, deploy models to SageMaker hosting, and test or validate your models.
The notebook instance has a variety of networking configurations available to it. In this setup, we have the notebook instance run within a private subnet of the VPC and don’t have direct internet access.
Configure cross-account access settings
Cross-account access settings include two steps to delegate access from the server account to client account by using IAM roles:
- Create an IAM role in the client account.
- Grant access to the role in the server account.
For detailed steps to set up a similar scenario, refer to Delegate access across AWS accounts using IAM roles.
In the client account, we create an IAM role called FL-kickoff-client-job
with the policy FL-sagemaker-actions
attached to the role. The FL-sagemaker-actions
policy has JSON content as follows:
We then modify the trust policy in the trust relationships of the FL-kickoff-client-job
role:
In the server account, permissions are added to an existing user (for example, developer
) to allow switching to the FL-kickoff-client-job
role in client account. To do this, we create an inline policy called FL-allow-kickoff-client-job
and attach it to the user. The following is the policy JSON content:
Sample dataset and data preparation
In this post, we use a curated dataset for fraud detection in Medicare providers’ data released by the Centers for Medicare & Medicaid Services (CMS). Data is split into a training dataset and a testing dataset. Because the majority of the data is non-fraud, we apply SMOTE to balance the training dataset, and further split the training dataset into training and validation parts. Both the training and validation data are uploaded to an Amazon Simple Storage Service (Amazon S3) bucket for model training in the client account, and the testing dataset is used in the server account for testing purposes only. Details of the data preparation code are in the following notebook.
With the SageMaker pre-built Docker images for the scikit-learn framework and SageMaker managed training process, we train a logistic regression model on this dataset using federated learning.
Implement a federated learning client in the client account
In the client account’s SageMaker notebook instance, we prepare a client.py script and a utils.py script. The client.py
file contains code for the client, and the utils.py
file contains code for some of the utility functions that will be needed for our training. We use the scikit-learn package to build the logistic regression model.
In client.py
, we define a Flower client. The client is derived from the class fl.client.NumPyClient. It needs to define the following three methods:
- get_parameters – It returns the current local model parameters. The utility function
get_model_parameters
will do this. - fit – It defines the steps to train the model on the training data in client’s account. It also receives global model parameters and other configuration information from the server. We update the local model’s parameters using the received global parameters and continue training it on the dataset in the client account. This method also sends the local model’s parameters after training, the size of the training set, and a dictionary communicating arbitrary values back to the server.
- evaluate – It evaluates the provided parameters using the validation data in the client account. It returns the loss together with other details such as the size of the validation set and accuracy back to the server.
The following is a code snippet for the Flower client definition:
We then use SageMaker script mode to prepare the rest of the client.py
file. This includes defining parameters that will be passed to SageMaker training, loading training and validation data, initializing and training the model on the client, setting up the Flower client to communicate with the server, and finally saving the trained model.
utils.py
includes a few utility functions that are called in client.py
:
- get_model_parameters – It returns the scikit-learn LogisticRegression model parameters.
- set_model_params – It sets the model’s parameters.
- set_initial_params – It initializes the parameters of the model as zeros. This is required because the server asks for initial model parameters from the client at launch. However, in the scikit-learn framework,
LogisticRegression
model parameters are not initialized untilmodel.fit()
is called. - load_data – It loads the training and testing data.
- save_model – It saves model as a
.joblib
file.
Because Flower is not a package installed in the SageMaker pre-built scikit-learn Docker container, we list flwr==1.3.0
in a requirements.txt
file.
We put all three files (client.py
, utils.py
, and requirements.txt
) under a folder and tar zip it. The .tar.gz file (named source.tar.gz
in this post) is then uploaded to an S3 bucket in the client account.
Implement a federated learning server in the server account
In the server account, we prepare code on a Jupyter notebook. This includes two parts: the server first assumes a role to start a training job in the client account, then the server federates the model using Flower.
Assume a role to run the training job in the client account
We use the Boto3 Python SDK to set up an AWS Security Token Service (AWS STS) client to assume the FL-kickoff-client-job
role and set up a SageMaker client so as to run a training job in the client account by using the SageMaker managed training process:
Using the assumed role, we create a SageMaker training job in client account. The training job uses the SageMaker built-in scikit-learn framework. Note that all S3 buckets and the SageMaker IAM role in the following code snippet are related to the client account:
Aggregate local models into a global model using Flower
We prepare code to federate the model on the server. This includes defining the strategy for federation and its initialization parameters. We use utility functions in the utils.py
script described earlier to initialize and set model parameters. Flower allows you to define your own callback functions to customize an existing strategy. We use the FedAvg strategy with custom callbacks for evaluation and fit configuration. See the following code:
The following two functions are mentioned in the preceding code snippet:
- fit_round – It’s used to send the round number to the client. We pass this callback as the
on_fit_config_fn
parameter of the strategy. We do this simply to demonstrate the use of theon_fit_config_fn
parameter. - get_evaluate_fn – It’s used for model evaluation on the server.
For demo purposes, we use the testing dataset that we set aside in data preparation to evaluate the model federated from the client’s account and communicate the result back to the client. However, it’s worth noting that in almost all real use cases, the data used in the server account is not split from the dataset used in the client account.
After the federated learning process is finished, a model.tar.gz
file is saved by SageMaker as a model artifact in an S3 bucket in the client account. Meanwhile, a model.joblib
file is saved on the SageMaker notebook instance in the server account. Lastly, we use the testing dataset to test the final model (model.joblib
) on the server. Testing output of the final model is as follows:
Clean up
After you are done, clean up the resources in both the server account and client account to avoid additional charges:
- Stop the SageMaker notebook instances.
- Delete VPC peering connections and corresponding VPCs.
- Empty and delete the S3 bucket you created for data storage.
Conclusion
In this post, we walked through how to implement federated learning on SageMaker by using the Flower package. We showed how to configure VPC peering, set up cross-account access, and implement the FL client and server. This post is useful for those who need to train ML models on SageMaker using decentralized data across accounts with restricted data sharing. Because the FL in this post is implemented using SageMaker, it’s worth noting that a lot more features in SageMaker can be brought into the process.
Implementing federated learning on SageMaker can take advantage of all the advanced features that SageMaker provides through the ML lifecycle. There are other ways to achieve or apply federated learning on the AWS Cloud, such as using EC2 instances or on the edge. For details about these alternative approaches, refer to Federated Learning on AWS with FedML and Applying Federated Learning for ML at the Edge.
About the authors
Sherry Ding is a senior AI/ML specialist solutions architect at Amazon Web Services (AWS). She has extensive experience in machine learning with a PhD degree in computer science. She mainly works with public sector customers on various AI/ML-related business challenges, helping them accelerate their machine learning journey on the AWS Cloud. When not helping customers, she enjoys outdoor activities.
Lorea Arrizabalaga is a Solutions Architect aligned to the UK Public Sector, where she helps customers design ML solutions with Amazon SageMaker. She is also part of the Technical Field Community dedicated to hardware acceleration and helps with testing and benchmarking AWS Inferentia and AWS Trainium workloads.
Ben Snively is an AWS Public Sector Senior Principal Specialist Solutions Architect. He works with government, non-profit, and education customers on big data, analytical, and AI/ML projects, helping them build solutions using AWS.
Explain medical decisions in clinical settings using Amazon SageMaker Clarify
Explainability of machine learning (ML) models used in the medical domain is becoming increasingly important because models need to be explained from a number of perspectives in order to gain adoption. These perspectives range from medical, technological, legal, and the most important perspective—the patient’s. Models developed on text in the medical domain have become accurate statistically, yet clinicians are ethically required to evaluate areas of weakness related to these predictions in order to provide the best care for individual patients. Explainability of these predictions is required in order for clinicians to make the correct choices on a patient-by-patient basis.
In this post, we show how to improve model explainability in clinical settings using Amazon SageMaker Clarify.
Background
One specific application of ML algorithms in the medical domain, which uses large volumes of text, is clinical decision support systems (CDSSs) for triage. On a daily basis, patients are admitted to hospitals and admission notes are taken. After these notes are taken, the triage process is initiated, and ML models can assist clinicians with estimating clinical outcomes. This can help reduce operational overhead costs and provide optimal care for patients. Understanding why these decisions are suggested by the ML models is extremely important for decision-making related to individual patients.
The purpose of this post is to outline how you can deploy predictive models with Amazon SageMaker for the purposes of triage within hospital settings and use SageMaker Clarify to explain these predictions. The intent is to offer an accelerated path to adoption of predictive techniques within CDSSs for many healthcare organizations.
The notebook and code from this post are available on GitHub. To run it yourself, clone the GitHub repository and open the Jupyter notebook file.
Technical background
A large asset for any acute healthcare organization is its clinical notes. At the time of intake within a hospital, admission notes are taken. A number of recent studies have shown the predictability of key indicators such as diagnoses, procedures, length of stay, and in-hospital mortality. Predictions of these are now highly achievable from admission notes alone, through the use of natural language processing (NLP) algorithms [1].
Advances in NLP models, such as Bi-directional Encoder Representations from Transformers (BERT), have allowed for highly accurate predictions on a corpus of text, such as admission notes, that were previously difficult to get value from. Their prediction of the clinical indicators is highly applicable for use in a CDSS.
Yet, in order to use the new predictions effectively, how these accurate BERT models are achieving their predictions still needs to be explained. There are several techniques to explain the predictions of such models. One such technique is SHAP (SHapley Additive exPlanations), which is a model-agnostic technique for explaining the output of ML models.
What is SHAP
SHAP values are a technique for explaining the output of ML models. It provides a way to break down the prediction of an ML model and understand how much each input feature contributes to the final prediction.
SHAP values are based on game theory, specifically the concept of Shapley values, which were originally proposed to allocate the payout of a cooperative game among its players [2]. In the context of ML, each feature in the input space is considered a player in a cooperative game, and the prediction of the model is the payout. SHAP values are calculated by examining the contribution of each feature to the model prediction for each possible combination of features. The average contribution of each feature across all possible feature combinations is then calculated, and this becomes the SHAP value for that feature.
SHAP allows models to explain predictions without understanding the model’s inner workings. In addition, there are techniques to display these SHAP explanations in text, so that the medical and patient perspectives can all have intuitive visibility into how algorithms come to their predictions.
With new additions to SageMaker Clarify, and the use of pre-trained models from Hugging Face that are easily used implemented in SageMaker, model training and explainability can all be easily done in AWS.
For the purpose of an end-to-end example, we take the clinical outcome of in-hospital mortality and show how this process can be implemented easily in AWS using a pre-trained Hugging Face BERT model, and the predictions will be explained using SageMaker Clarify.
Choices of Hugging Face model
Hugging Face offers a variety of pre-trained BERT models that have been specialized for use on clinical notes. For this post, we use the bigbird-base-mimic-mortality model. This model is a fine-tuned version of Google’s BigBird model, specifically adapted for predicting mortality using MIMIC ICU admission notes. The model’s task is to determine the likelihood of a patient not surviving a particular ICU stay based on the admission notes. One of the significant advantages of using this BigBird model is its capability to process larger context lengths, which means we can input the complete admission notes without the need for truncation.
Our steps involve deploying this fine-tuned model on SageMaker. We then incorporate this model into a setup that allows for real-time explanation of its predictions. To achieve this level of explainability, we use SageMaker Clarify.
Solution overview
SageMaker Clarify provides ML developers with purpose-built tools to gain greater insights into their ML training data and models. SageMaker Clarify explains both global and local predictions and explains decisions made by computer vision (CV) and NLP models.
The following diagram shows the SageMaker architecture for hosting an endpoint that serves explainability requests. It includes interactions between an endpoint, the model container, and the SageMaker Clarify explainer.
In the sample code, we use a Jupyter notebook to showcase the functionality. However, in a real-world use case, electronic health records (EHRs) or other hospital care applications would directly invoke the SageMaker endpoint to get the same response. In the Jupyter notebook, we deploy a Hugging Face model container to a SageMaker endpoint. Then we use SageMaker Clarify to explain the results that we obtain from the deployed model.
Prerequisites
You need the following prerequisites:
Access the code from the GitHub repository and upload it to your notebook instance. You can also run the notebook in an Amazon SageMaker Studio environment, which is an integrated development environment (IDE) for ML development. We recommend using a Python 3 (Data Science) kernel on SageMaker Studio or a conda_python3 kernel on a SageMaker notebook instance.
Deploy the model with SageMaker Clarify enabled
As the first step, download the model from Hugging Face and upload it to an Amazon Simple Storage Service (Amazon S3) bucket. Then create a model object using the HuggingFaceModel class. This uses a prebuilt container to simplify the process of deploying Hugging Face models to SageMaker. You also use a custom inference script to do the predictions within the container. The following code illustrates the script that is passed as an argument to the HuggingFaceModel class:
Then you can define the instance type that you deploy this model on:
We then populate ExecutionRoleArn
, ModelName
and PrimaryContainer
fields to create a Model.
Next, create an endpoint configuration by calling the create_endpoint_config
API. Here, you supply the same model_name
used in the create_model
API call. The create_endpoint_config
now supports the additional parameter ClarifyExplainerConfig
to enable the SageMaker Clarify explainer. The SHAP baseline is mandatory; you can provide it either as inline baseline data (the ShapBaseline parameter) or by a S3 baseline file (the ShapBaselineUri parameter). For optional parameters, see the developer guide.
In the following code, we use a special token as the baseline:
The TextConfig is configured with sentence-level granularity (each sentence is a feature, and we need a few sentences per review for good visualization) and the language as English:
Finally, after you have the model and endpoint configuration ready, use the create_endpoint
API to create your endpoint. The endpoint_name
must be unique within a Region in your AWS account. The create_endpoint
API is synchronous in nature and returns an immediate response with the endpoint status being in the Creating state.
Explain the prediction
Now that you have deployed the endpoint with online explainability enabled, you can try some examples. You can invoke the real-time endpoint using the invoke_endpoint
method by providing the serialized payload, which in this case is some sample admission notes:
In the first scenario, let’s assume that the following medical admission note was taken by a healthcare worker:
The following screenshot shows the model results.
After this is forwarded to the SageMaker endpoint, the label was predicted as 0, which indicates that the risk of mortality is low. In other words, 0 implies that the admitted patient is in non-acute condition according to the model. However, we need the reasoning behind that prediction. For that, you can use the SHAP values as the response. The response includes the SHAP values corresponding to the phrases of the input note, which can be further color-coded as green or red based on how the SHAP values contribute to the prediction. In this case, we see more phrases in green, such as “Patient reports no previous history of chest pain” and “EKG shows sinus tachycardia with no ST-elevations or depressions,” as opposed to red, aligning with the mortality prediction of 0.
In the second scenario, let’s assume that the following medical admission note was taken by a healthcare worker:
The following screenshot shows our results.
After this is forwarded to the SageMaker endpoint, the label was predicted as 1, which indicates that the risk of mortality is high. This implies that the admitted patient is in acute condition according to the model. However, we need the reasoning behind that prediction. Again, you can use the SHAP values as the response. The response includes the SHAP values corresponding to the phrases of the input note, which can be further color-coded. In this case, we see more phrases in red, such as “Patient reports a fever, chills, and weakness for the past 3 days, as well as decreased urine output and confusion” and “Patient is a 72-year-old female with a chief complaint of severe sepsis shock,” as opposed to green, aligning with the mortality prediction of 1.
The clinical care team can use these explanations to assist in their decisions on the care process for each individual patient.
Clean up
To clean up the resources that have been created as part of this solution, run the following statements:
Conclusion
This post showed you how to use SageMaker Clarify to explain decisions in a healthcare use case based on the medical notes captured during various stages of triage process. This solution can be integrated into existing decision support systems to provide another data point to clinicians as they evaluate patients for admission into the ICU. To learn more about using AWS services in the healthcare industry, check out the following blog posts:
- Introducing the Healthcare Industry Lens for the AWS Well-Architected Framework
- How Telescope Health Streamlines Virtual Care in the Cloud
- The Pathway to better Surgical Care with Operating Room Analytics on AWS
- Predicting diabetic patient readmission using multi-model training on Amazon SageMaker Pipelines
- How Pieces Technologies leverages AWS services to predict patient outcomes
References
[1] https://aclanthology.org/2021.eacl-main.75/ [2] https://arxiv.org/pdf/1705.07874.pdfAbout the authors
Shamika Ariyawansa, serving as a Senior AI/ML Solutions Architect in the Global Healthcare and Life Sciences division at Amazon Web Services (AWS), has a keen focus on Generative AI. He assists customers in integrating Generative AI into their projects, emphasizing the importance of explainability within their AI-driven initiatives. Beyond his professional commitments, Shamika passionately pursues skiing and off-roading adventures.”
Ted Spencer is an experienced Solutions Architect with extensive acute healthcare experience. He is passionate about applying machine learning to solve new use cases, and rounds out solutions with both the end consumer and their business/clinical context in mind. He lives in Toronto Ontario, Canada, and enjoys traveling with his family and training for triathlons as time permits.
Ram Pathangi is a Solutions Architect at AWS supporting healthcare and life sciences customers in the San Francisco Bay Area. He has helped customers in finance, healthcare, life sciences, and hi-tech verticals run their business successfully on the AWS Cloud. He specializes in Databases, Analytics, and Machine Learning.
Apply fine-grained data access controls with AWS Lake Formation in Amazon SageMaker Data Wrangler
Amazon SageMaker Data Wrangler reduces the time it takes to collect and prepare data for machine learning (ML) from weeks to minutes. You can streamline the process of feature engineering and data preparation with SageMaker Data Wrangler and finish each stage of the data preparation workflow (including data selection, purification, exploration, visualization, and processing at scale) within a single visual interface. Data is frequently kept in data lakes that can be managed by AWS Lake Formation, giving you the ability to implement fine-grained access control using a straightforward grant or revoke procedure. SageMaker Data Wrangler supports fine-grained data access control with Lake Formation and Amazon Athena connections.
We are happy to announce that SageMaker Data Wrangler now supports using Lake Formation with Amazon EMR to provide this fine-grained data access restriction.
Data professionals such as data scientists want to use the power of Apache Spark, Hive, and Presto running on Amazon EMR for fast data preparation; however, the learning curve is steep. Our customers wanted the ability to connect to Amazon EMR to run ad hoc SQL queries on Hive or Presto to query data in the internal metastore or external metastore (such as the AWS Glue Data Catalog), and prepare data within a few clicks.
In this post, we show how to use Lake Formation as a central data governance capability and Amazon EMR as a big data query engine to enable access for SageMaker Data Wrangler. The capabilities of Lake Formation simplify securing and managing distributed data lakes across multiple accounts through a centralized approach, providing fine-grained access control.
Solution overview
We demonstrate this solution with an end-to-end use case using a sample dataset, the TPC data model. This data represents transaction data for products and includes information such as customer demographics, inventory, web sales, and promotions. To demonstrate fine-grained data access permissions, we consider the following two users:
- David, a data scientist on the marketing team. He is tasked with building a model on customer segmentation, and is only permitted to access non-sensitive customer data.
- Tina, a data scientist on the sales team. She is tasked with building the sales forecast model, and needs access to sales data for the particular region. She is also helping the product team with innovation, and therefore needs access to product data as well.
The architecture is implemented as follows:
- Lake Formation manages the data lake, and the raw data is available in Amazon Simple Storage Service (Amazon S3) buckets
- Amazon EMR is used to query the data from the data lake and perform data preparation using Spark
- AWS Identity and Access Management (IAM) roles are used to manage data access using Lake Formation
- SageMaker Data Wrangler is used as the single visual interface to interactively query and prepare the data
The following diagram illustrates this architecture. Account A is the data lake account that houses all the ML-ready data obtained through extract, transform, and load (ETL) processes. Account B is the data science account where a group of data scientists compile and run data transformations using SageMaker Data Wrangler. In order for SageMaker Data Wrangler in Account B to have access to the data tables in Account A’s data lake via Lake Formation permissions, we must activate the necessary rights.
You can use the provided AWS CloudFormation stack to set up the architectural components for this solution.
Prerequisites
Before you get started, make sure you have the following prerequisites:
- An AWS account
- An IAM user with administrator access
- An S3 bucket
Provision resources with AWS CloudFormation
We provide a CloudFormation template that deploys the services in the architecture for end-to-end testing and to facilitate repeated deployments. The outputs of this template are as follows:
- An S3 bucket for the data lake.
- An EMR cluster with EMR runtime roles enabled. For more details on using runtime roles with Amazon EMR, see Configure runtime roles for Amazon EMR steps. Associating runtime roles with EMR clusters is supported in Amazon EMR 6.9. Make sure the following configuration is in place:
- Create a security configuration in Amazon EMR.
- The EMR runtime role’s trust policy should allow the EMR EC2 instance profile to assume the role.
- The EMR EC2 instance profile role should be able to assume the EMR runtime roles.
- The EMR cluster should be created with encryption in transit.
- IAM roles for accessing the data in data lake, with fine-grained permissions:
- Marketing-data-access-role
- Sales-data-access-role
- An Amazon SageMaker Studio domain and two user profiles. The SageMaker Studio execution roles for the users allow the users to assume their corresponding EMR runtime roles.
- A lifecycle configuration to enable the selection of the role to use for the EMR connection.
- A Lake Formation database populated with the TPC data.
- Networking resources required for the setup, such as VPC, subnets, and security groups.
Create Amazon EMR encryption certificates for the data in transit
With Amazon EMR release version 4.8.0 or later, you have option for specifying artifacts for encrypting data in transit using a security configuration. We manually create PEM certificates, include them in a .zip file, upload it to an S3 bucket, and then reference the .zip file in Amazon S3. You likely want to configure the private key PEM file to be a wildcard certificate that enables access to the VPC domain in which your cluster instances reside. For example, if your cluster resides in the us-east-1 Region, you could specify a common name in the certificate configuration that allows access to the cluster by specifying CN=*.ec2.internal
in the certificate subject definition. If your cluster resides in us-west-2
, you could specify CN=*.us-west-2.compute.internal
.
Run the following commands using your system terminal. This will generate PEM certificates and collate them into a .zip file:
Upload my-certs.zip
to an S3 bucket in the same Region where you intend to run this exercise. Copy the S3 URI for the uploaded file. You’ll need this while launching the CloudFormation template.
This example is a proof of concept demonstration only. Using self-signed certificates is not recommended and presents a potential security risk. For production systems, use a trusted certification authority (CA) to issue certificates.
Deploying the CloudFormation template
To deploy the solution, complete the following steps:
- Sign in to the AWS Management Console as an IAM user, preferably an admin user.
- Choose Launch Stack to launch the CloudFormation template:
- Choose Next.
- For Stack name, enter a name for the stack.
- For IdleTimeout, enter a value for the idle timeout for the EMR cluster (to avoid paying for the cluster when it’s not being used).
- For S3CertsZip, enter an S3 URI with the EMR encryption key.
For instructions to generate a key and .zip file specific to your Region, refer to Providing certificates for encrypting data in transit with Amazon EMR encryption. If you are deploying in US East (N. Virginia), remember to use CN=*.ec2.internal. For more information, refer to Create keys and certificates for data encryption. Make sure to upload the .zip file to an S3 bucket in the same Region as your CloudFormation stack deployment.
- On the review page, select the check box to confirm that AWS CloudFormation might create resources.
- Choose Create stack.
Wait until the status of the stack changes from CREATE_IN_PROGRESS to CREATE_COMPLETE. The process usually takes 10–15 minutes.
After the stack is created, allow Amazon EMR to query Lake Formation by updating the External Data Filtering settings on Lake Formation. For instructions, refer to Getting started with Lake Formation. Specify Amazon EMR for Session tag values and enter your AWS account ID under AWS account IDs.
Test data access permissions
Now that the necessary infrastructure is in place, you can verify that the two SageMaker Studio users have access to granular data. To review, David shouldn’t have access to any private information about your customers. Tina has access to information about sales. Let’s put each user type to the test.
Test David’s user profile
To test your data access with David’s user profile, complete the following steps:
- On the SageMaker console, choose Domains in the navigation pane.
- From the SageMaker Studio domain, launch SageMaker Studio from the user profile david-non-sensitive-customer.
- In your SageMaker Studio environment, create an Amazon SageMaker Data Wrangler flow, and choose Import & prepare data visually.
Alternatively, on the File menu, choose New, then choose Data Wrangler flow.
We discuss these steps to create a data flow in detail later in this post.
Test Tina’s user profile
Tina’s SageMaker Studio execution role allows her to access the Lake Formation database using two EMR execution roles. This is achieved by listing the role ARNs in a configuration file in Tina’s file directory. These roles can be set using SageMaker Studio lifecycle configurations to persist the roles across app restarts. To test Tina’s access, complete the following steps:
- On the SageMaker console, navigate to the SageMaker Studio domain.
- Launch SageMaker Studio from the user profile
tina-sales-electronics
.
It’s a good practice to close any previous SageMaker Studio sessions on your browser when switching user profiles. There can only be one active SageMaker Studio user session at a time.
- Create a Data Wrangler data flow.
In the following sections, we showcase creating a data flow within SageMaker Data Wrangler and connecting to Amazon EMR as the data source. David and Tina will have similar experiences with data preparation, except for access permissions, so they will see different tables.
Create a SageMaker Data Wrangler data flow
In this section, we cover connecting to the existing EMR cluster created through the CloudFormation template as a data source in SageMaker Data Wrangler. For demonstration purposes, we use David’s user profile.
To create your data flow, complete the following steps:
- On the SageMaker console, choose Domains in the navigation pane.
- Choose StudioDomain, which was created by running the CloudFormation template.
- Select a user profile (for this example, David’s) and launch SageMaker Studio.
- Choose Open Studio.
- In SageMaker Studio, create a new data flow and choose Import & prepare data visually.
Alternatively, on the File menu, choose New, then choose Data Wrangler flow.
Creating a new flow can take a few minutes. After the flow has been created, you see the Import data page.
- To add Amazon EMR as a data source in SageMaker Data Wrangler, on the Add data source menu, choose Amazon EMR.
You can browse all the EMR clusters that your SageMaker Studio execution role has permissions to see. You have two options to connect to a cluster: one is through the interactive UI, and the other is to first create a secret using AWS Secrets Manager with a JDBC URL, including EMR cluster information, and then provide the stored AWS secret ARN in the UI to connect to Presto or Hive. In this post, we use the first method.
- Select any of the clusters that you want to use, then choose Next.
- Select which endpoint you want to use.
- Enter a name to identify your connection, such as
emr-iam-connection
, then choose Next.
- Select IAM as your authentication type and choose Connect.
When you’re connected, you can interactively view a database tree and table preview or schema. You can also query, explore, and visualize data from Amazon EMR. For a preview, you see a limit of 100 records by default. After you provide a SQL statement in the query editor and choose Run, the query is run on the Amazon EMR Hive engine to preview the data. Choose Cancel query to cancel ongoing queries if they are taking an unusually long time.
- Let’s access data from the table that David doesn’t have permissions to.
The query will result in the error message “Unable to fetch table dl_tpc_web_sales. Insufficient Lake Formation permission(s) on dl_tpc_web_sales.”
The last step is to import the data. When you are ready with the queried data, you have the option to update the sampling settings for the data selection according to the sampling type (FirstK, Random, or Stratified) and the sampling size for importing data into Data Wrangler.
- Choose Import to import the data.
On the next page, you can add various transformations and essential analysis to the dataset.
- Navigate to the data flow and add more steps to the flow as needed for transformations and analysis.
You can run a data insight report to identify data quality issues and get recommendations to fix those issues. Let’s look at some example transforms.
- In the Data flow view, you should see that we are using Amazon EMR as a data source using the Hive connector.
- Choose the plus sign next to Data types and choose Add transform.
Let’s explore the data and apply a transformation. For example, the c_login
column is empty and it will not add value as a feature. Let’s delete the column.
- In the All steps pane, choose Add step.
- Choose Manage columns.
- For Transform, choose Drop column.
- For Columns to drop, choose the
c_login
column. - Choose Preview, then choose Add.
- Verify the step by expanding the Drop column section.
You can continue adding steps based on the different transformations required for your dataset. Let’s go back to our data flow. You can now see the Drop column block showing the transform we performed.
ML practitioners spend a lot of time crafting feature engineering code, applying it to their initial datasets, training models on the engineered datasets, and evaluating model accuracy. Given the experimental nature of this work, even the smallest project will lead to multiple iterations. The same feature engineering code is often run again and again, wasting time and compute resources on repeating the same operations. In large organizations, this can cause an even greater loss of productivity because different teams often run identical jobs or even write duplicate feature engineering code because they have no knowledge of prior work. To avoid the reprocessing of features, we can export our transformed features to Amazon SageMaker Feature Store. For more information, refer to New – Store, Discover, and Share Machine Learning Features with Amazon SageMaker Feature Store.
- Choose the plus sign next to Drop column.
- Choose Export to and SageMaker Feature Store (via Jupyter notebook).
You can easily export your generated features to SageMaker Feature Store by specifying it as the destination. You can save the features into an existing feature group or create a new one. For more information, refer to Easily create and store features in Amazon SageMaker without code.
We have now created features with SageMaker Data Wrangler and stored those features in SageMaker Feature Store. We showed an example workflow for feature engineering in the SageMaker Data Wrangler UI.
Clean up
If your work with SageMaker Data Wrangler is complete, delete the resources you created to avoid incurring additional fees.
- In SageMaker Studio, close all the tabs, then on the File menu, choose Shut Down.
- When prompted, choose Shutdown All.
Shutdown might take a few minutes based on the instance type. Make sure all the apps associated with each user profile got deleted. If they were not deleted, manually delete the app associated under each user profile created using the CloudFormation template.
- On the Amazon S3 console, empty any S3 buckets that were created from the CloudFormation template when provisioning clusters.
The buckets should have the same prefix as the CloudFormation launch stack name and cf-templates-.
- On the Amazon EFS console, delete the SageMaker Studio file system.
You can confirm that you have the correct file system by choosing the file system ID and confirming the tag ManagedByAmazonSageMakerResource
on the Tags tab.
- On the AWS CloudFormation console, select the stack you created and choose Delete.
You’ll receive an error message, which is expected. We’ll come back to this and clean it up in the subsequent steps.
- Identify the VPC that was created by the CloudFormation stack, named dw-emr-, and follow the prompts to delete the VPC.
- Return to the AWS CloudFormation console and retry the stack deletion for dw-emr-.
All the resources provisioned by the CloudFormation template described in this post have now been removed from your account.
Conclusion
In this post, we went over how to apply fine-grained access control with Lake Formation and access the data using Amazon EMR as a data source in SageMaker Data Wrangler, how to transform and analyze a dataset, and how to export the results to a data flow for use in a Jupyter notebook. After visualizing our dataset using SageMaker Data Wrangler’s built-in analytical features, we further enhanced our data flow. The fact that we created a data preparation pipeline without writing a single line of code is significant.
To get started with SageMaker Data Wrangler, refer to Prepare ML Data with Amazon SageMaker Data Wrangler.
About the Authors
Ajjay Govindaram is a Senior Solutions Architect at AWS. He works with strategic customers who are using AI/ML to solve complex business problems. His experience lies in providing technical direction as well as design assistance for modest to large-scale AI/ML application deployments. His knowledge ranges from application architecture to big data, analytics, and machine learning. He enjoys listening to music while resting, experiencing the outdoors, and spending time with his loved ones.
Isha Dua is a Senior Solutions Architect based in the San Francisco Bay Area. She helps AWS enterprise customers grow by understanding their goals and challenges, and guides them on how they can architect their applications in a cloud-native manner while ensuring resilience and scalability. She’s passionate about machine learning technologies and environmental sustainability.
Parth Patel is a Senior Solutions Architect at AWS in the San Francisco Bay Area. Parth guides enterprise customers to accelerate their journey to the cloud and help them adopt and grow on the AWS Cloud successfully. He is passionate about machine learning technologies, environmental sustainability, and application modernization.
From internship project to published research and a role at Amazon
How Linghui Luo’s research helps ensure code is checked and ready to deploy.Read More
A quick guide to Amazon’s papers at Interspeech 2023
Speech recognition predominates, but Amazon’s research takes in data representation, dialogue management, question answering, and more.Read More
Build ML features at scale with Amazon SageMaker Feature Store using data from Amazon Redshift
Amazon Redshift is the most popular cloud data warehouse that is used by tens of thousands of customers to analyze exabytes of data every day. Many practitioners are extending these Redshift datasets at scale for machine learning (ML) using Amazon SageMaker, a fully managed ML service, with requirements to develop features offline in a code way or low-code/no-code way, store featured data from Amazon Redshift, and make this happen at scale in a production environment.
In this post, we show you three options to prepare Redshift source data at scale in SageMaker, including loading data from Amazon Redshift, performing feature engineering, and ingesting features into Amazon SageMaker Feature Store:
- Option A – Use an AWS Glue interactive session on Amazon SageMaker Studio (in a dev environment) and an AWS Glue job (in a prod environment) with Spark
- Option B – Use an Amazon SageMaker Processing job with a Redshift dataset definition, or use SageMaker Feature Processing in SageMaker Feature Store, which runs SageMaker training jobs
- Option C – Use Amazon SageMaker Data Wrangler in a low-code/no-code way
If you’re an AWS Glue user and would like to do the process interactively, consider option A. If you’re familiar with SageMaker and writing Spark code, option B could be your choice. If you want to do the process in a low-code/no-code way, you can follow option C.
Amazon Redshift uses SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes, using AWS-designed hardware and ML to deliver the best price-performance at any scale.
SageMaker Studio is the first fully integrated development environment (IDE) for ML. It provides a single web-based visual interface where you can perform all ML development steps, including preparing data and building, training, and deploying models.
AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, ML, and application development. AWS Glue enables you to seamlessly collect, transform, cleanse, and prepare data for storage in your data lakes and data pipelines using a variety of capabilities, including built-in transforms.
Solution overview
The following diagram illustrates the solution architecture for each option.
Prerequisites
To continue with the examples in this post, you need to create the required AWS resources. To do this, we provide an AWS CloudFormation template to create a stack that contains the resources. When you create the stack, AWS creates a number of resources in your account:
- A SageMaker domain, which includes an associated Amazon Elastic File System (Amazon EFS) volume
- A list of authorized users and a variety of security, application, policy, and Amazon Virtual Private Cloud (Amazon VPC) configurations
- A Redshift cluster
- A Redshift secret
- An AWS Glue connection for Amazon Redshift
- An AWS Lambda function to set up required resources, execution roles and policies
Make sure that you don’t have already two SageMaker Studio domains in the Region where you’re running the CloudFormation template. This is the maximum allowed number of domains in each supported Region.
Deploy the CloudFormation template
Complete the following steps to deploy the CloudFormation template:
- Save the CloudFormation template sm-redshift-demo-vpc-cfn-v1.yaml locally.
- On the AWS CloudFormation console, choose Create stack.
- For Prepare template, select Template is ready.
- For Template source, select Upload a template file.
- Choose Choose File and navigate to the location on your computer where the CloudFormation template was downloaded and choose the file.
- Enter a stack name, such as
Demo-Redshift
. - On the Configure stack options page, leave everything as default and choose Next.
- On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources with custom names and choose Create stack.
You should see a new CloudFormation stack with the name Demo-Redshift
being created. Wait for the status of the stack to be CREATE_COMPLETE (approximately 7 minutes) before moving on. You can navigate to the stack’s Resources tab to check what AWS resources were created.
Launch SageMaker Studio
Complete the following steps to launch your SageMaker Studio domain:
- On the SageMaker console, choose Domains in the navigation pane.
- Choose the domain you created as part of the CloudFormation stack (
SageMakerDemoDomain
). - Choose Launch and Studio.
This page can take 1–2 minutes to load when you access SageMaker Studio for the first time, after which you’ll be redirected to a Home tab.
Download the GitHub repository
Complete the following steps to download the GitHub repo:
- In the SageMaker notebook, on the File menu, choose New and Terminal.
- In the terminal, enter the following command:
You can now see the amazon-sagemaker-featurestore-redshift-integration
folder in navigation pane of SageMaker Studio.
Set up batch ingestion with the Spark connector
Complete the following steps to set up batch ingestion:
- In SageMaker Studio, open the notebook 1-uploadJar.ipynb under
amazon-sagemaker-featurestore-redshift-integration
. - If you are prompted to choose a kernel, choose Data Science as the image and Python 3 as the kernel, then choose Select.
- For the following notebooks, choose the same image and kernel except the AWS Glue Interactive Sessions notebook (4a).
- Run the cells by pressing Shift+Enter in each of the cells.
While the code runs, an asterisk (*) appears between the square brackets. When the code is finished running, the * will be replaced with numbers. This action is also workable for all other notebooks.
Set up the schema and load data to Amazon Redshift
The next step is to set up the schema and load data from Amazon Simple Storage Service (Amazon S3) to Amazon Redshift. To do so, run the notebook 2-loadredshiftdata.ipynb.
Create feature stores in SageMaker Feature Store
To create your feature stores, run the notebook 3-createFeatureStore.ipynb.
Perform feature engineering and ingest features into SageMaker Feature Store
In this section, we present the steps for all three options to perform feature engineering and ingest processed features into SageMaker Feature Store.
Option A: Use SageMaker Studio with a serverless AWS Glue interactive session
Complete the following steps for option A:
- In SageMaker Studio, open the notebook 4a-glue-int-session.ipynb.
- If you are prompted to choose a kernel, choose SparkAnalytics 2.0 as the image and Glue Python [PySpark and Ray] as the kernel, then choose Select.
The environment preparation process may take some time to complete.
Option B: Use a SageMaker Processing job with Spark
In this option, we use a SageMaker Processing job with a Spark script to load the original dataset from Amazon Redshift, perform feature engineering, and ingest the data into SageMaker Feature Store. To do so, open the notebook 4b-processing-rs-to-fs.ipynb in your SageMaker Studio environment.
Here we use RedshiftDatasetDefinition
to retrieve the dataset from the Redshift cluster. RedshiftDatasetDefinition
is one type of input of the processing job, which provides a simple interface for practitioners to configure Redshift connection-related parameters such as identifier, database, table, query string, and more. You can easily establish your Redshift connection using RedshiftDatasetDefinition
without maintaining a connection full time. We also use the SageMaker Feature Store Spark connector library in the processing job to connect to SageMaker Feature Store in a distributed environment. With this Spark connector, you can easily ingest data to the feature group’s online and offline store from a Spark DataFrame. Also, this connector contains the functionality to automatically load feature definitions to help with creating feature groups. Above all, this solution offers you a native Spark way to implement an end-to-end data pipeline from Amazon Redshift to SageMaker. You can perform any feature engineering in a Spark context and ingest final features into SageMaker Feature Store in just one Spark project.
To use the SageMaker Feature Store Spark connector, we extend a pre-built SageMaker Spark container with sagemaker-feature-store-pyspark
installed. In the Spark script, use the system executable command to run pip install
, install this library in your local environment, and get the local path of the JAR file dependency. In the processing job API, provide this path to the parameter of submit_jars
to the node of the Spark cluster that the processing job creates.
In the Spark script for the processing job, we first read the original dataset files from Amazon S3, which temporarily stores the unloaded dataset from Amazon Redshift as a medium. Then we perform feature engineering in a Spark way and use feature_store_pyspark
to ingest data into the offline feature store.
For the processing job, we provide a ProcessingInput
with a redshift_dataset_definition
. Here we build a structure according to the interface, providing Redshift connection-related configurations. You can use query_string
to filter your dataset by SQL and unload it to Amazon S3. See the following code:
You need to wait 6–7 minutes for each processing job including USER
, PLACE
, and RATING
datasets.
For more details about SageMaker Processing jobs, refer to Process data.
For SageMaker native solutions for feature processing from Amazon Redshift, you can also use Feature Processing in SageMaker Feature Store, which is for underlying infrastructure including provisioning the compute environments and creating and maintaining SageMaker pipelines to load and ingest data. You can only focus on your feature processor definitions that include transformation functions, the source of Amazon Redshift, and the sink of SageMaker Feature Store. The scheduling, job management, and other workloads in production are managed by SageMaker. Feature Processor pipelines are SageMaker pipelines, so the standard monitoring mechanisms and integrations are available.
Option C: Use SageMaker Data Wrangler
SageMaker Data Wrangler allows you to import data from various data sources including Amazon Redshift for a low-code/no-code way to prepare, transform, and featurize your data. After you finish data preparation, you can use SageMaker Data Wrangler to export features to SageMaker Feature Store.
There are some AWS Identity and Access Management (IAM) settings that allow SageMaker Data Wrangler to connect to Amazon Redshift. First, create an IAM role (for example, redshift-s3-dw-connect
) that includes an Amazon S3 access policy. For this post, we attached the AmazonS3FullAccess
policy to the IAM role. If you have restrictions of accessing a specified S3 bucket, you can define it in the Amazon S3 access policy. We attached the IAM role to the Redshift cluster that we created earlier. Next, create a policy for SageMaker to access Amazon Redshift by getting its cluster credentials, and attach the policy to the SageMaker IAM role. The policy looks like the following code:
After this setup, SageMaker Data Wrangler allows you to query Amazon Redshift and output the results into an S3 bucket. For instructions to connect to a Redshift cluster and query and import data from Amazon Redshift to SageMaker Data Wrangler, refer to Import data from Amazon Redshift.
SageMaker Data Wrangler offers a selection of over 300 pre-built data transformations for common use cases such as deleting duplicate rows, imputing missing data, one-hot encoding, and handling time series data. You can also add custom transformations in pandas or PySpark. In our example, we applied some transformations such as drop column, data type enforcement, and ordinal encoding to the data.
When your data flow is complete, you can export it to SageMaker Feature Store. At this point, you need to create a feature group: give the feature group a name, select both online and offline storage, provide the name of a S3 bucket to use for the offline store, and provide a role that has SageMaker Feature Store access. Finally, you can create a job, which creates a SageMaker Processing job that runs the SageMaker Data Wrangler flow to ingest features from the Redshift data source to your feature group.
Here is one end-to-end data flow in the scenario of PLACE feature engineering.
Use SageMaker Feature Store for model training and prediction
To use SageMaker Feature store for model training and prediction, open the notebook 5-classification-using-feature-groups.ipynb.
After the Redshift data is transformed into features and ingested into SageMaker Feature Store, the features are available for search and discovery across teams of data scientists responsible for many independent ML models and use cases. These teams can use the features for modeling without having to rebuild or rerun feature engineering pipelines. Feature groups are managed and scaled independently, and can be reused and joined together regardless of the upstream data source.
The next step is to build ML models using features selected from one or multiple feature groups. You decide which feature groups to use for your models. There are two options to create an ML dataset from feature groups, both utilizing the SageMaker Python SDK:
- Use the SageMaker Feature Store DatasetBuilder API – The SageMaker Feature Store
DatasetBuilder
API allows data scientists create ML datasets from one or more feature groups in the offline store. You can use the API to create a dataset from a single or multiple feature groups, and output it as a CSV file or a pandas DataFrame. See the following example code:
- Run SQL queries using the athena_query function in the FeatureGroup API – Another option is to use the auto-built AWS Glue Data Catalog for the FeatureGroup API. The FeatureGroup API includes an
Athena_query
function that creates an AthenaQuery instance to run user-defined SQL query strings. Then you run the Athena query and organize the query result into a pandas DataFrame. This option allows you to specify more complicated SQL queries to extract information from a feature group. See the following example code:
Next, we can merge the queried data from different feature groups into our final dataset for model training and testing. For this post, we use batch transform for model inference. Batch transform allows you to get model inferene on a bulk of data in Amazon S3, and its inference result is stored in Amazon S3 as well. For details on model training and inference, refer to the notebook 5-classification-using-feature-groups.ipynb.
Run a join query on prediction results in Amazon Redshift
Lastly, we query the inference result and join it with original user profiles in Amazon Redshift. To do this, we use Amazon Redshift Spectrum to join batch prediction results in Amazon S3 with the original Redshift data. For details, refer to the notebook run 6-read-results-in-redshift.ipynb.
Clean up
In this section, we provide the steps to clean up the resources created as part of this post to avoid ongoing charges.
Shut down SageMaker Apps
Complete the following steps to shut down your resources:
- In SageMaker Studio, on the File menu, choose Shut Down.
- In the Shutdown confirmation dialog, choose Shutdown All to proceed.
- After you get the “Server stopped” message, you can close this tab.
Delete the apps
Complete the following steps to delete your apps:
- On the SageMaker console, in the navigation pane, choose Domains.
- On the Domains page, choose
SageMakerDemoDomain
. - On the domain details page, under User profiles, choose the user
sagemakerdemouser
. - In the Apps section, in the Action column, choose Delete app for any active apps.
- Ensure that the Status column says Deleted for all the apps.
Delete the EFS storage volume associated with your SageMaker domain
Locate your EFS volume on the SageMaker console and delete it. For instructions, refer to Manage Your Amazon EFS Storage Volume in SageMaker Studio.
Delete default S3 buckets for SageMaker
Delete the default S3 buckets (sagemaker-<region-code>-<acct-id>
) for SageMaker If you are not using SageMaker in that Region.
Delete the CloudFormation stack
Delete the CloudFormation stack in your AWS account so as to clean up all related resources.
Conclusion
In this post, we demonstrated an end-to-end data and ML flow from a Redshift data warehouse to SageMaker. You can easily use AWS native integration of purpose-built engines to go through the data journey seamlessly. Check out the AWS Blog for more practices about building ML features from a modern data warehouse.
About the Authors
Akhilesh Dube, a Senior Analytics Solutions Architect at AWS, possesses more than two decades of expertise in working with databases and analytics products. His primary role involves collaborating with enterprise clients to design robust data analytics solutions while offering comprehensive technical guidance on a wide range of AWS Analytics and AI/ML services.
Ren Guo is a Senior Data Specialist Solutions Architect in the domains of generative AI, analytics, and traditional AI/ML at AWS, Greater China Region.
Sherry Ding is a Senior AI/ML Specialist Solutions Architect. She has extensive experience in machine learning with a PhD degree in Computer Science. She mainly works with Public Sector customers on various AI/ML-related business challenges, helping them accelerate their machine learning journey on the AWS Cloud. When not helping customers, she enjoys outdoor activities.
Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build AI/ML solutions. Mark’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Mark holds six AWS Certifications, including the ML Specialty Certification. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services.
Unlocking efficiency: Harnessing the power of Selective Execution in Amazon SageMaker Pipelines
MLOps is a key discipline that often oversees the path to productionizing machine learning (ML) models. It’s natural to focus on a single model that you want to train and deploy. However, in reality, you’ll likely work with dozens or even hundreds of models, and the process may involve multiple complex steps. Therefore, it’s important to have the infrastructure in place to track, train, deploy, and monitor models with varying complexities at scale. This is where MLOps tooling comes in. MLOps tooling helps you repeatably and reliably build and simplify these processes into a workflow that is tailored for ML.
Amazon SageMaker Pipelines, a feature of Amazon SageMaker, is a purpose-built workflow orchestration service for ML that helps you automate end-to-end ML workflows at scale. It simplifies the development and maintenance of ML models by providing a centralized platform to orchestrate tasks such as data preparation, model training, tuning and validation. SageMaker Pipelines can help you streamline workflow management, accelerate experimentation and retrain models more easily.
In this post, we spotlight an exciting new feature of SageMaker Pipelines known as Selective Execution. This new feature empowers you to selectively run specific portions of your ML workflow, resulting in significant time and compute resource savings by limiting the run to pipeline steps in scope and eliminating the need to run steps out of scope. Furthermore, we explore various use cases where the advantages of utilizing Selective Execution become evident, further solidifying its value proposition.
Solution overview
SageMaker Pipelines continues to innovate its developer experience with the release of Selective Execution. ML builders now have the ability to choose specific steps to run within a pipeline, eliminating the need to rerun the entire pipeline. This feature enables you to rerun specific sections of the pipeline while modifying the runtime parameters associated with the selected steps.
It’s important to note that the selected steps may rely on the results of non-selected steps. In such cases, the outputs of these non-selected steps are reused from a reference run of the current pipeline version. This means that the reference run must have already completed. The default reference run is the latest run of the current pipeline version, but you can also choose to use a different run of the current pipeline version as a reference.
The overall state of the reference run must be Successful, Failed or Stopped. It cannot be Running when Selective Execution attempts to use its outputs. When using Selective Execution, you can choose any number of steps to run, as long as they form a contiguous portion of the pipeline.
The following diagram illustrates the pipeline behavior with a full run.
The following diagram illustrates the pipeline behavior using Selective Execution.
In the following sections, we show how to use Selective Execution for various scenarios, including complex workflows in pipeline Direct Acyclic Graphs (DAGs).
Prerequisites
To start experimenting with Selective Execution, we need to first set up the following components of your SageMaker environment:
- SageMaker Python SDK – Ensure that you have an updated SageMaker Python SDK installed in your Python environment. You can run the following command from your notebook or terminal to install or upgrade the SageMaker Python SDK version to 2.162.0 or higher:
python3 -m pip install sagemaker>=2.162.0
orpip3 install sagemaker>=2.162.0
. - Access to SageMaker Studio (optional) – Amazon SageMaker Studio can be helpful for visualizing pipeline runs and interacting with preexisting pipeline ARNs visually. If you don’t have access to SageMaker Studio or are using on-demand notebooks or other IDEs, you can still follow this post and interact with your pipeline ARNs using the Python SDK.
The sample code for a full end-to-end walkthrough is available in the GitHub repo.
Setup
With the sagemaker>=1.162.0
Python SDK, we introduced the SelectiveExecutionConfig
class as part of the sagemaker.workflow.selective_execution_config
module. The Selective Execution feature relies on a pipeline ARN that has been previously marked as Succeeded, Failed or Stopped. The following code snippet demonstrates how to import the SelectiveExecutionConfig
class, retrieve the reference pipeline ARN, and gather associated pipeline steps and runtime parameters governing the pipeline run:
import boto3
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.selective_execution_config import SelectiveExecutionConfig
sm_client = boto3.client('sagemaker')
# reference the name of your sample pipeline
pipeline_name = "AbalonePipeline"
# filter for previous success pipeline execution arns
pipeline_executions = [_exec
for _exec in Pipeline(name=pipeline_name).list_executions()['PipelineExecutionSummaries']
if _exec['PipelineExecutionStatus'] == "Succeeded"
]
# get the last successful execution
latest_pipeline_arn = pipeline_executions[0]['PipelineExecutionArn']
print(latest_pipeline_arn)
>>> arn:aws:sagemaker:us-east-1:123123123123:pipeline/AbalonePipeline/execution/x62pbar3gs6h
# list all steps of your sample pipeline
execution_steps = sm_client.list_pipeline_execution_steps(
PipelineExecutionArn=latest_pipeline_arn
)['PipelineExecutionSteps']
print(execution_steps)
>>>
[{'StepName': 'Abalone-Preprocess',
'StartTime': datetime.datetime(2023, 6, 27, 4, 41, 30, 519000, tzinfo=tzlocal()),
'EndTime': datetime.datetime(2023, 6, 27, 4, 41, 30, 986000, tzinfo=tzlocal()),
'StepStatus': 'Succeeded',
'AttemptCount': 0,
'Metadata': {'ProcessingJob': {'Arn': 'arn:aws:sagemaker:us-east-1:123123123123:processing-job/pipelines-fvsmu7m7ki3q-Abalone-Preprocess-d68CecvHLU'}},
'SelectiveExecutionResult': {'SourcePipelineExecutionArn': 'arn:aws:sagemaker:us-east-1:123123123123:pipeline/AbalonePipeline/execution/ksm2mjwut6oz'}},
{'StepName': 'Abalone-Train',
'StartTime': datetime.datetime(2023, 6, 27, 4, 41, 31, 320000, tzinfo=tzlocal()),
'EndTime': datetime.datetime(2023, 6, 27, 4, 43, 58, 224000, tzinfo=tzlocal()),
'StepStatus': 'Succeeded',
'AttemptCount': 0,
'Metadata': {'TrainingJob': {'Arn': 'arn:aws:sagemaker:us-east-1:123123123123:training-job/pipelines-x62pbar3gs6h-Abalone-Train-PKhAc1Q6lx'}}},
{'StepName': 'Abalone-Evaluate',
'StartTime': datetime.datetime(2023, 6, 27, 4, 43, 59, 40000, tzinfo=tzlocal()),
'EndTime': datetime.datetime(2023, 6, 27, 4, 57, 43, 76000, tzinfo=tzlocal()),
'StepStatus': 'Succeeded',
'AttemptCount': 0,
'Metadata': {'ProcessingJob': {'Arn': 'arn:aws:sagemaker:us-east-1:123123123123:processing-job/pipelines-x62pbar3gs6h-Abalone-Evaluate-vmkZDKDwhk'}}},
{'StepName': 'Abalone-MSECheck',
'StartTime': datetime.datetime(2023, 6, 27, 4, 57, 43, 821000, tzinfo=tzlocal()),
'EndTime': datetime.datetime(2023, 6, 27, 4, 57, 44, 124000, tzinfo=tzlocal()),
'StepStatus': 'Succeeded',
'AttemptCount': 0,
'Metadata': {'Condition': {'Outcome': 'True'}}}]
# list all configureable pipeline parameters
# params can be altered during selective execution
parameters = sm_client.list_pipeline_parameters_for_execution(
PipelineExecutionArn=latest_pipeline_arn
)['PipelineParameters']
print(parameters)
>>>
[{'Name': 'XGBNumRounds', 'Value': '120'},
{'Name': 'XGBSubSample', 'Value': '0.9'},
{'Name': 'XGBGamma', 'Value': '2'},
{'Name': 'TrainingInstanceCount', 'Value': '1'},
{'Name': 'XGBMinChildWeight', 'Value': '4'},
{'Name': 'XGBETA', 'Value': '0.25'},
{'Name': 'ApprovalStatus', 'Value': 'PendingManualApproval'},
{'Name': 'ProcessingInstanceCount', 'Value': '1'},
{'Name': 'ProcessingInstanceType', 'Value': 'ml.t3.medium'},
{'Name': 'MseThreshold', 'Value': '6'},
{'Name': 'ModelPath',
'Value': 's3://sagemaker-us-east-1-123123123123/Abalone/models/'},
{'Name': 'XGBMaxDepth', 'Value': '12'},
{'Name': 'TrainingInstanceType', 'Value': 'ml.c5.xlarge'},
{'Name': 'InputData',
'Value': 's3://sagemaker-us-east-1-123123123123/sample-dataset/abalone/abalone.csv'}]
Use cases
In this section, we present a few scenarios where Selective Execution can potentially save time and resources. We use a typical pipeline flow, which includes steps such as data extraction, training, evaluation, model registration and deployment, as a reference to demonstrate the advantages of Selective Execution.
SageMaker Pipelines allows you to define runtime parameters for your pipeline run using pipeline parameters. When a new run is triggered, it typically runs the entire pipeline from start to finish. However, if step caching is enabled, SageMaker Pipelines will attempt to find a previous run of the current pipeline step with the same attribute values. If a match is found, SageMaker Pipelines will use the outputs from the previous run instead of recomputing the step. Note that even with step caching enabled, SageMaker Pipelines will still run the entire workflow to the end by default.
With the release of the Selective Execution feature, you can now rerun an entire pipeline workflow or selectively run a subset of steps using a prior pipeline ARN. This can be done even without step caching enabled. The following use cases illustrate the various ways you can use Selective Execution.
Use case 1: Run a single step
Data scientists often focus on the training stage of a MLOps pipeline and don’t want to worry about the preprocessing or deployment steps. Selective Execution allows data scientists to focus on just the training step and modify training parameters or hyperparameters on the fly to improve the model. This can save time and reduce cost because compute resources are only utilized for running user-selected pipeline steps. See the following code:
# select a reference pipeline arn and subset step to execute
selective_execution_config = SelectiveExecutionConfig(
source_pipeline_execution_arn="arn:aws:sagemaker:us-east-1:123123123123:pipeline/AbalonePipeline/execution/9e3ljoql7s0n",
selected_steps=["Abalone-Train"]
)
# start execution of pipeline subset
select_execution = pipeline.start(
selective_execution_config=selective_execution_config,
parameters={
"XGBNumRounds": 120,
"XGBSubSample": 0.9,
"XGBGamma": 2,
"XGBMinChildWeight": 4,
"XGBETA": 0.25,
"XGBMaxDepth": 12
}
)
The following figures illustrate the pipeline with one step in process and then complete.
Use case 2: Run multiple contiguous pipeline steps
Continuing with the previous use case, a data scientist wants to train a new model and evaluate its performance against a golden test dataset. This evaluation is crucial to ensure that the model meets rigorous guidelines for user acceptance testing (UAT) or production deployment. However, the data scientist doesn’t want to run the entire pipeline workflow or deploy the model. They can use Selective Execution to focus solely on the training and evaluation steps, saving time and resources while still getting the validation results they need:
# select a reference pipeline arn and subset step to execute
selective_execution_config = SelectiveExecutionConfig(
source_pipeline_execution_arn="arn:aws:sagemaker:us-east-1:123123123123:pipeline/AbalonePipeline/execution/9e3ljoql7s0n",
selected_steps=["Abalone-Train", "Abalone-Evaluate"]
)
# start execution of pipeline subset
select_execution = pipeline.start(
selective_execution_config=selective_execution_config,
parameters={
"ProcessingInstanceType": "ml.t3.medium",
"XGBNumRounds": 120,
"XGBSubSample": 0.9,
"XGBGamma": 2,
"XGBMinChildWeight": 4,
"XGBETA": 0.25,
"XGBMaxDepth": 12
}
)
Use case 3: Update and rerun failed pipeline steps
You can use Selective Execution to rerun failed steps within a pipeline or resume the run of a pipeline from a failed step onwards. This can be useful for troubleshooting and debugging failed steps because it allows developers to focus on the specific issues that need to be addressed. This can lead to more efficient problem-solving and faster iteration times. The following example illustrates how you can choose to rerun just the failed step of a pipeline.
# select a previously failed pipeline arn
selective_execution_config = SelectiveExecutionConfig(
source_pipeline_execution_arn="arn:aws:sagemaker:us-east-1:123123123123:pipeline/AbalonePipeline/execution/fvsmu7m7ki3q",
selected_steps=["Abalone-Evaluate"]
)
# start execution of failed pipeline subset
select_execution = pipeline.start(
selective_execution_config=selective_execution_config
)
Alternatively, a data scientist can resume a pipeline from a failed step to the end of the workflow by specifying the failed step and all the steps that follow it in the SelectiveExecutionConfig
.
Use case 4: Pipeline coverage
In some pipelines, certain branches are less frequently run than others. For example, there might be a branch that only runs when a specific condition fails. It’s important to test these branches thoroughly to ensure that they work as expected when a failure does occur. By testing these less frequently run branches, developers can verify that their pipeline is robust and that error-handling mechanisms effectively maintain the desired workflow and produce reliable results.
selective_execution_config = SelectiveExecutionConfig(
source_pipeline_execution_arn="arn:aws:sagemaker:us-east-1:123123123123:pipeline/AbalonePipeline/execution/9e3ljoql7s0n",
selected_steps=["Abalone-Train", "Abalone-Evaluate", "Abalone-MSECheck", "Abalone-FailNotify"]
)
Conclusion
In this post, we discussed the Selective Execution feature of SageMaker Pipelines, which empowers you to selectively run specific steps of your ML workflows. This capability leads to significant time and computational resource savings. We provided some sample code in the GitHub repo that demonstrates how to use Selective Execution and presented various scenarios where it can be advantageous for users. If you would like to learn more about Selective Execution, refer to our Developer Guide and API Reference Guide.
To explore the available steps within the SageMaker Pipelines workflow in more detail, refer to Amazon SageMaker Model Building Pipeline and SageMaker Workflows. Additionally, you can find more examples showcasing different use cases and implementation approaches using SageMaker Pipelines in the AWS SageMaker Examples GitHub repository. These resources can further enhance your understanding and help you take advantage of the full potential of SageMaker Pipelines and Selective Execution in your current and future ML projects.
About the Authors
Pranav Murthy is an AI/ML Specialist Solutions Architect at AWS. He focuses on helping customers build, train, deploy and migrate machine learning (ML) workloads to SageMaker. He previously worked in the semiconductor industry developing large computer vision (CV) and natural language processing (NLP) models to improve semiconductor processes. In his free time, he enjoys playing chess and traveling.
Akhil Numarsu is a Sr.Product Manager-Technical focused on helping teams accelerate ML outcomes through efficient tools and services in the cloud. He enjoys playing Table Tennis and is a sports fan.
Nishant Krishnamoorthy is a Sr. Software Development Engineer with Amazon Stores. He holds a masters degree in Computer Science and currently focuses on accelerating ML Adoption in different orgs within Amazon by building and operationalizing ML solutions on SageMaker.