The center will support UIUC researchers in their development of novel approaches to conversational AI systems.Read More
Enable fully homomorphic encryption with Amazon SageMaker endpoints for secure, real-time inferencing
This is joint post co-written by Leidos and AWS. Leidos is a FORTUNE 500 science and technology solutions leader working to address some of the world’s toughest challenges in the defense, intelligence, homeland security, civil, and healthcare markets.
Leidos has partnered with AWS to develop an approach to privacy-preserving, confidential machine learning (ML) modeling where you build cloud-enabled, encrypted pipelines.
Homomorphic encryption is a new approach to encryption that allows computations and analytical functions to be run on encrypted data, without first having to decrypt it, in order to preserve privacy in cases where you have a policy that states data should never be decrypted. Fully homomorphic encryption (FHE) is the strongest notion of this type of approach, and it allows you to unlock the value of your data where zero-trust is key. The core requirement is that the data needs to be able to be represented with numbers through an encoding technique, which can be applied to numerical, textual, and image-based datasets. Data using FHE is larger in size, so testing must be done for applications that need the inference to be performed in near-real time or with size limitations. It’s also important to phrase all computations as linear equations.
In this post, we show how to activate privacy-preserving ML predictions for the most highly regulated environments. The predictions (inference) use encrypted data and the results are only decrypted by the end consumer (client side).
To demonstrate this, we show an example of customizing an Amazon SageMaker Scikit-learn, open sourced, deep learning container to enable a deployed endpoint to accept client-side encrypted inference requests. Although this example shows how to perform this for inference operations, you can extend the solution to training and other ML steps.
Endpoints are deployed with a couple clicks or lines of code using SageMaker, which simplifies the process for developers and ML experts to build and train ML and deep learning models in the cloud. Models built using SageMaker can then be deployed as real-time endpoints, which is critical for inference workloads where you have real time, steady state, low latency requirements. Applications and services can call the deployed endpoint directly or through a deployed serverless Amazon API Gateway architecture. To learn more about real-time endpoint architectural best practices, refer to Creating a machine learning-powered REST API with Amazon API Gateway mapping templates and Amazon SageMaker. The following figure shows both versions of these patterns.
In both of these patterns, encryption in transit provides confidentiality as the data flows through the services to perform the inference operation. When received by the SageMaker endpoint, the data is generally decrypted to perform the inference operation at runtime, and is inaccessible to any external code and processes. To achieve additional levels of protection, FHE enables the inference operation to generate encrypted results for which the results can be decrypted by a trusted application or client.
More on fully homomorphic encryption
FHE enables systems to perform computations on encrypted data. The resulting computations, when decrypted, are controllably close to those produced without the encryption process. FHE can result in a small mathematical imprecision, similar to a floating point error, due to noise injected into the computation. It’s controlled by selecting appropriate FHE encryption parameters, which is a problem-specific, tuned parameter. For more information, check out the video How would you explain homomorphic encryption?
The following diagram provides an example implementation of an FHE system.
In this system, you or your trusted client can do the following:
- Encrypt the data using a public key FHE scheme. There are a couple of different acceptable schemes; in this example, we’re using the CKKS scheme. To learn more about the FHE public key encryption process we chose, refer to CKKS explained.
- Send client-side encrypted data to a provider or server for processing.
- Perform model inference on encrypted data; with FHE, no decryption is required.
- Encrypted results are returned to the caller and then decrypted to reveal your result using a private key that’s only available to you or your trusted users within the client.
We’ve used the preceding architecture to set up an example using SageMaker endpoints, Pyfhel as an FHE API wrapper simplifying the integration with ML applications, and SEAL as our underlying FHE encryption toolkit.
Solution overview
We’ve built out an example of a scalable FHE pipeline in AWS using an SKLearn logistic regression deep learning container with the Iris dataset. We perform data exploration and feature engineering using a SageMaker notebook, and then perform model training using a SageMaker training job. The resulting model is deployed to a SageMaker real-time endpoint for use by client services, as shown in the following diagram.
In this architecture, only the client application sees unencrypted data. The data processed through the model for inferencing remains encrypted throughout its lifecycle, even at runtime within the processor in the isolated AWS Nitro Enclave. In the following sections, we walk through the code to build this pipeline.
Prerequisites
To follow along, we assume you have launched a SageMaker notebook with an AWS Identity and Access Management (IAM) role with the AmazonSageMakerFullAccess managed policy.
Train the model
The following diagram illustrates the model training workflow.
The following code shows how we first prepare the data for training using SageMaker notebooks by pulling in our training dataset, performing the necessary cleaning operations, and then uploading the data to an Amazon Simple Storage Service (Amazon S3) bucket. At this stage, you may also need to do additional feature engineering of your dataset or integrate with different offline feature stores.
In this example, we’re using script-mode on a natively supported framework within SageMaker (scikit-learn), where we instantiate our default SageMaker SKLearn estimator with a custom training script to handle the encrypted data during inference. To see more information about natively supported frameworks and script mode, refer to Use Machine Learning Frameworks, Python, and R with Amazon SageMaker.
Finally, we train our model on the dataset and deploy our trained model to the instance type of our choice.
At this point, we’ve trained a custom SKLearn FHE model and deployed it to a SageMaker real-time inference endpoint that’s ready accept encrypted data.
Encrypt and send client data
The following diagram illustrates the workflow of encrypting and sending client data to the model.
In most cases, the payload of the call to the inference endpoint contains the encrypted data rather than storing it in Amazon S3 first. We do this in this example because we’ve batched a large number of records to the inference call together. In practice, this batch size will be smaller or batch transform will be used instead. Using Amazon S3 as an intermediary isn’t required for FHE.
Now that the inference endpoint has been set up, we can start sending data over. We normally use different test and training datasets, but for this example we use the same training dataset.
First, we load the Iris dataset on the client side. Next, we set up the FHE context using Pyfhel. We selected Pyfhel for this process because it’s simple to install and work with, includes popular FHE schemas, and relies upon trusted underlying open-sourced encryption implementation SEAL. In this example, we send the encrypted data, along with public keys information for this FHE scheme, to the server, which enables the endpoint to encrypt the result to send on its side with the necessary FHE parameters, but doesn’t give it the ability to decrypt the incoming data. The private key remains only with the client, which has the ability to decrypt the results.
After we encrypt our data, we put together a complete data dictionary—including the relevant keys and encrypted data—to be stored on Amazon S3. Aferwards, the model makes its predictions over the encrypted data from the client, as shown in the following code. Notice we don’t transmit the private key, so the model host isn’t able to decrypt the data. In this example, we’re passing the data through as an S3 object; alternatively, that data may be sent directly to the Sagemaker endpoint. As a real-time endpoint, the payload contains the data parameter in the body of the request, which is mentioned in the SageMaker documentation.
The following screenshot shows the central prediction within fhe_train.py
(the appendix shows the entire training script).
We’re computing the results of our encrypted logistic regression. This code computes an encrypted scalar product for each possible class and returns the results to the client. The results are the predicted logits for each class across all examples.
Client returns decrypted results
The following diagram illustrates the workflow of the client retrieving their encrypted result and decrypting it (with the private key that only they have access to) to reveal the inference result.
In this example, results are stored on Amazon S3, but generally this would be returned through the payload of the real-time endpoint. Using Amazon S3 as an intermediary isn’t required for FHE.
The inference result will be controllably close to the results as if they had computed it themselves, without using FHE.
Clean up
We end this process by deleting the endpoint we created, to make sure there isn’t any unused compute after this process.
Results and considerations
One of the common drawbacks of using FHE on top of models is that it adds computational overhead, which—in practice—makes the resulting model too slow for interactive use cases. But, in cases where the data is highly sensitive, it might be worthwhile to accept this latency trade-off. However, for our simple logistic regression, we are able to process 140 input data samples within 60 seconds and see linear performance. The following chart includes the total end-to-end time, including the time performed by the client to encrypt the input and decypt the results. It also uses Amazon S3, which adds latency and isn’t required for these cases.
We see linear scaling as we increase the number of examples from 1 to 150. This is expected because each example is encrypted independently from each other, so we expect a linear increase in computation, with a fixed setup cost.
This also means that you can scale your inference fleet horizontally for greater request throughput behind your SageMaker endpoint. You can use Amazon SageMaker Inference Recommender to cost optimize your fleet depending on your business needs.
Conclusion
And there you have it: fully homomorphic encryption ML for a SKLearn logistic regression model that you can set up with a few lines of code. With some customization, you can implement this same encryption process for different model types and frameworks, independent of the training data.
If you’d like to learn more about building an ML solution that uses homomorphic encryption, reach out to your AWS account team or partner, Leidos, to learn more. You can also refer to the following resources for more examples:
- Capabilities on the Leidos website.
- The AWS re:Invent 2020 breakout session Privacy-preserving maching learning, by Joan Feidenbaim, Amazon Scholar with the AWS Crytographic Algorithms Group. In this session, Feidenbaim describes two prototypes that were built in 2020.
- Amazon Science articles related to homomorphic encryption.
- What is cryptographic computing? A conversation with two AWS experts, featuring Joan Feigenbaum, Amazon Scholar, AWS Cryptography, and Bill Horne, Principal Product Manager, AWS Cryptography.
The content and opinions in this post contains those from third-party authors and AWS is not responsible for the content or accuracy of this post.
Appendix
The full training script is as follows:
About the Authors
Liv d’Aliberti is a researcher within the Leidos AI/ML Accelerator under the Office of Technology. Their research focuses on privacy-preserving machine learning.
Manbir Gulati is a researcher within the Leidos AI/ML Accelerator under the Office of Technology. His research focuses on the intersection of cybersecurity and emerging AI threats.
Joe Kovba is a Cloud Center of Excellence Practice Lead within the Leidos Digital Modernization Accelerator under the Office of Technology. In his free time, he enjoys refereeing football games and playing softball.
Ben Snively is a Public Sector Specialist Solutions Architect. He works with government, non-profit, and education customers on big data and analytical projects, helping them build solutions using AWS. In his spare time, he adds IoT sensors throughout his house and runs analytics on them.
Sami Hoda is a Senior Solutions Architect in the Partners Consulting division covering the Worldwide Public Sector. Sami is passionate about projects where equal parts design thinking, innovation, and emotional intelligence can be used to solve problems for and impact people in need.
A new paradigm for partnership between industry and academia
How Amazon is shaping a set of initiatives to enable academia-based talent to harmonize their passions, life stations, and career ambitions.Read More
Automate Amazon Rekognition Custom Labels model training and deployment using AWS Step Functions
With Amazon Rekognition Custom Labels, you can have Amazon Rekognition train a custom model for object detection or image classification specific to your business needs. For example, Rekognition Custom Labels can find your logo in social media posts, identify your products on store shelves, classify machine parts in an assembly line, distinguish healthy and infected plants, or detect animated characters in videos.
Developing a Rekognition Custom Labels model to analyze images is a significant undertaking that requires time, expertise, and resources, often taking months to complete. Additionally, it often requires thousands or tens of thousands of hand-labeled images to provide the model with enough data to accurately make decisions. Generating this data can take months to gather and require large teams of labelers to prepare it for use in machine learning (ML).
With Rekognition Custom Labels, we take care of the heavy lifting for you. Rekognition Custom Labels builds off of the existing capabilities of Amazon Rekognition, which is already trained on tens of millions of images across many categories. Instead of thousands of images, you simply need to upload a small set of training images (typically a few hundred images or less) that are specific to your use case via our easy-to-use console. If your images are already labeled, Amazon Rekognition can begin training in just a few clicks. If not, you can label them directly within the Amazon Rekognition labeling interface, or use Amazon SageMaker Ground Truth to label them for you. After Amazon Rekognition begins training from your image set, it produces a custom image analysis model for you in just a few hours. Behind the scenes, Rekognition Custom Labels automatically loads and inspects the training data, selects the right ML algorithms, trains a model, and provides model performance metrics. You can then use your custom model via the Rekognition Custom Labels API and integrate it into your applications.
However, building a Rekognition Custom Labels model and hosting it for real-time predictions involves several steps: creating a project, creating the training and validation datasets, training the model, evaluating the model, and then creating an endpoint. After the model is deployed for inference, you might have to retrain the model when new data becomes available or if feedback is received from real-world inference. Automating the whole workflow can help reduce manual work.
In this post, we show how you can use AWS Step Functions to build and automate the workflow. Step Functions is a visual workflow service that helps developers use AWS services to build distributed applications, automate processes, orchestrate microservices, and create data and ML pipelines.
Solution overview
The Step Functions workflow is as follows:
- We first create an Amazon Rekognition project.
- In parallel, we create the training and the validation datasets using existing datasets. We can use the following methods:
- Import a folder structure from Amazon Simple Storage Service (Amazon S3) with the folders representing the labels.
- Use a local computer.
- Use Ground Truth.
- Create a dataset using an existing dataset with the AWS SDK.
- Create a dataset with a manifest file with the AWS SDK.
- After the datasets are created, we train a Custom Labels model using the CreateProjectVersion API. This could take from minutes to hours to complete.
- After the model is trained, we evaluate the model using the F1 score output from the previous step. We use the F1 score as our evaluation metric because it provides a balance between precision and recall. You can also use precision or recall as your model evaluation metrics. For more information on custom label evaluation metrics, refer to Metrics for evaluating your model.
- We then start to use the model for predictions if we are satisfied with the F1 score.
The following diagram illustrates the Step Functions workflow.
Prerequisites
Before deploying the workflow, we need to create the existing training and validation datasets. Complete the following steps:
- First, create an Amazon Rekognition project.
- Then, create the training and validation datasets.
- Finally, install the AWS SAM CLI.
Deploy the workflow
To deploy the workflow, clone the GitHub repository:
These commands build, package and deploy your application to AWS, with a series of prompts as explained in the repository.
Run the workflow
To test the workflow, navigate to the deployed workflow on the Step Functions console, then choose Start execution.
The workflow could take a few minutes to a few hours to complete. If the model passes the evaluation criteria, an endpoint for the model is created in Amazon Rekognition. If the model doesn’t pass the evaluation criteria or the training failed, the workflow fails. You can check the status of the workflow on the Step Functions console. For more information, refer to Viewing and debugging executions on the Step Functions console.
Perform model predictions
To perform predictions against the model, you can call the Amazon Rekognition DetectCustomLabels API. To invoke this API, the caller needs to have the necessary AWS Identity and Access Management (IAM) permissions. For more details on performing predictions using this API, refer to Analyzing an image with a trained model.
However, if you need to expose the DetectCustomLabels API publicly, you can front the DetectCustomLabels API with Amazon API Gateway. API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. API Gateway acts as the front door for your DetectCustomLabels API, as shown in the following architecture diagram.
API Gateway forwards the user’s inference request to AWS Lambda. Lambda is a serverless, event-driven compute service that lets you run code for virtually any type of application or backend service without provisioning or managing servers. Lambda receives the API request and calls the Amazon Rekognition DetectCustomLabels API with the necessary IAM permissions. For more information on how to set up API Gateway with Lambda integration, refer to Set up Lambda proxy integrations in API Gateway.
The following is an example Lambda function code to call the DetectCustomLabels API:
Clean up
To delete the workflow, use the AWS SAM CLI:
To delete the Rekognition Custom Labels model, you can either use the Amazon Rekognition console or the AWS SDK. For more information, refer to Deleting an Amazon Rekognition Custom Labels model.
Conclusion
In this post, we walked through a Step Functions workflow to create a dataset and then train, evaluate, and use a Rekognition Custom Labels model. The workflow allows application developers and ML engineers to automate the custom label classification steps for any computer vision use case. The code for the workflow is open-sourced.
For more serverless learning resources, visit Serverless Land. To learn more about Rekognition custom labels, visit Amazon Rekognition Custom Labels.
About the Author
Veda Raman is a Senior Specialist Solutions Architect for machine learning based in Maryland. Veda works with customers to help them architect efficient, secure and scalable machine learning applications. Veda is interested in helping customers leverage serverless technologies for Machine learning.
Build a machine learning model to predict student performance using Amazon SageMaker Canvas
There has been a paradigm change in the mindshare of education customers who are now willing to explore new technologies and analytics. Universities and other higher learning institutions have collected massive amounts of data over the years, and now they are exploring options to use that data for deeper insights and better educational outcomes.
You can use machine learning (ML) to generate these insights and build predictive models. Educators can also use ML to identify challenges in learning outcomes, increase success and retention among students, and broaden the reach and impact of online learning content.
However, higher education institutions often lack ML professionals and data scientists. With this fact, they are looking for solutions that can be quickly adopted by their existing business analysts.
Amazon SageMaker Canvas is a low-code/no-code ML service that enables business analysts to perform data preparation and transformation, build ML models, and deploy these models into a governed workflow. Analysts can perform all these activities with a few clicks and without writing a single piece of code.
In this post, we show how to use SageMaker Canvas to build an ML model to predict student performance.
Solution overview
For this post, we discuss a specific use case: how universities can predict student dropout or continuation ahead of final exams using SageMaker Canvas. We predict whether the student will drop out, enroll (continue), or graduate at the end of the course. We can use the outcome from the prediction to take proactive action to improve student performance and prevent potential dropouts.
The solution includes the following components:
- Data ingestion – Importing the data from your local computer to SageMaker Canvas
- Data preparation – Clean and transform the data (if required) within SageMaker Canvas
- Build the ML model – Build the prediction model inside SageMaker Canvas to predict student performance
- Prediction – Generate batch or single predictions
- Collaboration – Analysts using SageMaker Canvas and data scientists using Amazon SageMaker Studio can interact while working in their respective settings, sharing domain knowledge and offering expert feedback to improve models
The following diagram illustrates the solution architecture.
Prerequisites
For this post, you should complete the following prerequisites:
- Have an AWS account.
- Set up SageMaker Canvas. For instructions, refer to Prerequisites for setting up Amazon SageMaker Canvas.
- Download the following student dataset to your local computer.
The dataset contains student background information like demographics, academic journey, economic background, and more. The dataset contains 37 columns, out of which 36 are features and 1 is a label. The label column name is Target, and it contains categorical data: dropout, enrolled, and graduate.
The dataset comes under the Attribution 4.0 International (CC BY 4.0) license and is free to share and adapt.
Data ingestion
The first step for any ML process is to ingest the data. Complete the following steps:
- On the SageMaker Canvas console, choose Import.
- Import the
Dropout_Academic Success - Sheet1.csv
dataset into SageMaker Canvas. - Select the dataset and choose Create a model.
- Name the
model student-performance-model
.
Data preparation
For ML problems, data scientists analyze the dataset for outliers, handle the missing values, add or remove fields, and perform other transformations. Analysts can perform the same actions in SageMaker Canvas using the visual interface. Note that major data transformation is out of scope for this post.
In the following screenshot, the first highlighted section (annotated as 1 in the screenshot) shows the options available with SageMaker Canvas. IT staff can apply these actions on the dataset and can even explore the dataset for more details by choosing Data visualizer.
The second highlighted section (annotated as 2 in the screenshot) indicates that the dataset doesn’t have any missing or mismatched records.
Build the ML model
To proceed with training and building the ML model, we need to choose the column that needs to be predicted.
- On the SageMaker Canvas interface, for Select a column to predict, choose Target.
As soon as you choose the target column, it will prompt you to validate data.
- Choose Validate, and within few minutes SageMaker Canvas will finish validating your data.
Now it’s the time to build the model. You have two options: Quick build and Standard build. Analysts can choose either of the options based on your requirements.
- For this post, we choose Standard build.
Apart from speed and accuracy, one major difference between Standard build and Quick build is that Standard build provides the capability to share the model with data scientists, which Quick build doesn’t.
SageMaker Canvas took approximately 25 minutes to train and build the model. Your models may take more or less time, depending on factors such as input data size and complexity. The accuracy of the model was around 80%, as shown in the following screenshot. You can explore the bottom section to see the impact of each column on the prediction.
So far, we have uploaded the dataset, prepared the dataset, and built the prediction model to measure student performance. Next, we have two options:
- Generate a batch or single prediction
- Share this model with the data scientists for feedback or improvements
Prediction
Choose Predict to start generating predictions. You can choose from two options:
- Batch prediction – You can upload datasets here and let SageMaker Canvas predict the performance for the students. You can use these predictions to take proactive actions.
- Single prediction – In this option, you provide the values for a single student. SageMaker Canvas will predict the performance for that particular student.
Collaboration
In some cases, you as an analyst might want to get feedback from expert data scientists on the model before proceeding with the prediction. To do so, choose Share and specify the Studio user to share with.
Then the data scientist can complete the following steps:
- On the Studio console, in the navigation pane, under Models, choose Shared models.
- Choose View model to open the model.
They can update the model either of the following ways:
- Share a new model – The data scientist can change the data transformations, retrain the model, and then share the model
- Share an alternate model – The data scientist can select an alternate model from the list of trained Amazon SageMaker Autopilot models and share that back with the SageMaker Canvas user.
For this example, we choose Share an alternate model and assume the inference latency as the key parameter shared the second-best model with the SageMaker Canvas user.
The data scientist can look for other parameters like F1 score, precision, recall, and log loss as decision criterion to share an alternate model with the SageMaker Canvas user.
In this scenario, the best model has an accuracy of 80% and inference latency of 0.781 seconds, whereas the second-best model has an accuracy of 79.9% and inference latency of 0.327 seconds.
- Choose Share to share an alternate model with the SageMaker Canvas user.
- Add the SageMaker Canvas user to share the model with.
- Add an optional note, then choose Share.
- Choose an alternate model to share.
- Add feedback and choose Share to share the model with the SageMaker Canvas user.
After the data scientist has shared an updated model with you, you will get a notification and SageMaker Canvas will start importing the model into the console.
SageMaker Canvas will take a moment to import the updated model, and then the updated model will reflect as a new version (V3 in this case).
You can now switch between the versions and generate predictions from any version.
If an administrator is worried about managing permissions for the analysts and data scientists, they can use Amazon SageMaker Role Manager.
Clean up
To avoid incurring future charges, delete the resources you created while following this post. SageMaker Canvas bills you for the duration of the session, and we recommend logging out of Canvas when you’re not using it. Refer to Logging out of Amazon SageMaker Canvas for more details.
Conclusion
In this post, we discussed how SageMaker Canvas can help higher learning institutions use ML capabilities without requiring ML expertise. In our example, we showed how an analyst can quickly build a highly accurate predictive ML model without writing any code. The university can now act on those insights by specifically targeting students at risk of dropping out of a course with individualized attention and resources, benefitting both parties.
We demonstrated the steps starting from loading the data into SageMaker Canvas, building the model in Canvas, and receiving the feedback from data scientists via Studio. The entire process was completed through web-based user interfaces.
To start your low-code/no-code ML journey, refer to Amazon SageMaker Canvas.
About the author
Ashutosh Kumar is a Solutions Architect with the Public Sector-Education Team. He is passionate about transforming businesses with digital solutions. He has good experience in databases, AI/ML, data analytics, compute, and storage.
Access Snowflake data using OAuth-based authentication in Amazon SageMaker Data Wrangler
In this post, we show how to configure a new OAuth-based authentication feature for using Snowflake in Amazon SageMaker Data Wrangler. Snowflake is a cloud data platform that provides data solutions for data warehousing to data science. Snowflake is an AWS Partner with multiple AWS accreditations, including AWS competencies in machine learning (ML), retail, and data and analytics.
Data Wrangler simplifies the data preparation and feature engineering process, reducing the time it takes from weeks to minutes by providing a single visual interface for data scientists to select and clean data, create features, and automate data preparation in ML workflows without writing any code. You can import data from multiple data sources, such as Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, Amazon EMR, and Snowflake. With this new feature, you can use your own identity provider (IdP) such as Okta, Azure AD, or Ping Federate to connect to Snowflake via Data Wrangler.
Solution overview
In the following sections, we provide steps for an administrator to set up the IdP, Snowflake, and Studio. We also detail the steps that data scientists can take to configure the data flow, analyze the data quality, and add data transformations. Finally, we show how to export the data flow and train a model using SageMaker Autopilot.
Prerequisites
For this walkthrough, you should have the following prerequisites:
- For admin:
- A Snowflake user with permissions to create storage integrations, and security integrations in Snowflake.
- An AWS account with permissions to create AWS Identity and Access Management (IAM) policies and roles.
- Access and permissions to configure IDP to register Data Wrangler application and set up the authorization server or API.
- For data scientist:
- An S3 bucket that Data Wrangler can use to output transformed data.
- Access to Amazon SageMaker, an instance of Amazon SageMaker Studio, and a user for Studio. For more information about prerequisites, see Get Started with Data Wrangler.
- An IAM role used for Studio with permissions to create and update secrets in AWS Secrets Manager.
Administrator setup
Instead of having your users directly enter their Snowflake credentials into Data Wrangler, you can have them use an IdP to access Snowflake.
The following steps are involved to enable Data Wrangler OAuth access to Snowflake:
- Configure the IdP.
- Configure Snowflake.
- Configure SageMaker Studio.
Configure the IdP
To set up your IdP, you must register the Data Wrangler application and set up your authorization server or API.
Register the Data Wrangler application within the IdP
Refer to the following documentation for the IdPs that Data Wrangler supports:
Use the documentation provided by your IdP to register your Data Wrangler application. The information and procedures in this section help you understand how to properly use the documentation provided by your IdP.
Specific customizations in addition to the steps in the respective guides are called out in the subsections.
- Select the configuration that starts the process of registering Data Wrangler as an application.
- Provide the users within the IdP access to Data Wrangler.
- Enable OAuth client authentication by storing the client credentials as a Secrets Manager secret.
- Specify a redirect URL using the following format:
https://domain-ID.studio.AWS Region.sagemaker.aws/jupyter/default/lab
.
You’re specifying the SageMaker domain ID and AWS Region that you’re using to run Data Wrangler. You must register a URL for each domain and Region where you’re running Data Wrangler. Users from a domain and Region that don’t have redirect URLs set up for them won’t be able to authenticate with the IdP to access the Snowflake connection.
- Make sure the authorization code and refresh token grant types are allowed for your Data Wrangler application.
Set up the authorization server or API within the IdP
Within your IdP, you must set up an authorization server or an application programming interface (API). For each user, the authorization server or the API sends tokens to Data Wrangler with Snowflake as the audience.
Snowflake uses the concept of roles that are distinct from IAM roles used in AWS. You must configure the IdP to use ANY Role to use the default role associated with the Snowflake account. For example, if a user has systems administrator
as the default role in their Snowflake profile, the connection from Data Wrangler to Snowflake uses systems administrator
as the role.
Use the following procedure to set up the authorization server or API within your IdP:
- From your IdP, begin the process of setting up the server or API.
- Configure the authorization server to use the authorization code and refresh token grant types.
- Specify the lifetime of the access token.
- Set the refresh token idle timeout.
The idle timeout is the time that the refresh token expires if it’s not used. If you’re scheduling jobs in Data Wrangler, we recommend making the idle timeout time greater than the frequency of the processing job. Otherwise, some processing jobs might fail because the refresh token expired before they could run. When the refresh token expires, the user must re-authenticate by accessing the connection that they’ve made to Snowflake through Data Wrangler.
Note that Data Wrangler doesn’t support rotating refresh tokens. Using rotating refresh tokens might result in access failures or users needing to log in frequently.
If the refresh token expires, your users must reauthenticate by accessing the connection that they’ve made to Snowflake through Data Wrangler.
- Specify
session:role-any
as the new scope.
For Azure AD, you must also specify a unique identifier for the scope.
After you’ve set up the OAuth provider, you provide Data Wrangler with the information it needs to connect to the provider. You can use the documentation from your IdP to get values for the following fields:
- Token URL – The URL of the token that the IdP sends to Data Wrangler
- Authorization URL – The URL of the authorization server of the IdP
- Client ID – The ID of the IdP
- Client secret – The secret that only the authorization server or API recognizes
- OAuth scope – This is for Azure AD only
Configure Snowflake
To configure Snowflake, complete the instructions in Import data from Snowflake.
Use the Snowflake documentation for your IdP to set up an external OAuth integration in Snowflake. See the previous section Register the Data Wrangler application within the IdP for more information on how to set up an external OAuth integration.
When you’re setting up the security integration in Snowflake, make sure you activate external_oauth_any_role_mode
.
Configure SageMaker Studio
You store the fields and values in a Secrets Manager secret and add it to the Studio Lifecycle Configuration that you’re using for Data Wrangler. A Lifecycle Configuration is a shell script that automatically loads the credentials stored in the secret when the user logs into Studio. For information about creating secrets, see Move hardcoded secrets to AWS Secrets Manager. For information about using Lifecycle Configurations in Studio, see Use Lifecycle Configurations with Amazon SageMaker Studio.
Create a secret for Snowflake credentials
To create your secret for Snowflake credentials, complete the following steps:
- On the Secrets Manager console, choose Store a new secret.
- For Secret type, select Other type of secret.
- Specify the details of your secret as key-value pairs.
Key names require lowercase letters due to case sensitivity. Data Wrangler gives a warning if you enter any of these incorrectly. Input the secret values as key-value pairs Key/value if you’d like, or use the Plaintext option.
The following is the format of the secret used for Okta. If you are using Azure AD, you need to add the datasource_oauth_scope
field.
- Update the preceding values with your choice of IdP and information gathered after application registration.
- Choose Next.
- For Secret name, add the prefix
AmazonSageMaker
(for example, our secret isAmazonSageMaker-DataWranglerSnowflakeCreds
). - In the Tags section, add a tag with the key
SageMaker
and valuetrue
. - Choose Next.
- The rest of the fields are optional; choose Next until you have the option to choose Store to store the secret.
After you store the secret, you’re returned to the Secrets Manager console.
- Choose the secret you just created, then retrieve the secret ARN.
- Store this in your preferred text editor for use later when you create the Data Wrangler data source.
Create a Studio Lifecycle Configuration
To create a Lifecycle Configuration in Studio, complete the following steps:
- On the SageMaker console, choose Lifecycle configurations in the navigation pane.
- Choose Create configuration.
- Choose Jupyter server app.
- Create a new lifecycle configuration or append an existing one with the following content:
The configuration creates a file with the name ".snowflake_identity_provider_oauth_config"
, containing the secret in the user’s home folder.
- Choose Create Configuration.
Set the default Lifecycle Configuration
Complete the following steps to set the Lifecycle Configuration you just created as the default:
- On the SageMaker console, choose Domains in the navigation pane.
- Choose the Studio domain you’ll be using for this example.
- On the Environment tab, in the Lifecycle configurations for personal Studio apps section, choose Attach.
- For Source, select Existing configuration.
- Select the configuration you just made, then choose Attach to domain.
- Select the new configuration and choose Set as default, then choose Set as default again in the pop-up message.
Your new settings should now be visible under Lifecycle configurations for personal Studio apps as default.
- Shut down the Studio app and relaunch for the changes to take effect.
Data scientist experience
In this section, we cover how data scientists can connect to Snowflake as a data source in Data Wrangler and prepare data for ML.
Create a new data flow
To create your data flow, complete the following steps:
- On the SageMaker console, choose Amazon SageMaker Studio in the navigation pane.
- Choose Open Studio.
- On the Studio Home page, choose Import & prepare data visually. Alternatively, on the File drop-down, choose New, then choose SageMaker Data Wrangler Flow.
Creating a new flow can take a few minutes.
- On the Import data page, choose Create connection.
- Choose Snowflake from the list of data sources.
- For Authentication method, choose OAuth.
If you don’t see OAuth, verify the preceding Lifecycle Configuration steps.
- Enter details for Snowflake account name and Storage integration.
- Ener a connection name and choose Connect.
You’re redirected to an IdP authentication page. For this example, we’re using Okta.
- Enter your user name and password, then choose Sign in.
After the authentication is successful, you’re redirected to the Studio data flow page.
- On the Import data from Snowflake page, browse the database objects, or run a query for the targeted data.
- In the query editor, enter a query and preview the results.
In the following example, we load Loan Data and retrieve all columns from 5,000 rows.
- Choose Import.
- Enter a dataset name (for this post, we use
snowflake_loan_dataset
) and choose Add.
You’re redirected to the Prepare page, where you can add transformations and analyses to the data.
Data Wrangler makes it easy to ingest data and perform data preparation tasks such as exploratory data analysis, feature selection, and feature engineering. We’ve only covered a few of the capabilities of Data Wrangler in this post on data preparation; you can use Data Wrangler for more advanced data analysis such as feature importance, target leakage, and model explainability using an easy and intuitive user interface.
Analyze data quality
Use the Data Quality and Insights Report to perform an analysis of the data that you’ve imported into Data Wrangler. Data Wrangler creates the report from the sampled data.
- On the Data Wrangler flow page, choose the plus sign next to Data types, then choose Get data insights.
- Choose Data Quality And Insights Report for Analysis type.
- For Target column, choose your target column.
- For Problem type, select Classification.
- Choose Create.
The insights report has a brief summary of the data, which includes general information such as missing values, invalid values, feature types, outlier counts, and more. You can either download the report or view it online.
Add transformations to the data
Data Wrangler has over 300 built-in transformations. In this section, we use some of these transformations to prepare the dataset for an ML model.
- On the Data Wrangler flow page, choose plus sign, then choose Add transform.
If you’re following the steps in the post, you’re directed here automatically after adding your dataset.
- Verify and modify the data type of the columns.
Looking through the columns, we identify that MNTHS_SINCE_LAST_DELINQ
and MNTHS_SINCE_LAST_RECORD
should most likely be represented as a number type rather than string.
- After applying the changes and adding the step, you can verify the column data type is changed to float.
Looking through the data, we can see that the fields EMP_TITLE
, URL
, DESCRIPTION
, and TITLE
will likely not provide value to our model in our use case, so we can drop them.
- Choose Add Step, then choose Manage columns.
- For Transform, choose Drop column.
- For Column to drop, specify
EMP_TITLE
,URL
,DESCRIPTION
, andTITLE
. - Choose Preview and Add.
Next, we want to look for categorical data in our dataset. Data Wrangler has a built-in functionality to encode categorical data using both ordinal and one-hot encodings. Looking at our dataset, we can see that the TERM
, HOME_OWNERSHIP
, and PURPOSE
columns all appear to be categorical in nature.
- Add another step and choose Encode categorical.
- For Transform, choose One-hot encode.
- For Input column, choose
TERM
. - For Output style, choose Columns.
- Leave all other settings as default, then choose Preview and Add.
The HOME_OWNERSHIP
column has four possible values: RENT
, MORTGAGE
, OWN
, and other.
- Repeat the preceding steps to apply a one-hot encoding approach on these values.
Lastly, the PURPOSE
column has several possible values. For this data, we use a one-hot encoding approach as well, but we set the output to a vector rather than columns.
- For Transform, choose One-hot encode.
- For Input column, choose
PURPOSE
. - For Output style, choose Vector.
- For Output column, we call this column
PURPOSE_VCTR
.
This keeps the original PURPOSE
column, if we decide to use it later.
- Leave all other settings as default, then choose Preview and Add.
Export the data flow
Finally, we export this whole data flow to a feature store with a SageMaker Processing job, which creates a Jupyter notebook with the code pre-populated.
- On the data flow page , choose the plus sign and Export to.
- Choose where to export. For our use case, we choose SageMaker Feature Store.
The exported notebook is now ready to run.
Export data and train a model with Autopilot
Now we can train the model using Amazon SageMaker Autopilot.
- On the data flow page, choose the Training tab.
- For Amazon S3 location, enter a location for the data to be saved.
- Choose Export and train.
- Specify the settings in the Target and features, Training method, Deployment and advance settings, and Review and create sections.
- Choose Create experiment to find the best model for your problem.
Clean up
If your work with Data Wrangler is complete, shut down your Data Wrangler instance to avoid incurring additional fees.
Conclusion
In this post, we demonstrated connecting Data Wrangler to Snowflake using OAuth, transforming and analyzing a dataset, and finally exporting it to the data flow so that it could be used in a Jupyter notebook. Most notably, we created a pipeline for data preparation without having to write any code at all.
To get started with Data Wrangler, see Prepare ML Data with Amazon SageMaker Data Wrangler.
About the authors
Ajjay Govindaram is a Senior Solutions Architect at AWS. He works with strategic customers who are using AI/ML to solve complex business problems. His experience lies in providing technical direction as well as design assistance for modest to large-scale AI/ML application deployments. His knowledge ranges from application architecture to big data, analytics, and machine learning. He enjoys listening to music while resting, experiencing the outdoors, and spending time with his loved ones.
Bosco Albuquerque is a Sr. Partner Solutions Architect at AWS and has over 20 years of experience in working with database and analytics products from enterprise database vendors and cloud providers. He has helped large technology companies design data analytics solutions and has led engineering teams in designing and implementing data analytics platforms and data products.
Matt Marzillo is a Sr. Partner Sales Engineer at Snowflake. He has 10 years of experience in data science and machine learning roles both in consulting and with industry organizations. Matt has experience developing and deploying AI and ML models across many different organizations in areas such as marketing, sales, operations, clinical, and finance, as well as advising in consultative roles.
Huong Nguyen is a product leader for Amazon SageMaker Data Wrangler at AWS. She has 15 years of experience creating customer-obsessed and data-driven products for both enterprise and consumer spaces. In her spare time, she enjoys audio books, gardening, hiking, and spending time with her family and friends.
Remote monitoring of raw material supply chains for sustainability with Amazon SageMaker geospatial capabilities
Deforestation is a major concern in many tropical geographies where local rainforests are at severe risk of destruction. About 17% of the Amazon rainforest has been destroyed over the past 50 years, and some tropical ecosystems are approaching a tipping point beyond which recovery is unlikely.
A key driver for deforestation is raw material extraction and production, for example the production of food and timber or mining operations. Businesses consuming these resources are increasingly recognizing their share of responsibility in tackling the deforestation issue. One way they can do this is by ensuring that their raw material supply is produced and sourced sustainably. For example, if a business uses palm oil in their products, they will want to ensure that natural forests were not burned down and cleared to make way for a new palm oil plantation.
Geospatial analysis of satellite imagery taken of the locations where suppliers operate can be a powerful tool to detect problematic deforestation events. However, running such analyses is difficult, time-consuming, and resource-intensive. Amazon SageMaker geospatial capabilities—now generally available in the AWS Oregon Region—provide a new and much simpler solution to this problem. The tool makes it easy to access geospatial data sources, run purpose-built processing operations, apply pre-trained ML models, and use built-in visualization tools faster and at scale.
In this post, you will learn how to use SageMaker geospatial capabilities to easily baseline and monitor the vegetation type and density of areas where suppliers operate. Supply chain and sustainability professionals can use this solution to track the temporal and spatial dynamics of unsustainable deforestation in their supply chains. Specifically, the guidance provides data-driven insights into the following questions:
- When and over what period did deforestation occur – The guidance allows you to pinpoint when a new deforestation event occurred and monitor its duration, progression, or recovery
- Which type of land cover was most affected – The guidance allows you to pinpoint which vegetation types were most affected by a land cover change event (for example, tropical forests or shrubs)
- Where specifically did deforestation occur – Pixel-by-pixel comparisons between baseline and current satellite imagery (before vs. after) allow you to identify the precise locations where deforestation has occurred
- How much forest was cleared – An estimate on the affected area (in km2) is provided by taking advantage of the fine-grained resolution of satellite data (for example, 10mx10m raster cells for Sentinel 2)
Solution overview
The solution uses SageMaker geospatial capabilities to retrieve up-to-date satellite imagery for any area of interest with just a few lines of code, and apply pre-built algorithms such as land use classifiers and band math operations. You can then visualize results using built-in mapping and raster image visualization tooling. To derive further insights from the satellite data, the guidance uses the export functionality of Amazon SageMaker to save the processed satellite imagery to Amazon Simple Storage Service (Amazon S3), where data is cataloged and shared for custom postprocessing and analysis in an Amazon SageMaker Studio notebook with a SageMaker geospatial image. Results of these custom analyses are subsequently published and made observable in Amazon QuickSight so that procurement and sustainability teams can review supplier location vegetation data in one place. The following diagram illustrates this architecture.
The notebooks and code with a deployment-ready implementation of the analyses shown in this post are available at the GitHub repository Guidance for Geospatial Insights for Sustainability on AWS.
Example use case
This post uses an area of interest (AOI) from Brazil where land clearing for cattle production, oilseed growing (soybean and palm oil), and timber harvesting is a major concern. You can also generalize this solution to any other desired AOI.
The following screenshot displays the AOI showing satellite imagery (visible band) from the European Space Agency’s Sentinel 2 satellite constellation retrieved and visualized in a SageMaker notebook. Agricultural regions are clearly visible against dark green natural rainforest. Note also the smoke originating from inside the AOI as well as a larger area to the North. Smoke is often an indicator of the use of fire in land clearing.
NDVI as a measure for vegetation density
To identify and quantify changes in forest cover over time, this solution uses the Normalized Difference Vegetation Index (NDVI). . NDVI is calculated from the visible and near-infrared light reflected by vegetation. Healthy vegetation absorbs most of the visible light that hits it, and reflects a large portion of the near-infrared light. Unhealthy or sparse vegetation reflects more visible light and less near-infrared light. The index is computed by combining the red (visible) and near-infrared (NIR) bands of a satellite image into a single index ranging from -1 to 1.
Negative values of NDVI (values approaching -1) correspond to water. Values close to zero (-0.1 to 0.1) represent barren areas of rock, sand, or snow. Lastly, low and positive values represent shrub, grassland, or farmland (approximately 0.2 to 0.4), whereas high NDVI values indicate temperate and tropical rainforests (values approaching 1). Learn more about NDVI calculations here). NDVI values can therefore be mapped easily to a corresponding vegetation class:
By tracking changes in NDVI over time using the SageMaker built-in NDVI model, we can infer key information on whether suppliers operating in the AOI are doing so responsibly or whether they’re engaging in unsustainable forest clearing activity.
Retrieve, process, and visualize NDVI data using SageMaker geospatial capabilities
One primary function of the SageMaker Geospatial API is the Earth Observation Job (EOJ), which allows you to acquire and transform raster data collected from the Earth’s surface. An EOJ retrieves satellite imagery from a specified data source (i.e., a satellite constellation) for a specified area of interest and time period, and applies one or several models to the retrieved images.
EOJs can be created via a geospatial notebook. For this post, we use an example notebook.
To configure an EOJ, set the following parameters:
- InputConfig – The input configuration defines data sources and filtering criteria to be applied during data acquisition:
- RasterDataCollectionArn – Defines which satellite to collect data from.
- AreaOfInterest – The geographical AOI; defines Polygon for which images are to be collected (in
GeoJSON
format). - TimeRangeFilter – The time range of interest:
{StartTime: <string>, EndTime: <string> }
. - PropertyFilters – Additional property filters, such as maximum acceptable cloud cover.
- JobConfig – The model configuration defines the processing job to be applied to the retrieved satellite image data. An NDVI model is available as part of the pre-built
BandMath
operation.
Set InputConfig
SageMaker geospatial capabilities support satellite imagery from two different sources that can be referenced via their Amazon Resource Names (ARNs):
- Landsat Collection 2 Level-2 Science Products, which measures the Earth’s surface reflectance (SR) and surface temperature (ST) at a spatial resolution of 30m
- Sentinel 2 L2A COGs, which provides large-swath continuous spectral measurements across 13 individual bands (blue, green, near-infrared, and so on) with resolution down to 10m.
You can retrieve these ARNs directly via the API by calling list_raster_data_collections().
This solution uses Sentinel 2 data. The Sentinel-2 mission is based on a constellation of two satellites. As a constellation, the same spot over the equator is revisited every 5 days, allowing for frequent and high-resolution observations. To specify Sentinel 2 as data source for the EOJ, simply reference the ARN:
Next, the AreaOfInterest
(AOI) for the EOJ needs to be defined. To do so, you need to provide a GeoJSON of the bounding box that defines the area where a supplier operates. The following code snippet extracts the bounding box coordinates and defines the EOJ request input:
The time range is defined using the following request syntax:
Depending on the raster data collection selected, different additional property filters are supported. You can review the available options by calling get_raster_data_collection(Arn=data_collection_arn)["SupportedFilters"]
. In the following example, a tight limit of 5% cloud cover is imposed to ensure a relatively unobstructed view on the AOI:
Review query results
Before you start the EOJ, make sure that the query parameters actually result in satellite images being returned as a response. In this example, the ApproximateResultCount
is 3, which is sufficient. You may need to use a less restrictive PropertyFilter
if no results are returned.
You can review thumbnails of the raw input images by indexing the query_results
object. For example, the raw image thumbnail URL of the last item returned by the query can be accessed as follows: query_results['Items'][-1]["Assets"]["thumbnail"]["Href"]
.
Set JobConfig
Now that we have set all required parameters needed to acquire the raw Sentinel 2 satellite data, the next step is to infer vegetation density measured in terms of NDVI. This would typically involve identifying the satellite tiles that intersect the AOI and downloading the satellite imagery for the time frame in scope from a data provider. You would then have to go through the process of overlaying, merging, and clipping the acquired files, computing the NDVI per each raster cell of the combined image by performing mathematical operations on the respective bands (such as red and near-infrared), and finally saving the results to a new single-band raster image. SageMaker geospatial capabilities provide an end-to-end implementation of this workflow, including a built-in NDVI model that can be run with a simple API call. All you need to do is specify the job configuration and set it to the predefined NDVI model:
Having defined all required inputs for SageMaker to acquire and transform the geospatial data of interest, you can now start the EOJ with a simple API call:
After the EOJ is complete, you can start exploring the results. SageMaker geospatial capabilities provide built-in visualization tooling powered by Foursquare Studio, which natively works from within a SageMaker notebook via the SageMaker geospatial Map SDK. The following code snippet initializes and renders a map and then adds several layers to it:
Once rendered, you can interact with the map by hiding or showing layers, zooming in and out, or modifying color schemes, among other options. The following screenshot shows the AOI bounding box layer superimposed on the output layer (the NDVI-transformed Sentinel 2 raster file). Bright yellow patches represent rainforest that is intact (NDVI=1), darker patches represent fields (0.5>NDVI>0), and dark-blue patches represent water (NDVI=-1).
By comparing current period values vs. a defined baseline period, changes and anomalies in NDVI can be identified and tracked over time.
Custom postprocessing and QuickSight visualization for additional insights
SageMaker geospatial capabilities come with a powerful pre-built analysis and mapping toolkit that delivers the functionality needed for many geospatial analysis tasks. In some cases, you may require additional flexibility and want to run customized post-analyses on the EOJ results. SageMaker geospatial capabilities facilitate this flexibility via an export function. Exporting EOJ outputs is again a simple API call:
Then you can download the output raster files for further local processing in a SageMaker geospatial notebook using common Python libraries for geospatial analysis such as GDAL, Fiona, GeoPandas, Shapely, and Rasterio, as well as SageMaker-specific libraries. With the analyses running in SageMaker, all AWS analytics tooling that natively integrate with SageMaker are also at your disposal. For example, the solution linked in the Guidance for Geospatial Insights for Sustainability on AWS GitHub repo uses Amazon S3 and Amazon Athena for querying the postprocessing results and making them observable in a QuickSight dashboard. All processing routines along with deployment code and instructions for the QuickSight dashboard are detailed in the GitHub repository.
The dashboard offers three core visualization components:
- A time series plot of NDVI values normalized against a baseline period, which enables you to track the temporal dynamics in vegetation density
- Full discrete distribution of NDVI values in the baseline period and the current period, providing transparency on which vegetation types have seen the largest change
- NDVI-transformed satellite imagery for the baseline period, current period, and pixel-by-pixel differences between both periods, which allows you to identify the affected regions inside the AOI
As shown in the following example, over the period of 5 years (2017-Q3 to 2022-Q3), the average NDVI of AOI decreased by 7.6% against the baseline period (Q3 2017), affecting a total area of 250.21 km2. This reduction was primarily driven by changes in high-NDVI areas (forest, rainforest), which can be seen when comparing the NDVI distributions of the current vs. the baseline period.
The pixel-by-pixel spatial comparison against the baseline highlights that the deforestation event occured in an area right at the center of the AOI where previously untouched natural forest has been converted into farmland. Supply chain professionals can take these data points as basis for further investigation and a potential review of relationships with the supplier in question.
Conclusion
SageMaker geospatial capabilities can form an integral part in tracking corporate climate action plans by making remote geospatial monitoring easy and accessible. This blog post focused on just one specific use case – monitoring raw material supply chain origins. Other use cases are easily conceivable. For example, you could use a similar architecture to track forest restoration efforts for emission offsetting, monitor plant health in reforestation or farming applications, or detect the impact of droughts on water bodies, among many other applications.
About the Authors
Karsten Schroer is a Solutions Architect at AWS. He supports customers in leveraging data and technology to drive sustainability of their IT infrastructure and build cloud-native data-driven solutions that enable sustainable operations in their respective verticals. Karsten joined AWS following his PhD studies in applied machine learning & operations management. He is truly passionate about technology-enabled solutions to societal challenges and loves to dive deep into the methods and application architectures that underlie these solutions.
Tamara Herbert is an Application Developer with AWS Professional Services in the UK. She specializes in building modern & scalable applications for a wide variety of customers, currently focusing on those within the public sector. She is actively involved in building solutions and driving conversations that enable organizations to meet their sustainability goals both in and through the cloud.
Margaret O’Toole joined AWS in 2017 and has spent her time helping customers in various technical roles. Today, Margaret is the WW Tech Leader for Sustainability and leads a community of customer facing sustainability technical specialists. Together, the group helps customers optimize IT for sustainability and leverage AWS technology to solve some of the most difficult challenges in sustainability around the world. Margaret studied biology and computer science at the University of Virginia and Leading Sustainable Corporations at Oxford’s Saïd Business School.
Best practices for viewing and querying Amazon SageMaker service quota usage
Amazon SageMaker customers can view and manage their quota limits through Service Quotas. In addition, they can view near real-time utilization metrics and create Amazon CloudWatch metrics to view and programmatically query SageMaker quotas.
SageMaker helps you build, train, and deploy machine learning (ML) models with ease. To learn more, refer to Getting started with Amazon SageMaker. Service Quotas simplifies limit management by allowing you to view and manage your quotas for SageMaker from a central location.
With Service Quotas, you can view the maximum number of resources, actions, or items in your AWS account or AWS Region. You can also use Service Quotas to request an increase for adjustable quotas.
With the increasing usage of MLOps practices, and therefore the demand for resources designated for ML model experimentation and retraining, more customers need to run multiple instances, often of the same instance type at the same time.
Many data science teams often work in parallel, using several instances for processing, training, and tuning concurrently. Previously, users would sometimes reach an adjustable account limit for some particular instance type and have to manually request a limit increase from AWS.
To request quota increases manually from the Service Quotas UI, you can choose the quota from the list and choose Request quota increase. For more information, refer to Requesting a quota increase.
In this post, we show how you can use the new features to automatically request limit increases when a high level of instances is reached.
Solution overview
The following diagram illustrates the solution architecture.
This architecture includes the following workflow:
- A CloudWatch metric monitors the usage of the resource. A CloudWatch alarm triggers when the resource usage goes beyond a certain preconfigured threshold.
- A message is sent to Amazon Simple Notification Service (Amazon SNS).
- The message is received by an AWS Lambda function.
- The Lambda function requests the quota increase.
Aside from requesting for a quota increase for the specific account, the Lambda function can also add the quota increase to the organization template (up to 10 quotas). This way, any new account created under a given AWS Organization has the increased quota requests by default.
Prerequisites
Complete the following prerequisite steps:
- Set up an AWS account and create an AWS Identity and Access Management (IAM) user. For instructions, refer to Secure Your AWS Account.
- Install the AWS SAM CLI.
Deploy using AWS Serverless Application Model
To deploy the application using the GitHub repo, run the following command in the terminal:
After the solution is deployed, you should have a new alarm on the CloudWatch console. This alarm monitors usage for SageMaker notebook instances for the ml.t3.medium instance.
If your resource usage reaches more than 50%, the alarm triggers and the Lambda function requests an increase.
If the account you have is part of an AWS Organization and you have the quota request template enabled, you should also see those increases on the template, if the template has available slots. This way, new accounts from that organization also have the increases configured upon creation.
Deploy using the CloudWatch console
To deploy the application using the CloudWatch console, complete the following steps:
- On the CloudWatch console, choose All alarms in the navigation pane.
- Choose Create alarm.
- Choose Select metric.
- Choose Usage.
- Select the metric you want to monitor.
- Select the condition of when you would like the alarm to trigger.
For more possible configurations when configuring the alarm, see Create a CloudWatch alarm based on a static threshold.
- Configure the SNS topic to be notified about the alarm.
You can also use Amazon SNS to trigger a Lambda function when the alarm is triggered. See Using AWS Lambda with Amazon SNS for more information.
- For Alarm name, enter a name.
- Choose Next.
- Choose Create alarm.
Clean up
To clean up the resources created as part of this post, make sure to delete all the created stacks. To do that, run the following command:
Conclusion
In this post, we showed how you can use the new integration from SageMaker with Service Quotas to automate the requests for quota increases for SageMaker resources. This way, data science teams can effectively work in parallel and reduce issues related to unavailability of instances.
You can learn more about Amazon SageMaker quotas by accessing the documentation. You can also learn more about Service Quotas here.
About the authors
Bruno Klein is a Machine Learning Engineer in the AWS ProServe team. He particularly enjoys creating automations and improving the lifecycle of models in production. In his free time, he likes to spend time outdoors and hiking.
Paras Mehra is a Senior Product Manager at AWS. He is focused on helping build Amazon SageMaker Training and Processing. In his spare time, Paras enjoys spending time with his family and road biking around the Bay Area. You can find him on LinkedIn.
Build custom code libraries for your Amazon SageMaker Data Wrangler Flows using AWS Code Commit
As organizations grow in size and scale, the complexities of running workloads increase, and the need to develop and operationalize processes and workflows becomes critical. Therefore, organizations have adopted technology best practices, including microservice architecture, MLOps, DevOps, and more, to improve delivery time, reduce defects, and increase employee productivity. This post introduces a best practice for managing custom code within your Amazon SageMaker Data Wrangler workflow.
Data Wrangler is a low-code tool that facilitates data analysis, preprocessing, and visualization. It contains over 300 built-in data transformation steps to aid with feature engineering, normalization, and cleansing to transform your data without having to write any code.
In addition to the built-in transforms, Data Wrangler contains a custom code editor that allows you to implement custom code written in Python, PySpark, or SparkSQL.
When using Data Wrangler custom transform steps to implement your custom functions, you need to implement best practices around developing and deploying code in Data Wrangler flows.
This post shows how you can use code stored in AWS CodeCommit in the Data Wrangler custom transform step. This provides you with additional benefits, including:
- Improve productivity and collaboration across personnel and teams
- Version your custom code
- Modify your Data Wrangler custom transform step without having to log in to Amazon SageMaker Studio to use Data Wrangler
- Reference parameter files in your custom transform step
- Scan code in CodeCommit using Amazon CodeGuru or any third-party application for security vulnerabilities before using it in Data Wrangler flowssagemake
Solution overview
This post demonstrates how to build a Data Wrangler flow file with a custom transform step. Instead of hardcoding the custom function into your custom transform step, you pull a script containing the function from CodeCommit, load it, and call the loaded function in your custom transform step.
For this post, we use the bank-full.csv
data from the University of California Irving Machine Learning Repository to demonstrate these functionalities. The data is related to the direct marketing campaigns of a banking institution. Often, more than one contact with the same client was required to assess if the product (bank term deposit) would be subscribed (yes
) or not subscribed (no
).
The following diagram illustrates this solution.
The workflow is as follows:
- Create a Data Wrangler flow file and import the dataset from Amazon Simple Storage Service (Amazon S3).
- Create a series of Data Wrangler transformation steps:
- A custom transform step to implement a custom code stored in CodeCommit.
- Two built-in transform steps.
We keep the transformation steps to a minimum so as not to detract from the aim of this post, which is focused on the custom transform step. For more information about available transformation steps and implementation, refer to Transform Data and the Data Wrangler blog.
- In the custom transform step, write code to pull the script and configuration file from CodeCommit, load the script as a Python module, and call a function in the script. The function takes a configuration file as an argument.
- Run a Data Wrangler job and set Amazon S3 as the destination.
Destination options also include Amazon SageMaker Feature Store.
Prerequisites
As a prerequisite, we set up the CodeCommit repository, Data Wrangler flow, and CodeCommit permissions.
Create a CodeCommit repository
For this post, we use an AWS CloudFormation template to set up a CodeCommit repository and copy the required files into this repository. Complete the following steps:
- Choose Launch Stack:
- Select the Region where you want to create the CodeCommit repository.
- Enter a name for Stack name.
- Enter a name for the repository to be created for RepoName.
- Choose Create stack.
AWS CloudFormation takes a few seconds to provision your CodeCommit repository. After the CREATE_COMPLETE
status appears, navigate to the CodeCommit console to see your newly created repository.
Set up Data Wrangler
Download the bank.zip
dataset from the University of California Irving Machine Learning Repository. Then, extract the contents of bank.zip
and upload bank-full.csv
to Amazon S3.
To create a Data Wrangler flow file and import the bank-full.csv
dataset from Amazon S3, complete the following steps:
- Onboard to SageMaker Studio using the quick start for users new to Studio.
- Select your SageMaker domain and user profile and on the Launch menu, choose Studio.
- On the Studio console, on the File menu, choose New, then choose Data Wrangler Flow.
- Choose Amazon S3 for Data sources.
- Navigate to your S3 bucket containing the file and upload the
bank-full.csv
file.
A Preview Error will be thrown.
- Change the Delimiter in the Details pane to the right to SEMICOLON.
A preview of the dataset will be displayed in the result window.
- In the Details pane, on the Sampling drop-down menu, choose None.
This is a relatively small dataset, so you don’t need to sample.
- Choose Import.
Configure CodeCommit permissions
You need to provide Studio with permission to access CodeCommit. We use a CloudFormation template to provision an AWS Identity and Access Management (IAM) policy that gives your Studio role permission to access CodeCommit. Complete the following steps:
- Choose Launch Stack:
- Select the Region you are working in.
- Enter a name for Stack name.
- Enter your Studio domain ID for SageMakerDomainID. The domain information is available on the SageMaker console Domains page, as shown in the following screenshot.
- Enter your Studio domain user profile name for SageMakerUserProfileName. You can view your user profile name by navigating into your Studio domain. If you have multiple user profiles in your Studio domain, enter the name for the user profile used to launch Studio.
- Select the acknowledgement box.
The IAM resources used by this CloudFormation template provide the minimum permissions to successfully create the IAM policy attached to your Studio role for CodeCommit access.
- Choose Create stack.
Transformation steps
Next, we add transformations to process the data.
Custom transform step
In this post, we calculate the Variance Inflation factor (VIF) for each numerical feature and drop features that exceed a VIF threshold. We do this in the custom transform step because Data Wrangler doesn’t have a built-in transform for this task as of this writing.
However, we don’t hardcode this VIF function. Instead, we pull this function from the CodeCommit repository into the custom transform step. Then we run the function on the dataset.
- On the Data Wrangler console, navigate to your data flow.
- Choose the plus sign next to Data types and choose Add transform.
- Choose + Add step.
- Choose Custom transform.
- Optionally, enter a name in the Name field.
- Choose Python (PySpark) on the drop-down menu.
- For Your custom transform, enter the following code (provide the name of the CodeCommit repository and Region where the repository is located):
The code uses the AWS SDK for Python (Boto3) to access CodeCommit API functions. We use the get_file
API function to pull files from the CodeCommit repository into the Data Wrangler environment.
- Choose Preview.
In the Output pane, a table is displayed showing the different numerical features and their corresponding VIF value. For this exercise, the VIF threshold value is set to 1.2. However, you can modify this threshold value in the parameter.json
file found in your CodeCommit repository. You will notice that two columns have been dropped (pdays
and previous
), bringing the total column count to 15.
- Choose Add.
Encode categorical features
Some feature types are categorical variables that need to be transformed into numerical forms. Use the one-hot encode built-in transform to achieve this data transformation. Let’s create numerical features representing the unique value in each categorical feature in the dataset. Complete the following steps:
- Choose + Add step.
- Choose the Encode categorical transform.
- On the Transform drop-down menu, choose One-hot encode.
- For Input column, choose all categorical features, including
poutcome
,y
,month
,marital
,contact
,default
,education
,housing
,job
, andloan
. - For Output style, choose Columns.
- Choose Preview to preview the results.
One-hot encoding might take a while to generate results, given the number of features and unique values within each feature.
- Choose Add.
For each numerical feature created with one-hot encoding, the name combines the categorical feature name appended with an underscore (_
) and the unique categorical value within that feature.
Drop column
The y_yes
feature is the target column for this exercise, so we drop the y_no
feature.
- Choose + Add step.
- Choose Manage columns.
- Choose Drop column under Transform.
- Choose
y_no
under Columns to drop. - Choose Preview, then choose Add.
Create a Data Wrangler job
Now that you have created all the transform steps, you can create a Data Wrangler job to process your input data and store the output in Amazon S3. Complete the following steps:
- Choose Data flow to go back to the Data Flow page.
- Choose the plus sign on the last tile of your flow visualization.
- Choose Add destination and choose Amazon S3.
- Enter the name of the output file for Dataset name.
- Choose Browse and choose the bucket destination for Amazon S3 location.
- Choose Add destination.
- Choose Create job.
- Change the Job name value as you see fit.
- Choose Next, 2. Configure job.
- Change Instance count to 1, because we work with a relatively small dataset, to reduce the cost incurred.
- Choose Create.
This will start an Amazon SageMaker Processing job to process your Data Wrangler flow file and store the output in the specified S3 bucket.
Automation
Now that you have created your Data Wrangler flow file, you can schedule your Data Wrangler jobs to automatically run at specific times and frequency. This is a feature that comes out of the box with Data Wrangler and simplifies the process of scheduling Data Wrangler jobs. Furthermore, CRON expressions are supported and provide additional customization and flexibility in scheduling your Data Wrangler jobs.
However, this post shows how you can automate the Data Wrangler job to run every time there is a modification to the files in the CodeCommit repository. This automation technique ensures that any changes to the custom code functions or changes to values in the configuration file in CodeCommit trigger a Data Wrangler job to reflect these changes immediately.
Therefore, you don’t have to manually start a Data Wrangler job to get the output data that reflects the changes you just made. With this automation, you can improve the agility and scale of your Data Wrangler workloads. To automate your Data Wrangler jobs, you configure the following:
- Amazon SageMaker Pipelines – Pipelines helps you create machine learning (ML) workflows with an easy-to-use Python SDK, and you can visualize and manage your workflow using Studio
- Amazon EventBridge – EventBridge facilitates connection to AWS services, software as a service (SaaS) applications, and custom applications as event producers to launch workflows.
Create a SageMaker pipeline
First, you need to create a SageMaker pipeline for your Data Wrangler job. Then complete the following steps to export your Data Wrangler flow to a SageMaker pipeline:
- Choose the plus sign on your last transform tile (the transform tile before the Destination tile).
- Choose Export to.
- Choose SageMaker Inference Pipeline (via Jupyter Notebook).
This creates a new Jupyter notebook prepopulated with code to create a SageMaker pipeline for your Data Wrangler job. Before running all the cells in the notebook, you may want to change certain variables.
- To add a training step to your pipeline, change the
add_training_step
variable toTrue
.
Be aware that running a training job will incur additional costs on your account.
- Specify a value for the
target_attribute_name
variable toy_yes
.
- To change the name of the pipeline, change the
pipeline_name
variable.
- Lastly, run the entire notebook by choosing Run and Run All Cells.
This creates a SageMaker pipeline and runs the Data Wrangler job.
- To view your pipeline, choose the home icon on the navigation pane and choose Pipelines.
You can see the new SageMaker pipeline created.
- Choose the newly created pipeline to see the run list.
- Note the name of the SageMaker pipeline, as you will use it later.
- Choose the first run and choose Graph to see a Directed Acyclic Graph (DAG) flow of your SageMaker pipeline.
As shown in the following screenshot, we didn’t add a training step to our pipeline. If you added a training step to your pipeline, it will display in your pipeline run Graph tab under DataWranglerProcessingStep.
Create an EventBridge rule
After successfully creating your SageMaker pipeline for the Data Wrangler job, you can move on to setting up an EventBridge rule. This rule listens to activities in your CodeCommit repository and triggers the run of the pipeline in the event of a modification to any file in the CodeCommit repository. We use a CloudFormation template to automate creating this EventBridge rule. Complete the following steps:
- Choose Launch Stack:
- Select the Region you are working in.
- Enter a name for Stack name.
- Enter a name for your EventBridge rule for EventRuleName.
- Enter the name of the pipeline you created for PipelineName.
- Enter the name of the CodeCommit repository you are working with for RepoName.
- Select the acknowledgement box.
The IAM resources that this CloudFormation template uses provide the minimum permissions to successfully create the EventBridge rule.
- Choose Create stack.
It takes a few minutes for the CloudFormation template to run successfully. When the Status changes to CREATE_COMPLTE, you can navigate to the EventBridge console to see the created rule.
Now that you have created this rule, any changes you make to the file in your CodeCommit repository will trigger the run of the SageMaker pipeline.
To test the pipeline edit a file in your CodeCommit repository, modify the VIF threshold in your parameter.json
file to a different number, and go to the SageMaker pipeline details page to see a new run of your pipeline created.
In this new pipeline run, Data Wrangler drops numerical features that have a greater VIF value than the threshold you specified in your parameter.json
file in CodeCommit.
You have successfully automated and decoupled your Data Wrangler job. Furthermore, you can add more steps to your SageMaker pipeline. You can also modify the custom scripts in CodeCommit to implement various functions in your Data Wrangler flow.
It’s also possible to store your scripts and files in Amazon S3 and download them into your Data Wrangler custom transform step as an alternative to CodeCommit. In addition, you ran your custom transform step using the Python (PyScript) framework. However, you can also use the Python (Pandas) framework for your custom transform step, allowing you to run custom Python scripts. You can test this out by changing your framework in the custom transform step to Python (Pandas) and modifying your custom transform step code to pull and implement the Python script version stored in your CodeCommit repository. However, the PySpark option for Data Wrangler provides better performance when working on a large dataset compared to the Python Pandas option.
Clean up
After you’re done experimenting with this use case, clean up the resources you created to avoid incurring additional charges to your account:
- Stop the underlying instance used to create your Data Wrangler flow.
- Delete the resources created by the various CloudFormation template.
- If you see a
DELETE_FAILED
state, when deleting the CloudFormation template, delete the stack one more time to successfully delete it.
Summary
This post showed you how to decouple your Data Wrangler custom transform step by pulling scripts from CodeCommit. We also showed how to automate your Data Wrangler jobs using SageMaker Pipelines and EventBridge.
Now you can operationalize and scale your Data Wrangler jobs without modifying your Data Wrangler flow file. You can also scan your custom code in CodeCommit using CodeGuru or any third-party application for vulnerabilities before implementing it in Data Wrangler. To know more about end-to-end machine learning operations (MLOps) on AWS, visit Amazon SageMaker for MLOps.
About the Author
Uchenna Egbe is an Associate Solutions Architect at AWS. He spends his free time researching about herbs, teas, superfoods, and how to incorporate them into his daily diet.
Biased graph sampling for better related-product recommendation
Tailoring neighborhood sizes and sampling probability to nodes’ degree of connectivity improves the utility of graph-neural-network embeddings by as much as 230%.Read More