Import data from cross-account Amazon Redshift in Amazon SageMaker Data Wrangler for exploratory data analysis and data preparation

Organizations moving towards a data-driven culture embrace the use of data and machine learning (ML) in decision-making. To make ML-based decisions from data, you need your data available, accessible, clean, and in the right format to train ML models. Organizations with a multi-account architecture want to avoid situations where they must extract data from one account and load it into another for data preparation activities. Manually building and maintaining the different extract, transform, and load (ETL) jobs in different accounts adds complexity and cost, and makes it more difficult to maintain the governance, compliance, and security best practices to keep your data safe.

Amazon Redshift is a fast, fully managed cloud data warehouse. The Amazon Redshift cross-account data sharing feature provides a simple and secure way to share fresh, complete, and consistent data in your Amazon Redshift data warehouse with any number of stakeholders in different AWS accounts. Amazon SageMaker Data Wrangler is a capability of Amazon SageMaker that makes it faster for data scientists and engineers to prepare data for ML applications by using a visual interface. Data Wrangler allows you to explore and transform data for ML by connecting to Amazon Redshift datashares.

In this post, we walk through setting up a cross-account integration using an Amazon Redshift datashare and preparing data using Data Wrangler.

Solution overview

We start with two AWS accounts: a producer account with the Amazon Redshift data warehouse, and a consumer account for SageMaker ML use cases. For this post, we use the banking dataset. To follow along, download the dataset to your local machine. The following is a high-level overview of the workflow:

  1. Instantiate an Amazon Redshift RA3 cluster in the producer account and load the dataset.
  2. Create an Amazon Redshift datashare in the producer account and allow the consumer account to access the data.
  3. Access the Amazon Redshift datashare in the consumer account.
  4. Analyze and process data with Data Wrangler in the consumer account and build your data preparation workflows.

Be aware of the considerations for working with Amazon Redshift data sharing:

  • Multiple AWS accounts – You need at least two AWS accounts: a producer account and a consumer account.
  • Cluster type – Data sharing is supported in the RA3 cluster type. When instantiating an Amazon Redshift cluster, make sure to choose the RA3 cluster type.
  • Encryption – For data sharing to work, both the producer and consumer clusters must be encrypted and should be in the same AWS Region.
  • Regions – Cross-account data sharing is available for all Amazon Redshift RA3 node types in US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Paris), Europe (Stockholm), and South America (São Paulo).
  • Pricing – Cross-account data sharing is available across clusters that are in the same Region. There is no cost to share data. You just pay for the Amazon Redshift clusters that participate in sharing.

Cross-account data sharing is a two-step process. First, a producer cluster administrator creates a datashare, adds objects, and gives access to the consumer account. Then the producer account administrator authorizes sharing data for the specified consumer. You can do this from the Amazon Redshift console.

Create an Amazon Redshift datashare in the producer account

To create your datashare, complete the following steps:

  1. On the Amazon Redshift console, create an Amazon Redshift cluster.
  2. Specify Production and choose the RA3 node type.
  3. Under Additional configurations, deselect Use defaults.
  4. Under Database configurations, set up encryption for your cluster.
  5. After you create the cluster, import the direct marketing bank dataset. You can download from the following URL: https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip.
  6. Upload bank-additional-full.csv to an Amazon Simple Storage Service (Amazon S3) bucket your cluster has access to.
  7. Use the Amazon Redshift query editor and run the following SQL query to copy the data into Amazon Redshift:
    create table bank_additional_full (
      age char(40),
      job char(40),
      marital char(40),
      education char(40),
      default_history varchar(40),
      housing char(40),
      loan char(40),
      contact char(40),
      month char(40),
      day_of_week char(40),
      duration char(40),
      campaign char(40),
      pdays char(40),
      previous char(40),
      poutcome char(40),
      emp_var_rate char(40),
      cons_price_idx char(40),
      cons_conf_idx char(40),
      euribor3m char(40),
      nr_employed char(40),
      y char(40));
    copy bank_additional_full
    from <S3 LOCATION OF THE CSV FILE>
    credentials <CLUSTER ROLE ARN>
    region 'us-east-1'
    format csv
    IGNOREBLANKLINES
    IGNOREHEADER 1

  8. Navigate to the cluster details page and on the Datashares tab, choose Create datashare.
  9. For Datashare name, enter a name.
  10. For Database name, choose a database.
  11. In the Add datashare objects section, choose the objects from the database you want to include in the datashare.
    You have granular control of what you choose to share with others. For simplicity, we share all the tables. In practice, you might choose one or more tables, views, or user-defined functions.
  12. Choose Add.
  13. To add data consumers, select Add AWS accounts to the datashare and add your secondary AWS account ID.
  14. Choose Create datashare.
  15. To authorize the data consumer you just created, go to the Datashares page on the Amazon Redshift console and choose the new datashare.
  16. Select the data consumer and choose Authorize.

The consumer status changes from Pending authorization to Authorized.

Access the Amazon Redshift cross-account datashare in the consumer AWS account

Now that the datashare is set up, switch to your consumer AWS account to consume the datashare. Make sure you have at least one Amazon Redshift cluster created in your consumer account. The cluster has to be encrypted and in the same Region as the source.

  1. On the Amazon Redshift console, choose Datashares in the navigation pane.
  2. On the From other accounts tab, select the datashare you created and choose Associate.
  3. You can associate the datashare with one or more clusters in this account or associate the datashare to the entire account so that the current and future clusters in the consumer account get access to this share.
  4. Specify your connection details and choose Connect.
  5. Choose Create database from datashare and enter a name for your new database.
  6. To test the datashare, go to query editor and run queries against the new database to make sure all the objects are available as part of the datashare.

Analyze and process data with Data Wrangler

You can now use Data Wrangler to access the cross-account data created as a datashare in Amazon Redshift.

  1. Open Amazon SageMaker Studio.
  2. On the File menu, choose New and Data Wrangler Flow.
  3. On the Import tab, choose Add data source and Amazon Redshift.
  4. Enter the connection details of the Amazon Redshift cluster you just created in the consumer account for the datashare.
  5. Choose Connect.
  6. Use the AWS Identity and Access Management (IAM) role you used for your Amazon Redshift cluster.

Note that even though the datashare is a new database in the Amazon Redshift cluster, you can’t connect to it directly from Data Wrangler.

The correct way is to connect to the default cluster database first, and then use SQL to query the datashare database. Provide the required information for connecting to the default cluster database. Note that an AWS Key Management Service (AWS KMS) key ID is not required in order to connect.

Data Wrangler is now connected to the Amazon Redshift instance.

  1. Query the data in the Amazon Redshift datashare database using a SQL editor.
  2. Choose Import to import the dataset to Data Wrangler.
  3. Enter a name for the dataset and choose Add.

You can now see the flow on the Data Flow tab of Data Wrangler.

After you have loaded the data into Data Wrangler, you can do exploratory data analysis and prepare data for ML.

  1. Choose the plus sign and choose Add analysis.

Data Wrangler provides built-in analyses. These include but aren’t limited to a data quality and insights report, data correlation, a pre-training bias report, a summary of your dataset, and visualizations (such as histograms and scatter plots). You can also create your own custom visualization.

You can use the Data Quality and Insights Report to automatically generate visualizations and analyses to identify data quality issues, and recommend the right transformation required for your dataset.

  1. Choose Data Quality and Insights Report, and choose the Target column as y.
  2. Because this is a classification problem statement, for Problem type, select Classification.
  3. Choose Create.

Data Wrangler creates a detailed report on your dataset. You can also download the report to your local machine.

  1. For data preparation, choose the plus sign and choose Add analysis.
  2. Choose Add step to start building your transformations.

At the time of this writing, Data Wrangler provides over 300 built-in transformations. You can also write your own transformations using Pandas or PySpark.

You can now start building your transforms and analysis based on your business requirement.

Conclusion

In this post, we explored sharing data across accounts using Amazon Redshift datashares without having to manually download and upload data. We walked through how to access the shared data using Data Wrangler and prepare the data for your ML use cases. This no-code/low-code capability of Amazon Redshift datashares and Data Wrangler accelerates training data preparation and increases the agility of data engineers and data scientists with faster iterative data preparation.

To learn more about Amazon Redshift and SageMaker, refer to the Amazon Redshift Database Developer Guide and Amazon SageMaker Documentation.


About the Authors

 Meenakshisundaram Thandavarayan is a Senior AI/ML specialist with AWS. He helps hi-tech strategic accounts on their AI and ML journey. He is very passionate about data-driven AI.

James Wu is a Senior AI/ML Specialist Solution Architect at AWS. helping customers design and build AI/ML solutions. James’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. Prior to joining AWS, James was an architect, developer, and technology leader for over 10 years, including 6 years in engineering and 4 years in marketing & advertising industries.

Read More

Predict types of machine failures with no-code machine learning using Amazon SageMaker Canvas

Predicting common machine failure types is critical in manufacturing industries. Given a set of characteristics of a product that is tied to a given type of failure, you can develop a model that can predict the failure type when you feed those attributes to a machine learning (ML) model. ML can help with insights, but up until now you needed ML experts to build models to predict machine failure types, the lack of which could delay any corrective actions that businesses need for efficiencies or improvement.

In this post, we show you how business analysts can build a machine failure type prediction ML model with Amazon SageMaker Canvas. Canvas provides you with a visual point-and-click interface that allows you to build models and generate accurate ML predictions on your own—without requiring any ML experience or having to write a single line of code.

Solution overview

Let’s assume you’re a business analyst assigned to a maintenance team of a large manufacturing organization. Your maintenance team has asked you to assist in predicting common failures. They have provided you with a historical dataset that contains characteristics tied to a given type of failure and would like you to predict which failure will occur in the future. The failure types include No Failure, Overstrain, and Power Failures. The data schema is listed in the following table.

Column Name Data Type Description
UID INT Unique identifier ranging from 1–10,000
productID STRING Consisting of a letter—L, M, or H for low, medium, or high—as product quality variants and a variant-specific serial number
type STRING Initial letter associated with productID consisting of L, M, or H only
air temperature [K] DECIMAL Air temperature specified in kelvin
process temperature [K] DECIMAL Precisely controlled temperatures to ensure quality of a given type of product specified in kelvin
rotational speed [rpm] DECIMAL Rotational speed of an object rotating around an axis is the number of turns of the object divided by time, specified as revolutions per minute
torque [Nm] DECIMAL Machine turning force through a radius, expressed in newton meters
tool wear [min] INT Tool wear expressed in minutes
failure type (target) STRING No Failure, Power Failure, or Overstrain Failure

After the failure type is identified, businesses can take any corrective actions. To do this, you use the data you have in a CSV file, which contains certain characteristics of a product as outlined in the table. You use Canvas to perform the following steps:

  1. Import the maintenance dataset.
  2. Train and build the predictive machine maintenance model.
  3. Analyze the model results.
  4. Test predictions against the model.

Prerequisites

A cloud admin with an AWS account with appropriate permissions is required to complete the following prerequisites:

  1. Deploy an Amazon SageMaker domain For instructions, see Onboard to Amazon SageMaker Domain.
  2. Launch Canvas. For instructions, see Setting up and managing Amazon SageMaker Canvas (for IT administrators).
  3. Configure cross-origin resource sharing (CORS) policies for Canvas. For instructions, see Give your users the ability to upload local files.

Import the dataset

First, download the maintenance dataset and review the file to make sure all the data is there.

Canvas provides several sample datasets in your application to help you get started. To learn more about the SageMaker-provided sample datasets you can experiment with, see Use sample datasets. If you use the sample dataset (canvas-sample-maintenance.csv) available within Canvas, you don’t have to import the maintenance dataset.


You can import data from different data sources into Canvas. If you plan to use your own dataset, follow the steps in Importing data in Amazon SageMaker Canvas.

For this post, we use the full maintenance dataset that we downloaded.

  1. Sign in to the AWS Management Console, using an account with the appropriate permissions to access Canvas.
  2. Log in to the Canvas console.
  3. Choose Import.
  4. Choose Upload and select the maintenance_dataset.csv file.
  5. Choose Import data to upload it to Canvas.

Import the dataset

The import process takes approximately 10 seconds (this can vary depending on dataset size). When it’s complete, you can see the dataset is in Ready status.

After you confirm that the imported dataset is ready, you can create your model.

Build and train the model

To create and train your model, complete the following steps:

  1. Choose New model, and provide a name for your model.
  2. Choose Create.
  3. Select the maintenance_dataset.csv dataset and choose Select dataset.
    In the model view, you can see four tabs, which correspond to the four steps to create a model and use it to generate predictions: Select, Build, Analyze, and Predict.
  4. On the Select tab, select the maintenance_dataset.csv dataset you uploaded previously and choose Select dataset.
    This dataset includes 9 columns and 10,000 rows. Canvas automatically moves to the Build phase.
  5. On this tab, choose the target column, in our case Failure Type.The maintenance team has informed you that this column indicates the type of failures typically seen based off of historical data from their existing machines. This is what you want to train your model to predict. Canvas automatically detects that this is a 3 Category problem (also known as multi-class classification). If the wrong model type is detected, you can change it manually with the Change type option.
    It should be noted that this dataset is highly unbalanced towards the No Failure class, which can be seen by viewing the column named Failure Type. Although Canvas and the underlying AutoML capabilities can partly handle dataset imbalance, this may result in some skewed performances. As an additional next step, refer to Balance your data for machine learning with Amazon SageMaker Data Wrangler. Following the steps in the shared link, you can launch an Amazon SageMaker Studio app from the SageMaker console and import this dataset within Amazon SageMaker Data Wrangler and use the Balance data transformation, then take the balanced dataset back to Canvas and continue the following steps. We are proceeding with the imbalanced dataset in this post to show that Canvas can handle imbalanced datasets as well.
    In the bottom half of the page, you can look at some of the statistics of the dataset, including missing and mismatched values, unique vales, and mean and median values. You can also drop some of the columns if you don’t want to use them for the prediction by simply deselecting them.
    After you’ve explored this section, it’s time to train the model! Before building a complete model, it’s a good practice to have a general idea about the model performance by training a Quick Model. A quick model trains fewer combinations of models and hyperparameters in order to prioritize speed over accuracy, especially in cases where you want to prove the value of training an ML model for your use case. Note that the quick build option isn’t available for models bigger than 50,000 rows.
  6. Choose Quick build.

model building in progress

Now you wait anywhere from 2–15 minutes. Once done, Canvas automatically moves to the Analyze tab to show you the results of quick training. The analysis performed using quick build estimates that your model is able to predict the right failure type (outcome) 99.2% of the time. You may experience slightly different values. This is expected.

Let’s focus on the first tab, Overview. This is the tab that shows you the Column impact, or the estimated importance of each column in predicting the target column. In this example, the Torque [Nm] and Rotational speed [rpm] columns have the most significant impact in predicting what type of failure will occur.

Analyse - Overview

Evaluate model performance

When you move to the Scoring portion of your analysis, you can see a plot representing the distribution of our predicted values with respect to the actual values. Notice that most failures will be within the No Failure category. To learn more about how Canvas uses SHAP baselines to bring explainability to ML, refer to Evaluating Your Model’s Performance in Amazon SageMaker Canvas, as well as SHAP Baselines for Explainability.
evaluate the model metrics

Canvas splits the original dataset into train and validation sets before the training. The scoring is a result of Canvas running the validation set against the model. This is an interactive interface where you can select the failure type. If you choose Overstrain Failure in the graphic, you can see that model identifies these 84% of time. This is good enough to take action on—perhaps have an operator or engineer check further. You can choose Power Failure in the graphic to see the respective scoring for further interpretation and actions.

You may be interested in failure types and how well the model predicts failure types based on a series of inputs. To take a closer look at the results, choose Advanced metrics. This displays a matrix that allows you to more closely examine the results. In ML, this is referred to as a confusion matrix.

advanced metrics

This matrix defaults to the dominate class, No Failure. On the Class menu, you can choose to view advanced metrics of the other two failure types of Overstrain Failure and Power Failure.

In ML, the accuracy of the model is defined as the number of correct predictions divided over the total number of predictions. The blue boxes represent correct predictions that the model made against a subset of test data where there was a known outcome. Here we are interested in what percentage of the time the model predicted a particular machine failure type (lets say No Failure) when its actually that failure type (No Failure). In ML, a ratio used to measure this is TP / (TP + FN). This is referred to as recall. In the default case, No Failure, there were 1,923 correct predictions out of 1,926 overall records, which resulted in 99% recall. Alternatively, in the class of Overstrain Failure, there were 32 out of 38, which results in 84% recall. Lastly, in the class of Power Failure, there were 16 out of 19, which results in 84% recall.

Now, you have two options:

  1. You can use this model to run some predictions by choosing Predict.
  2. You can create a new version of this model to train with the Standard build option. This will take much longer—about 1–2 hours—but provides a more robust model because it goes through a full AutoML review of data, algorithms, and tuning iterations.

Because you’re trying to predict failures, and the model predicts failures correctly 84% of time, you can confidently use the model to identify possible failures. So, you can proceed to option 1. If you weren’t confident, then you could have a data scientist review the modeling Canvas did and offer potential improvements via option 2.

Generate predictions

Now that the model is trained, you can start generating predictions.

  1. Choose Predict at the bottom of the Analyze page, or choose the Predict tab.
  2. Choose Select dataset, and choose the maintenance_dataset.csv file.
  3. Choose Generate predictions.

Canvas uses this dataset to generate our predictions. Although it’s generally a good idea to not use the same dataset for both training and testing, you can use the same dataset for the sake of simplicity in this case. Alternatively, you can remove some records from your original dataset that you use for training and use those records in a CSV file and feed it to the batch prediction here so you don’t use the same dataset for testing post-training.

batch prediction
After a few seconds, the prediction is complete. Canvas returns a prediction for each row of data and the probability of the prediction being correct. You can choose Preview to view the predictions, or choose Download to download a CSV file containing the full output.

download prediction
You can also choose to predict one by one values by choosing Single prediction instead of Batch prediction. Canvas shows you a view where you can provide the values for each feature manually and generate a prediction. This is ideal for situations like what-if scenarios, for example: How does the tool wear impact the failure type? What if process temperature increases or decreases? What if rotational speed changes?

single prediction

Standard build

The Standard build option chooses accuracy over speed. If you want to share the artifacts of the model with your data scientist and ML engineers, you can create a standard build next.

  1. Choose Add version
    Standard build - add version
  2. Choose a new version and choose Standard build.choose standard build
  3. After you create a standard build, you can share the model with data scientists and ML engineers for further evaluation and iteration.

Share model

Clean up

To avoid incurring future session charges, log out of Canvas.
Logout of Canvas

Conclusion

In this post, we showed how a business analyst can create a machine failure type prediction model with Canvas using maintenance data. Canvas allows business analysts such as reliability engineers to create accurate ML models and generate predictions using a no-code, visual, point-and-click interface. Analysts can take this to the next level by sharing their models with data scientist colleagues. Data scientists can view the Canvas model in Studio, where they can explore the choices Canvas made, validate model results, and even take the model to production with a few clicks. This can accelerate ML-based value creation and help scale improved outcomes faster.

To learn more about using Canvas, see Build, Share, Deploy: how business analysts and data scientists achieve faster time-to-market using no-code ML and Amazon SageMaker Canvas. For more information about creating ML models with a no-code solution, see Announcing Amazon SageMaker Canvas – a Visual, No Code Machine Learning Capability for Business Analysts.


About the Authors

Rajakumar Sampathkumar is a Principal Technical Account Manager at AWS, providing customers guidance on business-technology alignment and supporting the reinvention of their cloud operation models and processes. He is passionate about cloud and machine learning. Raj is also a machine learning specialist and works with AWS customers to design, deploy, and manage their AWS workloads and architectures.

Twann Atkins is a Senior Solutions Architect for Amazon Web Services. He is responsible for working with Agriculture, Retail, and Manufacturing customers to identify business problems and working backwards to identify viable and scalable technical solutions. Twann has been helping customers plan and migrate critical workloads for more than 10 years with a recent focus on democratizing analytics, artificial intelligence and machine learning for customers and builders of tomorrow.

Omkar Mukadam is a Edge Specialist Solution Architecture at Amazon Web Services. He currently focuses on solutions which enables commercial customers to effectively design, build and scale with AWS Edge service offerings which includes but not limited to AWS Snow Family.

Read More

Visual inspection automation using Amazon SageMaker JumpStart

According to Gartner, hyperautomation is the number one trend in 2022 and will continue advancing in future. One of the main barriers to hyperautomation is in areas where we’re still struggling to reduce human involvement. Intelligent systems have a hard time matching human visual recognition abilities, despite great advancements in deep learning in computer vision. This is mainly due to the lack of annotated data (or when data is sparse) and in areas such as quality control, where trained human eyes still dominate. Another reason is the feasibility of human access in all areas of the product supply chain, such as quality control inspection on the production line. Visual inspection is widely used for performing internal and external assessment of various equipment in a production facility, such as storage tanks, pressure vessels, piping, vending machines, and other equipment, which expands to many industries, such as electronics, medical, CPG, and raw materials and more.

Using Artificial Intelligence (AI) for automated visual inspection or augmenting the human visual inspection process with AI can help address the challenges outlined below.

Challenges of human visual inspection

Human-led visual inspection has the following high-level issues:

  • Scale – Most products go through multiple stages, from assembly to supply chain to quality control, before being made available to the end consumer. Defects can occur during the manufacturing process or assembly at different points in space and time. Therefore, it’s not always feasible or cost-effective to use in-person human visual inspection. This inability to scale can result in disasters such as the BP Deepwater Horizon oil spill and Challenger space shuttle explosion, the overall negative impact of which (to humans and nature) overshoots the monetary cost by quite a distance.
  • Human visual error – In areas where human-led visual inspection can be conveniently performed, human error is a major factor that often goes overlooked. According to the following report, most inspection tasks are complex and typically exhibit error rates of 20–30%, which directly translates to cost and undesirable outcomes.
  • Personnel and miscellaneous costs – Although the overall cost of quality control can vary greatly depending on industry and location, according to some estimates, a trained quality inspector salary ranges between $26,000–60,000 (USD) per year. There are also other miscellaneous costs that may not always be accounted for.

SageMaker JumpStart is a great place to get started with various Amazon SageMaker features and capabilities through curated one-click solutions, example notebooks, and pre-trained Computer Vision, Natural Language Processing and Tabular data models that users can choose, fine-tune (if needed) and deploy using AWS SageMaker infrastructure.

In this post, we walk through how to quickly deploy an automated defect detection solution, from data ingestion to model inferencing, using a publicly available dataset and SageMaker JumpStart.

Solution overview

This solution uses a state-of-the-art deep learning approach to automatically detect surface defects using SageMaker. The Defect Detection Network or DDN model enhances the Faster R-CNN and identifies possible defects in an image of a steel surface. The NEU surface defect database, is a balanced dataset that contains six kinds of typical surface defects of a hot-rolled steel strip: rolled-in scale (RS), patches (Pa), crazing (Cr), pitted surface (PS), inclusion (In), and scratches (Sc). The database includes 1,800 grayscale images: 300 samples each of type of defect.

Content

The JumpStart solution contains the following artifacts, which are available to you from the JupyterLab File Browser:

  • cloudformation/ AWS CloudFormation configuration files to create relevant SageMaker resources and apply permissions. Also includes cleanup scripts to delete created resources.
  • src/ – Contains the following:

    • prepare_data/ – Data preparation for NEU datasets.
    • sagemaker_defect_detection/ – Main package containing the following:

      • dataset – Contains NEU dataset handling.
      • models – Contains Automated Defect Inspection (ADI) System called Defect Detection Network. See the following paper for details.
      • utils – Various utilities for visualization and COCO evaluation.
      • classifier.py – For the classification task.
      • detector.py – For the detection task.
      • transforms.py – Contains the image transformations used in training.
  • notebooks/ – The individual notebooks, discussed in more detail later in this post.
  • scripts/ – Various scripts for training and building.

Default dataset

This solution trains a classifier on the NEU-CLS dataset and a detector on the NEU-DET dataset. This dataset contains 1800 images and 4189 bounding boxes in total. The type of defects in our dataset are as follows:

  • Crazing (class: Cr, label: 0)
  • Inclusion (class: In, label: 1)
  • Pitted surface (class: PS, label: 2)
  • Patches (class: Pa, label: 3)
  • Rolled-in scale (class: RS, label: 4)
  • Scratches (class: Sc, label: 5)

The following are sample images of the six classes.

The following images are sample detection results. From left to right, we have the original image, the ground truth detection, and the SageMaker DDN model output.

Architecture

The JumpStart solution comes pre-packaged with Amazon SageMaker Studio notebooks that download the required datasets and contain the code and helper functions for training the model/s and deployment using a real-time SageMaker endpoint.

All notebooks download the dataset from a public Amazon Simple Storage Service (Amazon S3) bucket and import helper functions to visualize the images. The notebooks allow the user to customize the solution, such as hyperparameters for model training or perform transfer learning in case you choose to use the solution for your defect detection use case.

The solution contains the following four Studio notebooks:

  • 0_demo.ipynb – Creates a model object from a pre-trained DDN model on the NEU-DET dataset and deploys it behind a real-time SageMaker endpoint. Then we send some image samples with defects for detection and visualize the results.
  • 1_retrain_from_checkpoint.ipynb – Retrains our pre-trained detector for a few more epochs and compares results. You can also bring your own dataset; however, we use the same dataset in the notebook. Also included is a step to perform transfer learning by fine-tuning the pre-trained model. Fine-tuning a deep learning model on one particular task involves using the learned weights from a particular dataset to enhance the performance of the model on another dataset. You can also perform fine-tuning over the same dataset used in the initial training but perhaps with different hyperparameters.
  • 2_detector_from_scratch.ipynb – Trains our detector from scratch to identify if defects exist in an image.
  • 3_classification_from_scratch.ipynb – Trains our classifier from scratch to classify the type of defect in an image.

Each notebook contains boilerplate code which deploys a SageMaker real-time endpoint for model inferencing. You can view the list of notebooks by going to the JupyterLab file browser and navigating to the “notebooks” folder in the JumpStart Solution directory or by clicking “Open Notebook” on the JumpStart solution, specifically “Product Defect Detection” solution page (See below).

Prerequisites

The solution outlined in this post is part of Amazon SageMaker JumpStart. To run this SageMaker JumpStart 1P Solution and have the infrastructure deploy to your AWS account, you need to create an active Amazon SageMaker Studio instance (see Onboard to Amazon SageMaker Domain).

JumpStart features are not available in SageMaker notebook instances, and you can’t access them through the AWS Command Line Interface (AWS CLI).

Deploy the solution

We provide walkthrough videos for the high-level steps on this solution. To start, launch SageMaker JumpStart and choose the Product Defect Detection solution on the Solutions tab.

The provided SageMaker notebooks download the input data and launch the later stages. The input data is located in an S3 bucket.

We train the classifier and detector models and evaluate the results in SageMaker. If desired, you can deploy the trained models and create SageMaker endpoints.

The SageMaker endpoint created from the previous step is an HTTPS endpoint and is capable of producing predictions.

You can monitor the model training and deployment via Amazon CloudWatch.

Clean up

When you’re finished with this solution, make sure that you delete all unwanted AWS resources. You can use AWS CloudFormation to automatically delete all standard resources that were created by the solution and notebook. On the AWS CloudFormation console, delete the parent stack. Deleting the parent stack automatically deletes the nested stacks.

You need to manually delete any extra resources that you may have created in this notebook, such as extra S3 buckets in addition to the solution’s default bucket or extra SageMaker endpoints (using a custom name).

Conclusion

In this post, we introduced a solution using SageMaker JumpStart to address issues with the current state of visual inspection, quality control, and defect detection in various industries. We recommended a novel approach called Automated Defect Inspection system built using a pre-trained DDN model for defect detection on steel surfaces. After you launched the JumpStart solution and downloaded the public NEU datasets, you deployed a pre-trained model behind a SageMaker real-time endpoint and analyzed the endpoint metrics using CloudWatch. We also discussed other features of the JumpStart solution, such as how to bring your own training data, perform transfer learning, and retrain the detector and classifier.

Try out this JumpStart solution on SageMaker Studio, either retraining the existing model on a new dataset for defect detection or pick from SageMaker JumpStart’s library of computer vision models, NLP models or tabular models and deploy them for your specific use case.


About the Authors

Vedant Jain is a Sr. AI/ML Specialist Solutions Architect, helping customers derive value out of the Machine Learning ecosystem at AWS. Prior to joining AWS, Vedant has held ML/Data Science Specialty positions at various companies such as Databricks, Hortonworks (now Cloudera) & JP Morgan Chase. Outside of his work, Vedant is passionate about making music, using Science to lead a meaningful life & exploring delicious vegetarian cuisine from around the world.

Tao Sun is an Applied Scientist in AWS. He obtained his Ph.D. in Computer Science from University of Massachusetts, Amherst. His research interests lie in deep reinforcement learning and probabilistic modeling. He contributed to AWS DeepRacer, AWS DeepComposer. He likes ballroom dance and reading during his spare time.

Read More

Accelerate your career with ML skills through the AWS Machine Learning Engineer Scholarship

Amazon Web Services and Udacity are partnering to offer free services to educate developers of all skill levels on machine learning (ML) concepts with the AWS Machine Learning Engineer Scholarship program. The program offers free enrollment to the AWS Machine Learning Foundations course and 325 scholarships awarded to the AWS Machine Learning Engineer Nanodegree, a $2,000 USD value, powered through Udacity.

Machine learning will not only change the way we work and live, but also open pathways to millions of new jobs, with the World Economic Forum estimating 97 million new roles may be created by 2025 in AI and ML. Gaining access to the job-ready skills to break into an ML career encounters high cost to traditional education and rigorous content, with a lack of real-world application from theory into practice. AWS is invested in addressing these challenges by providing free educational content and hands-on learning, such as exploring reinforcement learning concepts with AWS DeepRacer, as well as a community of learner support with technical experts and like-minded peers.

The AWS Machine Learning Engineer Nanodegree Program gave me a solid footing in understanding the foundational building blocks of Machine Learning workflows,” said Jikmyan Mangut Sunday, AWS Machine Learning Scholarship Alumni. “This shaped my knowledge of the fundamental concepts in building state-of-the-art Machine Learning models. Udacity curated learning materials that were easy to grasp and applicable to every field of endeavor, my learning experience was challenging and fun-filled.

AWS is also collaborating with Girls in Tech and National Society for Black Engineers, to provide scholarships to women and underrepresented groups in tech. Organizations like these aims to inspire, support, train, and empower people from underrepresented groups to pursue careers in tech. In partnership, AWS will aid in providing access and resources to programs such as the AWS Machine Learning Engineer Scholarship Program to increase the diversity and talent in technical roles.

Tech needs representation from women, BIPOC, and other marginalized communities in every aspect of our industry,” says Adriana Gascoigne, founder and CEO of Girls in Tech. “Girls in Tech applauds our collaborator AWS, as well as Udacity, for breaking down the barriers that so often leave women behind in tech. Together, we aim to give everyone a seat at the table.

Open pathways to new career opportunities

Learners in the program are able to apply theory into hands on application to a suite of AWS ML services including AWS DeepRacer, Amazon SageMaker, and AWS DeepComposer. As many struggle to get started with machine learning, the scholarship program provides easy to learn, self-paced modules to provide the flexibility at a self-guided pace. Throughout the course journey, learners will have access to a supportive online community for technical assistance through Udacity tutors.

Before taking the program, the many tools provided by AWS seemed frustrating but now I have a good grasp of them. I learned how to organize my code and work in a professional setting,” said Kariem Gazer AWS Machine Learning Scholarship Alumni. “The organized modules, follow up quizzes, and personalized feedback all made the learning experience smoother and concrete.

Gain ML skills beyond the classroom

The AWS Machine Learning Engineer Scholarship program is open to all developers interested in expanding their ML skills and expertise through AWS curated content and services. Applicants 18 years of age or older are invited to register for the program. All applicants will have immediate classroom access to the free AWS ML Foundations course upon application completion.

Phase 1: AWS Machine Learning Foundations Course

  • Learn object-oriented programming skills, including writing clean and modularized code and understanding the fundamental aspects of ML.
  • Learn reinforcement learning with AWS DeepRacer and generative AI with AWS DeepComposer.
  • Take advantage of support through the Discourse Tech community with technical moderators.
  • Receive a certificate for course completion and take an online assessment quiz to receive a full scholarship to the AWS Machine Learning Engineer Nanodegree program.
  • Dedicate 3–5 hours a week on the course and work towards earning one of the follow-up Nanodegree program scholarships.

Phase 2: Full scholarship to the AWS Machine Learning Engineer Udacity Nanodegree ($2,000 USD value)

  • Learn advanced ML techniques and algorithms, including how to package and deploy your models to a production environment.
  • Acquire practical experience such as using Amazon SageMaker to prepare you for a career in ML.
  • Take advantage of community support through a learner connect program for technical assistance and learner engagement.
  • Dedicate 5–10 hours a week on the course to earn an Udacity Nanodegree certificate.

Program dates

June 21, 2022 Scholarship applications open and students are automatically enrolled in the AWS Machine Learning Foundations Course (Phase 1)

July 21, 2022 Scholarship applications close
November 23, 2022 AWS Machine Learning Foundations Course (Phase 1) ends
December 6, 2022 AWS Machine Learning Engineer Scholarship winners announced
December 8, 2022 AWS Machine Learning Engineer Nanodegree (Phase 2) opens
March 22, 2023 AWS Machine Learning Engineer Nanodegree (Phase 2) closes

Connect with the ML community and take the next step

Connect with experts and like-minded aspiring ML developers on the AWS Machine Learning Discord and enroll today in the AWS Machine Learning Engineer Scholarship program.


About the Author

Anastacia Padilla is a Product Marketing Manager for AWS AI & ML Education. She spends her time building and evangelizing offerings for the aspiring ML developer community to upskill students and underrepresented groups in tech. She is focused on democratizing AI & ML education to be accessible to all who want to learn.

Read More

Identify mangrove forests using satellite image features using Amazon SageMaker Studio and Amazon SageMaker Autopilot – Part 2

Mangrove forests are an import part of a healthy ecosystem, and human activities are one of the major reasons for their gradual disappearance from coastlines around the world. Using a machine learning (ML) model to identify mangrove regions from a satellite image gives researchers an effective way to monitor the size of the forests over time. In Part 1 of this series, we showed how to gather satellite data in an automated fashion and analyze it in Amazon SageMaker Studio with interactive visualization. In this post, we show how to use Amazon SageMaker Autopilot to automate the process of building a custom mangrove classifier.

Train a model with Autopilot

Autopilot provides a balanced way of building several models and selecting the best one. While creating multiple combinations of different data preprocessing techniques and ML models with minimal effort, Autopilot provides complete control over these component steps to the data scientist, if desired.

You can use Autopilot using one of the AWS SDKs (details available in the API reference guide for Autopilot) or through Studio. We use Autopilot in our Studio solution following the steps outlined in this section:

  1. On the Studio Launcher page, choose the plus sign for New Autopilot experiment.
  2. For Connect your data, select Find S3 bucket, and enter the bucket name where you kept the training and test datasets.
  3. For Dataset file name, enter the name of the training data file you created in the Prepare the training data section in Part 1.
  4. For Output data location (S3 bucket), enter the same bucket name you used in step 2.
  5. For Dataset directory name, enter a folder name under the bucket where you want Autopilot to store artifacts.
  6. For Is your S3 input a manifest file?, choose Off.
  7. For Target, choose label.
  8. For Auto deploy, choose Off.
  9. Under the Advanced settings, for Machine learning problem type, choose Binary Classification.
  10. For Objective metric, choose AUC.
  11. For Choose how to run your experiment, choose No, run a pilot to create a notebook with candidate definitions.
  12. Choose Create Experiment.

    For more information about creating an experiment, refer to Create an Amazon SageMaker Autopilot experiment.It may take about 15 minutes to run this step.
  13. When complete, choose Open candidate generation notebook, which opens a new notebook in read-only mode.
  14. Choose Import notebook to make the notebook editable.
  15. For Image, choose Data Science.
  16. For Kernel, choose Python 3.
  17. Choose Select.

This auto-generated notebook has detailed explanations and provides complete control over the actual model building task to follow. A customized version of the notebook, where a classifier is trained using Landsat satellite bands from 2013, is available in the code repository under notebooks/mangrove-2013.ipynb.

The model building framework consists of two parts: feature transformation as part of the data processing step, and hyperparameter optimization (HPO) as part of the model selection step. All the necessary artifacts for these tasks were created during the Autopilot experiment and saved in Amazon Simple Storage Service (Amazon S3). The first notebook cell downloads those artifacts from Amazon S3 to the local Amazon SageMaker file system for inspection and any necessary modification. There are two folders: generated_module and sagemaker_automl, where all the Python modules and scripts necessary to run the notebook are stored. The various feature transformation steps like imputation, scaling, and PCA are saved as generated_modules/candidate_data_processors/dpp*.py.

Autopilot creates three different models based on the XGBoost, linear learner, and multi-layer perceptron (MLP) algorithms. A candidate pipeline consists of one of the feature transformations options, known as data_transformer, and an algorithm. A pipeline is a Python dictionary and can be defined as follows:

candidate1 = {
    "data_transformer": {
        "name": "dpp5",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
            "volume_size_in_gb":  50
        },
        "transform_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
        "transforms_label": True,
        "transformed_data_format": "application/x-recordio-protobuf",
        "sparse_encoding": True
    },
    "algorithm": {
        "name": "xgboost",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
    }
}

In this example, the pipeline transforms the training data according to the script in generated_modules/candidate_data_processors/dpp5.py and builds an XGBoost model. This is where Autopilot provides complete control to the data scientist, who can pick the automatically generated feature transformation and model selection steps or build their own combination.

You can now add the pipeline to a pool for Autopilot to run the experiment as follows:

from sagemaker_automl import AutoMLInteractiveRunner, AutoMLLocalCandidate

automl_interactive_runner = AutoMLInteractiveRunner(AUTOML_LOCAL_RUN_CONFIG)
automl_interactive_runner.select_candidate(candidate1)

This is an important step where you can decide to keep only a subset of candidates suggested by Autopilot, based on subject matter expertise, to reduce the total runtime. For now, keep all Autopilot suggestions, which you can list as follows:

automl_interactive_runner.display_candidates()
Candidate Name Algorithm Feature Transformer
dpp0-xgboost xgboost dpp0.py
dpp1-xgboost xgboost dpp1.py
dpp2-linear-learner linear-learner dpp2.py
dpp3-xgboost xgboost dpp3.py
dpp4-xgboost xgboost dpp4.py
dpp5-xgboost xgboost dpp5.py
dpp6-mlp mlp dpp6.py

The full Autopilot experiment is done in two parts. First, you need to run the data transformation jobs:

automl_interactive_runner.fit_data_transformers(parallel_jobs=7)

This step should complete in about 30 minutes for all the candidates, if you make no further modifications to the dpp*.py files.

The next step is to build the best set of models by tuning the hyperparameters for the respective algorithms. The hyperparameters are usually divided into two parts: static and tunable. The static hyperparameters remain unchanged throughout the experiment for all candidates that share the same algorithm. These hyperparameters are passed to the experiment as a dictionary. If you choose to pick the best XGBoost model by maximizing AUC from three rounds of a five-fold cross-validation scheme, the dictionary looks like the following code:

{
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    '_kfold': 5,
    '_num_cv_round': 3,
} 

For the tunable hyperparameters, you need to pass another dictionary with ranges and scaling type:

{
    'num_round': IntegerParameter(64, 1024, scaling_type='Logarithmic'),
    'max_depth': IntegerParameter(2, 8, scaling_type='Logarithmic'),
        'eta': ContinuousParameter(1e-3, 1.0, scaling_type='Logarithmic'),
...    
}

The complete set of hyperparameters is available in the mangrove-2013.ipynb notebook.

To create an experiment where all seven candidates can be tested in parallel, create a multi-algorithm HPO tuner:

multi_algo_tuning_parameters = automl_interactive_runner.prepare_multi_algo_parameters(
    objective_metrics=ALGORITHM_OBJECTIVE_METRICS,
    static_hyperparameters=STATIC_HYPERPARAMETERS,
    hyperparameters_search_ranges=ALGORITHM_TUNABLE_HYPERPARAMETER_RANGES)

The objective metrics are defined independently for each algorithm:

ALGORITHM_OBJECTIVE_METRICS = {
    'xgboost': 'validation:auc',
    'linear-learner': 'validation:roc_auc_score',
    'mlp': 'validation:roc_auc',
}

Trying all possible values of hyperparameters for all the experiments is wasteful; you can adopt a Bayesian strategy to create an HPO tuner:

multi_algo_tuning_inputs = automl_interactive_runner.prepare_multi_algo_inputs()
ase_tuning_job_name = "{}-tuning".format(AUTOML_LOCAL_RUN_CONFIG.local_automl_job_name)

tuner = HyperparameterTuner.create(
    base_tuning_job_name=base_tuning_job_name,
    strategy='Bayesian',
    objective_type='Maximize',
    max_parallel_jobs=10,
    max_jobs=50,
    **multi_algo_tuning_parameters,
)

In the default setting, Autopilot picks 250 jobs in the tuner to pick the best model. For this use case, it’s sufficient to set max_jobs=50 to save time and resources, without any significant penalty in terms of picking the best set of hyperparameters. Finally, submit the HPO job as follows:

tuner.fit(inputs=multi_algo_tuning_inputs, include_cls_metadata=None)

The process takes about 80 minutes on ml.m5.4xlarge instances. You can monitor progress on the SageMaker console by choosing Hyperparameter tuning jobs under Training in the navigation pane.

You can visualize a host of useful information, including the performance of each candidate, by choosing the name of the job in progress.

Finally, compare the model performance of the best candidates as follows:

from sagemaker.analytics import HyperparameterTuningJobAnalytics

SAGEMAKER_SESSION = AUTOML_LOCAL_RUN_CONFIG.sagemaker_session
SAGEMAKER_ROLE = AUTOML_LOCAL_RUN_CONFIG.role

tuner_analytics = HyperparameterTuningJobAnalytics(
    tuner.latest_tuning_job.name, sagemaker_session=SAGEMAKER_SESSION)

df_tuning_job_analytics = tuner_analytics.dataframe()

df_tuning_job_analytics.sort_values(
    by=['FinalObjectiveValue'],
    inplace=True,
    ascending=False if tuner.objective_type == "Maximize" else True)

# select the columns to display and rename
select_columns = ["TrainingJobDefinitionName", "FinalObjectiveValue", "TrainingElapsedTimeSeconds"]
rename_columns = {
	"TrainingJobDefinitionName": "candidate",
	"FinalObjectiveValue": "AUC",
	"TrainingElapsedTimeSeconds": "run_time"  
}

# Show top 5 model performances
df_tuning_job_analytics.rename(columns=rename_columns)[rename_columns.values()].set_index("candidate").head(5)
candidate AUC run_time (s)
dpp6-mlp 0.96008 2711.0
dpp4-xgboost 0.95236 385.0
dpp3-xgboost 0.95095 202.0
dpp4-xgboost 0.95069 458.0
dpp3-xgboost 0.95015 361.0

The top performing model based on MLP, while marginally better than the XGBoost models with various choices of data processing steps, also takes a lot longer to train. You can find important details about the MLP model training, including the combination of hyperparameters used, as follows:

df_tuning_job_analytics.loc[df_tuning_job_analytics.TrainingJobName==best_training_job].T.dropna() 
TrainingJobName mangrove-2-notebook–211021-2016-012-500271c8
TrainingJobStatus Completed
FinalObjectiveValue 0.96008
TrainingStartTime 2021-10-21 20:22:55+00:00
TrainingEndTime 2021-10-21 21:08:06+00:00
TrainingElapsedTimeSeconds 2711
TrainingJobDefinitionName dpp6-mlp
dropout_prob 0.415778
embedding_size_factor 0.849226
layers 256
learning_rate 0.00013862
mini_batch_size 317
network_type feedforward
weight_decay 1.29323e-12

Create an inference pipeline

To generate inference on new data, you have to construct an inference pipeline on SageMaker to host the best model that can be called later to generate inference. The SageMaker pipeline model requires three containers as its components: data transformation, algorithm, and inverse label transformation (if numerical predictions need to be mapped on to non-numerical labels). For brevity, only part of the required code is shown in the following snippet; the complete code is available in the mangrove-2013.ipynb notebook:

from sagemaker.estimator import Estimator
from sagemaker import PipelineModel
from sagemaker_automl import select_inference_output

…
# Final pipeline model 
model_containers = [best_data_transformer_model, best_algo_model]
if best_candidate.transforms_label:
	model_containers.append(best_candidate.get_data_transformer_model(
    	transform_mode="inverse-label-transform",
    	role=SAGEMAKER_ROLE,
    	sagemaker_session=SAGEMAKER_SESSION))

# select the output type
model_containers = select_inference_output("BinaryClassification", model_containers, output_keys=['predicted_label'])

After the model containers are built, you can construct and deploy the pipeline as follows:

from sagemaker import PipelineModel

pipeline_model = PipelineModel(
	name=f"mangrove-automl-2013",
	role=SAGEMAKER_ROLE,
	models=model_containers,
	vpc_config=AUTOML_LOCAL_RUN_CONFIG.vpc_config)

pipeline_model.deploy(initial_instance_count=1,
                  	instance_type='ml.m5.2xlarge',
                  	endpoint_name=pipeline_model.name,
                  	wait=True)

The endpoint deployment takes about 10 minutes to complete.

Get inference on the test dataset using an endpoint

After the endpoint is deployed, you can invoke it with a payload of features B1–B7 to classify each pixel in an image as either mangrove (1) or other (0):

import boto3
sm_runtime = boto3.client('runtime.sagemaker')

pred_labels = []
with open(local_download, 'r') as f:
    for i, row in enumerate(f):
        payload = row.rstrip('n')
        x = sm_runtime.invoke_endpoint(EndpointName=inf_endpt,
                                   	ContentType="text/csv",
                                   	Body=payload)
        pred_labels.append(int(x['Body'].read().decode().strip()))

Complete details on postprocessing the model predictions for evaluation and plotting are available in notebooks/model_performance.ipynb.

Get inference on the test dataset using a batch transform

Now that you have created the best-performing model with Autopilot, we can use the model for inference. To get inference on large datasets, it’s more efficient to use a batch transform. Let’s generate predictions on the entire dataset (training and test) and append the results to the features, so that we can perform further analysis to, for instance, check the predicted vs. actuals and the distribution of features amongst predicted classes.

First, we create a manifest file in Amazon S3 that points to the locations of the training and test data from the previous data processing steps:

import boto3
data_bucket = <Name of the S3 bucket that has the training data>
prefix = "LANDSAT_LC08_C01_T1_SR/Year2013"
manifest = "[{{"prefix": "s3://{}/{}/"}},n"train.csv",n"test.csv"n]".format(data_bucket, prefix)
s3_client = boto3.client('s3')
s3_client.put_object(Body=manifest, Bucket=data_bucket, Key=f"{prefix}/data.manifest")

Now we can create a batch transform job. Because our input train and test dataset have label as the last column, we need to drop it during inference. To do that, we pass InputFilter in the DataProcessing argument. The code "$[:-2]" indicates to drop the last column. The predicted output is then joined with the source data for further analysis.

In the following code, we construct the arguments for the batch transform job and then pass to the create_transform_job function:

from time import gmtime, strftime

batch_job_name = "Batch-Transform-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
output_location = "s3://{}/{}/batch_output/{}".format(data_bucket, prefix, batch_job_name)
input_location = "s3://{}/{}/data.manifest".format(data_bucket, prefix)

request = {
    "TransformJobName": batch_job_name,
    "ModelName": pipeline_model.name,
    "TransformOutput": {
        "S3OutputPath": output_location,
        "Accept": "text/csv",
        "AssembleWith": "Line",
    },
    "TransformInput": {
        "DataSource": {"S3DataSource": {"S3DataType": "ManifestFile", "S3Uri": input_location}},
        "ContentType": "text/csv",
        "SplitType": "Line",
        "CompressionType": "None",
    },
    "TransformResources": {"InstanceType": "ml.m4.xlarge", "InstanceCount": 1},
    "DataProcessing": {"InputFilter": "$[:-2]", "JoinSource": "Input"}
}

sagemaker = boto3.client("sagemaker")
sagemaker.create_transform_job(**request)
print("Created Transform job with name: ", batch_job_name)

You can monitor the status of the job on the SageMaker console.

Visualize model performance

You can now visualize the performance of the best model on the test dataset, consisting of regions from India, Myanmar, Cuba, and Vietnam, as a confusion matrix. The model has a high recall value for pixels representing mangroves, but only about 75% precision. The precision of non-mangrove or other pixels stand at 99% with an 85% recall. You can tune the probability cutoff of the model predictions to adjust the respective values depending on the particular use case.

It’s worth noting that the results are a significant improvement over the built-in smileCart model.

Visualize model predictions

Finally, it’s useful to observe the model performance on specific regions on the map. In the following image, the mangrove area in the India-Bangladesh border is depicted in red. Points sampled from the Landsat image patch belonging to the test dataset are superimposed on the region, where each point is a pixel that the model determines to be representing mangroves. The blue points are classified correctly by the model, whereas the black points represent mistakes by the model.

The following image shows only the points that the model predicted to not represent mangroves, with the same color scheme as the preceding example. The gray outline is the part of the Landsat patch that doesn’t include any mangroves. As is evident from the image, the model doesn’t make any mistake classifying points on water, but faces a challenge when distinguishing pixels representing mangroves from those representing regular foliage.

The following image shows model performance on the Myanmar mangrove region.

In the following image, the model does a better job identifying mangrove pixels.

Clean up

The SageMaker inference endpoint continues to incur cost if left running. Delete the endpoint as follows when you’re done:

sagemaker.delete_endpoint(EndpointName=pipeline_model.name)

Conclusion

This series of posts provided an end-to-end framework for data scientists for solving GIS problems. Part 1 showed the ETL process and a convenient way to visually interact with the data. Part 2 showed how to use Autopilot to automate building a custom mangrove classifier.

You can use this framework to explore new satellite datasets containing a richer set of bands useful for mangrove classification and explore feature engineering by incorporating domain knowledge.


About the Authors

Andrei Ivanovic is an incoming Master’s of Computer Science student at the University of Toronto and a recent graduate of the Engineering Science program at the University of Toronto, majoring in Machine Intelligence with a Robotics/Mechatronics minor. He is interested in computer vision, deep learning, and robotics. He did the work presented in this post during his summer internship at Amazon.

David Dong is a Data Scientist at Amazon Web Services.

Arkajyoti Misra is a Data Scientist at Amazon LastMile Transportation. He is passionate about applying Computer Vision techniques to solve problems that helps the earth. He loves to work with non-profit organizations and is a founding member of ekipi.org.

Read More

Identify mangrove forests using satellite image features using Amazon SageMaker Studio and Amazon SageMaker Autopilot – Part 1

The increasing ubiquity of satellite data over the last two decades is helping scientists observe and monitor the health of our constantly changing planet. By tracking specific regions of the Earth’s surface, scientists can observe how regions like forests, water bodies, or glaciers change over time. One such region of interest for geologists is mangrove forests. These forests are essential to the overall health of the planet and are one of the many areas across the world that are impacted by human activities. In this post, we show how to get access to satellite imagery data containing mangrove forests and how to visually interact with the data in Amazon SageMaker Studio. In Part 2 of this series, we show how to train a machine learning (ML) model using Amazon SageMaker Autopilot to identify those forests from a satellite image.

Overview of solution

A large number of satellites orbit the Earth, scanning its surface on a regular basis. Typical examples of such satellites are Landsat, Sentinel, CBERS, and MODIS, to name a few. You can access both recent and historical data captured by these satellites at no cost from multiple providers like USGS EarthExplorer, Land Viewer, or Copernicus Open Access Hub. Although they provide an excellent service to the scientific community by making their data freely available, it takes a significant amount of effort to gain familiarity with the interfaces of the respective providers. Additionally, such data from satellites is made available in different formats and may not comply with the standard Geographical Information Systems (GIS) data formatting. All of these challenges make it extremely difficult for newcomers to GIS to prepare a suitable dataset for ML model training.

Platforms like Google Earth Engine (GEE) and Earth on AWS make a wide variety of satellite imagery data available in a single portal that eases searching for the right dataset and standardizes the ETL (extract, transform, and load) component of the ML workflow in a convenient, beginner-friendly manner. GEE additionally provides a coding platform where you can programmatically explore the dataset and build a model in JavaScript. The Python API for GEE lacks the maturity of its JavaScript counterpart; however, that gap is sufficiently bridged by the open-sourced project geemap.

In this series of posts, we present a complete end-to-end example of building an ML model in the GIS space to detect mangrove forests from satellite images. Our goal is to provide a template solution that ML engineers and data scientists can use to explore and interact with the satellite imagery, make the data available in the right format for building a classifier, and have the option to validate model predictions visually. Specifically, we walk through the following:

  • How to download satellite imagery data to a Studio environment
  • How to interact with satellite data and perform exploratory data analysis in a Studio notebook
  • How to automate training an ML model in Autopilot

Build the environment

The solution presented in this post is built in a Studio environment. To configure the environment, complete the following steps:

  1. Add a new SageMaker domain user and launch the Studio app. (For instructions, refer to Get Started.)
  2. Open a new Studio notebook by choosing the plus sign under Notebook and compute resources (make sure to choose the Data Science SageMaker image).
  3. Clone the mangrove-landcover-classification Git repository, which contains all the code used for this post. (For instructions, refer to Clone a Git Repository in SageMaker Studio).
  4. Open the notebook notebooks/explore_mangrove_data.ipynb.
  5. Run the first notebook cell to pip install all the required dependencies listed in the requirements.txt file in the root folder.
  6. Open a new Launcher tab and open a system terminal found in the Utilities and files section.
  7. Install the Earth Engine API:
    pip install earthengine-api

  8. Authenticate Earth Engine:
    earthengine authenticate

  9. Follow the Earth Engine link in the output and sign up as a developer so that you can access GIS data from a notebook.

Mangrove dataset

The Global Mangrove Forest Distribution (GMFD) is one of the most cited datasets used by researchers in the area. The dataset, which contains labeled mangrove regions at a 30-meter resolution from around the world, is curated from more than 1,000 Landsat images obtained from the USGS EROS Center. One of the disadvantages of using the dataset is that it was compiled in 2000. In the absence of a newer dataset that is as comprehensive as the GMFD, we decided to use it because it serves the purpose of demonstrating an ML workload in the GIS space.

Given the visual nature of GIS data, it’s critical for ML practitioners to be able to interact with satellite images in an interactive manner with full map functionalities. Although GEE provides this functionality through a browser interface, it’s only available in JavaScript. Fortunately, the open-sourced project geemap aids data scientists by providing those functionalities in Python.

Go back to the explore_mangrove_data.ipynb notebook you opened earlier and follow the remaining cells to understand how to use simple interactive maps in the notebook.

  1. Start by importing Earth Engine and initializing it:
    import ee
    import geemap.eefolium as geemap
    ee.Initialize()

  2. Now import the satellite image collection from the database:
    mangrove_images_landsat = ee.ImageCollection('LANDSAT/MANGROVE_FORESTS')

  3. Extract the collection, which contains just one set:
    mangrove_images_landsat = mangrove_images_landsat.first()

  4. To visualize the data on a map, you first need to instantiate a map through geemap:
    mangrove_map = geemap.Map()

  5. Next, define some parameters that make it easy to visualize the data on a world map:
    mangrovesVis = {
          min: 0,
          max: 1.0,
          'palette': ['d40115'],
        }

  6. Now add the data as a layer on the map instantiated earlier with the visualization parameters:
    mangrove_map.addLayer(mangrove_images_landsat, mangrovesVis, 'Mangroves')

You can add as many layers as you want to the map and then interactively turn them on or off for a cleaner view when necessary. Because mangrove forests aren’t everywhere on the earth, it makes sense to center the map to a coastal region with known mangrove forests and then render the map on the notebook as follows:

mangrove_map.setCenter(-81, 25, 9)
mangrove_map

The latitude and longitude chosen here, 25 degrees north and 81 degrees west, respectively, correspond to the gulf coast of Florida, US. The map is rendered at a zoom level of 9, where a higher number provides a more closeup view.

You can obtain some useful information about the dataset by accessing the associated metadata as follows:

geemap.image_props(mangrove_images_landsat).getInfo()

You get the following output:

{'IMAGE_DATE': '2000-01-01',
 'NOMINAL_SCALE': 30.359861978395436,
 'system:asset_size': '41.133541 MB',
 'system:band_names': ['1'],
 'system:id': 'LANDSAT/MANGROVE_FORESTS/2000',
 'system:index': '2000',
 'system:time_end': '2001-01-01 00:00:00',
 'system:time_start': '2000-01-01 00:00:00',
 'system:version': 1506796895089836
}

Most of the fields in the metadata are self-explanatory, except for the band names. The next section discusses this field in more detail.

Landsat dataset

The following image is a satellite image of an area at the border of French Guiana and Suriname, where mangrove forests are common. The left image shows a raw satellite image of the region; the image on the right depicts the GMFD data superimposed on it. Pixels representing mangroves are shown in red. It’s quite evident from the side-by-side comparison that there is no straightforward visual cue in either structure or color in the underlying satellite image that distinguishes mangroves from the surrounding region. In the absence of any such distinguishing pattern in the images, it poses a considerable challenge even for state-of-the-art deep learning-based classifiers to identify mangroves accurately. Fortunately, satellite images are captured at a range of wavelengths on the electromagnetic spectrum, part of which falls outside the visible range. Additionally, they also contain important measurements like surface reflectance. Therefore, researchers in the field have traditionally relied upon these measurements to build ML classifiers.

Unfortunately, apart from marking whether or not an individual pixel represents mangroves, the GMFD dataset doesn’t provide any additional information. However, other datasets can provide a host of features for every pixel that can be utilized to train a classifier. In this post, you use the USGS Landsat 8 dataset for that purpose. The Landsat 8 satellite was launched in 2013 and orbits the Earth every 99 minutes at an altitude of 705 km, capturing images covering a 185 km x 180 km patch on the Earth’s surface. It captures nine spectral bands, or portions of the electromagnetic spectrum sensed by a satellite, ranging from ultra blue to shortwave infrared. Therefore, the images available in the Landsat dataset are a collection of image patches containing multiple bands, with each patch time stamped by the date of collection.

To get a sample image from the Landsat dataset, you need to define a point of interest:

point = ee.Geometry.Point([<longitude>, <latitude>])

Then you filter the image collection by the point of interest, a date range, and optionally by the bands of interest. Because the images collected by the satellites are often obscured by cloud cover, it’s absolutely necessary to extract images with the minimum amount of cloud cover. Fortunately, the Landsat dataset already comes with a cloud detector. This streamlines the process of accessing all available images over several months, sorting them by amount of cloud cover, and picking the one with minimum cloud cover. For example, you can perform the entire process of extracting a Landsat image patch from the northern coast of the continent of South America in a few lines of code:

point = ee.Geometry.Point([-53.94, 5.61])
image_patch = ee.ImageCollection('LANDSAT/LC08/C01/T1_SR') 
    .filterBounds(point) 
    .filterDate('2016-01-01', '2016-12-31') 
    .select('B[1-7]') 
    .sort('CLOUD_COVER') 
    .first()

When specifying a region using a point of interest, that region doesn’t necessarily have to be centered on that point. The extracted image patch simply contains the point somewhere within it.

Finally, you can plot the image patch over a map by specifying proper plotting parameters based on a few of the chosen bands:

vis_params = {
    			'min': 0,
'max': 3000,
'bands': ['B5', 'B4', 'B3']
  }
landsat = geemap.Map()
landsat.centerObject(point, 8)
landsat.addLayer(image_patch, vis_params, "Landsat-8")
landsat

The following is a sample image patch collected by Landsat 8 showing in false color the Suriname-French Guiana border region. The mangrove regions are too tiny to be visible at the scale of the image.

As usual, there is a host of useful metadata available for the extracted image:

geemap.image_props(image_patch).getInfo()

{'CLOUD_COVER': 5.76,
 'CLOUD_COVER_LAND': 8.93,
 'EARTH_SUN_DISTANCE': 0.986652,
 'ESPA_VERSION': '2_23_0_1a',
 'GEOMETRIC_RMSE_MODEL': 9.029,
 'GEOMETRIC_RMSE_MODEL_X': 6.879,
 'GEOMETRIC_RMSE_MODEL_Y': 5.849,
 'IMAGE_DATE': '2016-11-27',
 'IMAGE_QUALITY_OLI': 9,
 'IMAGE_QUALITY_TIRS': 9,
 'LANDSAT_ID': 'LC08_L1TP_228056_20161127_20170317_01_T1',
 'LEVEL1_PRODUCTION_DATE': 1489783959000,
 'NOMINAL_SCALE': 30,
 'PIXEL_QA_VERSION': 'generate_pixel_qa_1.6.0',
 'SATELLITE': 'LANDSAT_8',
 'SENSING_TIME': '2016-11-27T13:52:20.6150480Z',
 'SOLAR_AZIMUTH_ANGLE': 140.915802,
 'SOLAR_ZENITH_ANGLE': 35.186565,
 'SR_APP_VERSION': 'LaSRC_1.3.0',
 'WRS_PATH': 228,
 'WRS_ROW': 56,
 'system:asset_size': '487.557501 MB',
 'system:band_names': ['B1', 'B2', 'B3', 'B4', 'B5', 'B6', 'B7'],
 'system:id': 'LANDSAT/LC08/C01/T1_SR/LC08_228056_20161127',
 'system:index': 'LC08_228056_20161127',
 'system:time_end': '2016-11-27 13:52:20',
 'system:time_start': '2016-11-27 13:52:20',
 'system:version': 1522722936827122}

The preceding image isn’t free from clouds, which is confirmed by the metadata suggesting a 5.76% cloud cover. Compared to a single binary band available from the GMFD image, the Landsat image contains the bands B1–B7.

ETL process

To summarize, you need to work with two distinct datasets to train a mangrove classifier. The GMFD dataset provides only the coordinates of pixels belonging to the minority class (mangrove). The Landsat dataset, on the other hand, provides band information for every pixel in a collection of patches, each patch covering roughly a 180 km2 area on the Earth’s surface. You now need to combine these two datasets to create the training dataset containing pixels belonging to both the minority and majority classes.

It’s wasteful to have a training dataset covering the entire surface of the Earth, because the mangrove regions cover a tiny fraction of the surface area. Because these regions are generally isolated from one another, an effective strategy is to create a set of points, each representing a specific mangrove forest on the earth’s surface, and collect the Landsat patches around those points. Subsequently, pixels can be sampled from each Landsat patch and a class—either mangrove or non-mangrove—can be assigned to it depending on whether the pixel appears in the GMFD dataset. The full labeled dataset can then be constructed by aggregating points sampled from this collection of patches.

The following table shows a sample of the regions and the corresponding coordinates to filter the Landsat patches.

. region longitude latitude
0 Mozambique1 36.2093 -18.7423
1 Mozambique2 34.7455 -20.6128
2 Nigeria1 5.6116 5.3431
3 Nigeria2 5.9983 4.5678
4 Guinea-Bissau -15.9903 12.1660

Due to the larger expanse of mangrove forests in Mozambique and Nigeria, two points each are required to capture the respective regions in the preceding table. The full curated list of points is available on GitHub.

To sample points representing both classes, you have to create a binary mask for each class first. The minority class mask for a Landsat patch is simply the intersection of pixels in the patch and the GMFD dataset. The mask for the majority class for the patch is simply the inverse of the minority class mask. See the following code:

mangrove_mask = image_patch.updateMask(mangrove_images_landsat.eq(1))
non_mangrove_mask = image_patch.updateMask(mangrove_mask.unmask().Not())

Use these two masks for the patch and create a set of labeled pixels by randomly sampling pixels from the respective masks:

mangrove_training_pts = mangrove_mask.sample(**{
    'region': mangrove_mask.geometry(),
    'scale': 30,
    'numPixels': 100000,
    'seed': 0,
    'geometries': True
})
non_mangrove_training_pts = non_mangrove_mask.sample(**{
    'region': non_mangrove_mask.geometry(),
    'scale': 30,
    'numPixels': 50000,
    'seed': 0,
    'geometries': True
})

numPixels is the number of samples drawn from the entire patch, and the sampled point is retained in the collection only if it falls in the target mask area. Because the mangrove region is typically a small fraction of the Landsat image patch, you need to use a larger value of numPixels for the mangrove mask compared to that for the non-mangrove mask. You can always look at the size of the two classes as follows to adjust the corresponding numPixels values:

mangrove_training_pts.size().getInfo(), non_mangrove_training_pts.size().getInfo()
(900, 49500)

In this example, the mangrove region is a tiny fraction of the Landsat patch because only 900 points were sampled from 100,000 attempts. Therefore, you should probably increase the value for numPixels for the minority class to restore balance between the two classes.

It’s a good idea to visually verify that the sampled points from the two respective sets indeed fall in the intended region in the map:

# define the point of interest
suriname_lonlat = [-53.94, 5.61]
suriname_point = ee.Geometry.Point(suriname_lonlat)
training_map = geemap.Map()
training_map.setCenter(*suriname_lonlat, 13)

# define visualization parameters
vis_params = {
    'min': 0,
    'max': 100,
    'bands': ['B4']
}

# define colors for the two set of points
mangrove_color = 'eb0000'
non_mangrove_color = '1c5f2c'

# create legend for the map
legend_dict = {
    'Mangrove Point': mangrove_color,
    'Non-mangrove Point': non_mangrove_color
}

# add layers to the map
training_map.addLayer(mangrove_mask, vis_params, 'mangrove mask', True)
training_map.addLayer(mangrove_training_pts, {'color': mangrove_color}, 'Mangrove Sample')
training_map.addLayer(non_mangrove_mask, {}, 'non mangrove mask', True)
training_map.addLayer(non_mangrove_training_pts, {'color': non_mangrove_color}, 'non mangrove training', True)
training_map.add_legend(legend_dict=legend_dict)

# display the map
training_map

Sure enough, as the following image shows, the red points representing mangrove pixels fall in the white regions and the green points representing a lack of mangroves fall in the gray region. The maps.ipynb notebook walks through the process of generation and visual inspection of sampled points on a map.

Now you need to convert the sampled points into a DataFrame for ML model training, which can be accomplished by the ee_to_geopandas module of geemap:

from geemap import ee_to_geopandas
mangrove_gdf = ee_to_geopandas(mangrove_training_pts)
                    geometry    B1    B2    B3    B4    B5    B6    B7
0  POINT (-53.95268 5.73340)   251   326   623   535  1919   970   478
1  POINT (-53.38339 5.55982)  4354  4483  4714  4779  5898  4587  3714
2  POINT (-53.75469 5.68400)  1229  1249  1519  1455  3279  1961  1454
3  POINT (-54.78127 5.95457)   259   312   596   411  3049  1644   740
4  POINT (-54.72215 5.97807)   210   279   540   395  2689  1241   510

The pixel coordinates at this stage are still represented as a Shapely geometry point. In the next step, you have to convert those into latitudes and longitudes. Additionally, you need to add labels to the DataFrame, which for the mangrove_gdf should all be 1, representing the minority class. See the following code:

mangrove_gdf["lon"] = mangrove_gdf["geometry"].apply(lambda p: p.x)
mangrove_gdf["lat"] = mangrove_gdf["geometry"].apply(lambda p: p.y)
mangrove_gdf["label"] = 1 
mangrove_gdf = mangrove_gdf.drop("geometry", axis=1)
print(mangrove_gdf.head())

     B1    B2    B3    B4    B5    B6    B7        lon       lat  label
0   251   326   623   535  1919   970   478 -53.952683  5.733402      1
1  4354  4483  4714  4779  5898  4587  3714 -53.383394  5.559823      1
2  1229  1249  1519  1455  3279  1961  1454 -53.754688  5.683997      1
3   259   312   596   411  3049  1644   740 -54.781271  5.954568      1
4   210   279   540   395  2689  1241   510 -54.722145  5.978066      1

Similarly, create another DataFrame, non_mangrove_gdf, using sampled points from the non-mangrove part of the Landsat image patch and assigning label=0 to all those points. A training dataset for the region is created by appending mangrove_gdf and non_mangrove_gdf.

Exploring the bands

Before diving into building a model to classify pixels in an image representing mangroves or not, it’s worth looking into the band values associated with those pixels. There are seven bands in the dataset, and the kernel density plots in the following figure show the distribution of those bands extracted from the 2015 Landsat data for the Indian mangrove region. The distribution of each band is broken down into two groups: pixels representing mangroves, and pixels representing other surface features like water or cultivated land.

One important aspect of building a classifier is to understand how these distributions vary over different regions of the Earth. The following figure shows the kernel density plots for bands captured in the same year from the Miami area of the US in 2015. The apparent similarity of the density profiles indicate that it may be possible to build a universal mangrove classifier that can be generalized to predict new areas excluded from the training set.

The plots shown in both figures are generated from band values that represent minimum cloud coverage, as determined by the built-in Earth Engine algorithm. Although this is a very reasonable approach, because different regions on the Earth have varying amounts of cloud coverage on the specific date of data collection, there exist alternative ways to capture the band values. For example, it’s also useful to calculate the median from a simple composite and use it for model training, but those details are beyond the scope of this post.

Prepare the training data

There are two main strategies to split the labeled dataset into training and test sets. In the first approach, datasets corresponding to the different regions can be combined into a single DataFrame and then split into training and test sets while preserving the fraction of the minority class. The alternative approach is to train a model on a subset of the regions and treat the remaining regions as part of the test set. One of the critical questions we want to address here is how good a model trained in a certain region generalizes over other regions previously unseen. This is important because mangroves from different parts of the world can have some local characteristics, and one way to judge the quality of a model is to investigate how reliable it is in predicting mangrove forests from the satellite image of a new region. Therefore, although splitting the dataset using the first strategy would likely improve the model performance, we follow the second approach.

As indicated earlier, the mangrove dataset was broken down into geographical regions and four of those, Vietnam2, Myanmar3, Cuba2, and India, were set aside to create the test dataset. The remaining 21 regions made up the training set. The dataset for each region was created by setting numPixels=10000 for mangrove and numPixels=1000 for the non-mangrove regions in the sampling process. The larger value of numPixels for mangroves ensures a more balanced dataset, because mangroves usually cover a small fraction of the satellite image patches. The resulting training data ended up having a 75/25 split between the majority and minority classes, whereas the split was 69/31 for the test dataset. The regional datasets as well as the training and test datasets were stored in an Amazon Simple Storage Service (Amazon S3) bucket. The complete code for generating the training and test sets is available in the prep_mangrove_dataset.ipynb notebook.

Train a model with smileCart

One of the few built-in models GEE provides is a classification and regression tree-based algorithm (smileCart) for quick classification. These built-in models allow you to quickly train a classifier and perform inference, at the cost of detailed model tuning and customization. Even with this downside, using smileCart still provides a beginner-friendly introduction to land cover classification, and therefore can serve as a baseline.

To train the built-in classifier, you need to provide two pieces of information: the satellite bands to use as features and the column representing the label. Additionally, you have to convert the training and test datasets from Pandas DataFrames to GEE feature collections. Then you instantiate the built-in classifier and train the model. The following is a high-level version of the code; you can find more details in the smilecart.ipynb notebook:

bands = ['B1', 'B2', 'B3', 'B4', 'B5', 'B6', 'B7']
label = 'label'

# Train a CART classifier with default parameters.
classifier = ee.Classifier.smileCart().train(train_set_pts, label, bands)

# Inference on test set
result_featurecollection = test_set_pts.select(bands).classify(classifier)

Both train_set_pts and test_set_pts are FeatureCollections, a common GEE data structure, containing the train dataset and test dataset, respectively. The model prediction generates the following confusion matrix on the test dataset.

The model doesn’t predict mangroves very well, but this is a good starting point, and the result will act as a baseline for the custom models you build in part two of this series.

Conclusion

This concludes the first part of a two-part post, in which we show the ETL process for building a mangrove classifier based on features extracted from satellite images. We showed how to automate the process of gathering satellite images and visualize it in Studio for detailed exploration. In Part 2 of the post, we show how to use AutoML to build a custom model in Autopilot that performs better than the built-in smileCart model.


About the Authors

Andrei Ivanovic is an incoming Master’s of Computer Science student at the University of Toronto and a recent graduate of the Engineering Science program at the University of Toronto, majoring in Machine Intelligence with a Robotics/Mechatronics minor. He is interested in computer vision, deep learning, and robotics. He did the work presented in this post during his summer internship at Amazon.

David Dong is a Data Scientist at Amazon Web Services.

Arkajyoti Misra is a Data Scientist at Amazon LastMile Transportation. He is passionate about applying Computer Vision techniques to solve problems that helps the earth. He loves to work with non-profit organizations and is a founding member of ekipi.org.

Read More