Amazon AWS – Page 221

New workshop to help bring causal reasoning to recommendation systems

June 23, 2022

by Amazon AWS

Two-day RecSys workshop that extends the popular REVEAL to include CONSEQUENCES features Amazon organizers, speakers.Read More

Former Amazon intern Karsten Roth wins EMVA young professional award

June 23, 2022

by Amazon AWS

EMVA Young Professional Award honors “outstanding and innovative work of a student or a young professional in the field of machine vision or image processing.”Read More

Import data from cross-account Amazon Redshift in Amazon SageMaker Data Wrangler for exploratory data analysis and data preparation

June 23, 2022

by Meenakshisundaram Thandavarayan Amazon AWS

Organizations moving towards a data-driven culture embrace the use of data and machine learning (ML) in decision-making. To make ML-based decisions from data, you need your data available, accessible, clean, and in the right format to train ML models. Organizations with a multi-account architecture want to avoid situations where they must extract data from one account and load it into another for data preparation activities. Manually building and maintaining the different extract, transform, and load (ETL) jobs in different accounts adds complexity and cost, and makes it more difficult to maintain the governance, compliance, and security best practices to keep your data safe.

Amazon Redshift is a fast, fully managed cloud data warehouse. The Amazon Redshift cross-account data sharing feature provides a simple and secure way to share fresh, complete, and consistent data in your Amazon Redshift data warehouse with any number of stakeholders in different AWS accounts. Amazon SageMaker Data Wrangler is a capability of Amazon SageMaker that makes it faster for data scientists and engineers to prepare data for ML applications by using a visual interface. Data Wrangler allows you to explore and transform data for ML by connecting to Amazon Redshift datashares.

In this post, we walk through setting up a cross-account integration using an Amazon Redshift datashare and preparing data using Data Wrangler.

Solution overview

We start with two AWS accounts: a producer account with the Amazon Redshift data warehouse, and a consumer account for SageMaker ML use cases. For this post, we use the banking dataset. To follow along, download the dataset to your local machine. The following is a high-level overview of the workflow:

Instantiate an Amazon Redshift RA3 cluster in the producer account and load the dataset.
Create an Amazon Redshift datashare in the producer account and allow the consumer account to access the data.
Access the Amazon Redshift datashare in the consumer account.
Analyze and process data with Data Wrangler in the consumer account and build your data preparation workflows.

Be aware of the considerations for working with Amazon Redshift data sharing:

Multiple AWS accounts – You need at least two AWS accounts: a producer account and a consumer account.
Cluster type – Data sharing is supported in the RA3 cluster type. When instantiating an Amazon Redshift cluster, make sure to choose the RA3 cluster type.
Encryption – For data sharing to work, both the producer and consumer clusters must be encrypted and should be in the same AWS Region.
Regions – Cross-account data sharing is available for all Amazon Redshift RA3 node types in US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Paris), Europe (Stockholm), and South America (São Paulo).
Pricing – Cross-account data sharing is available across clusters that are in the same Region. There is no cost to share data. You just pay for the Amazon Redshift clusters that participate in sharing.

Cross-account data sharing is a two-step process. First, a producer cluster administrator creates a datashare, adds objects, and gives access to the consumer account. Then the producer account administrator authorizes sharing data for the specified consumer. You can do this from the Amazon Redshift console.

Create an Amazon Redshift datashare in the producer account

To create your datashare, complete the following steps:

On the Amazon Redshift console, create an Amazon Redshift cluster.
Specify Production and choose the RA3 node type.
Under Additional configurations, deselect Use defaults.
Under Database configurations, set up encryption for your cluster.
After you create the cluster, import the direct marketing bank dataset. You can download from the following URL: https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip.
Upload bank-additional-full.csv to an Amazon Simple Storage Service (Amazon S3) bucket your cluster has access to.

Use the Amazon Redshift query editor and run the following SQL query to copy the data into Amazon Redshift:

create table bank_additional_full (
  age char(40),
  job char(40),
  marital char(40),
  education char(40),
  default_history varchar(40),
  housing char(40),
  loan char(40),
  contact char(40),
  month char(40),
  day_of_week char(40),
  duration char(40),
  campaign char(40),
  pdays char(40),
  previous char(40),
  poutcome char(40),
  emp_var_rate char(40),
  cons_price_idx char(40),
  cons_conf_idx char(40),
  euribor3m char(40),
  nr_employed char(40),
  y char(40));
copy bank_additional_full
from <S3 LOCATION OF THE CSV FILE>
credentials <CLUSTER ROLE ARN>
region 'us-east-1'
format csv
IGNOREBLANKLINES
IGNOREHEADER 1

Navigate to the cluster details page and on the Datashares tab, choose Create datashare.
For Datashare name, enter a name.
For Database name, choose a database.
In the Add datashare objects section, choose the objects from the database you want to include in the datashare.
You have granular control of what you choose to share with others. For simplicity, we share all the tables. In practice, you might choose one or more tables, views, or user-defined functions.
Choose Add.
To add data consumers, select Add AWS accounts to the datashare and add your secondary AWS account ID.
Choose Create datashare.
To authorize the data consumer you just created, go to the Datashares page on the Amazon Redshift console and choose the new datashare.
Select the data consumer and choose Authorize.

The consumer status changes from Pending authorization to Authorized.

Access the Amazon Redshift cross-account datashare in the consumer AWS account

Now that the datashare is set up, switch to your consumer AWS account to consume the datashare. Make sure you have at least one Amazon Redshift cluster created in your consumer account. The cluster has to be encrypted and in the same Region as the source.

On the Amazon Redshift console, choose Datashares in the navigation pane.
On the From other accounts tab, select the datashare you created and choose Associate.
You can associate the datashare with one or more clusters in this account or associate the datashare to the entire account so that the current and future clusters in the consumer account get access to this share.
Specify your connection details and choose Connect.
Choose Create database from datashare and enter a name for your new database.
To test the datashare, go to query editor and run queries against the new database to make sure all the objects are available as part of the datashare.

Analyze and process data with Data Wrangler

You can now use Data Wrangler to access the cross-account data created as a datashare in Amazon Redshift.

Open Amazon SageMaker Studio.
On the File menu, choose New and Data Wrangler Flow.
On the Import tab, choose Add data source and Amazon Redshift.
Enter the connection details of the Amazon Redshift cluster you just created in the consumer account for the datashare.
Choose Connect.
Use the AWS Identity and Access Management (IAM) role you used for your Amazon Redshift cluster.

Note that even though the datashare is a new database in the Amazon Redshift cluster, you can’t connect to it directly from Data Wrangler.

The correct way is to connect to the default cluster database first, and then use SQL to query the datashare database. Provide the required information for connecting to the default cluster database. Note that an AWS Key Management Service (AWS KMS) key ID is not required in order to connect.

Data Wrangler is now connected to the Amazon Redshift instance.

Query the data in the Amazon Redshift datashare database using a SQL editor.
Choose Import to import the dataset to Data Wrangler.
Enter a name for the dataset and choose Add.

You can now see the flow on the Data Flow tab of Data Wrangler.

After you have loaded the data into Data Wrangler, you can do exploratory data analysis and prepare data for ML.

Choose the plus sign and choose Add analysis.

Data Wrangler provides built-in analyses. These include but aren’t limited to a data quality and insights report, data correlation, a pre-training bias report, a summary of your dataset, and visualizations (such as histograms and scatter plots). You can also create your own custom visualization.

You can use the Data Quality and Insights Report to automatically generate visualizations and analyses to identify data quality issues, and recommend the right transformation required for your dataset.

Choose Data Quality and Insights Report, and choose the Target column as y.
Because this is a classification problem statement, for Problem type, select Classification.
Choose Create.

Data Wrangler creates a detailed report on your dataset. You can also download the report to your local machine.

For data preparation, choose the plus sign and choose Add analysis.
Choose Add step to start building your transformations.

At the time of this writing, Data Wrangler provides over 300 built-in transformations. You can also write your own transformations using Pandas or PySpark.

You can now start building your transforms and analysis based on your business requirement.

Conclusion

In this post, we explored sharing data across accounts using Amazon Redshift datashares without having to manually download and upload data. We walked through how to access the shared data using Data Wrangler and prepare the data for your ML use cases. This no-code/low-code capability of Amazon Redshift datashares and Data Wrangler accelerates training data preparation and increases the agility of data engineers and data scientists with faster iterative data preparation.

To learn more about Amazon Redshift and SageMaker, refer to the Amazon Redshift Database Developer Guide and Amazon SageMaker Documentation.

About the Authors

Meenakshisundaram Thandavarayan is a Senior AI/ML specialist with AWS. He helps hi-tech strategic accounts on their AI and ML journey. He is very passionate about data-driven AI.

James Wu is a Senior AI/ML Specialist Solution Architect at AWS. helping customers design and build AI/ML solutions. James’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. Prior to joining AWS, James was an architect, developer, and technology leader for over 10 years, including 6 years in engineering and 4 years in marketing & advertising industries.

Predict types of machine failures with no-code machine learning using Amazon SageMaker Canvas

June 23, 2022

by Rajakumar Sampathkumar Amazon AWS

Predicting common machine failure types is critical in manufacturing industries. Given a set of characteristics of a product that is tied to a given type of failure, you can develop a model that can predict the failure type when you feed those attributes to a machine learning (ML) model. ML can help with insights, but up until now you needed ML experts to build models to predict machine failure types, the lack of which could delay any corrective actions that businesses need for efficiencies or improvement.

In this post, we show you how business analysts can build a machine failure type prediction ML model with Amazon SageMaker Canvas. Canvas provides you with a visual point-and-click interface that allows you to build models and generate accurate ML predictions on your own—without requiring any ML experience or having to write a single line of code.

Solution overview

Let’s assume you’re a business analyst assigned to a maintenance team of a large manufacturing organization. Your maintenance team has asked you to assist in predicting common failures. They have provided you with a historical dataset that contains characteristics tied to a given type of failure and would like you to predict which failure will occur in the future. The failure types include No Failure, Overstrain, and Power Failures. The data schema is listed in the following table.

Column Name	Data Type	Description
UID	INT	Unique identifier ranging from 1–10,000
productID	STRING	Consisting of a letter—L, M, or H for low, medium, or high—as product quality variants and a variant-specific serial number
type	STRING	Initial letter associated with productID consisting of L, M, or H only
air temperature [K]	DECIMAL	Air temperature specified in kelvin
process temperature [K]	DECIMAL	Precisely controlled temperatures to ensure quality of a given type of product specified in kelvin
rotational speed [rpm]	DECIMAL	Rotational speed of an object rotating around an axis is the number of turns of the object divided by time, specified as revolutions per minute
torque [Nm]	DECIMAL	Machine turning force through a radius, expressed in newton meters
tool wear [min]	INT	Tool wear expressed in minutes
failure type (target)	STRING	No Failure, Power Failure, or Overstrain Failure

After the failure type is identified, businesses can take any corrective actions. To do this, you use the data you have in a CSV file, which contains certain characteristics of a product as outlined in the table. You use Canvas to perform the following steps:

Import the maintenance dataset.
Train and build the predictive machine maintenance model.
Analyze the model results.
Test predictions against the model.

Prerequisites

A cloud admin with an AWS account with appropriate permissions is required to complete the following prerequisites:

Deploy an Amazon SageMaker domain For instructions, see Onboard to Amazon SageMaker Domain.
Launch Canvas. For instructions, see Setting up and managing Amazon SageMaker Canvas (for IT administrators).
Configure cross-origin resource sharing (CORS) policies for Canvas. For instructions, see Give your users the ability to upload local files.

Import the dataset

First, download the maintenance dataset and review the file to make sure all the data is there.

Canvas provides several sample datasets in your application to help you get started. To learn more about the SageMaker-provided sample datasets you can experiment with, see Use sample datasets. If you use the sample dataset (canvas-sample-maintenance.csv) available within Canvas, you don’t have to import the maintenance dataset.

You can import data from different data sources into Canvas. If you plan to use your own dataset, follow the steps in Importing data in Amazon SageMaker Canvas.

For this post, we use the full maintenance dataset that we downloaded.

Sign in to the AWS Management Console, using an account with the appropriate permissions to access Canvas.
Log in to the Canvas console.
Choose Import.
Choose Upload and select the maintenance_dataset.csv file.
Choose Import data to upload it to Canvas.

The import process takes approximately 10 seconds (this can vary depending on dataset size). When it’s complete, you can see the dataset is in Ready status.

After you confirm that the imported dataset is ready, you can create your model.

Build and train the model

To create and train your model, complete the following steps:

Choose New model, and provide a name for your model.
Choose Create.
Select the maintenance_dataset.csv dataset and choose Select dataset.
In the model view, you can see four tabs, which correspond to the four steps to create a model and use it to generate predictions: Select, Build, Analyze, and Predict.
On the Select tab, select the maintenance_dataset.csv dataset you uploaded previously and choose Select dataset.
This dataset includes 9 columns and 10,000 rows. Canvas automatically moves to the Build phase.
On this tab, choose the target column, in our case Failure Type.The maintenance team has informed you that this column indicates the type of failures typically seen based off of historical data from their existing machines. This is what you want to train your model to predict. Canvas automatically detects that this is a 3 Category problem (also known as multi-class classification). If the wrong model type is detected, you can change it manually with the Change type option.
It should be noted that this dataset is highly unbalanced towards the No Failure class, which can be seen by viewing the column named Failure Type. Although Canvas and the underlying AutoML capabilities can partly handle dataset imbalance, this may result in some skewed performances. As an additional next step, refer to Balance your data for machine learning with Amazon SageMaker Data Wrangler. Following the steps in the shared link, you can launch an Amazon SageMaker Studio app from the SageMaker console and import this dataset within Amazon SageMaker Data Wrangler and use the Balance data transformation, then take the balanced dataset back to Canvas and continue the following steps. We are proceeding with the imbalanced dataset in this post to show that Canvas can handle imbalanced datasets as well.
In the bottom half of the page, you can look at some of the statistics of the dataset, including missing and mismatched values, unique vales, and mean and median values. You can also drop some of the columns if you don’t want to use them for the prediction by simply deselecting them.
After you’ve explored this section, it’s time to train the model! Before building a complete model, it’s a good practice to have a general idea about the model performance by training a Quick Model. A quick model trains fewer combinations of models and hyperparameters in order to prioritize speed over accuracy, especially in cases where you want to prove the value of training an ML model for your use case. Note that the quick build option isn’t available for models bigger than 50,000 rows.
Choose Quick build.

Now you wait anywhere from 2–15 minutes. Once done, Canvas automatically moves to the Analyze tab to show you the results of quick training. The analysis performed using quick build estimates that your model is able to predict the right failure type (outcome) 99.2% of the time. You may experience slightly different values. This is expected.

Let’s focus on the first tab, Overview. This is the tab that shows you the Column impact, or the estimated importance of each column in predicting the target column. In this example, the Torque [Nm] and Rotational speed [rpm] columns have the most significant impact in predicting what type of failure will occur.

Evaluate model performance

When you move to the Scoring portion of your analysis, you can see a plot representing the distribution of our predicted values with respect to the actual values. Notice that most failures will be within the No Failure category. To learn more about how Canvas uses SHAP baselines to bring explainability to ML, refer to Evaluating Your Model’s Performance in Amazon SageMaker Canvas, as well as SHAP Baselines for Explainability.

Canvas splits the original dataset into train and validation sets before the training. The scoring is a result of Canvas running the validation set against the model. This is an interactive interface where you can select the failure type. If you choose Overstrain Failure in the graphic, you can see that model identifies these 84% of time. This is good enough to take action on—perhaps have an operator or engineer check further. You can choose Power Failure in the graphic to see the respective scoring for further interpretation and actions.

You may be interested in failure types and how well the model predicts failure types based on a series of inputs. To take a closer look at the results, choose Advanced metrics. This displays a matrix that allows you to more closely examine the results. In ML, this is referred to as a confusion matrix.

This matrix defaults to the dominate class, No Failure. On the Class menu, you can choose to view advanced metrics of the other two failure types of Overstrain Failure and Power Failure.

In ML, the accuracy of the model is defined as the number of correct predictions divided over the total number of predictions. The blue boxes represent correct predictions that the model made against a subset of test data where there was a known outcome. Here we are interested in what percentage of the time the model predicted a particular machine failure type (lets say No Failure) when its actually that failure type (No Failure). In ML, a ratio used to measure this is TP / (TP + FN). This is referred to as recall. In the default case, No Failure, there were 1,923 correct predictions out of 1,926 overall records, which resulted in 99% recall. Alternatively, in the class of Overstrain Failure, there were 32 out of 38, which results in 84% recall. Lastly, in the class of Power Failure, there were 16 out of 19, which results in 84% recall.

Now, you have two options:

You can use this model to run some predictions by choosing Predict.
You can create a new version of this model to train with the Standard build option. This will take much longer—about 1–2 hours—but provides a more robust model because it goes through a full AutoML review of data, algorithms, and tuning iterations.

Because you’re trying to predict failures, and the model predicts failures correctly 84% of time, you can confidently use the model to identify possible failures. So, you can proceed to option 1. If you weren’t confident, then you could have a data scientist review the modeling Canvas did and offer potential improvements via option 2.

Generate predictions

Now that the model is trained, you can start generating predictions.

Choose Predict at the bottom of the Analyze page, or choose the Predict tab.
Choose Select dataset, and choose the maintenance_dataset.csv file.
Choose Generate predictions.

Canvas uses this dataset to generate our predictions. Although it’s generally a good idea to not use the same dataset for both training and testing, you can use the same dataset for the sake of simplicity in this case. Alternatively, you can remove some records from your original dataset that you use for training and use those records in a CSV file and feed it to the batch prediction here so you don’t use the same dataset for testing post-training.

After a few seconds, the prediction is complete. Canvas returns a prediction for each row of data and the probability of the prediction being correct. You can choose Preview to view the predictions, or choose Download to download a CSV file containing the full output.

You can also choose to predict one by one values by choosing Single prediction instead of Batch prediction. Canvas shows you a view where you can provide the values for each feature manually and generate a prediction. This is ideal for situations like what-if scenarios, for example: How does the tool wear impact the failure type? What if process temperature increases or decreases? What if rotational speed changes?

Standard build

The Standard build option chooses accuracy over speed. If you want to share the artifacts of the model with your data scientist and ML engineers, you can create a standard build next.

Choose Add version
Choose a new version and choose Standard build.
After you create a standard build, you can share the model with data scientists and ML engineers for further evaluation and iteration.

Clean up

To avoid incurring future session charges, log out of Canvas.

Conclusion

In this post, we showed how a business analyst can create a machine failure type prediction model with Canvas using maintenance data. Canvas allows business analysts such as reliability engineers to create accurate ML models and generate predictions using a no-code, visual, point-and-click interface. Analysts can take this to the next level by sharing their models with data scientist colleagues. Data scientists can view the Canvas model in Studio, where they can explore the choices Canvas made, validate model results, and even take the model to production with a few clicks. This can accelerate ML-based value creation and help scale improved outcomes faster.

To learn more about using Canvas, see Build, Share, Deploy: how business analysts and data scientists achieve faster time-to-market using no-code ML and Amazon SageMaker Canvas. For more information about creating ML models with a no-code solution, see Announcing Amazon SageMaker Canvas – a Visual, No Code Machine Learning Capability for Business Analysts.

About the Authors

Rajakumar Sampathkumar is a Principal Technical Account Manager at AWS, providing customers guidance on business-technology alignment and supporting the reinvention of their cloud operation models and processes. He is passionate about cloud and machine learning. Raj is also a machine learning specialist and works with AWS customers to design, deploy, and manage their AWS workloads and architectures.

Twann Atkins is a Senior Solutions Architect for Amazon Web Services. He is responsible for working with Agriculture, Retail, and Manufacturing customers to identify business problems and working backwards to identify viable and scalable technical solutions. Twann has been helping customers plan and migrate critical workloads for more than 10 years with a recent focus on democratizing analytics, artificial intelligence and machine learning for customers and builders of tomorrow.

Omkar Mukadam is a Edge Specialist Solution Architecture at Amazon Web Services. He currently focuses on solutions which enables commercial customers to effectively design, build and scale with AWS Edge service offerings which includes but not limited to AWS Snow Family.

Antia Lamas-Linares’s path into the world of quantum

June 23, 2022

by Amazon AWS

Among the ‘first wave’ of scientists to gain a PhD in quantum technology, the senior manager of research science discusses her two-decade-long career journey.Read More

Visual inspection automation using Amazon SageMaker JumpStart

June 22, 2022

by Vedant Jain Amazon AWS

According to Gartner, hyperautomation is the number one trend in 2022 and will continue advancing in future. One of the main barriers to hyperautomation is in areas where we’re still struggling to reduce human involvement. Intelligent systems have a hard time matching human visual recognition abilities, despite great advancements in deep learning in computer vision. This is mainly due to the lack of annotated data (or when data is sparse) and in areas such as quality control, where trained human eyes still dominate. Another reason is the feasibility of human access in all areas of the product supply chain, such as quality control inspection on the production line. Visual inspection is widely used for performing internal and external assessment of various equipment in a production facility, such as storage tanks, pressure vessels, piping, vending machines, and other equipment, which expands to many industries, such as electronics, medical, CPG, and raw materials and more.

Using Artificial Intelligence (AI) for automated visual inspection or augmenting the human visual inspection process with AI can help address the challenges outlined below.

Challenges of human visual inspection

Human-led visual inspection has the following high-level issues:

Scale – Most products go through multiple stages, from assembly to supply chain to quality control, before being made available to the end consumer. Defects can occur during the manufacturing process or assembly at different points in space and time. Therefore, it’s not always feasible or cost-effective to use in-person human visual inspection. This inability to scale can result in disasters such as the BP Deepwater Horizon oil spill and Challenger space shuttle explosion, the overall negative impact of which (to humans and nature) overshoots the monetary cost by quite a distance.
Human visual error – In areas where human-led visual inspection can be conveniently performed, human error is a major factor that often goes overlooked. According to the following report, most inspection tasks are complex and typically exhibit error rates of 20–30%, which directly translates to cost and undesirable outcomes.
Personnel and miscellaneous costs – Although the overall cost of quality control can vary greatly depending on industry and location, according to some estimates, a trained quality inspector salary ranges between $26,000–60,000 (USD) per year. There are also other miscellaneous costs that may not always be accounted for.

SageMaker JumpStart is a great place to get started with various Amazon SageMaker features and capabilities through curated one-click solutions, example notebooks, and pre-trained Computer Vision, Natural Language Processing and Tabular data models that users can choose, fine-tune (if needed) and deploy using AWS SageMaker infrastructure.

In this post, we walk through how to quickly deploy an automated defect detection solution, from data ingestion to model inferencing, using a publicly available dataset and SageMaker JumpStart.

Solution overview

This solution uses a state-of-the-art deep learning approach to automatically detect surface defects using SageMaker. The Defect Detection Network or DDN model enhances the Faster R-CNN and identifies possible defects in an image of a steel surface. The NEU surface defect database, is a balanced dataset that contains six kinds of typical surface defects of a hot-rolled steel strip: rolled-in scale (RS), patches (Pa), crazing (Cr), pitted surface (PS), inclusion (In), and scratches (Sc). The database includes 1,800 grayscale images: 300 samples each of type of defect.

Content

The JumpStart solution contains the following artifacts, which are available to you from the JupyterLab File Browser:

cloudformation/ – AWS CloudFormation configuration files to create relevant SageMaker resources and apply permissions. Also includes cleanup scripts to delete created resources.
src/ – Contains the following:
- prepare_data/ – Data preparation for NEU datasets.
- sagemaker_defect_detection/ – Main package containing the following:
  - dataset – Contains NEU dataset handling.
  - models – Contains Automated Defect Inspection (ADI) System called Defect Detection Network. See the following paper for details.
  - utils – Various utilities for visualization and COCO evaluation.
  - classifier.py – For the classification task.
  - detector.py – For the detection task.
  - transforms.py – Contains the image transformations used in training.
notebooks/ – The individual notebooks, discussed in more detail later in this post.
scripts/ – Various scripts for training and building.

Default dataset

This solution trains a classifier on the NEU-CLS dataset and a detector on the NEU-DET dataset. This dataset contains 1800 images and 4189 bounding boxes in total. The type of defects in our dataset are as follows:

Crazing (class: Cr, label: 0)
Inclusion (class: In, label: 1)
Pitted surface (class: PS, label: 2)
Patches (class: Pa, label: 3)
Rolled-in scale (class: RS, label: 4)
Scratches (class: Sc, label: 5)

The following are sample images of the six classes.

The following images are sample detection results. From left to right, we have the original image, the ground truth detection, and the SageMaker DDN model output.

Architecture

The JumpStart solution comes pre-packaged with Amazon SageMaker Studio notebooks that download the required datasets and contain the code and helper functions for training the model/s and deployment using a real-time SageMaker endpoint.

All notebooks download the dataset from a public Amazon Simple Storage Service (Amazon S3) bucket and import helper functions to visualize the images. The notebooks allow the user to customize the solution, such as hyperparameters for model training or perform transfer learning in case you choose to use the solution for your defect detection use case.

The solution contains the following four Studio notebooks:

0_demo.ipynb – Creates a model object from a pre-trained DDN model on the NEU-DET dataset and deploys it behind a real-time SageMaker endpoint. Then we send some image samples with defects for detection and visualize the results.
1_retrain_from_checkpoint.ipynb – Retrains our pre-trained detector for a few more epochs and compares results. You can also bring your own dataset; however, we use the same dataset in the notebook. Also included is a step to perform transfer learning by fine-tuning the pre-trained model. Fine-tuning a deep learning model on one particular task involves using the learned weights from a particular dataset to enhance the performance of the model on another dataset. You can also perform fine-tuning over the same dataset used in the initial training but perhaps with different hyperparameters.
2_detector_from_scratch.ipynb – Trains our detector from scratch to identify if defects exist in an image.
3_classification_from_scratch.ipynb – Trains our classifier from scratch to classify the type of defect in an image.

Each notebook contains boilerplate code which deploys a SageMaker real-time endpoint for model inferencing. You can view the list of notebooks by going to the JupyterLab file browser and navigating to the “notebooks” folder in the JumpStart Solution directory or by clicking “Open Notebook” on the JumpStart solution, specifically “Product Defect Detection” solution page (See below).

Prerequisites

The solution outlined in this post is part of Amazon SageMaker JumpStart. To run this SageMaker JumpStart 1P Solution and have the infrastructure deploy to your AWS account, you need to create an active Amazon SageMaker Studio instance (see Onboard to Amazon SageMaker Domain).

JumpStart features are not available in SageMaker notebook instances, and you can’t access them through the AWS Command Line Interface (AWS CLI).

Deploy the solution

We provide walkthrough videos for the high-level steps on this solution. To start, launch SageMaker JumpStart and choose the Product Defect Detection solution on the Solutions tab.

The provided SageMaker notebooks download the input data and launch the later stages. The input data is located in an S3 bucket.

We train the classifier and detector models and evaluate the results in SageMaker. If desired, you can deploy the trained models and create SageMaker endpoints.

The SageMaker endpoint created from the previous step is an HTTPS endpoint and is capable of producing predictions.

You can monitor the model training and deployment via Amazon CloudWatch.

Clean up

When you’re finished with this solution, make sure that you delete all unwanted AWS resources. You can use AWS CloudFormation to automatically delete all standard resources that were created by the solution and notebook. On the AWS CloudFormation console, delete the parent stack. Deleting the parent stack automatically deletes the nested stacks.

You need to manually delete any extra resources that you may have created in this notebook, such as extra S3 buckets in addition to the solution’s default bucket or extra SageMaker endpoints (using a custom name).

Conclusion

In this post, we introduced a solution using SageMaker JumpStart to address issues with the current state of visual inspection, quality control, and defect detection in various industries. We recommended a novel approach called Automated Defect Inspection system built using a pre-trained DDN model for defect detection on steel surfaces. After you launched the JumpStart solution and downloaded the public NEU datasets, you deployed a pre-trained model behind a SageMaker real-time endpoint and analyzed the endpoint metrics using CloudWatch. We also discussed other features of the JumpStart solution, such as how to bring your own training data, perform transfer learning, and retrain the detector and classifier.

Try out this JumpStart solution on SageMaker Studio, either retraining the existing model on a new dataset for defect detection or pick from SageMaker JumpStart’s library of computer vision models, NLP models or tabular models and deploy them for your specific use case.

About the Authors

Vedant Jain is a Sr. AI/ML Specialist Solutions Architect, helping customers derive value out of the Machine Learning ecosystem at AWS. Prior to joining AWS, Vedant has held ML/Data Science Specialty positions at various companies such as Databricks, Hortonworks (now Cloudera) & JP Morgan Chase. Outside of his work, Vedant is passionate about making music, using Science to lead a meaningful life & exploring delicious vegetarian cuisine from around the world.

Tao Sun is an Applied Scientist in AWS. He obtained his Ph.D. in Computer Science from University of Massachusetts, Amherst. His research interests lie in deep reinforcement learning and probabilistic modeling. He contributed to AWS DeepRacer, AWS DeepComposer. He likes ballroom dance and reading during his spare time.

Alexa’s head scientist on conversational exploration, ambient AI

June 22, 2022

by Amazon AWS

Rohit Prasad on the pathway to generalizable intelligence and what excites him most about his re:MARS keynote.Read More

Prime Video’s work on 3-D scene reconstruction, image representation

June 22, 2022

by Amazon AWS

CVPR papers examine the recovery of 3-D information from camera movement and learning general representations from weakly annotated data.Read More

Accelerate your career with ML skills through the AWS Machine Learning Engineer Scholarship

June 21, 2022

by Anastacia Padilla Amazon AWS

Amazon Web Services and Udacity are partnering to offer free services to educate developers of all skill levels on machine learning (ML) concepts with the AWS Machine Learning Engineer Scholarship program. The program offers free enrollment to the AWS Machine Learning Foundations course and 325 scholarships awarded to the AWS Machine Learning Engineer Nanodegree, a $2,000 USD value, powered through Udacity.

Machine learning will not only change the way we work and live, but also open pathways to millions of new jobs, with the World Economic Forum estimating 97 million new roles may be created by 2025 in AI and ML. Gaining access to the job-ready skills to break into an ML career encounters high cost to traditional education and rigorous content, with a lack of real-world application from theory into practice. AWS is invested in addressing these challenges by providing free educational content and hands-on learning, such as exploring reinforcement learning concepts with AWS DeepRacer, as well as a community of learner support with technical experts and like-minded peers.

“The AWS Machine Learning Engineer Nanodegree Program gave me a solid footing in understanding the foundational building blocks of Machine Learning workflows,” said Jikmyan Mangut Sunday, AWS Machine Learning Scholarship Alumni. “This shaped my knowledge of the fundamental concepts in building state-of-the-art Machine Learning models. Udacity curated learning materials that were easy to grasp and applicable to every field of endeavor, my learning experience was challenging and fun-filled.”

AWS is also collaborating with Girls in Tech and National Society for Black Engineers, to provide scholarships to women and underrepresented groups in tech. Organizations like these aims to inspire, support, train, and empower people from underrepresented groups to pursue careers in tech. In partnership, AWS will aid in providing access and resources to programs such as the AWS Machine Learning Engineer Scholarship Program to increase the diversity and talent in technical roles.

“Tech needs representation from women, BIPOC, and other marginalized communities in every aspect of our industry,” says Adriana Gascoigne, founder and CEO of Girls in Tech. “Girls in Tech applauds our collaborator AWS, as well as Udacity, for breaking down the barriers that so often leave women behind in tech. Together, we aim to give everyone a seat at the table.”

Open pathways to new career opportunities

Learners in the program are able to apply theory into hands on application to a suite of AWS ML services including AWS DeepRacer, Amazon SageMaker, and AWS DeepComposer. As many struggle to get started with machine learning, the scholarship program provides easy to learn, self-paced modules to provide the flexibility at a self-guided pace. Throughout the course journey, learners will have access to a supportive online community for technical assistance through Udacity tutors.

“Before taking the program, the many tools provided by AWS seemed frustrating but now I have a good grasp of them. I learned how to organize my code and work in a professional setting,” said Kariem Gazer AWS Machine Learning Scholarship Alumni. “The organized modules, follow up quizzes, and personalized feedback all made the learning experience smoother and concrete.”

Gain ML skills beyond the classroom

The AWS Machine Learning Engineer Scholarship program is open to all developers interested in expanding their ML skills and expertise through AWS curated content and services. Applicants 18 years of age or older are invited to register for the program. All applicants will have immediate classroom access to the free AWS ML Foundations course upon application completion.

Phase 1: AWS Machine Learning Foundations Course

Learn object-oriented programming skills, including writing clean and modularized code and understanding the fundamental aspects of ML.
Learn reinforcement learning with AWS DeepRacer and generative AI with AWS DeepComposer.
Take advantage of support through the Discourse Tech community with technical moderators.
Receive a certificate for course completion and take an online assessment quiz to receive a full scholarship to the AWS Machine Learning Engineer Nanodegree program.
Dedicate 3–5 hours a week on the course and work towards earning one of the follow-up Nanodegree program scholarships.

Phase 2: Full scholarship to the AWS Machine Learning Engineer Udacity Nanodegree ($2,000 USD value)

Learn advanced ML techniques and algorithms, including how to package and deploy your models to a production environment.
Acquire practical experience such as using Amazon SageMaker to prepare you for a career in ML.
Take advantage of community support through a learner connect program for technical assistance and learner engagement.
Dedicate 5–10 hours a week on the course to earn an Udacity Nanodegree certificate.

Program dates

June 21, 2022 Scholarship applications open and students are automatically enrolled in the AWS Machine Learning Foundations Course (Phase 1)

July 21, 2022	Scholarship applications close
November 23, 2022	AWS Machine Learning Foundations Course (Phase 1) ends
December 6, 2022	AWS Machine Learning Engineer Scholarship winners announced
December 8, 2022	AWS Machine Learning Engineer Nanodegree (Phase 2) opens
March 22, 2023	AWS Machine Learning Engineer Nanodegree (Phase 2) closes

Connect with the ML community and take the next step

Connect with experts and like-minded aspiring ML developers on the AWS Machine Learning Discord and enroll today in the AWS Machine Learning Engineer Scholarship program.

About the Author

Anastacia Padilla is a Product Marketing Manager for AWS AI & ML Education. She spends her time building and evangelizing offerings for the aspiring ML developer community to upskill students and underrepresented groups in tech. She is focused on democratizing AI & ML education to be accessible to all who want to learn.

Identify mangrove forests using satellite image features using Amazon SageMaker Studio and Amazon SageMaker Autopilot – Part 2

June 21, 2022

by Andrei Ivanovic Amazon AWS

Mangrove forests are an import part of a healthy ecosystem, and human activities are one of the major reasons for their gradual disappearance from coastlines around the world. Using a machine learning (ML) model to identify mangrove regions from a satellite image gives researchers an effective way to monitor the size of the forests over time. In Part 1 of this series, we showed how to gather satellite data in an automated fashion and analyze it in Amazon SageMaker Studio with interactive visualization. In this post, we show how to use Amazon SageMaker Autopilot to automate the process of building a custom mangrove classifier.

Train a model with Autopilot

Autopilot provides a balanced way of building several models and selecting the best one. While creating multiple combinations of different data preprocessing techniques and ML models with minimal effort, Autopilot provides complete control over these component steps to the data scientist, if desired.

You can use Autopilot using one of the AWS SDKs (details available in the API reference guide for Autopilot) or through Studio. We use Autopilot in our Studio solution following the steps outlined in this section:

On the Studio Launcher page, choose the plus sign for New Autopilot experiment.
For Connect your data, select Find S3 bucket, and enter the bucket name where you kept the training and test datasets.
For Dataset file name, enter the name of the training data file you created in the Prepare the training data section in Part 1.
For Output data location (S3 bucket), enter the same bucket name you used in step 2.
For Dataset directory name, enter a folder name under the bucket where you want Autopilot to store artifacts.
For Is your S3 input a manifest file?, choose Off.
For Target, choose label.
For Auto deploy, choose Off.
Under the Advanced settings, for Machine learning problem type, choose Binary Classification.
For Objective metric, choose AUC.
For Choose how to run your experiment, choose No, run a pilot to create a notebook with candidate definitions.
Choose Create Experiment.

For more information about creating an experiment, refer to Create an Amazon SageMaker Autopilot experiment.It may take about 15 minutes to run this step.
When complete, choose Open candidate generation notebook, which opens a new notebook in read-only mode.
Choose Import notebook to make the notebook editable.
For Image, choose Data Science.
For Kernel, choose Python 3.
Choose Select.

This auto-generated notebook has detailed explanations and provides complete control over the actual model building task to follow. A customized version of the notebook, where a classifier is trained using Landsat satellite bands from 2013, is available in the code repository under notebooks/mangrove-2013.ipynb.

The model building framework consists of two parts: feature transformation as part of the data processing step, and hyperparameter optimization (HPO) as part of the model selection step. All the necessary artifacts for these tasks were created during the Autopilot experiment and saved in Amazon Simple Storage Service (Amazon S3). The first notebook cell downloads those artifacts from Amazon S3 to the local Amazon SageMaker file system for inspection and any necessary modification. There are two folders: generated_module and sagemaker_automl, where all the Python modules and scripts necessary to run the notebook are stored. The various feature transformation steps like imputation, scaling, and PCA are saved as generated_modules/candidate_data_processors/dpp*.py.

Autopilot creates three different models based on the XGBoost, linear learner, and multi-layer perceptron (MLP) algorithms. A candidate pipeline consists of one of the feature transformations options, known as data_transformer, and an algorithm. A pipeline is a Python dictionary and can be defined as follows:

candidate1 = {
    "data_transformer": {
        "name": "dpp5",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
            "volume_size_in_gb":  50
        },
        "transform_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
        "transforms_label": True,
        "transformed_data_format": "application/x-recordio-protobuf",
        "sparse_encoding": True
    },
    "algorithm": {
        "name": "xgboost",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
    }
}

In this example, the pipeline transforms the training data according to the script in generated_modules/candidate_data_processors/dpp5.py and builds an XGBoost model. This is where Autopilot provides complete control to the data scientist, who can pick the automatically generated feature transformation and model selection steps or build their own combination.

You can now add the pipeline to a pool for Autopilot to run the experiment as follows:

from sagemaker_automl import AutoMLInteractiveRunner, AutoMLLocalCandidate

automl_interactive_runner = AutoMLInteractiveRunner(AUTOML_LOCAL_RUN_CONFIG)
automl_interactive_runner.select_candidate(candidate1)

This is an important step where you can decide to keep only a subset of candidates suggested by Autopilot, based on subject matter expertise, to reduce the total runtime. For now, keep all Autopilot suggestions, which you can list as follows:

automl_interactive_runner.display_candidates()

Candidate Name	Algorithm	Feature Transformer
dpp0-xgboost	xgboost	dpp0.py
dpp1-xgboost	xgboost	dpp1.py
dpp2-linear-learner	linear-learner	dpp2.py
dpp3-xgboost	xgboost	dpp3.py
dpp4-xgboost	xgboost	dpp4.py
dpp5-xgboost	xgboost	dpp5.py
dpp6-mlp	mlp	dpp6.py

The full Autopilot experiment is done in two parts. First, you need to run the data transformation jobs:

automl_interactive_runner.fit_data_transformers(parallel_jobs=7)

This step should complete in about 30 minutes for all the candidates, if you make no further modifications to the dpp*.py files.

The next step is to build the best set of models by tuning the hyperparameters for the respective algorithms. The hyperparameters are usually divided into two parts: static and tunable. The static hyperparameters remain unchanged throughout the experiment for all candidates that share the same algorithm. These hyperparameters are passed to the experiment as a dictionary. If you choose to pick the best XGBoost model by maximizing AUC from three rounds of a five-fold cross-validation scheme, the dictionary looks like the following code:

{
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    '_kfold': 5,
    '_num_cv_round': 3,
}

For the tunable hyperparameters, you need to pass another dictionary with ranges and scaling type:

{
    'num_round': IntegerParameter(64, 1024, scaling_type='Logarithmic'),
    'max_depth': IntegerParameter(2, 8, scaling_type='Logarithmic'),
        'eta': ContinuousParameter(1e-3, 1.0, scaling_type='Logarithmic'),
...    
}

The complete set of hyperparameters is available in the mangrove-2013.ipynb notebook.

To create an experiment where all seven candidates can be tested in parallel, create a multi-algorithm HPO tuner:

multi_algo_tuning_parameters = automl_interactive_runner.prepare_multi_algo_parameters(
    objective_metrics=ALGORITHM_OBJECTIVE_METRICS,
    static_hyperparameters=STATIC_HYPERPARAMETERS,
    hyperparameters_search_ranges=ALGORITHM_TUNABLE_HYPERPARAMETER_RANGES)

The objective metrics are defined independently for each algorithm:

ALGORITHM_OBJECTIVE_METRICS = {
    'xgboost': 'validation:auc',
    'linear-learner': 'validation:roc_auc_score',
    'mlp': 'validation:roc_auc',
}

Trying all possible values of hyperparameters for all the experiments is wasteful; you can adopt a Bayesian strategy to create an HPO tuner:

multi_algo_tuning_inputs = automl_interactive_runner.prepare_multi_algo_inputs()
ase_tuning_job_name = "{}-tuning".format(AUTOML_LOCAL_RUN_CONFIG.local_automl_job_name)

tuner = HyperparameterTuner.create(
    base_tuning_job_name=base_tuning_job_name,
    strategy='Bayesian',
    objective_type='Maximize',
    max_parallel_jobs=10,
    max_jobs=50,
    **multi_algo_tuning_parameters,
)

In the default setting, Autopilot picks 250 jobs in the tuner to pick the best model. For this use case, it’s sufficient to set max_jobs=50 to save time and resources, without any significant penalty in terms of picking the best set of hyperparameters. Finally, submit the HPO job as follows:

tuner.fit(inputs=multi_algo_tuning_inputs, include_cls_metadata=None)

The process takes about 80 minutes on ml.m5.4xlarge instances. You can monitor progress on the SageMaker console by choosing Hyperparameter tuning jobs under Training in the navigation pane.

You can visualize a host of useful information, including the performance of each candidate, by choosing the name of the job in progress.

Finally, compare the model performance of the best candidates as follows:

from sagemaker.analytics import HyperparameterTuningJobAnalytics

SAGEMAKER_SESSION = AUTOML_LOCAL_RUN_CONFIG.sagemaker_session
SAGEMAKER_ROLE = AUTOML_LOCAL_RUN_CONFIG.role

tuner_analytics = HyperparameterTuningJobAnalytics(
    tuner.latest_tuning_job.name, sagemaker_session=SAGEMAKER_SESSION)

df_tuning_job_analytics = tuner_analytics.dataframe()

df_tuning_job_analytics.sort_values(
    by=['FinalObjectiveValue'],
    inplace=True,
    ascending=False if tuner.objective_type == "Maximize" else True)

# select the columns to display and rename
select_columns = ["TrainingJobDefinitionName", "FinalObjectiveValue", "TrainingElapsedTimeSeconds"]
rename_columns = {
	"TrainingJobDefinitionName": "candidate",
	"FinalObjectiveValue": "AUC",
	"TrainingElapsedTimeSeconds": "run_time"  
}

# Show top 5 model performances
df_tuning_job_analytics.rename(columns=rename_columns)[rename_columns.values()].set_index("candidate").head(5)

candidate	AUC	run_time (s)
dpp6-mlp	0.96008	2711.0
dpp4-xgboost	0.95236	385.0
dpp3-xgboost	0.95095	202.0
dpp4-xgboost	0.95069	458.0
dpp3-xgboost	0.95015	361.0

The top performing model based on MLP, while marginally better than the XGBoost models with various choices of data processing steps, also takes a lot longer to train. You can find important details about the MLP model training, including the combination of hyperparameters used, as follows:

df_tuning_job_analytics.loc[df_tuning_job_analytics.TrainingJobName==best_training_job].T.dropna()

TrainingJobName	mangrove-2-notebook–211021-2016-012-500271c8
TrainingJobStatus	Completed
FinalObjectiveValue	0.96008
TrainingStartTime	2021-10-21 20:22:55+00:00
TrainingEndTime	2021-10-21 21:08:06+00:00
TrainingElapsedTimeSeconds	2711
TrainingJobDefinitionName	dpp6-mlp
dropout_prob	0.415778
embedding_size_factor	0.849226
layers	256
learning_rate	0.00013862
mini_batch_size	317
network_type	feedforward
weight_decay	1.29323e-12

Create an inference pipeline

To generate inference on new data, you have to construct an inference pipeline on SageMaker to host the best model that can be called later to generate inference. The SageMaker pipeline model requires three containers as its components: data transformation, algorithm, and inverse label transformation (if numerical predictions need to be mapped on to non-numerical labels). For brevity, only part of the required code is shown in the following snippet; the complete code is available in the mangrove-2013.ipynb notebook:

from sagemaker.estimator import Estimator
from sagemaker import PipelineModel
from sagemaker_automl import select_inference_output

…
# Final pipeline model 
model_containers = [best_data_transformer_model, best_algo_model]
if best_candidate.transforms_label:
	model_containers.append(best_candidate.get_data_transformer_model(
    	transform_mode="inverse-label-transform",
    	role=SAGEMAKER_ROLE,
    	sagemaker_session=SAGEMAKER_SESSION))

# select the output type
model_containers = select_inference_output("BinaryClassification", model_containers, output_keys=['predicted_label'])

After the model containers are built, you can construct and deploy the pipeline as follows:

from sagemaker import PipelineModel

pipeline_model = PipelineModel(
	name=f"mangrove-automl-2013",
	role=SAGEMAKER_ROLE,
	models=model_containers,
	vpc_config=AUTOML_LOCAL_RUN_CONFIG.vpc_config)

pipeline_model.deploy(initial_instance_count=1,
                  	instance_type='ml.m5.2xlarge',
                  	endpoint_name=pipeline_model.name,
                  	wait=True)

The endpoint deployment takes about 10 minutes to complete.

Get inference on the test dataset using an endpoint

After the endpoint is deployed, you can invoke it with a payload of features B1–B7 to classify each pixel in an image as either mangrove (1) or other (0):

import boto3
sm_runtime = boto3.client('runtime.sagemaker')

pred_labels = []
with open(local_download, 'r') as f:
    for i, row in enumerate(f):
        payload = row.rstrip('n')
        x = sm_runtime.invoke_endpoint(EndpointName=inf_endpt,
                                   	ContentType="text/csv",
                                   	Body=payload)
        pred_labels.append(int(x['Body'].read().decode().strip()))

Complete details on postprocessing the model predictions for evaluation and plotting are available in notebooks/model_performance.ipynb.

Get inference on the test dataset using a batch transform

Now that you have created the best-performing model with Autopilot, we can use the model for inference. To get inference on large datasets, it’s more efficient to use a batch transform. Let’s generate predictions on the entire dataset (training and test) and append the results to the features, so that we can perform further analysis to, for instance, check the predicted vs. actuals and the distribution of features amongst predicted classes.

First, we create a manifest file in Amazon S3 that points to the locations of the training and test data from the previous data processing steps:

import boto3
data_bucket = <Name of the S3 bucket that has the training data>
prefix = "LANDSAT_LC08_C01_T1_SR/Year2013"
manifest = "[{{"prefix": "s3://{}/{}/"}},n"train.csv",n"test.csv"n]".format(data_bucket, prefix)
s3_client = boto3.client('s3')
s3_client.put_object(Body=manifest, Bucket=data_bucket, Key=f"{prefix}/data.manifest")

Now we can create a batch transform job. Because our input train and test dataset have label as the last column, we need to drop it during inference. To do that, we pass InputFilter in the DataProcessing argument. The code "$[:-2]" indicates to drop the last column. The predicted output is then joined with the source data for further analysis.

In the following code, we construct the arguments for the batch transform job and then pass to the create_transform_job function:

from time import gmtime, strftime

batch_job_name = "Batch-Transform-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
output_location = "s3://{}/{}/batch_output/{}".format(data_bucket, prefix, batch_job_name)
input_location = "s3://{}/{}/data.manifest".format(data_bucket, prefix)

request = {
    "TransformJobName": batch_job_name,
    "ModelName": pipeline_model.name,
    "TransformOutput": {
        "S3OutputPath": output_location,
        "Accept": "text/csv",
        "AssembleWith": "Line",
    },
    "TransformInput": {
        "DataSource": {"S3DataSource": {"S3DataType": "ManifestFile", "S3Uri": input_location}},
        "ContentType": "text/csv",
        "SplitType": "Line",
        "CompressionType": "None",
    },
    "TransformResources": {"InstanceType": "ml.m4.xlarge", "InstanceCount": 1},
    "DataProcessing": {"InputFilter": "$[:-2]", "JoinSource": "Input"}
}

sagemaker = boto3.client("sagemaker")
sagemaker.create_transform_job(**request)
print("Created Transform job with name: ", batch_job_name)

You can monitor the status of the job on the SageMaker console.

Visualize model performance

You can now visualize the performance of the best model on the test dataset, consisting of regions from India, Myanmar, Cuba, and Vietnam, as a confusion matrix. The model has a high recall value for pixels representing mangroves, but only about 75% precision. The precision of non-mangrove or other pixels stand at 99% with an 85% recall. You can tune the probability cutoff of the model predictions to adjust the respective values depending on the particular use case.

It’s worth noting that the results are a significant improvement over the built-in smileCart model.

Visualize model predictions

Finally, it’s useful to observe the model performance on specific regions on the map. In the following image, the mangrove area in the India-Bangladesh border is depicted in red. Points sampled from the Landsat image patch belonging to the test dataset are superimposed on the region, where each point is a pixel that the model determines to be representing mangroves. The blue points are classified correctly by the model, whereas the black points represent mistakes by the model.

The following image shows only the points that the model predicted to not represent mangroves, with the same color scheme as the preceding example. The gray outline is the part of the Landsat patch that doesn’t include any mangroves. As is evident from the image, the model doesn’t make any mistake classifying points on water, but faces a challenge when distinguishing pixels representing mangroves from those representing regular foliage.

The following image shows model performance on the Myanmar mangrove region.

In the following image, the model does a better job identifying mangrove pixels.

Clean up

The SageMaker inference endpoint continues to incur cost if left running. Delete the endpoint as follows when you’re done:

sagemaker.delete_endpoint(EndpointName=pipeline_model.name)

Conclusion

This series of posts provided an end-to-end framework for data scientists for solving GIS problems. Part 1 showed the ETL process and a convenient way to visually interact with the data. Part 2 showed how to use Autopilot to automate building a custom mangrove classifier.

You can use this framework to explore new satellite datasets containing a richer set of bands useful for mangrove classification and explore feature engineering by incorporating domain knowledge.

About the Authors

Andrei Ivanovic is an incoming Master’s of Computer Science student at the University of Toronto and a recent graduate of the Engineering Science program at the University of Toronto, majoring in Machine Intelligence with a Robotics/Mechatronics minor. He is interested in computer vision, deep learning, and robotics. He did the work presented in this post during his summer internship at Amazon.

David Dong is a Data Scientist at Amazon Web Services.

Arkajyoti Misra is a Data Scientist at Amazon LastMile Transportation. He is passionate about applying Computer Vision techniques to solve problems that helps the earth. He loves to work with non-profit organizations and is a founding member of ekipi.org.

Solution overview

Create an Amazon Redshift datashare in the producer account

Access the Amazon Redshift cross-account datashare in the consumer AWS account

Analyze and process data with Data Wrangler

Conclusion

About the Authors

Solution overview

Prerequisites

Import the dataset

Build and train the model

Evaluate model performance

Generate predictions

Standard build

Clean up

Conclusion

About the Authors

Challenges of human visual inspection

Solution overview

Content

Default dataset

Architecture

Prerequisites

Deploy the solution

Clean up

Conclusion

About the Authors

Open pathways to new career opportunities

Gain ML skills beyond the classroom

Program dates

Connect with the ML community and take the next step

About the Author

Train a model with Autopilot

Create an inference pipeline

Get inference on the test dataset using an endpoint

Get inference on the test dataset using a batch transform

Visualize model performance

Visualize model predictions

Clean up

Conclusion

About the Authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.