New tool can spot problems — such as overfitting and vanishing gradients — that prevent machine learning models from learning.Read More
How a paper by three Oxford academics influenced AWS bias and explainability software
Why conditional demographic disparity matters for developers using SageMaker Clarify.Read More
Alexa: The science must go on
Throughout the pandemic, the Alexa team has continued to invent on behalf of our customers.Read More
How Digitata provides intelligent pricing on mobile data and voice with Amazon Lookout for Metrics
This is a guest post by Nico Kruger (CTO of Digitata) and Chris King (Sr. ML Specialist SA at AWS). In their own words, “Digitata intelligently transforms pricing and subscriber engagement for mobile operators, empowering operators to make better and more informed decisions to meet and exceed business objectives.”
As William Gibson said, “The future is here. It’s just not evenly distributed yet.” This is incredibly true in many emerging markets for connectivity. Users often pay 100 times more for data, voice, and SMS services than their counterparts in Europe or the US. Digitata aims to better democratize access to telecommunications services through dynamic pricing solutions for mobile network operators (MNOs) and to help deliver optimal returns on their large capital investments in infrastructure.
Our pricing models are classically based on supply and demand. We use machine learning (ML) algorithms to optimize two main variables: utilization (of infrastructure), and revenue (fees for telco services). For example, at 3:00 AM when a tower is idle, it’s better to charge a low rate for data than waste this fixed capacity and have no consumers. Comparatively, for a very busy tower, it’s prudent to raise the prices at certain times to reduce congestion, thereby reducing the number of dropped calls or sluggish downloads for customers.
Our models attempt to optimize utilization and revenue according to three main features, or dimensions: location, time, and user segment. Taking the traffic example further, the traffic profile over time for a tower located in a rural or suburban area is very different from a tower in a central downtown district. In general, the suburban tower is busier early in the mornings and later at night than the tower based in the central business district, which is much busier during traditional working hours.
Our customers (the MNOs) trust us to be their automated, intelligent pricing partner. As such, it’s imperative that we keep on top of any anomalous behavior patterns when it comes to their revenue or network utilization. If our model charges too little for data bundles (or even makes it free), it could lead to massive network congestion issues as well as the obvious lost revenue impact. Conversely, if we charge too much for services, it could lead to unhappy customers and loss of revenue, through the principles of supply and demand.
It’s therefore imperative that we have a robust, real-time anomaly detection system in place to alert us whenever there is anomalous behavior on revenue and utilization. It also needs to be aware of the dimensions we operate under (location, user segment, and time).
History of anomaly detection at Digitata
We have been through four phases of anomaly detection at Digitata in the last 13 years:
- Manually monitoring our KPIs in reports on a routine basis.
- Defining routine checks using static thresholds that alert if the threshold is exceeded.
- Using custom anomaly detection models to track basic KPIs over time, such as total unique customers per tower, revenue per GB, and network congestion.
- Creating complex collections of anomaly detection models to track even more KPIs over time.
Manual monitoring continued to grow to consume more of our staff hours and was the most error-prone, which led to the desire to automate in Phase 2. The automated alarms with static alert thresholds ensured that routine checks were actually and accurately performed, but not with sufficient sophistication. This led to alert fatigue, and pushed us to custom modeling.
Custom modeling can work well for a simple problem, but the approach for one particular problem doesn’t translate perfectly to the next problem. This leads to a number of models that must be operating in the field to provide relevant insights. The operational complexity of orchestrating these begins to scale beyond the means of our in-house developers and tooling. The cost of expansion also prohibits other teams from running experiments and identifying other opportunities for us to leverage ML-backed anomaly detection.
Additionally, although you can now detect anomalies via ML, you still need to do frequent deep-dive analysis to find other combinations of dimensions that may point to underlying anomalies. For example, when a competitor is strongly targeting a certain location or segment of users, it may have an adverse impact on sales that may not necessarily be reflected, depending on how deep you have set up your anomaly detection models to actively track the different dimensions.
The problem that remains to be solved
Given our earlier problem statement, it means that we have, at least, the following dimensions under which products are being sold:
- Thousands of locations (towers).
- Hundreds of products and bundles (different data bundles such as social or messaging).
- Hundreds of customer segments. Segments are based on clusters of users according to hundreds of attributes that are system calculated from MNO data feeds.
- Hourly detection for each day of the week.
We can use traditional anomaly detection methods to have anomaly detection on a measure, such as revenue or purchase count. We don’t, however, have the necessary insights on a dimension-based level to answer questions such as:
- How is product A selling compared to product B?
- What does revenue look like at location A vs. location B?
- What do sales look like for customer segment A vs. customer segment B?
- When you start combining dimensions, what does revenue look like on product A, segment A, vs. product A, segment B; product B, segment A; and product B, segment B?
The number of dimensions quickly add up. It becomes impractical to create anomaly detection models for each dimension and each combination of dimensions. And that is only with the four dimensions mentioned! What if we want to quickly add two or three additional dimensions to our anomaly detection systems? It requires time and resource investment, even to use existing off-the-shelf tools to create additional anomaly models, notwithstanding the weeks to months of investment required to build it in-house.
That is when we looked for a purpose-built tool to do exactly this, such as the dimension-aware managed anomaly detection service, Amazon Lookout for Metrics.
Amazon Lookout for Metrics
Amazon Lookout for Metrics uses ML to automatically detect and diagnose anomalies (outliers from the norm) in business and operational time series data, such as a sudden dip in sales revenue or customer acquisition rates.
In a couple of clicks, you can connect Amazon Lookout for Metrics to popular data stores like Amazon Simple Storage Service (Amazon S3), Amazon Redshift, and Amazon Relational Database Service (Amazon RDS), as well as third-party SaaS applications, such as Salesforce, ServiceNow, Zendesk, and Marketo, and start monitoring metrics that are important to your business.
Amazon Lookout for Metrics automatically inspects and prepares the data from these sources and builds a custom ML model—informed by over 20 years of experience at Amazon—to detect anomalies with greater speed and accuracy than traditional methods used for anomaly detection. You can also provide feedback on detected anomalies to tune the results and improve accuracy over time. Amazon Lookout for Metrics makes it easy to diagnose detected anomalies by grouping together anomalies that are related to the same event and sending an alert that includes a summary of the potential root cause. It also ranks anomalies in order of severity so that you can prioritize your attention to what matters the most to your business.
How we used Amazon Lookout for Metrics
Inside Amazon Lookout for Metrics, you need to describe your data in terms of measures and dimensions. Measures are variables or key performance indicators on which you want to detect anomalies, and dimensions are metadata that represent categorical information about the measures.
To detect outliers, Amazon Lookout for Metrics builds an ML model that is trained with your source data. This model, called a detector, is automatically trained with the ML algorithm that best fits your data and use case. You can either provide your historical data for training, if you have any, or get started with real-time data, and Amazon Lookout for Metrics learns as it goes.
We used Amazon Lookout for Metrics to convert our anomaly detection tracking on two of our most important datasets: bundle revenue and voice revenue.
For bundle revenue, we track the following measures:
- Total revenue from sales
- Total number of sales
- Total number of sales to distinct users
- Average price at which the product was bought
Additionally, we track the following dimensions:
- Location (tower)
- Product
- Customer segment
For voice revenue, we track the following measures:
- Total calls made
- Total revenue from calls
- Total distinct users that made a call
- The average price at which a call was made
Additionally, we track the following dimensions:
- Location (tower)
- Type of call (international, on-net, roaming, off-net)
- Whether the user received a discount or not
- Customer spend
This allows us to have coverage on these two datasets, using only two anomaly detection models with Amazon Lookout for Metrics.
Architecture overview
Apache Nifi is an open-source data flow tool that we use for ETL tasks, both on premises and in AWS. We use it as a main flow engine for parsing, processing, and updating data we receive from the mobile network. This data ranges from call records, data usage records, airtime recharges, to network tower utilization and congestion information. This data is fed into our ML models to calculate the price on a product, location, time, and segment basis.
The following diagram illustrates our architecture.
Because of the reality of the MNO industry (at the moment), it’s not always possible for us to leverage AWS for all of our deployments. Therefore, we have a mix of fully on-premises, hybrid, and fully native cloud deployments.
We use a setup where we leverage Apache Nifi, connected from AWS over VPC and VPN connections, to pull anonymized data on an event-based basis from all of our deployments (regardless of type) simultaneously. The data is then stored in Amazon S3 and in Amazon CloudWatch, from where we can use services such as Amazon Lookout for Metrics.
Results from our experiments
While getting to know Amazon Lookout for Metrics, we primarily focused the backtesting functionality within the service. This feature allows you to supply historical data, have Amazon Lookout for Metrics train on a large portion of your early historical data, and then identify anomalies in the remaining, more recent data.
We quickly discovered that this has massive potential, not only to start learning the service, but also to gather insights as to what other opportunities reside within your data, which you may have never thought of, or always expected to be there but never had the time to investigate.
For example, we quickly found the following very interesting example with one of our customers. We were tracking voice revenue as the measure, under the dimensions of call type (on-net, off-net, roaming), region (a high-level concept of an area, such as a province or big city), and timeband (after hours, business hours, weekends)
Amazon Lookout for Metrics identified an anomaly on international calls in a certain region, as shown in the following graph.
We quickly went to our source data, and saw the following visualization.
This graph looks at the total revenue for the days for international calls. As you can see, when looking at the global revenue, there is no real impact of the sort that Amazon Lookout for Metrics identified.
But when looking at the specific region that was identified, you see the following anomaly.
A clear spike in international calls took place on this day, in this region. We looked deeper into it and found that the specific city identified by this region is known as a tourist and conference destination. This begs the question: is there any business value to be found in an insight such as this? Can we react to anomalies like these in real time by using Amazon Lookout for Metrics and then providing specific pricing specials on international calls in the region, in order to take advantage of the influx of demand? The answer is yes, and we are! With stakeholders alerted to these notifications for future events and with exploratory efforts into our recent history, we’re prepared for future issues and are becoming more aware of operational gaps in our past.
In addition to the exploration using the back testing feature (which is still ongoing as of this writing), we also set up real-time detectors to work in parallel with our existing anomaly detection service.
Within two days, we found our first real operational issue, as shown in the following graph.
The graph shows revenue attributed to voice calls in another customer. In this case, we had a clear spike in our catchall NO LOCATION LOOKUP region. We map revenue from the towers to regions (such as city, province, or state) using a mapping table that we periodically refresh from within the MNO network, or by receiving such a mapping from the MNO themselves. When a tower isn’t mapped correctly by this table, it shows up as this catchall region in our data. In this case, there was a problem with the mapping feed from our customer.
The effect was that the number of towers that could not be classified was slowly growing. This could affect our pricing models, which could become less accurate at factoring the location aspect when generating the optimal price.
A very important operational anomaly to detect early!
Digitata in the future
We’re constantly evolving our ML and analytics capabilities, with the end goal of making connectivity more affordable for the entire globe. As we continue on this journey, we look to services such as Amazon Lookout for Metrics to help us ensure the quality of our services, find operational issues, and identify opportunities. It has made a dramatic difference in our anomaly detection capabilities, and has pointed us to some previously undiscovered opportunities. This all allows us to work on what really matters: getting everyone connected to the wonder of the internet at affordable prices!
Getting started
Amazon Lookout for Metrics is now available in preview in US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Tokyo), and Europe (Ireland). Request preview access to get started today!
You can interact with the service using the AWS Management Console, the AWS SDKs, and the AWS Command Line Interface (AWS CLI). For more information, see the Amazon Lookout for Metrics Developer Guide.
About the Authors
Nico Kruger is the CTO of Digitata and is a fan of programming computers, reading things, listening to music, and playing games. Nico has 10+ years experience in telco. In his own words: “From C++ to Javascript, AWS to on-prem, as long as the tool is fit for the job, it works and the customer is happy; it’s all good. Automate all the things, plan for failure and be adaptable and everybody wins.”
Chris King is a Senior Solutions Architect in Applied AI with AWS. He has a special interest in launching AI services and helped grow and build Amazon Personalize and Amazon Forecast before focusing on Amazon Lookout for Metrics. In his spare time, he enjoys cooking, reading, boxing, and building models to predict the outcome of combat sports.
Rust detection using machine learning on AWS
Visual inspection of industrial environments is a common requirement across heavy industries, such as transportation, construction, and shipbuilding, and typically requires qualified experts to perform the inspection. Inspection locations can often be remote or in adverse environments that put humans at risk, such as bridges, skyscrapers, and offshore oil rigs.
Many of these industries deal with huge metal surfaces and harsh environments. A common problem across these industries is metal corrosion and rust. Although corrosion and rust are used interchangeably across different industries (we also use the terms interchangeably in this post), these two phenomena are different. For more details about the differences between corrosion and rust as well as different degrees of such damages, see Difference Between Rust and Corrosion and Stages of Rust.
Different levels and grades of rust can also result in different colors for the damaged areas. If you have enough images of different classes of rust, you can use the techniques described in this post to detect different classes of rust and corrosion.
Rust is a serious risk for operational safety. The costs associated with inadequate protection against corrosion can be catastrophic. Conventionally, corrosion detection is done using visual inspection of structures and facilities by subject matter experts. Inspection can involve on-site direct interpretation or the collection of pictures and the offline interpretation of them to evaluate damages. Advances in the fields of computer vision and machine learning (ML) makes it possible to automate corrosion detection to reduce the costs and risks involved in performing such inspections.
In this post, we describe how to build a serverless pipeline to create ML models for corrosion detection using Amazon SageMaker and other AWS services. The result is a fully functioning app to help you detect metal corrosion.
We will use the following AWS services:
- Amazon API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale.
- AWS Lambda is a compute service that lets you run code without provisioning or managing servers. Lambda runs your code only when triggered and scales automatically, from a few requests per day to thousands per second.
- Amazon SageMaker is a fully managed service that provides developers and data scientists the tools to build, train, and deploy different types of ML models.
- AWS Step Functions allows you to coordinate several AWS services into a serverless workflow. You can design and run workflows where the output of one step acts as the input to the next step while embedding error handling into the workflow.
Solution overview
The corrosion detection solution comprises a React-based web application that lets you pick one or more images of metal corrosion to perform detection. The application lets you train the ML model and deploys the model to SageMaker hosting services to perform inference.
The following diagram shows the solution architecture.
The solution supports the following use cases:
- Performing on-demand corrosion detection
- Performing batch corrosion detection
- Training ML models using Step Functions workflows
The following are the steps for each workflow:
- On-demand corrosion detection – An image picked by the application user is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket. The image S3 object key is sent to an API deployed on API Gateway. The API’s Lambda function invokes a SageMaker endpoint to detect corrosion in the image uploaded, and generates and stores a new image in an S3 bucket, which is further rendered in the front end for analysis.
- Batch corrosion detection – The user uploads a .zip file containing images to an S3 bucket. A Lambda function configured as an Amazon S3 trigger is invoked. The function performs batch corrosion detection by performing an inference using the SageMaker endpoint. Resulting new images are stored back in Amazon S3. These images can be viewed in the front end.
- Training the ML model – The web application allows you to train a new ML model using Step Functions and SageMaker. The following diagram shows the model training and endpoint hosting orchestration. The Step Functions workflow is started by invoking the
StartTrainingJob
API supported by the Amazon States Language. After a model has been created, theCreateEndpoint
API of SageMaker is invoked, which creates a new SageMaker endpoint and hosts the new ML model. A checkpoint step ensures that the endpoint is completely provisioned before ending the workflow.
Machine learning algorithm options
Corrosion detection is conventionally done by trained professionals using visual inspection. In challenging environments such as offshore rigs, visual inspection can be very risky. Automating the inspection process using computer vision models mounted on drones is a helpful alternative. You can use different ML approaches for corrosion detection. Depending on the available data and application objectives, you could use deep learning (including object detection or semantic segmentation) or color classification, using algorithms such as Extreme Gradient Boosting (XGBoost). We discuss both approaches in this post, with an emphasis on XGBoost method, and cover advantages and limitations of both approaches. Other methods such as unsupervised clustering might also be applicable, but aren’t discussed in this post.
Deep learning approach
In recent years, deep learning has been used for automatic corrosion detection. Depending on the data availability and the type of labeling used, you can use object detection or semantic segmentation to detect corroded areas in metal structures. Although deep learning techniques are very effective for numerous use cases, the complex nature of corrosion detection (the lack of specific shapes) sometimes make deep learning methods less effective for detecting corroded areas.
We explain in more detail some of the challenges involved in using deep learning for this problem and propose an alternative way using a simpler ML method that doesn’t require the laborious labeling required for deep learning methods. If you have a dataset annotated using rectangular bounding boxes, you can use an object detection algorithm.
The most challenging aspect of this problem when using deep learning is that corroded parts of structures don’t have predictable shapes, which makes it difficult to train a comprehensive deep learning model using object detection or semantic segmentation. However, if you have enough annotated images, you can detect these random-looking patterns with reasonable accuracy. For instance, you can detect the corroded area in the following image (shown inside the red rectangle) using an object detection or semantic segmentation model with proper training and data.
The more challenging problem for performing corrosion detection using deep learning is the fact that the entire metal structure can often be corroded (as in the following image), and deep learning models confuse these corroded structures with the non-corroded ones because the edges and shapes of entirely corroded structures are similar to a regular healthy structure with no corrosion. This can be the case for any structure and not just limited to pipes.
Color classification approach (using the XGBoost algorithm)
Another way of looking at the corrosion detection problem is to treat it as a pixel-level color classification, which has shown promise over deep learning methods, even with small training datasets. We use a simple XGBoost method, but you can use any other classification algorithm (such as Random Forest).
The downside of this approach is that darker pixel colors in images can be mistakenly interpreted as corrosion. Lighting conditions and shadows might also affect the outcome of this approach. However, this method produced better-quality results compared to deep learning approaches because this method isn’t affected by the shape of structures or the extent of corrosion. Accuracy can be improved by using more comprehensive data.
If you require pixel-level interpretation of images, the other alternative is to use semantic segmentation, which requires significant labeling. Our proposed method offers a solution to avoid this tedious labeling.
The rest of this post focuses on using the color classification (using XGBoost) approach. We explain the steps required to prepare data for this approach and how to train such a model on SageMaker using the accompanying web application.
Create training and validation datasets
When using XGBoost, you have the option of creating training datasets from both annotated or manually cropped and non-annotated images. The color classification (XGBoost) algorithm requires that you extract the RGB values of each pixel in the image that has been labeled as clean or corroded.
We created Jupyter notebooks to help you create training and validation datasets depending on whether you’re using annotated or non-annotated images.
Create training and validation datasets for annotated images
When you have annotated images of corrosion, you can programmatically crop them to create smaller images so you have just the clean or corroded parts of the image. You reshape the small cropped images into a 2D array and stack them together to build your dataset. To ensure better-quality data, the following code further crops the small images to pick only the central portion of the image.
To help you get started quickly, we created a sample training dataset (5 MB) that you can use to create training and validation datasets. You can then use these datasets to train and deploy a new ML model. We created the sample training dataset from a few public images from pexels.com.
Let’s understand the process of creating a training dataset from annotated images. We created a notebook to help you with the data creation. The following are the steps involved in creating the training and validation data.
Crop annotated images
The first step is to crop the annotated images.
- We read all annotated images and the XML files containing the annotation information (such as bounding boxes and class name). See the following code:
xml_paths = get_file_path_list(xml_path) images_names = list(set(get_filename_list(images_path)))
- Because the input images are annotated, we extract the class names and bounding boxes for each annotated image:
for idx, x in enumerate(xml_paths): single_imgfile_path = images_path + '\'+ x.split('\')[-1].split('.')[0] +'.JPG' image = Image.open(single_imgfile_path) tree = ET.parse(x) root = tree.getroot() for idx2, rt in enumerate(root.findall('object')): name = rt.find('name').text if name in classes_to_use: xmin = int(rt.find('bndbox').find('xmin').text) ymin = int(rt.find('bndbox').find('ymin').text) xmax = int(rt.find('bndbox').find('xmax').text) ymax = int(rt.find('bndbox').find('ymax').text)
- For each bounding box in an image, we zoom in to the bounding box, crop the center portion, and save that in a separate file. We cut the bounding box by 1/3 of its size from each side, therefore taking 1/9 of the area inside the bounding box (its center). See the following code:
a = (xmax-xmin)/3.0 b = (ymax-ymin)/3.0 box = [int(xmin+a),int(ymin+b),int(xmax-a),int(ymax-b)] image1 = image.crop(box)
- Finally, we save the cropped image:
image1.save('cropped_images_small/'+name+"-"+str(count)+".png", "PNG", quality=80, optimize=True, progressive=True)
It’s recommended to do a quick visual inspection of the cropped images to make sure they only contain either clean or corroded parts.
The following code shows the implementation for cropping the images (also available in section 2 of the notebook):
def crop_images(xml_path, images_path, classes_to_use):
# Crop objects of type given in "classes_to_use" from xml files with several
# objects in each file and several classes in each file
if os.path.isdir("cropped_images_small"):
shutil.rmtree('cropped_images_small')
os.mkdir('cropped_images_small')
print("Storing cropped images in cropped_images_small folder" )
else:
os.mkdir('cropped_images_small')
print("Storing cropped images in cropped_images_small folder" )
xml_paths = get_file_path_list(xml_path)
images_names = list(set(get_filename_list(images_path)))
count = 0
for idx, x in enumerate(xml_paths):
if '.DS_Store' not in x:
single_imgfile_path = images_path + '\'+ x.split('\')[-1].split('.')[0] +'.JPG'
image = Image.open(single_imgfile_path)
tree = ET.parse(x)
root = tree.getroot()
for idx2, rt in enumerate(root.findall('object')):
name = rt.find('name').text
if name in classes_to_use:
xmin = int(rt.find('bndbox').find('xmin').text)
ymin = int(rt.find('bndbox').find('ymin').text)
xmax = int(rt.find('bndbox').find('xmax').text)
ymax = int(rt.find('bndbox').find('ymax').text)
a = (xmax-xmin)/3.0
b = (ymax-ymin)/3.0
box = [int(xmin+a),int(ymin+b),int(xmax-a),int(ymax-b)]
image1 = image.crop(box)
image1.save('cropped_images_small/'+name+"-"+str(count)+".png", "PNG", quality=80, optimize=True, progressive=True)
count+=1
Create the RGB DataFrame
After cropping and saving the annotated parts, we have many small images, and each image contains only pixels belonging to one class (Clean or Corroded). The next step in preparing the data is to turn the small images into a DataFrame.
- We first define the column names for the DataFrame that contains the class (Clean or Corroded) and the RGB values for each pixel.
- We define the classes to be used (in case we want to ignore other possible classes that might be present).
- For each cropped image, we reshape the image and extract RGB information into a new DataFrame.
- Finally, we save the final data frame into a .csv file.
See the following code:
37 crop_path = 'Path to your cropped images'
38 files = get_file_path_list(crop_path)
39
40 cols = ['class','R','G','B']
41 df = pd.DataFrame()
42
43 classes_to_use = ['Corroded','Clean']
44 dict1 = {'Clean': 0, 'Corroded': 1}
45 for file in files:
46 lbls = Image.open(file)
47 imagenp = np.asarray(lbls)
48 imagenp=imagenp.reshape(imagenp.shape[1]*imagenp.shape[0],3)
49 name = file.split('\')[-1].split('.')[0].split('-')[0]
50 classname = dict1[name]
51 dftemp = pd.DataFrame(imagenp)
52 dftemp.columns =['R','G','B']
53 dftemp['class'] = classname
54 columnsTitles=['class','R','G','B']
55 dftemp=dftemp.reindex(columns=columnsTitles)
56 df = pd.concat([df,dftemp], axis=0)
57
58 df.columns = cols
59 df.to_csv('data.csv', index=False)
In the end, we have a table containing labels and RGB values.
Create training and validation sets and upload to Amazon S3
After you prepare the data, you can use the code listed under section 4 of our notebook to generate the training and validation datasets. Before running the code in this section, make sure you enter the name of a S3 bucket in the bucket
variable, for storing the training and validation data.
The following lines of code in the notebook define variables for the input data file name (FILE_DATA
), the training/validation ratio (for this post, we use 20% of the data for validation, which leaves 80% for training) and the name of the generated training and validation data .csv files. You can choose to use the sample training dataset as the input data file or use the data file you generated by following the previous step and assigning it to the FILE_DATA
variable.
FILE_DATA = 'data.csv'
TARGET_VAR = 'class'
FILE_TRAIN = 'train.csv'
FILE_VALIDATION = 'validation.csv'
PERCENT_VALIDATION = 20
Finally, you upload the training and validation data to the S3 bucket:
s3_train_loc = upload_to_s3(bucket = bucket, channel = 'train', filename = FILE_TRAIN)
s3_valid_loc = upload_to_s3(bucket = bucket, channel = 'validation', filename = FILE_VALIDATION)
Create a training dataset for manually cropped images
For creating the training and validation dataset when using manually cropping images, you should name your cropped images with the prefixes Corroded and Clean to be consistent with the implementation in the provided Jupyter notebook. For example, for the Corroded class, you should name your image files Corroded-1.png, Corroded-2.png, and so on.
Set the path of your images and XML files into the variables img_path and xml_path. Also set the bucket name to the bucket variable. Run the code in all the sections defined in the notebook. This creates the training and validation datasets and uploads them to the S3 bucket.
Deploy the solution
Now that we have the training and validation datasets in Amazon S3, it’s time to train an XGBoost classifier using SageMaker. To do so, you can use the corrosion detection web application’s model training functionality. To help you with the web application deployment, we created AWS CloudFormation templates. Clone the source code from the GitHub repository and follow the deployment steps outlined to complete the application deployment. After you successfully deploy the application, you can explore the features it provides, such as on-demand corrosion detection, training and deploying a model, and batch features.
Training an XGBoost classifier on SageMaker
To train an XGBoost classifier, sign in to the corrosion detection web application, and on the menu, choose Model Training. Here you can train a new SageMaker model.
You need to configure parameters before starting a new training job in SageMaker. The application provides a JSON formatted parameter payload that contains information about the SageMaker training job name, Amazon Elastic Compute Cloud (Amazon EC2) instance type, the number of EC2 instances to use, the Amazon S3 location of the training and validation datasets, and XGBoost hyperparameters.
The parameter payload also lets you configure the EC2 instance type, which you can use for hosting the trained ML model using SageMaker hosting services. You can change the values of the hyperparameters, although the default values provided work. For more information about training job parameters, see CreateTrainingJob. For more information about hyperparameters, see XGBoost Hyperparameters.
See the following JSON code:
{
"TrainingJobName":"Corrosion-Detection-7",
"MaxRuntimeInSeconds":20000,
"InstanceCount":1,
"InstanceType":"ml.c5.2xlarge",
"S3OutputPath":"s3://bucket/csv/output",
"InputTrainingS3Uri":"s3://bucket/csv/train/train.csv",
"InputValidationS3Uri":"s3://bucket/csv/validation/validation.csv",
"HyperParameters":{
"max_depth":"3",
"learning_rate":"0.12",
"eta":"0.2",
"colsample_bytree":"0.9",
"gamma":"0.8",
"n_estimators":"150",
"min_child_weight":"10",
"num_class":"2",
"subsample":"0.8",
"num_round":"100",
"objective":"multi:softmax"
},
"EndpointInstanceType":"ml.m5.xlarge",
"EndpointInitialInstanceCount":1
}
The following screenshot shows the model training page. To start the SageMaker training job, you need to submit the JSON payload by choosing Submit Training Job.
The application shows you the status of the training job. When the job is complete, a SageMaker endpoint is provisioned. This should take a few minutes, and a new SageMaker endpoint should appear on the SageMaker Endpoints tab of the app.
Promote the SageMaker endpoint
For the application to use the newly created SageMaker endpoint, you need to configure the endpoint with the web app. You do so by entering the newly created endpoint name in the New Endpoint field. The application allows you to promote newly created SageMaker endpoints for inference.
Detect corrosion
Now you’re all set to perform corrosion detection. On the Batch Analysis page, you can upload a .zip file containing your images. This processes all the images by detecting corrosion and indicating the percentage of corrosion found in each image.
Summary
In this post, we introduced you to different ML algorithms and used the color classification XGBoost algorithm to detect corrosion. We also showed you how to train and host ML models using Step Functions and SageMaker. We discussed the pros and cons of different ML and deep learning methods and why a color classification method might be more effective. Finally, we showed how you can integrate ML into a web application that allows you to train and deploy a model and perform inference on images. Learn more about Amazon SageMaker and try these solutions out yourself! If you have any comments or questions, let us know in the comments below!
About the Authors
Aravind Kodandaramaiah is a Solution Builder with the AWS Global verticals solutions prototyping team, helping global customers realize the “art of the possibility” using AWS to solve challenging business problems. He is an avid Machine learning enthusiast and focusses on building end-to-end solutions on AWS.
Mehdi E. Far is a Sr Machine Learning Specialist SA at Manufacturing and Industrial Global and Strategic Accounts organization. He helps customers build Machine Learning and Cloud solutions for their challenging problems.
Amazon Scholar has his eyes on the future of robot movement
Learn how Bill Smart wants to simplify the ways that robots and people work together — and why waiting on a date one night changed his career path.Read More
Aerobotics improves training speed by 24 times per sample with Amazon SageMaker and TensorFlow
Editor’s note: This is a guest post written by Michael Malahe, Head of Data at Aerobotics, a South African startup that builds AI-driven tools for agriculture.
Aerobotics is an agri-tech company operating in 18 countries around the world, based out of Cape Town, South Africa. Our mission is to provide intelligent tools to feed the world. We aim to achieve this by providing farmers with actionable data and insights on our platform, Aeroview, so that they can make the necessary interventions at the right time in the growing season. Our predominant data source is aerial drone imagery: capturing visual and multispectral images of trees and fruit in an orchard.
In this post we look at how we use Amazon SageMaker and TensorFlow to improve our Tree Insights product, which provides per-tree measurements of important quantities like canopy area and health, and provides the locations of dead and missing trees. Farmers use this information to make precise interventions like fixing irrigation lines, applying fertilizers at variable rates, and ordering replacement trees. The following is an image of the tool that farmers use to understand the health of their trees and make some of these decisions.
To provide this information to make these decisions, we first must accurately assign each foreground pixel to a single unique tree. For this instance segmentation task, it’s important that we’re as accurate as possible, so we use a machine learning (ML) model that’s been effective in large-scale benchmarks. The model is a variant of Mask R-CNN, which pairs a convolutional neural network (CNN) for feature extraction with several additional components for detection, classification, and segmentation. In the following image, we show some typical outputs, where the pixels belong to a given tree are outlined by a contour.
Glancing at the outputs, you might think that the problem is solved.
The challenge
The main challenge with analyzing and modeling agricultural data is that it’s highly varied across a number of dimensions.
The following image illustrates some extremes of the variation in the size of trees and the extent to which they can be unambiguously separated.
In the grove of pecan trees, we have one of the largest trees in our database, with an area of 654 m2 (a little over a minute to walk around at a typical speed). The vines to the right of the grove measure 50 cm across (the size of a typical potted plant). Our models need to be tolerant to these variations to provide accurate segmentations regardless of the scale.
An additional challenge is that the sources of variation aren’t static. Farmers are highly innovative, and best practices can change significantly over time. One example is ultra-high-density planting for apples, where trees are planted as close as a foot apart. Another is the adoption of protective netting, which obscures aerial imagery, as in the following image.
In this domain with broad and shifting variations, we need to maintain accurate models to provide our clients with reliable insights. Models should improve with every challenging new sample we encounter, and we should deploy them with confidence.
In our initial approach to this challenge, we simply trained on all the data we had. As we scaled, however, we quickly got to the point where training on all our data became infeasible, and the cost of doing so became an impediment to experimentation.
The solution
Although we have variation in the edge cases, we recognized that there was a lot of redundancy in our standard cases. Our goal was to get to a point where our models are trained on only the most salient data, and can converge without needing to see every sample. Our approach to achieving this was first to create an environment where it’s simple to experiment with different approaches to dataset construction and sampling. The following diagram shows our overall workflow for the data preprocessing that enables this.
The outcome is that training samples are available as individual files in Amazon Simple Storage Service (Amazon S3), which is only sensible to do with bulky data like multispectral imagery, with references and rich metadata stored in Amazon Redshift tables. This makes it trivial to construct datasets with a single query, and makes it possible to fetch individual samples with arbitrary access patterns at train time. We use UNLOAD to create an immutable dataset in Amazon S3, and we create a reference to the file in our Amazon Relational Database Service (Amazon RDS) database, which we use for training provenance and evaluation result tracking. See the following code:
UNLOAD ('[subset_query]')
TO '[s3://path/to/dataset]'
IAM_ROLE '[redshift_write_to_s3_role]'
FORMAT PARQUET
The ease of querying the tile metadata allowed us to rapidly create and test subsets of our data, and eventually we were able to train to convergence after seeing only 1.1 million samples of a total 3.1 million. This sub-epoch convergence has been very beneficial in bringing down our compute costs, and we got a better understanding of our data along the way.
The second step we took in reducing our training costs was to optimize our compute. We used the TensorFlow profiler heavily throughout this step:
import tensorflow as tf
tf.profiler.experimental.ProfilerOptions(host_tracer_level=2, python_tracer_level=1,device_tracer_level=1)
tf.profiler.experimental.start("[log_dir]")
[train model]
tf.profiler.experimental.stop()
For training, we use Amazon SageMaker with P3 instances provisioned by Amazon Elastic Compute Cloud (Amazon EC2), and initially we found that the NVIDIA Tesla V100 GPUs in the instances were bottlenecked by CPU compute in the input pipeline. The overall pattern for alleviating the bottleneck was to shift as much of the compute from native Python code to TensorFlow operations as possible to ensure efficient thread parallelism. The largest benefit was switching to tf.io for data fetching and deserialization, which improved throughput by 41%. See the following code:
serialised_example = tf.io.decode_compressed(tf.io.gfile.GFile(fname, 'rb').read(), compression_type='GZIP')
example = tf.train.Example.FromString(serialised_example.numpy())
A bonus feature with this approach was that switching between local files and Amazon S3 storage required no code changes due to the file object abstraction provided by GFile.
We found that the last remaining bottleneck came from the default TensorFlow CPU parallelism settings, which we optimized using a SageMaker hyperparameter tuning job (see the following example config).
With the CPU bottleneck removed, we moved to GPU optimization, and made the most of the V100’s Tensor Cores by using mixed precision training:
from tensorflow.keras.mixed_precision import experimental as mixed_precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)
The mixed precision guide is a solid reference, but the change to using mixed precision still requires some close attention to ensure that the operations happening in half precision are not ill-conditioned or prone to underflow. Some specific cases that were critical were terminal activations and custom regularization terms. See the following code:
import tensorflow as tf
...
y_pred = tf.keras.layers.Activation('sigmoid', dtype=tf.float32)(x)
loss = binary_crossentropy(tf.cast(y_true, tf.float32), tf.cast(y_pred, tf.float32), from_logits=True)
...
model.add_loss(lambda: tf.keras.regularizers.l2()(tf.cast(w, tf.float32)))
After implementing this, we measured the following benchmark results for a single V100.
Precision | CPU parallelism | Batch size | Samples per second |
single | default | 8 | 9.8 |
mixed | default | 16 | 19.3 |
mixed | optimized | 16 | 22.4 |
The impact of switching to mixed precision was that training speed roughly doubled, and the impact of using the optimal CPU parallelism settings discovered by SageMaker was an additional 16% increase.
Implementing these initiatives as we grew resulted in reducing the cost of training a model from $122 to $68, while our dataset grew from 228 thousand samples to 3.1 million, amounting to a 24 times reduction in cost per sample.
Conclusion
This reduction in training time and cost has meant that we can quickly and cheaply adapt to changes in our data distribution. We often encounter new cases that are confounding even for humans, such as the following image.
However, they quickly become standard cases for our models, as shown in the following image.
We aim to continue making training faster by using more devices, and making it more efficient by leveraging SageMaker managed Spot Instances. We also aim to make the training loop tighter by serving SageMaker models that are capable of online learning, so that improved models are available in near-real time. With these in place, we should be well equipped to handle all the variation that agriculture can throw at us. To learn more about Amazon SageMaker, visit the product page.
The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.
About the Author
Michael Malahe is the Head of Data at Aerobotics, a South African startup that builds AI-driven tools for agriculture.
MLSys 2021: Bridging the divide between machine learning and systems
Amazon distinguished scientist and conference general chair Alex Smola on what makes MLSys unique — both thematically and culturally.Read More
AWS ML Community showcase: March 2021 edition
In our Community Showcase, Amazon Web Services (AWS) highlights projects created by AWS Heroes and AWS Community Builders.
Each month AWS ML Heroes and AWS ML Community Builders bring to life projects and use cases for the full range of machine learning skills from beginner to expert through deep dive tutorials, podcasts, videos, and other content that show how to use AWS Machine Learning (ML) solutions such as Amazon SageMaker, pertained AI services such as Amazon Rekognition, and AI learning devices such as AWS DeepRacer.
The AWS ML community is a vibrant group of developers, data scientists, researchers, and business decision-makers that dive deep into artificial intelligence and ML concepts, contribute with real-world experiences, and collaborate on building projects together.
Here are a few highlights of externally published getting started guides and tutorials curated by our AWS ML Evangelist team led by Julien Simon.
AWS ML Heroes and AWS ML Community Builder Projects
Making My Toddler’s Dream of Flying Come True with AI Tech (with code samples). In this deep dive tutorial, AWS ML Hero Agustinus Nalwan walks you through how to build an object detection model with Amazon SageMaker JumpStart (a set of solutions for the most common use cases that can be deployed readily with just a few clicks), Torch2trt (a tool to automatically convert PyTorch models into TensorRT), and NVIDIA Jetson AGX Xavier.
How to use Amazon Rekognition Custom Labels to analyze AWS DeepRacer Real World Performance Through Video (with code samples). In this deep dive tutorial, AWS ML Community Builder Pui Kwan Ho shows you how to analyze the path and speed of an AWS DeepRacer device using pretrained computer vision with Amazon Rekognition Custom Labels.
AWS Panorama Appliance Developers Kit: An Unboxing and Walkthrough (with code samples). In this video, AWS ML Hero Mike Chambers shows you how to get started with AWS Panorama, an ML appliance and software development kit (SDK) that allows developers to bring computer vision and make predictions locally with high accuracy and low latency.
Improving local food processing with Amazon Lookout for Vision (with code samples). In this deep tutorial, AWS ML Hero Olalekan Elesin demonstrates how to use AI to improve the quality of food sorting (using cassava flakes) cost-effectively and with zero AI knowledge.
Conclusion
Whether you’re just getting started with ML, already an expert, or something in between, there is always something to learn. Choose from community-created and ML-focused blogs, videos, eLearning guides, and much more from the AWS ML community.
Are you interested in contributing to the community? Apply to the AWS Community Builders program today!
The content and opinions in the preceding linked posts are those of the third-party authors and AWS is not responsible for the content or accuracy of those posts.
About the Author
Cameron Peron is Senior Marketing Manager for AWS Amazon Rekognition and the AWS AI/ML community. He evangelizes how AI/ML innovation solves complex challenges facing community, enterprise, and startups alike. Out of the office, he enjoys staying active with kettlebell-sport, spending time with his family and friends, and is an avid fan of Euro-league basketball.
ECIR 2021: Where information retrieval becomes conversation
In the future, says Amazon Scholar Emine Yilmaz, users will interact with computers to identify just the information they need, rather than scrolling through long lists of results.Read More