How Digitata provides intelligent pricing on mobile data and voice with Amazon Lookout for Metrics

This is a guest post by Nico Kruger (CTO of Digitata) and Chris King (Sr. ML Specialist SA at AWS). In their own words, “Digitata intelligently transforms pricing and subscriber engagement for mobile operators, empowering operators to make better and more informed decisions to meet and exceed business objectives.”

As William Gibson said, “The future is here. It’s just not evenly distributed yet.” This is incredibly true in many emerging markets for connectivity. Users often pay 100 times more for data, voice, and SMS services than their counterparts in Europe or the US. Digitata aims to better democratize access to telecommunications services through dynamic pricing solutions for mobile network operators (MNOs) and to help deliver optimal returns on their large capital investments in infrastructure.

Our pricing models are classically based on supply and demand. We use machine learning (ML) algorithms to optimize two main variables: utilization (of infrastructure), and revenue (fees for telco services). For example, at 3:00 AM when a tower is idle, it’s better to charge a low rate for data than waste this fixed capacity and have no consumers. Comparatively, for a very busy tower, it’s prudent to raise the prices at certain times to reduce congestion, thereby reducing the number of dropped calls or sluggish downloads for customers.

Our models attempt to optimize utilization and revenue according to three main features, or dimensions: location, time, and user segment. Taking the traffic example further, the traffic profile over time for a tower located in a rural or suburban area is very different from a tower in a central downtown district. In general, the suburban tower is busier early in the mornings and later at night than the tower based in the central business district, which is much busier during traditional working hours.

Our customers (the MNOs) trust us to be their automated, intelligent pricing partner. As such, it’s imperative that we keep on top of any anomalous behavior patterns when it comes to their revenue or network utilization. If our model charges too little for data bundles (or even makes it free), it could lead to massive network congestion issues as well as the obvious lost revenue impact. Conversely, if we charge too much for services, it could lead to unhappy customers and loss of revenue, through the principles of supply and demand.

It’s therefore imperative that we have a robust, real-time anomaly detection system in place to alert us whenever there is anomalous behavior on revenue and utilization. It also needs to be aware of the dimensions we operate under (location, user segment, and time).

History of anomaly detection at Digitata

We have been through four phases of anomaly detection at Digitata in the last 13 years:

  1. Manually monitoring our KPIs in reports on a routine basis.
  2. Defining routine checks using static thresholds that alert if the threshold is exceeded.
  3. Using custom anomaly detection models to track basic KPIs over time, such as total unique customers per tower, revenue per GB, and network congestion.
  4. Creating complex collections of anomaly detection models to track even more KPIs over time.

Manual monitoring continued to grow to consume more of our staff hours and was the most error-prone, which led to the desire to automate in Phase 2. The automated alarms with static alert thresholds ensured that routine checks were actually and accurately performed, but not with sufficient sophistication. This led to alert fatigue, and pushed us to custom modeling.

Custom modeling can work well for a simple problem, but the approach for one particular problem doesn’t translate perfectly to the next problem. This leads to a number of models that must be operating in the field to provide relevant insights. The operational complexity of orchestrating these begins to scale beyond the means of our in-house developers and tooling. The cost of expansion also prohibits other teams from running experiments and identifying other opportunities for us to leverage ML-backed anomaly detection.

Additionally, although you can now detect anomalies via ML, you still need to do frequent deep-dive analysis to find other combinations of dimensions that may point to underlying anomalies. For example, when a competitor is strongly targeting a certain location or segment of users, it may have an adverse impact on sales that may not necessarily be reflected, depending on how deep you have set up your anomaly detection models to actively track the different dimensions.

The problem that remains to be solved

Given our earlier problem statement, it means that we have, at least, the following dimensions under which products are being sold:

  • Thousands of locations (towers).
  • Hundreds of products and bundles (different data bundles such as social or messaging).
  • Hundreds of customer segments. Segments are based on clusters of users according to hundreds of attributes that are system calculated from MNO data feeds.
  • Hourly detection for each day of the week.

We can use traditional anomaly detection methods to have anomaly detection on a measure, such as revenue or purchase count. We don’t, however, have the necessary insights on a dimension-based level to answer questions such as:

  • How is product A selling compared to product B?
  • What does revenue look like at location A vs. location B?
  • What do sales look like for customer segment A vs. customer segment B?
  • When you start combining dimensions, what does revenue look like on product A, segment A, vs. product A, segment B; product B, segment A; and product B, segment B?

The number of dimensions quickly add up. It becomes impractical to create anomaly detection models for each dimension and each combination of dimensions. And that is only with the four dimensions mentioned! What if we want to quickly add two or three additional dimensions to our anomaly detection systems? It requires time and resource investment, even to use existing off-the-shelf tools to create additional anomaly models, notwithstanding the weeks to months of investment required to build it in-house.

That is when we looked for a purpose-built tool to do exactly this, such as the dimension-aware managed anomaly detection service, Amazon Lookout for Metrics.

Amazon Lookout for Metrics

Amazon Lookout for Metrics uses ML to automatically detect and diagnose anomalies (outliers from the norm) in business and operational time series data, such as a sudden dip in sales revenue or customer acquisition rates.

In a couple of clicks, you can connect Amazon Lookout for Metrics to popular data stores like Amazon Simple Storage Service (Amazon S3), Amazon Redshift, and Amazon Relational Database Service (Amazon RDS), as well as third-party SaaS applications, such as Salesforce, ServiceNow, Zendesk, and Marketo, and start monitoring metrics that are important to your business.

Amazon Lookout for Metrics automatically inspects and prepares the data from these sources and builds a custom ML model—informed by over 20 years of experience at Amazon—to detect anomalies with greater speed and accuracy than traditional methods used for anomaly detection. You can also provide feedback on detected anomalies to tune the results and improve accuracy over time. Amazon Lookout for Metrics makes it easy to diagnose detected anomalies by grouping together anomalies that are related to the same event and sending an alert that includes a summary of the potential root cause. It also ranks anomalies in order of severity so that you can prioritize your attention to what matters the most to your business.

How we used Amazon Lookout for Metrics

Inside Amazon Lookout for Metrics, you need to describe your data in terms of measures and dimensions. Measures are variables or key performance indicators on which you want to detect anomalies, and dimensions are metadata that represent categorical information about the measures.

To detect outliers, Amazon Lookout for Metrics builds an ML model that is trained with your source data. This model, called a detector, is automatically trained with the ML algorithm that best fits your data and use case. You can either provide your historical data for training, if you have any, or get started with real-time data, and Amazon Lookout for Metrics learns as it goes.

We used Amazon Lookout for Metrics to convert our anomaly detection tracking on two of our most important datasets: bundle revenue and voice revenue.

For bundle revenue, we track the following measures:

  • Total revenue from sales
  • Total number of sales
  • Total number of sales to distinct users
  • Average price at which the product was bought

Additionally, we track the following dimensions:

  • Location (tower)
  • Product
  • Customer segment

For voice revenue, we track the following measures:

  • Total calls made
  • Total revenue from calls
  • Total distinct users that made a call
  • The average price at which a call was made

Additionally, we track the following dimensions:

  • Location (tower)
  • Type of call (international, on-net, roaming, off-net)
  • Whether the user received a discount or not
  • Customer spend

This allows us to have coverage on these two datasets, using only two anomaly detection models with Amazon Lookout for Metrics.

Architecture overview

Apache Nifi is an open-source data flow tool that we use for ETL tasks, both on premises and in AWS. We use it as a main flow engine for parsing, processing, and updating data we receive from the mobile network. This data ranges from call records, data usage records, airtime recharges, to network tower utilization and congestion information. This data is fed into our ML models to calculate the price on a product, location, time, and segment basis.

The following diagram illustrates our architecture.

The following diagram illustrates our architecture.

Because of the reality of the MNO industry (at the moment), it’s not always possible for us to leverage AWS for all of our deployments. Therefore, we have a mix of fully on-premises, hybrid, and fully native cloud deployments.

We use a setup where we leverage Apache Nifi, connected from AWS over VPC and VPN connections, to pull anonymized data on an event-based basis from all of our deployments (regardless of type) simultaneously. The data is then stored in Amazon S3 and in Amazon CloudWatch, from where we can use services such as Amazon Lookout for Metrics.

Results from our experiments

While getting to know Amazon Lookout for Metrics, we primarily focused the backtesting functionality within the service. This feature allows you to supply historical data, have Amazon Lookout for Metrics train on a large portion of your early historical data, and then identify anomalies in the remaining, more recent data.

We quickly discovered that this has massive potential, not only to start learning the service, but also to gather insights as to what other opportunities reside within your data, which you may have never thought of, or always expected to be there but never had the time to investigate.

For example, we quickly found the following very interesting example with one of our customers. We were tracking voice revenue as the measure, under the dimensions of call type (on-net, off-net, roaming), region (a high-level concept of an area, such as a province or big city), and timeband (after hours, business hours, weekends)

Amazon Lookout for Metrics identified an anomaly on international calls in a certain region, as shown in the following graph.

We quickly went to our source data, and saw the following visualization.

We quickly went to our source data, and saw the following visualization.

This graph looks at the total revenue for the days for international calls. As you can see, when looking at the global revenue, there is no real impact of the sort that Amazon Lookout for Metrics identified.

But when looking at the specific region that was identified, you see the following anomaly.

But when looking at the specific region that was identified, you see the following anomaly.

A clear spike in international calls took place on this day, in this region. We looked deeper into it and found that the specific city identified by this region is known as a tourist and conference destination. This begs the question: is there any business value to be found in an insight such as this? Can we react to anomalies like these in real time by using Amazon Lookout for Metrics and then providing specific pricing specials on international calls in the region, in order to take advantage of the influx of demand? The answer is yes, and we are! With stakeholders alerted to these notifications for future events and with exploratory efforts into our recent history, we’re prepared for future issues and are becoming more aware of operational gaps in our past.

In addition to the exploration using the back testing feature (which is still ongoing as of this writing), we also set up real-time detectors to work in parallel with our existing anomaly detection service.

Within two days, we found our first real operational issue, as shown in the following graph.

The graph shows revenue attributed to voice calls in another customer. In this case, we had a clear spike in our catchall NO LOCATION LOOKUP region. We map revenue from the towers to regions (such as city, province, or state) using a mapping table that we periodically refresh from within the MNO network, or by receiving such a mapping from the MNO themselves. When a tower isn’t mapped correctly by this table, it shows up as this catchall region in our data. In this case, there was a problem with the mapping feed from our customer.

The effect was that the number of towers that could not be classified was slowly growing. This could affect our pricing models, which could become less accurate at factoring the location aspect when generating the optimal price.

A very important operational anomaly to detect early!

Digitata in the future

We’re constantly evolving our ML and analytics capabilities, with the end goal of making connectivity more affordable for the entire globe. As we continue on this journey, we look to services such as Amazon Lookout for Metrics to help us ensure the quality of our services, find operational issues, and identify opportunities. It has made a dramatic difference in our anomaly detection capabilities, and has pointed us to some previously undiscovered opportunities. This all allows us to work on what really matters: getting everyone connected to the wonder of the internet at affordable prices!

Getting started

Amazon Lookout for Metrics is now available in preview in US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Tokyo), and Europe (Ireland). Request preview access to get started today!

You can interact with the service using the AWS Management Console, the AWS SDKs, and the AWS Command Line Interface (AWS CLI). For more information, see the Amazon Lookout for Metrics Developer Guide.


About the Authors

Nico Kruger is the CTO of Digitata and is a fan of programming computers, reading things, listening to music, and playing games. Nico has 10+ years experience in telco. In his own words: “From C++ to Javascript, AWS to on-prem, as long as the tool is fit for the job, it works and the customer is happy; it’s all good. Automate all the things, plan for failure and be adaptable and everybody wins.”

 

Chris King is a Senior Solutions Architect in Applied AI with AWS. He has a special interest in launching AI services and helped grow and build Amazon Personalize and Amazon Forecast before focusing on Amazon Lookout for Metrics. In his spare time, he enjoys cooking, reading, boxing, and building models to predict the outcome of combat sports.

Read More

Rust detection using machine learning on AWS

Visual inspection of industrial environments is a common requirement across heavy industries, such as transportation, construction, and shipbuilding, and typically requires qualified experts to perform the inspection. Inspection locations can often be remote or in adverse environments that put humans at risk, such as bridges, skyscrapers, and offshore oil rigs.

Many of these industries deal with huge metal surfaces and harsh environments. A common problem across these industries is metal corrosion and rust. Although corrosion and rust are used interchangeably across different industries (we also use the terms interchangeably in this post), these two phenomena are different. For more details about the differences between corrosion and rust as well as different degrees of such damages, see Difference Between Rust and Corrosion and Stages of Rust.

Different levels and grades of rust can also result in different colors for the damaged areas. If you have enough images of different classes of rust, you can use the techniques described in this post to detect different classes of rust and corrosion.

Rust is a serious risk for operational safety. The costs associated with inadequate protection against corrosion can be catastrophic. Conventionally, corrosion detection is done using visual inspection of structures and facilities by subject matter experts. Inspection can involve on-site direct interpretation or the collection of pictures and the offline interpretation of them to evaluate damages. Advances in the fields of computer vision and machine learning (ML) makes it possible to automate corrosion detection to reduce the costs and risks involved in performing such inspections.

In this post, we describe how to build a serverless pipeline to create ML models for corrosion detection using Amazon SageMaker and other AWS services. The result is a fully functioning app to help you detect metal corrosion.

We will use the following AWS services:

  • Amazon API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale.
  • AWS Lambda is a compute service that lets you run code without provisioning or managing servers. Lambda runs your code only when triggered and scales automatically, from a few requests per day to thousands per second.
  • Amazon SageMaker is a fully managed service that provides developers and data scientists the tools to build, train, and deploy different types of ML models.
  • AWS Step Functions allows you to coordinate several AWS services into a serverless workflow. You can design and run workflows where the output of one step acts as the input to the next step while embedding error handling into the workflow.

Solution overview

The corrosion detection solution comprises a React-based web application that lets you pick one or more images of metal corrosion to perform detection. The application lets you train the ML model and deploys the model to SageMaker hosting services to perform inference.

The following diagram shows the solution architecture.

The solution supports the following use cases:

  • Performing on-demand corrosion detection
  • Performing batch corrosion detection
  • Training ML models using Step Functions workflows

The following are the steps for each workflow:

  • On-demand corrosion detection – An image picked by the application user is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket. The image S3 object key is sent to an API deployed on API Gateway. The API’s Lambda function invokes a SageMaker endpoint to detect corrosion in the image uploaded, and generates and stores a new image in an S3 bucket, which is further rendered in the front end for analysis.
  • Batch corrosion detection – The user uploads a .zip file containing images to an S3 bucket. A Lambda function configured as an Amazon S3 trigger is invoked. The function performs batch corrosion detection by performing an inference using the SageMaker endpoint. Resulting new images are stored back in Amazon S3. These images can be viewed in the front end.
  • Training the ML model – The web application allows you to train a new ML model using Step Functions and SageMaker. The following diagram shows the model training and endpoint hosting orchestration. The Step Functions workflow is started by invoking the StartTrainingJob API supported by the Amazon States Language. After a model has been created, the CreateEndpoint API of SageMaker is invoked, which creates a new SageMaker endpoint and hosts the new ML model. A checkpoint step ensures that the endpoint is completely provisioned before ending the workflow.

Machine learning algorithm options

Corrosion detection is conventionally done by trained professionals using visual inspection. In challenging environments such as offshore rigs, visual inspection can be very risky. Automating the inspection process using computer vision models mounted on drones is a helpful alternative. You can use different ML approaches for corrosion detection. Depending on the available data and application objectives, you could use deep learning (including object detection or semantic segmentation) or color classification, using algorithms such as Extreme Gradient Boosting (XGBoost). We discuss both approaches in this post, with an emphasis on XGBoost method, and cover advantages and limitations of both approaches. Other methods such as unsupervised clustering might also be applicable, but aren’t discussed in this post.

Deep learning approach

In recent years, deep learning has been used for automatic corrosion detection. Depending on the data availability and the type of labeling used, you can use object detection or semantic segmentation to detect corroded areas in metal structures. Although deep learning techniques are very effective for numerous use cases, the complex nature of corrosion detection (the lack of specific shapes) sometimes make deep learning methods less effective for detecting corroded areas.

We explain in more detail some of the challenges involved in using deep learning for this problem and propose an alternative way using a simpler ML method that doesn’t require the laborious labeling required for deep learning methods. If you have a dataset annotated using rectangular bounding boxes, you can use an object detection algorithm.

The most challenging aspect of this problem when using deep learning is that corroded parts of structures don’t have predictable shapes, which makes it difficult to train a comprehensive deep learning model using object detection or semantic segmentation. However, if you have enough annotated images, you can detect these random-looking patterns with reasonable accuracy. For instance, you can detect the corroded area in the following image (shown inside the red rectangle) using an object detection or semantic segmentation model with proper training and data.

The more challenging problem for performing corrosion detection using deep learning is the fact that the entire metal structure can often be corroded (as in the following image), and deep learning models confuse these corroded structures with the non-corroded ones because the edges and shapes of entirely corroded structures are similar to a regular healthy structure with no corrosion. This can be the case for any structure and not just limited to pipes.

  

Color classification approach (using the XGBoost algorithm)

Another way of looking at the corrosion detection problem is to treat it as a pixel-level color classification, which has shown promise over deep learning methods, even with small training datasets. We use a simple XGBoost method, but you can use any other classification algorithm (such as Random Forest).

The downside of this approach is that darker pixel colors in images can be mistakenly interpreted as corrosion. Lighting conditions and shadows might also affect the outcome of this approach. However, this method produced better-quality results compared to deep learning approaches because this method isn’t affected by the shape of structures or the extent of corrosion. Accuracy can be improved by using more comprehensive data.

If you require pixel-level interpretation of images, the other alternative is to use semantic segmentation, which requires significant labeling. Our proposed method offers a solution to avoid this tedious labeling.

The rest of this post focuses on using the color classification (using XGBoost) approach. We explain the steps required to prepare data for this approach and how to train such a model on SageMaker using the accompanying web application.

Create training and validation datasets

When using XGBoost, you have the option of creating training datasets from both annotated or manually cropped and non-annotated images. The color classification (XGBoost) algorithm requires that you extract the RGB values of each pixel in the image that has been labeled as clean or corroded.

We created Jupyter notebooks to help you create training and validation datasets depending on whether you’re using annotated or non-annotated images.

Create training and validation datasets for annotated images

When you have annotated images of corrosion, you can programmatically crop them to create smaller images so you have just the clean or corroded parts of the image. You reshape the small cropped images into a 2D array and stack them together to build your dataset. To ensure better-quality data, the following code further crops the small images to pick only the central portion of the image.

To help you get started quickly, we created a sample training dataset (5 MB) that you can use to create training and validation datasets. You can then use these datasets to train and deploy a new ML model. We created the sample training dataset from a few public images from pexels.com.

Let’s understand the process of creating a training dataset from annotated images. We created a notebook to help you with the data creation. The following are the steps involved in creating the training and validation data.

Crop annotated images

The first step is to crop the annotated images.

  1. We read all annotated images and the XML files containing the annotation information (such as bounding boxes and class name). See the following code:
    xml_paths = get_file_path_list(xml_path)
    images_names = list(set(get_filename_list(images_path)))
    

  1. Because the input images are annotated, we extract the class names and bounding boxes for each annotated image:
    for idx, x in enumerate(xml_paths):
    single_imgfile_path = images_path + '\'+ x.split('\')[-1].split('.')[0] +'.JPG'
    image = Image.open(single_imgfile_path)
    tree = ET.parse(x)
    root = tree.getroot()
    for idx2, rt in enumerate(root.findall('object')):
    name = rt.find('name').text
    if name in classes_to_use:
    xmin = int(rt.find('bndbox').find('xmin').text)
    ymin = int(rt.find('bndbox').find('ymin').text)
    xmax = int(rt.find('bndbox').find('xmax').text)
    ymax = int(rt.find('bndbox').find('ymax').text)     
    

 

  1. For each bounding box in an image, we zoom in to the bounding box, crop the center portion, and save that in a separate file. We cut the bounding box by 1/3 of its size from each side, therefore taking 1/9 of the area inside the bounding box (its center). See the following code:
    a = (xmax-xmin)/3.0
    b = (ymax-ymin)/3.0
    box = [int(xmin+a),int(ymin+b),int(xmax-a),int(ymax-b)]
    image1 = image.crop(box)
    

  2. Finally, we save the cropped image:
    image1.save('cropped_images_small/'+name+"-"+str(count)+".png", "PNG", quality=80, optimize=True, progressive=True)

It’s recommended to do a quick visual inspection of the cropped images to make sure they only contain either clean or corroded parts.

The following code shows the implementation for cropping the images (also available in section 2 of the notebook):

def crop_images(xml_path, images_path, classes_to_use):
# Crop objects of type given in "classes_to_use" from xml files with several 
# objects in each file and several classes in each file

    if os.path.isdir("cropped_images_small"):
        shutil.rmtree('cropped_images_small')
        os.mkdir('cropped_images_small')       
        print("Storing cropped images in cropped_images_small folder" )
    else:
        os.mkdir('cropped_images_small')       
        print("Storing cropped images in cropped_images_small folder" )

    xml_paths = get_file_path_list(xml_path)
    images_names = list(set(get_filename_list(images_path)))
    count = 0
    for idx, x in enumerate(xml_paths):
        if '.DS_Store' not in x:
            single_imgfile_path = images_path + '\'+ x.split('\')[-1].split('.')[0] +'.JPG'
            image = Image.open(single_imgfile_path)
            tree = ET.parse(x)
            root = tree.getroot()
            for idx2, rt in enumerate(root.findall('object')):
                name = rt.find('name').text
                if name in classes_to_use:
                    xmin = int(rt.find('bndbox').find('xmin').text)
                    ymin = int(rt.find('bndbox').find('ymin').text)
                    xmax = int(rt.find('bndbox').find('xmax').text)
                    ymax = int(rt.find('bndbox').find('ymax').text)
                    a = (xmax-xmin)/3.0
                    b = (ymax-ymin)/3.0
                    box = [int(xmin+a),int(ymin+b),int(xmax-a),int(ymax-b)]
                    image1 = image.crop(box)
                    image1.save('cropped_images_small/'+name+"-"+str(count)+".png", "PNG", quality=80, optimize=True, progressive=True)
                    count+=1

Create the RGB DataFrame

After cropping and saving the annotated parts, we have many small images, and each image contains only pixels belonging to one class (Clean or Corroded). The next step in preparing the data is to turn the small images into a DataFrame.

  1. We first define the column names for the DataFrame that contains the class (Clean or Corroded) and the RGB values for each pixel.
  2. We define the classes to be used (in case we want to ignore other possible classes that might be present).
  3. For each cropped image, we reshape the image and extract RGB information into a new DataFrame.
  4. Finally, we save the final data frame into a .csv file.

See the following code:

37    crop_path = 'Path to your cropped images'
38    files = get_file_path_list(crop_path)
39
40    cols = ['class','R','G','B']
41    df = pd.DataFrame()
42
43    classes_to_use = ['Corroded','Clean']
44    dict1 = {'Clean': 0, 'Corroded': 1}
45    for file in files:
46        lbls = Image.open(file)
47        imagenp = np.asarray(lbls)
48        imagenp=imagenp.reshape(imagenp.shape[1]*imagenp.shape[0],3)
49        name = file.split('\')[-1].split('.')[0].split('-')[0]
50        classname = dict1[name]
51        dftemp = pd.DataFrame(imagenp)
52        dftemp.columns =['R','G','B']
53        dftemp['class'] = classname
54        columnsTitles=['class','R','G','B']
55        dftemp=dftemp.reindex(columns=columnsTitles)
56        df = pd.concat([df,dftemp], axis=0)
57
58    df.columns = cols
59    df.to_csv('data.csv', index=False)

In the end, we have a table containing labels and RGB values.

Create training and validation sets and upload to Amazon S3

After you prepare the data, you can use the code listed under section 4 of our notebook to generate the training and validation datasets. Before running the code in this section, make sure you enter the name of a S3 bucket in the bucket variable, for storing the training and validation data.

The following lines of code in the notebook define variables for the input data file name (FILE_DATA), the training/validation ratio (for this post, we use 20% of the data for validation, which leaves 80% for training) and the name of the generated training and validation data .csv files. You can choose to use the sample training dataset as the input data file or use the data file you generated by following the previous step and assigning it to the FILE_DATA variable.

FILE_DATA = 'data.csv'
TARGET_VAR = 'class'
FILE_TRAIN = 'train.csv'
FILE_VALIDATION = 'validation.csv'
PERCENT_VALIDATION = 20

Finally, you upload the training and validation data to the S3 bucket:

s3_train_loc = upload_to_s3(bucket = bucket, channel = 'train', filename = FILE_TRAIN)
s3_valid_loc = upload_to_s3(bucket = bucket, channel = 'validation', filename = FILE_VALIDATION)

Create a training dataset for manually cropped images

For creating the training and validation dataset when using manually cropping images, you should name your cropped images with the prefixes Corroded and Clean to be consistent with the implementation in the provided Jupyter notebook. For example, for the Corroded class, you should name your image files Corroded-1.png, Corroded-2.png, and so on.

Set the path of your images and XML files into the variables img_path and xml_path. Also set the bucket name to the bucket variable. Run the code in all the sections defined in the notebook. This creates the training and validation datasets and uploads them to the S3 bucket.

Deploy the solution

Now that we have the training and validation datasets in Amazon S3, it’s time to train an XGBoost classifier using SageMaker. To do so, you can use the corrosion detection web application’s model training functionality. To help you with the web application deployment, we created AWS CloudFormation templates. Clone the source code from the GitHub repository and follow the deployment steps outlined to complete the application deployment. After you successfully deploy the application, you can explore the features it provides, such as on-demand corrosion detection, training and deploying a model, and batch features.

Training an XGBoost classifier on SageMaker

To train an XGBoost classifier, sign in to the corrosion detection web application, and on the menu, choose Model Training. Here you can train a new SageMaker model.

You need to configure parameters before starting a new training job in SageMaker. The application provides a JSON formatted parameter payload that contains information about the SageMaker training job name, Amazon Elastic Compute Cloud (Amazon EC2) instance type, the number of EC2 instances to use, the Amazon S3 location of the training and validation datasets, and XGBoost hyperparameters.

The parameter payload also lets you configure the EC2 instance type, which you can use for hosting the trained ML model using SageMaker hosting services. You can change the values of the hyperparameters, although the default values provided work. For more information about training job parameters, see CreateTrainingJob. For more information about hyperparameters, see XGBoost Hyperparameters.

See the following JSON code:

{
   "TrainingJobName":"Corrosion-Detection-7",
   "MaxRuntimeInSeconds":20000,
   "InstanceCount":1,
   "InstanceType":"ml.c5.2xlarge",
   "S3OutputPath":"s3://bucket/csv/output",
   "InputTrainingS3Uri":"s3://bucket/csv/train/train.csv",
   "InputValidationS3Uri":"s3://bucket/csv/validation/validation.csv",
   "HyperParameters":{
      "max_depth":"3",
      "learning_rate":"0.12",
      "eta":"0.2",
      "colsample_bytree":"0.9",
      "gamma":"0.8",
      "n_estimators":"150",
      "min_child_weight":"10",
      "num_class":"2",
      "subsample":"0.8",
      "num_round":"100",
      "objective":"multi:softmax"
   },
"EndpointInstanceType":"ml.m5.xlarge",
"EndpointInitialInstanceCount":1
}

The following screenshot shows the model training page. To start the SageMaker training job, you need to submit the JSON payload by choosing Submit Training Job.

The application shows you the status of the training job. When the job is complete, a SageMaker endpoint is provisioned. This should take a few minutes, and a new SageMaker endpoint should appear on the SageMaker Endpoints tab of the app.

Promote the SageMaker endpoint

For the application to use the newly created SageMaker endpoint, you need to configure the endpoint with the web app. You do so by entering the newly created endpoint name in the New Endpoint field. The application allows you to promote newly created SageMaker endpoints for inference.

Detect corrosion

Now you’re all set to perform corrosion detection. On the Batch Analysis page, you can upload a .zip file containing your images. This processes all the images by detecting corrosion and indicating the percentage of corrosion found in each image.

Summary

In this post, we introduced you to different ML algorithms and used the color classification XGBoost algorithm to detect corrosion. We also showed you how to train and host ML models using Step Functions and SageMaker. We discussed the pros and cons of different ML and deep learning methods and why a color classification method might be more effective. Finally, we showed how you can integrate ML into a web application that allows you to train and deploy a model and perform inference on images. Learn more about Amazon SageMaker and try these solutions out yourself! If you have any comments or questions, let us know in the comments below!


About the Authors

Aravind Kodandaramaiah is a Solution Builder with the AWS Global verticals solutions prototyping team, helping global customers realize the “art of the possibility” using AWS to solve challenging business problems. He is an avid Machine learning enthusiast and focusses on building end-to-end solutions on AWS.

 

 

Mehdi E. Far is a Sr Machine Learning Specialist SA at Manufacturing and Industrial Global and Strategic Accounts organization. He helps customers build Machine Learning and Cloud solutions for their challenging problems.

Read More

Aerobotics improves training speed by 24 times per sample with Amazon SageMaker and TensorFlow

Editor’s note: This is a guest post written by Michael Malahe, Head of Data at Aerobotics, a South African startup that builds AI-driven tools for agriculture.

Aerobotics is an agri-tech company operating in 18 countries around the world, based out of Cape Town, South Africa. Our mission is to provide intelligent tools to feed the world. We aim to achieve this by providing farmers with actionable data and insights on our platform, Aeroview, so that they can make the necessary interventions at the right time in the growing season. Our predominant data source is aerial drone imagery: capturing visual and multispectral images of trees and fruit in an orchard.

In this post we look at how we use Amazon SageMaker and TensorFlow to improve our Tree Insights product, which provides per-tree measurements of important quantities like canopy area and health, and provides the locations of dead and missing trees. Farmers use this information to make precise interventions like fixing irrigation lines, applying fertilizers at variable rates, and ordering replacement trees. The following is an image of the tool that farmers use to understand the health of their trees and make some of these decisions.

To provide this information to make these decisions, we first must accurately assign each foreground pixel to a single unique tree. For this instance segmentation task, it’s important that we’re as accurate as possible, so we use a machine learning (ML) model that’s been effective in large-scale benchmarks. The model is a variant of Mask R-CNN, which pairs a convolutional neural network (CNN) for feature extraction with several additional components for detection, classification, and segmentation. In the following image, we show some typical outputs, where the pixels belong to a given tree are outlined by a contour.

Glancing at the outputs, you might think that the problem is solved.

The challenge

The main challenge with analyzing and modeling agricultural data is that it’s highly varied across a number of dimensions.

The following image illustrates some extremes of the variation in the size of trees and the extent to which they can be unambiguously separated.

In the grove of pecan trees, we have one of the largest trees in our database, with an area of 654 m2 (a little over a minute to walk around at a typical speed). The vines to the right of the grove measure 50 cm across (the size of a typical potted plant). Our models need to be tolerant to these variations to provide accurate segmentations regardless of the scale.

An additional challenge is that the sources of variation aren’t static. Farmers are highly innovative, and best practices can change significantly over time. One example is ultra-high-density planting for apples, where trees are planted as close as a foot apart. Another is the adoption of protective netting, which obscures aerial imagery, as in the following image.

In this domain with broad and shifting variations, we need to maintain accurate models to provide our clients with reliable insights. Models should improve with every challenging new sample we encounter, and we should deploy them with confidence.

In our initial approach to this challenge, we simply trained on all the data we had. As we scaled, however, we quickly got to the point where training on all our data became infeasible, and the cost of doing so became an impediment to experimentation.

The solution

Although we have variation in the edge cases, we recognized that there was a lot of redundancy in our standard cases. Our goal was to get to a point where our models are trained on only the most salient data, and can converge without needing to see every sample. Our approach to achieving this was first to create an environment where it’s simple to experiment with different approaches to dataset construction and sampling. The following diagram shows our overall workflow for the data preprocessing that enables this.

The outcome is that training samples are available as individual files in Amazon Simple Storage Service (Amazon S3), which is only sensible to do with bulky data like multispectral imagery, with references and rich metadata stored in Amazon Redshift tables. This makes it trivial to construct datasets with a single query, and makes it possible to fetch individual samples with arbitrary access patterns at train time. We use UNLOAD to create an immutable dataset in Amazon S3, and we create a reference to the file in our Amazon Relational Database Service (Amazon RDS) database, which we use for training provenance and evaluation result tracking. See the following code:

UNLOAD ('[subset_query]')
TO '[s3://path/to/dataset]'
IAM_ROLE '[redshift_write_to_s3_role]'
FORMAT PARQUET

The ease of querying the tile metadata allowed us to rapidly create and test subsets of our data, and eventually we were able to train to convergence after seeing only 1.1 million samples of a total 3.1 million. This sub-epoch convergence has been very beneficial in bringing down our compute costs, and we got a better understanding of our data along the way.

The second step we took in reducing our training costs was to optimize our compute. We used the TensorFlow profiler heavily throughout this step:

import tensorflow as tf
tf.profiler.experimental.ProfilerOptions(host_tracer_level=2,  python_tracer_level=1,device_tracer_level=1)
tf.profiler.experimental.start("[log_dir]")
[train model]
tf.profiler.experimental.stop()

For training, we use Amazon SageMaker with P3 instances provisioned by Amazon Elastic Compute Cloud (Amazon EC2), and initially we found that the NVIDIA Tesla V100 GPUs in the instances were bottlenecked by CPU compute in the input pipeline. The overall pattern for alleviating the bottleneck was to shift as much of the compute from native Python code to TensorFlow operations as possible to ensure efficient thread parallelism. The largest benefit was switching to tf.io for data fetching and deserialization, which improved throughput by 41%. See the following code:

serialised_example = tf.io.decode_compressed(tf.io.gfile.GFile(fname, 'rb').read(), compression_type='GZIP')
example = tf.train.Example.FromString(serialised_example.numpy())

A bonus feature with this approach was that switching between local files and Amazon S3 storage required no code changes due to the file object abstraction provided by GFile.

We found that the last remaining bottleneck came from the default TensorFlow CPU parallelism settings, which we optimized using a SageMaker hyperparameter tuning job (see the following example config).

With the CPU bottleneck removed, we moved to GPU optimization, and made the most of the V100’s Tensor Cores by using mixed precision training:

from tensorflow.keras.mixed_precision import experimental as mixed_precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)

The mixed precision guide is a solid reference, but the change to using mixed precision still requires some close attention to ensure that the operations happening in half precision are not ill-conditioned or prone to underflow. Some specific cases that were critical were terminal activations and custom regularization terms. See the following code:

import tensorflow as tf
...
y_pred = tf.keras.layers.Activation('sigmoid', dtype=tf.float32)(x)
loss = binary_crossentropy(tf.cast(y_true, tf.float32), tf.cast(y_pred, tf.float32), from_logits=True)
...
model.add_loss(lambda: tf.keras.regularizers.l2()(tf.cast(w, tf.float32)))

After implementing this, we measured the following benchmark results for a single V100.

Precision CPU parallelism Batch size Samples per second
single default 8 9.8
mixed default 16 19.3
mixed optimized 16 22.4

The impact of switching to mixed precision was that training speed roughly doubled, and the impact of using the optimal CPU parallelism settings discovered by SageMaker was an additional 16% increase.

Implementing these initiatives as we grew resulted in reducing the cost of training a model from $122 to $68, while our dataset grew from 228 thousand samples to 3.1 million, amounting to a 24 times reduction in cost per sample.

Conclusion

This reduction in training time and cost has meant that we can quickly and cheaply adapt to changes in our data distribution. We often encounter new cases that are confounding even for humans, such as the following image.

However, they quickly become standard cases for our models, as shown in the following image.

We aim to continue making training faster by using more devices, and making it more efficient by leveraging SageMaker managed Spot Instances. We also aim to make the training loop tighter by serving SageMaker models that are capable of online learning, so that improved models are available in near-real time. With these in place, we should be well equipped to handle all the variation that agriculture can throw at us. To learn more about Amazon SageMaker, visit the product page.

 

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.


About the Author

Michael Malahe is the Head of Data at Aerobotics, a South African startup that builds AI-driven tools for agriculture.

Read More

AWS ML Community showcase: March 2021 edition

In our Community Showcase, Amazon Web Services (AWS) highlights projects created by AWS Heroes and AWS Community Builders. 

Each month AWS ML Heroes and AWS ML Community Builders bring to life projects and use cases for the full range of machine learning skills from beginner to expert through deep dive tutorials, podcasts, videos, and other content that show how to use AWS Machine Learning (ML) solutions such as Amazon SageMaker, pertained AI services such as Amazon Rekognition, and AI learning devices such as AWS DeepRacer.

The AWS ML community is a vibrant group of developers, data scientists, researchers, and business decision-makers that dive deep into artificial intelligence and ML concepts, contribute with real-world experiences, and collaborate on building projects together.

Here are a few highlights of externally published getting started guides and tutorials curated by our AWS ML Evangelist team led by Julien Simon.

AWS ML Heroes and AWS ML Community Builder Projects

Making My Toddler’s Dream of Flying Come True with AI Tech (with code samples). In this deep dive tutorial, AWS ML Hero Agustinus Nalwan walks you through how to build an object detection model with Amazon SageMaker JumpStart (a set of solutions for the most common use cases that can be deployed readily with just a few clicks), Torch2trt (a tool to automatically convert PyTorch models into TensorRT), and NVIDIA Jetson AGX Xavier.

How to use Amazon Rekognition Custom Labels to analyze AWS DeepRacer Real World Performance Through Video (with code samples). In this deep dive tutorial, AWS ML Community Builder Pui Kwan Ho shows you how to analyze the path and speed of an AWS DeepRacer device using pretrained computer vision with Amazon Rekognition Custom Labels.

AWS Panorama Appliance Developers Kit: An Unboxing and Walkthrough (with code samples). In this video, AWS ML Hero Mike Chambers shows you how to get started with AWS Panorama, an ML appliance and software development kit (SDK) that allows developers to bring computer vision and make predictions locally with high accuracy and low latency.

Improving local food processing with Amazon Lookout for Vision (with code samples). In this deep tutorial, AWS ML Hero Olalekan Elesin demonstrates how to use AI to improve the quality of food sorting (using cassava flakes) cost-effectively and with zero AI knowledge.

Conclusion

Whether you’re just getting started with ML, already an expert, or something in between, there is always something to learn. Choose from community-created and ML-focused blogs, videos, eLearning guides, and much more from the AWS ML community.

Are you interested in contributing to the community? Apply to the AWS Community Builders program today!

 

The content and opinions in the preceding linked posts are those of the third-party authors and AWS is not responsible for the content or accuracy of those posts.


About the Author

Cameron Peron is Senior Marketing Manager for AWS Amazon Rekognition and the AWS AI/ML community. He evangelizes how AI/ML innovation solves complex challenges facing community, enterprise, and startups alike. Out of the office, he enjoys staying active with kettlebell-sport, spending time with his family and friends, and is an avid fan of Euro-league basketball.

Read More

Configure Amazon Forecast for a multi-tenant SaaS application

Amazon Forecast is a fully managed service that is based on the same technology used for forecasting at Amazon.com. Forecast uses machine learning (ML) to combine time series data with additional variables to build highly accurate forecasts. Forecast requires no ML experience to get started. You only need to provide historical data and any additional data that may impact forecasts.

Customers are turning towards using a Software as service (SaaS) model for delivery of multi-tenant solutions. You can build SaaS applications with a variety of different architectural models to meet regulatory and compliance requirements. Depending on the SaaS model, resources like Forecast are shared across tenants. Forecast data access, monitoring, and billing needs to be considered per tenant for deploying SaaS solutions.

This post outlines how to use Forecast within a multi-tenant SaaS application using Attribute Based Access Control (ABAC) in AWS Identity and Access Management (IAM) to provide these capabilities. ABAC is a powerful approach that you can use to isolate resources across tenants.

In this post, we provide guidance on setting up IAM policies for tenants using ABAC principles and Forecast. To demonstrate the configuration, we set up two tenants, TenantA and TenantB, and show a use case in the context of an SaaS application using Forecast. In our use case, TenantB can’t delete TenantA resources, and vice versa. The following diagram illustrates our architecture.

TenantA and TenantB have services running as microservice within Amazon Elastic Kubernetes Service (Amazon EKS). The tenant application uses Forecast as part of its business flow.

Forecast data ingestion

Forecast imports data from the tenant’s Amazon Simple Storage Service (Amazon S3) bucket to the Forecast managed S3 bucket. Data can be encrypted in transit and at rest automatically using Forecast managed keys or tenant-specific keys through AWS Key Management Service (AWS KMS). The tenant-specific key can be created by the SaaS application as part of onboarding, or the tenant can provide their own customer managed key (CMK) using AWS KMS. Revoking permission on the tenant-specific key prevents Forecast from using the tenant’s data. We recommend using a tenant-specific key and an IAM role per tenant in a multi-tenant SaaS environment. This enables securing data on a tenant-by-tenant basis.

Solution overview

You can partition data on Amazon S3 to segregate tenant access in different ways. For this post, we discuss two strategies:

  • Use one S3 bucket per tenant
  • Use a single S3 bucket and separate tenant data with a prefix

For more information about various strategies, see the Storing Multi-Tenant Data on Amazon S3 GitHub repo.

When using one bucket per tenant, you use an IAM policy to restrict access to a given tenant S3 bucket. For example:

s3://tenant_a    [ Tag tenant = tenant_a ]
s3://tenant_b     [ Tag tenant = tenant_b ]

There is a hard limit on the number of S3 buckets per account. A multi-account strategy needs to be considered to overcome this limit.

In our second option, tenant data separated using an S3 prefix in a single S3 bucket. We use an IAM policy  to restrict access within a bucket prefix per tenant. For example:

s3://<bucketname>/tenant_a

For this post, we use the second option of assigning S3 prefixes within a single bucket. We encrypt tenant data using CMKs in AWS KMS.

Tenant onboarding

SaaS applications rely on a frictionless model for introducing new tenants into their environment. This often requires orchestrating several components to successfully provision and configure all the elements needed to create a new tenant. This process, in SaaS architecture, is referred to as tenant onboarding. This can be initiated directly by tenants or as part of a provider-managed process. The following diagram illustrates the flow of configuring Forecast per tenant as part of onboarding process.

Resources are tagged with tenant information. For this post, we tag resources with a value for tenant, for example,  tenant_a.

Create a Forecast role

This IAM role is assumed by Forecast per tenant. You should apply the following policy to allow Forecast to interact with Amazon S3 and AWS KMS in the customer account. The role is tagged with the tag tenant. For example, see the following code:

TenantA create role Forecast_TenantA_Role  [ Tag tenant = tenant_a]
TenantB create role Forecast_TenantB_Role [ Tag tenant = tenant_b]

Create the policies

In this next step, we create policies for our Forecast role. For this post, we split them into two policies for more readability, but you can create them according to your needs.

Policy 1: Forecast read-only access

The following policy gives privileges to describe, list, and query Forecast resources. This policy restricts Forecast to read-only access. The tenant tag validation condition in the following code makes sure that the tenant tag value matches the principal’s tenant tag. Refer to the bolded code for specifics.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "DescribeQuery",
            "Effect": "Allow",
            "Action": [
                "forecast:GetAccuracyMetrics",
                "forecast:ListTagsForResource",
                "forecast:DescribeDataset",
                "forecast:DescribeForecast",
                "forecast:DescribePredictor",
                "forecast:DescribeDatasetImportJob",
                "forecast:DescribePredictorBacktestExportJob",
                "forecast:DescribeDatasetGroup",
                "forecast:DescribeForecastExportJob",
                "forecast:QueryForecast"
            ],
            "Resource": [
                "arn:aws:forecast:*:<accountid>:dataset-import-job/*",
                "arn:aws:forecast:*:<accountid>:dataset-group/*",
                "arn:aws:forecast:*:<accountid>:predictor/*",
                "arn:aws:forecast:*:<accountid>:forecast/*",
                "arn:aws:forecast:*:<accountid>:forecast-export-job/*",
                "arn:aws:forecast:*:<accountid>:dataset/*",
                "arn:aws:forecast:*:<accountid>:predictor-backtest-export-job/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/tenant":"${aws:PrincipalTag/tenant}"
                }
            }
        },
        {
            "Sid": "List",
            "Effect": "Allow",
            "Action": [
                "forecast:ListDatasetImportJobs",
                "forecast:ListDatasetGroups",
                "forecast:ListPredictorBacktestExportJobs",
                "forecast:ListForecastExportJobs",
                "forecast:ListForecasts",
                "forecast:ListPredictors",
                "forecast:ListDatasets"
            ],
            "Resource": "*"
        }
    ]
}

Policy 2: Amazon S3 and AWS KMS access policy

The following policy gives privileges to AWS KMS and access to the S3 tenant prefix. The tenant tag validation condition in the following code makes sure that the tenant tag value matches the principal’s tenant tag. Refer to the bolded code for specifics.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "KMS",
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt",
                "kms:Encrypt",
                "kms:RevokeGrant",
                "kms:GenerateDataKey",
                "kms:DescribeKey",
                "kms:RetireGrant",
                "kms:CreateGrant",
                "kms:ListGrants"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/tenant":"${aws:PrincipalTag/tenant}"
                }
            }
        },
        {
            "Sid": "S3Access",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject", 
                "s3:PutObject",
                "s3:GetObjectVersionTagging",
                "s3:GetObjectAcl",
                "s3:GetObjectVersionAcl",
                "s3:GetBucketPolicyStatus",
                "s3:ListBucket",
                "s3:ListBucketMultipartUploads",
                "s3:ListAccessPoints",
                "s3:GetObjectVersion"
            ],
            "Resource": [
                "arn:aws:s3:::<bucketname>/*",
                "arn:aws:s3:::<bucketname>"
            ],
            "Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "${aws:PrincipalTag/tenant}",
                        "${aws:PrincipalTag/tenant}/*"
                    ]
                }
            }
        }
    ]
}

Create a tenant specific key

We now create a tenant-specific key in AWS KMS per tenant and tag it with the tenant tag value. Alternatively, the tenant can bring their own key to AWS KMS. We give the preceding roles (Forecast_TenantA_Role or Forecast_TenantB_Role) access to the tenant-specific key.

For example, the following screenshot shows the key-value pair of tenant and tenant_a.

The following screenshot shows the IAM roles that can use this key.

Create an application role

The second role we create is assumed by the SaaS application per tenant. You should apply the following policy to allow the application to interact with Forecast, Amazon S3, and AWS KMS. The role is tagged with the tag tenant. See the following code:

TenantA create role TenantA_Application_Role  [ Tag tenant = tenant_a]
TenantB create role TenantB_Application_Role  [ Tag tenant = tenant_b]

Create the policies

We now create policies for the application role. For this post, we split them into two policies for more readability, but you can create them according to your needs.

Policy 1: Forecast access

The following policy gives privileges to create, update, and delete Forecast resources. The policy enforces the tag requirement during creation. In addition, it restricts the list, describe, and delete actions on resources to the respective tenant. This policy has IAM PassRole to allow Forecast to assume the role.

The tenant tag validation condition in the following code makes sure that the tenant tag value matches the tenant. Refer to the bolded code for specifics.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "CreateDataSet",
            "Effect": "Allow",
            "Action": [
                "forecast:CreateDataset",
                "forecast:CreateDatasetGroup",
                "forecast:TagResource"
            ],
            "Resource": [
                "arn:aws:forecast:*:<accountid>:dataset-import-job/*",
                "arn:aws:forecast:*:<accountid>:dataset-group/*",
                "arn:aws:forecast:*:<accountid>:predictor/*",
                "arn:aws:forecast:*:<accountid>:forecast/*",
                "arn:aws:forecast:*:<accountid>:forecast-export-job/*",
                "arn:aws:forecast:*:<accountid>:dataset/*",
                "arn:aws:forecast:*:<accountid>:predictor-backtest-export-job/*"
            ],
            "Condition": {
                "ForAnyValue:StringEquals": {
                    "aws:TagKeys": [ "tenant" ]
                },
                "StringEquals": {
                    "aws:RequestTag/tenant": "${aws:PrincipalTag/tenant}"
                }
            }
        },
        {
            "Sid": "CreateUpdateDescribeQueryDelete",
            "Effect": "Allow",
            "Action": [
                "forecast:CreateDatasetImportJob",
                "forecast:CreatePredictor",
                "forecast:CreateForecast",
                "forecast:CreateForecastExportJob",
                "forecast:CreatePredictorBacktestExportJob",
                "forecast:GetAccuracyMetrics",
                "forecast:ListTagsForResource",
                "forecast:UpdateDatasetGroup",
                "forecast:DescribeDataset",
                "forecast:DescribeForecast",
                "forecast:DescribePredictor",
                "forecast:DescribeDatasetImportJob",
                "forecast:DescribePredictorBacktestExportJob",
                "forecast:DescribeDatasetGroup",
                "forecast:DescribeForecastExportJob",
                "forecast:QueryForecast",
                "forecast:DeletePredictorBacktestExportJob",
                "forecast:DeleteDatasetImportJob",
                "forecast:DeletePredictor",
                "forecast:DeleteDataset",
                "forecast:DeleteDatasetGroup",
                "forecast:DeleteForecastExportJob",
                "forecast:DeleteForecast"
            ],
            "Resource": [
                "arn:aws:forecast:*:<accountid>:dataset-import-job/*",
                "arn:aws:forecast:*:<accountid>:dataset-group/*",
                "arn:aws:forecast:*:<accountid>:predictor/*",
                "arn:aws:forecast:*:<accountid>:forecast/*",
                "arn:aws:forecast:*:<accountid>:forecast-export-job/*",
                "arn:aws:forecast:*:<accountid>:dataset/*",
                "arn:aws:forecast:*:<accountid>:predictor-backtest-export-job/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/tenant": "${aws:PrincipalTag/tenant}"
                }
            }
        },
        {
            "Sid": "IAMPassRole",
            "Effect": "Allow",
            "Action": [
                "iam:GetRole",
                "iam:PassRole"
            ],
            "Resource": "--Provide Resource ARN--"
        },
        {
            "Sid": "ListAccess",
            "Effect": "Allow",
            "Action": [
                "forecast:ListDatasetImportJobs",
                "forecast:ListDatasetGroups",
                "forecast:ListPredictorBacktestExportJobs",
                "forecast:ListForecastExportJobs",
                "forecast:ListForecasts",
                "forecast:ListPredictors",
                "forecast:ListDatasets"
            ],
            "Resource": "*"
        }
    ]
}

Policy 2: Amazon S3, AWS KMS, Amazon CloudWatch, and resource group access

The following policy gives privileges to access Amazon S3 and AWS KMS resources, and also Amazon CloudWatch. It limits access to the tenant-specific S3 prefix and tenant-specific CMK. The tenant validation condition is in bolded code.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "S3Storage",
            "Effect": "Allow",
            "Action": [
                "s3:*" ---> To be modifed based on application needs
            ],
            "Resource": [
                "arn:aws:s3:::<bucketname>",
                "arn:aws:s3:::<bucketname>/*"
            ],
            "Condition": {
                "StringLike": {
                    "s3:prefix": [ "${aws:PrincipalTag/tenant}", "${aws:PrincipalTag/tenant}/*"
                    ]
                }
            }
        },
  {
            "Sid": "ResourceGroup",
            "Effect": "Allow",
            "Action": [
                "resource-groups:SearchResources",
                "tag:GetResources",
                "tag:getTagKeys",
                "tag:getTagValues",
         "resource-explorer:List*",
   "cloudwatch:PutMetricData"
            ],
            "Resource": "*"
        },
        {
            "Sid": "KMS",
            "Effect": "Allow",
            "Action": [
                "kms:Encrypt",
                "kms:Decrypt",
                "kms:CreateGrant",
                "kms:RevokeGrant",
                "kms:RetireGrant",
                "kms:ListGrants",
                "kms:DescribeKey",
                "kms:GenerateDataKey"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/tenant": "${aws:PrincipalTag/tenant}"
                }
            }
        }
    ]
}

Create a resource group

The resource group allows all tagged resources to be queried by the tenant. The following example code uses the AWS Command Line Interface (AWS CLI) to create a resource group for TenantA:

aws resource-groups create-group --name TenantA --tags tenant=tenant_a --resource-query '{"Type":"TAG_FILTERS_1_0", "Query":"{"ResourceTypeFilters":["AWS::AllSupported"],"TagFilters":[{"Key":"tenant", "Values":["tenant_a"]}]}"}'

Forecast application flow

The following diagram illustrates our Forecast application flow. Application service assumes IAM role for the tenant and as part of its business flow invokes Forecast API.

Create a predictor for TenantB

Resources created should be tagged with the tenant tag. The following code uses the Python (Boto3) API to create a predictor for TenantB (refer to the bolded code for specifics):

//Run under TenantB role TenantB_Application_Role
session = boto3.Session() 
forecast = session.client(service_name='forecast') 
...
response=forecast.create_dataset(
                    Domain="CUSTOM",
                    DatasetType='TARGET_TIME_SERIES',
                    DatasetName=datasetName,
                    DataFrequency=DATASET_FREQUENCY, 
                    Schema = schema,
                    Tags = [{'Key':'tenant','Value':'tenant_b'}],
                    EncryptionConfig={'KMSKeyArn':'KMS_TenantB_ARN', 'RoleArn':Forecast_TenantB_Role}
)
...
create_predictor_response=forecast.create_predictor(
                    ...
                    EncryptionConfig={ 'KMSKeyArn':'KMS_TenantB _ARN', 'RoleArn':Forecast_TenantB_Role}, Tags = [{'Key':'tenant','Value':'tenant_b'}],
                      ...
                      }) 
predictor_arn=create_predictor_response['PredictorArn']

Create a forecast on the predictor for TenantB

The following code uses the Python (Boto3) API to create a forecast on the predictor you just created:

//Run under TenantB role TenantB_Application_Role
session = boto3.Session() 
forecast = session.client(service_name='forecast') 
...
create_forecast_response=create_forecast_response=forecast.create_forecast(
ForecastName=forecastName,
             PredictorArn=predictor_arn,
             Tags = [{'Key':'tenant','Value':'tenant_b'}])
tenant_b_forecast_arn = create_forecast_response['ForecastArn']

Validate access to Forecast resources

In this section, we confirm that only the respective tenant can access Forecast resources. Access, modifying, or deleting Forecast resources belonging to a different tenant throws an error. The following code uses the Python (Boto3) API to demonstrate TenantA attempting to delete a TenantB Forecast resource:

//Run under TenantA role TenantA_Application_Role
session = boto3.Session() 
forecast = session.client(service_name='forecast') 
..
forecast.delete_forecast(ForecastArn= tenant_b_forecast_arn)

ClientError: An error occurred (AccessDeniedException) when calling the DeleteForecast operation: User: arn:aws:sts::<accountid>:assumed-role/TenantA_Application_Role/tenant-a-role is not authorized to perform: forecast:DeleteForecast on resource: arn:aws:forecast:<region>:<accountid>:forecast/tenantb_deeparp_algo_forecast

List and monitor predictors

The following example code uses the Python (Boto3) API to query Forecast predictors for TenantA using resource groups:

//Run under TenantA role TenantA_Application_Role
session = boto3.Session() 
resourcegroup = session.client(service_name='resource-groups')

query="{"ResourceTypeFilters":["AWS::Forecast::Predictor"],"TagFilters":[{"Key":"tenant", "Values":["tenant_a"]}]}"  Tenant Tag needs to be specified.

response = resourcegroup.search_resources(
    ResourceQuery={
        'Type': 'TAG_FILTERS_1_0',
        'Query': query
    },
    MaxResults=20
)

predictor_count=0
for resource in response['ResourceIdentifiers']:
    print(resource['ResourceArn'])
    predictor_count=predictor_count+1

As the AWS Well-Architected Framework explains, it’s important to monitor service quotas (which are also referred to as service limits). Forecast has limits per accounts; for more information, see Guidelines and Quotas.

The following code is an example of populating a CloudWatch metric with the total number of predictors:

cloudwatch = session.client(service_name='cloudwatch')
cwresponse = cloudwatch.put_metric_data(Namespace='TenantA_PredictorCount',MetricData=[
 {
 'MetricName': 'TotalPredictors',
 'Value': predictor_count
 }]
)

Other considerations

Resource limits and throttling need to be managed by the application across tenants. If you can’t accommodate the Forecast limits, you should consider a multi-account configuration.

The Forecast List APIs or resource group response need to be filtered by application based on the tenant tag value.

Conclusion

In this post, we demonstrated how to isolate Forecast access using the ABAC technique in a multi-tenant SaaS application. We showed how to limit access to Forecast by tenant using the tenant tag. You can further customize policies by applying more tags, or apply this strategy to other AWS services.

For more information about using ABAC as an authorization strategy, see What is ABAC for AWS?


About the Authors

Gunjan Garg is a Sr. Software Development Engineer in the AWS Vertical AI team. In her current role at Amazon Forecast, she focuses on engineering problems and enjoys building scalable systems that provide the most value to end users. In her free time, she enjoys playing Sudoku and Minesweeper.

 

 

 

Matias Battaglia is a Technical Account Manager at Amazon Web Services. In his current role, he enjoys helping customers at all the stages of their cloud journey. On his free time, he enjoys building AI/ML projects.

 

 

 

Rakesh Ramadas is an ISV Solution Architect at Amazon Web Services. His focus areas include AI/ML and Big Data.

 

Read More

Introducing Amazon Lookout for Metrics: An anomaly detection service to proactively monitor the health of your business

Anomalies are unexpected changes in data, which could point to a critical issue. An anomaly could be a technical glitch on your website, or an untapped business opportunity. It could be a new marketing channel with exceedingly high customer conversions. As businesses produce more data than ever before, detecting these unexpected changes and responding in a timely manner is essential, yet challenging. Delayed responses cost businesses millions of dollars, missed opportunities, and the risk of losing the trust of their customers.

We’re excited to announce the general availability of Amazon Lookout for Metrics, a new service that uses machine learning (ML) to automatically monitor the metrics that are most important to businesses with greater speed and accuracy. The service also makes it easier to diagnose the root cause of anomalies like unexpected dips in revenue, high rates of abandoned shopping carts, spikes in payment transaction failures, increases in new user sign-ups, and many more. Lookout for Metrics goes beyond simple anomaly detection. It allows developers to set up autonomous monitoring for important metrics to detect anomalies and identify their root cause in a matter of few clicks, using the same technology used by Amazon internally to detect anomalies in its metrics—all with no ML experience required.

You can connect Lookout for Metrics to 19 popular data sources, including Amazon Simple Storage Solution (Amazon S3), Amazon CloudWatch, Amazon Relational Database Service (Amazon RDS), and Amazon Redshift, as well as software as a service (SaaS) applications like Salesforce, Marketo, and Zendesk, to continuously monitor metrics important to your business. Lookout for Metrics automatically inspects and prepares the data, uses ML to detect anomalies, groups related anomalies together, and summarizes potential root causes. The service also ranks anomalies by severity so you can prioritize which issue to tackle first.

Lookout for Metrics easily connects to notification and event services like Amazon Simple Notification Service (Amazon SNS), Slack, Pager Duty, and AWS Lambda, allowing you to create customized alerts or actions like filing a trouble ticket or removing an incorrectly priced product from a retail website. As the service begins returning results, you can also provide feedback on the relevancy of detected anomalies via the Lookout for Metrics console or the API, and the service uses this input to continuously improve its accuracy over time.

Digitata, a telecommunication analytics provider, intelligently transforms pricing and subscriber engagement for mobile network operators (MNOs), empowering them to make better and more informed business decisions. One of Digitata’s MNO customers had made an erroneous update to their pricing platform, which led to them charging their end customers the maximum possible price for their internet data bundles. Lookout for Metrics immediately identified that this update had led to a drop of over 16% in their active purchases and notified the customer within minutes of the said incident using Amazon SNS. The customer was also able to attribute the drop to the latest updates to the pricing platform using Lookout for Metrics. With a clear and immediate remediation path, the customer was able to deploy a fix within 2 hours of getting notified. Without Lookout for Metrics, it would have taken Digitata approximately a day to identify and triage the issue, and would have led to a 7.5% drop in customer revenue, in addition to the risk of losing the trust of their end customers.

Solution overview

This post demonstrates how you can set up anomaly detection on a sample ecommerce dataset using Lookout for Metrics. The solution allows you to download relevant datasets, set up continuous anomaly detection, and optionally set up alerts to receive notifications in case anomalies occur.

Our sample dataset is designed to detect abnormal changes in revenue and views for the ecommerce website across major supported platforms like pc_web, mobile_web, and mobile_app and marketplaces like US, UK, DE, FR, ES, IT, and JP.

The following diagram shows the architecture of our continuous detection system.

Building this system requires three simple steps:

  1. Create an S3 bucket and upload your sample dataset.
  2. Create a detector for Lookout for Metrics.
  3. Add a dataset and activate the detector to start learning and continuous detection.

Then you can review and analyze the results.

If you’re familiar with Python and Jupyter, you can get started immediately by following along with the GitHub repo; this post walks you through getting started with the service. After you set up the detection system, you can optionally define alerts that notify you when anomalies are found that meet or exceed a specified severity threshold.

Create an S3 bucket and upload your sample dataset

Download the sample dataset and save it locally. Then continue through the following steps:

  1. Create an S3 bucket.

This bucket needs to be unique and in the same Region where you’re using Lookout for Metrics. For this post, we use the bucket 059124553121-lookoutmetrics-lab.

  1. After you create the bucket, extract the demo dataset on your local machine.

You should have a folder named ecommerce.

  1. On the Amazon S3 console, open the bucket you created.

  1. Choose Upload.

  1. Upload the ecommerce folder.

It takes a few moments to process the files.

  1. When the files are processed, choose Upload.

Do not navigate away from this page while the upload is still processing. You can move to the next step when your dataset is ready.

Alternatively, you can use the AWS Command Line Interface (AWS CLI) to upload the file in just a few minutes using the following command:

!aws s3 sync {data_dirname}/ecommerce/ s3://{s3_bucket}/ecommerce/ --quiet

Create a detector for Lookout for Metrics

To create your detector, complete the following steps:

  1. On the Lookout for Metrics console, choose Create detector.

  1. For Name, enter a detector name.
  2. For Description, enter a description.
  3. For Interval, choose 1 hour intervals.
  4. Optionally, you can modify encryption settings.
  5. Choose Create.

Add a dataset and activate the detector

You now configure your dataset and metrics.

  1. Choose Add a dataset.

  1. For Name, enter a name for your dataset.
  2. For Description, enter a description.
  3. For Timezone, choose the UTC timezone.
  4. For Datasource, choose Amazon S3.

We use Amazon S3 as our data source in this post, but Lookout for Metrics can connect to 19 popular data sources, including CloudWatch, Amazon RDS, and Amazon Redshift, as well as SaaS applications like Salesforce, Marketo, and Zendesk.

You also have an offset parameter, for when you have data that can take a while to arrive and you want Lookout for Metrics to wait until your data has arrived to start reading. This is helpful for long-running jobs that may feed into Amazon S3.

  1. For Detector mode, select either Backtest or Continuous.

Backtesting allows you to detect anomalies on historical data. This feature is helpful when you want to try out the service on past data or validate against known anomalies that occurred in the past. For this post, we use continuous mode, where you can detect anomalies on live data continuously, as they occur.

  1. For Path to an object in your continuous dataset, enter the value for S3_Live_Data_URI.
  2. For Continuous data S3 path pattern, choose the S3 path with your preferred time format (for this post, we choose the path with {{yyyyMMdd}}/{{/HHmm}}), which is the third option in the drop down.

The options on the drop-down menu adapt for your data.

  1. For Datasource interval, choose your preferred time interval (for this post, we choose 1 hour intervals).

For continuous mode, you have the option to provide your historical data, if you have any, for the system to proactively learn from. If you don’t have historical data, you can get started with real-time or live data, and Lookout for Metrics learns on the go. For this post, we have historical data for learning, hence we check this box (Step 10)

  1. For Historical data, select Use historical data.
  2. For S3 location of historical data, enter the Amazon S3 URI for historical data that you collected earlier.

You need to provide the ARN of an AWS Identity and Access Management (IAM) role to allow Lookout for Metrics to read from your S3 bucket. You can use an existing role or create a new one. For this post, we use an existing role.

  1. For Service role, choose Enter the ARN of a role.
  2. Enter the role ARN.
  3. Choose Next.

The service now validates your data.

  1. Choose OK.

On the Map files page, you specify which fields you want to run anomaly detection on. Measures are the key performance indicators on which we want to detect anomalies, and dimensions are the categorical information about the measures. You may want to monitor your data for anomalies in number of views or revenue for every platform, marketplace, and combination of both. You can designate up to five measures and five dimensions per dataset.

  1. For Measures, choose views and revenue.
  2. For Dimensions, choose platform and marketplace.

Lookout for Metrics analyzes each combination of these measures and dimensions. For our example, we have seven unique marketplace values (US, UK, DE, FR, ES, IT, and JP) and three unique platform values (pc_web, mobile_web, and mobile_app), for a total of 21 unique combinations. Each unique combination of measures and dimension values is a metric. In this case, you have 21 dimensions and two measures, for a total of 42 time series metrics. Lookout for Metrics detects anomalies at the most granular level so you can pinpoint any unexpected behavior in your data.

  1. For Timestamp, choose your timestamp formatting (for this post, we use the default 24-hour format in Python’s Pandas package, yyyy-MM-dd HH:mm:ss).

  1. Choose Next.

  1. Review your setup and choose Save and activate.

You’re redirected to the detector page, where you can see that the job has started. This process takes 20–25 minutes to complete.

From here you could define the alerts. Lookout for Metrics can automatically send you alerts through channels such as Amazon SNS, Datadog, PagerDuty, Webhooks, and Slack, or trigger custom actions using Lambda, reducing your time to resolution.

Congratulations, your first detector is up and running.

Review and analyze the results

When detecting an anomaly, Lookout for Metrics helps you focus on what matters most by assigning a severity score to aid prioritization. To help you find the root cause, it intelligently groups anomalies that may be related to the same incident and summarizes the different sources of impact. In the following screenshot, the anomaly in revenue on March 8 at 22:00 GMT had a severity score of 97, indicating a high severity anomaly that needs immediate attention. The impact analysis also tells you that the Mobile_web platform in Germany (DE) saw the highest impact.

Lookout for Metrics also allows you to provide real-time feedback on the relevance of the detected anomalies, enabling a powerful human-in-the-loop mechanism. This information is fed back to the anomaly detection model to improve its accuracy continuously, in near-real time.

Clean up

To avoid incurring ongoing charges, delete the following resources created in this post:

  • Detector
  • S3 bucket
  • IAM role

Conclusion

Lookout for Metrics is available directly via the AWS Management Console, the AWS SDKs, the AWS CLI, as well as through supporting partners to help you easily implement customized solutions. The service is also compatible with AWS CloudFormation and can be used in compliance with the European Union’s General Data Protection Regulation (GDPR).

As of this writing, Lookout for Metrics is available in the following Regions:

  • US East (Ohio)
  • US East (N. Virginia)
  • US West (Oregon)
  • Asia Pacific (Singapore)
  • Asia Pacific (Sydney)
  • Asia Pacific (Tokyo)
  • Europe (Ireland)
  • Europe (Frankfurt)
  • Europe (Stockholm)

To get started building your first detector, adding metrics via various popular data sources, and creating custom alerts and actions via the output channel of your choice, see Amazon Lookout for Metrics Documentation.


About the Authors

Ankita Verma is the Product Lead for Amazon Lookout for Metrics. Her current focus is helping businesses make data driven decisions using AI/ ML. Outside of AWS, she is a fitness enthusiast, and loves mentoring budding product managers and entrepreneurs in her free time. She also publishes a weekly product management newsletter called ‘The Product Mentors’ on Substack.

 

 

Chris King is a Senior Solutions Architect in Applied AI with AWS. He has a special interest in launching AI services and helped grow and build Amazon Personalize and Amazon Forecast before focusing on Amazon Lookout for Metrics. In his spare time he enjoys cooking, reading, boxing, and building models to predict the outcome of combat sports.

Read More

Amazon Kendra adds new search connectors from AWS Partner, Perficient, to help customers search enterprise content faster

Today, Amazon Kendra is making nine new search connectors available in the Amazon Kendra connector library developed by Perficient, an AWS Partner. These include search connectors for IBM Case Manager, Adobe Experience Manger, Atlassian Jira and Confluence, and many others.

Improving the Enterprise Search Experience

These days, employees and customers expect an intuitive search experience. Search has evolved from simply being considered a feature to being considered a critical element of how every product works.

Online searches have trained us to expect specific, relevant answers, and not choices when we ask a search engine a question. Furthermore, we expect this level of accuracy everywhere we perform a search, particularly in the workplace. Unfortunately, enterprise search does not typically deliver the desired experience.

Amazon Kendra is an intelligent search service powered by machine learning. Amazon Kendra uses deep learning and reading comprehension to deliver more accurate search results. It also uses natural language understanding (NLU) to deliver a search experience that is more in line with asking an expert a specific question so you can receive a direct answer, quickly.

Amazon Kendra’s intelligent search capabilities improve the search and discovery experience, but enterprises are still faced with the challenge of connecting troves of unstructured data and making that data accessible to search. Content is often unstructured and scattered across multiple repositories making critical information hard to find, costing employees’ time and effort to track down the right answer.

Data Source Connectors Make Information Searchable

Due to the diversity of sources, quality of data, and variety of connection protocols for an organization’s data sources, it’s crucial for enterprises to avoid time spent building complex ETL jobs that aggregate data sources.

To achieve this, organizations can use data source connectors to quickly unify content as part of a single, searchable index, without needing to copy or move data from an existing location to a new one. This reduces the time and effort typically associated with creating a new search solution.

In addition to unifying scattered content, data source connectors that have the right source interfaces in place, and the ability to apply rules and enrich content before users begin searching, can make a search application even more accurate.

Perficient Connectors for Amazon Kendra

Perficient has years of experience developing data source connectors for a wide range of enterprise data sources. After countless one off implementations, they identified patterns which can be repeated for any connection type. They developed their search connector platform Handshake and integrated with Amazon Kendra to deliver the following:

  • Reduced setup time when creating intelligent search applications
    • Handshake’s integration with Amazon Kendra out of the box indexes, extracts and transforms metadata making it easy to configure and get started with whatever data source connector you choose.
  • More connectors for many popular enterprise data sources
    • With Handshake, Amazon Kendra customers can license any of Perficient’s existing source connectors, including IBM Case Manager, Adobe Experience Manger, Nuxeo, Atlassian Jira and Confluence, Elastic Path, Elastic search, SQL databases, file systems, any REST-based API, and 30 other interfaces ready upon request.
  • Access to the most up-to-date enterprise content
    • Data source connectors can be scheduled to automatically sync an Amazon Kendra index with a data source, to ensure you’re always securely searching through the most up-to-date content.

Additionally, you can also develop your own source connectors using Perficient’s framework as an accelerator. Wherever the data lives, whatever format, Perficient can build hooks to combine, enhance, and index the critical data to augment Amazon Kendra’s intelligent search.

Perficient has found Amazon Kendra’s intelligent search capabilities, customization potential, and APIs to be powerful tools in search experience design. We are thrilled to add our growing list of data source connectors to Amazon Kendra’s connector library.

To get started with Amazon Kendra, visit the Amazon Kendra Essentials+ workshop for an interactive walkthrough. To learn more about other Amazon Kendra data source connectors visit the Amazon Kendra connector library. For more information about Perficent’s Handshake platform or to contact the team directly visit their website.


About the Authors

Zach Fischer is a Solution Architect at Perficient with 10 years’ experience delivering Search and ECM implementations to over 40 customers. Recently, he has become the Product Manager for Perficient’s Handshake and Nero products. When not working, he’s playing music or DnD.

 

 

Jean-Pierre Dodel leads product management for Amazon Kendra, a new ML-powered enterprise search service from AWS. He brings 15 years of Enterprise Search and ML solutions experience to the team, having worked at Autonomy, HP, and search startups for many years prior to joining Amazon 4 years ago. JP has led the Kendra team from its inception, defining vision, roadmaps, and delivering transformative semantic search capabilities to customers like Dow Jones, Liberty Mutual, 3M, and PwC.

Read More