Training and deploying models using TensorFlow 2 with the Object Detection API on Amazon SageMaker

With the rapid growth of object detection techniques, several frameworks with packaged pre-trained models have been developed to provide users easy access to transfer learning. For example, GluonCV, Detectron2, and the TensorFlow Object Detection API are three popular computer vision frameworks with pre-trained models.

In this post, we use Amazon SageMaker to build, train, and deploy an EfficientDet model using the TensorFlow Object Detection API. It’s built on top of TensorFlow 2, which makes it easy to construct, train, and deploy object detection models.

It also provides the TensorFlow 2 Detection Model Zoo, which is a collection of pre-trained detection models we can use to accelerate our endeavor.

SageMaker is a fully managed service that provides developers and data scientists the ability to build, train, and deploy ML models quickly. SageMaker removes the heavy lifting from each step of the ML process to make it easier to develop high-quality models.

Walkthrough overview

This post demonstrates how to do the following:

  • Label images using SageMaker Ground Truth
  • Generate the dataset TFRecords and label map using SageMaker Processing
  • Fine-tune an EfficientDet model with TensorFlow 2 on SageMaker
  • Monitor your model training with TensorBoard and SageMaker Debugger
  • Deploy your model on a SageMaker endpoint and visualize predictions

Prerequisites

If you want to try out each step yourself, make sure that you have the following in place:

The code repo contains the following folders with step-by-step walkthrough via notebooks.

The code repo contains the following folders with step-by-step walkthrough via notebooks.

Preparing the data

You can follow this section by running the cells in this notebook.

The dataset

In this post, we use a dataset from iNaturalist.org and train a model to recognize bees from RGB images.

This dataset contains 500 images of bees that have been uploaded by iNaturalist users for the purposes of recording the observation and identification. We only use images that users have licensed under a CC0 license.

This dataset contains 500 images of bees that have been uploaded by iNaturalist users for the purposes of recording the observation and identification.

We placed the dataset in Amazon S3 in a single .zip archive that you can download or by following instructions in the prepare_data.ipynb notebook in your instance.

The archive contains 500 .jpg image files, and an output.manifest file, which we explain later in the post. We also have 10 test images in the 3_predict/test_images notebook folder that we use to visualize our model predictions.

Labeling images using SageMaker Ground Truth

To train an ML model, you need large, high-quality, labeled datasets. Labeling thousands of images can become tedious and time-consuming. Thankfully, Ground Truth makes it easy to crowdsource this task. Ground Truth offers easy access to public and private human labelers for annotating datasets. It provides built-in workflows and interfaces for common labeling tasks, including drawing bounding boxes for object detection.

You can now move on to creating labeling jobs in Ground Truth. In this post, we don’t cover each step in creating a labeling job. It’s already covered in detail in the post Amazon SageMaker Ground Truth – Build Highly Accurate Datasets and Reduce Labeling Costs by up to 70%.

For our dataset, we follow the recommended workflow from the post Create high-quality instructions for Amazon SageMaker Ground Truth labeling jobs to create our labeling instructions for the labeler.

The following screenshot shows an example of a labeling job configuration in Ground Truth.

The following screenshot shows an example of a labeling job configuration in Ground Truth.

At the end of a labeling job, Ground Truth saves an output manifest file in Amazon S3, where each line corresponds to a single image and its labeled bounding boxes, alongside some metadata. See the following code:

{"source-ref":"s3://sagemaker-remars/datasets/na-bees/500/10006450.jpg","bees-500":{"annotations":[{"class_id":0,"width":95.39999999999998,"top":256.2,"height":86.80000000000001,"left":177}],"image_size":[{"width":500,"depth":3,"height":500}]},"bees-500-metadata":{"job-name":"labeling-job/bees-500","class-map":{"0":"bee"},"human-annotated":"yes","objects":[{"confidence":0.75}],"creation-date":"2019-05-16T00:15:58.914553","type":"groundtruth/object-detection"}}
{"source-ref":"s3://sagemaker-remars/datasets/na-bees/500/10022723.jpg","bees-500":{"annotations":[{"class_id":0,"width":93.8,"top":228.8,"height":135,"left":126.8}],"image_size":[{"width":375,"depth":3,"height":500}]},"bees-500-metadata":{"job-name":"labeling-job/bees-500","class-map":{"0":"bee"},"human-annotated":"yes","objects":[{"confidence":0.82}],"creation-date":"2019-05-16T00:41:33.384412","type":"groundtruth/object-detection"}}
{"source-ref":"s3://sagemaker-remars/datasets/na-bees/500/10059108.jpg","bees-500":{"annotations":[{"class_id":0,"width":157.39999999999998,"top":188.60000000000002,"height":131.2,"left":110.8}],"image_size":[{"width":375,"depth":3,"height":500}]},"bees-500-metadata":{"job-name":"labeling-job/bees-500","class-map":{"0":"bee"},"human-annotated":"yes","objects":[{"confidence":0.8}],"creation-date":"2019-05-16T00:57:28.636681","type":"groundtruth/object-detection"}}
{"source-ref":"s3://sagemaker-remars/datasets/na-bees/500/10250726.jpg","bees-500":{"annotations":[{"class_id":0,"width":77.20000000000002,"top":204,"height":94.4,"left":79.6}],"image_size":[{"width":375,"depth":3,"height":500}]},"bees-500-metadata":{"job-name":"labeling-job/bees-500","class-map":{"0":"bee"},"human-annotated":"yes","objects":[{"confidence":0.81}],"creation-date":"2019-05-16T00:34:21.300882","type":"groundtruth/object-detection"}}

For your convenience, we previously completed a labeling job called bees-500 and included the augmented manifest file output.manifest in the dataset.zip archive. In the provided notebook, we upload this dataset to the default S3 bucket before data preparation.

Generating TFRecords and the dataset label map

To use our dataset in the TensorFlow Object Detection API, we must first combine its images and labels and convert them into the TFRecord file format. The TFRecord format is a simple format for storing a sequence of binary records, which helps in data reading and processing efficiency. We also need to generate a label map, which defines the mapping between a class ID and a class name.

In the provided preprocessing notebook, we build a custom SageMaker Processing job with our own processing container. We first build a Docker container with the necessary TensorFlow image, Python libraries, and code to run those steps and push it to an Amazon Elastic Container Registry (Amazon ECR) repository. We then launch a processing job, which runs the pushed container and prepares the data for training. See the following code:

data_processor = Processor(role=role, 
                           image_uri=container, 
                           instance_count=1, 
                           instance_type='ml.m5.xlarge',
                           volume_size_in_gb=30, 
                           max_runtime_in_seconds=1200,
                           base_job_name='tf2-object-detection')

input_folder = '/opt/ml/processing/input'
ground_truth_manifest = '/opt/ml/processing/input/output.manifest'
label_map = '{"0": "bee"}' # each class ID should map to the human readable equivalent
output_folder = '/opt/ml/processing/output'

data_processor.run(
    arguments= [
        f'--input={input_folder}',
        f'--ground_truth_manifest={ground_truth_manifest}',
        f'--label_map={label_map}',
        f'--output={output_folder}'
    ],
    inputs = [
        ProcessingInput(
            input_name='input',
            source=s3_input,
            destination=input_folder
        )
    ],
    outputs= [
        ProcessingOutput(
            output_name='tfrecords',
            source=output_folder,
            destination=f's3://{bucket}/data/bees/tfrecords'
        )
    ]
)

The job takes the .jpg images, the output.manifest, and the dictionary of classes as Amazon S3 inputs. It splits the dataset into a training and a validation datasets, generates the TFRecord and label_map.pbtxt files, and outputs them into the Amazon S3 destination of our choice.

Out of the total of 500 images, we use 450 for training and 50 for validation.

During the training the algorithm, we use the first set to train the model and the latter for evaluation.

You should end up with three files named label_map.pbtxt, train.records, and validation.records in the Amazon S3 destination you defined (s3://{bucket}/data/bees/tfrecords).

During the training the algorithm, we use the first set to train the model and the latter for evaluation.

We can now move to model training!

Fine-tuning an EfficientDet model with TensorFlow 2 on SageMaker

You can follow this section by running the cells in this notebook.

Building a TensorFlow 2 Object Detection API Docker container

In this step, we first build and push a Docker container based on the Tensorflow gpu image.

We install the TensorFlow Object Detection API and the sagemaker-training-toolkit library to make it easily compatible with SageMaker.

SageMaker offers several ways to run our custom container. For more information, see Amazon SageMaker Custom Training containers. For this post, we use script mode and instantiate our SageMaker estimator as a CustomFramework. This allows us to work dynamically with our training code stored in the source_dir folder and prevents us from pushing container images to Amazon ECR at every change.

The following screenshot shows the corresponding training folder structure.

The following screenshot shows the corresponding training folder structure.

Setting up TensorBoard real-time monitoring using SageMaker Debugger

To capture real-time model training and performance metrics, we use TensorBoard and SageMaker Debugger. First, we start by defining a TensorboardOutputConfig in which we specify the S3 path where we save the TensorFlow checkpoints. See the following code:

from sagemaker.debugger import TensorBoardOutputConfig

tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path=tensorboard_s3_prefix,
    container_local_output_path='/opt/training/'
)

Each time the training script writes a date to the container_local_output_path, SageMaker uploads it to Amazon S3, allowing us to monitor in real time.

Training a TensorFlow 2 object detection model using SageMaker

We fine-tune a pre-trained EfficientDet model available in the TensorFlow 2 Object Detection Model Zoo, because it presents good performance on the COCO 2017 dataset and efficiency to run it.

We save the model checkpoint and its base pipeline.config in the source_dir folder, along with our training code.

We then adjust the pipeline.config so TensorFlow 2 can find the TFRecord and label_map.pbtxt files when they are loaded inside the container from Amazon S3.

Your source_dir folder should now look like the following screenshot.

Your source_dir folder should now look like the following screenshot.

We use run_training.sh as the run entry point. This is the main script that SageMaker runs during training time, and performs the following steps:

  1. Launch the model training based on the specified hyperparameters.
  2. Launch the model evaluation based on the last checkpoint saved during the training.
  3. Prepare the trained model for inference using the exporter script.

You’re ready to launch the training job with the following commands:

hyperparameters = {
    "model_dir":"/opt/training",        
    "pipeline_config_path": "pipeline.config",
    "num_train_steps": 1000,    
    "sample_1_of_n_eval_examples": 1
}

estimator = CustomFramework(image_uri=container,
                            role=role,
                            entry_point='run_training.sh',
                            source_dir='source_dir/',
                            instance_count=1,
                            instance_type='ml.p3.8xlarge',
                            hyperparameters=hyperparameters,
                            tensorboard_output_config=tensorboard_output_config,
                            base_job_name='tf2-object-detection')

#We make sure to specify wait=False, so our notebook is not waiting for the training job to finish.

estimator.fit(inputs)

When the job is running, Debugger allows us to capture TensorBoard data into a chosen Amazon S3 location and monitor the progress in real time with TensorBoard. As we indicated in the log directory when configuring the TensorBoardOutputConfig object, we can use it to as the --logdir parameter.

Now, we can start up the TensorBoard server with the following command:

job_artifacts_path = estimator.latest_job_tensorboard_artifacts_path()
tensorboard_s3_output_path = f'{job_artifacts_path}/train'

!F_CPP_MIN_LOG_LEVEL=3 AWS_REGION=<ADD YOUR REGION HERE> tensorboard --logdir=$tensorboard_s3_output_path

TensorBoard runs on your notebook instance, and you can open it by visiting the URL https://your-notebook-instance-name.notebook.your-region.sagemaker.aws/proxy/6006/.

The following screenshot shows the TensorBoard dashboard after the training is over.

The following screenshot shows the TensorBoard dashboard after the training is over.

We can also look at the TensorBoard logs generated by the evaluation step. These are accessible under the following eval folder:

tensorboard_s3_output_path = f'{job_artifacts_path}/eval'
region_name = 'eu-west-1'

!F_CPP_MIN_LOG_LEVEL=3 AWS_REGION=$region_name tensorboard —logdir=$tensorboard_s3_output_path

This allows us to compare the ground truth data (right image in the following screenshot) and the predictions (left image).

This allows us to compare the ground truth data (right image in the following screenshot) and the predictions (left image).

Deploying your object detection model into a SageMaker endpoint

When the training is complete, the model is exported to a TensorFlow inference graph as a model.tar.gz.gz .pb file and saved in a model.tar.gz .zip file in Amazon S3 by SageMaker. model

SageMaker provides a managed TensorFlow Serving environment that makes it easy to deploy TensorFlow models.

To access the model_artefact path, you can open the training job on the SageMaker console, as in the following screenshot.

To access the model_artefact path, you can open the training job on the SageMaker console, as in the following screenshot.

When you have the S3 model artifact path, you can use the following code to create a SageMaker endpoint:

from sagemaker.tensorflow.serving import Model

model_artefact = '<your-model-s3-path>'

model = Model(model_data=model_artefact,
              name=name_from_base('tf2-object-detection'),
              role=role,
              framework_version='2.2')
              
predictor = model.deploy(initial_instance_count=1, instance_type='ml.m5.xlarge')

When the endpoint is up and running, we can send prediction requests to it with test images and visualize the results using the Matplotlib library:

img = image_file_to_tensor('test_images/22673445.jpg')

input = {
  'instances': [img.tolist()]
}

detections = predictor.predict(input)['predictions'][0]

The following screenshot shows an example of our output.

The following screenshot shows an example of our output.

Summary

In this post, we covered an end-to-end process of collecting and labeling data using Ground Truth, preparing and converting the data to TFRecord format, and training and deploying a custom object detection model using the TensorFlow Object Detection API.

Get started today! You can learn more about SageMaker and kick off your own machine learning experiments and solutions by visiting Amazon SageMaker console.


About the Authors

Sofian Hamiti is an AI/ML specialist Solutions Architect at AWS. He helps customers across industries accelerate their AI/ML journey by helping them build and operationalize end-to-end machine learning solutions.

 

 

 

Othmane Hamzaoui is a Data Scientist working in the AWS Professional Services team. He is passionate about solving customer challenges using Machine Learning, with a focus on bridging the gap between research and business to achieve impactful outcomes. In his spare time, he enjoys running and discovering new coffee shops in the beautiful city of Paris.

Read More

Learn from the winner of the AWS DeepComposer Chartbusters Track or Treat challenge

AWS is excited to announce the winner of the AWS DeepComposer Chartbusters Track or Treat challenge, Greg Baker. AWS DeepComposer gives developers a creative way to get started with machine learning (ML). In June 2020, we launched Chartbusters, a global competition in which developers use AWS DeepComposer to create original AI-generated compositions and compete to showcase their ML skills. Developers of all skill levels can train and optimize generative AI models to create original music. The Track or Treat challenge, which ran in October 2020, challenged developers to create Halloween-themed compositions using the AWS DeepComposer Music studio.

Greg Baker, a father of two in Sydney, Australia, works as a CTO for several startups to help build out their technology infrastructure. He has a background in mathematics, and teaches courses in ML and data science. One of his friends first introduced him to how he could combine mathematics with ML, and this became his starting point for exploring ML.

We interviewed Greg to learn about his experience competing in the Track or Treat Chartbusters challenge, and asked him to tell us more about how he created his winning composition.

Greg Baker at his work station.

Greg Baker at his work station

Getting started with generative AI

Before entering the Chartbusters challenge, one of Greg’s earliest projects in ML was during an ML meetup in Australia. Participants were tasked with presenting something they had worked on in generative AI. Greg created a model that tried to learn how to handwrite words through an optical character recognition (OCR) system.

“I had a system that generated random scribbled letters and fed them through an OCR system to try and make sense of it,” Greg says. “I sent it to the OCR system and the OCR system gives a percentage probability of what it might be. And my goal with the system was to produce something that could learn how to do handwriting that could be read.”

Greg ended up not pursuing the project.

“It could sort of get the letter L, and that’s all it could learn to write. It didn’t work very well; it was 7,000 iterations to learn the letter L. I had grand plans about how it was going to learn to write Chinese and learn the history of hieroglyphics, but it couldn’t do a thing,” laughed Greg. “It was so embarrassingly bad; I sort of buried it afterwards.”

Greg happened to find AWS DeepComposer on the AWS Console through his work. He was intrigued with AWS DeepComposer, because in addition to his passion in mathematics and ML, Greg has worked as a professional musician.

“The highlight of my music career so far has been jamming with the Piano Guys at the Sydney Opera House. I also worked as a pipe organ building apprentice for a while, so I have a bit of practice in making spooky music!”

Greg spent time in the AWS DeepComposer Music studio and through his interest in music and ML, decided to compete in the Chartbusters challenge.

Building in AWS DeepComposer

In Track or Treat, developers were challenged to compose music using spooky instruments in the AWS DeepComposer Music Studio. Greg created two compositions throughout his process. First, he created a composition to get a handle of the AWS DeepComposer keyboard. Then, he worked on his second composition using the autoregressive convolutional neural network model to make his composition as spooky as possible.

“I knew that to do something spooky, I wanted it to be as low as possible and in a minor key. The AWS keyboard goes down to F, so I played a few notes in F minor, and then echoed them starting on a B flat, and then did a few variant notes starting on F again.”

Greg composing his melody.

Greg composing his melody

Greg’s favorite feature on the AWS DeepComposer console is the autoregressive convolutional neural network (AR-CNN) model. The AR-CNN technique works by adding and removing notes from the input track provided. Adjusting the hyperparameters further controls how the model edits the melody. Developers can also use the Edit melody feature to collaborate iteratively with the technique. Learn more about generating compositions using the AR-CNN model.

Greg explains how the model can take a couple notes and generate a musical composition that, upon hearing, you would not know was generated by a machine.

“If you heard it in a concert you wouldn’t guess it was automatically composed. There are options to make the autoregressor very experimental, which I turned way down in the opposite direction so that DeepComposer would be very conservative and follow all the rules that my music theory teachers drilled into me. I wanted to make it feel very classical—like it was written in the same era that Shelley was writing Frankenstein.”

You can listen to Greg’s winning composition, “Halloween,” on the AWS DeepComposer SoundCloud page. 

What’s next

In the future, Greg hopes to see more bridges between music creation, artificial intelligence, and generative AI. With AWS DeepComposer, developers can create original music using generative AI which opens the door to an entire world of human and computer creativity. Greg hopes this challenge encourages more people with no background in music to become creators using AI.

“Computing is one of those enabling technologies that lets individuals, wherever they are, create in the world and create things of value […] What I’m hoping is that this will expand the scope of people who can take some joy in composing. At the moment, it’s not something everyone can do, but creating new music is such a wonderful feeling that I wish more people could experience it.”

Greg felt that the Chartbusters challenge nurtured his desire to be a creator and builder through creating a lasting piece of music. From developers who have no background in music composition, to developers who are experienced in music composition, anyone can enjoy the process of building and creating in AWS DeepComposer and listen to the final compositions created from the Chartbusters challenge.

“I’m not a very competitive person […] but setting algorithms to compete against each other seems like a fun thing to do. The Chartbusters competition is interesting because at the end of it, you have a piece of music that everyone can enjoy regardless of whether you win or lose. So it’s not so much about competing as creating.”

What’s next for Greg in ML? Building a model to predict his weight.

“I just recently created a machine learning model to predict my weight, based on how often I’m going running. I wasn’t really getting anywhere for most of this year—I was actually going backwards—but since I won the Chartbusters challenge and put together my prediction model, I’m beginning to make some progress.”

Congratulations to Greg for his well-deserved win!

We hope Greg’s story has inspired you to learn more about ML and get started with AWS DeepComposer. Check out the next AWS DeepComposer Chartbusters challenge, and start composing today.


About the Author

Paloma Pineda is a Product Marketing Manager for AWS Artificial Intelligence Devices. She is passionate about the intersection of technology, art, and human centered design. Out of the office, Paloma enjoys photography, watching foreign films, and cooking French cuisine.

Read More

Amazon DevOps Guru is powered by pre-trained ML models that encode operational excellence

On December 1, 2020, we announced the preview of Amazon DevOps Guru, a machine learning (ML)-powered service that gives operators of cloud-based applications a simpler way to measure and improve an application’s operational performance and availability to reduce expensive downtime.

Amazon DevOps Guru is a turn-key solution that helps operators by automatically ingesting operational data for analysis and predictively identifying operationally relevant issues. Although DevOps Guru uses ML models that are informed by years of operational expertise in building, scaling, and maintaining highly available applications at Amazon.com, it doesn’t require any ML experience.

Amazon DevOps Guru automatically identifies most common behaviors of applications that correspond to operational incidents. When it identifies a critical issue, it alerts service operators with a summary of related anomalies, the likely root cause, and context on when and where the issue occurred. When possible, it also provides prescriptive recommendations on how to remediate the issue. In this post, we shed light on some of the ML approaches that power DevOps Guru.

DevOps Guru detectors

At the core of Amazon DevOps Guru is a unique approach to identify meaningful operational incidents. At the start of our research for DevOps Guru, we focused on domain-agnostic, general-purpose anomaly detection models. Though they gave statistically correct results, these models couldn’t always identify or distinguish critical failures from interesting but not critical issues. Over time, we learned that failure patterns differ considerably from metric to metric. For example, a common use case of DevOps Guru is running highly available, low-latency web applications, where an operator may be interested in monitoring both application latency and the number of incoming requests. However, failure patterns in these two metrics differ substantially, making generic statistical anomaly detection models to address both scenarios unlikely to succeed. As a result, we changed our approach radically. After consulting with domain experts to identify known anomaly types across a variety of metrics and services, we set out to built domain-specific, single-purpose models to identify these known failure modes instead of normal metric behavior.

Fast-forward to now, Amazon DevOps Guru relies on a large ensemble of detectors—statistical models tuned to detect common adverse scenarios in a variety of operational metrics. DevOps Guru detectors don’t need to be trained or configured. They work instantly as long as enough history is available, saving days if not months of time that would otherwise be spent training ML models prior to anomaly generation. Individual detectors work in preconfigured ensembles to generate anomalies on some of the most important metrics operators monitor: error rates, availability, latency, incoming request rates, CPU, memory, and disk utilization, among others.

Detectors codify experts’ understanding of operational anomalies as closely as possible, in both determining anomalous patterns as well as establishing bounds for normal application behavior. Both detectors, and the ensembles that compose them into full models, were trained and tuned on Amazon’s data based on years of operational experience at Amazon.com and AWS. Next, we dive into some of the capabilities of DevOps Guru detectors.

Monitoring resource metrics with finite bounds

The purpose of this detector is to monitor finite resource metrics such as disk utilization. It utilizes a digital filter to detect long-running trends in metric data in a highly scalable and compute-effective manner. The detector notifies operators when these trends point to impending resource exhaustion. The following graph shows an illustrative example.

This detector identified a significant trend in disk usage, heading for disk exhaustion within 24 hours. The model has identified a significant trend between the vertical dashed lines. Extrapolating this trend (diagonal dashed line), the detector predicts time to resource exhaustion. As the metric breaches the horizontal red line, which acts as a significance threshold, the detector notifies operators.

Detecting scenarios with periodicity

Many metrics, such as the number of incoming requests in customer-facing APIs, exhibit periodic behavior. The purpose of the causal convolution detector is to analyze temporal data with such patterns and to determine expected periodic behavior. When the detector infers that a metric is periodic, it adapts normal metric behavior thresholds to the seasonal pattern (as in the following graph). On a selected group of metrics, Amazon DevOps Guru can also identify and filter periodic spiking behavior, such as regular batch jobs producing high database load. In the following graph, we only see one detector active for better visualization—in reality, the causal convolution detector tracks the seasonal metric closely, whereas a further dynamic threshold detector detects catastrophic changes if breached.

The causal convolution detector sets bounds for application behavior in line with daily application load patterns. By tracking the seasonality, it allows catching spikes relative to the weekend, which traditional approaches, based on static threshold lines, only catch at the expense of many false positives.

DevOps Guru insights

Instead of providing just a list of anomalies that an ensemble of detectors find, DevOps Guru generates operational insights that aggregate relevant information needed to investigate and remediate an operational issue. Amazon DevOps Guru leverages anomaly metadata to identify related anomalies and potential root causes. Based on metadata, related anomalies are grouped together based on their temporal proximity, shared resources, and a large graph of potential causal links between anomaly types.

DevOps Guru presents insights with the following:

  • Graphs and timelines related to the numerous anomalous metrics
  • Contextual information such as relevant events and log snippets for easily understanding the anomaly scope
  • Recommendations to remediate the issue

The following screenshot illustrates an example insight detail page from DevOps Guru, which shows a collection of related metrics’ anomalies in a timeline view.

The following screenshot illustrates an example insight detail page from DevOps Guru

Conclusion

Amazon DevOps Guru saves IT operators hours if not days of time and effort spent detecting, debugging, and resolving operational issues. Because it uses pre-trained proprietary ML models informed by years of operational experience at Amazon.com and AWS in managing highly available services, IT operators can receive the same high-quality insights without having any ML experience. Start using DevOps Guru today.

Acknowledgments: The authors would like to acknowledge the contributions of Jan Gasthaus, Valentin Flunkert, Syama Rangapuram, Lorenzo Stella, Konstantinos Benidis and Francois Aubet to the algorithms and models presented in this blog post.


About the Authors

Caner Türkmen is a Machine Learning Scientist at Amazon Web Services, where he works on problems at the intersection of machine learning, forecasting, and anomaly detection. Before joining AWS, he worked in the management consulting industry as a data scientist, serving the financial services and telecommunications industries on projects across the globe. Caner’s personal research interests span a range of topics including probabilistic and Bayesian ML, stochastic processes, and their practical applications.

 

Ravi Turlapati leads product efforts at AWS. Ravi joined AWS more than three years ago, and has launched multiple products from scratch including AWS Data Exchange and Amazon DevOps Guru. In his latest role at AWS AI group, Ravi aims to deliver easy to use ML-based products that solve complex challenges for customers. Ravi is passionate about social causes and supports any charity that creates a self sustaining environment for those in need.

 

Tim Januschowski is a Machine Learning Science Manager in Amazon’s AWS AI Labs. He has broad interests in machine learning with a particular focus on machine learning for business problems and decision making. At Amazon, he has produced end-to-end solutions for a wide variety of problems, from forecasting, to anomaly detection and personalization in application areas such as retail, operations and energy. Tim’s  interests in applying machine learning span applications, system, algorithm and modeling aspects and the downstream mathematical programming problems. He studied Mathematics at TU Berlin, IMPA, Rio de Janeiro, and Zuse-Institute Berlin and holds a PhD from University College Cork.

Read More

Anomaly detection with Amazon Lookout for Metrics

This is a guest blog post from Quantiphi, an AWS Advanced Consulting Partner that specializes in artificial intelligence, machine learning, and data and analytics solutions.

We’ve all heard the saying “time is money,” and that’s especially true for the retail industry. In a highly competitive environment where large volumes of data are generated, quick and accurate anomaly detection is critical to smooth business operations and positive customer experiences. However, doing so is easier said than done. Traditional anomaly detection techniques fall short because they can’t efficiently keep up with the growing volume of data. To put this in perspective, one of the largest retailers today collects around 2.5 petabytes of data per hour. Techniques that detect anomalies quickly at this scale and identify their root causes are required for taking effective business decisions.

Traditionally, businesses use dashboards to track metrics or key performance indicators. However, as the number of metrics grow, anomaly identification and remediation using traditional techniques become cumbersome. Therefore, many organizations look at machine learning (ML)-based anomaly detection to overcome this challenge.

In this post, we walk you through a retail example to demonstrate how Amazon Lookout for Metrics, a new ML-based AWS service for anomaly detection, helps accelerate detection and remediation of anomalies.

Anomaly detection in retail transactional data

Let’s dive into an anomaly detection use case we worked on for an online retailer. The retailer generated large amounts of customer transactional data, which serves as a window into their end-customer’s behavior (what products, brands, promotions, and ads they engaged with). For them, quick and accurate anomaly detection in KPIs is imperative for timely remediation, in order to maintain inventory flow and price compliance. As a result, the retailer wanted to ensure that the anomaly insights were actionable and reflected real risks to inventory availability and revenue.

To set up Lookout for Metrics, we first divided the data into regular time intervals. We then set up the detector, specifying the category of every column and the time format of the timestamp, which are mandatory fields. Lookout for Metrics allows us to define up to five measures and five dimensions for continuous monitoring for anomalies. We then trained the detector using historical data and used live data for testing and continuous learning. We uploaded this data to Amazon Simple Storage Service (Amazon S3) regularly, at the time interval specified when setting up the detector.

At each specified time interval, Lookout for Metrics checked for the presence of new data and new anomalies. When it detected an anomaly, it provided two additional insights. First it provided a severity score measuring the magnitude of the anomaly. The severity scores also helped the retailer tune the sensitivity of their model to focus only on the most important events. Second, it revealed a breakdown of the dimensions that contributed to the anomaly, with a percentage contribution from each dimension value, which was useful for determining the appropriate actions to take.

When the retailer applied the detector, we identified a few transactions in which thousands of items were sold at steep discounts. We used insights from the dimensions to quickly learn that these transactions were due to wholesale purchasing and bulk order shipments to large corporations. The retailer was then able to promptly take action to ensure inventory availability for other customers wasn’t impacted.

Comparing Amazon Lookout for Metrics to traditional anomaly detection methods

One of the key benefits of using Lookout for Metrics is a reduction in setup time from days to hours. The setup process is straightforward (see the following diagram). After you define your data source and the metrics you want to track, you can build your first detector model. Lookout for Metrics is then capable of detecting anomalies continuously. In the previous retail example, we detected anomalies in the transactional dataset within 10 minutes of setting up the Lookout for Metrics detector. The process with traditional methods would take 2–3 days. Traditional methods also require highly technical resources and subject matter experts versed in analytics and ML to create and manage custom models. In the case of Lookout for Metrics, the service manages the ML pipeline, allowing you to focus on anomalies, root causes, and actions.

By leveraging a managed service like Lookout for Metrics, anomaly detection is simplified and automated, saving time and effort. Lookout for Metrics helps customers in a wide variety of industries such as retail, ads and marketing, gaming, software and internet, and telecommunications to accurately detect anomalies in their time series data and understand the root cause.

You can use Lookout for Metrics to do the following:

  • Detect anomalies in metrics with high accuracy using advanced ML technology, leading to fewer false alarms and missed anomalies, with no ML experience required.
  • Get a more holistic view of their business while minimizing disparate alerts. The service groups concurrent anomalies and sends a single alert on related anomalies, which summarizes the what, how, and where.
  • Continuously improve accuracy and performance by providing feedback on detected anomalies. Lookout for Metrics incorporates your feedback in real time to learn what is most relevant to your business.
  • Prioritize which anomalies to focus on first using the ranked severity score and tune the sensitivity threshold to get alerted on only the relevant anomalies (see the following diagram).

Prioritize which anomalies to focus on first using the ranked severity score

Summary

As an Advanced Consulting Partner, Quantiphi comes with immense experience in solving critical business challenges for our customers. We have worked on several anomaly detection engagements. In addition to the retail example discussed in this post, we recently helped an advertisement analytics company improve their offerings by building models to improve the effectiveness of ad campaigns.

With Lookout for Metrics, we aim to expedite the delivery of solutions to detect anomalies in different business processes, thus bringing unprecedented value to our customers.

Amazon Lookout for Metrics (Preview) is available on the AWS Management Console, via the AWS SDKs, and the AWS Command Line Interface (AWS CLI) in the following Regions: US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), and Asia Pacific (Tokyo). See the AWS Region Table for more details. Request access to the preview today.

Quantiphi is an AWS Advanced Consulting Partner with a deep understanding of artificial intelligence, machine learning, and data and analytics solutions. For more information about Quantiphi, contact them here.

 

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.


About the Authors

Vibhav Gupta is a Senior Client Solutions Partner at Quantiphi.

Tripurana Rahul is a Senior Business Analyst at Quantiphi.

Read More

Using genetic algorithms on AWS for optimization problems

Machine learning (ML)-based solutions are capable of solving complex problems, from voice recognition to finding and identifying faces in video clips or photographs. Usually, these solutions use large amounts of training data, which results in a model that processes input data and produces numeric output that can be interpreted as a word, face, or classification category. For many types of problems, this approach works very well.

But what if you have a problem that doesn’t have training data available, or doesn’t fit within the concept of a classification or regression? For example, what if you need to find an optimal ordering for a given set of worker tasks with a given set of conditions and constraints? How do you solve that, especially if the number of tasks is very large?

This post describes genetic algorithms (GAs) and demonstrates how to use them on AWS. GAs are unsupervised ML algorithms used to solve general types of optimization problems, including:

  • Optimal data orderings – Examples include creating work schedules, determining the best order to perform a set of tasks, or finding an optimal path through an environment
  • Optimal data subsets – Examples include finding the best subset of products to include in a shipment, or determining which financial instruments to include in a portfolio
  • Optimal data combinations – Examples include finding an optimal strategy for a task that is composed of many components, where each component is a choice of one of many options

For many optimization problems, the number of potential solutions (good and bad) is very large, so GAs are often considered a type of search algorithm, where the goal is to efficiently search through a huge solution space. GAs are especially advantageous when the fitness landscape is complex and non-convex, so that classical optimization methods such as gradient descent are an ineffective means to find a global solution. Finally, GAs are often referred to as heuristic search algorithms because they don’t guarantee finding the absolute best solution, but they do have a high probability of finding a sufficiently good solution to the problem in a short amount of time.

GAs use concepts from evolution such as survival of the fittest, genetic crossover, and genetic mutation to solve problems. Rather than trying to create a single solution, those evolutionary concepts are applied to a population of different problem solutions, each of which is initially random. The population goes through a number of generations, literally evolving solutions through mechanisms like reproduction (crossover) and mutation. After a number of generations of evolution, the best solution found across all the generations is chosen as the final problem solution.

As a prerequisite to using a GA, you must be able to do the following:

  • Represent each potential solution in a data structure.
  • Evaluate that data structure and return a numeric fitness score that accurately reflects the solution quality. For example, imagine a fitness score that measures the total time to perform a set of tasks. In that case, the goal would be to minimize that fitness score in order to perform the tasks as quickly as possible.

Each member of the population has a different solution stored in its data structure, so the fitness function must return a score that can be used to compare two candidates against each other. That’s the “survival of the fittest” part of the algorithm—one candidate is evaluated as better than another, and that fitter candidate’s information is passed on to future generations.

One note about terminology: because many of the ideas behind a genetic algorithm come from the field of genetics, the data representation that each member of a population uses is sometimes called a genome. That’s simply another way to refer to the data used to represent a particular solution.

Use case: Finding an optimal route for a delivery van

As an example, let’s say that you work for a company that ships lots of packages all over the world, and your job is focused on the final step, which is delivering a package by truck or van to its final destination.

A given delivery vehicle might have up to 100 packages at the start of a day, so you’d like to calculate the shortest route to deliver all the packages and return the truck to the main warehouse when done. This is a version of a classic optimization problem called The Travelling Salesman Problem, originally formulated in 1930. In the following visualization of the problem, displayed as a top-down map of a section of a city, the warehouse is shown as a yellow dot, and each delivery stop is shown as a red dot.

In the following visualization of the problem, displayed as a top-down map of a section of a city, the warehouse is shown as a yellow dot, and each delivery stop is shown as a red dot.

To keep things simple for this demonstration, we assume that when traveling from one delivery stop to another, there are no one-way roads. Due to this assumption, the total distance traveled from one stop to the next is the difference in X coordinates added to the difference in Y coordinates.

If the problem had a slightly different form (like traveling via airplane rather than driving through city streets), we might calculate the distance using the Pythagorean equation, taking the square root of the total of the difference in X coordinates (squared) added to the difference in Y coordinates (squared). For this use case, however, we stick with the total of the difference in X coordinates added to the total of the difference in Y coordinates, because that matches how a truck travels to deliver the packages, assuming two-way streets.

Next, let’s get a sense of how challenging this problem is. In other words, how many possible routes are there with 100 stops where you visit each stop only once? In this case, the math is simple: there are 100 possible first stops multiplied by 99 possible second stops, multiplied by 98 possible third stops, and so on—100 factorial (100!), in other words. That’s 9.3 x 10157 possibilities, which definitely counts as a large solution space and rules out any thoughts of using a brute force approach. After all, with that volume of potential solutions, there really is no way to iterate through all the possible solutions in any reasonable amount of time.

Given that, it seems that a GA could be a good approach for this problem, because GAs are effective at finding good-quality solutions within very large solution spaces. Let’s develop a GA to see how that works.

Representation and a fitness function

As mentioned earlier, the first step in writing a GA is to determine the data structure for a solution. Suppose that we have a list of all 100 destinations with their associated locations. A useful data representation for this problem is to have each candidate store a list of 100 location indexes that represent the order the delivery van must visit each location. The X and Y coordinates found in the lookup table could be latitude and longitude coordinates or other real-world data.

The X and Y coordinates found in the lookup table could be latitude and longitude coordinates or other real-world data.

To implement our package delivery solution, we use a Python script, although almost any modern computer language like Java or C# works well. Open-source packages like inspyred also create the general structure of a GA, allowing you to focus on just the parts that vary from project to project. However, for the purposes of introducing the ideas behind a GA, we write the code without relying on third-party libraries.

As a first step, we represent a potential solution as the following code:

class CandidateSolution(object):
    def __init__(self):
        self.fitness_score = 0
        num_stops = len(delivery_stop_locations) # a list of (X,Y) tuples
        self.path = list(range(num_stops))
        random.shuffle(self.path)

The class has a fitness_score field and a path field. The path is a list of indexes into delivery_stop_locations, which is a list of (X,Y) coordinates for each delivery stop. That list is loaded from a database elsewhere in the code. We also use random.shuffle(), which ensures that each potential solution is a randomly shuffled list of indexes into the delivery_stop_locations list. GAs always start with a population of completely random solutions, and then rely on evolution to home in on the best solution possible.

With this data structure, the fitness function is straightforward. We start at the warehouse, then travel to the first location in our list, then the second location, and so on until we’ve visited all delivery stop locations, and then we return to the warehouse. In the end, the fitness function simply totals up the distance traveled over that entire trip. The goal of this GA is to minimize that distance, so the smaller the fitness score, the better the solution. We use the following the code to implement the fitness function:

def dist(location_a, location_b):
    xdiff = abs(location_a['X'] - location_b['X'])
    ydiff = abs(location_a['Y'] - location_b['Y'])
    return xdiff + ydiff

 def calc_score_for_candidate(candidate):
    # start with the distance from the warehouse to the first stop
    warehouse_location = {'X': STARTING_WAREHOUSE_X, 'Y': STARTING_WAREHOUSE_Y}
    total_distance = dist(warehouse_location, delivery_stop_locations[candidate.path[0]])

    # then travel to each stop
    for i in range(len(candidate.path) - 1):
        total_distance += dist(
            delivery_stop_locations[candidate.path[i]], 
            delivery_stop_locations[candidate.path[i + 1]])

    # then travel back to the warehouse
    total_distance += dist(warehouse_location, delivery_stop_locations
[candidate.path[-1]])
    return total_distance

Now that we have a representation and a fitness function, let’s look at the overall flow of a genetic algorithm.

Program flow for a genetic algorithm

When you have a data representation and a fitness function, you’re ready to create the rest of the GA. The standard program flow includes the following pseudocode:

  1. Generation 0 – Initialize the entire population with completely random solutions.
  2. Fitness – Calculate the fitness score for each member of the population.
  3. Completion check – Take one of the following actions:
    1. If the best fitness score found in the current generation is better than any seen before, save it as a potential solution.
    2. If you go through a certain number of generations without any improvement (no better solution has been found), then exit this loop, returning the best found to date.
  4. Elitism – Create a new generation, initially empty. Take a small percentage (like 5%) of the best-scoring candidates from the current generation and copy them unchanged into the new generation.
  5. Selection and crossover – To populate the remainder of the new generation, repeatedly select two good candidate solutions from the current generation and combine them to form a new child candidate that gets added to the next generation.
    1. Mutation – On rare occasions (like 2%, for example), mutate a newly created child candidate by randomly perturbing its data.
  6. Replace the current generation with the next generation and return to step 2.

When the algorithm exits the main loop, the best solution found during that run is used as the problem’s final solution. However, it’s important to realize that because there is so much randomness in a GA—from the initially completely random candidates to randomized selection, crossover, and mutation—each time you run a GA, you almost certainly get a different result. Because of that randomness, a best practice when using a GA is to run it multiple times to solve the same problem, keeping the very best solutions found across all the runs.

Using a genetic algorithm on AWS via Amazon SageMaker Processing

Due to the inherent randomness that comes with a GA, it’s usually a good idea to run the code multiple times, using the best result found across those runs. This can be accomplished using Amazon SageMaker Processing, which is an Amazon SageMaker managed service for running data processing workloads. In this case, we use it to launch the GA so that multiple instances of the code run in parallel.

Before we start, we need to set up a couple of AWS resources that our project needs, like database tables to store the delivery stop locations and GA results, and an AWS Identity and Access Management (IAM) role to run the GA. Use the AWS CloudFormation template included in the associated GitHub repo to create these resources, and make a note of the resulting ARN of the IAM role. Detailed instructions are included in the README file found in the GitHub repo.

After you create the required resources, populate the Amazon DynamoDB table DeliveryStops (indicating the coordinates for each delivery stop) using the Python script create_delivery_stops.py, which is included in the code repo. You can run this code from a SageMaker notebook or directly from a desktop computer, assuming you have Python and Boto3 installed. See the README in the repo for detailed instructions on running this code.

We use DynamoDB for storing the delivery stops and the results. DynamoDB is a reasonable choice for this use case because it’s highly scalable and reliable, and doesn’t require any maintenance due to it being a fully managed service. DynamoDB can handle more than 10 trillion requests per day and can support peaks of more than 20 million requests per second, although this use case doesn’t require anywhere near that kind of volume.

After you create the IAM role and DynamoDB tables, we’re ready to set up the GA code and run it using SageMaker.

  1. To start, create a notebook in SageMaker.

Be sure to use a notebook instance rather than SageMaker Studio, because we need a kernel with Docker installed.

To use SageMaker Processing, we first need to create a Docker image that we use to provide a runtime environment for the GA.

  1. Upload Dockerfile and genetic_algorithm.py from the code repo into the root folder for your Jupyter notebook instance.
  2. Open Dockerfile and ensure that the ENV AWS_DEFAULT_REGION line refers to the AWS Region that you’re using.

The default Region in the file from the repo is us-east-2, but you can use any Region you wish.

  1. Create a cell in your notebook and enter the following code:
    import boto3
    
    print("Building container...")
    
    region = boto3.session.Session().region_name
    account_id = boto3.client('sts').get_caller_identity().get('Account')
    ecr_repository = 'sagemaker-processing-container-for-ga'
    tag = ':latest'
    base_uri = '{}.dkr.ecr.{}.amazonaws.com'.format(account_id, region)
    repo_uri = '{}/{}'.format(base_uri, ecr_repository + tag)
    
    # Create ECR repository and push docker image
    !docker build -t $ecr_repository docker
    !aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $base_uri
    !aws ecr create-repository --repository-name $ecr_repository
    !docker tag {ecr_repository + tag} $repo_uri
    !docker push $repo_uri
    
    print("Container Build done")
    
    iam_role = 'ARN_FOR_THE_IAM_ROLE_CREATED_EARLIER'
    

Be sure to fill in the iam_role ARN, which is displayed on the Outputs page of the CloudFormation stack that you created earlier. You can also change the name of the Docker image if you wish, although the default value of sagemaker-processing-container-for-ga is reasonable.

Running that cell creates a Docker image that supports Python with the Boto3 package installed, and then registers it with Amazon Elastic Container Registry (Amazon ECR), which is a fully-managed Docker registry that handles everything required to scale or manage the storage of Docker images.

Add a new cell to your notebook and enter and run the following code:

from sagemaker.processing import ScriptProcessor

processor = ScriptProcessor(image_uri=repo_uri,
     role=iam_role,
     command=['python3']
     instance_count=1,
     instance_type="ml.m5.xlarge")

processor.run(code='./genetic_algorithm.py')

This image shows the job launched, and the results displayed below as the GA does its processing:

This image shows the job launched, and the results displayed below as the GA does its processing:

The ScriptProcessor class is used to create a container that the GA code runs in. We don’t include the code for the GA in the container itself because the ScriptProcessor class is designed to be used as a generic container (preloaded with all required software packages), and the run command chooses a Python file to run within that container. Although the GA Python code is located on your notebook instance, SageMaker Processing copies it to an Amazon Simple Storage Service (Amazon S3) bucket in your account so that it can be referenced by the processing job. Because of that, the IAM role we use must include a read-only permission policy for Amazon S3, along with other required permissions related to services like DynamoDB and Amazon ECR.

Calculating fitness scores is something that can and should be done in parallel, because fitness calculations tend to be fairly slow and each candidate solution is independent of all other candidate solutions. The GA code for this demonstration uses multiprocessing to calculate multiple fitness scores at the same time, which dramatically increases the speed at which the GA runs. We also specify the instance type in the ScriptProcessor constructor. In this case, we chose ml.m5.xlarge in order to use a processor with 4 vCPUs. Choosing an instance type with more vCPUs results in faster runs of each run of the GA, at a higher price per hour. There is no benefit to using an instance type with GPUs for a GA, because all of the work is done via a CPU.

Finally, the ScriptProcessor constructor also specifies the number of instances to run. If you specify a number of instances greater than 1, the same code runs in parallel, which is exactly what we want for a GA. Each instance is a complete run of the GA, run in its own container. Because each instance is completely self-contained, we can run multiple instances at once, and each instance does its calculations and writes its results into the DynamoDB results table.

To review, we’re using two different forms of parallelism for the GA: one is through running multiple instances at once (one per container), and the other is through having each container instance use multiprocessing in order to effectively calculate fitness scores for multiple candidates at the same time.

The following diagram illustrates the overall architecture of this approach.

The following diagram illustrates the overall architecture of this approach.The Docker image defines the runtime environment, which is stored in Amazon ECR. That image is combined with a Python script that runs the GA, and SageMaker Processing uses one or more containers to run the code. Each instance reads configuration data from DynamoDB and writes results into DynamoDB.

Genetic operations

Now that we know how to run a GA using SageMaker, let’s dive a little deeper into how we can apply a GA to our delivery problem.

Selection

When we select two parents for crossover, we want a balance between good quality and randomness, which can be thought of as genetic diversity. If we only pick candidates with the best fitness scores, we miss candidates that have elements that might eventually help find a great solution, even though the candidate’s current fitness score isn’t the best. On the other hand, if we completely ignore quality when selecting parents, the evolutionary process doesn’t work very well—we’re ignoring survival of the fittest.

There are a number of approaches for selection, but the simplest is called tournament selection. With a tournament of size 2, you randomly select two candidates from the population and keep the best one. The same applies to a tournament of size 3 or more—you simply use the one with the best fitness score. The larger the number you use, the better quality candidate you get, but at a cost of reduced genetic diversity.

The following code shows the implementation of tournament selection:

def tourney_select(population):
    selected = random.sample(population, TOURNEY_SIZE)
    best = min(selected, key=lambda c: c.fitness_score)
    return best

def select_parents(population):
    # using Tourney selection, get two candidates and make sure they're distinct
    while True:
        candidate1 = tourney_select(population)
        candidate2 = tourney_select(population)
        if candidate1 != candidate2:
            break
    return candidate1, candidate2

Crossover

After we select two candidates, how can we combine them to form one or two children? If both parents are simply lists of numbers and we can’t duplicate or leave out any numbers from the list, combining the two can be challenging.

One approach is called partially mapped crossover. It works as follows:

  1. Copy each parent, creating two children.
  2. Randomly select a starting and ending point for crossover within the genome. We use the same starting and ending points for both children.
  3. For each child, iterate from the starting crossover point to the ending crossover point and perform the following actions on each gene in the child at the current point:
    1. Find the corresponding gene in the other parent (the one that wasn’t copied into the current child), using the same crossover point. If that gene matches what’s already in the child at that point, continue to the next point, because no crossover is required for the gene.
    2. Otherwise, find the gene from the alternate parent and swap it with the current gene within the child.

The following diagram illustrates the first step, making copies of both parents.

The following diagram illustrates the first step, making copies of both parents.Each child is crossed over with the alternate parent. The following diagram shows the randomly selected start and end points, with the thick arrow indicating which gene is crossed over next.

The following diagram shows the randomly selected start and end points, with the thick arrow indicating which gene is crossed over next.

In the first swap position, the parent contributes the value 8. Because the current gene value in the child is 4, the 4 and 8 are swapped within the child.

Because the current gene value in the child is 4, the 4 and 8 are swapped within the child.

That swap has the effect of taking the gene with value 8 from the parent and placing it within the child at the corresponding position. When the swap is complete, the large arrow moves to the next gene to cross over.

When the swap is complete, the large arrow moves to the next gene to cross over.

At this point, the sequence is repeated. In this case, both gene values in the current position are the same (6), so the crossover position advances to the next position.

At this point, the sequence is repeated.

The gene value from the parent is 7 in this case, so the swap occurs within the child.

The gene value from the parent is 7 in this case, so the swap occurs within the child.

The following diagram shows the final result, with the arrows indicating how the genes were crossed over.

The following diagram shows the final result, with the arrows indicating how the genes were crossed over.

Crossover isn’t a mandatory step, and most GAs use a crossover rate parameter to control how often crossover happens. If two parents are selected but crossover isn’t used, both parents are copied unchanged into the next generation.

We used the following code for the crossover in this solution:

def crossover_parents_to_create_children(parent_one, parent_two):
    child1 = copy.deepcopy(parent_one)
    child2 = copy.deepcopy(parent_two)

    # sometimes we don't cross over, so use copies of the parents
    if random.random() >= CROSSOVER_RATE:
        return child1, child2

    num_genes = len(parent_one.path)
    start_cross_at = random.randint(0, num_genes - 2)  # pick a point between 0 and the end - 2, so we can cross at least 1 stop
    num_remaining = num_genes - start_cross_at
    end_cross_at = random.randint(num_genes - num_remaining + 1, num_genes - 1)

    for index in range(start_cross_at, end_cross_at + 1):
        child1_stop = child1.path[index]
        child2_stop = child2.path[index]

        # if the same, skip it since there is no crossover needed at this gene
        if child1_stop == child2_stop:
            continue

        # find within child1 and swap
        first_found_at = child1.path.index(child1_stop)
        second_found_at = child1.path.index(child2_stop)
        child1.path[first_found_at], child1.path[second_found_at] = child1.path[second_found_at], child1.path[first_found_at]

        # and the same for the second child
        first_found_at = child2.path.index(child1_stop)
        second_found_at = child2.path.index(child2_stop)
        child2.path[first_found_at], child2.path[second_found_at] = child2.path[second_found_at], child2.path[first_found_at]

    return child1, child2

Mutation

Mutation is a way to add genetic diversity to a GA, which is often desirable. However, too much mutation causes the GA to lose its way, so it’s best to use it in moderation if it’s needed at all.

You can approach mutation for this problem in two different ways: swapping and displacement.

A swap mutation is just what it sounds like—two randomly selected locations (genes) are swapped within a genome (see the following diagram).

A swap mutation is just what it sounds like—two randomly selected locations (genes) are swapped within a genome

 

The following code performs the swap:

def swap_mutation(candidate):
    indexes = range(len(candidate.path))
    pos1, pos2 = random.sample(indexes, 2)
    candidate.path[pos1], candidate.path[pos2] = candidate.path[pos2], candidate.path[pos1]

A displacement mutation randomly selects a gene, randomly selects an insertion point, and moves the selected gene into the selected insertion point, shifting other genes as needed to make space (see the following diagram).

A displacement mutation randomly selects a gene, randomly selects an insertion point.

The following code performs the displacement:

def displacement_mutation(candidate):
    num_stops = len(candidate.path)
    stop_to_move = random.randint(0, num_stops - 1)
    insert_at = random.randint(0, num_stops - 1)
    # make sure it's moved to a new index within the path, so it's really different
    while insert_at == stop_to_move:
        insert_at = random.randint(0, num_stops - 1)
    stop_index = candidate.path[stop_to_move]
    del candidate.path[stop_to_move]
    candidate.path.insert(insert_at, stop_index)

Elitism

An optional part of any GA is elitism, which is done when populating a new generation of candidates. When used, elitism copies a certain percentage of the best-scoring candidates from the current generation into the next generation. Elitism is a method for ensuring that the very best candidates always remain in the population. See the following code:

num_elites = int(ELITISM_RATE * POPULATION_SIZE)
current_generation.sort(key=lambda c: c.fitness_score)
next_generation = [current_generation[i] for i in range(num_elites)]

Results

It’s helpful to compare the results from our GA to those from a baseline algorithm. One common non-GA approach to solving this problem is known as the Nearest Neighbor algorithm, which you can apply in this manner:

  1. Set our current location to be the warehouse.
  2. While there are unvisited delivery stops, perform the following:
    1. Find the unvisited delivery stop that is closest to our current location.
    2. Move to that stop, making it the current location.
  3. Return to the warehouse.

The following diagrams illustrate the head-to-head results, using varying numbers of stops.

 

10 Delivery Stops

Nearest Neighbor
Total distance: 142
Genetic Algorithm
Total distance: 124

25 Delivery Stops

Nearest Neighbor
Total distance: 202
Genetic Algorithm
Total distance: 170

50 Delivery Stops

Nearest Neighbor
Total distance: 268
Genetic Algorithm
Total distance: 252

75 Delivery Stops


Nearest Neighbor
Total distance: 370
Genetic Algorithm
Total distance: 318

100 Delivery Stops

Nearest Neighbor
Total distance: 346
Genetic Algorithm
Total distance: 368

The following table summarizes the results.

# delivery stops Nearest Neighbor Distance Genetic Algorithm Distance
10 142 124
25 202 170
50 268 252
75 370 318
100 346 368

The Nearest Neighbor algorithm performs well in situations where many locations are clustered tightly together, but can perform poorly when dealing with locations that are more widely distributed. The path calculated for 75 delivery stops is significantly longer than the path calculated for 100 delivery stops—this is an example of how the results can vary widely depending on the data. We need a deeper statistical analysis using a broader set of sample data to thoroughly compare the results of the two algorithms.

On the other hand, for the majority of test cases, the GA solution finds the shorter path, even though it could admittedly be improved with tuning. Like other ML methodologies, genetic algorithms benefit from hyperparameter tuning. The following table summarizes the hyperparameters used in our runs, and we could further tune them to improve the GA’s performance.

Hyperparameter Value Used
Population size 5,000
Crossover rate 50%
Mutation rate 10%
Mutation method 50/50 split between swap and displacement
Elitism rate 10%
Tournament size 2

Conclusion and resources

Genetic algorithms are a powerful tool to solve optimization problems, and running them using SageMaker Processing allows you to leverage the power of multiple containers at once. Additionally, you can select instance types that have useful characteristics, like multiple virtual CPUs to optimize running jobs.

If you’d like to learn more about GAs, see Genetic algorithm on Wikipedia, which contains a number of useful links. Although several GA frameworks exist, the code for a GA tends to be relatively simple (because there’s very little math) and you may be able to write the code yourself, or use the accompanying code in our GitHub repo, which includes the CloudFormation template that creates the required AWS infrastructure. Be sure to shut down the CloudFormation stack when you’re done, in order to avoid running up charges.

Although optimization problems are relatively rare compared to other ML applications like classification or regression, when you need to solve one, a genetic algorithm is usually a good option, and SageMaker Processing makes it easy.


About the Author

Greg Sommerville is a Prototyping Architect on the AWS Envision Engineering Americas Prototyping team, where he helps AWS customers implement innovative solutions to challenging problems with machine learning, IoT and serverless technologies. He lives in Ann Arbor, Michigan and enjoys practicing yoga, catering to his dogs, and playing poker.

Read More