Estimating 3D pose for athlete tracking using 2D videos and Amazon SageMaker Studio

In preparation for the upcoming Olympic Games, Intel®, an American multinational corporation and one of the world’s largest technology companies, developed a concept around 3D Athlete Tracking (3DAT). 3DAT is a machine learning (ML) solution to create real-time digital models of athletes in competition in order to increase fan engagement during broadcasts. Intel was looking to leverage this technology for the purpose of coaching and training elite athletes.

Classical computer vision methods for 3D pose reconstruction have proven to be cumbersome for most scientists, given that these models mostly rely on embedding additional sensors on an athlete and the lack of 3D labels and models. Although we can put seamless data collection mechanisms in place using regular mobile phones, developing 3D models using 2D video data is a challenge, given the lack of depth of information in 2D videos. Intel’s 3DAT team partnered with the Amazon ML Solutions Lab (MLSL) to develop 3D human pose estimation techniques on 2D videos in order to create a lightweight solution for coaches to extract biomechanics and other metrics of their athletes’ performance.

This unique collaboration brought together Intel’s rich history in innovation and Amazon ML Solution Lab’s computer vision expertise to develop a 3D multi-person pose estimation pipeline using 2D videos from standard mobile phones as inputs, with Amazon SageMaker Studio notebooks (SM Studio) as the development environment.

Jonathan Lee, Director of Intel Sports Performance, Olympic Technology Group, says, “The MLSL team did an amazing job listening to our requirements and proposing a solution that would meet our customers’ needs. The team surpassed our expectations, developing a 3D pose estimation pipeline using 2D videos captured with mobile phones in just two weeks. By standardizing our ML workload on Amazon SageMaker, we achieved a remarkable 97% average accuracy on our models.”

This post discusses how we employed 3D pose estimation models and generated 3D outputs on 2D video data collected from Ashton Eaton, a decathlete and two-time Olympic gold medalist from the United States, using different angles. It also presents two computer vision techniques to align the videos captured from different angles, thereby allowing coaches to use a unique set of 3D coordinates across the run.

Challenges

Human pose estimation techniques use computer vision aim to provide a graphical skeleton of a person detected in a scene. They include coordinates of predefined key points corresponding to human joints, such as the arms, neck, and hips. These coordinates are used to capture the body’s orientation for further analysis, such as pose tracking, posture analysis, and subsequent evaluation. Recent advances in computer vision and deep learning have enabled scientists to explore pose estimation in a 3D space, where the Z-axis provides additional insights compared to 2D pose estimation. These additional insights could be used for more comprehensive visualization and analysis. However, building a 3D pose estimation model from scratch is challenging because it requires imaging data along with 3D labels. Therefore, many researchers employ pretrained 3D pose estimation models.

Data processing pipeline

We designed an end-to-end 3D pose estimation pipeline illustrated in the following diagram using SM Studio, which encompassed several components:

  • Amazon Simple Storage Service (Amazon S3) bucket to host video data
  • Frame extraction module to convert video data to static images
  • Object detection modules to detect bounding boxes of persons in each frame
  • 2D pose estimation for future evaluation purposes
  • 3D pose estimation module to generate 3D coordinates for each person in each frame
  • Evaluation and visualization modules

SM Studio offers a broad range of features facilitating the development process, including easy access to data in Amazon S3, availability of compute capability, software and library availability, and an integrated development experience (IDE) for ML applications.

First, we read the video data from the S3 bucket and extracted the 2D frames in a portable network graphics (PNG) format for frame-level development. We used YOLOv3 object detection to generate a bounding box of each person detected in a frame. For more information, see Benchmarking Training Time for CNN-based Detectors with Apache MXNet.

Next, we passed the frames and corresponding bounding box information to the 3D pose estimation model to generate the key points for evaluation and visualization. We applied a 2D pose estimation technique to the frames, and we generated the key points per frame for development and evaluation. The following sections discuss the details of each module in the 3D pipeline.

Data preprocessing

The first step was to extract frames from a given video utilizing OpenCV as shown in the following figure. We used two counters to keep track of time and frame count respectively, because videos were captured at different frames per second (FPS) rates. We then stored the sequence of images asvideo_name + second_count + frame_countin PNG format.

Object (person) detection

We employed YOLOv3 pretrained models based on the Pascal VOC dataset to detect persons in frames. For more information, see Deploying custom models built with Gluon and Apache MXNet on Amazon SageMaker. The YOLOv3 algorithm produced the bounding boxes shown in the following animations (the original images are resized to 910×512 pixels).

We stored the bounding box coordinates in a CSV file, in which the rows indicated the frame index, bounding box information as a list, and their confidence scores.

2D pose estimation

We selected ResNet-18 V1b as the pretrained pose estimation model, which considers a top-down strategy to estimate human poses within bounding boxes output by the object detection model. We further reset the detector classes to include humans so that the non-maximum suppression (NMS) process could be performed faster. The Simple Pose network was applied to predict heatmaps for key points (as in the following animation), and the highest values in the heatmaps were mapped to the coordinates on the original images.

3D pose estimation

We employed a state-of-the-art 3D pose estimation algorithm encompassing a camera distance-aware top-down method for multi-person per RGB frame referred to as 3DMPPE (Moon et al.). This algorithm consisted of two major phases:

  • RootNet – Estimates the camera-centered coordinates of a person’s root in a cropped frame
  • PoseNet – Uses a top-down approach to predict the relative 3D pose coordinates in the cropped image

Next, we used the bounding box information to project the 3D coordinates back to the original space. 3DMPPE offered two pretrained models trained using Human36 and MuCo3D datasets (for more information, see the GitHub repo), which included 17 and 21 key points, respectively, as illustrated in the following animations. We used the 3D pose coordinates predicted by the two pretrained models for visualization and evaluation purposes.

Evaluation

To evaluate the 2D and 3D pose estimation models’ performance, we used the 2D pose (x,y) and 3D pose (x,y,z) coordinates for each joint generated for every frame in a given video. The number of key points varied based on the datasets; for instance, the Leeds Sports Pose Dataset (LSP) includes 14, whereas the MPII Human Pose dataset, a state-of-the-art benchmark for evaluating articulated human pose estimation referring to Human3.6M, includes 16 key points. We used two metrics commonly used for both 2D and 3D pose estimation, as described in the next section on evaluation. In our implementation, our default key points dictionary followed the COCO detection dataset, which has 17 key points (see the following image), and the order is defined as follows:

KEY POINTS = {
    0: "nose",
    1: "left_eye",
    2: "right_eye",
    3: "left_ear",
    4: "right_ear",
    5: "left_shoulder",
    6: "right_shoulder",
    7: "left_elbow",
    8: "right_elbow",
    9: "left_wrist",
    10: "right_wrist",
    11: "left_hip",
    12: "right_hip",
    13: "left_knee",
    14: "right_knee",
    15: "left_ankle",
    16: "right_ankle"
}

Mean per joint position error

Mean per joint position error (MPJPE) is the Euclidean distance between ground truth and a joint prediction. As MPJPE measures the error or loss distance, and lower values indicate greater precision.

We use the following pseudo code:

  • Let G denote ground_truth_joint and preprocess G by:
    • Replacing the null entries in G with [0,0] (2D) or [0,0,0] (3D)
    • Using a Boolean matrix B to store the location of null entries
  • Let P denote predicted_joint matrix, and align G and P by frame index by inserting a zero vector if any frame doesn’t have results or is unlabeled
  • Compute element-wise Euclidean distance between G and P, and let D denote distance matrix
  • Replace Di,j with 0 if Bi,j
  • Mean per joint position is the mean value of each column of Ds,tDi,j ≠ 0

The following figure visualizes an example of video’s per joint error, a matrix with dimension m*n, where m denotes the number of frames in a video and n denotes the number of joints (key points). The matrix shows an example of a heatmap of per joint position error on the left and the mean per joint position error on the right.

The following figure visualizes an example of video’s per joint error, a matrix with dimension m*n , where m denotes the number of frames in a video and n denotes the number of joints (key points). The matrix shows an example of a heatmap of per joint position error on the left and the mean per joint position error on the right.

Percentage of correct key points

The percentage of correct key points (PCK) represents a pose evaluation metric where a detected joint is considered correct if the distance between the predicted and actual joint is within a certain threshold; this threshold may vary, which leads to a few different variations of metrics. Three variations are commonly used:

  • PCKh@0.5, which is when the threshold is defined as 0.5 * head bone link
  • PCK@0.2, which is when the distance between the predicted and actual joint is < 0.2 * torso diameter
  • 150mm as a hard threshold

In our solution, we used the PCKh@0.5 as our ground truth XML data containing the head bounding box, which we can use to compute the head-bone link. To the best of our knowledge, no existing package contains an easy-to-use implementation for this metric; therefore, we implemented the metric in-house.

Pseudo code

We used the following pseudo code:

  • Let G denote ground-truth joint and preprocess G by:
    • Replacing the null entries in G with [0,0] (2D) or [0,0,0] (3D)
    • Using a Boolean matrix B to store the location of null entries
  • For each frame Fi, use its bbox Bi=(xmin,ymin,xmax,ymax) to compute each frame’s corresponding head-bone link Hi , where Hi=((xmax-xmin)2+(ymax-ymin)2)½
  • Let P denote predicted joint matrix and align G and P by frame index; insert a zero tensor if any frame is missing
  • Compute the element-wise 2-norm error between G and P; let E denote error matrix, where Ei,j=||Gi,j-Pi,j||
  • Compute a scaled matrix S=H*I, where I represents an identity matrix with the same dimension as E
  • To avoid division by 0, replace Si,j with 0.000001 if Bi,j=1
  • Compute scaled error matrix Si,j=Ei,j/Si,j
  • Filter out SE with threshold = 0.5, and let C denote the counter matrix, where Ci,j=1 if SEi,j<0.5 and Ci,j=0 elsewise
  • Count how many 1’s in C*,j as c⃗ and count how many 0’s in B*,j as b⃗
  • PCKh@0.5=mean(c⃗/b⃗)

In the sixth step (replace Si,jwith 0.000001 if Bi,j=1), we set up a trap for the scaled error matrix by replacing 0 entries with 0.00001. Dividing any number by a tiny number generates an amplified number. Because we later used > 0.5 as a threshold to filter out incorrect predictions, the null entries were excluded from the correct prediction because it was way too large. We subsequently counted only the not null entries in the Boolean matrix. In this way, we also excluded the null entries from the whole dataset. We proposed an engineering trick in this implementation to filter out null entries from either unlabeled key points in the ground truth or the frames with no person detected.

Video alignment

We considered two different camera configurations to capture video data from athletes, namely the line and box setups. The line setup consists of four cameras placed along a line while the box setup consists of four cameras placed in each corner of a rectangle. The cameras were synchronized in the line configuration and then lined up at a predefined distance from each other, utilizing slightly overlapping camera angles. The objective of the video alignment in the line configuration was to identify the timestamps connecting consecutive cameras to remove repeated and empty frames. We implemented two approaches based on object detection and cross-correlation of optical flows.

Object detection algorithm

We used the object detection results in this approach, including persons’ bounding boxes from the previous steps. The object detection techniques produced a probability (score) per person in each frame. Therefore, plotting the scores in a video enabled us to find the frame where the first person appeared or disappeared. The reference frame from the box configuration was extracted from each video, and all cameras were then synchronized based on the first frame’s references. In the line configuration, both the start and end timestamps were extracted, and a rule-based algorithm was implemented to connect and align consecutive videos, as illustrated in the following images.

The top videos in the following figure show the original videos in the line configuration. Underneath that are person detection scores. The next rows show a threshold of 0.75 applied to the scores, and appropriate start and end timestamps are extracted. The bottom row shows aligned videos for further analysis.

Moment of snap

We introduced the moment of snap (MOS) – a well-known alignment approach – which indicates when an event or play begins. We wanted to determine the frame number when an athlete enters or leaves the scene. Typically, relatively little movement occurs on the running field before the start and after the end of the snap, whereas relatively substantial movement occurs when the athlete is running. Therefore, intuitively, we could find the MOS frame by finding the video frames with relatively large differences in the video’s movement before and after the frame. To this end, we utilized density optical flow, a standard measure of movement in the video, to estimate the MOS. First, given a video, we computed optical flow for every two consecutive frames. The following videos present a visualization of dense optical flow on the horizontal axis.

We then measured cross-correlation between two consecutive frames’ optical flows, because cross-correlation measures the difference between them. For each angle’s camera-captured video, we repeated the algorithm to find its MOS. Finally, we used the MOS frame as the key frame for aligning videos from different angles. The following video details these steps.

Conclusion

The technical objective of the work demonstrated in this post was to develop a deep-learning based solution producing 3D pose estimation coordinates using 2D videos. We employed a camera distance-aware technique with a top-down approach to achieve 3D multi-person pose estimation. Further, using object detection, cross-correlation, and optical flow algorithms, we aligned the videos captured from different angles.

This work has enabled coaches to analyze 3D pose estimation of athletes over time to measure biomechanics metrics, such as velocity, and monitor the athletes’ performance using quantitative and qualitative methods.

This post demonstrated a simplified process for extracting 3D poses in real-world scenarios, which can be scaled to coaching in other sports such as swimming or team sports.

If you would like help with accelerating the use of ML in your products and services, please contact the Amazon ML Solutions Lab program.

References

Moon, Gyeongsik, Ju Yong Chang, and Kyoung Mu Lee. “Camera distance-aware top-down approach for 3d multi-person pose estimation from a single RGB image.” In Proceedings of the IEEE International Conference on Computer Vision, pp. 10133-10142. 2019.


About the Author

Saman Sarraf is a Data Scientist at the Amazon ML Solutions Lab. His background is in applied machine learning including deep learning, computer vision, and time series data prediction.

 

 

 

Amery Cong is an Algorithms Engineer at Intel, where he develops machine learning and computer vision technologies to drive biomechanical analyses at the Olympic Games. He is interested in quantifying human physiology with AI, especially in a sports performance context.

 

 

Ashton Eaton is a Product Development Engineer at Intel, where he helps design and test technologies aimed at advancing sport performance. He works with customers and the engineering team to identify and develop products that serve customer needs. He is interested in applying science and technology to human performance.

 

 

Jonathan Lee is the Director of Sports Performance Technology, Olympic Technology Group at Intel. He studied the application of machine learning to health as an undergrad at UCLA and during his graduate work at University of Oxford. His career has focused on algorithm and sensor development for health and human performance. He now leads the 3D Athlete Tracking project at Intel.

 

 

Nelson Leung is the Platform Architect in the Sports Performance CoE at Intel, where he defines end-to-end architecture for cutting-edge products that enhance athlete performance. He also leads the implementation, deployment and productization of these machine learning solutions at scale to different Intel partners.

 

 

Suchitra Sathyanarayana is a manager at the Amazon ML Solutions Lab, where she helps AWS customers across different industry verticals accelerate their AI and cloud adoption. She holds a PhD in Computer Vision from Nanyang Technological University, Singapore.

 

 

Wenzhen Zhu is a data scientist with the Amazon ML Solution Lab team at Amazon Web Services. She leverages Machine Learning and Deep Learning to solve diverse problems across industries for AWS customers.

Read More

Implement checkpointing with TensorFlow for Amazon SageMaker Managed Spot Training

Customers often ask us how can they lower their costs when conducting deep learning training on AWS. Training deep learning models with libraries such as TensorFlow, PyTorch, and Apache MXNet usually requires access to GPU instances, which are AWS instances types that provide access to NVIDIA GPUs with thousands of compute cores. GPU instance types can be more expensive than other Amazon Elastic Compute Cloud (Amazon EC2) instance types, so optimizing usage of these types of instances is a priority for customers as well as an overall best practice for well-architected workloads.

Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to prepare, build, train, and deploy machine learning (ML) models quickly. SageMaker removes the heavy lifting from each step of the ML process to make it easier to develop high-quality models. SageMaker provides all the components used for ML in a single toolset so models get to production faster with less effort and at lower cost.

Amazon EC2 Spot Instances offer spare compute capacity available in the AWS Cloud at steep discounts compared to On-Demand prices. Amazon EC2 can interrupt Spot Instances with 2 minutes of notification when the service needs the capacity back. You can use Spot Instances for various fault-tolerant and flexible applications. Some examples are analytics, containerized workloads, stateless web servers, CI/CD, training and inference of ML models, and other test and development workloads. Spot Instance pricing makes high-performance GPUs much more affordable for deep learning researchers and developers who run training jobs.

One of the key benefits of SageMaker is that it frees you of any infrastructure management, no matter the scale you’re working at. For example, instead of having to set up and manage complex training clusters, you simply tell SageMaker which EC2 instance type to use and how many you need. The appropriate instances are then created on-demand, configured, and stopped automatically when the training job is complete. As SageMaker customers have quickly understood, this means that they pay only for what they use. Building, training, and deploying ML models are billed by the second, with no minimum fees, and no upfront commitments. SageMaker can also use EC2 Spot Instances for training jobs, which optimize the cost of the compute used for training deep-learning models.

In this post, we walk through the process of training a TensorFlow model with Managed Spot Training in SageMaker. We walk through the steps required to set up and run a training job that saves training progress in Amazon Simple Storage Service (Amazon S3) and restarts the training job from the last checkpoint if an EC2 instance is interrupted. This allows our training jobs to continue from the same point before the interruption occurred. Finally, we see the savings that we achieved by running our training job on Spot Instances using Managed Spot Training in SageMaker.

Managed Spot Training in SageMaker

SageMaker makes it easy to train ML models using managed EC2 Spot Instances. Managed Spot Training can optimize the cost of training models up to 90% over On-Demand Instances. With only a few lines of code, SageMaker can manage Spot interruptions on your behalf.

Managed Spot Training uses EC2 Spot Instances to run training jobs instead of On-Demand Instances. You can specify which training jobs use Spot Instances and a stopping condition that specifies how long SageMaker waits for a training job to complete using EC2 Spot Instances. Metrics and logs generated during training runs are available in Amazon CloudWatch.

Managed Spot Training is available in all training configurations:

  • All instance types supported by SageMaker
  • All models: built-in algorithms, built-in frameworks, and custom models
  • All configurations: single instance training and distributed training

Interruptions and checkpointing

There’s an important difference when working with Managed Spot Training. Unlike On-Demand Instances that are expected to be available until a training job is complete, Spot Instances may be reclaimed any time Amazon EC2 needs the capacity back.

SageMaker, as a fully managed service, handles the lifecycle of Spot Instances automatically. It interrupts the training job, attempts to obtain Spot Instances again, and either restarts or resumes the training job.

To avoid restarting a training job from scratch if it’s interrupted, we strongly recommend that you implement checkpointing, a technique that saves the model in training at periodic intervals. When you use checkpointing, you can resume a training job from a well-defined point in time, continuing from the most recent partially trained model, and avoiding starting from the beginning and wasting compute time and money.

To implement checkpointing, we have to make a distinction on the type of algorithm you use:

  • Built-in frameworks and custom models – You have full control over the training code. Just make sure that you use the appropriate APIs to save model checkpoints to Amazon S3 regularly, using the location you defined in the CheckpointConfig parameter and passed to the SageMaker Estimator. TensorFlow uses checkpoints by default. For other frameworks, see our sample notebooks and Use Machine Learning Frameworks, Python, and R with Amazon SageMaker.
  • Built-in algorithms – Computer vision algorithms support checkpointing (object detection, semantic segmentation, and image classification). Because they tend to train on large datasets and run for longer than other algorithms, they have a higher likelihood of being interrupted. The XGBoost built-in algorithm also supports checkpointing.

TensorFlow image classification model with Managed Spot Training

To demonstrate Managed Spot Training and checkpointing, I guide you through the steps needed to train a TensorFlow image classification model. To make sure that your training scripts can take advantage of SageMaker Managed Spot Training, we need to implement the following:

  • Frequent saving of checkpoints, thereby saving checkpoints each epoch
  • The ability to resume training from checkpoints if checkpoints exist

Save checkpoints

SageMaker automatically backs up and syncs checkpoint files generated by your training script to Amazon S3. Therefore, you need to make sure that your training script saves checkpoints to a local checkpoint directory on the Docker container that’s running the training. The default location to save the checkpoint files is /opt/ml/checkpoints, and SageMaker syncs these files to the specific S3 bucket. Both local and S3 checkpoint locations are customizable.

Saving checkpoints using Keras is very easy. You need to create an instance of the ModelCheckpoint callback class and register it with the model by passing it to the fit() function.

You can find the full implementation code on the GitHub repo.

The following is the relevant code:

callbacks = []
callbacks.append(ModelCheckpoint(args.checkpoint_path + '/checkpoint-{epoch}.h5'))

logging.info("Starting training from epoch: {}".format(initial_epoch_number+1))
    
model.fit(x=train_dataset[0],
          y=train_dataset[1],
          steps_per_epoch=(num_examples_per_epoch('train') 
          epochs=args.epochs,
          initial_epoch=initial_epoch_number,
          validation_data=validation_dataset,
          validation_steps=(num_examples_per_epoch('validation') 
          callbacks=callbacks) 

The names of the checkpoint files saved are as follows: checkpoint-1.h5, checkpoint-2.h5, checkpoint-3.h5, and so on.

For this post, I’m passing initial_epoch, which you normally don’t set. This lets us resume training from a certain epoch number and comes in handy when you already have checkpoint files.

The checkpoint path is configurable because we get it from args.checkpoint_path in the main function:

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    
    ...
    parser.add_argument("--checkpoint-path",type=str,default="/opt/ml/checkpoints",help="Path where checkpoints will be saved.")
    ...
    
    args = parser.parse_args()

Resume training from checkpoint files

When Spot capacity becomes available again after Spot interruption, SageMaker launches a new Spot Instance, instantiates a Docker container with your training script, copies your dataset and checkpoint files from Amazon S3 to the container, and runs your training scripts.

Your script needs to implement resuming training from checkpoint files, otherwise your training script restarts training from scratch. You can implement a load_model_from_checkpoints function as shown in the following code. It takes in the local checkpoint files path (/opt/ml/checkpoints being the default) and returns a model loaded from the latest checkpoint and the associated epoch number.

You can find the full implementation code on the GitHub repo.

The following is the relevant code:

def load_model_from_checkpoints(checkpoint_path):
    checkpoint_files = [file for file in os.listdir(checkpoint_path) if file.endswith('.' + 'h5')]
    logging.info('--------------------------------------------')
    logging.info("Available checkpoint files: {}".format(checkpoint_files))
    epoch_numbers = [re.search('(.*[0-9])(?=.)',file).group() for file in checkpoint_files]
      
    max_epoch_number = max(epoch_numbers)
    max_epoch_index = epoch_numbers.index(max_epoch_number)
    max_epoch_filename = checkpoint_files[max_epoch_index]

    logging.info('Latest epoch checkpoint file name: {}'.format(max_epoch_filename))
    logging.info('Resuming training from epoch: {}'.format(int(max_epoch_number)+1))
    logging.info('---------------------------------------------')
    
    resumed_model_from_checkpoints = load_model(f'{checkpoint_path}/{max_epoch_filename}')
    return resumed_model_from_checkpoints, int(max_epoch_number)

Managed Spot Training with a TensorFlow estimator

You can launch SageMaker training jobs from your laptop, desktop, EC2 instance, or SageMaker notebook instances. Make sure you have the SageMaker Python SDK installed and the right user permissions to run SageMaker training jobs.

To run a Managed Spot Training job, you need to specify few additional options to your standard SageMaker Estimator function call:

  • use_spot_instances – Specifies whether to use SageMaker Managed Spot Training for training. If enabled, you should also set the train_max_wait automated reasoning group (ARG).
  • max_wait – Timeout in seconds waiting for Spot training instances (default: None). After this amount of time, SageMaker stops waiting for Spot Instances to become available or the training job to finish. From previous runs, I know that the training job will finish in 4 minutes, so I set it to 600 seconds.
  • max_run – Timeout in seconds for training (default: 24 * 60 * 60). After this amount of time, SageMaker stops the job regardless of its current status. I am willing to stand double the time a training with On-Demand takes, so I assign 20 minutes of training time in total using Spot.
  • checkpoint_s3_uri – The S3 URI in which to persist checkpoints that the algorithm persists (if any) during training.

You can find the full implementation code on the GitHub repo.

The following is the relevant code:

use_spot_instances = True
max_run=600
max_wait = 1200

checkpoint_suffix = str(uuid.uuid4())[:8]
checkpoint_s3_uri = 's3://{}/checkpoint-{}'.format(bucket, checkpoint_suffix)
hyperparameters = {'epochs': 5, 'batch-size': 256}

spot_estimator = TensorFlow(entry_point='cifar10_keras_main.py',
                       source_dir='source_dir',
                       metric_definitions=metric_definitions,
                       hyperparameters=hyperparameters,
                       role=role,
                       framework_version='1.15.2',
                       py_version='py3',
                       instance_count=1,
                       instance_type='ml.p3.2xlarge',
                       base_job_name='cifar10-tf-spot-1st-run',
                       tags=tags,
                       checkpoint_s3_uri=checkpoint_s3_uri,
                       use_spot_instances=use_spot_instances,
                       max_run=max_run,
                       max_wait=max_wait)

Those are all the changes you need to make to significantly lower your cost of ML training.

To monitor your training job and view savings, you can look at the logs on your Jupyter notebook.

Towards the end of the job, you should see two lines of output:

  • Training seconds: X – The actual compute time your training job spent
  • Billable seconds: Y – The time you are billed for after Spot discounting is applied.

If you enabled use_spot_instances, you should see a notable difference between X and Y, signifying the cost savings you get for using Managed Spot Training. This is reflected in an additional line:

  • Managed Spot Training savings – Calculated as (1-Y/X)*100 %

The following screenshot shows the output logs for our Jupyter notebook:

When the training is complete, you can also navigate to the Training jobs page on the SageMaker console and choose your training job to see how much you saved.

For this example training job of a model using TensorFlow, my training job ran for 144 seconds, but I’m only billed for 43 seconds, so for a 5 epoch training on a ml.p3.2xlarge GPU instance, I was able to save 70% on training cost!

Confirm that checkpoints and recovery works for when your training job is interrupted

How can you test if your training job will resume properly if a Spot Interruption occurs?

If you’re familiar with running EC2 Spot Instances, you know that you can simulate your application behavior during a Spot Interruption by following the recommended best practices. However, because SageMaker is a managed service, and manages the lifecycle of EC2 instances on your behalf, you can’t stop a SageMaker training instance manually. Your only option is to stop the entire training job.

You can still test your code’s behavior when resuming an incomplete training by running a shorter training job, and then using the outputted checkpoints from that training job as inputs to a longer training job. To do this, first run a SageMaker Managed Spot Training job for a specified number of epochs as described in the previous section. Let’s say you run training for five epochs. SageMaker would have backed up your checkpoint files to the specified S3 location for the five epochs.

You can navigate to the training job details page on the SageMaker console to see the checkpoint configuration S3 output path.

Choose the S3 output path link to navigate to the checkpointing S3 bucket, and verify that five checkpoint files are available there.

Now run a second training run with 10 epochs. You should provide the first job’s checkpoint location to checkpoint_s3_uri so the training job can use those checkpoints as inputs to the second training job.

You can find the full implementation code in the GitHub repo.

The following is the relevant code:

hyperparameters = {'epochs': 10, 'batch-size': 256}

spot_estimator = TensorFlow(entry_point='cifar10_keras_main.py',
                       source_dir='source_dir',
                       metric_definitions=metric_definitions,
                       hyperparameters=hyperparameters,
                       role=role,
                       framework_version='1.15.2',
                       py_version='py3',
                       instance_count=1,
                       instance_type='ml.p3.2xlarge',
                       base_job_name='cifar10-tf-spot-2nd-run',
                       tags=tags,
                       checkpoint_s3_uri=checkpoint_s3_uri,
                       use_spot_instances=use_spot_instances,
                       max_run=max_run,
                       max_wait=max_wait)

By providing checkpoint_s3_uri with your previous job’s checkpoints, you’re telling SageMaker to copy those checkpoints to your new job’s container. Your training script then loads the latest checkpoint and resumes training. The following screenshot shows that the training resumes resume from the sixth epoch.

To confirm that all checkpoint files were created, navigate to the same S3 bucket. This time you can see that 10 checkpoint files are available.

The key difference between simulating an interruption this way and how SageMaker manages interruptions is that you’re creating a new training job to test your code. In the case of Spot Interruptions, SageMaker simply resumes the existing interrupted job.

Implement checkpointing with PyTorch, MXNet, and XGBoost built-in and script mode

The steps shown in the TensorFlow example are basically the same for PyTorch and MXNet. The code for saving checkpoints and loading them to resume training is different.

You can see full examples for TensorFlow 1.x/2.x, PyTorch, MXNet, and XGBoost built-in and script mode in the GitHub repo.

Conclusions and next steps

In this post, we trained a TensorFlow image classification model using SageMaker Managed Spot Training. We saved checkpoints locally in the container and loaded checkpoints to resume training if they existed. SageMaker takes care of synchronizing the checkpoints with Amazon S3 and the training container. We simulated a Spot interruption by running Managed Spot Training with 5 epochs, and then ran a second Managed Spot Training Job with 10 epochs, configuring the checkpoints’ S3 bucket of the previous job. This resulted in the training job loading the checkpoints stored in Amazon S3 and resuming from the sixth epoch.

It’s easy to save on training costs with SageMaker Managed Spot Training. With minimal code changes, you too can save over 70% when training your deep-learning models.

As a next step, try to modify your own TensorFlow, PyTorch, or MXNet script to implement checkpointing, and then run a Managed Spot Training in SageMaker to see that the checkpoint files are created in the S3 bucket you specified. Let us know how you do in the comments!


About the Author

Eitan Sela is a Solutions Architect with Amazon Web Services. He works with AWS customers to provide guidance and technical assistance, helping them improve the value of their solutions when using AWS. Eitan also helps customers build and operate machine learning solutions on AWS. In his spare time, Eitan enjoys jogging and reading the latest machine learning articles.

Read More

HawkEye 360 uses Amazon SageMaker Autopilot to streamline machine learning model development for maritime vessel risk assessment

This post is cowritten by Ian Avilez and Tim Pavlick from HawkEye 360.

HawkEye 360 is a commercial radio frequency (RF) satellite constellation data analytics provider. Our signals of interest include very high frequency (VHF) push-to-talk radios, maritime radar systems, AIS beacons, satellite mobile comms, and more. Our Mission Space offering, released in February 2021, allows mission analysts to intuitively visualize RF signals and analytics, which allows them to identify activity and understand trends. This capability improves maritime situational awareness for mission analysts, allowing them to identify and characterize nefarious behavior such as illegal fishing or ship-to-ship transfer of illicit goods.

The following screenshot shows HawkEye 360’s Mission Space experience.

RF data can be overwhelming to the naked eye without filtering and advanced algorithms to parse through and characterize the vast amount of raw data. HawkEye 360 partnered with the Amazon ML Solutions Lab to build machine learning (ML) capabilities into our analytics. With the guidance of the Amazon ML Solutions Lab, we used Amazon SageMaker Autopilot to rapidly generate high-quality AI models for maritime vessel risk assessment, maintain full visibility and control of model creation, and provide the ability to easily deploy and monitor a model in a production environment.

Hidden patterns and relationships among vessel features

Seagoing vessels are distinguished by several characteristics relating to the vessel itself, its operation and management, and its historical behavior. Knowing which characteristics are indicative of a suspicious vessel isn’t immediately clear. One of HawkEye 360’s missions is to discover hidden patterns and automatically alert analysts to anomalous maritime activity. Hawkeye 360 accomplishes this alerting regions by using a diverse set of variables, in combination with proprietary RF geo-analytics. A key focus of these efforts is to identify which vessels are more likely to engage in suspicious maritime activity, such as illegal fishing or ship-to-ship transfer of illicit goods. ML algorithms reveal hidden patterns, where they exist, that would otherwise be lost in the vast sea of complexity.

The following image demonstrates some of the existing pattern finding behavior that has been built into Mission Space. Mission Space automatically identifies other instances of a suspicious vessel. Identifying the key features, that are most predictive of suspicious behavior, allows for easy display of those features in Mission Space. This enables users to understand links between bad actors that would otherwise have never been seen. Mission Space was purposefully designed to point out these connections to mission analysts.

The following image demonstrates some of the existing pattern finding behavior that has been built into Mission Space.

Challenges detecting anomalous behavior with maritime vessels

HawkEye 360’s data lake includes a large volume of vessel information, history, and analytics variables. With such a wide array of RF data and analytics, some natural data handling issues must be addressed. Sporadic reporting by vessels results in missing values across datasets. Variations amongst data types must be taken into account. Previously, data exploration and baseline modeling typically would takes up a large chunk of an analysts’ time. After the data is prepared, a series of automatic experiments is run to narrow down to a set of the most promising AI models, and in a stepwise fashion from there, to select the one most appropriate for the data and the research questions. For HawkEye 360, this automated exploration is key to determining which features, and feature combinations, are critical to predicting how likely a vessel is to engage in suspicious behavior.

We used Autopilot to expedite this process by quickly identifying which features of the data are useful in predicting suspicious behavior. Automation of data exploration and analysis enables our data scientists to spend less time wrangling data and manually engineering features, and expedites the ability to identify the vessel features that are most predictive of suspicious vessel behavior.

How we used Autopilot to quickly generate high-quality ML models

As part of an Autopilot job, several candidate models are generated rapidly for evaluation with a single API call. Autopilot inspected the data and evaluated several models to determine the optimal combination of preprocessing methods, ML algorithms, and hyperparameters. This significantly shortened the model exploration time frame and allowed us to quickly test the suitability of ML to our unique hypotheses.

The following code shows our setup and API call:

input_data_config = [{
      'DataSource': {
        'S3DataSource': {
          'S3DataType': 'S3Prefix',
          'S3Uri': 's3://{}/{}/train'.format(bucket,prefix)
        }
      },
      'TargetAttributeName': 'ship_sanctioned_ofac'
    }
  ]

output_data_config = {
    'S3OutputPath': 's3://{}/{}/output'.format(bucket,prefix)
  }

from time import gmtime, strftime, sleep
timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())

auto_ml_job_name = 'automl-darkcount-' + timestamp_suffix
print('AutoMLJobName: ' + auto_ml_job_name)

sm.create_auto_ml_job(AutoMLJobName=auto_ml_job_name,
                      InputDataConfig=input_data_config,
                      OutputDataConfig=output_data_config,
                      RoleArn=role)

Autopilot job process

An Autopilot job consists of the following actions:

  • Dividing the data into train and validation sets
  • Analyzing the data to recommend candidate configuration
  • Performing feature engineering to generate optimal transformed features appropriate for the algorithm
  • Tuning hyperparameters to generate a leaderboard of models
  • Surfacing the best candidate model based on the given evaluation metric

After we trained several models, Autopilot racks and stacks the trained candidates based on a given metric (see the following code). For this application, we used an F1 score, which gives an even weight to both precision and recall. This is an important consideration when classes are imbalanced, which they are in this dataset.

candidates = sm.list_candidates_for_auto_ml_job(AutoMLJobName=auto_ml_job_name, SortBy='FinalObjectiveMetricValue')['Candidates']
index = 1
print("List of model candidates in descending objective metric:")
for candidate in candidates:
    print (str(index) + "  " + candidate['CandidateName'] + "  " + str(candidate['FinalAutoMLJobObjectiveMetric']['Value']))
    index += 1

The following code shows our output:

List of model candidates in descending objective metric:

1 tuning-job-1-1be4d5a5fb8e42bc84-238-e264d09f 0.9641900062561035
2 tuning-job-1-1be4d5a5fb8e42bc84-163-336eb2e7 0.9641900062561035
3 tuning-job-1-1be4d5a5fb8e42bc84-143-5007f7dc 0.9641900062561035
4 tuning-job-1-1be4d5a5fb8e42bc84-154-cab67dc4 0.9641900062561035
5 tuning-job-1-1be4d5a5fb8e42bc84-123-f76ad56c 0.9641900062561035
6 tuning-job-1-1be4d5a5fb8e42bc84-117-39eac182 0.9633200168609619
7 tuning-job-1-1be4d5a5fb8e42bc84-108-77addf80 0.9633200168609619
8 tuning-job-1-1be4d5a5fb8e42bc84-179-1f831078 0.9633200168609619
9 tuning-job-1-1be4d5a5fb8e42bc84-133-917ccdf1 0.9633200168609619
10 tuning-job-1-1be4d5a5fb8e42bc84-189-102070d9 0.9633200168609619

We can now create a model from the best candidate, which can be quickly deployed into production:

model_name = 'automl-darkcount-25-23-07-39'

model = sm.create_model(Containers=best_candidate['InferenceContainers'],
                            ModelName=model_name,
                            ExecutionRoleArn=role)

print('Model ARN corresponding to the best candidate is : {}'.format(model['ModelArn']))

The following code shows our output:

Model ARN corresponding to the best candidate is : arn:aws:sagemaker:us-east-1:278150328949:model/automl-darkcount-25-23-07-39

Maintaining full visibility and control

The process to generate a model is completely transparent. Two notebooks are generated for any model Autopilot creates:

  • Data exploration notebook – Describes your dataset and what Autopilot learned about your dataset
  • Model candidate notebook – Lists data transformations used as well as candidate model building pipelines consisting of feature transformers paired with main estimators

Conclusion

We used Autopilot to quickly generate many candidate models to determine ML feasibility and baselining ML performance on the vessel data. The automaticity of Autopilot allowed our data scientists to spend 50% less time developing ML capabilities by automating the manual tasks such as data analysis, feature engineering, model development, and model deployment.

With HawkEye 360’s new RF data analysis application, Mission Space, identifying which vessels have the potential to engage in suspicious activity allows users to easily know where to focus their scarce attention and investigate further. Expediting the data understanding and model creation allows for cutting-edge insights to be quickly assimilated into Mission Space, which accelerates the evolution of Mission Space’s capabilities as shown in the following map. We can see that a Mission Analyst identified a specific rendezvous (highlighted in magenta) and Mission Space automatically identified other related rendezvous (in purple).

For example, in the following map, we can see Mission Space identified a specific rendezvous

For more information about HawkEye 360’s Mission Space offering, see Misson Space.

If you’d like assistance in accelerating the use of ML in your products and services, contact the Amazon ML Solutions Lab.


About the Authors

  Tim Pavlick, PhD, is VP of Product at HawkEye 360. He is responsible for the conception, creation, and productization of all HawkEye space innovations. Mission Space is HawkEye 360’s flagship product, incorporating all the data and analytics from the HawkEye portfolio into one intuitive RF experience. Dr. Pavlick’s prior invention contributions include Myca, IBM’s AI Career Coach, Grit PTSD monitor for Veterans, IBM Defense Operations Platform, Smarter Planet Intelligent Operations Center, AI detection of dangerous hate speech on the internet, and the STORES electronic food ordering system for the US military. Dr. Pavlick received his PhD in Cognitive Psychology from the University of Maryland College Park.

Ian Avilez is a Data Scientist with HawkEye 360. He works with customers to highlight the insights that can be gained by combining different datasets and looking at that data in various ways.

 

 

 

Dan Ford is a Data Scientist at the Amazon ML Solution Lab, where he helps AWS National Security customers build state-of-the-art ML solutions.

 

 

 

Gaurav Rele is a Data Scientist at the Amazon ML Solution Lab, where he works with AWS customers across different verticals to accelerate their use of machine learning and AWS Cloud services to solve their business challenges.

Read More

Protecting people from hazardous areas through virtual boundaries with Computer Vision

As companies welcome more autonomous robots and other heavy equipment into the workplace, we need to ensure equipment can operate safely around human teammates. In this post, we will show you how to build a virtual boundary with computer vision and AWS DeepLens, the AWS deep learning-enabled video camera designed for developers to learn machine learning (ML). Using the machine learning techniques in this post, you can build virtual boundaries for restricted areas that automatically shut down equipment or sound an alert when humans come close.

For this project, you will train a custom object detection model with Amazon SageMaker and deploy the model to an AWS DeepLens device. Object detection is an ML algorithm that takes an image as input and identifies objects and their location within the image. In addition to virtual boundary solutions, you can apply techniques learned in this post when you need to detect where certain objects are inside an image or count the number of instances of a desired object in an image, such as counting items in a storage bin or on a retail shelf.

Solution overview

The walkthrough includes the following steps:

  1. Prepare your dataset to feed into an ML algorithm.
  2. Train a model with Amazon SageMaker.
  3. Test model with custom restriction zones.
  4. Deploy the solution to AWS DeepLens.

We also discuss other real-world use cases where you can apply this solution.

The following diagram illustrates the solution architecture.

Prerequisites

To complete this walkthrough, you must have the following prerequisites:

Prepare your dataset to feed into an ML algorithm

This post uses an ML algorithm called an object detection model to build a solution that detects if a person is in a custom restricted zone. You use the publicly available Pedestrian Detection dataset available on Kaggle, which has over 2,000 images. This dataset has labels for human and human-like objects (like mannequins) so the trained model can more accurately distinguish between real humans and cardboard props or statues.

For example, the following images are examples of a construction worker being detected and if they are in the custom restriction zone (red outline).

To start training your model, first create an S3 bucket to store your training data and model output. For AWS DeepLens projects, the S3 bucket names must start with the prefix deeplens-. You use this data to train a model with SageMaker, a fully managed service that provides the ability to build, train, and deploy ML models quickly.

Train a model with Amazon SageMaker

You use SageMaker Jupyter notebooks as the development environment to train the model. Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text. For this post, we provide Train_Object_Detection_People_DeepLens.ipynb, a full notebook for you to follow along.

To create a custom object detection model, you need to use a graphic processing unit (GPU)-enabled training job instance. GPUs are excellent at parallelizing computations required to train a neural network. Although the notebook itself is a single ml.t2.medium instance, the training job specifically uses an ml.p2.xlarge instance. To access a GPU-enabled training job instance, you must submit a request for a service limit increase to the AWS Support Center.

After you receive your limit increase, complete the following steps to create a SageMaker notebook instance:

  1. On the SageMaker console, choose Notebook instances.
  2. Choose Create notebook instance.
  3. For Notebook instance name, enter a name for your notebook instance.
  4. For Instance type, choose t2.medium.

This is the least expensive instance type that notebook instances support, and it suffices for this tutorial.

  1. For IAM role, choose Create a new role.

Make sure this AWS Identity and Access Management (IAM) role has access to the S3 bucket you created earlier (prefix deeplens-).

  1. Choose Create notebook instance. Your notebook instance can take a couple of minutes to start up.
  1. When the status on the notebook instances page changes to InService, choose Open Jupyter to launch your newly created Jupyter notebook instance.
  2. Choose Upload to upload the Train_Object_Detection_people_DeepLens.ipynb file you downloaded earlier.

  1. Open the notebook and follow it through to the end.
  2. If you’re asked about setting the kernel, select conda_mxnet_p36.

The Jupyter notebook contains a mix of text and code cells. To run a piece of code, choose the cell and press Shift+Enter. While the cell is running, an asterisk appears next to the cell. When the cell is complete, an output number and new output cell appear below the original cell.

  1. Download the dataset from the public S3 bucket into the local SageMaker instance and unzip the data. This can be done by following the code in the notebook:
     !aws s3 cp s3://deeplens-public/samples/pedestriansafety/humandetection_data.zip .  
      
    !rm -rf humandetection/  
    !unzip humandetection_data.zip -d humandetection
    

  2. Convert the dataset into a format (RecordIO) that can be fed into the SageMaker algorithm:
     !python $mxnet_path/tools/im2rec.py --pass-through --pack-label $DATA_PATH/train_mask.lst $DATA_PATH/  
    !python $mxnet_path/tools/im2rec.py --pass-through --pack-label $DATA_PATH/val_mask.lst $DATA_PATH/ 
    

  3. Transfer the RecordIO files back to Amazon S3.

Now that you’re done with all the data preparation, you’re ready to train the object detector.

There are many different types of object detection algorithms. For this post, you use the Single-Shot MultiBox Detection algorithm (SSD). The SSD algorithm has a good balance of speed vs. accuracy, making it ideal for running on edge devices such as AWS DeepLens.

As part of the training job, you have a lot of options for hyperparameters that help configure the training behavior (such as number of epochs, learning rate, optimizer type, and mini-batch size). Hyperparameters let you tune training speed and accuracy of your model. For more information about hyperparameters, see Object Detection Algorithm.

  1. Set up your hyperparameters and data channels. Consider using the following example definition of hyperparameters:
     od_model = sagemaker.estimator.Estimator(training_image,  
                                             role,   
                                             train_instance_count=1,   
                                             train_instance_type='ml.p2.xlarge',  
                                             train_volume_size = 50,  
                                             train_max_run = 360000,  
                                             input_mode= 'File',  
                                             output_path=s3_output_location,  
                                             sagemaker_session=sess)  
      
    od_model.set_hyperparameters(base_network='resnet-50',  
                                 use_pretrained_model=1,  
                                 num_classes=2,  
                                 mini_batch_size=32,  
                                 epochs=100,  
                                 learning_rate=0.003,  
                                 lr_scheduler_step='3,6',  
                                 lr_scheduler_factor=0.1,  
                                 optimizer='sgd',  
                                 momentum=0.9,  
                                 weight_decay=0.0005,  
                                 overlap_threshold=0.5,  
                                 nms_threshold=0.45,  
                                 image_shape=300,  
                                 num_training_samples=n_train_samples) 
    

The notebook has some default hyperparameters that have been pre-selected. For pedestrian detection, you train the model for 100 epochs. This training step should take approximately 2 hours using one ml.p2.xlarge instance. You can experiment with different combinations of the hyperparameters, or train for more epochs for performance improvements. For information about the latest pricing, see Amazon SageMaker Pricing.

  1. You can start a training job with a single line of code and monitor the accuracy over time on the SageMaker console:
    od_model.fit(inputs=data_channels, logs=True)  

For more information about how training works, see CreateTrainingJob. The provisioning and data downloading take time, depending on the size of the data. Therefore, it might be a few minutes before you start getting data logs for your training jobs.

You can monitor the progress of your training job through the metric mean average precision (mAP), which allows you to monitor the quality of the model’s ability to classify objects and detect the correct bounding boxes. The data logs also print out the mAP on the validation data, among other losses, for every run of the dataset, one time for one epoch. This metric is a proxy for the quality of the algorithm’s performance on accurately detecting the class and the accurate bounding box around it.

When the job is finished, you can find the trained model files in the S3 bucket and folder specified earlier in s3_output_location:

s3_output_location = 's3://{}/{}/output'.format(BUCKET, PREFIX)

For this post, we show results on the validation set at the completion of the 10th epoch and the 100th epoch. At the end of the 10th epoch, we see a validation mAP of approximately 0.027, whereas the 100th epoch was approximately 0.42.

To achieve better detection results, you can try to tune the hyperparameters by using the capability built into SageMaker for automatic model tuning and train the model for more epochs. You usually stop training when you see a diminishing gain in accuracy.

Test model with custom restriction zones

Before you deploy the trained model to AWS DeepLens, you can test it in the cloud by using a SageMaker hosted endpoint. A SageMaker endpoint is a fully managed service that allows you to make real-time inferences via a REST API. SageMaker allows you to quickly deploy new endpoints to test your models so you don’t have to host the model on the local instance that was used to train the model. This allows you to make predictions (or inference) from the model on images that the algorithm didn’t see during training.

You don’t have to host on the same instance type that you used to train. Training is a prolonged and compute-heavy job that requires a different set of compute and memory requirements that hosting typically doesn’t. You can choose any type of instance you want to host the model. In this case, we chose the ml.p3.2xlarge instance to train, but we choose to host the model on the less expensive CPU instance, ml.m4.xlarge. The following code snippet shows our endpoint deployment.

object_detector = od_model.deploy(initial_instance_count = 1,
                                  instance_type = 'ml.m4.xlarge')

Detection in a custom restriction zone (region of interest)

The format of the output can be represented as [class_index, confidence_score, xmin, ymin, xmax, ymax]. Low-confidence predictions often have higher chances of a false positive or false negative, so you should probably discard low-confidence predictions. You can use the following code to detect if the bounding box of the person overlaps with the restricted zone.

def inRestrictedSection(ImShape = None, R1 = None, restricted_region = None, kclass = None, score = None, threshold = None):  
    statement = 'Person Not Detected in Restricted Zone'  
    if (kclass == 1) and (score > threshold):  
        Im1 = np.zeros((ImShape[0],ImShape[1],3), np.int32)  
        cv2.fillPoly(Im1, [R1], 255)  
        Im2 = np.zeros((ImShape[0],ImShape[1],3), np.int32)  
        if restricted_region is None:  
            restricted_region = np.array([[0,ImShape[0]],[ImShape[1],ImShape[0]],[ImShape[1],0], [0,0]], np.int32)  
        cv2.fillPoly(Im2, [restricted_region], 255)  
        Im = Im1 * Im2  
        if np.sum(np.greater(Im, 0))>0:  
            statement = 'Person Detected in Restricted Zone'  
    else:  
        statement = statement  
      
    return statement

By default, the complete frame is evaluated for human presence. However, you can easily specify the region of interest within which the presence of a person is deemed as high risk. If you want to add a custom restriction zone, add coordinates of the vertices of the region represented by [X-axis,Y-axis] and create the polygon. The coordinates must be entered in either clockwise or counter-clockwise. See the following code:

restricted_region = None  
#restricted_region = np.array([[0,200],[100,200],[100,0], [10,10]], np.int32)

The following sample code shows pedestrians that are identified within a restricted zone:

file_name = 'humandetection/test_images/t1_image.jpg'  
img = cv2.imread(file_name)  
img =cv2.cvtColor(img,cv2.COLOR_BGR2RGB)  
thresh = 0.2  
height = img.shape[0]  
width = img.shape[1]  
colors = dict()  
  
  
with open(file_name, 'rb') as image:  
    f = image.read()  
    b = bytearray(f)  
    ne = open('n.txt','wb')  
    ne.write(b)  
      
  
results = object_detector.predict(b, initial_args={'ContentType': 'image/jpeg'})  
detections = json.loads(results)  
  
object_categories = ['no-person', 'person']  
  
for det in detections['prediction']:  
    (klass, score, x0, y0, x1, y1) = det  
    if score < thresh:  
        continue  
    cls_id = int(klass)  
    prob = score  
    if cls_id not in colors:  
        colors[cls_id] = (random.random(), random.random(), random.random())  
    xmin = int(x0 * width)  
    ymin = int(y0 * height)  
    xmax = int(x1 * width)  
    ymax = int(y1 * height)  
      
    R1 = np.array([[xmin,ymin],[xmax,ymin],[xmax,ymax], [xmin,ymax]], np.int32)  
    cv2.polylines(img,[R1],True, (255,255,0), thickness = 5)  
    cv2.polylines(img,[restricted_region],True, (255,0,0), thickness = 5)  
      
    plt.imshow(img)  
      
    print(inRestrictedSection(img.shape,R1 = R1, restricted_region= restricted_region, kclass = cls_id, score = prob, threshold=0.2))

The following images show our results.

Deploy the solution to AWS DeepLens

Convert the model for deployment to AWS DeepLens

When deploying a SageMaker-trained SSD model to AWS DeepLens, you must first run deploy.py to convert the model artifact into a deployable model:

!rm -rf incubator-mxnet  
!git clone -b v1.7.x https://github.com/apache/incubator-mxnet  
  
MODEL_PATH = od_model.model_data  
TARGET_PATH ='s3://'+BUCKET+'/'+PREFIX+'/patched/'  
!rm -rf tmp && mkdir tmp  
  
rm -rf tmp && mkdir tmp  
!aws s3 cp $MODEL_PATH tmp  
!tar -xzvf tmp/model.tar.gz -C tmp  
!mv tmp/model_algo_1-0000.params tmp/ssd_resnet50_300-0000.params  
!mv tmp/model_algo_1-symbol.json tmp/ssd_resnet50_300-symbol.json  
!python incubator-mxnet/example/ssd/deploy.py --network resnet50 --data-shape 300 --num-class 2 --prefix tmp/ssd_  
!tar -cvzf ./patched_model.tar.gz -C tmp ./deploy_ssd_resnet50_300-0000.params ./deploy_ssd_resnet50_300-symbol.json ./hyperparams.json  
!aws s3 cp patched_model.tar.gz $TARGET_PATH

Import your model into AWS DeepLens

To run the model on an AWS DeepLens device, you need to create an AWS DeepLens project. Start by importing your model into AWS DeepLens.

  1. On the AWS DeepLens console, under Resources, choose Models.
  2. Choose Import model.

  1. For Import source, select Externally trained model.
  2. Enter the Amazon S3 location of the patched model that you saved from running deploy.py in the step above.
  3. For Model framework, choose MXNet.
  4. Choose Import model.

Create the inference function

The inference function feeds each camera frame into the model to get predictions and runs any custom business logic on using the inference results. You use AWS Lambda to create a function that you deploy to AWS DeepLens. The function runs inference locally on the AWS DeepLens device.

First, we need to create a Lambda function to deploy to AWS DeepLens.

  1. Download the inference Lambda function.
  2. On the Lambda console, choose Functions.
  3. Choose Create function.
  4. Select Author from scratch.
  5. For Function name, enter a name.
  6. For Runtime, choose Python 3.7.
  7. For Choose or create an execution role, choose Use an existing role.
  8. Choose service-role/AWSDeepLensLambdaRole.
  9. Choose Create function.

  1. On the function’s detail page, on the Actions menu, choose Upload a .zip file.

  1. Upload the inference Lambda file you downloaded earlier.
  2. Choose Save to save the code you entered.
  3. On the Actions menu, choose Publish new version.

Publishing the function makes it available on the AWS DeepLens console so that you can add it to your custom project.

  1. Enter a version number and choose Publish.

Understanding the inference function

This section walks you through some important parts of the inference function. First, you should pay attention to two specific files:

  • labels.txt – Contains a mapping of the output from the neural network (integers) to human readable labels (string)
  • lambda_function.py – Contains code for the function being called to generate predictions on every camera frame and send back results

In lambda_function.py, you first load and optimize the model. Compared to cloud virtual machines with a GPU, AWS DeepLens has less computing power. AWS DeepLens uses the Intel OpenVino model optimizer to optimize the model trained in SageMaker to run on its hardware. The following code optimizes your model to run locally:

client.publish(topic=iot_topic, payload='Optimizing model...')  
ret, model_path = mo.optimize('deploy_ssd_resnet50_300', INPUT_W, INPUT_H)  
  
# Load the model onto the GPU.  
client.publish(topic=iot_topic, payload='Loading model...')  
model = awscam.Model(model_path, {'GPU': 1})  

Then you run the model frame-per-frame over the images from the camera. See the following code:

while True:  
    # Get a frame from the video stream  
    ret, frame = awscam.getLastFrame()  
    if not ret:  
        raise Exception('Failed to get frame from the stream')  
    # Resize frame to the same size as the training set.  
    frame_resize = cv2.resize(frame, (INPUT_H, INPUT_W))  
    # Run the images through the inference engine and parse the results using  
    # the parser API, note it is possible to get the output of doInference  
    # and do the parsing manually, but since it is a ssd model,  
    # a simple API is provided.  
    parsed_inference_results = model.parseResult(model_type,  
                                                 model.doInference(frame_resize))  

Finally, you send the text prediction results back to the cloud. Viewing the text results in the cloud is a convenient way to make sure that the model is working correctly. Each AWS DeepLens device has a dedicated iot_topic automatically created to receive the inference results. See the following code:

# Send results to the cloud  
client.publish(topic=iot_topic, payload=json.dumps(cloud_output))  

Create a custom AWS DeepLens project

To create a new AWS DeepLens project, complete the following steps:

  1. On the AWS DeepLens console, on the Projects page, choose Create project.
  2. For Project type, select Create a new blank project.
  3. Choose Next.

  1. Name your project yourname-pedestrian-detector-.
  2. Choose Add model.
  3. Select the model you just created.
  4. Choose Add function.
  5. Search for the Lambda function you created earlier by name.
  6. Choose Create project.
  7. On the Projects page, select the project you want to deploy.
  8. Chose Deploy to device.
  9. For Target device, choose your device.
  10. Choose Review.
  11. Review your settings and choose Deploy.

The deployment can take up to 10 minutes to complete, depending on the speed of the network your AWS DeepLens is connected to. When the deployment is complete, you should see a green banner on the page with the message, “Congratulations, your model is now running locally on AWS DeepLens!”

To see the text output, scroll down on the device details page to the Project output section. Follow the instructions in the section to copy the topic and go to the AWS IoT Core console to subscribe to the topic. You should see results as in the following screenshot.

For step-by-step instructions on viewing the video stream or text output, see Viewing results from AWS DeepLens.

Real-world use cases

Now that you have predictions from your model running on AWS DeepLens, let’s convert those predictions into alerts and insights. Some most common uses for a project like this include:

  • Understanding how many people on a given day entered a restricted zone so construction sites can identify spots that require more safety signs. This can be done by collecting the results and using them to create a dashboard using Amazon QuickSight. For more details about creating a dashboard using QuickSight, see Build a work-from-home posture tracker with AWS DeepLens and GluonCV.
  • Collecting the output from AWS DeepLens and configuring a Raspberry Pi to sound an alert when someone is walking into a restricted zone. For more details about connecting an AWS DeepLens device to a Raspberry Pi device, see Building a trash sorter with AWS DeepLens.

Conclusion

In this post, you learned how to train an object detection model and deploy it to AWS DeepLens to detect people entering restricted zones. You can use this tutorial as a reference to train and deploy your own custom object detection projects on AWS DeepLens.

For a more detailed walkthrough of this tutorial and other tutorials, samples, and project ideas with AWS DeepLens, see AWS DeepLens Recipes.


About the Authors

Yash Shah is a data scientist in the Amazon ML Solutions Lab, where he works on a range of machine learning use cases from healthcare to manufacturing and retail. He has a formal background in Human Factors and Statistics, and was previously part of the Amazon SCOT team designing products to guide 3P sellers with efficient inventory management.

 

 

Phu Nguyen is a Product Manager for AWS Panorama. He builds products that give developers of any skill level an easy, hands-on introduction to machine learning.

 

 

Read More

Enable cross-account access for Amazon SageMaker Data Wrangler using AWS Lake Formation

Amazon SageMaker Data Wrangler is the fastest and easiest way for data scientists to prepare data for machine learning (ML) applications. With Data Wrangler, you can simplify the process of feature engineering and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization through a single visual interface. Data Wrangler comes with 300 built-in data transformation recipes that you can use to quickly normalize, transform, and combine features. With the data selection tool in Data Wrangler, you can quickly select data from different data sources, such as Amazon Simple Storage Service (Amazon S3), Amazon Athena, and Amazon Redshift.

AWS Lake Formation cross-account capabilities simplify securing and managing distributed data lakes across multiple accounts through a centralized approach, providing fine-grained access control to Athena tables.

In this post, we demonstrate how to enable cross-account access for Data Wrangler using Athena as a source and Lake Formation as a central data governance capability. As shown in the following architecture diagram, Account A is the data lake account that holds all the ML-ready data derived from ETL pipelines. Account B is the data science account where a team of data scientists uses Data Wrangler to compile and run data transformations. We need to enable cross-account permissions for Data Wrangler in Account B to access the data tables located in Account A’s data lake via Lake Formation permissions.

With this architecture, data scientists and engineers outside the data lake account can access data from the lake and create data transformations via Data Wrangler.

Before you dive into the setup process, ensure the data to be shared across accounts are crawled and cataloged as detailed in this post. Let us presume this process has been completed and the databases and tables already exist in Lake Formation.

The following are the high-level steps to implement this solution:

  1. In Account A, register your S3 bucket using Lake Formation and create the necessary databases and tables for the data if doesn’t exist.
  2. The Lake Formation administrator can now share datasets from Account A to other accounts. Lake Formation shares these resources using AWS Resource Access Manager (AWS RAM).
  3. In Account B, accept the resource share request using AWS RAM. Create a local resource link for the shared table via Lake Formation and create a local database.
  4. Next, you need to grant permissions for the SageMaker Studio execution role in Account B to access the shared table and the resource link you created in the previous step.
  5. In Data Wrangler, use the local database and the resource link you created in Account B to query the dataset using the Athena connector and perform feature transformations.

Data lake setup using Lake Formation

To get started, create a central data lake in Account A. You can control the access to the data lake with policies and permissions, and define permissions at the database, table, or column level.

To kickstart the setup process, download the titanic dataset .csv file and upload it to your S3 bucket. After you upload the file, you need to register the bucket in Lake Formation. Lake Formation permissions enable fine-grained access control for data in your data lake.

Note: If the titanic dataset has already been cataloged, you can skip the registration step below.

Register your S3 data store in Lake Formation

To register your data store, complete the following steps:

  1. In Account A, sign in to the Lake Formation console.

If this is the first time you’re accessing Lake Formation, you need to add administrators to the account.

  1. In the navigation pane, under Permissions, choose Admins and database creators.
  2. Under Data lake administrators, choose Grant.

You now add AWS Identity and Access Management (IAM) users or roles specific to Account A as data lake administrators.

  1. Under Manage data lake administrators, for IAM users and roles, choose your user or role (for this post, we use user-a).

This can also be the IAM admin role of Account A.

  1. Choose Save.

  1. Make sure the IAMAllowedPrincipals group is not listed under both Data lake administrators and Database creators.

For more information about security settings, see Changing the Default Security Settings for Your Data Lake.

Next, you need to register the S3 bucket as the data lake location.

  1. On the Lake Formation console, under Register and ingest, choose Data lake locations.

This page should display a list of S3 buckets that are marked as data lake storage resources for Lake Formation. A single S3 bucket may act as the repository for many datasets, or you could use separate buckets for separate data sources.

  1. Choose Register location.
  2. For Amazon S3 path, enter the path for your bucket.
  3. For IAM role¸ choose AWSServiceRoleForLakeFormationDataAccess.
  4. Choose Register location.

After this step, you should be able to see your S3 bucket under Data lake locations.

Create a database

This step is optional. Skip this step if the titanic dataset has already been crawled and cataloged. The database and table for the dataset should pre-exist within the data lake.

Complete the following steps to register the database if it does not exist:

  1. On the Lake Formation console, under Data catalog, choose Databases.
  2. Choose Create database.
  3. For Database details, select Database.
  4. For Name, enter a name (for example, titanic).
  5. For Location, enter the S3 data lake bucket path.
  6. Deselect Use only IAM access controls for tables in this database.
  7. Choose Create database.

  1. Under Actions, choose Permissions.
  2. Choose View permissions.
  3. Make sure that the IAMAllowedPrincipals group isn’t listed.

If it’s listed, make sure you revoke access to this group.

You should now be able to view the created database listed under Databases.

You should also be able to see the table in the Lake Formation console, under Data catalog in the navigation pane, under Tables. For this demo, let us presume the table name to be titanic_datalake_bucket_as as shown below.

Grant table permissions to Account A

To grant table permissions to Account A, complete the following steps:

  1. Sign in to the Lake Formation console with Account A.
  2. Under Data catalog, choose Tables.
  3. Select the newly created table.
  4. On the Actions menu, under Permissions, choose Grant.
  5. Select My account.
  6. For IAM users and roles, choose the users or roles you want to grant access (for this post, we choose user-x, a different user within Account A).

You can also set a column filter.

  1. For Columns, choose Include columns.
  2. For Include columns, choose the first five columns from the titanic_datalake_bucket_as table.
  3. For Table permissions, select Select.
  4. Chose Grant.

  1. Still in Account A, switch to the Athena console.
  2. Run a table preview.

You should be able to see the first five columns of the titanic_datalake_bucket_as table as per the granted permissions in the previous steps.

We have validated local access to the data lake table within Account A via this Athena step. Next, let’s grant access to an external account, in our case, Account B for the same table.

Grant table permissions to Account B

This external account is the account running Data Wrangler. To grant table permissions, complete the following steps:

  1. Staying within account A, on the Actions menu, under Permissions, choose Grant.
  2. Select External account.
  3. For AWS account ID, enter the account ID of Account B.
  4. Choose the same first five columns of the table.
  5. For Table permissions and Grantable permissions, select Select.
  6. Choose Grant.

You must revoke the Super permission from the IAMAllowedPrincipals group for this table before granting it external access. You can do this on the Actions menu under View permissions, then choose IAMAllowedPrincipals and choose Revoke.

  1. On the AWS RAM console, still in Account A, under Shared by me, choose Shared resources.

We can find a Lake Formation entry on this page.

  1. Switch to Account B.
  2. On the AWS RAM console, under Shared with me, you see an invitation from Lake Formation in Account A.

  1. Accept the invitation by choosing Accept resource share.

After you accept it, on the Resource shares page, you should see the shared Lake Formation entry, which encapsulates the catalog, database, and table information.

On the Lake Formation console in Account B, you can find the shared table owned by Account A on the Tables page. If you don’t see it, you can refresh your screen and the resource should appear shortly.

To use this shared table inside Account B, you need to create a database local to Account B in Lake Formation.

  1. On the Lake Formation console, under Databases, choose Create databases.
  2. Name the database local_db.

Next, for the shared titanic table in Lake Formation, you need to create a resource link. Resource links are Data Catalog objects that link to metadata databases and tables, typically to shared databases and tables from other AWS accounts. They help enable cross-account access to data in the data lake.

  1. On the table details page, on the Actions menu, choose Create resource link.

  1. For Resource link name, enter a name (for example, titanic_local).
  2. For Database, choose the local database you created previously.
  3. The values for Shared table and Shared table’s database should match the ones in Account A and be auto-populated.
  4. For Shared table’s owner ID, choose the account ID of Account A.
  5. Choose Create.

  1. In the navigation pane, under Data catalog, choose Settings.
  2. Make sure Use only IAM access control is disabled for new databases and tables.

This is to make sure that Lake Formation manages the database and table permissions.

  1. Switch to the SageMaker console.
  2. In the Studio Control Panel, under Studio Summary, copy the ARN of the execution role.
  3. You need to grant this role permissions to access the local database, the shared table, and the local table you had previously in Account B’s Lake Formation.
  4. You also need to attach the following custom policy to this role. This policy allows Studio to access data via Lake Formation and allows Account B to get data partitions for querying the titanic dataset from the created tables:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "lakeformation:GetDataAccess",
        "glue:GetPartitions"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}
  1. Switch back to Lake Formation console.
  2. Here, we need to grant permissions for the SageMaker execution role to access the shared titanic_datalake_bucket_as table.

This is the table that you shared to Account B from Account A via AWS RAM.

  1. In Account B, on the table details page, on the Actions menu, under Permissions, choose Grant.
  2. Grant the role access to the table and five columns.

  1. Finally, grant the SageMaker execution role permissions to access the local titanic table in Account B.

Cross-account data access in Studio

In this final stage, you should be ready to validate the steps deployed so far by testing this in the Data Wrangler interface.

  1. On the Import tab, for Import data, choose Amazon Athena as your data source.

  1. For Data catalog, choose AwsDataCatalog.
  2. For Database, choose the local database you created in Account B (local_db).

You should be able to see the local table (titanic_local) in the right pane.

  1. Run an Athena query as shown in the following screenshot to see the selected columns of the titanic dataset that you gave to the SageMaker execution role in Lake Formation (Account B).
  2. Choose Import dataset.

  1. For Dataset Name, enter a name (for example, titanic-dataset).
  2. Choose Add.

This imports the titanic dataset, and you should be able to see the data flow page with the visual blocks on the Prepare tab.

Conclusion

In this post, we demonstrated how to enable cross-account access for Data Wrangler using Lake Formation and AWS RAM. Following this methodology, organizations can allow multiple data science and engineering teams to access data from a central data lake and build feature pipelines and transformation recipes consistently. For more information about Data Wrangler, see Introducing Amazon SageMaker Data Wrangler, a Visual Interface to Prepare Data for Machine Learning and Exploratory data analysis, feature engineering, and operationalizing your data flow into your ML pipeline with Amazon SageMaker Data Wrangler.

Give Data Wrangler a try and share your feedback and questions in the comments section.


About the Authors

Rizwan Gilani is a Software Development Engineer at Amazon SageMaker. His passion lies with making machine learning more interactive and accessible at scale. Before that, he worked on Amazon Alexa as part of the core team that launched Alexa Communications.

 

 

Phi Nguyen is a solutions architect at AWS helping customers with their cloud journey with a special focus on data lake, analytics, semantics technologies and machine learning. In his spare time, you can find him biking to work, coaching his son’s soccer team or enjoying nature walk with his family.

 

 

Arunprasath Shankar is an Artificial Intelligence and Machine Learning (AI/ML) Specialist Solutions Architect with AWS, helping global customers scale their AI solutions effectively and efficiently in the cloud. In his spare time, Arun enjoys watching sci-fi movies and listening to classical music.

 

Read More

AWS and NVIDIA to bring Arm-based instances with GPUs to the cloud

AWS continues to innovate on behalf of our customers. We’re working with NVIDIA to bring an Arm processor-based, NVIDIA GPU accelerated Amazon Elastic Compute Cloud (Amazon EC2) instance to the cloud in the second half of 2021. This instance will feature the Arm-based AWS Graviton2 processor, which was built from the ground up by AWS and optimized for how customers run their workloads in the cloud, eliminating a lot of unneeded components that otherwise might go into a general-purpose processor.

AWS innovation with Arm technology

AWS has continued to pioneer cloud computing for our customers. In 2018, AWS was the first major cloud provider to offer Arm-based instances in the cloud with EC2 A1 instances powered by AWS Graviton processors. These instances are built around Arm cores and make extensive use of AWS custom-built silicon. They’re a great fit for scale-out workloads in which you can share the load across a group of smaller instances.

In 2020, AWS released AWS-designed, Arm-based Graviton2 processors, delivering a major leap in performance and capabilities over first-generation AWS Graviton processors. These processors power EC2 general purpose (M6g, M6gd, T4g), compute-optimized (C6g, C6gd, C6gn), and memory-optimized (R6g, R6gd, X2gd) instances, and provide up to 40% better price performance over comparable current generation x86-based instances for a wide variety of workloads. AWS Graviton2 processors deliver seven times more performance, four times more compute cores, five times faster memory, and caches twice as large over first-generation AWS Graviton processors.

Customers including Domo, Formula One, Honeycomb.io, Intuit, LexisNexis Risk Solutions, Nielsen, NextRoll, Redbox, SmugMug, Snap, and Twitter have seen significant performance gains and reduced costs from running AWS Graviton2-based instances in production. AWS Graviton2 processors, based on the 64-bit Arm architecture, are supported by popular Linux operating systems, including Amazon Linux 2, Red Hat, SUSE, and Ubuntu. Many popular applications and services from AWS and ISVs also support AWS Graviton2-based instances. Arm developers can use these instances to build applications natively in the cloud, thereby eliminating the need for emulation and cross-compilation, which are error-prone and time-consuming. Adding NVIDIA GPUs accelerates Graviton2-based instances for diverse cloud workloads, including gaming and other Arm-based workloads like machine learning (ML) inference.

Easily move Android games to the cloud

According to research from App Annie, mobile gaming is now the most popular form of gaming and has overtaken console, PC, and Mac. Additional research from App Annie has shown that up to 10% of all time spent on mobile devices is with games, and game developers need to support and optimize their games for the diverse set of mobile devices being used today and in the future. By leveraging the cloud, game developers can provide a uniform experience across the spectrum of mobile devices and extend battery life due to lower compute and power demands on the mobile device. The AWS Graviton2 instance with NVIDIA GPU acceleration enables game developers to run Android games natively, encode the rendered graphics, and stream the game over networks to a mobile device, all without needing to run emulation software on x86 CPU-based infrastructure.

Cost-effective, GPU-based machine learning inference

In addition to mobile gaming, customers running machine learning models in production are continuously looking for ways to lower costs as ML inference can represent up to 90% of the overall infrastructure spend for running these applications at scale. With this new offering, customers will be able to take advantage of the price/performance benefits of Graviton2 to deploy GPU accelerated deep learning models at a significantly lower cost vs. x86-based instances with GPU acceleration.

AWS and NVIDIA: A long history of collaboration

AWS and NVIDIA have collaborated for over 10 years to continually deliver powerful, cost-effective, and flexible GPU-based solutions to customers including the latest EC2 G4 instances with NVIDIA T4 GPUs launched in 2019 and EC2 P4d instances with NVIDIA A100 GPUs launched in 2020. EC2 P4d instances are deployed in hyperscale clusters called EC2 UltraClusters that are comprised of the highest performance compute, networking, and storage in the cloud. EC2 UltraClusters support 400 Gbps instance networking, Elastic Fabric Adapter (EFA), and NVIDIA GPUDirect RDMA technology to help rapidly train ML models using scale-out and distributed techniques.

In addition to being first in the cloud to offer GPU accelerated instances and first in the cloud to offer NVIDIA V100 GPUs, we’re now working together with NVIDIA to offer new EC2 instances that combine an Arm-based processor with a GPU accelerator in the second half of 2021. To learn more about how AWS and NVIDIA work together to bring innovative technology to customers, visit AWS at NVIDIA GTC 21.


About the Author

Geoff Murase is a Senior Product Marketing Manager for AWS EC2 accelerated computing instances, helping customers meet their compute needs by providing access to hardware-based compute accelerators such as Graphics Processing Units (GPUs) or Field Programmable Gate Arrays (FPGAs). In his spare time, he enjoys playing basketball and biking with his family.

Read More

Detect abnormal equipment behavior and review predictions using Amazon Lookout for Equipment and Amazon A2I

Companies that operate and maintain a broad range of industrial machinery such as generators, compressors, and turbines are constantly working to improve operational efficiency and avoid unplanned downtime due to component failure. They invest heavily in physical sensors (tags), data connectivity, data storage, and data visualization to monitor the condition of their equipment and get real-time alerts for predictive maintenance.

With machine learning (ML), more powerful technologies have become available that can provide data-driven models that learn from an equipment’s historical data. However, implementing such ML solutions is time-consuming and expensive because it involves managing and setting up complex infrastructure and having the right ML skills. Furthermore, ML applications need human oversight to ensure accuracy with sensitive data, help provide continuous improvements, and retrain models with updated predictions. However, you’re often forced to choose between an ML-only or human-only system. Companies are looking for the best of both worlds—integrating ML systems into your workflow while keeping a human eye on the results to achieve higher precision.

In this post, we show you how you can set up Amazon Lookout for Equipment to train an abnormal behavior detection model using a wind turbine dataset for predictive maintenance, use a human in the loop workflow to review the predictions using Amazon Augmented AI (Amazon A2I), and augment the dataset and retrain the model.

Solution overview

Amazon Lookout for Equipment analyzes the data from your sensors, such as pressure, flow rate, RPMs, temperature, and power, to automatically train a specific ML model based on your data, for your equipment, with no ML expertise required. Amazon Lookout for Equipment uses your unique ML model to analyze incoming sensor data in near-real time and accurately identify early warning signs that could lead to machine failures. This means you can detect equipment abnormalities with speed and precision, quickly diagnose issues, take action to reduce expensive downtime, and reduce false alerts.

Amazon A2I is an ML service that makes it easy to build the workflows required for human review. Amazon A2I brings human review to all developers, removing the undifferentiated heavy lifting associated with building human review systems or managing large numbers of human reviewers, whether running on AWS or not.

To get started with Amazon Lookout for Equipment, we create a dataset, ingest data, train a model, and run inference by setting up a scheduler. After going through these steps, we show you how you can quickly set up a human review process using Amazon A2I and retrain your model with augmented or human reviewed datasets.

In the accompanying Jupyter notebook, we walk you through the following steps:

  1. Create a dataset in Amazon Lookout for Equipment.
  2. Ingest data into the Amazon Lookout for Equipment dataset.
  3. Train a model in Amazon Lookout for Equipment.
  4. Run diagnostics on the trained model.
  5. Create an inference scheduler in Amazon Lookout for Equipment to send a simulated stream of real-time requests.
  6. Set up an Amazon A2I private human loop and review the predictions from Amazon Lookout for Equipment.
  7. Retrain your model based on augmented datasets from Amazon A2I.

Architecture overview

The following diagram illustrates our solution architecture.

 

The workflow contains the following steps:

  1. The architecture assumes that the inference pipeline is built and sensor data is periodically stored in the S3 path for inference inputs. These inputs are stored in CSV format with corresponding timestamps in the file name.
  2. Amazon Lookout for Equipment wakes up at a prescribed frequency and processes the most recent file from the inference inputs Amazon Simple Storage Service (Amazon S3) path.
  3. Inference results are stored in the inference outputs S3 path in JSON lines file format. The outputs also contain event diagnostics, which are used for root cause analysis.
  4. When Amazon Lookout for Equipment detects an anomaly, the inference input and outputs are presented to the private workforce for validation via Amazon A2I.
  5. A private workforce investigates and validates the detected anomalies and provides new anomaly labels. These labels are stored in a new S3 path.
  6. Training data is also updated, along with the corresponding new labels, and is staged for subsequent model retraining.
  7. After enough new labels are collected, a new Amazon Lookout for Equipment model is created, trained, and deployed. The retraining cycle can be repeated for continuous model retraining.

Prerequisites

Before you get started, complete the following steps to set up the Jupyter notebook:

  1. Create a notebook instance in Amazon SageMaker.

Make sure your SageMaker notebook has the necessary AWS Identity and Access Management (IAM) roles and permissions mentioned in the prerequisite section of the notebook.

  1. When the notebook is active, choose Open Jupyter.
  2. On the Jupyter dashboard, choose New, and choose Terminal.
  3. In the terminal, enter the following code:
cd SageMaker
git clone https://github.com/aws-samples/lookout-for-equipment-demo
  1. First run the data preparation notebook – 1_data_preparation.ipynb
  2. Then open the notebook for this blog – 3_integrate_l4e_and_a2i.ipynb

You’re now ready to run the following steps through the notebook cells. Run the setup environment step to set up the necessary Python SDKs and libraries that we use throughout the notebook.

  1. Provide an AWS Region, create an S3 bucket, and provide details of the bucket in the following code cell:
REGION_NAME = '<your region>'
BUCKET = '<your bucket name>'
PREFIX = 'data/wind-turbine'

Analyze the dataset and create component metadata

In this section, we walk you through how you can preprocess the existing wind turbine data and ingest it for Amazon Lookout for Equipment. Please make sure to run the data preparation notebook prior to running the accompanying notebook for the blog to follow through all the steps in this post. You need a data schema for using your existing historical data with Amazon Lookout for Equipment. The data schema tells Amazon Lookout for Equipment what the data means. Because a data schema describes the data, its structure mirrors that of the data files of the components it describes.

All components must be described in the data schema. The data for each component is contained in a separate CSV file structured as shown in the data schema.

You store the data for each asset’s component in a separate CSV file using the following folder structure:

S3 bucket > Asset_name > Component 1 > Component1.csv

Go to the notebook section Pre-process and Load Datasets and run the following cell to inspect the data:

import pandas as pd
turbine_id = 'R80711'
df = pd.read_csv(f'../data/wind-turbine/interim/{turbine_id}.csv', index_col = 'Timestamp')
df.head()

The following screenshot shows our output.

Now we create components map to create a dataset expected by Amazon Lookout for Equipment for ingest. Run the notebook cells under the section Create the Dataset Component Map to create a component map and generate a CSV file for ingest.

Create the Amazon Lookout for Equipment dataset

We use Amazon Lookout for Equipment Create Dataset APIs to create a dataset and provide the component map we created in the previous step as an input. Run the following notebook cell to create a dataset:

ROLE_ARN = sagemaker.get_execution_role()
# REGION_NAME = boto3.session.Session().region_name
DATASET_NAME = 'wind-turbine-train-dsv2-PR'
MODEL_NAME = 'wind-turbine-PR-v1'

lookout_dataset = lookout.LookoutEquipmentDataset(
dataset_name=DATASET_NAME,
component_fields_map=DATASET_COMPONENT_FIELDS_MAP,
region_name=REGION_NAME,
access_role_arn=ROLE_ARN
)

pp = pprint.PrettyPrinter(depth=5)
pp.pprint(eval(lookout_dataset.dataset_schema))
lookout_dataset.create()

You get the following output:

Dataset "wind-turbine-train-dsv2-PR" does not exist, creating it...


{'DatasetName': 'wind-turbine-train-dsv2-PR',
 'DatasetArn': 'arn:aws:lookoutequipment:ap-northeast-2:<aws-account>:dataset/wind-turbine-train-dsv2-PR/8325802a-9bb7-48fb-804b-ab9f5b79f49d',
 'Status': 'CREATED',
 'ResponseMetadata': {'RequestId': '52dc754c-84da-4a8c-aaef-1908e4348837',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '52dc754c-84da-4a8c-aaef-1908e4348837',
   'content-type': 'application/x-amz-json-1.0',
   'content-length': '203',
   'date': 'Thu, 25 Mar 2021 21:18:29 GMT'},
  'RetryAttempts': 0}}

Alternatively, you can go to the Amazon Lookout for Equipment console to view the dataset.

You can choose View under Data schema to view the schema of the dataset. You can choose Ingest new data to start ingesting data through the console, or you can use the APIs shown in the notebook to do the same using Python Boto3 APIs.

Run the notebook cells to ingest the data. When ingestion is complete, you get the following response:

=====Polling Data Ingestion Status=====

2021-03-25 21:18:45 |  IN_PROGRESS
2021-03-25 21:19:46 |  IN_PROGRESS
2021-03-25 21:20:46 |  IN_PROGRESS
2021-03-25 21:21:46 |  IN_PROGRESS
2021-03-25 21:22:46 |  SUCCESS

Now that we have preprocessed the data and ingested the data into Amazon Lookout for Equipment, we move on to the training steps. Alternatively, you can choose Ingest data.

Label your dataset using the SageMaker labeling workforce

If you don’t have an existing labeled dataset available to directly use with Amazon Lookout for Equipment, create a custom labeling workflow. This may be relevant in a use case in which, for example, a company wants to build a remote operating facility where alerts from various operations are sent to the central facility for the SMEs to review and update. For a sample crowd HTML template for your labeling UI, refer to our GitHub repository.

The following screenshot shows an example of what the sample labeling UI looks like.

For this post, we use the labels that came with the dataset for training. If you want to use the label file you created for your actual training in the next step, you need to copy the label file to an S3 bucket and provide the location in the training configuration.

Create a model in Amazon Lookout for Equipment

We walk you through the following steps in this section:

  • Prepare the model parameters and split data into test and train sets
  • Train the model using Amazon Lookout for Equipment APIs
  • Get diagnostics for the trained model

Prepare the model parameters and split the data

In this step, we split the datasets into test and train, prepare labels, and start the training using the notebook. Run the notebook code Split train and test to split the dataset into an 80/20 split for training and testing, respectively. Then run the prepare labels code and move on to setting up training config, as shown in the following code:

# Prepare the model parameters:
lookout_model = lookout.LookoutEquipmentModel(model_name=MODEL_NAME,
                                              dataset_name=DATASET_NAME,
                                              region_name=REGION_NAME)

# Set the training / evaluation split date:
lookout_model.set_time_periods(evaluation_start,
                               evaluation_end,
                               training_start,
                               training_end)

# Set the label data location:
lookout_model.set_label_data(bucket=BUCKET, 
                             prefix=PREFIX+'/labelled_data/',
                             access_role_arn=ROLE_ARN)

# This sets up the rate the service will resample the data before 
# training:
lookout_model.set_target_sampling_rate(sampling_rate='PT10M')

In the preceding code, we set up model training parameters such as time periods, label data, and target sampling rate for our model. For more information about these parameters, see CreateModel.

Train model

After setting these model parameters, you need to run the following train model API to start training your model with your dataset and the training parameters:

lookout_model.train()

You get the following response:

{'ModelArn': 'arn:aws:lookoutequipment:ap-northeast-2:<accountid>:model/wind-turbine-PR-v1/fac217a9-8855-4931-95f9-dd47f0af1ec5',
 'Status': 'IN_PROGRESS',
 'ResponseMetadata': {'RequestId': '3d385895-c62e-4126-9622-38f0ebed9715',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '3d385895-c62e-4126-9622-38f0ebed9715',
   'content-type': 'application/x-amz-json-1.0',
   'content-length': '152',
   'date': 'Thu, 25 Mar 2021 21:27:05 GMT'},
  'RetryAttempts': 0}}

Alternatively, you can go to the Amazon Lookout for Equipment console and monitor the training after you create the model.

The sample turbine dataset we provide in our example has millions of data points. Training takes approximately 2.5 hours.

Evaluate the trained model

After a model is trained, Amazon Lookout for Equipment evaluates its performance and displays the results. It provides an overview of the performance and detailed information about the abnormal equipment behavior events and how well the model performed when detecting those. With the data and failure labels that you provided for training and evaluating the model, Amazon Lookout for Equipment reports how many times the model’s predictions were true positives. It also reports the average forewarning time across all true positives. Additionally, it reports the false positive results generated by the model, along with the duration of the non-event.

For more information about performance evaluation, see Evaluating the output.

Review training diagnostics

Run the following code to generate the training diagnostics. Refer to the accompanying notebook for the complete code block to run for this step.

LookoutDiagnostics = lookout.LookoutEquipmentAnalysis(model_name=MODEL_NAME, tags_df=df, region_name=REGION_NAME)
LookoutDiagnostics.set_time_periods(evaluation_start, evaluation_end, training_start, training_end)
predicted_ranges = LookoutDiagnostics.get_predictions()

The results returned show the percentage contribution of each feature towards the abnormal equipment prediction for the corresponding date range.

Create an inference scheduler in Amazon Lookout for Equipment

In this step, we show you how the CreateInferenceScheduler API creates a scheduler and starts it—this starts costing you right away. Scheduling an inference is setting up a continuous real-time inference plan to analyze new measurement data. When setting up the scheduler, you provide an S3 bucket location for the input data, assign it a delimiter between separate entries in the data, set an offset delay if desired, and set the frequency of inference. You must also provide an S3 bucket location for the output data. Run the following notebook section to run inference on the model to create an inference scheduler:

scheduler = lookout.LookoutEquipmentScheduler(
scheduler_name=INFERENCE_SCHEDULER_NAME,
model_name=MODEL_NAME_FOR_CREATING_INFERENCE_SCHEDULER,
region_name=REGION_NAME
)

scheduler_params = {
'input_bucket': INFERENCE_DATA_SOURCE_BUCKET,
'input_prefix': INFERENCE_DATA_SOURCE_PREFIX,
'output_bucket': INFERENCE_DATA_OUTPUT_BUCKET,
'output_prefix': INFERENCE_DATA_OUTPUT_PREFIX,
'role_arn': ROLE_ARN_FOR_INFERENCE,
'upload_frequency': DATA_UPLOAD_FREQUENCY,
'delay_offset': DATA_DELAY_OFFSET_IN_MINUTES,
'timezone_offset': INPUT_TIMEZONE_OFFSET,
'component_delimiter': COMPONENT_TIMESTAMP_DELIMITER,
'timestamp_format': TIMESTAMP_FORMAT
}

scheduler.set_parameters(**scheduler_params)

After you create an inference scheduler, the next step is to create some sample datasets for inference.

Prepare the inference data

Run through the notebook steps to prepare the inference data. Let’s load the tags description; this dataset comes with a data description file. From here, we can collect the list of components (subsystem column) if required. We use the tag metadata from the data descriptions as a point of reference for our interpretation. We use the tag names to construct a list that Amazon A2I uses. For more details, refer to the section Set up Amazon A2I to review predictions from Amazon Lookout for Equipment in this post.

To build our sample inference dataset, we extract the last 2 hours of data from the evaluation period of the original time series. Specifically, we create three CSV files containing simulated real-time tags for our turbine 10 minutes apart. These are all stored in Amazon S3 in the inference-a2i folder. Now that we’ve prepared the data, create the scheduler by running the following code:

create_scheduler_response = scheduler.create()

You get the following response:

===== Polling Inference Scheduler Status =====

Scheduler Status: PENDING
Scheduler Status: RUNNING

===== End of Polling Inference Scheduler Status =====

Alternatively, on the Amazon Lookout for Equipment console, go to the Inference schedule settings section of your trained model and set up a scheduler by providing the necessary parameters.

Get inference results

Run through the notebook steps List inference executions to get the run details from the schedule you created in the previous step. Wait 5–15 minutes for the scheduler to run its first inference. When it’s complete, we can use the ListInferenceExecution API for our current inference scheduler. The only mandatory parameter is the scheduler name.

You can also choose a time period for which you want to query inference runs. If you don’t specify it, all runs for an inference scheduler are listed. If you want to specify the time range, you can use the following code:

START_TIME_FOR_INFERENCE_EXECUTIONS = datetime.datetime(2010,1,3,0,0,0)END_TIME_FOR_INFERENCE_EXECUTIONS = datetime.datetime(2010,1,5,0,0,0)

This code means that the runs after 2010-01-03 00:00:00 and before 2010-01-05 00:00:00 are listed.

You can also choose to query for runs in a particular status, such as IN_PROGRESS, SUCCESS, and FAILED:

START_TIME_FOR_INFERENCE_EXECUTIONS = None
END_TIME_FOR_INFERENCE_EXECUTIONS = None
EXECUTION_STATUS = None

execution_summaries = []

while len(execution_summaries) == 0:
    execution_summaries = scheduler.list_inference_executions(
        start_time=START_TIME_FOR_INFERENCE_EXECUTIONS,
        end_time=END_TIME_FOR_INFERENCE_EXECUTIONS,
        execution_status=EXECUTION_STATUS
    )
    if len(execution_summaries) == 0:
        print('WAITING FOR THE FIRST INFERENCE EXECUTION')
        time.sleep(60)
        
    else:
        print('FIRST INFERENCE EXECUTEDn')
        break
            
execution_summaries

You get the following response:

{'ModelName': 'wind-turbine-PR-v1',
  'ModelArn': 'arn:aws:lookoutequipment:ap-northeast-2:<aws-account>:model/wind-turbine-PR-v1/fac217a9-8855-4931-95f9-dd47f0af1ec5',
  'InferenceSchedulerName': 'wind-turbine-scheduler-a2i-PR-v10',
  'InferenceSchedulerArn': 'arn:aws:lookoutequipment:ap-northeast-2:<aws-account>:inference-scheduler/wind-turbine-scheduler-a2i-PR-v10/e633c39d-a4f9-49f6-8248-7594349db2d0',
  'ScheduledStartTime': datetime.datetime(2021, 3, 29, 15, 35, tzinfo=tzlocal()),
  'DataStartTime': datetime.datetime(2021, 3, 29, 15, 30, tzinfo=tzlocal()),
  'DataEndTime': datetime.datetime(2021, 3, 29, 15, 35, tzinfo=tzlocal()),
  'DataInputConfiguration': {'S3InputConfiguration': {'Bucket': '<your s3 bucket>',
    'Prefix': 'data/wind-turbine/inference-a2i/input/'}},
  'DataOutputConfiguration': {'S3OutputConfiguration': {'Bucket': '<your s3 bucket>',
    'Prefix': 'data/wind-turbine/inference-a2i/output/'}},
  'CustomerResultObject': {'Bucket': '<your s3 bucket>',
   'Key': 'data/wind-turbine/inference-a2i/output/2021-03-29T15:30:00Z/results.jsonl'},
  'Status': 'SUCCESS'}]

Get actual prediction results

After each successful inference, a JSON file is created in the output location of your bucket. Each inference creates a new folder with a single results.jsonl file in it. You can run through this section in the notebook to read these files and display their content.

results_df

The following screenshot shows the results.

Stop the inference scheduler

Make sure to stop the inference scheduler; we don’t need it for the rest of the steps in this post. However, as part of your solution, the inference scheduler should be running to ensure real-time inference for your equipment continues. Run through this notebook section to stop the inference scheduler.

Set up Amazon A2I to review predictions from Amazon Lookout for Equipment

Now that inference is complete, let’s understand how to set up a UI to review the inference results and update it, so we can send it back to Amazon Lookout for Equipment for retraining the model. In this section, we show how to use the Amazon A2I custom task type to integrate with Amazon Lookout for Equipment through the walkthrough notebook to set up a human in the loop process. It includes the following steps:

  • Create a human task UI
  • Create a workflow definition
  • Send predictions to Amazon A2I human loops
  • Sign in to the worker portal and annotate Amazon Lookout for Equipment inference predictions

Follow the steps provided in the notebook to initialize Amazon A2I APIs. Make sure to set up the bucket name in the initialization block where you want your Amazon A2I output:

a2ibucket = '<your bucket>'

You also need to create a private workforce and provide a work team ARN in the initialize step.

On the SageMaker console, create a private workforce. After you create the private workforce, find the workforce ARN and enter the ARN in the notebook:

WORKTEAM_ARN = 'your private workforce team ARN'

Create the human task UI

You now create a human task UI resource, giving a UI template in liquid HTML. You can download the provided template and customize it. This template is rendered to the human workers whenever a human loop is required. For over 70 pre-built UIs, see the amazon-a2i-sample-task-uis GitHub repo. We also provide this template in our GitHub repo.

You can use this template to create a task UI either via the console or by running the following code in the notebook:

def create_task_ui():
 
    response = sagemaker_client.create_human_task_ui(
        HumanTaskUiName=taskUIName,
        UiTemplate={'Content': template})
    return response

Create a human review workflow definition

Workflow definitions allow you to specify the following:

  • The worker template or human task UI you created in the previous step.
  • The workforce that your tasks are sent to. For this post, it’s the private workforce you created in the prerequisite steps.
  • The instructions that your workforce receives.

This post uses the Create Flow Definition API to create a workflow definition. Run the following cell in the notebook:

create_workflow_definition_response = sagemaker_client.create_flow_definition(
        FlowDefinitionName= flowDefinitionName,
        RoleArn=role,
        HumanLoopConfig= {
            "WorkteamArn": WORKTEAM_ARN,
            "HumanTaskUiArn": humanTaskUiArn,
            "TaskCount": 1,
            "TaskDescription": "Review the contents and select correct values as indicated",
            "TaskTitle": "Equipment Condition Review"
        },
        OutputConfig={
            "S3OutputPath" : OUTPUT_PATH
        }
    )
flowDefinitionArn = create_workflow_definition_response['FlowDefinitionArn'] 

Send predictions to Amazon A2I human loops

We create an item list from the Pandas DataFrame where we have the Amazon Lookout for Equipement output saved. Run the following notebook cell to create a list of items to send for review:

NUM_TO_REVIEW = 5 # number of line items to review
dftimestamp = sig_full_df['Timestamp'].astype(str).to_list()
dfsig001 = sig_full_df['Q_avg'].astype(str).to_list()
dfsig002 = sig_full_df['Ws1_avg'].astype(str).to_list()
dfsig003 = sig_full_df['Ot_avg'].astype(str).to_list()
dfsig004 = sig_full_df['Nf_avg'].astype(str).to_list()
dfsig046 = sig_full_df['Ba_avg'].astype(str).to_list()
sig_list = [{'timestamp': dftimestamp[x], 'reactive_power': dfsig001[x], 'wind_speed_1': dfsig002[x], 'outdoor_temp': dfsig003[x], 'grid_frequency': dfsig004[x], 'pitch_angle': dfsig046[x]} for x in range(NUM_TO_REVIEW)]
sig_list

Run the following code to create a JSON input for the Amazon A2I loop. This contains the lists that are sent as input to the Amazon A2I UI displayed to the human reviewers.

ip_content = {"signal": sig_list,
'anomaly': ano_list
}

Run the following notebook cell to call the Amazon A2I API to start the human loop:

import json
humanLoopName = str(uuid.uuid4())

start_loop_response = a2i.start_human_loop(
            HumanLoopName=humanLoopName,
            FlowDefinitionArn=flowDefinitionArn,
            HumanLoopInput={
                "InputContent": json.dumps(ip_content)
            }
        )

You can check the status of human loop by running the next cell in the notebook.

Annotate the results via the worker portal

Run the following notebook cell to get a login link to navigate to the private workforce portal:

workteamName = WORKTEAM_ARN[WORKTEAM_ARN.rfind('/') + 1:]
print("Navigate to the private worker portal and do the tasks. Make sure you've invited yourself to your workteam!")
print('https://' + sagemaker_client.describe_workteam(WorkteamName=workteamName)['Workteam']['SubDomain'])

You’re redirected to the Amazon A2I console. Select the human review job and choose Start working. After you review the changes and make corrections, choose Submit.

You can evaluate the results store in Amazon S3.

Evaluate the results

When the labeling work is complete, your results should be available in the S3 output path specified in the human review workflow definition. The human answers are returned and saved in the JSON file. Run the notebook cell to get the results from Amazon S3:

import re
import pprint

pp = pprint.PrettyPrinter(indent=4)
json_output = ''
for resp in completed_human_loops:
    splitted_string = re.split('s3://' + a2ibucket  + '/', resp['HumanLoopOutput']['OutputS3Uri'])
    print(splitted_string[1])
    output_bucket_key = splitted_string[1]
    response = s3.get_object(Bucket=a2ibucket, Key=output_bucket_key)
    content = response["Body"].read()
    json_output = json.loads(content)
    pp.pprint(json_output)
    print('n')

You get a response with human reviewed answers and flow-definition. Refer to the notebook to get the complete response.

Model retraining based on augmented datasets from Amazon A2I

Now we take the Amazon A2I output, process it, and send it back to Amazon Lookout for Equipment to retrain our model based on the human corrections. Refer to the accompanying notebook for all the steps to complete in this section. Let’s look at the last few entries of our original label file:

labels_df = pd.read_csv(os.path.join(LABEL_DATA, 'labels.csv'), header=None)
labels_df[0] = pd.to_datetime(labels_df[0])
labels_df[1] = pd.to_datetime(labels_df[1])
labels_df.columns = ['start', 'end']
labels_df.tail()

The following screenshot shows the labels file.

Update labels with new date ranges

Now let’s update our existing labels dataset with the new labels we received from the Amazon A2I human review process:

faulty = False
a2i_lbl_df = labels_df
x = json_output['humanAnswers'][0]
row_df = pd.DataFrame(columns=['rownr'])
tslist = {}

# Let's first check if the users mark equipment as faulty and if so get those row numbers into a dataframe            
for i in json_output['humanAnswers']:
    print("checking equipment review...")
    x = i['answerContent']
    for idx, key in enumerate(x):
        if "faulty" in key:
            if str(x.get(key)).split(':')[1].lstrip().strip('}') == "True": # faulty equipment selected
                    faulty = True
                    row_df.loc[len(row_df.index)] = [key.split('-')[1]] 
                    print("found faulty equipment in row: " + key.split('-')[1])


# Now we will get the date ranges for the faulty choices                     
for idx,k in row_df.iterrows():
    x = json_output['humanAnswers'][0]
    strchk = "TrueStart"+k['rownr']
    endchk = "TrueEnd"+k['rownr']
    for i in x['answerContent']:
        if i == strchk:
            tslist[i] = x['answerContent'].get(i)
        if i == endchk:
            tslist[i] = x['answerContent'].get(i)

            
# And finally let's add it to our new a2i labels dataset
for idx,k in row_df.iterrows():
    x = json_output['humanAnswers'][0]
    strchk = "TrueStart"+k['rownr']
    endchk = "TrueEnd"+k['rownr']
    a2i_lbl_df.loc[len(a2i_lbl_df.index)] = [tslist[strchk], tslist[endchk]]

You get the following response:

checking equipment review...
found faulty equipment in row: 1
found faulty equipment in row: 2

The following screenshot shows the updated labels file.

Let’s upload the updated labels data to a new augmented labels file:

a2i_label_s3_dest_path = f's3://{BUCKET}/{PREFIX}/augmented-labelled-data/labels.csv'
!aws s3 cp $a2i_label_src_fname $a2i_label_s3_dest_path

Update the training dataset with new measurements

We now update our original training dataset with the new measurement range based on what we got back from Amazon A2I. Run the following code to load the original dataset to a new DataFrame that we use to append our augmented data. Refer to the accompanying notebook for all the steps required.

turbine_id = 'R80711'
file = '../data/wind-turbine/final/training-data/'+turbine_id+'/'+turbine_id+'.csv'
newdf = pd.read_csv(file, index_col='Timestamp')
newdf.head()

The following screenshot shows our original training dataset snapshot.

Now we use the updated training dataset with the simulated inference data we created earlier, in which the human reviewers indicated that they found faulty equipment when running the inference. Run the following code to modify the index of the simulated inference dataset to reflect a 10-minute duration for each reading:

sig_full_df = sig_full_df.set_index('Timestamp')
tm = pd.to_datetime('2021-04-05 20:30:00')
print(tm)
new_index = pd.date_range(
start=tm,
periods=sig_full_df.shape[0],
freq='10min'
)
sig_full_df.index = new_index
sig_full_df.index.name = 'Timestamp'
sig_full_df = sig_full_df.reset_index()
sig_full_df['Timestamp'] = pd.to_datetime(sig_full_df['Timestamp'], errors='coerce')

Run the following code to append the simulated inference dataset to the original training dataset:

newdf = newdf.reset_index()
newdf = pd.concat([newdf,sig_full_df])

 

The simulated inference data with the recent timestamp is appended to the end of the training dataset. Now let’s create a CSV file and copy the data to the training channel in Amazon S3:

TRAIN_DATA_AUGMENTED = os.path.join(TRAIN_DATA,'augmented')
os.makedirs(TRAIN_DATA_AUGMENTED, exist_ok=True)
newdf.to_csv('../data/wind-turbine/final/training-data/augmented/'+turbine_id+'.csv')
!aws s3 sync $TRAIN_DATA_AUGMENTED s3://$BUCKET/$PREFIX/training_data/augmented

Now we update the components map with this augmented dataset, reload the data into Amazon Lookout for Equipment, and retrain this training model with this dataset. Refer to the accompanying notebook for the detailed steps to retrain the model.

Conclusion

In this post, we walked you through how to use Amazon Lookout for Equipment to train a model to detect abnormal equipment behavior with a wind turbine dataset, review diagnostics from the trained model, review the predictions from the model with a human in the loop using Amazon A2I, augment our original training dataset, and retrain our model with the feedback from the human reviews.

With Amazon Lookout for Equipment and Amazon A2I, you can set up a continuous prediction, review, train, and feedback loop to audit predictions and improve the accuracy of your models.

Please let us know what you think of this solution and how it applies to your industrial use case. Check out the GitHub repo for full resources to this post. Visit the webpages to learn more about Amazon Lookout for Equipment and Amazon Augmented AI. We look forward to hearing from you. Happy experimentation!


About the Authors 

 Dastan Aitzhanov is a Solutions Architect in Applied AI with Amazon Web Services. He specializes in architecting and building scalable cloud-based platforms with an emphasis on machine learning, internet of things, and big data-driven applications. When not working, he enjoys going camping, skiing, and spending time in the great outdoors with his family

 

 

Prem Ranga is an Enterprise Solutions Architect based out of Atlanta, GA. He is part of the Machine Learning Technical Field Community and loves working with customers on their ML and AI journey. Prem is passionate about robotics, is an Autonomous Vehicles researcher, and also built the Alexa-controlled Beer Pours in Houston and other locations.

 

Mona Mona is a Senior AI/ML Specialist Solutions Architect based out of Arlington, VA. She works with public sector customers, and helps them adopt machine learning on a large scale. She is passionate about NLP and ML explainability areas in AI/ML.

 

 

Baris Yasin is a Solutions Architect at AWS. He’s passionate about AI/ML & Analytics technologies and helping startup customers solve challenging business and technical problems with AWS.

 

Read More