In a pilot study, an automated code checker found about 100 possible errors, 80% of which turned out to require correction.Read More
AWS launches free digital training courses to empower business leaders with ML knowledge
Today, we’re pleased to launch Machine Learning Essentials for Business and Technical Decision Makers—a series of three free, on-demand, digital-training courses from AWS Training and Certification. These courses are intended to empower business leaders and technical decision makers with the foundational knowledge needed to begin shaping a machine learning (ML) strategy for their organization, even if they have no prior ML experience. Each 30-minute course includes real-world examples from Amazon’s 20+ years of experience scaling ML within its own operations as well as lessons learned through countless successful customer implementations. These new courses are based on content delivered through the AWS Machine Learning Embark program, an exclusive, hands-on, ML accelerator that brings together executives and technologists at an organization to solve business problems with ML via a holistic learning experience. After completing the three courses, business leaders and technical decision makers will be better able to assess their organization’s readiness, identify areas of the business where ML will be the most impactful, and identify concrete next steps.
Last year, Amazon announced that we’re committed to helping 29 million individuals around the world grow their tech skills with free cloud computing skills training by 2025. The new Machine Learning Essentials for Business and Technical Decision Makers series presents one more step in this direction, with three courses:
- Machine Learning: The Art of the Possible is the first course in the series. Using clear language and specific examples, this course helps you understand the fundamentals of ML, common use cases, and even potential challenges.
- Planning a Machine Learning Project – the second course – breaks down how you can help your organization plan for an ML project. Starting with the process of assessing whether ML is the right fit for your goals and progressing through the key questions you need to ask during deployment, this course helps you understand important issues, such as data readiness, project timelines, and deployment.
- Building a Machine Learning Ready Organization – the final course- offers insights into how to prepare your organization to successfully implement ML, from data-strategy evaluation, to culture, to starting an ML pilot, and more.
Democratizing access to free ML training
ML has the potential to transform nearly every industry, but most organizations struggle to adopt and implement ML at scale. Recent Gartner research shows that only 53% of ML projects make it from prototype to production. The most common barriers we see today are business and culture related. For instance, organizations often struggle to identify the right use cases to start their ML journey; this is often exacerbated by a shortage of skilled talent to execute on an organization’s ML ambitions. In fact, as an additional Gartner study shows, “skills of staff” is the number one challenge or barrier to the adoption of artificial intelligence (AI) and ML. Business leaders play a critical role in addressing these challenges by driving a culture of continuous learning and innovation; however, many lack the resources to develop their own knowledge of ML and its use cases.
With the new Machine Learning Essentials for Business and Technical Decision Makers course, we’re making a portion of the AWS Machine Learning Embark curriculum available globally as free, self-paced, digital-training courses.
The AWS Machine Learning Embark program has already helped many organizations harness the power of ML at scale. For example, the Met Office (the UK’s national weather service) is a great example of how organizations can accelerate their team’s ML knowledge using the program. As a research- and science-based organization, the Met Office develops custom weather-forecasting and climate-projection models that rely on very large observational data sets that are constantly being updated. As one of its many data-driven challenges, the Met Office was looking to develop an approach using ML to investigate how the Earth’s biosphere could alter in response to climate change. The Met Office partnered with the Amazon ML Solutions Lab through the AWS Machine Learning Embark program to explore novel approaches to solving this. “We were excited to work with colleagues from the AWS ML Solutions Lab as part of the Embark program,” said Professor Albert Klein-Tank, head of the Met Office’s Hadley Centre for Climate Science and Services. “They provided technical skills and experience that enabled us to explore a complex categorization problem that offers improved insight into how Earth’s biosphere could be affected by climate change. Our climate models generate huge volumes of data, and the ability to extract added value from it is essential for the provision of advice to our government and commercial stakeholders. This demonstration of the application of machine learning techniques to research projects has supported the further development of these skills across the Met Office.”
In addition to giving access to ML Embark content through the Machine Learning Essentials for Business and Technical Decision Makers, we’re also expanding the availability of the full ML Embark program through key strategic AWS Partners, including Slalom Consulting. We’re excited to jointly offer this exclusive program to all enterprise customers looking to jump-start their ML journey.
We invite you to expand your ML knowledge and help lead your organization to innovate with ML. Learn more and get started today.
About the Author
Michelle K. Lee is vice president of the Machine Learning Solutions Lab at AWS.
TheWebConf: Where communities converge on questions of scale
For Amazon’s Xin Luna Dong, the conference’s diversity mirrors that of her project: building the Amazon product knowledge graph.Read More
Estimating 3D pose for athlete tracking using 2D videos and Amazon SageMaker Studio
In preparation for the upcoming Olympic Games, Intel®, an American multinational corporation and one of the world’s largest technology companies, developed a concept around 3D Athlete Tracking (3DAT). 3DAT is a machine learning (ML) solution to create real-time digital models of athletes in competition in order to increase fan engagement during broadcasts. Intel was looking to leverage this technology for the purpose of coaching and training elite athletes.
Classical computer vision methods for 3D pose reconstruction have proven to be cumbersome for most scientists, given that these models mostly rely on embedding additional sensors on an athlete and the lack of 3D labels and models. Although we can put seamless data collection mechanisms in place using regular mobile phones, developing 3D models using 2D video data is a challenge, given the lack of depth of information in 2D videos. Intel’s 3DAT team partnered with the Amazon ML Solutions Lab (MLSL) to develop 3D human pose estimation techniques on 2D videos in order to create a lightweight solution for coaches to extract biomechanics and other metrics of their athletes’ performance.
This unique collaboration brought together Intel’s rich history in innovation and Amazon ML Solution Lab’s computer vision expertise to develop a 3D multi-person pose estimation pipeline using 2D videos from standard mobile phones as inputs, with Amazon SageMaker Studio notebooks (SM Studio) as the development environment.
Jonathan Lee, Director of Intel Sports Performance, Olympic Technology Group, says, “The MLSL team did an amazing job listening to our requirements and proposing a solution that would meet our customers’ needs. The team surpassed our expectations, developing a 3D pose estimation pipeline using 2D videos captured with mobile phones in just two weeks. By standardizing our ML workload on Amazon SageMaker, we achieved a remarkable 97% average accuracy on our models.”
This post discusses how we employed 3D pose estimation models and generated 3D outputs on 2D video data collected from Ashton Eaton, a decathlete and two-time Olympic gold medalist from the United States, using different angles. It also presents two computer vision techniques to align the videos captured from different angles, thereby allowing coaches to use a unique set of 3D coordinates across the run.
Challenges
Human pose estimation techniques use computer vision aim to provide a graphical skeleton of a person detected in a scene. They include coordinates of predefined key points corresponding to human joints, such as the arms, neck, and hips. These coordinates are used to capture the body’s orientation for further analysis, such as pose tracking, posture analysis, and subsequent evaluation. Recent advances in computer vision and deep learning have enabled scientists to explore pose estimation in a 3D space, where the Z-axis provides additional insights compared to 2D pose estimation. These additional insights could be used for more comprehensive visualization and analysis. However, building a 3D pose estimation model from scratch is challenging because it requires imaging data along with 3D labels. Therefore, many researchers employ pretrained 3D pose estimation models.
Data processing pipeline
We designed an end-to-end 3D pose estimation pipeline illustrated in the following diagram using SM Studio, which encompassed several components:
- Amazon Simple Storage Service (Amazon S3) bucket to host video data
- Frame extraction module to convert video data to static images
- Object detection modules to detect bounding boxes of persons in each frame
- 2D pose estimation for future evaluation purposes
- 3D pose estimation module to generate 3D coordinates for each person in each frame
- Evaluation and visualization modules
SM Studio offers a broad range of features facilitating the development process, including easy access to data in Amazon S3, availability of compute capability, software and library availability, and an integrated development experience (IDE) for ML applications.
First, we read the video data from the S3 bucket and extracted the 2D frames in a portable network graphics (PNG) format for frame-level development. We used YOLOv3 object detection to generate a bounding box of each person detected in a frame. For more information, see Benchmarking Training Time for CNN-based Detectors with Apache MXNet.
Next, we passed the frames and corresponding bounding box information to the 3D pose estimation model to generate the key points for evaluation and visualization. We applied a 2D pose estimation technique to the frames, and we generated the key points per frame for development and evaluation. The following sections discuss the details of each module in the 3D pipeline.
Data preprocessing
The first step was to extract frames from a given video utilizing OpenCV as shown in the following figure. We used two counters to keep track of time and frame count respectively, because videos were captured at different frames per second (FPS) rates. We then stored the sequence of images asvideo_name + second_count + frame_count
in PNG format.
Object (person) detection
We employed YOLOv3 pretrained models based on the Pascal VOC dataset to detect persons in frames. For more information, see Deploying custom models built with Gluon and Apache MXNet on Amazon SageMaker. The YOLOv3 algorithm produced the bounding boxes shown in the following animations (the original images are resized to 910×512 pixels).
We stored the bounding box coordinates in a CSV file, in which the rows indicated the frame index, bounding box information as a list, and their confidence scores.
2D pose estimation
We selected ResNet-18 V1b as the pretrained pose estimation model, which considers a top-down strategy to estimate human poses within bounding boxes output by the object detection model. We further reset the detector classes to include humans so that the non-maximum suppression (NMS) process could be performed faster. The Simple Pose network was applied to predict heatmaps for key points (as in the following animation), and the highest values in the heatmaps were mapped to the coordinates on the original images.
3D pose estimation
We employed a state-of-the-art 3D pose estimation algorithm encompassing a camera distance-aware top-down method for multi-person per RGB frame referred to as 3DMPPE (Moon et al.). This algorithm consisted of two major phases:
- RootNet – Estimates the camera-centered coordinates of a person’s root in a cropped frame
- PoseNet – Uses a top-down approach to predict the relative 3D pose coordinates in the cropped image
Next, we used the bounding box information to project the 3D coordinates back to the original space. 3DMPPE offered two pretrained models trained using Human36 and MuCo3D datasets (for more information, see the GitHub repo), which included 17 and 21 key points, respectively, as illustrated in the following animations. We used the 3D pose coordinates predicted by the two pretrained models for visualization and evaluation purposes.
Evaluation
To evaluate the 2D and 3D pose estimation models’ performance, we used the 2D pose (x,y) and 3D pose (x,y,z) coordinates for each joint generated for every frame in a given video. The number of key points varied based on the datasets; for instance, the Leeds Sports Pose Dataset (LSP) includes 14, whereas the MPII Human Pose dataset, a state-of-the-art benchmark for evaluating articulated human pose estimation referring to Human3.6M, includes 16 key points. We used two metrics commonly used for both 2D and 3D pose estimation, as described in the next section on evaluation. In our implementation, our default key points dictionary followed the COCO detection dataset, which has 17 key points (see the following image), and the order is defined as follows:
KEY POINTS = {
0: "nose",
1: "left_eye",
2: "right_eye",
3: "left_ear",
4: "right_ear",
5: "left_shoulder",
6: "right_shoulder",
7: "left_elbow",
8: "right_elbow",
9: "left_wrist",
10: "right_wrist",
11: "left_hip",
12: "right_hip",
13: "left_knee",
14: "right_knee",
15: "left_ankle",
16: "right_ankle"
}
Mean per joint position error
Mean per joint position error (MPJPE) is the Euclidean distance between ground truth and a joint prediction. As MPJPE measures the error or loss distance, and lower values indicate greater precision.
We use the following pseudo code:
- Let G denote
ground_truth_joint
and preprocess G by:- Replacing the null entries in G with [0,0] (2D) or [0,0,0] (3D)
- Using a Boolean matrix B to store the location of null entries
- Let P denote
predicted_joint matrix
, and align G and P by frame index by inserting a zero vector if any frame doesn’t have results or is unlabeled - Compute element-wise Euclidean distance between G and P, and let D denote distance matrix
- Replace Di,j with 0 if Bi,j
- Mean per joint position is the mean value of each column of Ds,tDi,j ≠ 0
The following figure visualizes an example of video’s per joint error, a matrix with dimension m*n, where m denotes the number of frames in a video and n denotes the number of joints (key points). The matrix shows an example of a heatmap of per joint position error on the left and the mean per joint position error on the right.
The following figure visualizes an example of video’s per joint error, a matrix with dimension m*n , where m denotes the number of frames in a video and n denotes the number of joints (key points). The matrix shows an example of a heatmap of per joint position error on the left and the mean per joint position error on the right.
Percentage of correct key points
The percentage of correct key points (PCK) represents a pose evaluation metric where a detected joint is considered correct if the distance between the predicted and actual joint is within a certain threshold; this threshold may vary, which leads to a few different variations of metrics. Three variations are commonly used:
- PCKh@0.5, which is when the threshold is defined as 0.5 * head bone link
- PCK@0.2, which is when the distance between the predicted and actual joint is < 0.2 * torso diameter
- 150mm as a hard threshold
In our solution, we used the PCKh@0.5 as our ground truth XML data containing the head bounding box, which we can use to compute the head-bone link. To the best of our knowledge, no existing package contains an easy-to-use implementation for this metric; therefore, we implemented the metric in-house.
Pseudo code
We used the following pseudo code:
- Let G denote ground-truth joint and preprocess G by:
- Replacing the null entries in G with [0,0] (2D) or [0,0,0] (3D)
- Using a Boolean matrix B to store the location of null entries
- For each frame Fi, use its bbox Bi=(xmin,ymin,xmax,ymax) to compute each frame’s corresponding head-bone link Hi , where Hi=((xmax-xmin)2+(ymax-ymin)2)½
- Let P denote predicted joint matrix and align G and P by frame index; insert a zero tensor if any frame is missing
- Compute the element-wise 2-norm error between G and P; let E denote error matrix, where Ei,j=||Gi,j-Pi,j||
- Compute a scaled matrix S=H*I, where I represents an identity matrix with the same dimension as E
- To avoid division by 0, replace Si,j with 0.000001 if Bi,j=1
- Compute scaled error matrix Si,j=Ei,j/Si,j
- Filter out SE with threshold = 0.5, and let C denote the counter matrix, where Ci,j=1 if SEi,j<0.5 and Ci,j=0 elsewise
- Count how many 1’s in C*,j as c⃗ and count how many 0’s in B*,j as b⃗
- PCKh@0.5=mean(c⃗/b⃗)
In the sixth step (replace Si,jwith 0.000001 if Bi,j=1), we set up a trap for the scaled error matrix by replacing 0 entries with 0.00001. Dividing any number by a tiny number generates an amplified number. Because we later used > 0.5 as a threshold to filter out incorrect predictions, the null entries were excluded from the correct prediction because it was way too large. We subsequently counted only the not null entries in the Boolean matrix. In this way, we also excluded the null entries from the whole dataset. We proposed an engineering trick in this implementation to filter out null entries from either unlabeled key points in the ground truth or the frames with no person detected.
Video alignment
We considered two different camera configurations to capture video data from athletes, namely the line and box setups. The line setup consists of four cameras placed along a line while the box setup consists of four cameras placed in each corner of a rectangle. The cameras were synchronized in the line configuration and then lined up at a predefined distance from each other, utilizing slightly overlapping camera angles. The objective of the video alignment in the line configuration was to identify the timestamps connecting consecutive cameras to remove repeated and empty frames. We implemented two approaches based on object detection and cross-correlation of optical flows.
Object detection algorithm
We used the object detection results in this approach, including persons’ bounding boxes from the previous steps. The object detection techniques produced a probability (score) per person in each frame. Therefore, plotting the scores in a video enabled us to find the frame where the first person appeared or disappeared. The reference frame from the box configuration was extracted from each video, and all cameras were then synchronized based on the first frame’s references. In the line configuration, both the start and end timestamps were extracted, and a rule-based algorithm was implemented to connect and align consecutive videos, as illustrated in the following images.
The top videos in the following figure show the original videos in the line configuration. Underneath that are person detection scores. The next rows show a threshold of 0.75 applied to the scores, and appropriate start and end timestamps are extracted. The bottom row shows aligned videos for further analysis.
Moment of snap
We introduced the moment of snap (MOS) – a well-known alignment approach – which indicates when an event or play begins. We wanted to determine the frame number when an athlete enters or leaves the scene. Typically, relatively little movement occurs on the running field before the start and after the end of the snap, whereas relatively substantial movement occurs when the athlete is running. Therefore, intuitively, we could find the MOS frame by finding the video frames with relatively large differences in the video’s movement before and after the frame. To this end, we utilized density optical flow, a standard measure of movement in the video, to estimate the MOS. First, given a video, we computed optical flow for every two consecutive frames. The following videos present a visualization of dense optical flow on the horizontal axis.
We then measured cross-correlation between two consecutive frames’ optical flows, because cross-correlation measures the difference between them. For each angle’s camera-captured video, we repeated the algorithm to find its MOS. Finally, we used the MOS frame as the key frame for aligning videos from different angles. The following video details these steps.
Conclusion
The technical objective of the work demonstrated in this post was to develop a deep-learning based solution producing 3D pose estimation coordinates using 2D videos. We employed a camera distance-aware technique with a top-down approach to achieve 3D multi-person pose estimation. Further, using object detection, cross-correlation, and optical flow algorithms, we aligned the videos captured from different angles.
This work has enabled coaches to analyze 3D pose estimation of athletes over time to measure biomechanics metrics, such as velocity, and monitor the athletes’ performance using quantitative and qualitative methods.
This post demonstrated a simplified process for extracting 3D poses in real-world scenarios, which can be scaled to coaching in other sports such as swimming or team sports.
If you would like help with accelerating the use of ML in your products and services, please contact the Amazon ML Solutions Lab program.
References
Moon, Gyeongsik, Ju Yong Chang, and Kyoung Mu Lee. “Camera distance-aware top-down approach for 3d multi-person pose estimation from a single RGB image.” In Proceedings of the IEEE International Conference on Computer Vision, pp. 10133-10142. 2019.
About the Author
Saman Sarraf is a Data Scientist at the Amazon ML Solutions Lab. His background is in applied machine learning including deep learning, computer vision, and time series data prediction.
Amery Cong is an Algorithms Engineer at Intel, where he develops machine learning and computer vision technologies to drive biomechanical analyses at the Olympic Games. He is interested in quantifying human physiology with AI, especially in a sports performance context.
Ashton Eaton is a Product Development Engineer at Intel, where he helps design and test technologies aimed at advancing sport performance. He works with customers and the engineering team to identify and develop products that serve customer needs. He is interested in applying science and technology to human performance.
Jonathan Lee is the Director of Sports Performance Technology, Olympic Technology Group at Intel. He studied the application of machine learning to health as an undergrad at UCLA and during his graduate work at University of Oxford. His career has focused on algorithm and sensor development for health and human performance. He now leads the 3D Athlete Tracking project at Intel.
Nelson Leung is the Platform Architect in the Sports Performance CoE at Intel, where he defines end-to-end architecture for cutting-edge products that enhance athlete performance. He also leads the implementation, deployment and productization of these machine learning solutions at scale to different Intel partners.
Suchitra Sathyanarayana is a manager at the Amazon ML Solutions Lab, where she helps AWS customers across different industry verticals accelerate their AI and cloud adoption. She holds a PhD in Computer Vision from Nanyang Technological University, Singapore.
Wenzhen Zhu is a data scientist with the Amazon ML Solution Lab team at Amazon Web Services. She leverages Machine Learning and Deep Learning to solve diverse problems across industries for AWS customers.
Implement checkpointing with TensorFlow for Amazon SageMaker Managed Spot Training
Customers often ask us how can they lower their costs when conducting deep learning training on AWS. Training deep learning models with libraries such as TensorFlow, PyTorch, and Apache MXNet usually requires access to GPU instances, which are AWS instances types that provide access to NVIDIA GPUs with thousands of compute cores. GPU instance types can be more expensive than other Amazon Elastic Compute Cloud (Amazon EC2) instance types, so optimizing usage of these types of instances is a priority for customers as well as an overall best practice for well-architected workloads.
Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to prepare, build, train, and deploy machine learning (ML) models quickly. SageMaker removes the heavy lifting from each step of the ML process to make it easier to develop high-quality models. SageMaker provides all the components used for ML in a single toolset so models get to production faster with less effort and at lower cost.
Amazon EC2 Spot Instances offer spare compute capacity available in the AWS Cloud at steep discounts compared to On-Demand prices. Amazon EC2 can interrupt Spot Instances with 2 minutes of notification when the service needs the capacity back. You can use Spot Instances for various fault-tolerant and flexible applications. Some examples are analytics, containerized workloads, stateless web servers, CI/CD, training and inference of ML models, and other test and development workloads. Spot Instance pricing makes high-performance GPUs much more affordable for deep learning researchers and developers who run training jobs.
One of the key benefits of SageMaker is that it frees you of any infrastructure management, no matter the scale you’re working at. For example, instead of having to set up and manage complex training clusters, you simply tell SageMaker which EC2 instance type to use and how many you need. The appropriate instances are then created on-demand, configured, and stopped automatically when the training job is complete. As SageMaker customers have quickly understood, this means that they pay only for what they use. Building, training, and deploying ML models are billed by the second, with no minimum fees, and no upfront commitments. SageMaker can also use EC2 Spot Instances for training jobs, which optimize the cost of the compute used for training deep-learning models.
In this post, we walk through the process of training a TensorFlow model with Managed Spot Training in SageMaker. We walk through the steps required to set up and run a training job that saves training progress in Amazon Simple Storage Service (Amazon S3) and restarts the training job from the last checkpoint if an EC2 instance is interrupted. This allows our training jobs to continue from the same point before the interruption occurred. Finally, we see the savings that we achieved by running our training job on Spot Instances using Managed Spot Training in SageMaker.
Managed Spot Training in SageMaker
SageMaker makes it easy to train ML models using managed EC2 Spot Instances. Managed Spot Training can optimize the cost of training models up to 90% over On-Demand Instances. With only a few lines of code, SageMaker can manage Spot interruptions on your behalf.
Managed Spot Training uses EC2 Spot Instances to run training jobs instead of On-Demand Instances. You can specify which training jobs use Spot Instances and a stopping condition that specifies how long SageMaker waits for a training job to complete using EC2 Spot Instances. Metrics and logs generated during training runs are available in Amazon CloudWatch.
Managed Spot Training is available in all training configurations:
- All instance types supported by SageMaker
- All models: built-in algorithms, built-in frameworks, and custom models
- All configurations: single instance training and distributed training
Interruptions and checkpointing
There’s an important difference when working with Managed Spot Training. Unlike On-Demand Instances that are expected to be available until a training job is complete, Spot Instances may be reclaimed any time Amazon EC2 needs the capacity back.
SageMaker, as a fully managed service, handles the lifecycle of Spot Instances automatically. It interrupts the training job, attempts to obtain Spot Instances again, and either restarts or resumes the training job.
To avoid restarting a training job from scratch if it’s interrupted, we strongly recommend that you implement checkpointing, a technique that saves the model in training at periodic intervals. When you use checkpointing, you can resume a training job from a well-defined point in time, continuing from the most recent partially trained model, and avoiding starting from the beginning and wasting compute time and money.
To implement checkpointing, we have to make a distinction on the type of algorithm you use:
- Built-in frameworks and custom models – You have full control over the training code. Just make sure that you use the appropriate APIs to save model checkpoints to Amazon S3 regularly, using the location you defined in the
CheckpointConfig
parameter and passed to the SageMakerEstimator
. TensorFlow uses checkpoints by default. For other frameworks, see our sample notebooks and Use Machine Learning Frameworks, Python, and R with Amazon SageMaker. - Built-in algorithms – Computer vision algorithms support checkpointing (object detection, semantic segmentation, and image classification). Because they tend to train on large datasets and run for longer than other algorithms, they have a higher likelihood of being interrupted. The XGBoost built-in algorithm also supports checkpointing.
TensorFlow image classification model with Managed Spot Training
To demonstrate Managed Spot Training and checkpointing, I guide you through the steps needed to train a TensorFlow image classification model. To make sure that your training scripts can take advantage of SageMaker Managed Spot Training, we need to implement the following:
- Frequent saving of checkpoints, thereby saving checkpoints each epoch
- The ability to resume training from checkpoints if checkpoints exist
Save checkpoints
SageMaker automatically backs up and syncs checkpoint files generated by your training script to Amazon S3. Therefore, you need to make sure that your training script saves checkpoints to a local checkpoint directory on the Docker container that’s running the training. The default location to save the checkpoint files is /opt/ml/checkpoints
, and SageMaker syncs these files to the specific S3 bucket. Both local and S3 checkpoint locations are customizable.
Saving checkpoints using Keras is very easy. You need to create an instance of the ModelCheckpoint
callback class and register it with the model by passing it to the fit()
function.
You can find the full implementation code on the GitHub repo.
The following is the relevant code:
callbacks = []
callbacks.append(ModelCheckpoint(args.checkpoint_path + '/checkpoint-{epoch}.h5'))
logging.info("Starting training from epoch: {}".format(initial_epoch_number+1))
model.fit(x=train_dataset[0],
y=train_dataset[1],
steps_per_epoch=(num_examples_per_epoch('train')
epochs=args.epochs,
initial_epoch=initial_epoch_number,
validation_data=validation_dataset,
validation_steps=(num_examples_per_epoch('validation')
callbacks=callbacks)
The names of the checkpoint files saved are as follows: checkpoint-1.h5
, checkpoint-2.h5
, checkpoint-3.h5
, and so on.
For this post, I’m passing initial_epoch
, which you normally don’t set. This lets us resume training from a certain epoch number and comes in handy when you already have checkpoint files.
The checkpoint path is configurable because we get it from args.checkpoint_path
in the main
function:
if __name__ == '__main__':
parser = argparse.ArgumentParser()
...
parser.add_argument("--checkpoint-path",type=str,default="/opt/ml/checkpoints",help="Path where checkpoints will be saved.")
...
args = parser.parse_args()
Resume training from checkpoint files
When Spot capacity becomes available again after Spot interruption, SageMaker launches a new Spot Instance, instantiates a Docker container with your training script, copies your dataset and checkpoint files from Amazon S3 to the container, and runs your training scripts.
Your script needs to implement resuming training from checkpoint files, otherwise your training script restarts training from scratch. You can implement a load_model_from_checkpoints
function as shown in the following code. It takes in the local checkpoint files path (/opt/ml/checkpoints
being the default) and returns a model loaded from the latest checkpoint and the associated epoch number.
You can find the full implementation code on the GitHub repo.
The following is the relevant code:
def load_model_from_checkpoints(checkpoint_path):
checkpoint_files = [file for file in os.listdir(checkpoint_path) if file.endswith('.' + 'h5')]
logging.info('--------------------------------------------')
logging.info("Available checkpoint files: {}".format(checkpoint_files))
epoch_numbers = [re.search('(.*[0-9])(?=.)',file).group() for file in checkpoint_files]
max_epoch_number = max(epoch_numbers)
max_epoch_index = epoch_numbers.index(max_epoch_number)
max_epoch_filename = checkpoint_files[max_epoch_index]
logging.info('Latest epoch checkpoint file name: {}'.format(max_epoch_filename))
logging.info('Resuming training from epoch: {}'.format(int(max_epoch_number)+1))
logging.info('---------------------------------------------')
resumed_model_from_checkpoints = load_model(f'{checkpoint_path}/{max_epoch_filename}')
return resumed_model_from_checkpoints, int(max_epoch_number)
Managed Spot Training with a TensorFlow estimator
You can launch SageMaker training jobs from your laptop, desktop, EC2 instance, or SageMaker notebook instances. Make sure you have the SageMaker Python SDK installed and the right user permissions to run SageMaker training jobs.
To run a Managed Spot Training job, you need to specify few additional options to your standard SageMaker Estimator
function call:
- use_spot_instances – Specifies whether to use SageMaker Managed Spot Training for training. If enabled, you should also set the
train_max_wait
automated reasoning group (ARG). - max_wait – Timeout in seconds waiting for Spot training instances (default:
None
). After this amount of time, SageMaker stops waiting for Spot Instances to become available or the training job to finish. From previous runs, I know that the training job will finish in 4 minutes, so I set it to 600 seconds. - max_run – Timeout in seconds for training (default:
24 * 60 * 60
). After this amount of time, SageMaker stops the job regardless of its current status. I am willing to stand double the time a training with On-Demand takes, so I assign 20 minutes of training time in total using Spot. - checkpoint_s3_uri – The S3 URI in which to persist checkpoints that the algorithm persists (if any) during training.
You can find the full implementation code on the GitHub repo.
The following is the relevant code:
use_spot_instances = True
max_run=600
max_wait = 1200
checkpoint_suffix = str(uuid.uuid4())[:8]
checkpoint_s3_uri = 's3://{}/checkpoint-{}'.format(bucket, checkpoint_suffix)
hyperparameters = {'epochs': 5, 'batch-size': 256}
spot_estimator = TensorFlow(entry_point='cifar10_keras_main.py',
source_dir='source_dir',
metric_definitions=metric_definitions,
hyperparameters=hyperparameters,
role=role,
framework_version='1.15.2',
py_version='py3',
instance_count=1,
instance_type='ml.p3.2xlarge',
base_job_name='cifar10-tf-spot-1st-run',
tags=tags,
checkpoint_s3_uri=checkpoint_s3_uri,
use_spot_instances=use_spot_instances,
max_run=max_run,
max_wait=max_wait)
Those are all the changes you need to make to significantly lower your cost of ML training.
To monitor your training job and view savings, you can look at the logs on your Jupyter notebook.
Towards the end of the job, you should see two lines of output:
- Training seconds: X – The actual compute time your training job spent
- Billable seconds: Y – The time you are billed for after Spot discounting is applied.
If you enabled use_spot_instances
, you should see a notable difference between X and Y, signifying the cost savings you get for using Managed Spot Training. This is reflected in an additional line:
- Managed Spot Training savings – Calculated as (1-Y/X)*100 %
The following screenshot shows the output logs for our Jupyter notebook:
When the training is complete, you can also navigate to the Training jobs page on the SageMaker console and choose your training job to see how much you saved.
For this example training job of a model using TensorFlow, my training job ran for 144 seconds, but I’m only billed for 43 seconds, so for a 5 epoch training on a ml.p3.2xlarge GPU instance, I was able to save 70% on training cost!
Confirm that checkpoints and recovery works for when your training job is interrupted
How can you test if your training job will resume properly if a Spot Interruption occurs?
If you’re familiar with running EC2 Spot Instances, you know that you can simulate your application behavior during a Spot Interruption by following the recommended best practices. However, because SageMaker is a managed service, and manages the lifecycle of EC2 instances on your behalf, you can’t stop a SageMaker training instance manually. Your only option is to stop the entire training job.
You can still test your code’s behavior when resuming an incomplete training by running a shorter training job, and then using the outputted checkpoints from that training job as inputs to a longer training job. To do this, first run a SageMaker Managed Spot Training job for a specified number of epochs as described in the previous section. Let’s say you run training for five epochs. SageMaker would have backed up your checkpoint files to the specified S3 location for the five epochs.
You can navigate to the training job details page on the SageMaker console to see the checkpoint configuration S3 output path.
Choose the S3 output path link to navigate to the checkpointing S3 bucket, and verify that five checkpoint files are available there.
Now run a second training run with 10 epochs. You should provide the first job’s checkpoint location to checkpoint_s3_uri so the training job can use those checkpoints as inputs to the second training job.
You can find the full implementation code in the GitHub repo.
The following is the relevant code:
hyperparameters = {'epochs': 10, 'batch-size': 256}
spot_estimator = TensorFlow(entry_point='cifar10_keras_main.py',
source_dir='source_dir',
metric_definitions=metric_definitions,
hyperparameters=hyperparameters,
role=role,
framework_version='1.15.2',
py_version='py3',
instance_count=1,
instance_type='ml.p3.2xlarge',
base_job_name='cifar10-tf-spot-2nd-run',
tags=tags,
checkpoint_s3_uri=checkpoint_s3_uri,
use_spot_instances=use_spot_instances,
max_run=max_run,
max_wait=max_wait)
By providing checkpoint_s3_uri
with your previous job’s checkpoints, you’re telling SageMaker to copy those checkpoints to your new job’s container. Your training script then loads the latest checkpoint and resumes training. The following screenshot shows that the training resumes resume from the sixth epoch.
To confirm that all checkpoint files were created, navigate to the same S3 bucket. This time you can see that 10 checkpoint files are available.
The key difference between simulating an interruption this way and how SageMaker manages interruptions is that you’re creating a new training job to test your code. In the case of Spot Interruptions, SageMaker simply resumes the existing interrupted job.
Implement checkpointing with PyTorch, MXNet, and XGBoost built-in and script mode
The steps shown in the TensorFlow example are basically the same for PyTorch and MXNet. The code for saving checkpoints and loading them to resume training is different.
You can see full examples for TensorFlow 1.x/2.x, PyTorch, MXNet, and XGBoost built-in and script mode in the GitHub repo.
Conclusions and next steps
In this post, we trained a TensorFlow image classification model using SageMaker Managed Spot Training. We saved checkpoints locally in the container and loaded checkpoints to resume training if they existed. SageMaker takes care of synchronizing the checkpoints with Amazon S3 and the training container. We simulated a Spot interruption by running Managed Spot Training with 5 epochs, and then ran a second Managed Spot Training Job with 10 epochs, configuring the checkpoints’ S3 bucket of the previous job. This resulted in the training job loading the checkpoints stored in Amazon S3 and resuming from the sixth epoch.
It’s easy to save on training costs with SageMaker Managed Spot Training. With minimal code changes, you too can save over 70% when training your deep-learning models.
As a next step, try to modify your own TensorFlow, PyTorch, or MXNet script to implement checkpointing, and then run a Managed Spot Training in SageMaker to see that the checkpoint files are created in the S3 bucket you specified. Let us know how you do in the comments!
About the Author
Eitan Sela is a Solutions Architect with Amazon Web Services. He works with AWS customers to provide guidance and technical assistance, helping them improve the value of their solutions when using AWS. Eitan also helps customers build and operate machine learning solutions on AWS. In his spare time, Eitan enjoys jogging and reading the latest machine learning articles.
Improving the accuracy of privacy-preserving neural networks
ADePT model transforms the texts used to train natural-language-understanding models while preserving semantic coherence.Read More
HawkEye 360 uses Amazon SageMaker Autopilot to streamline machine learning model development for maritime vessel risk assessment
This post is cowritten by Ian Avilez and Tim Pavlick from HawkEye 360.
HawkEye 360 is a commercial radio frequency (RF) satellite constellation data analytics provider. Our signals of interest include very high frequency (VHF) push-to-talk radios, maritime radar systems, AIS beacons, satellite mobile comms, and more. Our Mission Space offering, released in February 2021, allows mission analysts to intuitively visualize RF signals and analytics, which allows them to identify activity and understand trends. This capability improves maritime situational awareness for mission analysts, allowing them to identify and characterize nefarious behavior such as illegal fishing or ship-to-ship transfer of illicit goods.
The following screenshot shows HawkEye 360’s Mission Space experience.
RF data can be overwhelming to the naked eye without filtering and advanced algorithms to parse through and characterize the vast amount of raw data. HawkEye 360 partnered with the Amazon ML Solutions Lab to build machine learning (ML) capabilities into our analytics. With the guidance of the Amazon ML Solutions Lab, we used Amazon SageMaker Autopilot to rapidly generate high-quality AI models for maritime vessel risk assessment, maintain full visibility and control of model creation, and provide the ability to easily deploy and monitor a model in a production environment.
Hidden patterns and relationships among vessel features
Seagoing vessels are distinguished by several characteristics relating to the vessel itself, its operation and management, and its historical behavior. Knowing which characteristics are indicative of a suspicious vessel isn’t immediately clear. One of HawkEye 360’s missions is to discover hidden patterns and automatically alert analysts to anomalous maritime activity. Hawkeye 360 accomplishes this alerting regions by using a diverse set of variables, in combination with proprietary RF geo-analytics. A key focus of these efforts is to identify which vessels are more likely to engage in suspicious maritime activity, such as illegal fishing or ship-to-ship transfer of illicit goods. ML algorithms reveal hidden patterns, where they exist, that would otherwise be lost in the vast sea of complexity.
The following image demonstrates some of the existing pattern finding behavior that has been built into Mission Space. Mission Space automatically identifies other instances of a suspicious vessel. Identifying the key features, that are most predictive of suspicious behavior, allows for easy display of those features in Mission Space. This enables users to understand links between bad actors that would otherwise have never been seen. Mission Space was purposefully designed to point out these connections to mission analysts.
Challenges detecting anomalous behavior with maritime vessels
HawkEye 360’s data lake includes a large volume of vessel information, history, and analytics variables. With such a wide array of RF data and analytics, some natural data handling issues must be addressed. Sporadic reporting by vessels results in missing values across datasets. Variations amongst data types must be taken into account. Previously, data exploration and baseline modeling typically would takes up a large chunk of an analysts’ time. After the data is prepared, a series of automatic experiments is run to narrow down to a set of the most promising AI models, and in a stepwise fashion from there, to select the one most appropriate for the data and the research questions. For HawkEye 360, this automated exploration is key to determining which features, and feature combinations, are critical to predicting how likely a vessel is to engage in suspicious behavior.
We used Autopilot to expedite this process by quickly identifying which features of the data are useful in predicting suspicious behavior. Automation of data exploration and analysis enables our data scientists to spend less time wrangling data and manually engineering features, and expedites the ability to identify the vessel features that are most predictive of suspicious vessel behavior.
How we used Autopilot to quickly generate high-quality ML models
As part of an Autopilot job, several candidate models are generated rapidly for evaluation with a single API call. Autopilot inspected the data and evaluated several models to determine the optimal combination of preprocessing methods, ML algorithms, and hyperparameters. This significantly shortened the model exploration time frame and allowed us to quickly test the suitability of ML to our unique hypotheses.
The following code shows our setup and API call:
input_data_config = [{
'DataSource': {
'S3DataSource': {
'S3DataType': 'S3Prefix',
'S3Uri': 's3://{}/{}/train'.format(bucket,prefix)
}
},
'TargetAttributeName': 'ship_sanctioned_ofac'
}
]
output_data_config = {
'S3OutputPath': 's3://{}/{}/output'.format(bucket,prefix)
}
from time import gmtime, strftime, sleep
timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())
auto_ml_job_name = 'automl-darkcount-' + timestamp_suffix
print('AutoMLJobName: ' + auto_ml_job_name)
sm.create_auto_ml_job(AutoMLJobName=auto_ml_job_name,
InputDataConfig=input_data_config,
OutputDataConfig=output_data_config,
RoleArn=role)
Autopilot job process
An Autopilot job consists of the following actions:
- Dividing the data into train and validation sets
- Analyzing the data to recommend candidate configuration
- Performing feature engineering to generate optimal transformed features appropriate for the algorithm
- Tuning hyperparameters to generate a leaderboard of models
- Surfacing the best candidate model based on the given evaluation metric
After we trained several models, Autopilot racks and stacks the trained candidates based on a given metric (see the following code). For this application, we used an F1 score, which gives an even weight to both precision and recall. This is an important consideration when classes are imbalanced, which they are in this dataset.
candidates = sm.list_candidates_for_auto_ml_job(AutoMLJobName=auto_ml_job_name, SortBy='FinalObjectiveMetricValue')['Candidates']
index = 1
print("List of model candidates in descending objective metric:")
for candidate in candidates:
print (str(index) + " " + candidate['CandidateName'] + " " + str(candidate['FinalAutoMLJobObjectiveMetric']['Value']))
index += 1
The following code shows our output:
List of model candidates in descending objective metric:
1 tuning-job-1-1be4d5a5fb8e42bc84-238-e264d09f 0.9641900062561035
2 tuning-job-1-1be4d5a5fb8e42bc84-163-336eb2e7 0.9641900062561035
3 tuning-job-1-1be4d5a5fb8e42bc84-143-5007f7dc 0.9641900062561035
4 tuning-job-1-1be4d5a5fb8e42bc84-154-cab67dc4 0.9641900062561035
5 tuning-job-1-1be4d5a5fb8e42bc84-123-f76ad56c 0.9641900062561035
6 tuning-job-1-1be4d5a5fb8e42bc84-117-39eac182 0.9633200168609619
7 tuning-job-1-1be4d5a5fb8e42bc84-108-77addf80 0.9633200168609619
8 tuning-job-1-1be4d5a5fb8e42bc84-179-1f831078 0.9633200168609619
9 tuning-job-1-1be4d5a5fb8e42bc84-133-917ccdf1 0.9633200168609619
10 tuning-job-1-1be4d5a5fb8e42bc84-189-102070d9 0.9633200168609619
We can now create a model from the best candidate, which can be quickly deployed into production:
model_name = 'automl-darkcount-25-23-07-39'
model = sm.create_model(Containers=best_candidate['InferenceContainers'],
ModelName=model_name,
ExecutionRoleArn=role)
print('Model ARN corresponding to the best candidate is : {}'.format(model['ModelArn']))
The following code shows our output:
Model ARN corresponding to the best candidate is : arn:aws:sagemaker:us-east-1:278150328949:model/automl-darkcount-25-23-07-39
Maintaining full visibility and control
The process to generate a model is completely transparent. Two notebooks are generated for any model Autopilot creates:
- Data exploration notebook – Describes your dataset and what Autopilot learned about your dataset
- Model candidate notebook – Lists data transformations used as well as candidate model building pipelines consisting of feature transformers paired with main estimators
Conclusion
We used Autopilot to quickly generate many candidate models to determine ML feasibility and baselining ML performance on the vessel data. The automaticity of Autopilot allowed our data scientists to spend 50% less time developing ML capabilities by automating the manual tasks such as data analysis, feature engineering, model development, and model deployment.
With HawkEye 360’s new RF data analysis application, Mission Space, identifying which vessels have the potential to engage in suspicious activity allows users to easily know where to focus their scarce attention and investigate further. Expediting the data understanding and model creation allows for cutting-edge insights to be quickly assimilated into Mission Space, which accelerates the evolution of Mission Space’s capabilities as shown in the following map. We can see that a Mission Analyst identified a specific rendezvous (highlighted in magenta) and Mission Space automatically identified other related rendezvous (in purple).
For more information about HawkEye 360’s Mission Space offering, see Misson Space.
If you’d like assistance in accelerating the use of ML in your products and services, contact the Amazon ML Solutions Lab.
About the Authors
Tim Pavlick, PhD, is VP of Product at HawkEye 360. He is responsible for the conception, creation, and productization of all HawkEye space innovations. Mission Space is HawkEye 360’s flagship product, incorporating all the data and analytics from the HawkEye portfolio into one intuitive RF experience. Dr. Pavlick’s prior invention contributions include Myca, IBM’s AI Career Coach, Grit PTSD monitor for Veterans, IBM Defense Operations Platform, Smarter Planet Intelligent Operations Center, AI detection of dangerous hate speech on the internet, and the STORES electronic food ordering system for the US military. Dr. Pavlick received his PhD in Cognitive Psychology from the University of Maryland College Park.
Ian Avilez is a Data Scientist with HawkEye 360. He works with customers to highlight the insights that can be gained by combining different datasets and looking at that data in various ways.
Dan Ford is a Data Scientist at the Amazon ML Solution Lab, where he helps AWS National Security customers build state-of-the-art ML solutions.
Gaurav Rele is a Data Scientist at the Amazon ML Solution Lab, where he works with AWS customers across different verticals to accelerate their use of machine learning and AWS Cloud services to solve their business challenges.
Protecting people from hazardous areas through virtual boundaries with Computer Vision
As companies welcome more autonomous robots and other heavy equipment into the workplace, we need to ensure equipment can operate safely around human teammates. In this post, we will show you how to build a virtual boundary with computer vision and AWS DeepLens, the AWS deep learning-enabled video camera designed for developers to learn machine learning (ML). Using the machine learning techniques in this post, you can build virtual boundaries for restricted areas that automatically shut down equipment or sound an alert when humans come close.
For this project, you will train a custom object detection model with Amazon SageMaker and deploy the model to an AWS DeepLens device. Object detection is an ML algorithm that takes an image as input and identifies objects and their location within the image. In addition to virtual boundary solutions, you can apply techniques learned in this post when you need to detect where certain objects are inside an image or count the number of instances of a desired object in an image, such as counting items in a storage bin or on a retail shelf.
Solution overview
The walkthrough includes the following steps:
- Prepare your dataset to feed into an ML algorithm.
- Train a model with Amazon SageMaker.
- Test model with custom restriction zones.
- Deploy the solution to AWS DeepLens.
We also discuss other real-world use cases where you can apply this solution.
The following diagram illustrates the solution architecture.
Prerequisites
To complete this walkthrough, you must have the following prerequisites:
- An AWS account
- An AWS DeepLens device. They are available on the following Amazon websites: Amazon.com (US), Amazon.ca (Canada), Amazon.co.jp (Japan), Amazon.de (Germany), Amazon.co.uk (UK), Amazon.fr (France), Amazon.es (Spain), Amazon.it (Italy)
Prepare your dataset to feed into an ML algorithm
This post uses an ML algorithm called an object detection model to build a solution that detects if a person is in a custom restricted zone. You use the publicly available Pedestrian Detection dataset available on Kaggle, which has over 2,000 images. This dataset has labels for human and human-like objects (like mannequins) so the trained model can more accurately distinguish between real humans and cardboard props or statues.
For example, the following images are examples of a construction worker being detected and if they are in the custom restriction zone (red outline).
To start training your model, first create an S3 bucket to store your training data and model output. For AWS DeepLens projects, the S3 bucket names must start with the prefix deeplens-
. You use this data to train a model with SageMaker, a fully managed service that provides the ability to build, train, and deploy ML models quickly.
Train a model with Amazon SageMaker
You use SageMaker Jupyter notebooks as the development environment to train the model. Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text. For this post, we provide Train_Object_Detection_People_DeepLens.ipynb, a full notebook for you to follow along.
To create a custom object detection model, you need to use a graphic processing unit (GPU)-enabled training job instance. GPUs are excellent at parallelizing computations required to train a neural network. Although the notebook itself is a single ml.t2.medium instance, the training job specifically uses an ml.p2.xlarge instance. To access a GPU-enabled training job instance, you must submit a request for a service limit increase to the AWS Support Center.
After you receive your limit increase, complete the following steps to create a SageMaker notebook instance:
- On the SageMaker console, choose Notebook instances.
- Choose Create notebook instance.
- For Notebook instance name, enter a name for your notebook instance.
- For Instance type, choose t2.medium.
This is the least expensive instance type that notebook instances support, and it suffices for this tutorial.
- For IAM role, choose Create a new role.
Make sure this AWS Identity and Access Management (IAM) role has access to the S3 bucket you created earlier (prefix deeplens-
).
- Choose Create notebook instance. Your notebook instance can take a couple of minutes to start up.
- When the status on the notebook instances page changes to InService, choose Open Jupyter to launch your newly created Jupyter notebook instance.
- Choose Upload to upload the
Train_Object_Detection_people_DeepLens.ipynb
file you downloaded earlier.
- Open the notebook and follow it through to the end.
- If you’re asked about setting the kernel, select conda_mxnet_p36.
The Jupyter notebook contains a mix of text and code cells. To run a piece of code, choose the cell and press Shift+Enter. While the cell is running, an asterisk appears next to the cell. When the cell is complete, an output number and new output cell appear below the original cell.
- Download the dataset from the public S3 bucket into the local SageMaker instance and unzip the data. This can be done by following the code in the notebook:
!aws s3 cp s3://deeplens-public/samples/pedestriansafety/humandetection_data.zip . !rm -rf humandetection/ !unzip humandetection_data.zip -d humandetection
- Convert the dataset into a format (RecordIO) that can be fed into the SageMaker algorithm:
!python $mxnet_path/tools/im2rec.py --pass-through --pack-label $DATA_PATH/train_mask.lst $DATA_PATH/ !python $mxnet_path/tools/im2rec.py --pass-through --pack-label $DATA_PATH/val_mask.lst $DATA_PATH/
- Transfer the RecordIO files back to Amazon S3.
Now that you’re done with all the data preparation, you’re ready to train the object detector.
There are many different types of object detection algorithms. For this post, you use the Single-Shot MultiBox Detection algorithm (SSD). The SSD algorithm has a good balance of speed vs. accuracy, making it ideal for running on edge devices such as AWS DeepLens.
As part of the training job, you have a lot of options for hyperparameters that help configure the training behavior (such as number of epochs, learning rate, optimizer type, and mini-batch size). Hyperparameters let you tune training speed and accuracy of your model. For more information about hyperparameters, see Object Detection Algorithm.
- Set up your hyperparameters and data channels. Consider using the following example definition of hyperparameters:
od_model = sagemaker.estimator.Estimator(training_image, role, train_instance_count=1, train_instance_type='ml.p2.xlarge', train_volume_size = 50, train_max_run = 360000, input_mode= 'File', output_path=s3_output_location, sagemaker_session=sess) od_model.set_hyperparameters(base_network='resnet-50', use_pretrained_model=1, num_classes=2, mini_batch_size=32, epochs=100, learning_rate=0.003, lr_scheduler_step='3,6', lr_scheduler_factor=0.1, optimizer='sgd', momentum=0.9, weight_decay=0.0005, overlap_threshold=0.5, nms_threshold=0.45, image_shape=300, num_training_samples=n_train_samples)
The notebook has some default hyperparameters that have been pre-selected. For pedestrian detection, you train the model for 100 epochs. This training step should take approximately 2 hours using one ml.p2.xlarge instance. You can experiment with different combinations of the hyperparameters, or train for more epochs for performance improvements. For information about the latest pricing, see Amazon SageMaker Pricing.
- You can start a training job with a single line of code and monitor the accuracy over time on the SageMaker console:
od_model.fit(inputs=data_channels, logs=True)
For more information about how training works, see CreateTrainingJob. The provisioning and data downloading take time, depending on the size of the data. Therefore, it might be a few minutes before you start getting data logs for your training jobs.
You can monitor the progress of your training job through the metric mean average precision (mAP), which allows you to monitor the quality of the model’s ability to classify objects and detect the correct bounding boxes. The data logs also print out the mAP on the validation data, among other losses, for every run of the dataset, one time for one epoch. This metric is a proxy for the quality of the algorithm’s performance on accurately detecting the class and the accurate bounding box around it.
When the job is finished, you can find the trained model files in the S3 bucket and folder specified earlier in s3_output_location
:
s3_output_location = 's3://{}/{}/output'.format(BUCKET, PREFIX)
For this post, we show results on the validation set at the completion of the 10th epoch and the 100th epoch. At the end of the 10th epoch, we see a validation mAP of approximately 0.027, whereas the 100th epoch was approximately 0.42.
To achieve better detection results, you can try to tune the hyperparameters by using the capability built into SageMaker for automatic model tuning and train the model for more epochs. You usually stop training when you see a diminishing gain in accuracy.
Test model with custom restriction zones
Before you deploy the trained model to AWS DeepLens, you can test it in the cloud by using a SageMaker hosted endpoint. A SageMaker endpoint is a fully managed service that allows you to make real-time inferences via a REST API. SageMaker allows you to quickly deploy new endpoints to test your models so you don’t have to host the model on the local instance that was used to train the model. This allows you to make predictions (or inference) from the model on images that the algorithm didn’t see during training.
You don’t have to host on the same instance type that you used to train. Training is a prolonged and compute-heavy job that requires a different set of compute and memory requirements that hosting typically doesn’t. You can choose any type of instance you want to host the model. In this case, we chose the ml.p3.2xlarge instance to train, but we choose to host the model on the less expensive CPU instance, ml.m4.xlarge. The following code snippet shows our endpoint deployment.
object_detector = od_model.deploy(initial_instance_count = 1,
instance_type = 'ml.m4.xlarge')
Detection in a custom restriction zone (region of interest)
The format of the output can be represented as [class_index, confidence_score, xmin, ymin, xmax, ymax]. Low-confidence predictions often have higher chances of a false positive or false negative, so you should probably discard low-confidence predictions. You can use the following code to detect if the bounding box of the person overlaps with the restricted zone.
def inRestrictedSection(ImShape = None, R1 = None, restricted_region = None, kclass = None, score = None, threshold = None):
statement = 'Person Not Detected in Restricted Zone'
if (kclass == 1) and (score > threshold):
Im1 = np.zeros((ImShape[0],ImShape[1],3), np.int32)
cv2.fillPoly(Im1, [R1], 255)
Im2 = np.zeros((ImShape[0],ImShape[1],3), np.int32)
if restricted_region is None:
restricted_region = np.array([[0,ImShape[0]],[ImShape[1],ImShape[0]],[ImShape[1],0], [0,0]], np.int32)
cv2.fillPoly(Im2, [restricted_region], 255)
Im = Im1 * Im2
if np.sum(np.greater(Im, 0))>0:
statement = 'Person Detected in Restricted Zone'
else:
statement = statement
return statement
By default, the complete frame is evaluated for human presence. However, you can easily specify the region of interest within which the presence of a person is deemed as high risk. If you want to add a custom restriction zone, add coordinates of the vertices of the region represented by [X-axis,Y-axis] and create the polygon. The coordinates must be entered in either clockwise or counter-clockwise. See the following code:
restricted_region = None
#restricted_region = np.array([[0,200],[100,200],[100,0], [10,10]], np.int32)
The following sample code shows pedestrians that are identified within a restricted zone:
file_name = 'humandetection/test_images/t1_image.jpg'
img = cv2.imread(file_name)
img =cv2.cvtColor(img,cv2.COLOR_BGR2RGB)
thresh = 0.2
height = img.shape[0]
width = img.shape[1]
colors = dict()
with open(file_name, 'rb') as image:
f = image.read()
b = bytearray(f)
ne = open('n.txt','wb')
ne.write(b)
results = object_detector.predict(b, initial_args={'ContentType': 'image/jpeg'})
detections = json.loads(results)
object_categories = ['no-person', 'person']
for det in detections['prediction']:
(klass, score, x0, y0, x1, y1) = det
if score < thresh:
continue
cls_id = int(klass)
prob = score
if cls_id not in colors:
colors[cls_id] = (random.random(), random.random(), random.random())
xmin = int(x0 * width)
ymin = int(y0 * height)
xmax = int(x1 * width)
ymax = int(y1 * height)
R1 = np.array([[xmin,ymin],[xmax,ymin],[xmax,ymax], [xmin,ymax]], np.int32)
cv2.polylines(img,[R1],True, (255,255,0), thickness = 5)
cv2.polylines(img,[restricted_region],True, (255,0,0), thickness = 5)
plt.imshow(img)
print(inRestrictedSection(img.shape,R1 = R1, restricted_region= restricted_region, kclass = cls_id, score = prob, threshold=0.2))
The following images show our results.
Deploy the solution to AWS DeepLens
Convert the model for deployment to AWS DeepLens
When deploying a SageMaker-trained SSD model to AWS DeepLens, you must first run deploy.py to convert the model artifact into a deployable model:
!rm -rf incubator-mxnet
!git clone -b v1.7.x https://github.com/apache/incubator-mxnet
MODEL_PATH = od_model.model_data
TARGET_PATH ='s3://'+BUCKET+'/'+PREFIX+'/patched/'
!rm -rf tmp && mkdir tmp
rm -rf tmp && mkdir tmp
!aws s3 cp $MODEL_PATH tmp
!tar -xzvf tmp/model.tar.gz -C tmp
!mv tmp/model_algo_1-0000.params tmp/ssd_resnet50_300-0000.params
!mv tmp/model_algo_1-symbol.json tmp/ssd_resnet50_300-symbol.json
!python incubator-mxnet/example/ssd/deploy.py --network resnet50 --data-shape 300 --num-class 2 --prefix tmp/ssd_
!tar -cvzf ./patched_model.tar.gz -C tmp ./deploy_ssd_resnet50_300-0000.params ./deploy_ssd_resnet50_300-symbol.json ./hyperparams.json
!aws s3 cp patched_model.tar.gz $TARGET_PATH
Import your model into AWS DeepLens
To run the model on an AWS DeepLens device, you need to create an AWS DeepLens project. Start by importing your model into AWS DeepLens.
- On the AWS DeepLens console, under Resources, choose Models.
- Choose Import model.
- For Import source, select Externally trained model.
- Enter the Amazon S3 location of the patched model that you saved from running deploy.py in the step above.
- For Model framework, choose MXNet.
- Choose Import model.
Create the inference function
The inference function feeds each camera frame into the model to get predictions and runs any custom business logic on using the inference results. You use AWS Lambda to create a function that you deploy to AWS DeepLens. The function runs inference locally on the AWS DeepLens device.
First, we need to create a Lambda function to deploy to AWS DeepLens.
- Download the inference Lambda function.
- On the Lambda console, choose Functions.
- Choose Create function.
- Select Author from scratch.
- For Function name, enter a name.
- For Runtime, choose Python 3.7.
- For Choose or create an execution role, choose Use an existing role.
- Choose service-role/AWSDeepLensLambdaRole.
- Choose Create function.
- On the function’s detail page, on the Actions menu, choose Upload a .zip file.
- Upload the inference Lambda file you downloaded earlier.
- Choose Save to save the code you entered.
- On the Actions menu, choose Publish new version.
Publishing the function makes it available on the AWS DeepLens console so that you can add it to your custom project.
- Enter a version number and choose Publish.
Understanding the inference function
This section walks you through some important parts of the inference function. First, you should pay attention to two specific files:
- labels.txt – Contains a mapping of the output from the neural network (integers) to human readable labels (string)
- lambda_function.py – Contains code for the function being called to generate predictions on every camera frame and send back results
In lambda_function.py, you first load and optimize the model. Compared to cloud virtual machines with a GPU, AWS DeepLens has less computing power. AWS DeepLens uses the Intel OpenVino model optimizer to optimize the model trained in SageMaker to run on its hardware. The following code optimizes your model to run locally:
client.publish(topic=iot_topic, payload='Optimizing model...')
ret, model_path = mo.optimize('deploy_ssd_resnet50_300', INPUT_W, INPUT_H)
# Load the model onto the GPU.
client.publish(topic=iot_topic, payload='Loading model...')
model = awscam.Model(model_path, {'GPU': 1})
Then you run the model frame-per-frame over the images from the camera. See the following code:
while True:
# Get a frame from the video stream
ret, frame = awscam.getLastFrame()
if not ret:
raise Exception('Failed to get frame from the stream')
# Resize frame to the same size as the training set.
frame_resize = cv2.resize(frame, (INPUT_H, INPUT_W))
# Run the images through the inference engine and parse the results using
# the parser API, note it is possible to get the output of doInference
# and do the parsing manually, but since it is a ssd model,
# a simple API is provided.
parsed_inference_results = model.parseResult(model_type,
model.doInference(frame_resize))
Finally, you send the text prediction results back to the cloud. Viewing the text results in the cloud is a convenient way to make sure that the model is working correctly. Each AWS DeepLens device has a dedicated iot_topic automatically created to receive the inference results. See the following code:
# Send results to the cloud
client.publish(topic=iot_topic, payload=json.dumps(cloud_output))
Create a custom AWS DeepLens project
To create a new AWS DeepLens project, complete the following steps:
- On the AWS DeepLens console, on the Projects page, choose Create project.
- For Project type, select Create a new blank project.
- Choose Next.
- Name your project
yourname-pedestrian-detector-
. - Choose Add model.
- Select the model you just created.
- Choose Add function.
- Search for the Lambda function you created earlier by name.
- Choose Create project.
- On the Projects page, select the project you want to deploy.
- Chose Deploy to device.
- For Target device, choose your device.
- Choose Review.
- Review your settings and choose Deploy.
The deployment can take up to 10 minutes to complete, depending on the speed of the network your AWS DeepLens is connected to. When the deployment is complete, you should see a green banner on the page with the message, “Congratulations, your model is now running locally on AWS DeepLens!”
To see the text output, scroll down on the device details page to the Project output section. Follow the instructions in the section to copy the topic and go to the AWS IoT Core console to subscribe to the topic. You should see results as in the following screenshot.
For step-by-step instructions on viewing the video stream or text output, see Viewing results from AWS DeepLens.
Real-world use cases
Now that you have predictions from your model running on AWS DeepLens, let’s convert those predictions into alerts and insights. Some most common uses for a project like this include:
- Understanding how many people on a given day entered a restricted zone so construction sites can identify spots that require more safety signs. This can be done by collecting the results and using them to create a dashboard using Amazon QuickSight. For more details about creating a dashboard using QuickSight, see Build a work-from-home posture tracker with AWS DeepLens and GluonCV.
- Collecting the output from AWS DeepLens and configuring a Raspberry Pi to sound an alert when someone is walking into a restricted zone. For more details about connecting an AWS DeepLens device to a Raspberry Pi device, see Building a trash sorter with AWS DeepLens.
Conclusion
In this post, you learned how to train an object detection model and deploy it to AWS DeepLens to detect people entering restricted zones. You can use this tutorial as a reference to train and deploy your own custom object detection projects on AWS DeepLens.
For a more detailed walkthrough of this tutorial and other tutorials, samples, and project ideas with AWS DeepLens, see AWS DeepLens Recipes.
About the Authors
Yash Shah is a data scientist in the Amazon ML Solutions Lab, where he works on a range of machine learning use cases from healthcare to manufacturing and retail. He has a formal background in Human Factors and Statistics, and was previously part of the Amazon SCOT team designing products to guide 3P sellers with efficient inventory management.
Phu Nguyen is a Product Manager for AWS Panorama. He builds products that give developers of any skill level an easy, hands-on introduction to machine learning.
Enable cross-account access for Amazon SageMaker Data Wrangler using AWS Lake Formation
Amazon SageMaker Data Wrangler is the fastest and easiest way for data scientists to prepare data for machine learning (ML) applications. With Data Wrangler, you can simplify the process of feature engineering and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization through a single visual interface. Data Wrangler comes with 300 built-in data transformation recipes that you can use to quickly normalize, transform, and combine features. With the data selection tool in Data Wrangler, you can quickly select data from different data sources, such as Amazon Simple Storage Service (Amazon S3), Amazon Athena, and Amazon Redshift.
AWS Lake Formation cross-account capabilities simplify securing and managing distributed data lakes across multiple accounts through a centralized approach, providing fine-grained access control to Athena tables.
In this post, we demonstrate how to enable cross-account access for Data Wrangler using Athena as a source and Lake Formation as a central data governance capability. As shown in the following architecture diagram, Account A is the data lake account that holds all the ML-ready data derived from ETL pipelines. Account B is the data science account where a team of data scientists uses Data Wrangler to compile and run data transformations. We need to enable cross-account permissions for Data Wrangler in Account B to access the data tables located in Account A’s data lake via Lake Formation permissions.
With this architecture, data scientists and engineers outside the data lake account can access data from the lake and create data transformations via Data Wrangler.
Before you dive into the setup process, ensure the data to be shared across accounts are crawled and cataloged as detailed in this post. Let us presume this process has been completed and the databases and tables already exist in Lake Formation.
The following are the high-level steps to implement this solution:
- In Account A, register your S3 bucket using Lake Formation and create the necessary databases and tables for the data if doesn’t exist.
- The Lake Formation administrator can now share datasets from Account A to other accounts. Lake Formation shares these resources using AWS Resource Access Manager (AWS RAM).
- In Account B, accept the resource share request using AWS RAM. Create a local resource link for the shared table via Lake Formation and create a local database.
- Next, you need to grant permissions for the SageMaker Studio execution role in Account B to access the shared table and the resource link you created in the previous step.
- In Data Wrangler, use the local database and the resource link you created in Account B to query the dataset using the Athena connector and perform feature transformations.
Data lake setup using Lake Formation
To get started, create a central data lake in Account A. You can control the access to the data lake with policies and permissions, and define permissions at the database, table, or column level.
To kickstart the setup process, download the titanic dataset .csv file and upload it to your S3 bucket. After you upload the file, you need to register the bucket in Lake Formation. Lake Formation permissions enable fine-grained access control for data in your data lake.
Note: If the titanic dataset has already been cataloged, you can skip the registration step below.
Register your S3 data store in Lake Formation
To register your data store, complete the following steps:
- In Account A, sign in to the Lake Formation console.
If this is the first time you’re accessing Lake Formation, you need to add administrators to the account.
- In the navigation pane, under Permissions, choose Admins and database creators.
- Under Data lake administrators, choose Grant.
You now add AWS Identity and Access Management (IAM) users or roles specific to Account A as data lake administrators.
- Under Manage data lake administrators, for IAM users and roles, choose your user or role (for this post, we use
user-a
).
This can also be the IAM admin role of Account A.
- Choose Save.
- Make sure the
IAMAllowedPrincipals
group is not listed under both Data lake administrators and Database creators.
For more information about security settings, see Changing the Default Security Settings for Your Data Lake.
Next, you need to register the S3 bucket as the data lake location.
- On the Lake Formation console, under Register and ingest, choose Data lake locations.
This page should display a list of S3 buckets that are marked as data lake storage resources for Lake Formation. A single S3 bucket may act as the repository for many datasets, or you could use separate buckets for separate data sources.
- Choose Register location.
- For Amazon S3 path, enter the path for your bucket.
- For IAM role¸ choose
AWSServiceRoleForLakeFormationDataAccess
. - Choose Register location.
After this step, you should be able to see your S3 bucket under Data lake locations.
Create a database
This step is optional. Skip this step if the titanic dataset has already been crawled and cataloged. The database and table for the dataset should pre-exist within the data lake.
Complete the following steps to register the database if it does not exist:
- On the Lake Formation console, under Data catalog, choose Databases.
- Choose Create database.
- For Database details, select Database.
- For Name, enter a name (for example,
titanic
). - For Location, enter the S3 data lake bucket path.
- Deselect Use only IAM access controls for tables in this database.
- Choose Create database.
- Under Actions, choose Permissions.
- Choose View permissions.
- Make sure that the
IAMAllowedPrincipals
group isn’t listed.
If it’s listed, make sure you revoke access to this group.
You should now be able to view the created database listed under Databases.
You should also be able to see the table in the Lake Formation console, under Data catalog in the navigation pane, under Tables. For this demo, let us presume the table name to be titanic_datalake_bucket_as
as shown below.
Grant table permissions to Account A
To grant table permissions to Account A, complete the following steps:
- Sign in to the Lake Formation console with Account A.
- Under Data catalog, choose Tables.
- Select the newly created table.
- On the Actions menu, under Permissions, choose Grant.
- Select My account.
- For IAM users and roles, choose the users or roles you want to grant access (for this post, we choose
user-x
, a different user within Account A).
You can also set a column filter.
- For Columns, choose Include columns.
- For Include columns, choose the first five columns from the
titanic_datalake_bucket_as
table. - For Table permissions, select Select.
- Chose Grant.
- Still in Account A, switch to the Athena console.
- Run a table preview.
You should be able to see the first five columns of the titanic_datalake_bucket_as
table as per the granted permissions in the previous steps.
We have validated local access to the data lake table within Account A via this Athena step. Next, let’s grant access to an external account, in our case, Account B for the same table.
Grant table permissions to Account B
This external account is the account running Data Wrangler. To grant table permissions, complete the following steps:
- Staying within account A, on the Actions menu, under Permissions, choose Grant.
- Select External account.
- For AWS account ID, enter the account ID of Account B.
- Choose the same first five columns of the table.
- For Table permissions and Grantable permissions, select Select.
- Choose Grant.
You must revoke the Super permission from the IAMAllowedPrincipals
group for this table before granting it external access. You can do this on the Actions menu under View permissions, then choose IAMAllowedPrincipals and choose Revoke.
- On the AWS RAM console, still in Account A, under Shared by me, choose Shared resources.
We can find a Lake Formation entry on this page.
- Switch to Account B.
- On the AWS RAM console, under Shared with me, you see an invitation from Lake Formation in Account A.
- Accept the invitation by choosing Accept resource share.
After you accept it, on the Resource shares page, you should see the shared Lake Formation entry, which encapsulates the catalog, database, and table information.
On the Lake Formation console in Account B, you can find the shared table owned by Account A on the Tables page. If you don’t see it, you can refresh your screen and the resource should appear shortly.
To use this shared table inside Account B, you need to create a database local to Account B in Lake Formation.
- On the Lake Formation console, under Databases, choose Create databases.
- Name the database
local_db
.
Next, for the shared titanic table in Lake Formation, you need to create a resource link. Resource links are Data Catalog objects that link to metadata databases and tables, typically to shared databases and tables from other AWS accounts. They help enable cross-account access to data in the data lake.
- On the table details page, on the Actions menu, choose Create resource link.
- For Resource link name, enter a name (for example,
titanic_local
). - For Database, choose the local database you created previously.
- The values for Shared table and Shared table’s database should match the ones in Account A and be auto-populated.
- For Shared table’s owner ID, choose the account ID of Account A.
- Choose Create.
- In the navigation pane, under Data catalog, choose Settings.
- Make sure Use only IAM access control is disabled for new databases and tables.
This is to make sure that Lake Formation manages the database and table permissions.
- Switch to the SageMaker console.
- In the Studio Control Panel, under Studio Summary, copy the ARN of the execution role.
- You need to grant this role permissions to access the local database, the shared table, and the local table you had previously in Account B’s Lake Formation.
- You also need to attach the following custom policy to this role. This policy allows Studio to access data via Lake Formation and allows Account B to get data partitions for querying the
titanic
dataset from the created tables:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"lakeformation:GetDataAccess",
"glue:GetPartitions"
],
"Resource": [
"*"
]
}
]
}
- Switch back to Lake Formation console.
- Here, we need to grant permissions for the SageMaker execution role to access the shared
titanic_datalake_bucket_as
table.
This is the table that you shared to Account B from Account A via AWS RAM.
- In Account B, on the table details page, on the Actions menu, under Permissions, choose Grant.
- Grant the role access to the table and five columns.
- Finally, grant the SageMaker execution role permissions to access the local titanic table in Account B.
Cross-account data access in Studio
In this final stage, you should be ready to validate the steps deployed so far by testing this in the Data Wrangler interface.
- On the Import tab, for Import data, choose Amazon Athena as your data source.
- For Data catalog, choose AwsDataCatalog.
- For Database, choose the local database you created in Account B (
local_db
).
You should be able to see the local table (titanic_local
) in the right pane.
- Run an Athena query as shown in the following screenshot to see the selected columns of the
titanic
dataset that you gave to the SageMaker execution role in Lake Formation (Account B). - Choose Import dataset.
- For Dataset Name, enter a name (for example,
titanic-dataset
). - Choose Add.
This imports the titanic dataset, and you should be able to see the data flow page with the visual blocks on the Prepare tab.
Conclusion
In this post, we demonstrated how to enable cross-account access for Data Wrangler using Lake Formation and AWS RAM. Following this methodology, organizations can allow multiple data science and engineering teams to access data from a central data lake and build feature pipelines and transformation recipes consistently. For more information about Data Wrangler, see Introducing Amazon SageMaker Data Wrangler, a Visual Interface to Prepare Data for Machine Learning and Exploratory data analysis, feature engineering, and operationalizing your data flow into your ML pipeline with Amazon SageMaker Data Wrangler.
Give Data Wrangler a try and share your feedback and questions in the comments section.
About the Authors
Rizwan Gilani is a Software Development Engineer at Amazon SageMaker. His passion lies with making machine learning more interactive and accessible at scale. Before that, he worked on Amazon Alexa as part of the core team that launched Alexa Communications.
Phi Nguyen is a solutions architect at AWS helping customers with their cloud journey with a special focus on data lake, analytics, semantics technologies and machine learning. In his spare time, you can find him biking to work, coaching his son’s soccer team or enjoying nature walk with his family.
Arunprasath Shankar is an Artificial Intelligence and Machine Learning (AI/ML) Specialist Solutions Architect with AWS, helping global customers scale their AI solutions effectively and efficiently in the cloud. In his spare time, Arun enjoys watching sci-fi movies and listening to classical music.
How three science PhDs found different career paths at Amazon
Their doctoral degrees help these product managers bridge the gap between business and science.Read More