Google Research, 2022 & beyond: Algorithmic advances

Google Research, 2022 & beyond: Algorithmic advances

(This is Part 5 in our series of posts covering different topical areas of research at Google. You can find other posts in the series here.)

Robust algorithm design is the backbone of systems across Google, particularly for our ML and AI models. Hence, developing algorithms with improved efficiency, performance and speed remains a high priority as it empowers services ranging from Search and Ads to Maps and YouTube. Google Research has been at the forefront of this effort, developing many innovations from privacy-safe recommendation systems to scalable solutions for large-scale ML. In 2022, we continued this journey, and advanced the state-of-the-art in several related areas. Here we highlight our progress in a subset of these, including scalability, privacy, market algorithms, and algorithmic foundations.

Scalable algorithms: Graphs, clustering, and optimization

As the need to handle large-scale datasets increases, scalability and reliability of complex algorithms that also exhibit improved explainability, robustness, and speed remain a high priority. We continued our efforts in developing new algorithms for handling large datasets in various areas, including unsupervised and semi-supervised learning, graph-based learning, clustering, and large-scale optimization.

An important component of such systems is to build a similarity graph — a nearest-neighbor graph that represents similarities between objects. For scalability and speed, this graph should be sparse without compromising quality. We proposed a 2-hop spanner technique, called STAR, as an efficient and distributed graph building strategy, and showed how it significantly decreases the number of similarity computations in theory and practice, building much sparser graphs while producing high-quality graph learning or clustering outputs. As an example, for graphs with 10T edges, we demonstrate ~100-fold improvements in pairwise similarity comparisons and significant running time speedups with negligible quality loss. We had previously applied this idea to develop massively parallel algorithms for metric, and minimum-size clustering. More broadly in the context of clustering, we developed the first linear-time hierarchical agglomerative clustering (HAC) algorithm as well as DBSCAN, the first parallel algorithm for HAC with logarithmic depth, which achieves 50x speedup on 100B-edge graphs. We also designed improved sublinear algorithms for different flavors of clustering problems such as geometric linkage clustering, constant-round correlation clustering, and fully dynamic k-clustering.

Inspired by the success of multi-core processing (e.g., GBBS), we embarked on a mission to develop graph mining algorithms that can handle graphs with 100B edges on a single multi-core machine. The big challenge here is to achieve fast (e.g., sublinear) parallel running time (i.e., depth). Following our previous work for community detection and correlation clustering, we developed an algorithm for HAC, called ParHAC, which has provable polylogarithmic depth and near-linear work and achieves a 50x speedup. As an example, it took ParHAC only ~10 minutes to find an approximate affinity hierarchy over a graph of over 100B edges, and ~3 hours to find the full HAC on a single machine. Following our previous work on distributed HAC, we use these multi-core algorithms as a subroutine within our distributed algorithms in order to handle tera-scale graphs.

We also had a number of interesting results on graph neural networks (GNN) in 2022. We provided a model-based taxonomy that unified many graph learning methods. In addition, we discovered insights for GNN models from their performance across thousands of graphs with varying structure (shown below). We also proposed a new hybrid architecture to overcome the depth requirements of existing GNNs for solving fundamental graph problems, such as shortest paths and the minimum spanning tree.

Relative performance results of three GNN variants (GCN, APPNP, FiLM) across 50,000 distinct node classification datasets in GraphWorld. We find that academic GNN benchmark datasets exist in regions where model rankings do not change. GraphWorld can discover previously unexplored graphs that reveal new insights about GNN architectures.

Furthermore, to bring some of these many advances to the broader community, we had three releases of our flagship modeling library for building graph neural networks in TensorFlow (TF-GNN). Highlights include a model library and model orchestration API to make it easy to compose GNN solutions. Following our NeurIPS’20 workshop on Mining and Learning with Graphs at Scale, we ran a workshop on graph-based learning at ICML’22, and a tutorial for GNNs in TensorFlow at NeurIPS’22.

In “Robust Routing Using Electrical Flows”, we presented a recent paper that proposed a Google Maps solution to efficiently compute alternate paths in road networks that are resistant to failures (e.g., closures, incidents). We demonstrate how it significantly outperforms the state-of-the-art plateau and penalty methods on real-world road networks.

Example of how we construct the electrical circuit corresponding to the road network. The current can be decomposed into three flows, i1, i2 and i3, each of which corresponds to a viable alternate path from Fremont, CA to San Rafael, CA.

On the optimization front, we open-sourced Vizier, our flagship blackbox optimization and hyperparameter tuning library at Google. We also developed new techniques for linear programming (LP) solvers that address scalability limits caused by their reliance on matrix factorizations, which restricts the opportunity for parallelism and distributed approaches. To this end, we open-sourced a primal-dual hybrid gradient (PDHG) solution for LP called primal-dual linear programming (PDLP), a new first-order solver for large-scale LP problems. PDLP has been used to solve real-world problems with as many as 12B non-zeros (and an internal distributed version scaled to 92B non-zeros). PDLP’s effectiveness is due to a combination of theoretical developments and algorithm engineering.

With OSS Vizier, multiple clients each send a “Suggest” request to the Service API, which produces Suggestions for the clients using Pythia policies. The clients evaluate these suggestions and return measurements. All transactions are stored to allow fault-tolerance.

Top

Privacy and federated learning

Respecting user privacy while providing high-quality services remains a top priority for all Google systems. Research in this area spans many products and uses principles from differential privacy (DP) and federated learning.

First of all, we have made a variety of algorithmic advances to address the problem of training large neural networks with DP. Building on our earlier work, which enabled us to launch a DP neural network based on the DP-FTRL algorithm, we developed the matrix factorization DP-FTRL approach. This work demonstrates that one can design a mathematical program to optimize over a large set of possible DP mechanisms to find those best suited for specific learning problems. We also establish margin guarantees that are independent of the input feature dimension for DP learning of neural networks and kernel-based methods. We further extend this concept to a broader range of ML tasks, matching baseline performance with 300x less computation. For fine-tuning of large models, we argued that once pre-trained, these models (even with DP) essentially operate over a low-dimensional subspace, hence circumventing the curse of dimensionality that DP imposes.

On the algorithmic front, for estimating the entropy of a high-dimensional distribution, we obtained local DP mechanisms (that work even when as little as one bit per sample is available) and efficient shuffle DP mechanisms. We proposed a more accurate method to simultaneously estimate the top-k most popular items in the database in a private manner, which we employed in the Plume library. Moreover, we showed a near-optimal approximation algorithm for DP clustering in the massively parallel computing (MPC) model, which further improves on our previous work for scalable and distributed settings.

Another exciting research direction is the intersection of privacy and streaming. We obtained a near-optimal approximation-space trade-off for the private frequency moments and a new algorithm for privately counting distinct elements in the sliding window streaming model. We also presented a general hybrid framework for studying adversarial streaming.

Addressing applications at the intersection of security and privacy, we developed new algorithms that are secure, private, and communication-efficient, for measuring cross-publisher reach and frequency. The World Federation of Advertisers has adopted these algorithms as part of their measurement system. In subsequent work, we developed new protocols that are secure and private for computing sparse histograms in the two-server model of DP. These protocols are efficient from both computation and communication points of view, are substantially better than what standard methods would yield, and combine tools and techniques from sketching, cryptography and multiparty computation, and DP.

While we have trained BERT and transformers with DP, understanding training example memorization in large language models (LLMs) is a heuristic way to evaluate their privacy. In particular, we investigated when and why LLMs forget (potentially memorized) training examples during training. Our findings suggest that earlier-seen examples may observe privacy benefits at the expense of examples seen later. We also quantified the degree to which LLMs emit memorized training data.

Top

Market algorithms and causal inference

We also continued our research in improving online marketplaces in 2022. For example, an important recent area in ad auction research is the study of auto-bidding online advertising where the majority of bidding happens via proxy bidders that optimize higher-level objectives on behalf of advertisers. The complex dynamics of users, advertisers, bidders, and ad platforms leads to non-trivial problems in this space. Following our earlier work in analyzing and improving mechanisms under auto-bidding auctions, we continued our research in improving online marketplaces in the context of automation while taking different aspects into consideration, such as user experience and advertiser budgets. Our findings suggest that properly incorporating ML advice and randomization techniques, even in non-truthful auctions, can robustly improve the overall welfare at equilibria among auto-bidding algorithms.

Structure of auto-bidding online ads system.

Beyond auto-bidding systems, we also studied auction improvements in complex environments, e.g., settings where buyers are represented by intermediaries, and with Rich Ads where each ad can be shown in one of several possible variants. We summarize our work in this area in a recent survey. Beyond auctions, we also investigate the use of contracts in multi-agent and adversarial settings.

Online stochastic optimization remains an important part of online advertising systems with application in optimal bidding and budget pacing. Building on our long-term research in online allocation, we recently blogged about dual mirror descent, a new algorithm for online allocation problems that is simple, robust, and flexible. This state-of-the-art algorithm is robust against a wide range of adversarial and stochastic input distributions and can optimize important objectives beyond economic efficiency, such as fairness. We also show that by tailoring dual mirror descent to the special structure of the increasingly popular return-on-spend constraints, we can optimize advertiser value. Dual mirror descent has a wide range of applications and has been used over time to help advertisers obtain more value through better algorithmic decision making.

An overview of the dual mirror descent algorithm.

Furthermore, following our recent work at the interplay of ML, mechanism design and markets, we investigated transformers for asymmetric auction design, designed utility-maximizing strategies for no-regret learning buyers, and developed new learning algorithms to bid or to price in auctions.

An overview of bipartite experimental design to reduce causal interactions between entities.

A critical component of any sophisticated online service is the ability to experimentally measure the response of users and other players to new interventions. A major challenge of estimating these causal effects accurately is handling complex interactions — or interference — between the control and treatment units of these experiments. We combined our graph clustering and causal inference expertise to expand the results of our previous work in this area, with improved results under a flexible response model and a new experimental design that is more effective at reducing these interactions when treatment assignments and metric measurements occur on the same side of a bipartite platform. We also showed how synthetic control and optimization techniques can be combined to design more powerful experiments, especially in small data regimes.

Top

Algorithmic foundations and theory

Finally, we continued our fundamental algorithmic research by tackling long-standing open problems. A surprisingly concise paper affirmatively resolved a four-decade old open question on whether there is a mechanism that guarantees a constant fraction of the gains-from-trade attainable whenever buyer’s value weakly exceeds seller’s cost. Another recent paper obtained the state-of-the-art approximation for the classic and highly-studied k-means problem. We also improved the best approximation for correlation clustering breaking the barrier approximation factor of 2. Finally, our work on dynamic data structures to solve min-cost and other network flow problems has contributed to a breakthrough line of work in adapting continuous optimization techniques to solve classic discrete optimization problems.

Top

Concluding thoughts

Designing effective algorithms and mechanisms is a critical component of many Google systems that need to handle tera-scale data robustly with critical privacy and safety considerations. Our approach is to develop algorithms with solid theoretical foundations that can be deployed effectively in our product systems. In addition, we are bringing many of these advances to the broader community by open-sourcing some of our most novel developments and by publishing the advanced algorithms behind them. In this post, we covered a subset of algorithmic advances in privacy, market algorithms, scalable algorithms, graph-based learning, and optimization. As we move toward an AI-first Google with further automation, developing robust, scalable, and privacy-safe ML algorithms remains a high priority. We are excited about developing new algorithms and deploying them more broadly.

Acknowledgements

This post summarizes research from a large number of teams and benefited from input from several researchers including Gagan Aggarwal, Amr Ahmed, David Applegate, Santiago Balseiro, Vincent Cohen-addad, Yuan Deng, Alessandro Epasto, Matthew Fahrbach, Badih Ghazi, Sreenivas Gollapudi, Rajesh Jayaram, Ravi Kumar, Sanjiv Kumar, Silvio Lattanzi, Kuba Lacki, Brendan McMahan, Aranyak Mehta, Bryan Perozzi, Daniel Ramage, Ananda Theertha Suresh, Andreas Terzis, Sergei Vassilvitskii, Di Wang, and Song Zuo. Special thanks to Ravi Kumar for his contributions to this post.

Google Research, 2022 & beyond

This was the fifth blog post in the “Google Research, 2022 & Beyond” series. Other posts in this series are listed in the table below:

Language Models Computer Vision Multimodal Models
Generative Models Responsible AI ML & Computer Systems
Efficient Deep Learning Algorithmic Advances Robotics*
Health General Science & Quantum Community Engagement
* Articles will be linked as they are released.

Read More

Identifying defense coverage schemes in NFL’s Next Gen Stats

Identifying defense coverage schemes in NFL’s Next Gen Stats

This post is co-written with Jonathan Jung, Mike Band, Michael Chi, and Thompson Bliss at the National Football League.

A coverage scheme refers to the rules and responsibilities of each football defender tasked with stopping an offensive pass. It is at the core of understanding and analyzing any football defensive strategy. Classifying the coverage scheme for every pass play will provide insights of the football game to teams, broadcasters, and fans alike. For instance, it can reveal the preferences of play callers, allow deeper understanding of how respective coaches and teams continuously adjust their strategies based on their opponent’s strengths, and enable the development of new defensive-oriented analytics such as uniqueness of coverages (Seth et al.). However, manual identification of these coverages on a per-play basis is both laborious and difficult because it requires football specialists to carefully inspect the game footage. There is a need for an automated coverage classification model that can scale effectively and efficiently to reduce cost and turnaround time.

The NFL’s Next Gen Stats captures real-time location, speed, and more for every player and play of NFL football games, and derives various advanced stats covering different aspects of the game. Through a collaboration between the Next Gen Stats team and the Amazon ML Solutions Lab, we have developed the machine learning (ML)-powered stat of coverage classification that accurately identifies the defense coverage scheme based on the player tracking data. The coverage classification model is trained using Amazon SageMaker, and the stat has been launched for the 2022 NFL season.

In this post, we deep dive into the technical details of this ML model. We describe how we designed an accurate, explainable ML model to make coverage classification from player tracking data, followed by our quantitative evaluation and model explanation results.

Problem formulation and challenges

We define the defensive coverage classification as a multi-class classification task, with three types of man coverage (where each defensive player covers a certain offensive player) and five types of zone coverage (each defensive player covers a certain area on the field). These eight classes are visually depicted in the following figure: Cover 0 Man, Cover 1 Man, Cover 2 Man, Cover 2 Zone, Cover 3 Zone, Cover 4 Zone, Cover 6 Zone, and Prevent (also zone coverage). Circles in blue are the defensive players laid out in a particular type of coverage; circles in red are the offensive players. A full list of the player acronyms is provided in the appendix at the end of this post.

Eight coverages considered in the post

The following visualization shows an example play, with the location of all offensive and defensive players at the start of the play (left) and in the middle of the same play (right). To make the correct coverage identification, a multitude of information over time must be accounted for, including the way defenders lined up before the snap and the adjustments to offensive player movement once the ball is snapped. This poses the challenge for the model to capture spatial-temporal, and often subtle movement and interaction among the players.

Two frames of an example play showing player locations

Another key challenge faced by our partnership is the inherent ambiguity around the deployed coverage schemes. Beyond the eight commonly known coverage schemes, we identified adjustments in more specific coverage calls that lead to ambiguity among the eight general classes for both manual charting and model classification. We tackle these challenges using improved training strategies and model explanation. We describe our approaches in detail in the following section.

Explainable coverage classification framework

We illustrate our overall framework in the following figure, with the input of player tracking data and coverage labels starting at the top of the figure.

Overall framework for coverage classification

Feature engineering

Game tracking data is captured at 10 frames per second, including the player location, speed, acceleration, and orientation. Our feature engineering constructs sequences of play features as the input for model digestion. For a given frame, our features are inspired by the 2020 Big Data Bowl Kaggle Zoo solution (Gordeev et al.): we construct an image for each time step with the defensive players at the rows and offensive players at the columns. The pixel of the image therefore represents the features for the intersecting pair of players. Different from Gordeev et al., we extract a sequence of the frame representations, which effectively generates a mini-video to characterize the play.

The following figure visualizes how the features evolve over time in correspondence to two snapshots of an example play. For visual clarity, we only show four features out of all the ones we extracted. “LOS” in the figure stands for the line of scrimmage, and the x-axis refers to the horizontal direction to the right of the football field. Notice how the feature values, indicated by the colorbar, evolve over time in correspondence to the player movement. Altogether, we construct two sets of features as follows:

  • Defender features consisting of the defender position, speed, acceleration, and orientation, on the x-axis (horizontal direction to the right of the football field) and y-axis (vertical direction to the top of the football field)
  • Defender-offense relative features consisting of the same attributes, but calculated as the difference between the defensive and offensive players

Extracted features evolve over time corresponding to the player movement in the example play

CNN module

We utilize a convolutional neural network (CNN) to model the complex player interactions similar to the Open Source Football (Baldwin et al.) and Big Data Bowl Kaggle Zoo solution (Gordeev et al.). The image obtained from feature engineering facilitated the modeling of each play frame through a CNN. We modified the convolutional (Conv) block utilized by the Zoo solution (Gordeev et al.) with a branching structure that is comprised of a shallow one-layer CNN and a deep three-layer CNN. The convolution layer utilizes a 1×1 kernel internally: having the kernel look at each player pair individually ensures that the model is invariant to the player ordering. For simplicity, we order the players based on their NFL ID for all play samples. We obtain the frame embeddings as the output of the CNN module.

Temporal modeling

Within the short play period lasting just a few seconds, it contains rich temporal dynamics as key indicators to identify the coverage. The frame-based CNN modeling, as used in the Zoo solution (Gordeev et al.), has not accounted for the temporal progression. To tackle this challenge, we design a self-attention module (Vaswani et al.), stacked on top of the CNN, for temporal modeling. During training, it learns to aggregate the individual frames by weighing them differently (Alammar et al.). We will compare it with a more conventional, bidirectional LSTM approach in the quantitative evaluation. The learned attention embeddings as the output are then averaged to obtain the embedding of the whole play. Finally, a fully connected layer is connected to determine the coverage class of the play.

Model ensemble and label smoothing

Ambiguity among the eight coverage schemes and their imbalanced distribution make the clear separation among coverages challenging. We utilize the model ensemble to tackle these challenges during model training. Our study finds that a voting-based ensemble, one of the most simplistic ensemble methods, actually outperforms more complex approaches. In this method, each base model has the same CNN-attention architecture and is trained independently from different random seeds. The final classification takes the average over the outputs from all base models.

We further incorporate label smoothing (Müller et al.) into the cross-entropy loss to handle the potential noise in manual charting labels. Label smoothing steers the annotated coverage class slightly towards the remaining classes. The idea is to encourage the model to adapt to the inherent coverage ambiguity instead of overfitting to any biased annotations.

Quantitative evaluation

We utilize 2018–2020 season data for model training and validation, and 2021 season data for model evaluation. Each season consists of around 17,000 plays. We perform a five-fold cross-validation to select the best model during training, and perform hyperparameter optimization to select the best settings on multiple model architecture and training parameters.

To evaluate the model performance, we compute the coverage accuracy, F1 score, top-2 accuracy, and accuracy of the easier man vs. zone task. The CNN-based Zoo model used in Baldwin et al. is the most relevant for coverage classification and we use it as the baseline. In addition, we consider improved versions of the baseline that incorporate the temporal modeling components for comparative study: a CNN-LSTM model that utilizes a bi-directional LSTM to perform the temporal modeling, and a single CNN-attention model without the ensemble and label smoothing components. The results are shown in the following table.

Model Test Accuracy 8 Coverages (%) Top-2 Accuracy 8 Coverages (%) F1 Score 8 Coverages Test Accuracy Man vs. Zone (%)
Baseline: Zoo model 68.8±0.4 87.7±0.1 65.8±0.4 88.4±0.4
CNN-LSTM 86.5±0.1 93.9±0.1 84.9±0.2 94.6±0.2
CNN-attention 87.7±0.2 94.7±0.2 85.9±0.2 94.6±0.2
Ours: Ensemble of 5 CNN-attention models 88.9±0.1 97.6±0.1 87.4±0.2 95.4±0.1

We observe that incorporation of the temporal modeling module significantly improves the baseline Zoo model that was based on a single frame. Compared to the strong baseline of the CNN-LSTM model, our proposed modeling components including the self-attention module, model ensemble, and labeling smoothing combined provide significant performance improvement. The final model is performant as demonstrated by the evaluation measures. In addition, we identify very high top-2 accuracy and a significant gap to the top-1 accuracy. This can be attributed to the coverage ambiguity: when the top classification is incorrect, the 2nd guess often matches human annotation.

Model explanations and results

To shed light on the coverage ambiguity and understand what the model utilized to arrive at a given conclusion, we perform analysis using model explanations. It consists of two parts: global explanations that analyze all learned embeddings jointly, and local explanations that zoom into individual plays to analyze the most important signals captured by the model.

Global explanations

In this stage, we analyze the learned play embeddings from the coverage classification model globally to discover any patterns that require manual review. We utilize t-distributed stochastic neighbor embedding (t-SNE) (Maaten et al.) that projects the play embeddings into 2D space such as a pair of similar embeddings have high probability on their distribution. We experiment with the internal parameters to extract stable 2D projections. The embeddings from stratified samples of 9,000 plays are visualized in following figure (left), with each dot representing a certain play. We find that the majority of each coverage scheme are well separated, demonstrating the classification capability gained by the model. We observe two important patterns and investigate them further.

Some plays are mixed into other coverage types, as shown in the following figure (right). These plays could potentially be mislabeled and deserve manual inspection. We design a K-Nearest Neighbors (KNN) classifier to automatically identify these plays and send them for expert review. The results show that most of them were indeed labeled incorrectly.

t-SNE visualization of play embeddings and identified plays for manual review

Next, we observe several overlapping regions among the coverage types, manifesting coverage ambiguity in certain scenarios. As an example, in the following figure, we separate Cover 3 Zone (green cluster on the left) and Cover 1 Man (blue cluster in the middle). These are two different single-high coverage concepts, where the main distinction is man vs. zone coverage. We design an algorithm that automatically identifies the ambiguity between these two classes as the overlapping region of the clusters. The result is visualized as the red dots in the following right figure, with 10 randomly sampled plays marked with a black “x” for manual review. Our analysis reveals that most of the play examples in this region involve some sort of pattern matching. In these plays, the coverage responsibilities are contingent upon how the offensive receivers’ routes are distributed, and adjustments can make the play look like a mix of zone and man coverages. One such adjustment we identified applies to Cover 3 Zone, when the cornerback (CB) to one side is locked into man coverage (“Man Everywhere he Goes” or MEG) and the other has a traditional zone drop.

Overlapping region between Cover 3 Zone and Cover 1 Man

Instance explanations

In the second stage, instance explanations zoom into the individual play of interest, and extract frame-by-frame player interaction highlights that contribute the most to the identified coverage scheme. This is achieved through the Guided GradCAM algorithm (Ramprasaath et al.). We utilize the instance explanations on low-confidence model predictions.

For the play we illustrated in the beginning of the post, the model predicted Cover 3 Zone with 44.5% probability and Cover 1 Man with 31.3% probability. We generate the explanation results for both classes as shown in the following figure. The line thickness annotates the interaction strength that contributes to the model’s identification.

The top plot for Cover 3 Zone explanation comes right after the ball snap. The CB on the offense’s right has the strongest interaction lines, because he is facing the QB and stays in place. He ends up squaring off and matching with the receiver on his side, who threatens him deep.

The bottom plot for Cover 1 Man explanation comes a moment later, as the play action fake is happening. One of the strongest interactions is with the CB to the offense’s left, who is dropping with the WR. Play footage reveals that he keeps his eyes on the QB before flipping around and running with the WR who is threatening him deep. The SS on the offense’s right also has a strong interaction with the TE on his side, as he starts to shuffle as the TE breaks inside. He ends up following him across the formation, but the TE starts to block him, indicating the play was likely a run-pass option. This explains the uncertainty of the model’s classification: the TE is sticking with the SS by design, creating biases in the data.

Model explanation for Cover 3 Zone comes right after the ball snap

Model explanation for Cover 1 Man comes a moment later, as the play action fake is happening

Conclusion

The Amazon ML Solutions Lab and NFL’s Next Gen Stats team jointly developed the defense coverage classification stat that was recently launched for the 2022 NFL football season. This post presented the ML technical details of this stat, including the modeling of the fast temporal progression, training strategies to handle the coverage class ambiguity, and comprehensive model explanations to speed up expert review on both global and instance levels.

The solution makes live defensive coverage tendencies and splits available to broadcasters in-game for the first time ever. Likewise, the model enables the NFL to improve its analysis of post-game results and better identify key matchups leading up to games.

If you’d like help accelerating your use of ML, please contact the Amazon ML Solutions Lab program.

Appendix

Player position acronyms
Defensive positions
W “Will” Linebacker, or the weak side LB
M “Mike” Linebacker, or the middle LB
S “Sam” Linebacker, or the strong side LB
CB Cornerback
DE Defensive End
DT Defensive Tackle
NT Nose Tackle
FS Free Safety
SS Strong Safety
S Safety
LB Linebacker
ILB Inside Linebacker
OLB Outside Linebacker
MLB Middle Linebacker
Offensive positions
X Usually the number 1 wide receiver in an offense, they align on the LOS. In trips formations, this receiver is often aligned isolated on the backside.
Y Usually the starting tight end, this player will often align in-line and to the opposite side as the X.
Z Usually more of a slot receiver, this player will often align off the line of scrimmage and on the same side of the field as the tight end.
H Traditionally a fullback, this player is more often a third wide receiver or a second tight end in the modern league. They can align all over the formation, but are almost always off the line of scrimmage. Depending on the team, this player could also be designated as an F.
T The featured running back. Other than empty formations, this player will align in the backfield and be a threat to receive the handoff.
QB Quarterback
C Center
G Guard
RB Running Back
FB Fullback
WR Wide Receiver
TE Tight End
LG Left Guard
RG Right Guard
T Tackle
LT Left Tackle
RT Right Tackle

References


About the Authors

Huan Song is an applied scientist at Amazon Machine Learning Solutions Lab, where he works on delivering custom ML solutions for high-impact customer use cases from a variety of industry verticals. His research interests are graph neural networks, computer vision, time series analysis and their industrial applications.

Mohamad Al Jazaery is an applied scientist at Amazon Machine Learning Solutions Lab. He helps AWS customers identify and build ML solutions to address their business challenges in areas such as logistics, personalization and recommendations, computer vision, fraud prevention, forecasting and supply chain optimization. Prior to AWS, he obtained his MCS from West Virginia University and worked as computer vision researcher at Midea. Outside of work, he enjoys soccer and video games.

Haibo Ding is a senior applied scientist at Amazon Machine Learning Solutions Lab. He is broadly interested in Deep Learning and Natural Language Processing. His research focuses on developing new explainable machine learning models, with the goal of making them more efficient and trustworthy for real-world problems. He obtained his Ph.D. from University of Utah and worked as a senior research scientist at Bosch Research North America before joining Amazon. Apart from work, he enjoys hiking, running, and spending time with his family.

Lin Lee Cheong is an applied science manager with the Amazon ML Solutions Lab team at AWS. She works with strategic AWS customers to explore and apply artificial intelligence and machine learning to discover new insights and solve complex problems. She received her Ph.D. from Massachusetts Institute of Technology. Outside of work, she enjoys reading and hiking.

Jonathan Jung is a Senior Software Engineer at the National Football League. He has been with the Next Gen Stats team for the last seven years helping to build out the platform from streaming the raw data, building out microservices to process the data, to building API’s that exposes the processed data. He has collaborated with the Amazon Machine Learning Solutions Lab in providing clean data for them to work with as well as providing domain knowledge about the data itself. Outside of work, he enjoys cycling in Los Angeles and hiking in the Sierras.

Mike Band is a Senior Manager of Research and Analytics for Next Gen Stats at the National Football League. Since joining the team in 2018, he has been responsible for ideation, development, and communication of key stats and insights derived from player-tracking data for fans, NFL broadcast partners, and the 32 clubs alike. Mike brings a wealth of knowledge and experience to the team with a master’s degree in analytics from the University of Chicago, a bachelor’s degree in sport management from the University of Florida, and experience in both the scouting department of the Minnesota Vikings and the recruiting department of Florida Gator Football.

Michael Chi is a Senior Director of Technology overseeing Next Gen Stats and Data Engineering at the National Football League. He has a degree in Mathematics and Computer Science from the University of Illinois at Urbana Champaign. Michael first joined the NFL in 2007 and has primarily focused on technology and platforms for football statistics. In his spare time, he enjoys spending time with his family outdoors.

Thompson Bliss is a Manager, Football Operations, Data Scientist at the National Football League. He started at the NFL in February 2020 as a Data Scientist and was promoted to his current role in December 2021. He completed his master’s degree in Data Science at Columbia University in the City of New York in December 2019. He received a Bachelor of Science in Physics and Astronomy with minors in Mathematics and Computer Science at University of Wisconsin – Madison in 2018.

Read More

Detect signatures on documents or images using the signatures feature in Amazon Textract

Detect signatures on documents or images using the signatures feature in Amazon Textract

Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from any document or image. AnalyzeDocument Signatures is a feature within Amazon Textract that offers the ability to automatically detect signatures on any document. This can reduce the need for human review, custom code, or ML experience.

In this post, we discuss the benefits of the AnalyzeDocument Signatures feature and how the AnalyzeDocument Signatures API helps detect signatures in documents. We also walk through how to use the feature through the Amazon Textract console and provide code examples to use the API and process the response with the Amazon Textract response parser library. Lastly, we share some best practices for using this feature.

Benefits of the Signatures feature

Our customers from insurance, mortgage, legal, and tax industries face the challenge of processing huge volumes of paper-based documents while adhering to regulatory and compliance requirements that require signatures in documents. You may need to ensure that specific forms such as loan applications or claims submitted by your end clients contain signatures before you start processing the application. For certain document processing workflows, you may need to go a step further to extract and compare the signatures for verification.

Historically, customers generally route the documents to a human reviewer to detect signatures. Using human reviewers to detect signatures tends to require a significant amount of time and resources. It can also lead to inefficiencies in the document processing workflow, resulting in longer turnaround times and a poor end-user experience.

The AnalyzeDocument Signatures feature allows you to automatically detect handwritten signatures, electronic signatures, and initials on documents. This can help you build an automated scalable solution with less reliance on costly and time-consuming manual processing. Not only can you use this feature to verify whether the document is signed, but you can also validate if a particular field in the form is signed using the location details of the detected signatures. You can also use location information to redact personally identifiable information (PII) in a document.

How AnalyzeDocument Signatures detects signatures in documents

The AnalyzeDocument API has four feature types: Forms, Tables, Queries, and Signatures. When Amazon Textract processes documents, the results are returned in an array of Block objects. The Signatures feature can be used by itself or in combination with other feature types. When used by itself, the Signatures feature type provides a JSON response that includes the location and confidence scores of the detected signatures and raw text (words and lines) from the documents. The Signatures feature combined with other feature types, such as Forms and Tables, can help draw useful insights. In cases where the feature is used with Forms and Tables, the response shows the signature as part of key value pair or a table cell. For example, the response for the following form contains the key as Signature of Lender and the value as the Block object.

How to use the Signatures feature on the Amazon Textract console

Before we get started with the API and code samples, let’s review the Amazon Textract console. After you upload the document to the Amazon Textract console, select Signature detection in the Configure document section and choose Apply configuration.

The following screenshot shows an example of a paystub on the Signatures tab for the Analyze Document API on the Amazon Textract console.

The feature detects and presents the signature with its corresponding page and confidence score.

Code examples

You can use the Signatures feature to detect signatures on different types of documents, such as checks, loan application forms, claims forms, paystubs, mortgage documents, bank statements, lease agreements, and contracts. In this section, we discuss some of these documents and show how to invoke the AnalyzeDocument API with the Signatures parameter to detect signatures.

The input document can either be in a byte array format or located in an Amazon Simple Storage Service (Amazon S3) bucket. For documents in a byte array format, you can submit image bytes to an Amazon Textract API operation by using the bytes property. Signatures as a feature type is supported by the AnalyzeDocument API for synchronous document processing and StartDocumentAnalysis for asynchronous processing of documents.

In the following example, we detect signatures on an employment verification letter.

We use the following sample Python code:

import boto3
import json

#create a Textract Client
textract = boto3.client('textract')
#Document
documentName = image_filename

response = None
with open(image_filename, 'rb') as document:
    imageBytes = bytearray(document.read())

# Call Textract AnalyzeDocument by passing a document from local disk
response = textract.analyze_document(
    Document={'Bytes': imageBytes},
    FeatureTypes=["FORMS",'SIGNATURES']
    )

Let’s analyze the response we get from the AnalyzeDocument API. The following response has been trimmed to only show the relevant parts. The response has a BlockType of SIGNATURE that shows the confidence score, ID for the block, and bounding box details:

'BlockType': 'SIGNATURE',
   'Confidence': 38.468597412109375,
   'Geometry': {'BoundingBox': {'Width': 0.15083004534244537,
     'Height': 0.019236255437135696,
     'Left': 0.11393339931964874,
     'Top': 0.8885205388069153},
    'Polygon': [{'X': 0.11394496262073517, 'Y': 0.8885205388069153},
     {'X': 0.2647634446620941, 'Y': 0.8887625932693481},
     {'X': 0.264753133058548, 'Y': 0.9077568054199219},
     {'X': 0.11393339931964874, 'Y': 0.907513439655304}]},
   'Id': '609f749c-5e79-4dd4-abcc-ad47c6ebf777'}]

We use the following code to print the ID and location in a tabulated format:

#print detected text
from tabulate import tabulate
d = []
for item in response["Blocks"]:
    if item["BlockType"] == "SIGNATURE":
        d.append([item["Id"],item["Geometry"]])

print(tabulate(d, headers=["Id", "Geometry"],tablefmt="grid",maxcolwidths=[None, 100]))

The following screenshot shows our results.

More details and the complete code is available in the notebook on the GitHub repo.

For documents that have legible signatures in key value formats, we can use the Textract response parser to extract just the signature fields by searching for the key and the corresponding value to those keys:

from trp import Document
doc = Document(response)
d = []

for page in doc.pages:
    # Search fields by key
    print("nSearch Fields:")
    key = "Signature"
    fields = page.form.searchFieldsByKey(key)
    for field in fields:
        d.append([field.key, field.value])        

print(tabulate(d, headers=["Key", "Value"]))

The preceding code returns the following results:

Search Fields:
Key                        		Value
-------------------------  		--------------
8. Signature of Applicant 	Paulo Santos
26. Signature of Employer 	Richard Roe
3. Signature of Lender     	Carlos Salazar

Note that in order to transcribe the signatures in this way, the signatures must be legible.

Best practices for using the Signatures feature

Consider the following best practices when using this feature:

  • For real-time responses, use the synchronous operation of the AnalyzeDocument API. For use cases where you don’t need the response in real time, such as batch processing, we suggest using the asynchronous operation of the API.
  • The Signatures feature works best when there are up to three signatures on a page. When there are more than three signatures on a page, it’s best to split the page into sections and feed each of the sections separately to the API.
  • Use the confidence scores provided with the detected signatures to route the documents for human review when the scores don’t meet your required threshold. The confidence score is not a measure of accuracy, but an estimate of the model’s confidence in its prediction. You should select a confidence score that makes the most sense for your use case.

Summary

In this post, we provided an overview of the Signatures feature of Amazon Textract to automatically detect signatures on documents, such as paystubs, rental lease agreements, and contracts. AnalyzeDocument Signatures reduces the need for human reviewers and helps you reduce costs, save time, and build scalable solutions for document processing.

To get started, log on to the Amazon Textract console to try out the feature. To learn more about Amazon Textract capabilities, refer to Amazon Textract, the Amazon Textract Developer Guide, or Textract Resources.


About the Authors

Maran Chandrasekaran is a Senior Solutions Architect at Amazon Web Services, working with our enterprise customers. Outside of work, he loves to travel and ride his motorcycle in Texas Hill Country.

Shibin Michaelraj is a Sr. Product Manager with the AWS Textract team. He is focused on building AI/ML-based products for AWS customers.

Suprakash Dutta is a Sr. Solutions Architect at Amazon Web Services. He focuses on digital transformation strategy, application modernization and migration, data analytics, and machine learning. He is part of the AI/ML community at AWS and designs intelligent document processing solutions.

Read More

Monitoring Lake Mead drought using the new Amazon SageMaker geospatial capabilities

Monitoring Lake Mead drought using the new Amazon SageMaker geospatial capabilities

Earth’s changing climate poses an increased risk of drought due to global warming. Since 1880, the global temperature has increased 1.01 °C. Since 1993, sea levels have risen 102.5 millimeters. Since 2002, the land ice sheets in Antarctica have been losing mass at a rate of 151.0 billion metric tons per year. In 2022, the Earth’s atmosphere contains more than 400 parts per million of carbon dioxide, which is 50% more than it had in 1750. While these numbers might seem removed from our daily lives, the Earth has been warming at an unprecedented rate over the past 10,000 years [1].

In this post, we use the new geospatial capabilities in Amazon SageMaker to monitor drought caused by climate change in Lake Mead. Lake Mead is the largest reservoir in the US. It supplies water to 25 million people in the states of Nevada, Arizona, and California [2]. Research shows that the water levels in Lake Mead are at their lowest level since 1937 [3]. We use the geospatial capabilities in SageMaker to measure the changes in water levels in Lake Mead using satellite imagery.

Data access

The new geospatial capabilities in SageMaker offer easy access to geospatial data such as Sentinel-2 and Landsat 8. Built-in geospatial dataset access saves weeks of effort otherwise lost to collecting data from various data providers and vendors.

First, we will use an Amazon SageMaker Studio notebook with a SageMaker geospatial image by following steps outlined in Getting Started with Amazon SageMaker geospatial capabilities. We use a SageMaker Studio notebook with a SageMaker geospatial image for our analysis.

The notebook used in this post can be found in the amazon-sagemaker-examples GitHub repo. SageMaker geospatial makes the data query extremely easy. We will use the following code to specify the location and timeframe for satellite data.

In the following code snippet, we first define an AreaOfInterest (AOI) with a bounding box around the Lake Mead area. We use the TimeRangeFilter to select data from January 2021 to July 2022. However, the area we are studying may be obscured by clouds. To obtain mostly cloud-free imagery, we choose a subset of images by setting the upper bound for cloud coverage to 1%.

import boto3
import sagemaker
import sagemaker_geospatial_map

session = boto3.Session()
execution_role = sagemaker.get_execution_role()
sg_client = session.client(service_name="sagemaker-geospatial")

search_rdc_args = {
    "Arn": "arn:aws:sagemaker-geospatial:us-west-2:378778860802:raster-data-collection/public/nmqj48dcu3g7ayw8",  # sentinel-2 L2A COG
    "RasterDataCollectionQuery": {
        "AreaOfInterest": {
            "AreaOfInterestGeometry": {
                "PolygonGeometry": {
                    "Coordinates": [
                        [
                            [-114.529, 36.142],
                            [-114.373, 36.142],
                            [-114.373, 36.411],
                            [-114.529, 36.411],
                            [-114.529, 36.142],
                        ] 
                    ]
                }
            } # data location
        },
        "TimeRangeFilter": {
            "StartTime": "2021-01-01T00:00:00Z",
            "EndTime": "2022-07-10T23:59:59Z",
        }, # timeframe
        "PropertyFilters": {
            "Properties": [{"Property": {"EoCloudCover": {"LowerBound": 0, "UpperBound": 1}}}],
            "LogicalOperator": "AND",
        },
        "BandFilter": ["visual"],
    },
}

tci_urls = []
data_manifests = []
while search_rdc_args.get("NextToken", True):
    search_result = sg_client.search_raster_data_collection(**search_rdc_args)
    if search_result.get("NextToken"):
        data_manifests.append(search_result)
    for item in search_result["Items"]:
        tci_url = item["Assets"]["visual"]["Href"]
        print(tci_url)
        tci_urls.append(tci_url)

    search_rdc_args["NextToken"] = search_result.get("NextToken")

Model inference

After we identify the data, the next step is to extract water bodies from the satellite images. Typically, we would need to train a land cover segmentation model from scratch to identify different categories of physical materials on the earth surface’s such as water bodies, vegetation, snow, and so on. Training a model from scratch is time consuming and expensive. It involves data labeling, model training, and deployment. SageMaker geospatial capabilities provide a pre-trained land cover segmentation model. This land cover segmentation model can be run with a simple API call.

Rather than downloading the data to a local machine for inferences, SageMaker does all the heavy lifting for you. We simply specify the data configuration and model configuration in an Earth Observation Job (EOJ). SageMaker automatically downloads and preprocesses the satellite image data for the EOJ, making it ready for inference. Next, SageMaker automatically runs model inference for the EOJ. Depending on the workload (the number of images run through model inference), the EOJ can take several minutes to a few hours to finish. You can monitor the job status using the get_earth_observation_job function.

# Perform land cover segmentation on images returned from the sentinel dataset.
eoj_input_config = {
    "RasterDataCollectionQuery": {
        "RasterDataCollectionArn": "arn:aws:sagemaker-geospatial:us-west-2:378778860802:raster-data-collection/public/nmqj48dcu3g7ayw8",
        "AreaOfInterest": {
            "AreaOfInterestGeometry": {
                "PolygonGeometry": {
                    "Coordinates": [
                        [
                            [-114.529, 36.142],
                            [-114.373, 36.142],
                            [-114.373, 36.411],
                            [-114.529, 36.411],
                            [-114.529, 36.142],
                        ]
                    ]
                }
            }
        },
        "TimeRangeFilter": {
            "StartTime": "2021-01-01T00:00:00Z",
            "EndTime": "2022-07-10T23:59:59Z",
        },
        "PropertyFilters": {
            "Properties": [{"Property": {"EoCloudCover": {"LowerBound": 0, "UpperBound": 1}}}],
            "LogicalOperator": "AND",
        },
    }
}
eoj_config = {"LandCoverSegmentationConfig": {}}

response = sg_client.start_earth_observation_job(
    Name="lake-mead-landcover",
    InputConfig=eoj_input_config,
    JobConfig=eoj_config,
    ExecutionRoleArn=execution_role,
)

# Monitor the EOJ status.
eoj_arn = response["Arn"]
job_details = sg_client.get_earth_observation_job(Arn=eoj_arn)
{k: v for k, v in job_details.items() if k in ["Arn", "Status", "DurationInSeconds"]}

Visualize results

Now that we have run model inference, let’s visually inspect the results. We overlay the model inference results on input satellite images. We use Foursquare Studio tools that comes pre-integrated with SageMaker to visualize these results. First, we create a map instance using the SageMaker geospatial capabilities to visualize input images and model predictions:

# Creates an instance of the map to add EOJ input/ouput layer.
map = sagemaker_geospatial_map.create_map({"is_raster": True})
map.set_sagemaker_geospatial_client(sg_client)

# Render the map.
map.render()

When the interactive map is ready, we can render input images and model outputs as map layers without needing to download the data. Additionally, we can give each layer a label and select the data for a particular date using TimeRangeFilter:

# Visualize AOI
config = {"label": "Lake Mead AOI"}
aoi_layer = map.visualize_eoj_aoi(Arn=eoj_arn, config=config)

# Visualize input.
time_range_filter = {
    "start_date": "2022-07-01T00:00:00Z",
    "end_date": "2022-07-10T23:59:59Z",
}
config = {"label": "Input"}
input_layer = map.visualize_eoj_input(
    Arn=eoj_arn, config=config, time_range_filter=time_range_filter
)

# Visualize output, EOJ needs to be in completed status.
time_range_filter = {
    "start_date": "2022-07-01T00:00:00Z",
    "end_date": "2022-07-10T23:59:59Z",
}
config = {"preset": "singleBand", "band_name": "mask"}
output_layer = map.visualize_eoj_output(
    Arn=eoj_arn, config=config, time_range_filter=time_range_filter
)

We can verify that the area marked as water (bright yellow in the following map) accurately corresponds with the water body in Lake Mead by changing the opacity of the output layer.

Post analysis

Next, we use the export_earth_observation_job function to export the EOJ results to an Amazon Simple Storage Service (Amazon S3) bucket. We then run a subsequent analysis on the data in Amazon S3 to calculate the water surface area. The export function makes it convenient to share results across teams. SageMaker also simplifies dataset management. We can simply share the EOJ results using the job ARN, instead of crawling thousands of files in the S3 bucket. Each EOJ becomes an asset in the data catalog, as results can be grouped by the job ARN.

sagemaker_session = sagemaker.Session()
s3_bucket_name = sagemaker_session.default_bucket()  # Replace with your own bucket if needed
s3_bucket = session.resource("s3").Bucket(s3_bucket_name)
prefix = "eoj_lakemead"  # Replace with the S3 prefix desired
export_bucket_and_key = f"s3://{s3_bucket_name}/{prefix}/"

eoj_output_config = {"S3Data": {"S3Uri": export_bucket_and_key}}
export_response = sg_client.export_earth_observation_job(
    Arn=eoj_arn,
    ExecutionRoleArn=execution_role,
    OutputConfig=eoj_output_config,
    ExportSourceImages=False,
)

Next, we analyze changes in the water level in Lake Mead. We download the land cover masks to our local instance to calculate water surface area using open-source libraries. SageMaker saves the model outputs in Cloud Optimized GeoTiff (COG) format. In this example, we load these masks as NumPy arrays using the Tifffile package. The SageMaker Geospatial 1.0 kernel also includes other widely used libraries like GDAL and Rasterio.

Each pixel in the land cover mask has a value between 0-11. Each value corresponds to a particular class of land cover. Water’s class index is 6. We can use this class index to extract the water mask. First, we count the number of pixels that are marked as water. Next, we multiply that number by the area that each pixel covers to get the surface area of the water. Depending on the bands, the spatial resolution of a Sentinel-2 L2A image is 10m, 20m, or 60m. All bands are downsampled to a spatial resolution of 60 meters for the land cover segmentation model inference. As a result, each pixel in the land cover mask represents a ground area of 3600 m2, or 0.0036 km2.

import os
from glob import glob
import cv2
import numpy as np
import tifffile
import matplotlib.pyplot as plt
from urllib.parse import urlparse
from botocore import UNSIGNED
from botocore.config import Config

# Download land cover masks
mask_dir = "./masks/lake_mead"
os.makedirs(mask_dir, exist_ok=True)
image_paths = []
for s3_object in s3_bucket.objects.filter(Prefix=prefix).all():
    path, filename = os.path.split(s3_object.key)
    if "output" in path:
        mask_name = mask_dir + "/" + filename
        s3_bucket.download_file(s3_object.key, mask_name)
        print("Downloaded mask: " + mask_name)

# Download source images for visualization
for tci_url in tci_urls:
    url_parts = urlparse(tci_url)
    img_id = url_parts.path.split("/")[-2]
    tci_download_path = image_dir + "/" + img_id + "_TCI.tif"
    cogs_bucket = session.resource(
        "s3", config=Config(signature_version=UNSIGNED, region_name="us-west-2")
    ).Bucket(url_parts.hostname.split(".")[0])
    cogs_bucket.download_file(url_parts.path[1:], tci_download_path)
    print("Downloaded image: " + img_id)

print("Downloads complete.")

image_files = glob("images/lake_mead/*.tif")
mask_files = glob("masks/lake_mead/*.tif")
image_files.sort(key=lambda x: x.split("SQA_")[1])
mask_files.sort(key=lambda x: x.split("SQA_")[1])
overlay_dir = "./masks/lake_mead_overlay"
os.makedirs(overlay_dir, exist_ok=True)
lake_areas = []
mask_dates = []

for image_file, mask_file in zip(image_files, mask_files):
    image_id = image_file.split("/")[-1].split("_TCI")[0]
    mask_id = mask_file.split("/")[-1].split(".tif")[0]
    mask_date = mask_id.split("_")[2]
    mask_dates.append(mask_date)
    assert image_id == mask_id
    image = tifffile.imread(image_file)
    image_ds = cv2.resize(image, (1830, 1830), interpolation=cv2.INTER_LINEAR)
    mask = tifffile.imread(mask_file)
    water_mask = np.isin(mask, [6]).astype(np.uint8)  # water has a class index 6
    lake_mask = water_mask[1000:, :1100]
    lake_area = lake_mask.sum() * 60 * 60 / (1000 * 1000)  # calculate the surface area
    lake_areas.append(lake_area)
    contour, _ = cv2.findContours(water_mask, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
    combined = cv2.drawContours(image_ds, contour, -1, (255, 0, 0), 4)
    lake_crop = combined[1000:, :1100]
    cv2.putText(lake_crop, f"{mask_date}", (10,50), cv2.FONT_HERSHEY_SIMPLEX, 1.5, (0, 0, 0), 3, cv2.LINE_AA)
    cv2.putText(lake_crop, f"{lake_area} [sq km]", (10,100), cv2.FONT_HERSHEY_SIMPLEX, 1.5, (0, 0, 0), 3, cv2.LINE_AA)
    overlay_file = overlay_dir + '/' + mask_date + '.png'
    cv2.imwrite(overlay_file, cv2.cvtColor(lake_crop, cv2.COLOR_RGB2BGR))

# Plot water surface area vs. time.
plt.figure(figsize=(20,10))
plt.title('Lake Mead surface area for the 2021.02 - 2022.07 period.', fontsize=20)
plt.xticks(rotation=45)
plt.ylabel('Water surface area [sq km]', fontsize=14)
plt.plot(mask_dates, lake_areas, marker='o')
plt.grid('on')
plt.ylim(240, 320)
for i, v in enumerate(lake_areas):
    plt.text(i, v+2, "%d" %v, ha='center')
plt.show()

We plot the water surface area over time in the following figure. The water surface area clearly decreased between February 2021 and July 2022. In less than 2 years, Lake Mead’s surface area decreased from over 300 km2 to less than 250 km2, an 18% relative change.

import imageio.v2 as imageio
from IPython.display import HTML

frames = []
filenames = glob('./masks/lake_mead_overlay/*.png')
filenames.sort()
for filename in filenames:
    frames.append(imageio.imread(filename))
imageio.mimsave('lake_mead.gif', frames, duration=1)
HTML('<img src="./lake_mead.gif">')

We can also extract the lake’s boundaries and superimpose them over the satellite images to better visualize the changes in lake’s shoreline. As shown in the following animation, the north and southeast shoreline have shrunk over the last 2 years. In some months, the surface area has reduced by more than 20% year over year.

Lake Mead surface area animation

Conclusion

We have witnessed the impact of climate change on Lake Mead’s shrinking shoreline. SageMaker now supports geospatial machine learning (ML), making it easier for data scientists and ML engineers to build, train, and deploy models using geospatial data. In this post, we showed how to acquire data, perform analysis, and visualize the changes with SageMaker geospatial AI/ML services. You can find the code for this post in the amazon-sagemaker-examples GitHub repo. See the Amazon SageMaker geospatial capabilities to learn more.

References

[1] https://climate.nasa.gov/

[2] https://www.nps.gov/lake/learn/nature/overview-of-lake-mead.htm

[3] https://earthobservatory.nasa.gov/images/150111/lake-mead-keeps-dropping


About the Authors

 Xiong Zhou is a Senior Applied Scientist at AWS. He leads the science team for Amazon SageMaker geospatial capabilities. His current area of research includes computer vision and efficient model training. In his spare time, he enjoys running, playing basketball and spending time with his family.

Anirudh Viswanathan is a Sr Product Manager, Technical – External Services with the SageMaker geospatial ML team. He holds a Masters in Robotics from Carnegie Mellon University, an MBA from the Wharton School of Business, and is named inventor on over 40 patents. He enjoys long-distance running, visiting art galleries and Broadway shows.

Trenton Lipscomb is a Principal Engineer and part of the team that added geospatial capabilities to SageMaker. He has been involved in human in the loop solutions, working on the services SageMaker Ground Truth, Augmented AI and Amazon Mechanical Turk.

Xingjian Shi is a Senior Applied Scientist and part of the team that added geospatial capabilities to SageMaker. He is also working on deep learning for Earth science and multimodal AutoML.

Li Erran Li is the applied science manager at humain-in-the-loop services, AWS AI, Amazon. His research interests are 3D deep learning, and vision and language representation learning. Previously he was a senior scientist at Alexa AI, the head of machine learning at Scale AI and the chief scientist at Pony.ai. Before that, he was with the perception team at Uber ATG and the machine learning platform team at Uber working on machine learning for autonomous driving, machine learning systems and strategic initiatives of AI. He started his career at Bell Labs and was adjunct professor at Columbia University. He co-taught tutorials at ICML’17 and ICCV’19, and co-organized several workshops at NeurIPS, ICML, CVPR, ICCV on machine learning for autonomous driving, 3D vision and robotics, machine learning systems and adversarial machine learning. He has a PhD in computer science at Cornell University. He is an ACM Fellow and IEEE Fellow.

Read More

Amplification at the Quantum limit

Amplification at the Quantum limit

The Google Quantum AI team is building quantum computers with superconducting microwave circuits, but much like a classical computer the superconducting processor at the heart of these computers is only part of the story. An entire technology stack of peripheral hardware is required to make the quantum computer work properly. In many cases these parts must be custom designed, requiring extensive research and development to reach the highest levels of performance.

In this post, we highlight one aspect of this supplemental hardware: our superconducting microwave amplifiers. In “Readout of a Quantum Processor with High Dynamic Range Josephson Parametric Amplifiers”, published in Applied Physics Letters, we describe how we increased the maximum output power of our superconducting microwave amplifiers by a factor of over 100x. We discuss how this work can pave the way for the operation of larger quantum processor chips with improved performance.

Why microwave amplifiers?

One of the challenges of operating a superconducting quantum processor is measuring the state of a qubit without disturbing its operation. Fundamentally, this comes down to a microwave engineering problem, where we need to be able to measure the energy inside the qubit resonator without exposing it to noisy or lossy wiring. This can be accomplished by adding an additional microwave resonator to the system that is coupled to the qubit, but far from the qubit’s resonance frequency. The resonator acts as a filter that isolates the qubit from the control lines but also picks up a state-dependent frequency shift from the qubit. Just like in the binary phase shift keying (BPSK) encoding technique, the digital state of the qubit (0 or 1) is translated into a phase for a probe tone (microwave signal) reflecting off of this auxiliary resonator. Measuring the phase of this probe tone allows us to infer the state of the qubit without directly interfacing with the qubit itself.

While this sounds simple, the qubit actually imposes a severe cap on how much power can be used for this probe tone. In normal operation, a qubit should be in the 0 state or the 1 state or some superposition of the two. A measurement pulse should collapse the qubit into one of these two states, but using too much power can push it into a higher excited state and corrupt the computation. A safe measurement power is typically around -125 dBm, which amounts to only a handful of microwave photons interacting with the processor during the measurement. Typically, small signals are measured using microwave amplifiers, which increase the signal level, but also add their own noise. How much noise is acceptable? If the measurement process takes too long, the qubit state can change due to energy loss in the circuit. This means that these very small signals must be measured in just a few hundred nanoseconds with very high (>99%) fidelity. We therefore cannot afford to average the signal over a longer time to reduce the noise. Unfortunately, even the best semiconductor low-noise amplifiers are still almost a factor of 10 too noisy.

The solution is to design our own custom amplifiers based on the same circuit elements as the qubits themselves. These amplifiers typically consist of Josephson junctions to provide a tunable inductance wired into a superconducting resonant circuit. By constructing a resonant circuit out of these elements, you can create a parametric amplifier where amplification is achieved by modulating the tunable inductance at twice the frequency you want to amplify. Additionally, because all of the wiring is made of lossless superconductors, these devices operate near the quantum limit of added noise, where the only noise in the signal is coming from amplification of the zero point quantum voltage fluctuations.

The one downside to these devices is that the Josephson junctions constrain the power of the signals we can measure. If the signal is too large, the drive current can approach the junction critical current and degrade the amplifier performance. Even if this limit was sufficient to measure a single qubit, our goal was to increase efficiency by measuring up to six qubits at a time using the same amplifier. Some groups get around this limit by making traveling wave amplifiers, where the signals are distributed across thousands of junctions. This increases the saturation power, but the amplifiers get very complicated to produce and take up a lot of space on the chip. Our goal was to create an amplifier that could handle as much power as a traveling wave amplifier but with the same simple and compact design we were used to.

Results

The critical current of each Josephson junction limits our amplifier’s power handling. However, increasing this critical current also changes the inductance and, thus, the operating frequency of the amplifier. To avoid these constraints, we replaced a standard 2-junction DC SQUID with a nonlinear tunable inductor made up of two RF-SQUID arrays in parallel, which we call a snake inductor. Each RF-SQUID consists of a Josephson junction and geometric inductances L1 and L2, and each array contains 20 RF-SQUIDs. In this case, each junction of a standard DC SQUID is replaced by one of these RF-SQUID arrays. While the critical current of each RF-SQUID is much higher, we chain them together to keep the inductance and operating frequency the same. While this is a relatively modest increase in device complexity, it enables us to increase the power handling of each amplifier by roughly a factor of 100x. It is also fully compatible with existing designs that use impedance matching circuits to provide large measurement bandwidth.

Circuit diagram of our superconducting microwave amplifier. A split bias coil allows both DC and RF modulation of the snake inductor, while a shunt capacitor sets the frequency range. The flow of current is illustrated in the animation where an applied current (blue) on the bias line causes a circulating current (red) in the snake. A tapered impedance transformer lowers the loaded Q of the device. Since the Q is defined as frequency divided by bandwidth, lowering the Q with a constant frequency increases the bandwidth of the amplifier. Example circuit parameters used for a real device are Cs=6.0 pF, L1=2.6 pH, L2=8.0 pH, Lb=30 pH, M=50 pH, Z0 = 50 Ohms, and Zfinal = 18 ohms. The device operation is illustrated with a small signal (magenta) reflecting off the input of the amplifier. When the large pump tone (blue) is applied to the bias port, it generates amplified versions of the signal (gold) and a secondary tone known as an idler (also gold).
Microscope image of the nonlinear resonator showing the resonant circuit that consists of a large parallel plate capacitor, nonlinear snake inductor, and a current bias transformer to tune the inductance.

We measure this performance improvement by measuring the saturation power of the amplifier, or the point at which the gain is compressed by 1 dB. We also measure this power value vs. frequency to see how it scales with amplifier gain and distance from the center of the amplifier bandwidth. Since the amplifier gain is symmetric about its center frequency we measure this in terms of absolute detuning, which is just the absolute value of the difference between the center frequency of the amplifier and the probe tone frequency.

Input and output saturation power (1-dB gain compression point), calibrated using a superconducting quantum processor vs. absolute detuning from the amplifier center frequency.

Conclusion and future directions

The new microwave amplifiers represent a big step forward for our qubit measurement system. They will allow us to measure more qubits using a single device, and enable techniques that require higher power for each measurement tone. However, there are still quite a few areas we would like to explore. For example, we are currently investigating the application of snake inductors in amplifiers with advanced impedance matching techniques, directional amplifiers, and non-reciprocal devices like microwave circulators.

Acknowledgements

We would like to thank the Quantum AI team for the infrastructure and support that enabled the creation and measurement of our microwave amplifier devices. Thanks to our cohort of talented Google Research Interns that contributed to the future work mentioned above: Andrea Iorio for developing algorithms that automatically tune amplifiers and provide a snapshot of the local parameter space, Ryan Kaufman for measuring a new class of amplifiers using multi-pole impedance matching networks, and Randy Kwende for designing and testing a range of parametric devices based on snake inductors. With their contributions, we are gaining a better understanding of our amplifiers and designing the next generation of parametrically-driven devices.

Read More

Crossing Continents: XPENG G9 SUV and P7 Sedan Set Course for Scandinavia, the Netherlands

Crossing Continents: XPENG G9 SUV and P7 Sedan Set Course for Scandinavia, the Netherlands

Electric automaker XPENG’s flagship G9 SUV and P7 sports sedan are now available for order in Sweden, Denmark, Norway and the Netherlands — an expansion revealed last week at the eCar Expo in Stockholm.

The intelligent electric vehicles are built on the high-performance NVIDIA DRIVE Orin centralized compute architecture and deliver AI capabilities that are continuously upgradable through over-the-air software updates.

“This announcement represents a significant milestone as we build our presence in Europe,” said Brian Gu, vice chair and president of XPENG. “We believe both vehicles deliver a new level of sophistication and a people-first mobility experience, and will be the electric vehicles of choice for many European customers.”

Safety Never Takes a Back Seat 

The XPENG G9 and P7 come equipped with XPENG’s proprietary XPILOT Advanced Driver Assistance System, which offers safety, driving and parking support through a variety of smart functions.

The system is supported by 29 sensors, including high-definition radars, ultrasonic sensors, and surround-view and high-perception cameras, enabling the vehicles to safely tackle diverse driving scenarios.

The EVs are engineered to meet the European New Car Assessment Programme’s five-star safety standards, along with the European Union’s stringent whole vehicle type approval certification.

Leading Charge for the Long Haul

The rear-wheel-drive (RWD), long-range version of the XPENG G9 can travel up to 354 miles on a single charge and features a new powertrain system for ultrafast charging, going from 10% to 80% in just 20 minutes. The P7 RWD, long-range model also has optimized charging power to reach 80% in 29 minutes, while offering up to 358 miles on a single charge.

To ensure an easy, fast charging experience, XPENG customers can access more than 400,000 charging stations in Europe through the automaker’s collaboration with major third-party charging operations and mobility service providers.

The XPENG G9 features faster charging and longer range, up to 354 miles on a single charge.

Beauty Backed by Brains and Brawn

With the high compute power found only with DRIVE Orin, the XPENG G9 and P7 advanced driving systems boast superior performance, while sporting sleek and elegant designs, quality craftsmanship and comfort to meet the most discerning of tastes.

The upgraded in-car Xmart operating system (OS) features a new 3D user interface that offers support in English and other European languages, depending on the market. The OS comes with an improved voice assistant that can distinguish voice commands from four zones in the cabin. It also features wide infotainment screens and a library of in-car apps to assist and entertain both the driver and passengers.

The G9 and P7 are available in all-wheel drive (AWD) or RWD configurations. XPENG reports that the G9’s AWD version delivers up to 551 horsepower, and can accelerate from 0 to 100 kilometers per hour in 3.9 seconds, while the upgraded P7 AWD model can do the same in 4.1 seconds.

The XPENG P7 features an immersive cabin experience.

Deliveries of the P7 will begin in June, while the G9 is expected to start in September. To support demand in key European markets, XPENG plans to open delivery and service centers in Lørenskog, Norway, this month — as well as in Badhoevedorp, the Netherlands; Stäket, Sweden; and Hillerød, Denmark in Q2 2023.

The EV maker expects to open additional authorized service locations in other European countries by the end of the year.

Featured image: Next-gen XPENG P7 sports sedan, powered by NVIDIA DRIVE Orin.

Read More

3D Artist Brings Ride and Joy to Automotive Designs With Real-Time Renders Using NVIDIA RTX

3D Artist Brings Ride and Joy to Automotive Designs With Real-Time Renders Using NVIDIA RTX

Designing automotive visualizations can be incredibly time consuming. To make the renders look as realistic as possible, artists need to consider material textures, paints, realistic lighting and reflections, and more.

For 3D artist David Baylis, it’s important to include these details and still create high-resolution renders in a short amount of time. That’s why he uses the NVIDIA RTX A6000 GPU, which allows him to use features like real-time ray tracing so he can quickly get the highest-fidelity image.

The RTX A6000 also enables Baylis to handle massive amounts of data with 48GB of VRAM, which means more GPU memory. In computer graphics, the higher the resolution of the image, the more memory is used. And with the RTX A6000, Baylis can extract more data without worrying about memory limits slowing him down.

Bringing Realistic Details to Life With RTX

To create his automotive visualizations, Baylis starts with 3D modeling in Autodesk 3ds Max software. He’ll set up the scene and work on the car model before importing it to Unreal Engine, where he works on lighting and shading for the final render.

In Unreal Engine, Baylis can experiment with details such as different car paints to see what works best on the 3D model. Seeing all the changes in real time enables Baylis to iterate and experiment with design choices, so he can quickly achieve the look and feel he’s aiming for.

In one of his latest projects, Baylis created a scene with an astounding polycount of more than 50 million triangles. Using the RTX A6000, he could easily move around the scene to see the car from different angles. Even in path-traced mode, the A6000 allows Baylis to maintain high frame rates while switching from one angle to the next.

Rendering at a higher resolution is important to create photorealistic visuals. In the example below, Baylis shows a car model rendered at 4K resolution. But when zoomed in, the graphics start to appear blurry.

When the car is rendered at 12K resolution, the details on the car become sharper. By rendering at higher resolutions, the artist can include extra details to make the car look even more realistic. With the RTX A6000, Baylis said the 12K render took under 10 minutes to complete.

It’s not just the real-time ray tracing and path tracing that help Baylis enhance his designs. There’s another component he said he never thought would make an impact on creative workflows — GPU memory.

The RTX A6000 GPU is equipped with 48GB of VRAM, which allows Baylis to load incredibly high-resolution textures and high-polygon assets. The VRAM is especially helpful for automotive renders because the datasets behind them can be massive.

The large memory of the RTX A6000 allows him to easily manage the data.

If we throw more polygons into the scene, or if we include more scanned assets, it tends to use a lot of VRAM, but the RTX A6000 can handle all of it,” explained Baylis. “It’s great not having to think about optimizing all those assets in the scene. Instead, we can just scan the data in, even if the assets are 8K, 16K or even 24K resolution.”

When Baylis rendered one still frame at 8K resolution, he saw it only took up 24GB of VRAM. So he pushed the resolution higher to 12K, using almost 35GB of VRAM — with plenty of headroom to spare.

This is an important feature to highlight, because when people look at new GPUs, they immediately look at benchmarks and how fast it can render things,” said Baylis. “And it’s good if you can render graphics a minute or two faster, but if you really want to take projects to the finish line, you need more VRAM.”

Using NVLink, Baylis can bridge two NVIDIA RTX A6000 GPUs together to scale memory and performance. With one GPU, it takes just about a minute to render a path-traced image of the car. But using dual RTX A6000 GPUs with NVLink, it reduces the render time by almost half. NVLink also combines GPU memory, providing 96 GB VRAM total. This makes Baylis’ animation workflows much faster and easier to manage.

Check out more of Baylis’ work in the video below, and learn more about NVIDIA RTX. And join us at NVIDIA GTC, which takes place March 20-23, to learn more about the latest technologies shaping the future of design and visualization.

Read More

Gather Your Party: GFN Thursday Brings ‘Baldur’s Gate 3’ to the Cloud

Gather Your Party: GFN Thursday Brings ‘Baldur’s Gate 3’ to the Cloud

Venture to the Forgotten Realms this GFN Thursday in Baldur’s Gate 3, streaming on GeForce NOW.

Celebrations for the cloud gaming service’s third anniversary continue with a Dying Light 2 reward that’s to die for. It’s the cherry on top of three new titles joining the GeForce NOW library this week.

Roll for Initiative

Mysterious abilities are awakening inside you. Embrace corruption or fight against darkness itself in Baldur’s Gate 3 (Steam) – a next-generation role-playing game, set in the world of Dungeons and Dragons.

Choose from a wide selection of D&D races and classes, or play as an origin character with a handcrafted background on underpowered PCs and Macs. Adventure, loot, battle and romance as you journey through the Forgotten Realms and beyond from mobile devices. Play alone and select companions carefully, or as a party of up to four in multiplayer.

Level up to the GeForce NOW Ultimate membership to experience the power of an RTX 4080 in the cloud and all of its benefits, including up to 4K 120 frames per second gameplay on PC and Mac, and ultrawide resolution support for a truly immersive experience.

Dying 2 Celebrate This Anniversary

To celebrate the third anniversary of GeForce NOW, members can now check their accounts to make sure they received the gift of free Dying Light 2 rewards.

Dying Light 2 GeForce NOW Anniversary Reward
You’re all set to survive the post-apocalyptic wasteland with this loadout.

Claim a new in-game outfit dubbed “Post-Apo,” complete with a Rough Duster, Bleak Pants, Well-Worn Boots, Tattered Leather Gauntlets, Dystopian Mask and Spiked Bracers to scavenge around and parkour in. Members who upgrade to Ultimate and Priority memberships can claim extra loot with this haul, including the Patchy Paraglider and Scrap Slicer weapon.

Visit the GeForce NOW Rewards portal to start receiving special offers and in-game goodies.

Welcome to the Weekend

Recipe for Disaster on GeForce NOW
Uh… maybe we should order takeout.

Buckle up for three more games supported in the GeForce NOW library this week.

  • Recipe for Disaster (Free on Epic Games, Feb. 9-16)
  • Baldur’s Gate 3 (Steam)
  • Inside the Backrooms (Steam)

Members continue to celebrate #3YearsOfGFN on our social channels, sharing their favorite cloud gaming devices:

Follow #3YearsOfGFN on Twitter and Facebook all month long and check out this week’s question.

 

Read More

Optimize your machine learning deployments with auto scaling on Amazon SageMaker

Optimize your machine learning deployments with auto scaling on Amazon SageMaker

Machine learning (ML) has become ubiquitous. Our customers are employing ML in every aspect of their business, including the products and services they build, and for drawing insights about their customers.

To build an ML-based application, you have to first build the ML model that serves your business requirement. Building ML models involves preparing the data for training, extracting features, and then training and fine-tuning the model using the features. Next, the model has to be put to work so that it can generate inference (or predictions) from new data, which can then be used in the application. Although you can integrate the model directly into an application, the approach that works well for production-grade applications is to deploy the model behind an endpoint and then invoke the endpoint via a RESTful API call to obtain the inference. In this approach, the model is typically deployed on an infrastructure (compute, storage, and networking) that suits the price-performance requirements of the application. These requirements include the number inferences that the endpoint is expected to return in a second (called the throughput), how quickly the inference must be generated (the latency), and the overall cost of hosting the model.

Amazon SageMaker makes it easy to deploy ML models for inference at the best price-performance for any use case. It provides a broad selection of ML infrastructure and model deployment options to help meet all your ML inference needs. It is a fully managed service, so you can scale your model deployment, reduce inference costs, manage models more effectively in production, and reduce operational burden. One of the ways to minimize your costs is to provision only as much compute infrastructure as needed to serve the inference requests to the endpoint (also known as the inference workload) at any given time. Because the traffic pattern of inference requests can vary over time, the most cost-effective deployment system must be able to scale out when the workload increases and scale in when the workload decreases in real-time. SageMaker supports automatic scaling (auto scaling) for your hosted models. Auto scaling dynamically adjusts the number of instances provisioned for a model in response to changes in your inference workload. When the workload increases, auto scaling brings more instances online. When the workload decreases, auto scaling removes unnecessary instances so that you don’t pay for provisioned instances that you aren’t using.

With SageMaker, you can choose when to auto scale and how many instances to provision or remove to achieve the right availability and cost trade-off for your application. SageMaker supports three auto scaling options. The first and commonly used option is target tracking. In this option, you select an ideal value of an Amazon CloudWatch metric of your choice, such as the average CPU utilization or throughput that you want to achieve as a target, and SageMaker will automatically scale in or scale out the number of instances to achieve the target metric. The second option is to choose step scaling, which is an advanced method for scaling based on the size of the CloudWatch alarm breach. The third option is scheduled scaling, which lets you specify a recurring schedule for scaling your endpoint in and out based on anticipated demand. We recommend that you combine these scaling options for better resilience.

In this post, we provide a design pattern for deriving the right auto scaling configuration for your application. In addition, we provide a list of steps to follow, so even if your application has a unique behavior, such as different system characteristics or traffic patterns, this systemic approach can be applied to determine the right scaling policies. The procedure is further simplified with the use of Inference Recommender, a right-sizing and benchmarking tool built inside SageMaker. However, you can use any other benchmarking tool.

You can review the notebook we used to run this procedure to derive the right deployment configuration for our use case.

SageMaker hosting real-time endpoints and metrics

SageMaker real-time endpoints are ideal for ML applications that need to handle a variety of traffic and respond to requests in real time. The application setup begins with defining the runtime environment, including the containers, ML model, environment variables, and so on in the create-model API, and then defining the hosting details such as instance type and instance count for each variant in the create-endpoint-config API. The endpoint configuration API also allows you to split or duplicate traffic between variants using production and shadow variants. However, for this example, we define scaling policies using a single production variant. After setting up the application, you set up scaling, which involves registering the scaling target and applying scaling policies. Refer to Configuring autoscaling inference endpoints in Amazon SageMaker for more details on the various scaling options.

The following diagram illustrates the application and scaling setup in SageMaker.

Real-time endpoint setup

Endpoint metrics

In order to understand the scaling exercise, it’s important to understand the metrics that the endpoint emits. At a high level, these metrics are categorized into three classes: invocation metrics, latency metrics, and utilization metrics.

The following diagram illustrates these metrics and the endpoint architecture.

Endpoint architecture and its metrics

The following tables elaborate on the details of each metric.

Invocation metrics

Metrics Overview Period Units Statistics
Invocations The number of InvokeEndpoint requests sent to a model endpoint. 1 minute None Sum
InvocationsPerInstance The number of invocations sent to a model, normalized by InstanceCount in each variant. 1/numberOfInstances is sent as the value on each request, where numberOfInstances is the number of active instances for the variant behind the endpoint at the time of the request. 1 minute None Sum
Invocation4XXErrors The number of InvokeEndpoint requests where the model returned a 4xx HTTP response code. 1 minute None Average, Sum
Invocation5XXErrors The number of InvokeEndpoint requests where the model returned a 5xx HTTP response code. 1 minute None Average, Sum

Latency metrics

Metrics Overview Period Units Statistics
ModelLatency The interval of time taken by a model to respond as viewed from SageMaker. This interval includes the local communication times taken to send the request and to fetch the response from the container of a model and the time taken to complete the inference in the container. 1 minute Microseconds Average, Sum, Min, Max, Sample Count
OverheadLatency The interval of time added to the time taken to respond to a client request by SageMaker overheads. This interval is measured from the time SageMaker receives the request until it returns a response to the client, minus the ModelLatency. Overhead latency can vary depending on multiple factors, including request and response payload sizes, request frequency, and authentication or authorization of the request. 1 minute Microseconds Average, Sum, Min, Max, Sample Count

Utilization metrics

Metrics Overview Period Units
CPUUtilization The sum of each individual CPU core’s utilization. The CPU utilization of each core range is 0–100. For example, if there are four CPUs, the CPUUtilization range is 0–400%. 1 minute Percent
MemoryUtilization The percentage of memory that is used by the containers on an instance. This value range is 0–100%. 1 minute Percent
GPUUtilization The percentage of GPU units that are used by the containers on an instance. The value can range between 0–100 and is multiplied by the number of GPUs. 1 minute Percent
GPUMemoryUtilization The percentage of GPU memory used by the containers on an instance. The value range is 0–100 and is multiplied by the number of GPUs. For example, if there are four GPUs, the GPUMemoryUtilization range is 0–400%. 1 minute Percent
DiskUtilization The percentage of disk space used by the containers on an instance. This value range is 0–100%. 1 minute Percent

Use case overview

We use a simple XGBoost classifier model for our application and have decided to host on the ml.c5.large instance type. However, the following procedure is independent of the model or deployment configuration, so you can adopt the same approach for your own application and deployment choice. We assume that you already have a desired instance type at the start of this process. If you need assistance in determining the ideal instance type for your application, you should use the Inference Recommender default job for getting instance type recommendations.

Scaling plan

The scaling plan is a three-step procedure, as illustrated in the following diagram:

  • Identify the application characteristics – Knowing the bottlenecks of the application on the selected hardware is an essential part of this.
  • Set scaling expectations – This involves determining the maximum number of requests per second, and how the request pattern will look (whether it will be smooth or spiky).
  • Apply and evaluate – Scaling policies should be developed based on application characteristics and scaling expectations. As part of this final step, evaluate the policies by running the load that it is expected to handle. In addition, we recommend iterating the last step, until the scaling policy can handle the request load.

Scaling Plan

Identify application characteristics

In this section, we discuss the methods to identify application characteristics.

Benchmarking

To derive the right scaling policy, the first step in the plan is to determine application behavior on the chosen hardware. This can be achieved by running the application on a single host and increasing the request load to the endpoint gradually until it saturates. In many cases, after saturation, the endpoint can no longer handle any more requests and performance begins to deteriorate. This can be seen in the endpoint invocation metrics. We also recommend that you review hardware utilization metrics and understand the bottlenecks, if any. For CPU instances, the bottleneck can be in the CPU, memory, or disk utilization metrics, while for GPU instances, the bottleneck can be in GPU utilization and its memory. We discuss invocations and utilization metrics on ml.c5.large hardware in the following section. It’s also important to remember that CPU utilization is aggregated across all cores, therefore it is at 200% scale for an ml.c5.large two-core machine.

For benchmarking, we use the Inference Recommender default job. Inference Recommender default jobs will, by default, benchmark with multiple instance types. However, you can narrow down the search to your chosen instance type by passing those in supported instances. The service then provisioning the endpoint gradually increases the request and stops when the benchmark reaches saturation or if the endpoint invoke API call fails for 1% of the results. The hosting metrics can be used to determine the hardware bounds and set the right scaling limit. In the event that there is a hardware bottleneck, we recommend that you scale up the instance size in the same family or change the instance family entirely.

The following diagram illustrates the architecture of benchmarking using Inference Recommender.

Benchmarking using Inference recommender

Use the following code:

def trigger_inference_recommender(model_url, payload_url, container_url, instance_type, execution_role, framework,
                                  framework_version, domain="MACHINE_LEARNING", task="OTHER", model_name="classifier",
                                  mime_type="text/csv"):
    model_package_arn = create_model_package(model_url, payload_url, container_url, instance_type,
                                             framework, framework_version, domain, task, model_name, mime_type)
    job_name = create_inference_recommender_job(model_package_arn, execution_role)
    wait_for_job_completion(job_name)
    return job_name

Analyze the result

We then analyze the results of the recommendation job using endpoint metrics. From the following hardware utilization graph, we confirm that the hardware limits are within the bounds. Furthermore, the CPUUtilization line increases proportional to request load, so it is necessary to have scaling limits on CPU utilization as well.

Utilization metrics

From the following figure, we confirm that the invocation flattens after it reaches its peak.

Invocations and latency metrics

Next, we move on to the invocations and latency metrics for setting the scaling limit.

Find scaling limits

In this step, we run various scaling percentages to find the right scaling limit. As a general scaling rule, the hardware utilization percentage should be around 40% if you’re optimizing for availability, around 70% if you’re optimizing for cost, and around 50% if you want to balance availability and cost. The guidance gives an overview of the two dimensions: availability and cost. The lower the threshold, the better the availability. The higher the threshold, the better the cost. In the following figure, we plotted the graph with 55% as the upper limit and 45% as the lower limit for invocation metrics. The top graph shows invocations and latency metrics; the bottom graph shows utilization metrics.

Invocations & latency metrics (top), Utilization metrics (bottom) with scaling limit of 45%-55%

You can use the following sample code to change the percentages and see what the limits are for the invocations, latency, and utilization metrics. We highly recommend that you play around with percentages and find the best fit based on your metrics.

def analysis_inference_recommender_result(job_name, index=0, 
                                          upper_threshold=80.0, lower_threshold=65.0):

Because we want to optimize for availability and cost in this example, we decided to use 50% aggregate CPU utilization. As we selected a two-core machine, our aggregated CPU utilization is 200%. We therefore set a threshold of 100% for CPU utilization because we’re doing 50% for two cores. In addition to the utilization threshold, we also set the InvocationPerInstance threshold to 5000. The value for InvocationPerInstance is derived by overlaying CPUUtilization = 100% over the invocations graph.

As part of step 1 of the scaling plan (shown in the following figure), we benchmarked the application using the Inference Recommender default job, analyzed the results, and determined the scaling limit based on cost and availability.

Identify application characteristics

Set scaling expectations

The next step is to set expectations and develop scaling policies based on these expectations. This step involves defining the maximum and minimum requests to be served, as well as additional details, like what is the maximum request growth of the application should handle? Is it smooth or spiky traffic pattern? Data like this will help define the expectation and help you develop a scaling policy that meets your demand.

The following diagram illustrates an example traffic pattern.

Traffic pattern

For our application, the expectations are maximum requests per second (max) = 500, and minimum request per second (min) = 70.

Based on these expectations, we define MinCapacity and MaxCapacity using the following formula. For the following calculations, we normalize InvocationsPerInstance to seconds because it is per minute. Additionally, we define growth factor, which is the amount of additional capacity that you are willing to add when your scale exceeds the maximum requests per second. The growth_factor should always be greater than 1, and it is essential in planning for additional growth.

MinCapacity = ceil(min / (InvocationsPerInstance/60) )
MaxCapacity = ceil(max / (InvocationsPerInstance/60)) * Growth_factor

In the end, we arrive at MinCapacity = 1 and MaxCapacity = 8 (with 20% as growth factor), and we plan to handle a spiky traffic pattern.

Set expectations

Define scaling policies and verify

The final step is to define a scaling policy and evaluate its impact. The evaluation serves to validate the results of the calculations made so far. In addition, it helps us adjust the scaling setting if it doesn’t meet our needs. The evaluation is done using the Inference Recommender advanced job, where we specify the traffic pattern, MaxInvocations, and endpoint to benchmark against. In this case, we provision the endpoint and set the scaling policies, then run the Inference Recommender advanced job to validate the policy.

Target tracking

It is recommended to set up target tracking based on InvocationsPerInstance. The threshold has already been defined in step 1, so we set the CPUUtilization threshold to 100 and the InvocationsPerInstance threshold to 5000. First, we define a scaling policy based on the number of InvocationsPerInstance, and then we create a scaling policy that relies on CPU utilization.

As in the sample notebook, we use the following functions to register and set scaling policies:

def set_target_scaling_on_invocation(endpoint_name, variant_name, target_value,
                                     scale_out_cool_down=10,
                                     scale_in_cool_down=100):
    policy_name = 'target-tracking-invocations-{}'.format(str(round(time.time())))
    resource_id = "endpoint/{}/variant/{}".format(endpoint_name, variant_name)
    response = aas_client.put_scaling_policy(
        PolicyName=policy_name,
        ServiceNamespace='sagemaker',
        ResourceId=resource_id,
        ScalableDimension='sagemaker:variant:DesiredInstanceCount',
        PolicyType='TargetTrackingScaling',
        TargetTrackingScalingPolicyConfiguration={
            'TargetValue': target_value,
            'PredefinedMetricSpecification': {
                'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance',
            },
            'ScaleOutCooldown': scale_out_cool_down,
            'ScaleInCooldown': scale_in_cool_down,
            'DisableScaleIn': False
        }
    )
    return policy_name, response


def set_target_scaling_on_cpu_utilization(endpoint_name, variant_name, target_value,
                                          scale_out_cool_down=10,
                                          scale_in_cool_down=100):
    policy_name = 'target-tracking-cpu-util-{}'.format(str(round(time.time())))
    resource_id = "endpoint/{}/variant/{}".format(endpoint_name, variant_name)
    response = aas_client.put_scaling_policy(
        PolicyName=policy_name,
        ServiceNamespace='sagemaker',
        ResourceId=resource_id,
        ScalableDimension='sagemaker:variant:DesiredInstanceCount',
        PolicyType='TargetTrackingScaling',
        TargetTrackingScalingPolicyConfiguration={
            'TargetValue': target_value,
            'CustomizedMetricSpecification':
            {
                'MetricName': 'CPUUtilization',
                'Namespace': '/aws/sagemaker/Endpoints',
                'Dimensions': [
                    {'Name': 'EndpointName', 'Value': endpoint_name},
                    {'Name': 'VariantName', 'Value': variant_name}
                ],
                'Statistic': 'Average',
                'Unit': 'Percent'
            },
            'ScaleOutCooldown': scale_out_cool_down,
            'ScaleInCooldown': scale_in_cool_down,
            'DisableScaleIn': False
        }
    )
    return policy_name, response

Because we need to handle spiky traffic patterns, the sample notebook uses ScaleOutCooldown = 10 and ScaleInCooldown = 100 as the cooldown values. As we evaluate the policy in the next step, we plan to adjust the cooldown period (if needed).

Evaluation target tracking

The evaluation is done using the Inference Recommender advanced job, where we specify the traffic pattern, MaxInvocations, and endpoint to benchmark against. In this case, we provision the endpoint and set the scaling policies, then run the Inference Recommender advanced job to validate the policy.

from inference_recommender import trigger_inference_recommender_evaluation_job
from result_analysis import analysis_evaluation_result

eval_job = trigger_inference_recommender_evaluation_job(model_package_arn=model_package_arn, 
                                                        execution_role=role, 
                                                        endpoint_name=endpoint_name, 
                                                        instance_type=instance_type,
                                                        max_invocations=max_tps*60, 
                                                        max_model_latency=10000, 
                                                        spawn_rate=1)

print ("Evaluation job = {}, EndpointName = {}".format(eval_job, endpoint_name))

# In the next step, we will visualize the cloudwatch metrics and verify if we reach 30000 invocations.
max_value = analysis_evaluation_result(endpoint_name, variant_name, job_name=eval_job)

print("Max invocation realized = {}, and the expecation is {}".format(max_value, 30000))

Following benchmarking, we visualized the invocations graph to understand how the system responds to scaling policies. The scaling policy that we established can handle the requests and can reach up to 30,000 invocations without error.

Scaling endpoint with Target tracking

Now, let’s consider what happens if we triple the rate of new user. Does the same policy apply? We can rerun the same evaluation set with a higher request rate and set the spawn rate (an additional user per minute) to 3.

Scaling endpoint with spawn rate=3

With the above result, we confirm that the current auto-scaling policy can cover even the aggressive traffic pattern.

Step scaling

In addition to Target tracking, we also recommend using step scaling to have better control over aggressive traffic. Therefore, we defined an additional step scale with scaling adjustments to handle spiky traffic.

def set_step_scaling(endpoint_name, variant_name):
    policy_name = 'step-scaling-{}'.format(str(round(time.time())))
    resource_id = "endpoint/{}/variant/{}".format(endpoint_name, variant_name)
    response = aas_client.put_scaling_policy(
        PolicyName=policy_name,
        ServiceNamespace='sagemaker',
        ResourceId=resource_id,
        ScalableDimension='sagemaker:variant:DesiredInstanceCount',
        PolicyType='StepScaling',
        StepScalingPolicyConfiguration={
            'AdjustmentType': 'ChangeInCapacity',
            'StepAdjustments': [
                {
                    'MetricIntervalLowerBound': 0.0,
                    'MetricIntervalUpperBound': 5.0,
                    'ScalingAdjustment': 1
                },
                {
                    'MetricIntervalLowerBound': 5.0,
                    'MetricIntervalUpperBound': 80.0,
                    'ScalingAdjustment': 3
                },
                {
                    'MetricIntervalLowerBound': 80.0,
                    'ScalingAdjustment': 4
                },
            ],
            'MetricAggregationType': 'Average'
        },
    )
    return policy_name, response

Evaluation step scaling

We then follow the same step to evaluate, and after the benchmark we confirm that the scaling policy can handle a spiky traffic pattern and reach 30,000 invocations without any errors.

Scaling endpoint with step scaling

Therefore, defining the scaling policies and evaluating the results using the Inference Recommender is a necessary part of validation.

Evaluation

Further tuning

In this section, we discuss further tuning options.

Multiple scaling options

As shown in our use case, you can pick multiple scaling policies that meet your needs. In addition to the options mentioned previously, you should also consider scheduled scaling if you forecast traffic for a period of time. The combination of scaling policies is powerful and should be evaluated using benchmarking tools like Inference Recommender.

Scale up or down

SageMaker Hosting offers over 100 instance types to host your model. Your traffic load may be limited by the hardware you have chosen, so consider other hosting hardware. For example, if you want a system to handle 1,000 requests per second, scale up instead of out. Accelerator instances such as G5 and Inf1 can process higher numbers of requests on a single host. Scaling up and down can provide better resilience for some traffic needs than scaling in and out.

Custom metrics

In addition to InvocationsPerInstance and other SageMaker hosting metrics, you can also define metrics for scaling your application. However, any custom metrics that are used for scaling should depict the load of the system. The metrics should increase in value when utilization is high, and decrease otherwise. The custom metrics could bring more granularity to the load and help in defining custom scaling policies.

Adjusting scaling alarm

By defining the scaling policy, you are creating an alarm for scaling, and these alarms are used for scale in and scale out. However, these alarms have a default number of data points on which they are alerted. In case you want to alter the number of data points of the alarm, you can do so. Nevertheless, after any update to scaling policies, it is recommended to evaluate the policy by using a benchmarking tool with the load it should handle.

Scaling alarms

Conclusion

The process of defining the scaling policy for your application can be challenging. You must understand the characteristics of the application, determine your scaling needs, and iterate scaling policies to meet those needs. This post has reviewed each of these steps and explained the approach you should take at each step. You can find your application characteristics and evaluate scaling policies by using the Inference Recommender benchmarking system. The proposed design pattern can help you create a scalable application within hours, rather than days, that takes into account the availability and cost of your application.


About the Authors

Mohan Gandhi is a Senior Software Engineer at AWS. He has been with AWS for the last 10 years and has worked on various AWS services like EMR, EFA and RDS. Currently, he is focused on improving the SageMaker Inference Experience. In his spare time, he enjoys hiking and marathons.

Vikram Elango is an AI/ML Specialist Solutions Architect at Amazon Web Services, based in Virginia USA. Vikram helps financial and insurance industry customers with design, thought leadership to build and deploy machine learning applications at scale. He is currently focused on natural language processing, responsible AI, inference optimization and scaling ML across the enterprise. In his spare time, he enjoys traveling, hiking, cooking and camping with his family.

Venkatesh Krishnan leads Product Management for Amazon SageMaker in AWS. He is the product owner for a portfolio of SageMaker services that enable customers to deploy machine learning models for Inference. Earlier he was the Head of Product, Integrations and the lead product manager for Amazon AppFlow, a new AWS service that he helped build from the ground up. Before joining Amazon in 2018, Venkatesh served in various research, engineering, and product roles at Qualcomm, Inc. He holds a PhD in Electrical and Computer Engineering from Georgia Tech and an MBA from ULCA’s Anderson School of Management.

Read More