Create high-quality data for ML models with Amazon SageMaker Ground Truth

Machine learning (ML) has improved business across industries in recent years—from the recommendation system on your Prime Video account, to document summarization and efficient search with Alexa’s voice assistance. However, the question remains of how to incorporate this technology into your business. Unlike traditional rule-based methods, ML automatically infers patterns from data so as to perform your task of interest. Although this bypasses the need to curate rules for automation, it also means that ML models can only be as good as the data on which they’re trained. However, data creation is often a challenging task. At the Amazon Machine Learning Solutions Lab, we’ve repeatedly encountered this problem and want to ease this journey for our customers. If you want to offload this process, you can use Amazon SageMaker Ground Truth Plus.

By the end of this post, you’ll be able to achieve the following:

  • Understand the business processes involved in setting up a data acquisition pipeline
  • Identify AWS Cloud services for supporting and expediting your data labeling pipeline
  • Run a data acquisition and labeling task for custom use cases
  • Create high-quality data following business and technical best practices

Throughout this post, we focus on the data creation process and rely on AWS services to handle the infrastructure and process components. Namely, we use Amazon SageMaker Ground Truth to handle the labeling infrastructure pipeline and user interface. This service uses a point-and-go approach to collect your data from Amazon Simple Storage Service (Amazon S3) and set up a labeling workflow. For labeling, it provides you with the built-in flexibility to acquire data labels using your private team, an Amazon Mechanical Turk force, or your preferred labeling vendor from AWS Marketplace. Lastly, you can use AWS Lambda and Amazon SageMaker notebooks to process, visualize, or quality control the data—either pre- or post-labeling.

Now that all of the pieces have been laid down, let’s start the process!

The data creation process

Contrary to common intuition, the first step for data creation is not data collection. Working backward from the users to articulate the problem is crucial. For example, what do users care about in the final artifact? Where do experts believe the signals relevant to the use case reside in the data? What information about the use case environment could be provided to model? If you don’t know the answers to those questions, don’t worry. Give yourself some time to talk with users and field experts to understand the nuances. This initial understanding will orient you in the right direction and set you up for success.

For this post, we assume that you have covered this initial process of user requirement specification. The next three sections walk you through the subsequent process of creating quality data: planning, source data creation, and data annotation. Piloting loops at the data creation and annotation steps are vital for ensuring the efficient creation of labeled data. This involves iterating between data creation, annotation, quality assurance, and updating the pipeline as necessary.

The following figure provides an overview of the steps required in a typical data creation pipeline. You can work backward from the use case to identify the data that you need (Requirements Specification), build a process to obtain the data (Planning), implement the actual data acquisition process (Data Collection and Annotation), and assess the results. Pilot runs, highlighted with dashed lines, let you iterate on the process until a high-quality data acquisition pipeline has been developed.

In a typical data creation pipeline, you go through requirements specification for the use case in scope, plan for the data creation process, implement the process for data collection and labeling, and evaluate the results against the original requirements specification. Successive iterations of this workflow enable refinement of the pipeline.

Overview of steps required in a typical data creation pipeline.

Planning

A standard data creation process can be time-consuming and a waste of valuable human resources if conducted inefficiently. Why would it be time-consuming? To answer this question, we must understand the scope of the data creation process. To assist you, we have collected a high-level checklist and description of key components and stakeholders that you must consider. Answering these questions can be difficult at first. Depending on your use case, only some of these may be applicable.

  • Identify the legal point of contact for required approvals – Using data for your application can require license or vendor contract review to ensure compliance with company policies and use cases. It’s important to identify your legal support throughout the data acquisition and annotation steps of the process.
  • Identify the security point of contact for data handling –Leakage of purchased data might result in serious fines and repercussions for your company. It’s important to identify your security support throughout the data acquisition and annotation steps to ensure secure practices.
  • Detail use case requirements and define source data and annotation guidelines – Creating and annotating data is difficult due to the high specificity required. Stakeholders, including data generators and annotators, must be completely aligned to avoid wasting resources. To this end, it’s common practice to use a guidelines document that specifies every aspect of the annotation task: exact instructions, edge cases, an example walkthrough, and so on.
  • Align on expectations for collecting your source data – Consider the following:

    • Conduct research on potential data sources – For example, public datasets, existing datasets from other internal teams, self-collected, or purchased data from vendors.
    • Perform quality assessment – Create an analysis pipeline with relation to the final use case.
  • Align on expectations for creating data annotations – Consider the following:

    • Identify the technical stakeholders – This is usually an individual or team in your company capable of using the technical documentation regarding Ground Truth to implement an annotation pipeline. These stakeholders are also responsible for quality assessment of the annotated data to make sure that it meets the needs of your downstream ML application.
    • Identify the data annotators – These individuals use predetermined instructions to add labels to your source data within Ground Truth. They may need to possess domain knowledge depending on your use case and annotation guidelines. You can use a workforce internal to your company, or pay for a workforce managed by an external vendor.
  • Ensure oversight of the data creation process – As you can see from the preceding points, data creation is a detailed process that involves numerous specialized stakeholders. Therefore, it’s crucial to monitor it end to end toward the desired outcome. Having a dedicated person or team oversee the process can help you ensure a cohesive, efficient data creation process.

Depending on the route that you decide to take, you must also consider the following:

  • Create the source dataset – This refers to instances when existing data isn’t suitable for the task at hand, or legal constraints prevent you from using it. Internal teams or external vendors (next point) must be used. This is often the case for highly specialized domains or areas with low public research. For example, a physician’s common questions, garment lay down, or sports experts. It can be internal or external.
  • Research vendors and conduct an onboarding process – When external vendors are used, a contracting and onboarding process must be set in place between both entities.

In this section, we reviewed the components and stakeholders that we must consider. However, what does the actual process look like? In the following figure, we outline a process workflow for data creation and annotation. The iterative approach uses small batches of data called pilots to decrease turnaround time, detect errors early on, and avoid wasting resources in the creation of low-quality data. We describe these pilot rounds later in this post. We also cover some best practices for data creation, annotation, and quality control.

The following figure illustrates the iterative development of a data creation pipeline. Vertically, we find the data sourcing block (green) and the annotation block (blue). Both blocks have independent pilot rounds (Data creation/Annotation, QAQC, and Update). Increasingly higher sourced data is created and can be used to construct increasingly higher-quality annotations.

During the iterative development of a data creation or annotation pipeline, small batches of data are used for independent pilots. Each pilot round has a data creation or annotation phase, some quality assurance and quality control of the results, and an update step to refine the process. After these processes are finessed through successive pilots, you can proceed to large-scale data creation and annotation.

Overview of iterative development in a data creation pipeline.

Source data creation

The input creation process revolves around staging your items of interest, which depend on your task type. These could be images (newspaper scans), videos (traffic scenes), 3D point clouds (medical scans), or simply text (subtitle tracks, transcriptions). In general, when staging your task-related items, make sure of the following:

  • Reflect the real-world use case for the eventual AI/ML system – The setup for collecting images or videos for your training data should closely match the setup for your input data in the real-world application. This means having consistent placement surfaces, lighting sources, or camera angles.
  • Account for and minimize variability sources – Consider the following:

    • Develop best practices for maintaining data collection standards – Depending on the granularity of your use case, you may need to specify requirements to guarantee consistency among your data points. For example, if you’re collecting image or video data from single camera points, you may need to make sure of the consistent placement of your objects of interest, or require a quality check for the camera before a data capture round. This can avoid issues like camera tilt or blur, and minimize downstream overheads like removing out-of-frame or blurry images, as well as needing to manually center the image frame on your area of interest.
    • Pre-empt test time sources of variability – If you anticipate variability in any of the attributes mentioned so far during test time, make sure that you can capture those variability sources during training data creation. For example, if you expect your ML application to work in multiple different light settings, you should aim to create training images and videos at various light settings. Depending on the use case, variability in camera positioning can also influence the quality of your labels.
  • Incorporate prior domain knowledge when available – Consider the following:

    • Inputs on sources of error – Domain practitioners can provide insights into sources of error based on their years of experience. They can provide feedback on the best practices for the previous two points: What settings reflect the real-world use case best? What are the possible sources of variability during data collection, or at the time of use?
    • Domain-specific data collection best practices – Although your technical stakeholders may already have a good idea of the technical aspects to focus on in the images or videos collected, domain practitioners can provide feedback on how best to stage or collect the data such that these needs are met.

Quality control and quality assurance of the created data

Now that you have set up the data collection pipeline, it might be tempting to go ahead and collect as much data as possible. Wait a minute! We must first check if the data collected through the setup is suitable for your real-word use case. We can use some initial samples and iteratively improve the setup through the insights that we gained from analyzing that sample data. Work closely with your technical, business, and annotation stakeholders during the pilot process. This will make sure that your resultant pipeline is meeting business needs while generating ML-ready labeled data within minimal overheads.

Annotations

The annotation of inputs is where we add the magic touch to our data—the labels! Depending on your task type and data creation process, you may need manual annotators, or you can use off-the-shelf automated methods. The data annotation pipeline itself can be a technically challenging task. Ground Truth eases this journey for your technical stakeholders with its built-in repertoire of labeling workflows for common data sources. With a few additional steps, it also enables you to build custom labeling workflows beyond preconfigured options.

Ask yourself the following questions when developing a suitable annotation workflow:

  • Do I need a manual annotation process for my data? In some cases, automated labeling services may be sufficient for the task at hand. Reviewing the documentation and available tools can help you identify if manual annotation is necessary for your use case (for more information, see What is data labeling?). The data creation process can allow for varying levels of control regarding the granularity of your data annotation. Depending on this process, you can also sometimes bypass the need for manual annotation. For more information, refer to Build a custom Q&A dataset using Amazon SageMaker Ground Truth to train a Hugging Face Q&A NLU model.
  • What forms my ground truth? In most cases, the ground truth will come from your annotation process—that’s the whole point! In others, the user may have access to ground truth labels. This can significantly speed up your quality assurance process, or reduce the overhead required for multiple manual annotations.
  • What is the upper bound for the amount of deviance from my ground truth state? Work with your end-users to understand the typical errors around these labels, the sources of such errors, and the desired reduction in errors. This will help you identify which aspects of the labeling task are most challenging or are likely to have annotation errors.
  • Are there preexisting rules used by the users or field practitioners to label these items? Use and refine these guidelines to build a set of instructions for your manual annotators.

Piloting the input annotation process

When piloting the input annotation process, consider the following:

  • Review the instructions with the annotators and field practitioners – Instructions should be concise and specific. Ask for feedback from your users (Are the instructions accurate? Can we revise any instructions to make sure that they are understandable by non-field practitioners?) and annotators (Is everything understandable? Is the task clear?). If possible, add an example of good and bad labeled data to help your annotators identify what is expected, and what common labeling errors might look like.
  • Collect data for annotations – Review the data with your customer to make sure that it meets the expected standards, and to align on expected outcomes from the manual annotation.
  • Provide examples to your pool of manual annotators as a test run – What is the typical variance among the annotators in this set of examples? Study the variance for each annotation within a given image to identify the consistency trends among annotators. Then compare the variances across the images or video frames to identify which labels are challenging to place.

Quality control of the annotations

Annotation quality control has two main components: assessing consistency between the annotators, and assessing the quality of the annotations themselves.

You can assign multiple annotators to the same task (for example, three annotators label the key points on the same image), and measure the average value alongside the standard deviation of these labels among the annotators. Doing so helps you identify any outlier annotations (incorrect label used, or label far away from the average annotation), which can guide actionable outcomes, such as refining your instructions or providing further training to certain annotators.

Assessing the quality of annotations themselves is tied to annotator variability and (when available) the availability of domain experts or ground truth information. Are there certain labels (across all of your images) where the average variance between annotators is consistently high? Are any labels far off from your expectations of where they should be, or what they should look like?

Based on our experience, a typical quality control loop for data annotation can look like this:

  • Iterate on the instructions or image staging based on results from the test run – Are any objects occluded, or does image staging not match the expectations of annotators or users? Are the instructions misleading, or did you miss any labels or common errors in your exemplar images? Can you refine the instructions for your annotators?
  • If you are satisfied that you have addressed any issues from the test run, do a batch of annotations – For testing the results from the batch, follow the same quality assessment approach of assessing inter-annotator and inter-image label variabilities.

Conclusion

This post serves as a guide for business stakeholders to understand the complexities of data creation for AI/ML applications. The processes described also serve as a guide for technical practitioners to generate quality data while optimizing business constraints such as personnel and costs. If not done well, a data creation and labeling pipeline can take upwards of 4–6 months.

With the guidelines and suggestions outlined in this post, you can preempt roadblocks, reduce time to completion, and minimize the costs in your journey toward creating high-quality data.


About the authors

Jasleen Grewal is an Applied Scientist at Amazon Web Services, where she works with AWS customers to solve real world problems using machine learning, with special focus on precision medicine and genomics. She has a strong background in bioinformatics, oncology, and clinical genomics. She is passionate about using AI/ML and cloud services to improve patient care.

Boris Aronchik is a Manager in the Amazon AI Machine Learning Solutions Lab, where he leads a team of ML scientists and engineers to help AWS customers realize business goals leveraging AI/ML solutions.

Miguel Romero Calvo is an Applied Scientist at the Amazon ML Solutions Lab where he partners with AWS internal teams and strategic customers to accelerate their business through ML and cloud adoption.

Lin Lee Cheong is a Senior Scientist and Manager with the Amazon ML Solutions Lab team at Amazon Web Services. She works with strategic AWS customers to explore and apply artificial intelligence and machine learning to discover new insights and solve complex problems.

Read More

Automate your time series forecasting in Snowflake using Amazon Forecast

This post is a joint collaboration with Andries Engelbrecht and James Sun of Snowflake, Inc.

The cloud computing revolution has enabled businesses to capture and retain corporate and organizational data without capacity planning or data retention constraints. Now, with diverse and vast reserves of longitudinal data, companies are increasingly able to find novel and impactful ways to use their digital assets to make better and informed decisions when making short-term and long-term planning decisions. Time series forecasting is a unique and essential science that allows companies to make surgical planning decisions to help balance customer service levels against often competing goals of optimal profitability.

At AWS, we sometimes work with customers who have selected our technology partner Snowflake to deliver a cloud data platform experience. Having a platform that can recall years and years of historical data is powerful—but how can you use this data to look ahead and use yesterday’s evidence to plan for tomorrow? Imagine not only having what has happened available in Snowflake—your single version of the truth—but also an adjacent set of non-siloed data that offers a probabilistic forecast for days, weeks, or months into the future.

In a collaborative supply chain, sharing information between partners can improve performance, increase competitiveness, and reduce wasted resources. Sharing your future forecasts can be facilitated with Snowflake Data Sharing, which enables you to seamlessly collaborate with your business partners securely and identify business insights. If many partners share their forecasts, it can help control the bullwhip effect in the connected supply chain. You can effectively use Snowflake Marketplace to monetize your predictive analytics from datasets produced in Amazon Forecast.

In this post, we discuss how to implement an automated time series forecasting solution using Snowflake and Forecast.

Essential AWS services that enable this solution

Forecast provides several state-of-the-art time series algorithms and manages the allocation of enough distributed computing capacity to meet the needs of nearly any workload. With Forecast, you don’t get one model; you get the strength of many models that are further optimized into a uniquely weighted model for each time series in the set. In short, the service delivers all the science, data handling, and resource management into a simple API call.

AWS Step Functions provides a process orchestration mechanism that manages the overall workflow. The service encapsulates API calls with Amazon Athena, AWS Lambda, and Forecast to create an automated solution that harvests data from Snowflake, uses Forecast to convert historical data into future predictions, and then creates the data inside Snowflake.

Athena federated queries can connect to several enterprise data sources, including Amazon DynamoDB, Amazon Redshift, Amazon OpenSearch Service, MySQL, PostgreSQL, Redis, and other popular third-party data stores, such as Snowflake. Data connectors run as Lambda functions—you can use this source code to help launch the Amazon Athena Lambda Snowflake Connector and connect with AWS PrivateLink or through a NAT Gateway.

Solution overview

One of the things we often do at AWS is work to help customers realize their goals while also removing the burden of the undifferentiated heavy lifting. With this in mind, we propose the following solution to assist AWS and Snowflake customers perform the following steps:

  1. Export data from Snowflake. You can use flexible metadata to unload the necessary historical data driven by a ready-to-go workflow.
  2. Import data into Forecast. No matter the use case, industry, or scale, importing prepared data inputs is easy and automated.
  3. Train a state-of-the-art time series model. You can automate time series forecasting without managing the underlying data science or hardware provisioning.
  4. Generate inference against the trained model. Forecast-produced outputs are easy to consume for any purpose. They’re available as simple CSV or Parquet files on Amazon Simple Storage Service (Amazon S3).
  5. Use history and future predictions side by side directly in Snowflake.

The following diagram illustrates how to implement an automated workflow that enables Snowflake customers to benefit from highly accurate time series predictions supported by Forecast, an AWS managed service. Transcending use case and industry, the design offered here first extracts historical data from Snowflake. Next, the workflow submits the prepared data for time series computation. Lastly, future period predictions are available natively in Snowflake, creating a seamless user experience for joint AWS and Snowflake customers.

Although this architecture only highlights the key technical details, the solution is simple to put together, sometimes within 1–2 business days. We provide you with working sample code to help remove the undifferentiated heavy lifting of creating the solution alone and without a head start. After you discover how to implement this pattern for one workload, you can repeat the forecasting process for any data held in Snowflake. In the sections that follow, we outline the key steps that enable you to build an automated pipeline.

Extract historical data from Snowflake

In this first step, you use SQL to define what data you want forecasted and let an Athena Federated Query connect to Snowflake, run your customized SQL, and persist the resulting record set on Amazon S3. Forecast requires historical training data to be available on Amazon S3 before ingestion; therefore, Amazon S3 serves as an intermediate storage buffer between Snowflake and Forecast. We feature Athena in this design to enable Snowflake and other heterogeneous data sources. If you prefer, another approach is using the Snowflake COPY command and storage integration to write query results to Amazon S3.

Regardless of the transport mechanism used, we now outline the kind of data Forecast needs and how data is defined, prepared, and extracted. In the section that follows, we describe how to import data into Forecast.

The following screenshot depicts what a set of data might look like in its native Snowflake schema.

Although this screenshot shows how the data looks in its natural state, Forecast requires data to be shaped into three different datasets:

  • Target time series – This is a required dataset containing the target variable and is used to train and predict a future value. Alone, this dataset serves as a univariate time series model.
  • Related time series – This is an optional dataset that contains temporal variables that should have a relationship to the target variable. Examples include variable pricing, promotional efforts, hyperlocal event traffic, economic outlook data—anything you feel might help explain variance in the target time series and produce a better forecast. The related time series dataset turns your univariate model into a multivariate to help improve accuracy.
  • Item metadata – This is an optional dataset containing categorical data about the forecasted item. Item metadata often helps boost performance for newly launched products, which we term a cold start.

With the scope of each of the Forecast datasets defined, you can write queries in Snowflake that source the correct data fields from the necessary source tables with the proper filters to get the desired subset of data. The following are three example SQL queries used to generate each dataset that Forecast needs for a specific food demand planning scenario.

We start with the target time series query:

select LOCATION_ID, ITEM_ID, 
DATE_DEMAND as TIMESTAMP, QTY_DEMAND as TARGET_VALUE 
from DEMO.FOOD_DEMAND

The optional related time series query pulls covariates such as price and promotional:

select LOCATION_ID,ITEM_ID, DATE_DEMAND as TIMESTAMP,
CHECKOUT_PRICE, BASE_PRICE,
EMAILER_FOR_PROMOTION, HOMEPAGE_FEATURED
from DEMO.FOOD_DEMAND

The item metadata query fetches distinct categorical values that help give dimension and further define the forecasted item:

select DISTINCT ITEM_ID, FOOD_CATEGORY, FOOD_CUISINE
from DEMO.FOOD_DEMAND

With the source queries defined, we can connect to Snowflake through an Athena Federated Query to submit the queries and persist the resulting datasets for forecasting use. For more information, refer to Query Snowflake using Athena Federated Query and join with data in your Amazon S3 data lake.

The Athena Snowflake Connector GitHub repo helps install the Snowflake connector. The Forecast MLOps GitHub repo helps orchestrate all macro steps defined in this post, and makes them repeatable without writing code.

Import data into Forecast

After we complete the previous step, a target time series dataset is in Amazon S3 and ready for import into Forecast. In addition, the optional related time series and item metadata datasets may also be prepared and ready for ingestion. With the provided Forecast MLOps solution, all you have to do here is initiate the Step Functions state machine responsible for importing data—no code is necessary. Forecast launches a cluster for each of the datasets you have provided and makes the data ready for the service to use for ML model building and model inference.

Create a time series ML model with accuracy statistics

After data has been imported, highly accurate time series models are created simply by calling an API. This step is encapsulated inside a Step Functions state machine that initiates the Forecast API to start model training. After the predictor model is trained, the state machine exports the model statistics and predictions during the backtest window to Amazon S3. Backtest exports are queryable by Snowflake as an external stage, as shown in the following screenshot. If you prefer, you can store the data in an internal stage. The point is to use the backtest metrics to evaluate the performance spread of time series in your dataset provided.

Create future predictions

With the model trained from the previous step, a purpose-built Step Functions state machine calls the Forecast API to create future-dated forecasts. Forecast provisions a cluster to perform the inference and pulls the imported target time series, related time series, and item metadata datasets through a named predictor model created in the previous step. After the predictions are generated, the state machine writes them to Amazon S3, where, once again, they can be queried in place as a Snowflake external stage or moved into Snowflake as an internal stage.

Use the future-dated prediction data directly in Snowflake

AWS hasn’t built a fully automated solution for this step; however, with the solution in this post, data was already produced by Forecast in the previous two steps. You may treat the outputs as actionable events or build business intelligence dashboards on the data. You may also use the data to create future manufacturing plans and purchase orders, estimate future revenue, build staffing resource plans, and more. Every use case is different, but the point of this step is to deliver the predictions to the correct consuming systems in your organization or beyond.

The following code snippet shows how to query Amazon S3 data directly from within Snowflake:

CREATE or REPLACE FILE FORMAT mycsvformat
type = 'CSV'
field_delimiter = ','
empty_field_as_null = TRUE
ESCAPE_UNENCLOSED_FIELD = None
skip_header = 1;

CREATE or REPLACE STORAGE INTEGRATION amazon_forecast_integration
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = S3
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::nnnnnnnnnn:role/snowflake-forecast-poc-role'
ENABLED = true
STORAGE_ALLOWED_LOCATIONS = (
's3://bucket/folder/forecast',
's3://bucket/folder/backtest-export/accuracy-metrics-values',
's3://bucket/folder/backtest-export/forecasted-values';

CREATE or REPLACE STAGE backtest_accuracy_metrics
storage_integration = amazon_forecast_integration
url = 's3://bucket/folder/backtest-export/accuracy-metrics-values'
file_format = mycsvformat;

CREATE or REPLACE EXTERNAL TABLE FOOD_DEMAND_BACKTEST_ACCURACY_METRICS (
ITEM_ID varchar AS (value:c1::varchar),
LOCATION_ID varchar AS (value:c2::varchar),
backtest_window varchar AS (value:c3::varchar),
backtestwindow_start_time varchar AS (value:c4::varchar),
backtestwindow_end_time varchar AS (value:c5::varchar),
wQL_10 varchar AS (value:c6::varchar),
wQL_30 varchar AS (value:c7::varchar),
wQL_50 varchar AS (value:c8::varchar),
wQL_70 varchar AS (value:c9::varchar),
wQL_90 varchar AS (value:c10::varchar),
AVG_wQL varchar AS (value:c11::varchar),
RMSE varchar AS (value:c12::varchar),
WAPE varchar AS (value:c13::varchar),
MAPE varchar AS (value:c14::varchar),
MASE varchar AS (value:c15::varchar)
)
with location = @backtest_accuracy_metrics
FILE_FORMAT = (TYPE = CSV FIELD_DELIMITER = ',' SKIP_HEADER = 1);

For more information about setting up permissions, refer to Option 1: Configuring a Snowflake Storage Integration to Access Amazon S3. Additionally, you can use the AWS Service Catalog to configure Amazon S3 storage integration; more information is available on the GitHub repo.

Initiate a schedule-based or event-based workflow

After you install a solution for your specific workload, your final step is to automate the process on a schedule that makes sense for your unique requirement, such as daily or weekly. The main thing is to decide how to start the process. One method is to use Snowflake to invoke the Step Functions state machine and then orchestrate the steps serially. Another approach is to chain state machines together and start the overall run through an Amazon EventBridge rule, which you can configure to run from an event or scheduled task—for example, at 9:00 PM GMT-8 each Sunday night.

Conclusion

With the most experience; the most reliable, scalable, and secure cloud; and the most comprehensive set of services and solutions, AWS is the best place to unlock value from your data and turn it into insight. In this post, we showed you how to create an automated time series forecasting workflow. Better forecasting can lead to higher customer service outcomes, less waste, less idle inventory, and more cash on the balance sheet.

If you’re ready to automate and improve forecasting, we’re here to help support you on your journey. Contact your AWS or Snowflake account team to get started today and ask for a forecasting workshop to see what kind of value you can unlock from your data.


About the Authors

Bosco Albuquerque is a Sr. Partner Solutions Architect at AWS and has over 20 years of experience working with database and analytics products from enterprise database vendors and cloud providers. He has helped technology companies design and implement data analytics solutions and products.

Frank Dallezotte is a Sr. Solutions Architect at AWS and is passionate about working with independent software vendors to design and build scalable applications on AWS. He has experience creating software, implementing build pipelines, and deploying these solutions in the cloud.

Andries Engelbrecht is a Principal Partner Solutions Architect at Snowflake and works with strategic partners. He is actively engaged with strategic partners like AWS supporting product and service integrations as well as the development of joint solutions with partners. Andries has over 20 years of experience in the field of data and analytics.

Charles Laughlin is a Principal AI/ML Specialist Solutions Architect and works on the Time Series ML team at AWS. He helps shape the Amazon Forecast service roadmap and collaborates daily with diverse AWS customers to help transform their businesses using cutting-edge AWS technologies and thought leadership. Charles holds an M.S. in Supply Chain Management and has spent the past decade working in the consumer packaged goods industry.

James Sun is a Senior Partner Solutions Architect at Snowflake. James has over 20 years of experience in storage and data analytics. Prior to Snowflake, he held several senior technical positions at AWS and MapR. James holds a PhD from Stanford University.

Read More

Achieve four times higher ML inference throughput at three times lower cost per inference with Amazon EC2 G5 instances for NLP and CV PyTorch models

Amazon Elastic Compute Cloud (Amazon EC2) G5 instances are the first and only instances in the cloud to feature NVIDIA A10G Tensor Core GPUs, which you can use for a wide range of graphics-intensive and machine learning (ML) use cases. With G5 instances, ML customers get high performance and a cost-efficient infrastructure to train and deploy larger and more sophisticated models for natural language processing (NLP), computer vision (CV), and recommender engine use cases.

The purpose of this post is to showcase the performance benefits of G5 instances for large-scale ML inference workloads. We do this by comparing the price-performance (measured as $ per million inferences) for NLP and CV models with G4dn instances. We start by describing our benchmarking approach and then present throughput vs. latency curves across batch sizes and data type precision. In comparison to G4dn instances, we find that G5 instances deliver consistently lower cost per million inferences for both full precision and mixed precision modes for the NLP and CV models while achieving higher throughput and lower latency.

Benchmarking approach

To develop a price-performance study between G5 and G4dn, we need to measure throughput, latency, and cost per million inferences as a function of batch size. We also study the impact of full precision vs. mixed precision. Both the model graph and inputs are loaded into CUDA prior to inferencing.

As shown in the following architecture diagram, we first create respective base container images with CUDA for the underlying EC2 instance (G4dn, G5). To build the base container images, we start with AWS Deep Learning Containers, which use pre-packaged Docker images to deploy deep learning environments in minutes. The images contain the required deep learning PyTorch libraries and tools. You can add your own libraries and tools on top of these images for a higher degree of control over monitoring, compliance, and data processing.

Then we build a model-specific container image that encapsulates the model configuration, model tracing, and related code to run forward passes. All container images are loaded on into Amazon ECR to allow for horizontal scaling of these models for various model configurations. We use Amazon Simple Storage Service (Amazon S3) as a common data store to download configuration and upload benchmark results for summarization. You can use this architecture to recreate and reproduce the benchmark results and repurpose to benchmark various model types (such as Hugging Face models, PyTorch models, other custom models) across EC2 instance types (CPU, GPU, Inf1).

With this experiment set up, our goal is to study latency as a function of throughput. This curve is important for application design to arrive at a cost-optimal infrastructure for the target application. To achieve this, we simulate different loads by queuing up queries from multiple threads and then measuring the round-trip time for each completed request. Throughput is measured based on the number of completed requests per unit clock time. Furthermore, you can vary the batch sizes and other variables like sequence length and full precision vs. half precision to comprehensively sweep the design space to arrive at indicative performance metrics. In our study, through a parametric sweep of batch size and queries from multi-threaded clients, the throughput vs. latency curve is determined. Every request can be batched to ensure full utilization of the accelerator, especially for small requests that may not fully utilize the compute node. You can also adopt this setup to identify the client-side batch size for optimal performance.

In summary, we can represent this problem mathematically as: (Throughput, Latency) = function of (Batch Size, Number of threads, Precision).

This means, given the exhaustive space, the number of experiments can be large. Fortunately, each experiment can be independently run. We recommend using AWS Batch to perform this horizontally scaled benchmarking in compressed time without an increase in benchmarking cost compared to a linear approach to testing. The code for replicating the results is present in the GitHub repository prepared for AWS Re:Invent 2021. The repository is comprehensive to perform benchmarking on different accelerators. You can refer to the GPU aspect of code to build the container (Dockerfile-gpu) and then refer to the code inside Container-Root for specific examples for BERT and ResNet50.

We used the preceding approach to develop performance studies across two model types: Bert-base-uncased (110 million parameters, NLP) and ResNet50 (25.6 million parameters, CV). The following table summarizes the model details.

Model Type Model Details
NLP twmkn9/bert-base-uncased-squad2 110 million parameters Sequence length = 128
CV ResNet50 25.6 million parameters

Additionally, to benchmark across data types (full, half precision), we use torch.cuda.amp, which provides convenient methods to handle mixed precision where some operations use the torch.float32 (float) data type and other operations use torch.float16 (half). For example, operators like linear layers and convolutions are much faster with float16, whereas others like reductions often require the dynamic range of float32. Automatic mixed precision tries to match each operator to its appropriate data type to optimize the network’s runtime and memory footprint.

Benchmarking results

For a fair comparison, we selected G4dn.4xlarge and G5.4xlarge instances with similar attributes, as listed in the following table.

Instance GPUs GPU Memory (GiB) vCPUs Memory (GiB) Instance Storage (GB) Network Performance (Gbps) EBS Bandwidth (Gbps) Linux On-Demand Pricing (us-east-1)
G5.4xlarge 1 24 16 64 1x 600 NVMe SSD up to 25 8 $1.204/hour
G4dn.4xlarge 1 16 16 64 1x 225 NVMe SSD up to 25 4.75 $1.624/hour

In the following sections, we compare ML inference performance of BERT and RESNET50 models with a grid sweep approach for specific batch sizes (32, 16, 8, 4, 1) and data type precision (full and half precision) to arrive at the throughput vs. latency curve. Additionally, we investigate the effect of throughput vs. batch size for both full and half precision. Lastly, we measure cost per million inferences as a function of batch size. The consolidated results across these experiments are summarized later in this post.

Throughput vs. latency

The following figures compare G4dn and G5 instances for NLP and CV workloads at both full and half precision. In comparison to G4dn instances, the G5 instance delivers a throughput of about five times higher (full precision) and about 2.5 times higher (half precision) for a BERT base model, and about 2–2.5 times higher for a ResNet50 model. Overall, G5 is a preferred choice, with increasing batch sizes for both models for both full and mixed precision from a performance perspective.

The following graphs compare throughput and P95 latency at full and half precision for BERT.

The following graphs compare throughput and P95 latency at full and half precision for ResNet50.

Throughput and latency vs. batch size

The following graphs show the throughput as a function of the batch size. At low batch sizes, the accelerator isn’t functioning to its fullest capacity and as the batch size increases, throughput is increased at the cost of latency. The throughput curve asymptotes to a maximum value that is a function of the accelerator performance. The curve has two distinct features: a rising section and a flat asymptotic section. For a given model, a performant accelerator (G5) is able to stretch the rising section to higher batch sizes than G4dn and asymptote at a higher throughput. Also, there is a linear trade-off between latency and batch size. Therefore, if the application is latency bound, we can use P95 latency vs. batch size to determine the optimum batch size. However, if the objective is to maximize throughput at the lowest latency, it’s better to select the batch size corresponding to the “knee” between the rising and the asymptotic sections, because any further increase in batch size would result in the same throughput at a worse latency. To achieve the best price-performance ratio, targeting higher throughput at lowest latency, you’re better off horizontally scaling this optimum through multiple inference servers rather than just increasing the batch size.

Cost vs. batch size

In this section, we present the comparative results of inference costs ($ per million inferences) versus the batch size. From the following figure, we can clearly observe that the cost (measured as $ per million inferences) is consistently lower with G5 vs. G4dn both (full and half precision).

The following table summarizes throughput, latency, and cost ($ per million inferences) comparisons for BERT and RESNET50 models across both precision modes for specific batch sizes. In spite of a higher cost per instance, G5 consistently outperforms G4dn across all aspects of inference latency, throughput, and cost ($ per million inference), for all batch sizes. Combining the different metrics into a cost ($ per million inferences), BERT model (32 batch size, full precision) with G5 is 3.7 times more favorable than G4dn, and with ResNet50 model (32 batch size, full precision), it is 1.6 times more favorable than G4dn.

Model Batch Size Precision

Throughput

(Batch size X Requests/sec)

Latency (ms)

$/million

Inferences (On-Demand)

Cost Benefit

(G5 over G4dn)

. . . G5 G4dn G5 G4dn G5 G4dn
Bert-base-uncased 32 Full 723 154 44 208 $0.6 $2.2 3.7X
Mixed 870 410 37 79 $0.5 $0.8 1.6X
16 Full 651 158 25 102 $0.7 $2.1 3.0X
Mixed 762 376 21 43 $0.6 $0.9 1.5X
8 Full 642 142 13 57 $0.7 $2.3 3.3X
Mixed 681 350 12 23 $0.7 $1.0 1.4X
. 1 Full 160 116 6 9 $2.8 $2.9 1.0X
Mixed 137 102 7 10 $3.3 $3.3 1.0X
ResNet50 32 Full 941 397 34 82 $0.5 $0.8 1.6X
Mixed 1533 851 21 38 $0.3 $0.4 1.3X
16 Full 888 384 18 42 $0.5 $0.9 1.8X
Mixed 1474 819 11 20 $0.3 $0.4 1.3X
8 Full 805 340 10 24 $0.6 $1.0 1.7X
Mixed 1419 772 6 10 $0.3 $0.4 1.3X
. 1 Full 202 164 5 6 $2.2 $2 0.9X
Mixed 196 180 5 6 $2.3 $1.9 0.8X

Additional inference benchmarks

In addition to the BERT base and ResNet50 results in the prior sections, we present additional benchmarking results for other commonly used large NLP and CV models in PyTorch. The performance benefit of G5 over G4dn has been presented for BERT Large models at various precision, and Yolo-v5 models for various sizes. For the code for replicating the benchmark, refer to NVIDIA Deep Learning Examples for Tensor Cores. These results show the benefit of using G5 over G4dn for a wide range of inference tasks spanning different model types.

Model Precision Batch Size Sequence Length Throughput (sent/sec) Throughput: G4dn Speedup Over G4dn
BERT-large FP16 1 128 93.5 40.31 2.3
BERT-large FP16 4 128 264.2 87.4 3.0
BERT-large FP16 8 128 392.1 107.5 3.6
BERT-large FP32 1 128 68.4 22.67 3.0
BERT-large 4 128 118.5 32.21 3.7
BERT-large 8 128 132.4 34.67 3.8
Model GFLOPS Number of Parameters Preprocessing (ms) Inference (ms) Inference (Non-max-suppression) (NMS/image)
YOLOv5s 16.5 7.2M 0.2 3.6 4.5
YOLOv5m 49.1 21M 0.2 6.5 4.5
YOLOv5l 109.3 46M 0.2 9.1 3.5
YOLOv5x 205.9 86M 0.2 14.4 1.3

Conclusion

In this post, we showed that for inference with large NLP and CV PyTorch models, EC2 G5 instances are a better choice compared to G4dn instances. Although the on-demand hourly cost for G5 instances is higher than G4dn instances, its higher performance can achieve 2–5 times the throughput at any precision for NLP and CV models, which makes the cost per million inferences 1.5–3.5 times more favorable than G4dn instances. Even for latency bound applications, G5 is 2.5–5 times better than G4dn for NLP and CV models.

In summary, AWS G5 instances are an excellent choice for your inference needs from both a performance and cost per inference perspective. The universality of the CUDA framework and the scale and depth of the G5 instance pool on AWS provides you with a unique ability to perform inference at scale.


About the authors

Ankur Srivastava is a Sr. Solutions Architect in the ML Frameworks Team. He focuses on helping customers with self-managed distributed training and inference at scale on AWS. His experience includes industrial predictive maintenance, digital twins, probabilistic design optimization and has completed his doctoral studies from Mechanical Engineering at Rice University and post-doctoral research from Massachusetts Institute of Technology.

Sundar Ranganathan is the Head of Business Development, ML Frameworks on the Amazon EC2 team. He focuses on large-scale ML workloads across AWS services like Amazon EKS, Amazon ECS, Elastic Fabric Adapter, AWS Batch, and Amazon SageMaker. His experience includes leadership roles in product management and product development at NetApp, Micron Technology, Qualcomm, and Mentor Graphics.

Mahadevan Balasubramaniam is a Principal Solutions Architect for Autonomous Computing with nearly 20 years of experience in the area of physics-infused deep learning, building, and deploying digital twins for industrial systems at scale. Mahadevan obtained his PhD in Mechanical Engineering from the Massachusetts Institute of Technology and has over 25 patents and publications to his credit.

Amr Ragab is a Principal Solutions Architect for EC2 Accelerated Platforms for AWS, devoted to helping customers run computational workloads at scale. In his spare time he likes traveling and finding new ways to integrate technology into daily life.

Read More

Celebrate over 20 years of AI/ML at Innovation Day

Be our guest as we celebrate 20 years of AI/ML innovation on October 25, 2022, 9:00 AM – 10:30 AM PT.  The first 1,500 people to register will receive $50 of AWS credits. Register here.

Over the past 20 years, Amazon has delivered many world firsts for artificial intelligence (AI) and machine learning (ML). ML is an integral part of Amazon and is used for everything from applying personalization models at checkout, to forecasting the demand for products globally, to creating autonomous flight for Amazon Prime Air drones, to natural language processing (NLP) on Alexa. And the use of ML isn’t slowing down anytime soon, because ML helps Amazon exceed customer expectations for convenience, cost, and delivery speed.

During the virtual AI/ML innovation event on October 25, we take time to reflect on what’s been done at Amazon and how we have packaged this innovation into a wide breadth and depth of AI/ML services. The AWS ML stack helps you rapidly innovate and enhance customer experiences, enable faster and better decision-making, and optimize business processes using the same technology that Amazon uses every day. With the most experience; the most reliable, scalable, and secure cloud; and the most comprehensive set of services and solutions, AWS is the best place to unlock value from your data and turn it into insight.

We will also take a moment to celebrate customer success using AWS to harness the power of data with ML and, in many cases, change the way we live for the better. Mueller Water Products, Siemens Energy, Inspire, and ResMed will show what’s possible using ML for sustainability and accessibility challenges such as water conservation, predictive maintenance for industrial plants, personalized medical care resources for patients and caregivers, and cloud-connected customized recommendations for patients and their healthcare providers.

The 90-minute session doesn’t stop there! We have special guest speaker Professor Michael Jordan, who will talk about the decision-making side of ML spanning computational, inferential, and economic perspectives. Much of the recent focus in ML has been on the pattern recognition side of the field. In Professor Jordan’s talk, he will focus on the decision-making side, where many fundamental challenges remain. Some are statistical in nature, including the challenges associated with multiple decision-making. Others are economic, involving learning systems that must cope with scarcity, competition, and incentives, and some are algorithmic, including the challenge of coordinated decision-making on distributed platforms and the need for algorithms to converge to equilibria rather than optima. He will ponder how next-generation ML platforms can provide environments that support this kind of large scale, dynamic, data-aware, and market-aware decision-making.

Finally, we wrap up the celebration with Dr. Bratin Saha, VP of AI/ML, who will explain how AWS AI/ML has grown to over 100,000 customers so quickly, including how Amazon SageMaker became one of the fastest growing services in the history of AWS. Hint—SageMaker incorporates many world firsts, including fully managed infrastructure, tools such IDEs and feature stores, workflows, AutoML, and no-code capabilities.

AWS has helped foster ML growth through capabilities that help you deploy it at scale by operationalizing processes. We have seen this play out in many different industries. For example, in the automotive industry, the assembly line has standardized automotive design and manufacturing, and launched a revolution in transportation by helping us transition from hand-assembled cars to mass production.

Similarly, the software industry went from a few specialized business applications to becoming ubiquitous in every aspect of our lives. That happened through automation, tooling, and implementing and standardizing processes—in effect through the industrialization of software. In the same way, ML services from AWS are driving this transformation. In fact, customers today are running millions of models, billons of parameters, and hundreds of billions of predictions on AWS.

Dr. Saha will also look back at the history of flagship AI services, including services for text and documents, speech, vision, healthcare, industrial, search, business processes, and DevOps. He will explain how to use the AI Use Case Explorer, where you can explore use cases, discover customer success stories, and mobilize your team around the power of AI and ML. Dr. Saha will end on his vision for AWS AI/ML services.

We can’t wait to celebrate with you, so register now! If you’re among the first 1,500 people to register, you will receive $50 of AWS credits.

Happy innovating!


About the author

Kimberly Madia is a Principal Product Marketing Manager with AWS Machine Learning. Her goal is to make it easy for customers to build, train, and deploy machine learning models using Amazon SageMaker. For fun outside work, Kimberly likes to cook, read, and run on the San Francisco Bay Trail.

Read More

AWS Panorama now supports NVIDIA JetPack SDK 4.6.2

AWS Panorama is a collection of machine learning (ML) devices and a software development kit (SDK) that brings computer vision to on-premises internet protocol (IP) cameras. AWS Panorama device options include the AWS Panorama Appliance and the Lenovo ThinkEdge SE70, powered by AWS Panorama. These device options provide you choices in price and performance, depending on your unique use case. Both AWS Panorama devices are built on NVIDIA® Jetson™ System on Modules (SOMs) and use the NVIDIA JetPack SDK.

AWS has released a new software update for AWS Panorama that supports NVIDIA Jetpack SDK version 4.6.2. You can download this software update and apply it to the AWS Panorama device via an over-the-air (OTA) upgrade process. For more details, see Managing an AWS Panorama Appliance.

This release is not backward compatible with previous software releases for AWS Panorama; you must rebuild and redeploy your applications. This post provides a step-by step guide to updating your application software libraries to the latest supported versions.

Overview of update

The NVIDIA Jetpack SDK version 4.6.2 includes support for newer versions of CUDA 10.2, cuDNN 8.2.1, and TensorRT 8.2.1. Other notable libraries now supported include DeepStream 6.0 and Triton Inference Server 21.07. In addition, TensorRT 8.2.1 includes an expanded list of DNN operator support, Sigmoid/Tanh INT8 support for DLA, and better integration with TensorFlow and PyTorch. Torch to TensorRT conversion is now supported, as well as TensorFlow to TensorRT, without the need to convert to ONNX as an intermediate step. For additional details, refer to NVIDIA Announces TensorRT 8.2 and Integrations with PyTorch and TensorFlow.

You can redeploy your applications by following the steps in the following sections.

Prerequisites

As a prerequisite, you need an AWS account and an AWS Panorama device.

Upgrade your AWS Panorama device

First, you upgrade your AWS Panorama device to the latest version.

  1. On the AWS Panorama console, choose Devices in the navigation pane.
  2. Choose an Appliance.
  3. Choose Settings.
  4. Under System software, choose View Software update.

  5. Choose System Software version 5.0 or above and then proceed to install this software.

Redeploy your application

If you do not use the Open GPU access feature as part of your application, you use the Replace feature on the AWS Panorama console. The Replace function rebuilds your model for the latest software.

  1. On the AWS Panorama console, choose Deployed applications in the navigation pane.
  2. Select an application.
  3. Choose Replace.

For applications using the Open GPU access feature, upgrading typically involves allowing your container access to the underlying GPU hardware and deploying and managing your own models and runtime. We recommend using NVIDIA TensorRT in your application, but you’re not limited to this.

You also need to update the libraries of your Dockerfile. Typical libraries to update include CUDA 10.2, cuDNN 8.2.1, TensorRT 8.2.1, DeepStream 6.0, OpenCV 4.1.1, and VPI 1.1. As a note, all related CUDA/NVIDIA changes in the software stack can be found at JetPack SDK 4.6.2.

Now you rebuild the models for TensorRT 8.2.1 and update your package.json file with the updated assets. You can now build your container with the updated dependencies and models and deploy the application container to your Appliance using the AWS Panorama console or APIs.

At this point your, AWS Panorama applications should be able to use the Jetpack SDK version 4.6.2. AWS Panorama’s sample applications that are compatible with this version are located on GitHub.

Conclusion

With the new update to AWS Panorama, you must rebuild and redeploy your applications. This post walked you through the steps to update your AWS Panorama application software libraries to the latest supported versions.

Please reach out to AWS Panorama with any questions at AWS re:Post.


About the Authors

Vinod Raman is a Principal Product Manager at AWS.

Steven White is a Senior Computer Vision/EdgeML Solutions Architect at AWS.

Read More

Build flexible and scalable distributed training architectures using Kubeflow on AWS and Amazon SageMaker

In this post, we demonstrate how Kubeflow on AWS (an AWS-specific distribution of Kubeflow) used with AWS Deep Learning Containers and Amazon Elastic File System (Amazon EFS) simplifies collaboration and provides flexibility in training deep learning models at scale on both Amazon Elastic Kubernetes Service (Amazon EKS) and Amazon SageMaker utilizing a hybrid architecture approach.

Machine learning (ML) development relies on complex and continuously evolving open-source frameworks and toolkits, as well as complex and continuously evolving hardware ecosystems. This poses a challenge when scaling out ML development to a cluster. Containers offer a solution, because they can fully encapsulate not just the training code, but the entire dependency stack down to the hardware libraries. This ensures an ML environment that is consistent and portable, and facilitates reproducibility of the training environment on each individual node of the training cluster.

Kubernetes is a widely adopted system for automating infrastructure deployment, resource scaling, and management of these containerized applications. However, Kubernetes wasn’t built with ML in mind, so it can feel counterintuitive to data scientists due to its heavy reliance on YAML specification files. There isn’t a Jupyter experience, and there aren’t many ML-specific capabilities, such as workflow management and pipelines, and other capabilities that ML experts expect, such as hyperparameter tuning, model hosting, and others. Such capabilities can be built, but Kubernetes wasn’t designed to do this as its primary objective.

The open-source community took notice and developed a layer on top of Kubernetes called Kubeflow. Kubeflow aims to make the deployment of end-to-end ML workflows on Kubernetes simple, portable, and scalable. You can use Kubeflow to deploy best-of-breed open-source systems for ML to diverse infrastructures.

Kubeflow and Kubernetes provides flexibility and control to data scientist teams. However, ensuring high utilization of training clusters running at scale with reduced operational overheads is still challenging.

This post demonstrates how customers who have on-premises restrictions or existing Kubernetes investments can address this challenge by using Amazon EKS and Kubeflow on AWS to implement an ML pipeline for distributed training based on a self-managed approach, and use fully managed SageMaker for a cost-optimized, fully managed, and production-scale training infrastructure. This includes step-by-step implementation of a hybrid distributed training architecture that allows you to choose between the two approaches at runtime, conferring maximum control and flexibility with stringent needs for your deployments. You will see how you can continue using open-source libraries in your deep learning training script and still make it compatible to run on both Kubernetes and SageMaker in a platform agnostic way.

How does Kubeflow on AWS and SageMaker help?

Neural network models built with deep learning frameworks like TensorFlow, PyTorch, MXNet, and others provide much higher accuracy by using significantly larger training datasets, especially in computer vision and natural language processing use cases. However, with large training datasets, it takes longer to train the deep learning models, which ultimately slows down the time to market. If we could scale out a cluster and bring down the model training time from weeks to days or hours, it could have a huge impact on productivity and business velocity.

Amazon EKS helps provision the managed Kubernetes control plane. You can use Amazon EKS to create large-scale training clusters with CPU and GPU instances and use the Kubeflow toolkit to provide ML-friendly, open-source tools and operationalize ML workflows that are portable and scalable using Kubeflow Pipelines to improve your team’s productivity and reduce the time to market.

However, there could be a couple of challenges with this approach:

  • Ensuring maximum utilization of a cluster across data science teams. For example, you should provision GPU instances on demand and ensure its high utilization for demanding production-scale tasks such as deep learning training, and use CPU instances for the less demanding tasks such data preprocessing
  • Ensuring high availability of heavyweight Kubeflow infrastructure components, including database, storage, and authentication, that are deployed in the Kubernetes cluster worker node. For example, the Kubeflow control plane generates artifacts (such as MySQL instances, pod logs, or MinIO storage) that grow over time and need resizable storage volumes with continuous monitoring capabilities.
  • Sharing the training dataset, code, and compute environments between developers, training clusters, and projects is challenging. For example, if you’re working on your own set of libraries and those libraries have strong interdependencies, it gets really hard to share and run the same piece of code between data scientists in the same team. Also, each training run requires you to download the training dataset and build the training image with new code changes.

Kubeflow on AWS helps address these challenges and provides an enterprise-grade semi-managed Kubeflow product. With Kubeflow on AWS, you can replace some Kubeflow control plane services like database, storage, monitoring, and user management with AWS managed services like Amazon Relational Database Service (Amazon RDS), Amazon Simple Storage Service (Amazon S3), Amazon Elastic File System (Amazon EFS), Amazon FSx, Amazon CloudWatch, and Amazon Cognito.

Replacing these Kubeflow components decouples critical parts of the Kubeflow control plane from Kubernetes, providing a secure, scalable, resilient, and cost-optimized design. This approach also frees up storage and compute resources from the EKS data plane, which may be needed by applications such as distributed model training or user notebook servers. Kubeflow on AWS also provides native integration of Jupyter notebooks with Deep Learning Container (DLC) images, which are pre-packaged and preconfigured with AWS optimized deep learning frameworks such as PyTorch and TensorFlow that allow you to start writing your training code right away without dealing with dependency resolutions and framework optimizations. Also, Amazon EFS integration with training clusters and the development environment allows you to share your code and processed training dataset, which avoids building the container image and loading huge datasets after every code change. These integrations with Kubeflow on AWS help you speed up the model building and training time and allow for better collaboration with easier data and code sharing.

Kubeflow on AWS helps build a highly available and robust ML platform. This platform provides flexibility to build and train deep learning models and provides access to many open-source toolkits, insights into logs, and interactive debugging for experimentation. However, achieving maximum utilization of infrastructure resources while training deep learning models on hundreds of GPUs still involves a lot of operational overheads. This could be addressed by using SageMaker, which is a fully managed service designed and optimized for handling performant and cost-optimized training clusters that are only provisioned when requested, scaled as needed, and shut down automatically when jobs complete, thereby providing close to 100% resource utilization. You can integrate SageMaker with Kubeflow Pipelines using managed SageMaker components. This allows you to operationalize ML workflows as part of Kubeflow pipelines, where you can use Kubernetes for local training and SageMaker for product-scale training in a hybrid architecture.

Solution overview

The following architecture describes how we use Kubeflow Pipelines to build and deploy portable and scalable end-to-end ML workflows to conditionally run distributed training on Kubernetes using Kubeflow training or SageMaker based on the runtime parameter.

Kubeflow training is a group of Kubernetes Operators that add to Kubeflow the support for distributed training of ML models using different frameworks like TensorFlow, PyTorch, and others. pytorch-operator is the Kubeflow implementation of the Kubernetes custom resource (PyTorchJob) to run distributed PyTorch training jobs on Kubernetes.

We use the PyTorchJob Launcher component as part of the Kubeflow pipeline to run PyTorch distributed training during the experimentation phase when we need flexibility and access to all the underlying resources for interactive debugging and analysis.

We also use SageMaker components for Kubeflow Pipelines to run our model training at production scale. This allows us to take advantage of powerful SageMaker features such as fully managed services, distributed training jobs with maximum GPU utilization, and cost-effective training through Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances.

Solution overview

As part for the workflow creation process, you complete the following steps (as shown in the preceding diagram) to create this pipeline:

  1. Use the Kubeflow manifest file to create a Kubeflow dashboard and access Jupyter notebooks from the Kubeflow central dashboard.
  2. Use the Kubeflow pipeline SDK to create and compile Kubeflow pipelines using Python code. Pipeline compilation converts the Python function to a workflow resource, which is an Argo-compatible YAML format.
  3. Use the Kubeflow Pipelines SDK client to call the pipeline service endpoint to run the pipeline.
  4. The pipeline evaluates the conditional runtime variables and decides between SageMaker or Kubernetes as the target run environment.
  5. Use the Kubeflow PyTorch Launcher component to run distributed training on the native Kubernetes environment, or use the SageMaker component to submit the training on the SageMaker managed platform.

The following figure shows the Kubeflow Pipelines components involved in the architecture that give us the flexibility to choose between Kubernetes or SageMaker distributed environments.

Kubeflow Pipelines components

Use Case Workflow

We use the following step-by-step approach to install and run the use case for distributed training using Amazon EKS and SageMaker using Kubeflow on AWS.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account.
  • A machine with Docker and the AWS Command Line Interface (AWS CLI) installed.
  • Optionally, you can use AWS Cloud9, a cloud-based integrated development environment (IDE) that enables completing all the work from your web browser. For setup instructions, refer to Setup Cloud9 IDE. From your Cloud9 environment, choose the plus sign and open new terminal.
  • Create a role with the name sagemakerrole. Add managed policies AmazonSageMakerFullAccess and AmazonS3FullAccess to give SageMaker access to S3 buckets. This role is used by SageMaker job submitted as part of Kubeflow Pipelines step.
  • Make sure your account has SageMaker Training resource type limit for ml.p3.2xlarge increased to 2 using Service Quotas Console

1. Install Amazon EKS and Kubeflow on AWS

You can use several different approaches to build a Kubernetes cluster and deploy Kubeflow. In this post, we focus on an approach that we believe brings simplicity to the process. First, we create an EKS cluster, then we deploy Kubeflow on AWS v1.5 on it. For each of these tasks, we use a corresponding open-source project that follows the principles of the Do Framework. Rather than installing a set of prerequisites for each task, we build Docker containers that have all the necessary tools and perform the tasks from within the containers.

We use the Do Framework in this post, which automates the Kubeflow deployment with Amazon EFS as an add-on. For the official Kubeflow on AWS deployment options for production deployments, refer to Deployment.

Configure the current working directory and AWS CLI

We configure a working directory so we can refer to it as the starting point for the steps that follow:

export working_dir=$PWD

We also configure an AWS CLI profile. To do so, you need an access key ID and secret access key of an AWS Identity and Access Management (IAM) user account with administrative privileges (attach the existing managed policy) and programmatic access. See the following code:

aws configure --profile=kubeflow
AWS Access Key ID [None]: <enter access key id>
AWS Secret Access Key [None]: <enter secret access key>
Default region name [None]: us-west-2
Default output format [None]: json

# (In Cloud9, select “Cancel” and “Permanently disable” when the AWS managed temporary credentials dialog pops up)

export AWS_PROFILE=kubeflow

1.1 Create an EKS cluster

If you already have an EKS cluster available, you can skip to the next section. For this post, we use the aws-do-eks project to create our cluster.

  1. First clone the project in your working directory
    cd ${working_dir}
    git clone https://github.com/aws-samples/aws-do-eks
    cd aws-do-eks/

  2. Then build and run the aws-do-eks container:
    ./build.sh
    ./run.sh

    The build.sh script creates a Docker container image that has all the necessary tools and scripts for provisioning and operation of EKS clusters. The run.sh script starts a container using the created Docker image and keeps it up, so we can use it as our EKS management environment. To see the status of your aws-do-eks container, you can run ./status.sh. If the container is in Exited status, you can use the ./start.sh script to bring the container up, or to restart the container, you can run ./stop.sh followed by ./run.sh.

  3. Open a shell in the running aws-do-eks container:
    ./exec.sh

  4. To review the EKS cluster configuration for our KubeFlow deployment, run the following command:
    vi ./eks-kubeflow.yaml

    By default, this configuration creates a cluster named eks-kubeflow in the us-west-2 Region with six m5.xlarge nodes. Also, EBS volumes encryption is not enabled by default. You can enable it by adding "volumeEncrypted: true" to the nodegroup and it will encrypt using the default key. Modify other configurations settings if needed.

  5. To create the cluster, run the following command:
    export AWS_PROFILE=kubeflow
    eksctl create cluster -f ./eks-kubeflow.yaml

    The cluster provisioning process may take up to 30 minutes.

  6. To verify that the cluster was created successfully, run the following command:
    kubectl get nodes

    The output from the preceding command for a cluster that was created successfully looks like the following code:

    root@cdf4ecbebf62:/eks# kubectl get nodes
    NAME                                           STATUS   ROLES    AGE   VERSION
    ip-192-168-0-166.us-west-2.compute.internal    Ready    <none>   23m   v1.21.14-eks-ba74326
    ip-192-168-13-28.us-west-2.compute.internal    Ready    <none>   23m   v1.21.14-eks-ba74326
    ip-192-168-45-240.us-west-2.compute.internal   Ready    <none>   23m   v1.21.14-eks-ba74326
    ip-192-168-63-84.us-west-2.compute.internal    Ready    <none>   23m   v1.21.14-eks-ba74326
    ip-192-168-75-56.us-west-2.compute.internal    Ready    <none>   23m   v1.21.14-eks-ba74326
    ip-192-168-85-226.us-west-2.compute.internal   Ready    <none>   23m   v1.21.14-eks-ba74326

Create an EFS volume for the SageMaker training job

In this use case, you speed up the SageMaker training job by training deep learning models from data already stored in Amazon EFS. This choice has the benefit of directly launching your training jobs from the data in Amazon EFS with no data movement required, resulting in faster training start times.

We create an EFS volume and deploy the EFS Container Storage Interface (CSI) driver. This is accomplished by a deployment script located in /eks/deployment/csi/efs within the aws-do-eks container.

This script assumes you have one EKS cluster in your account. Set CLUSTER_NAME=<eks_cluster_name> in case you have more than one EKS cluster.

cd /eks/deployment/csi/efs
./deploy.sh

This script provisions an EFS volume and creates mount targets for the subnets of the cluster VPC. It then deploys the EFS CSI driver and creates the efs-sc storage class and efs-pv persistent volume in the EKS cluster.

Upon successful completion of the script, you should see output like the following:

Generating efs-sc.yaml ...

Applying efs-sc.yaml ...
storageclass.storage.k8s.io/efs-sc created
NAME            PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
efs-sc          efs.csi.aws.com         Delete          Immediate              false                  1s
gp2 (default)   kubernetes.io/aws-ebs   Delete          WaitForFirstConsumer   false                  36m

Generating efs-pv.yaml ...
Applying efs-pv.yaml ...
persistentvolume/efs-pv created
NAME     CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS   REASON   AGE
efs-pv   5Gi        RWX            Retain           Available           efs-sc                  10s

Done ...

Create an Amazon S3 VPC endpoint

You use a private VPC that your SageMaker training job and EFS file system have access to. To give the SageMaker training cluster access to the S3 buckets from your private VPC, you create a VPC endpoint:

cd /eks/vpc 
export CLUSTER_NAME=<eks-cluster> 
export REGION=<region> 
./vpc-endpoint-create.sh

You may now exit the aws-do-eks container shell and proceed to the next section:

exit

root@cdf4ecbebf62:/eks/deployment/csi/efs# exit
exit
TeamRole:~/environment/aws-do-eks (main) $

1.2 Deploy Kubeflow on AWS on Amazon EKS

To deploy Kubeflow on Amazon EKS, we use the aws-do-kubeflow project.

  1. Clone the repository using the following commands:
    cd ${working_dir}
    git clone https://github.com/aws-samples/aws-do-kubeflow
    cd aws-do-kubeflow

  2. Then configure the project:
    ./config.sh

    This script opens the project configuration file in a text editor. It’s important for AWS_REGION to be set to the Region your cluster is in, as well as AWS_CLUSTER_NAME to match the name of the cluster that you created earlier. By default, your configuration is already properly set, so if you don’t need to make any changes, just close the editor.

    ./build.sh
    ./run.sh
    ./exec.sh

    The build.sh script creates a Docker container image that has all the tools necessary to deploy and manage Kubeflow on an existing Kubernetes cluster. The run.sh script starts a container, using the Docker image, and the exec.sh script opens a command shell into the container, which we can use as our Kubeflow management environment. You can use the ./status.sh script to see if the aws-do-kubeflow container is up and running and the ./stop.sh and ./run.sh scripts to restart it as needed.

  3. After you have a shell opened in the aws-do-eks container, you can verify that the configured cluster context is as expected:
    root@ip-172-31-43-155:/kubeflow# kubectx
    kubeflow@eks-kubeflow.us-west-2.eksctl.io

  4. To deploy Kubeflow on the EKS cluster, run the deploy.sh script:
    ./kubeflow-deploy.sh

    The deployment is successful when all pods in the kubeflow namespace enter the Running state. A typical output looks like the following code:

    Waiting for all Kubeflow pods to start Running ...
    
    Waiting for all Kubeflow pods to start Running ...
    
    Restarting central dashboard ...
    pod "centraldashboard-79f489b55-vr6lp" deleted
    /kubeflow/deploy/distro/aws/kubeflow-manifests /kubeflow/deploy/distro/aws
    /kubeflow/deploy/distro/aws
    
    Kubeflow deployment succeeded
    Granting cluster access to kubeflow profile user ...
    Argument not provided, assuming default user namespace kubeflow-user-example-com ...
    clusterrolebinding.rbac.authorization.k8s.io/kubeflow-user-example-com-cluster-admin-binding created
    Setting up access to Kubeflow Pipelines ...
    Argument not provided, assuming default user namespace kubeflow-user-example-com ...
    
    Creating pod-default for namespace kubeflow-user-example-com ...
    poddefault.kubeflow.org/access-ml-pipeline created

  5. To monitor the state of the KubeFlow pods, in a separate window, you can use the following command:
    watch kubectl -n kubeflow get pods

  6. Press Ctrl+C when all pods are Running, then expose the Kubeflow dashboard outside the cluster by running the following command:
    ./kubeflow-expose.sh

You should see output that looks like the following code:

root@ip-172-31-43-155:/kubeflow# ./kubeflow-expose.sh
root@ip-172-31-43-155:/kubeflow# Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080

This command port-forwards the Istio ingress gateway service from your cluster to your local port 8080. To access the Kubeflow dashboard, visit http://localhost:8080 and log in using the default user credentials (user@example.com/12341234). If you’re running the aws-do-kubeflow container in AWS Cloud9, then you can choose Preview, then choose Preview Running Application. If you’re running on Docker Desktop, you may need to run the ./kubeflow-expose.sh script outside of the aws-do-kubeflow container.

2. Set up the Kubeflow on AWS environment

To set up your Kubeflow on AWS environment, we create an EFS volume and a Jupyter notebook.

2.1 Create an EFS volume

To create an EFS volume, complete the following steps:

  • On the Kubeflow dashboard, choose Volumes in the navigation pane.
  • Chose New volume.
  • For Name, enter efs-sc-claim.
  • For Volume size, enter 10.
  • For Storage class, choose efs-sc.
  • For Access mode, choose ReadWriteOnce.
  • Choose Create.

2.2 Create a Jupyter notebook

To create a new notebook, complete the following steps:

  • On the Kubeflow dashboard, choose Notebooks in the navigation pane.
  • Choose New notebook.
  • For Name, enter aws-hybrid-nb.
  • For Jupyter Docket Image, choose the image c9e4w0g3/notebook-servers/jupyter-pytorch:1.11.0-cpu-py38-ubuntu20.04-e3-v1.1 (the latest available jupyter-pytorch DLC image).
  • For CPU, enter 1.
  • For Memory, enter 5.
  • For GPUs, leave as None.
  • Don’t make any changes to the Workspace Volume section.
  • In the Data Volumes section, choose Attach existing volume and expand Existing volume section
  • For Name, choose efs-sc-claim.
  • For Mount path, enter /home/jovyan/efs-sc-claim.
    This mounts the EFS volume to your Jupyter notebook pod, and you can see the folder efs-sc-claim in your Jupyter lab interface. You save the training dataset and training code to this folder so the training clusters can access it without needing to rebuild the container images for testing.
  • Select Allow access to Kubeflow Pipelines in Configuration section.
  • Choose Launch.
    Verify that your notebook is created successfully (it may take a couple of minutes).
  • On the Notebooks page, choose Connect to log in to the JupyterLab environment.
  • On the Git menu, choose Clone a Repository.
  • For Clone a repo, enter https://github.com/aws-samples/aws-do-kubeflow.

3. Run distributed training

After you set up the Jupyter notebook, you can run the entire demo using the following high-level steps from the folder aws-do-kubeflow/workshop in the cloned repository:

  • PyTorch Distributed Data Parallel (DDP) training Script: Refer PyTorch DDP training script cifar10-distributed-gpu-final.py, which includes a sample convolutional neural network and logic to distribute training on a multi-node CPU and GPU cluster. (Refer 3.1 for details)
  • Install libraries: Run the notebook 0_initialize_dependencies.ipynb to initialize all dependencies. (Refer 3.2 for details)
  • Run distributed PyTorch job training on Kubernetes: Run the notebook 1_submit_pytorchdist_k8s.ipynb to create and submit distributed training on one primary and two worker containers using the Kubernetes custom resource PyTorchJob YAML file using Python code. (Refer 3.3 for details)
  • Create a hybrid Kubeflow pipeline: Run the notebook 2_create_pipeline_k8s_sagemaker.ipynb to create the hybrid Kubeflow pipeline that runs distributed training on the either SageMaker or Amazon EKS using the runtime variable training_runtime. (Refer 3.4 for details)

Make sure you ran the notebook 1_submit_pytorchdist_k8s.ipynb before you start notebook 2_create_pipeline_k8s_sagemaker.ipynb.

In the subsequent sections, we discuss each of these steps in detail.

3.1 PyTorch Distributed Data Parallel(DDP) training script

As part of the distributed training, we train a classification model created by a simple convolutional neural network that operates on the CIFAR10 dataset. The training script cifar10-distributed-gpu-final.py contains only the open-source libraries and is compatible to run both on Kubernetes and SageMaker training clusters on either GPU devices or CPU instances. Let’s look at a few important aspects of the training script before we run our notebook examples.

We use the torch.distributed module, which contains PyTorch support and communication primitives for multi-process parallelism across nodes in the cluster:

...
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data
import torch.utils.data.distributed
import torchvision
from torchvision import datasets, transforms
...

We create a simple image classification model using a combination of convolutional, max pooling, and linear layers to which a relu activation function is applied in the forward pass of the model training:

# Define models
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)

def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x

We use the torch DataLoader that combines the dataset and DistributedSampler (loads a subset of data in a distributed manner using torch.nn.parallel.DistributedDataParallel) and provides a single-process or multi-process iterator over the data:

# Define data loader for training dataset
def _get_train_data_loader(batch_size, training_dir, is_distributed):
logger.info("Get train data loader")

train_set = torchvision.datasets.CIFAR10(root=training_dir,
train=True,
download=False,
transform=_get_transforms())

train_sampler = (
torch.utils.data.distributed.DistributedSampler(train_set) if is_distributed else None
)

return torch.utils.data.DataLoader(
train_set,
batch_size=batch_size,
shuffle=train_sampler is None,
sampler=train_sampler)
...

If the training cluster has GPUs, the script runs the training on CUDA devices and the device variable holds the default CUDA device:

device = "cuda" if torch.cuda.is_available() else "cpu"
...

Before you run distributed training using PyTorch DistributedDataParallel to run distributed processing on multiple nodes, you need to initialize the distributed environment by calling init_process_group. This is initialized on each machine of the training cluster.

dist.init_process_group(backend=args.backend, rank=host_rank, world_size=world_size)
...

We instantiate the classifier model and copy over the model to the target device. If distributed training is enabled to run on multiple nodes, the DistributedDataParallel class is used as a wrapper object around the model object, which allows synchronous distributed training across multiple machines. The input data is split on the batch dimension and a replica of model is placed on each machine and each device.

model = Net().to(device)

if is_distributed:
model = torch.nn.parallel.DistributedDataParallel(model)

...

3.2 Install libraries

You will install all necessary libraries to run the PyTorch distributed training example. This includes Kubeflow Pipelines SDK, Training Operator Python SDK, Python client for Kubernetes and Amazon SageMaker Python SDK.

#Please run the below commands to install necessary libraries

!pip install kfp==1.8.4

!pip install kubeflow-training

!pip install kubernetes

!pip install sagemaker

3.3 Run distributed PyTorch job training on Kubernetes

The notebook 1_submit_pytorchdist_k8s.ipynb creates the Kubernetes custom resource PyTorchJob YAML file using Kubeflow training and the Kubernetes client Python SDK. The following are a few important snippets from this notebook.

We create the PyTorchJob YAML with the primary and worker containers as shown in the following code:

# Define PyTorchJob custom resource manifest
pytorchjob = V1PyTorchJob(
api_version="kubeflow.org/v1",
kind="PyTorchJob",
metadata=V1ObjectMeta(name=pytorch_distributed_jobname,namespace=user_namespace),
spec=V1PyTorchJobSpec(
run_policy=V1RunPolicy(clean_pod_policy="None"),
pytorch_replica_specs={"Master": master,
"Worker": worker}
)
)

This is submitted to the Kubernetes control plane using PyTorchJobClient:

# Creates and Submits PyTorchJob custom resource file to Kubernetes
pytorchjob_client = PyTorchJobClient()

pytorch_job_manifest=pytorchjob_client.create(pytorchjob):

View the Kubernetes training logs

You can view the training logs either from the same Jupyter notebook using Python code or from the Kubernetes client shell.

  • From the notebook, run the following code with the appropriate log_type parameter value to view the primary, worker, or all logs:
    #  Function Definition: read_logs(pyTorchClient: str, jobname: str, namespace: str, log_type: str) -> None:
    #    log_type: all, worker:all, master:all, worker:0, worker:1
    
    read_logs(pytorchjob_client, pytorch_distributed_jobname, user_namespace, "master:0")

  • From the Kubernetes client shell connected to the Kubernetes cluster, run the following commands using Kubectl to see the logs (substitute your namespace and pod names):
    kubectl get pods -n kubeflow-user-example-com
    kubectl logs <pod-name> -n kubeflow-user-example-com -f

    We set world size – 3 because we’re distributing the training to three processes running in one primary and two worker pods. Data is split at the batch dimension and a third of the data is processed by the model in each container.

3.4 Create a hybrid Kubeflow pipeline

The notebook 2_create_pipeline_k8s_sagemaker.ipynb creates a hybrid Kubeflow pipeline based on conditional runtime variable training_runtime, as shown in the following code. The notebook uses the Kubeflow Pipelines SDK and it’s provided a set of Python packages to specify and run the ML workflow pipelines. As part of this SDK, we use the following packages:

  • The domain-specific language (DSL) package decorator dsl.pipeline, which decorates the Python functions to return a pipeline
  • The dsl.Condition package, which represents a group of operations that are only run when a certain condition is met, such as checking the training_runtime value as sagemaker or kubernetes

See the following code:

# Define your training runtime value with either 'sagemaker' or 'kubernetes'
training_runtime='sagemaker'

# Create Hybrid Pipeline using Kubeflow PyTorch Training Operators and Amazon SageMaker Service
@dsl.pipeline(name="PyTorch Training pipeline", description="Sample training job test")
def pytorch_cnn_pipeline(<training parameters>):

# Pipeline Step 1: to evaluate the condition. You can enter any logic here. For demonstration we are checking if GPU is needed for training
condition_result = check_condition_op(training_runtime)

# Pipeline Step 2: to run training on Kuberentes using PyTorch Training Operators. This will be executed if gpus are not needed
with dsl.Condition(condition_result.output == 'kubernetes', name="PyTorch_Comp"):
train_task = pytorch_job_op(
name=training_job_name,
namespace=user_namespace,
master_spec=json.dumps(master_spec_loaded), # Please refer file at pipeline_yaml_specifications/pipeline_master_spec.yml
worker_spec=json.dumps(worker_spec_loaded), # Please refer file at pipeline_yaml_specifications/pipeline_worker_spec.yml
delete_after_done=False
).after(condition_result)

# Pipeline Step 3: to run training on SageMaker using SageMaker Components for Pipeline. This will be executed if gpus are needed
with dsl.Condition(condition_result.output == 'sagemaker', name="SageMaker_Comp"):
training = sagemaker_train_op(
region=region,
image=train_image,
job_name=training_job_name,
training_input_mode=training_input_mode,
hyperparameters='{ 
"backend": "'+str(pytorch_backend)+'", 
"batch-size": "64", 
"epochs": "3", 
"lr": "'+str(learning_rate)+'", 
"model-type": "custom", 
"sagemaker_container_log_level": "20", 
"sagemaker_program": "cifar10-distributed-gpu-final.py", 
"sagemaker_region": "us-west-2", 
"sagemaker_submit_directory": "'+source_s3+'" 
}',
channels=channels,
instance_type=instance_type,
instance_count=instance_count,
volume_size=volume_size,
max_run_time=max_run_time,
model_artifact_path=f's3://{bucket_name}/jobs',
network_isolation=network_isolation,
traffic_encryption=traffic_encryption,
role=role,
vpc_subnets=subnet_id,
vpc_security_group_ids=security_group_id
).after(condition_result)

We configure SageMaker distributed training using two ml.p3.2xlarge instances.

After the pipeline is defined, you can compile the pipeline to an Argo YAML specification using the Kubeflow Pipelines SDK’s kfp.compiler package. You can run this pipeline using the Kubeflow Pipeline SDK client, which calls the Pipelines service endpoint and passes in appropriate authentication headers right from the notebook. See the following code:

# DSL Compiler that compiles pipeline functions into workflow yaml.
kfp.compiler.Compiler().compile(pytorch_cnn_pipeline, "pytorch_cnn_pipeline.yaml")

# Connect to Kubeflow Pipelines using the Kubeflow Pipelines SDK client
client = kfp.Client()

experiment = client.create_experiment(name="kubeflow")

# Run a specified pipeline
my_run = client.run_pipeline(experiment.id, "pytorch_cnn_pipeline", "pytorch_cnn_pipeline.yaml")

# Please click “Run details” link generated below this cell to view your pipeline. You can click every pipeline step to see logs.

If you get a sagemaker import error, run !pip install sagemaker and restart the kernel (on the Kernel menu, choose Restart Kernel).

Choose the Run details link under the last cell to view the Kubeflow pipeline.

Repeat the pipeline creation step with training_runtime='kubernetes' to test the pipeline run on a Kubernetes environment. The training_runtime variable can also be passed in your CI/CD pipeline in a production scenario.

View the Kubeflow pipeline run logs for the SageMaker component

The following screenshot shows our pipeline details for the SageMaker component.

Choose the training job step and on the Logs tab, choose the CloudWatch logs link to access the SageMaker logs.

The following screenshot shows the CloudWatch logs for each of the two ml.p3.2xlarge instances.

Choose any of the groups to see the logs.

View the Kubeflow pipeline run logs for the Kubeflow PyTorchJob Launcher component

The following screenshot shows the pipeline details for our Kubeflow component.

Run the following commands using Kubectl on your Kubernetes client shell connected to the Kubernetes cluster to see the logs (substitute your namespace and pod names):

kubectl get pods -n kubeflow-user-example-com
kubectl logs <pod-name> -n kubeflow-user-example-com -f

4.1 Clean up

To clean up all the resources we created in the account, we need to remove them in reverse order.

  1. Delete the Kubeflow installation by running ./kubeflow-remove.sh in the aws-do-kubeflow container. The first set of commands are optional and can be used in case you don’t already have a command shell into your aws-do-kubeflow container open.
    cd aws-do-kubeflow
    ./status.sh
    ./start.sh
    ./exec.sh
    
    ./kubeflow-remove.sh

  2. From the aws-do-eks container folder, remove the EFS volume. The first set of commands is optional and can be used in case you don’t already have a command shell into your aws-do-eks container open.
    cd aws-do-eks
    ./status.sh
    ./start.sh
    ./exec.sh
    
    cd /eks/deployment/csi/efs
    ./delete.sh
    ./efs-delete.sh

    Deleting Amazon EFS is necessary in order to release the network interface associated with the VPC we created for our cluster. Note that deleting the EFS volume destroys any data that is stored on it.

  3. From the aws-do-eks container, run the eks-delete.sh script to delete the cluster and any other resources associated with it, including the VPC:
    cd /eks
    ./eks-delete.sh

Summary

In this post, we discussed some of the typical challenges of distributed model training and ML workflows. We provided an overview of the Kubeflow on AWS distribution and shared two open-source projects (aws-do-eks and aws-do-kubeflow) that simplify provisioning the infrastructure and the deployment of Kubeflow on it. Finally, we described and demonstrated a hybrid architecture that enables workloads to transition seamlessly between running on a self-managed Kubernetes and fully managed SageMaker infrastructure. We encourage you to use this hybrid architecture for your own use cases.

You can follow the AWS Labs repository to track all AWS contributions to Kubeflow. You can also find us on the Kubeflow #AWS Slack Channel; your feedback there will help us prioritize the next features to contribute to the Kubeflow project.

Special thanks to Sree Arasanagatta (Software Development Manager AWS ML) and Suraj Kota (Software Dev Engineer) for their support to the launch of this post.


About the authors

Kanwaljit Khurmi is an AI/ML Specialist Solutions Architect at Amazon Web Services. He works with the AWS product, engineering and customers to provide guidance and technical assistance helping them improve the value of their hybrid ML solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.

Gautam Kumar is a Software Engineer with AWS AI Deep Learning. He has developed AWS Deep Learning Containers and AWS Deep Learning AMI. He is passionate about building tools and systems for AI. In his spare time, he enjoy biking and reading books.

Alex Iankoulski is a full-stack software and infrastructure architect who likes to do deep, hands-on work. He is currently a Principal Solutions Architect for Self-managed Machine Learning at AWS. In his role he focuses on helping customers with containerization and orchestration of ML and AI workloads on container-powered AWS services. He is also the author of the open source Do framework and a Docker captain who loves applying container technologies to accelerate the pace of innovation while solving the world’s biggest challenges. During the past 10 years, Alex has worked on combating climate change, democratizing AI and ML, making travel safer, healthcare better, and energy smarter.

Read More

Bundesliga Match Fact Pressure Handling: Evaluating players’ performances in high-pressure situations on AWS

Pressing or pressure in football is a process in which a team seeks to apply stress to the opponent player who possesses the ball. A team applies pressure to limit the time an opposition player has left to make a decision, reduce passing options, and ultimately attempt to turn over ball possession. Although nearly all teams seek to apply pressure to their opponents, their strategy to do so may vary.

Some teams adopt a so-called deep press, leaving the opposition with time and room to move the ball up the pitch. However, once the ball reaches the last third of the field, defenders aim to intercept the ball by pressuring the ball carrier. A slightly less conservative approach is the middle press. Here pressure is applied around the halfway line, where defenders attempt to lead the buildup in a certain direction, blocking open players and passing lanes to ultimately force the opposition back. Borussia Dortmund under Jürgen Klopp was one of the most efficient teams to use a middle press. The most aggressive type of pressing teams apply is the high press strategy. Here a team seeks to pressure the defenders and goalkeeper, focusing on direct pressure on the ball carrier, leaving them with ample time to select the right passing options as they have to cover the ball. In this strategy, the pressing team seeks to turn over possession by challenges or intercepting sloppy passes.

In February 2021, the Bundesliga released the first insight into how teams apply pressure with the Most Pressed Player Match Fact powered by AWS. Most Pressed Player quantifies defensive pressure encountered by players in real time, allowing fans to compare how some players receive pressure vs. others. Over the last 1.5 years, this Match Fact provided fans with new insights on how much teams were applying pressure, but also resulted in new questions, such as “Was this pressure successful?” or “How is this player handling pressure?”

Introducing Pressure Handling, a new Bundesliga Match Fact that aims to evaluate the performance of a frequently pressed player using different metrics. Pressure Handling is a further development of Most Pressed Player, and adds a quality component to the number of significant pressure situations a player in ball possession finds themself in. A central statistic in this new Match Fact is Escape Rate, which indicates how often a player resolves pressure situations successfully by keeping the possession for their team. In addition, fans get insight into the passing and shooting performance of players under pressure.

This post dives deeper into how the AWS team worked closely together with the Bundesliga to bring the Pressure Handling Match Fact to life.

How does it work?

This new Bundesliga Match Fact profiles the performance of players in pressure situations. For example, an attacking player in ball possession may get pressured by opposing defenders. There is a significant probability of him losing the ball. If that player manages to resolve the pressure situation without losing the ball, they increase their performance under pressure. Not losing the ball is defined as the team retaining ball possession after the individual ball possession of the player ends. This could, for instance, be either by a successful pass to a teammate, being fouled, or obtaining a throw-in or a corner kick. In contrast, a pressed player can lose the ball through a tackle or an unsuccessful pass. We only count those ball possessions in which the player received the ball from their teammate. That way, we exclude situations where they intercept the ball and are under pressure immediately (which usually happens).

We aggregate the pressure handling performance of a player into a single KPI called escape rate. The escape rate is defined as the fraction of ball possessions by a player where they were under pressure and didn’t lost the ball. In this case, “under pressure” is defined as a pressure value of >0.6 (see our previous post for more information on the pressure value itself). The escape rate allows us to evaluate players on a per-match or per-season basis. The following heuristic is used for computing the escape rate:

  1. We start with a series of pressure events, based on the existing Most Pressed Player Match Fact. Each event consists of a list containing all individual pressure events on the ball carrier during one individual ball possession (IBP) phase.
  2. For each phase, we compute the maximum aggregated pressure on the ball carrier.
  3. As mentioned earlier, a pressure phase needs to satisfy two conditions in order to be considered:
    1. The previous IBP was by a player of the same team.
    2. The maximum pressure on the player during the current IBP was > 0.6.
  4. If the subsequent IBP accounts to a player of the same team, we count this as an escape. Otherwise, it’s counted as a lost ball.
  5. For each player, we compute the escape rate by counting the number of escapes and dividing it by the number of pressure events.

Examples of escapes

To illustrate the different ways of successfully resolving pressure, the following videos show four examples of Joshua Kimmich escaping pressure situations (Matchday 5, Season 22/23 – Union Berlin vs. Bayern Munich).

Joshua Kimmich moving out of pressure and passing to the wing.

Joshua Kimmich playing a quick pass forward to escape ensuing pressure.

Joshua Kimmich escaping pressure twice. The first escape is by a sliding tackle of the opponent, which nevertheless resulted in retained team ball possession. The second escape is by being fouled and thereby retaining team ball possession.

Joshua Kimmich escapes pressure by with a quick move and a pass.

Pressure Handling findings

Let’s look at some findings.

With the Pressure Handling Match Fact, players are ranked according to their escape rate on a match basis. In order to have a fair comparison between players, we only rank players that were under pressure at least 10 times.

The following table shows the number of times a player was in the top 2 of match rankings over the first seven matchdays of the 2022/23 season. We only show players with at least three appearances in the top 2.

Number of Times in Top 2 Player Number of Times in Ranking
4 Joshua Kimmich 5
4 Exequiel Palacios 6
3 Jude Bellingham 7
3 Alphonso Davies 6
3 Lars Stindl 3
3 Jonas Hector 6
3 Vincenzo Grifo 4
3 Kevin Stöger 7

Joshua Kimmich and Exequiel Palacios lead the pack with four appearances in the top 2 of match rankings each. A special mention may go to Lars Stindl, who appeared in the top 2 three times despite playing only three times before an injury prevent further Bundesliga starts.

How it is implemented?

The Bundesliga Match Fact Pressure Handling consumes positions and event data, as well as data from other Bundesliga Match Facts, namely xPasses and Most Pressed Player. Match Facts are independently running AWS Fargate containers inside Amazon Elastic Container Service (Amazon ECS). To guarantee the latest data is being reflected in the Pressure Handling calculations, we use Amazon Managed Streaming for Apache Kafka (Amazon MSK).

Amazon MSK allows different Bundesliga Match Facts to send and receive the newest events and updates in real time. By consuming Kafka, we receive most up-to-date events from all systems. The following diagram illustrates the end-t­­o-end workflow for Pressure Handling.

Pressure Handling starts its calculation after an event is received from the Most Pressed Player Match Fact. The Pressure Handling container writes the current statistics to a topic in Amazon MSK. A central AWS Lambda function consumes these messages from Amazon MSK, and writes the escape rates to an Amazon Aurora database. This data is then used for interactive near-real-time visualizations using Amazon QuickSight. Besides that, the results are also sent to a feed, which then triggers another Lambda function that sends the data to external systems where broadcasters worldwide can consume it.

Summary

In this post, we demonstrated how the new Bundesliga Match Fact Pressure Handling makes it possible to quantify and objectively compare the performance of different Bundesliga players in high-pressure situations. To do so, we build on and combine previously published Bundesliga Match Facts in real time. This allows commentators and fans to understand which players shine when pressured by their opponents.

The new Bundesliga Match Fact is the result of an in-depth analysis by the Bundesliga’s football experts and AWS data scientists. Extraordinary escape rates are shown in the live ticker of the respective matches in the official Bundesliga app. During a broadcast, escape rates are provided to commentators through the data story finder and visually shown to fans at key moments, such as when a player with a high pressure count and escape rate scores a goal, passes exceptionally well, or overcomes many challenges while staying in control of the ball.

We hope that you enjoy this brand-new Bundesliga Match Fact and that it provides you with new insights into the game. To learn more about the partnership between AWS and Bundesliga, visit Bundesliga on AWS!

We’re excited to learn what patterns you will uncover. Share your insights with us: @AWScloud on Twitter, with the hashtag #BundesligaMatchFacts.


About the Authors

Simon Rolfes played 288 Bundesliga games as a central midfielder, scored 41 goals, and won 26 caps for Germany. Currently, Rolfes serves as Managing Director Sport at Bayer 04 Leverkusen, where he oversees and develops the pro player roster, the scouting department, and the club’s youth development. Simon also writes weekly columns on Bundesliga.com about the latest Bundesliga Match Facts powered by AWS. There he offers his expertise as a former player, captain, and TV analyst to highlight the impact of advanced statistics and machine learning into the world of football.

Luuk Figdor is a Sports Technology Advisor in the AWS Professional Services team. He works with players, clubs, leagues, and media companies such as the Bundesliga and Formula 1 to help them tell stories with data using machine learning. In his spare time, he likes to learn all about the mind and the intersection between psychology, economics, and AI.

Javier Poveda-Panter is a Data Scientist for EMEA sports customers within the AWS Professional Services team. He enables customers in the area of spectator sports to innovate and capitalize on their data, delivering high-quality user and fan experiences through machine learning and data science. He follows his passion for a broad range of sports, music, and AI in his spare time.

Tareq Haschemi is a consultant within AWS Professional Services. His skills and areas of expertise include application development, data science, machine learning, and big data. He supports customers in developing data-driven applications within the cloud. Prior to joining AWS, he was also a consultant in various industries such as aviation and telecommunications. He is passionate about enabling customers on their data/AI journey to the cloud.

Fotinos Kyriakides is a Consultant with AWS Professional Services. Through his work as a Data Engineer and Application Developer, he supports customers in developing applications in the cloud that leverage and innovate on insights generated from data. In his spare time, he likes to run and explore nature.

Uwe Dick is a Data Scientist at Sportec Solutions AG. He works to enable Bundesliga clubs and media to optimize their performance using advanced stats and data—before, after, and during matches. In his spare time, he settles for less and just tries to last the full 90 minutes for his recreational football team.

Read More

Bundesliga Match Fact Win Probability: Quantifying the effect of in-game events on winning chances using machine learning on AWS

Ten years from now, the technological fitness of clubs will be a key contributor towards their success. Today we’re already witnessing the potential of technology to revolutionize the understanding of football. xGoals quantifies and allows comparison of goal scoring potential of any shooting situation, while xThreat and EPV models predict the value of any in-game moment. Ultimately, these and other advanced statistics serve one purpose: to improve the understanding of who will win and why. Enter the new Bundesliga Match Fact: Win Probability.

In Bayern’s second match against Bochum last season, the tables turned unexpectedly. Early in the match, Lewandowski scores 1:0 after just 9 minutes. The “Grey Mouse” of the league is instantly reminded of their 7:0 disaster when facing Bayern for the first time that season. But not this time: Christopher Antwi-Adjei scores his first goal for the club just 5 minutes later. After conceiving a penalty goal in the 38th minute, the team from Monaco di Bavaria seems paralyzed and things began to erupt: Gamboa nutmegs Coman and finishes with an absolute corker of a goal, and Holtmann makes it 4:1 close to halftime with a dipper from the left. Bayern hadn’t conceived this many goals in the first half since 1975, and was barely able to walk away with a 4:2 result. Who could have guessed that? Both teams played without their first keepers, which for Bayern meant missing out on their captain Manuel Neuer. Could his presence have saved them from this unexpected result?

Similarly, Cologne pulled off two extraordinary zingers in the 2020/2021 season. When they faced Dortmund, they had gone 18 matches without a win, while BVB’s Haaland was providing a master class in scoring goals that season (23 in 22 matches). The role of the favorite was clear, yet Cologne took an early lead with just 9 minutes on the clock. In the beginning of the second half, Skhiri scored a carbon-copy goal of his first one: 0:2. Dortmund subbed in attacking strength, created big chances, and scored 1:2. Of all players, Haaland missed a sitter 5 minutes into extra time and crowned Cologne with the first 3 points in Dortmund after almost 30 years.

Later in that season, Cologne—being last in the home-table—surprised RB Leipzig, who had all the motivation to close in on the championship leader Bayern. The opponent Leipzig pressured the “Billy Goats” with a team season record of 13 shots at goal in the first half, increasing their already high chances of a win. Ironically, Cologne scored the 1:0 with the first shot at goal in minute 46. After the “Red Bulls” scored a well-deserved equalizer, they slept on a throw-in just 80 seconds later, leading to Jonas Hector scoring for Cologne again. Just like Dortmund, Leipzig now put all energy into offense, but the best they managed to achieve was hitting the post in overtime.

For all of these matches, experts and novices alike would have wrongly guessed the winner, even well into the match. But what are the events that led to these surprising in-game swings of win probability? At what minute did the underdog’s chance of winning overtake the favorite’s as they ran out of time? Bundesliga and AWS have worked together to compute and illustrate the live development of winning chances throughout matches, enabling fans to see key moments of probability swings. The result is the new machine learning (ML)-powered Bundesliga Match Fact: Win Probability.

How does it work?

The new Bundesliga Match Fact Win Probability was developed by building ML models that analyzed over 1,000 historical games. The live model takes the pre-match estimates and adjusts them according to the match proceedings based on features that affect the outcome, including the following:

  • Goals
  • Penalties
  • Red cards
  • Substitutions
  • Time passed
  • Goal scoring chances created
  • Set-piece situations

The live model is trained using a neural network architecture and uses a Poisson distribution aplproach to predict a goals-per-minute-rate r for each team, as described in the following equation:

Those rates can be viewed as an estimation of a team’s strength and are computed using a series of dense layers based on the inputs. Based on these rates and the difference between the opponents, the probabilities of a win and a draw are computed in real time.

The input to the model is a 3-tuple of input features, current goal difference, and remaining playtime in minutes.

The first component of the three input dimensions consists of a feature set that describes the current game action in real time for both teams in performance metrics. These include various aggregated team-based xG values, with particular attention to the shots taken in the last 15 minutes before the prediction. We also process red cards, penalties, corner kicks, and the number of dangerous free kicks. A dangerous free kick is classified as a free kick closer than 25m to the opponent’s goal. During the development of the model, besides the influence of the former Bundesliga Match Fact xGoals, we also evaluated the impact of Bundesliga Match Fact Skill in the model. This means that the model reacts to substitution of top players—players with badges in the skills Finisher, Initiator, or Ball winner.

Win Probability example

Let’s look at a match from the current season (2022/2023). The following graph shows the win probability for the Bayern Munich and Stuttgart match from matchday 6.

The pre-match model calculated a win probability of 67% for Bayern, 14% for Stuttgart, and 19% for a draw. When we look at the course of the match, we see a large impact of goals scored in minute 36′, 57′, and 60′. Until the first minute of overtime, the score was 2:1 for Bayern. Only a successful penalty shot by S. Grassy in minute 90+2 secured a draw. The Win Probability Live Model therefore corrected the draw forecast from 5% to over 90%. The result is an unexpected late swing, with Bayern’s win probability decreasing from 90% to 8% in the 90+2 minute. The graph is representative of the swing in atmosphere in the Allianz Arena that day.

How it is implemented?

Win Probability consumes event data from an ongoing match (goal events, fouls, red cards, and more) as well as data produced by other Match Facts, such as xGoals. For real-time updates of probabilities, we use Amazon Managed Streaming Kafka (Amazon MSK) as a central data streaming and messaging solution. This way, event data, positions data, and outputs of different Bundesliga Match Facts can be communicated between containers real time.

The following diagram illustrates the end-to-end workflow for Win Probability.

Gathered match-related data gets ingested through an external provider (DataHub). Metadata of the match is ingested and processed in an AWS Lambda function. Positions and events data are ingested through an AWS Fargate container (MatchLink). All ingested data is then published for consumption in respective MSK topics. The heart of the Win Probability Match Fact sits in a dedicated Fargate container (BMF WinProbability), which runs for the duration of the respective match and consumes all required data obtained though Amazon MSK. The ML models (live and pre-match) are deployed on Amazon SageMaker Serverless Inference endpoints. Serverless endpoints automatically launch compute resources and scale those compute resources depending on incoming traffic, eliminating the need to choose instance types or manage scaling policies. With this pay-per-use model, Serverless Inference is ideal for workloads that have idle periods between traffic spurts. When there are no Bundesliga matches, there is no cost for idle resources.

Shortly before kick-off, we generate our initial set of features and calculate the pre-match win probabilities by calling the PreMatch SageMaker endpoint. With those PreMatch probabilities, we then initialize the live model, which reacts in real time to relevant in-game events and is continuously queried to receive current win probabilities.

The calculated probabilities are then sent back to DataHub to be provided to other MatchFacts consumers. Probabilities are also sent to the MSK cluster to a dedicated topic, to be consumed by other Bundesliga Match Facts. A Lambda function consumes all probabilities from the respective Kafka topic, and writes them to an Amazon Aurora database. This data is then used for interactive near-real-time visualizations using Amazon QuickSight.

Summary

In this post, we demonstrated how the new Bundesliga Match Fact Win Probability shows the impact of in-game events on the chances of a team winning or losing a match. To do so, we build on and combine previously published Bundesliga Match Facts in real time. This allows commentators and fans to uncover moments of probability swings and more during live matches.

The new Bundesliga Match Fact is the result of an in-depth analysis by the Bundesliga’s football experts and AWS data scientists. Win probabilities are shown in the live ticker of the respective matches in the official Bundesliga app. During a broadcast, win probabilities are provided to commentators through the data story finder and visually shown to fans at key moments, such as when the underdog takes the lead and is now most likely to win the game.

We hope that you enjoy this brand-new Bundesliga Match Fact and that it provides you with new insights into the game. To learn more about the partnership between AWS and Bundesliga, visit Bundesliga on AWS!

We’re excited to learn what patterns you will uncover. Share your insights with us: @AWScloud on Twitter, with the hashtag #BundesligaMatchFacts.


About the Authors

Simon Rolfes played 288 Bundesliga games as a central midfielder, scored 41 goals, and won 26 caps for Germany. Currently, Rolfes serves as Managing Director Sport at Bayer 04 Leverkusen, where he oversees and develops the pro player roster, the scouting department and the club’s youth development. Simon also writes weekly columns on Bundesliga.com about the latest Bundesliga Match Facts powered by AWS. There he offers his expertise as a former player, captain, and TV analyst to highlight the impact of advanced statistics and machine learning into the world of football.

Tareq Haschemi is a consultant within AWS Professional Services. His skills and areas of expertise include application development, data science, machine learning, and big data. He supports customers in developing data-driven applications within the cloud. Prior to joining AWS, he was also a consultant in various industries such as aviation and telecommunications. He is passionate about enabling customers on their data/AI journey to the cloud.

Javier Poveda-Panter is a Data Scientist for EMEA sports customers within the AWS Professional Services team. He enables customers in the area of spectator sports to innovate and capitalize on their data, delivering high-quality user and fan experiences through machine learning and data science. He follows his passion for a broad range of sports, music and AI in his spare time.

Luuk Figdor is a Sports Technology Advisor in the AWS Professional Services team. He works with players, clubs, leagues, and media companies such as the Bundesliga and Formula 1 to help them tell stories with data using machine learning. In his spare time, he likes to learn all about the mind and the intersection between psychology, economics, and AI.

Gabriel Zylka is a Machine Learning Engineer within AWS Professional Services. He works closely with customers to accelerate their cloud adoption journey. Specialized in the MLOps domain, he focuses on productionizing machine learning workloads by automating end-to-end machine learning lifecycles and helping achieve desired business outcomes.

Jakub Michalczyk is a Data Scientist at Sportec Solutions AG. Several years ago, he chose math studies over playing football, as he came to the conclusion that he wasn’t good enough at the latter. Now he combines both these passions in his professional career by applying machine learning methods to gain a better insight into this beautiful game. In his spare time, he still enjoys playing seven-a-side football, watching crime movies, and listening to film music.

Read More

Unified data preparation, model training, and deployment with Amazon SageMaker Data Wrangler and Amazon SageMaker Autopilot – Part 2

Depending on the quality and complexity of data, data scientists spend between 45–80% of their time on data preparation tasks. This implies that data preparation and cleansing take valuable time away from real data science work. After a machine learning (ML) model is trained with prepared data and readied for deployment, data scientists must often rewrite the data transformations used for preparing data for ML inference. This may stretch the time it takes to deploy a useful model that can inference and score the data from its raw shape and form.

In Part 1 of this series, we demonstrated how Data Wrangler enables a unified data preparation and model training experience with Amazon SageMaker Autopilot in just a few clicks. In this second and final part of this series, we focus on a feature that includes and reuses Amazon SageMaker Data Wrangler transforms, such as missing value imputers, ordinal or one-hot encoders, and more, along with the Autopilot models for ML inference. This feature enables automatic preprocessing of the raw data with the reuse of Data Wrangler feature transforms at the time of inference, further reducing the time required to deploy a trained model to production.

Solution overview

Data Wrangler reduces the time to aggregate and prepare data for ML from weeks to minutes, and Autopilot automatically builds, trains, and tunes the best ML models based on your data. With Autopilot, you still maintain full control and visibility of your data and model. Both services are purpose-built to make ML practitioners more productive and accelerate time to value.

The following diagram illustrates our solution architecture.

Prerequisites

Because this post is the second in a two-part series, make sure you’ve successfully read and implemented Part 1 before continuing.

Export and train the model

In Part 1, after data preparation for ML, we discussed how you can use the integrated experience in Data Wrangler to analyze datasets and easily build high-quality ML models in Autopilot.

This time, we use the Autopilot integration once again to train a model against the same training dataset, but instead of performing bulk inference, we perform real-time inference against an Amazon SageMaker inference endpoint that is created automatically for us.

In addition to the convenience provided by automatic endpoint deployment, we demonstrate how you can also deploy with all the Data Wrangler feature transforms as a SageMaker serial inference pipeline. This enables automatic preprocessing of the raw data with the reuse of Data Wrangler feature transforms at the time of inference.

Note that this feature is currently only supported for Data Wrangler flows that don’t use join, group by, concatenate, and time series transformations.

We can use the new Data Wrangler integration with Autopilot to directly train a model from the Data Wrangler data flow UI.

  1. Choose the plus sign next to the Scale values node, and choose Train model.
  2. For Amazon S3 location, specify the Amazon Simple Storage Service (Amazon S3) location where SageMaker exports your data.
    If presented with a root bucket path by default, Data Wrangler creates a unique export sub-directory under it—you don’t need to modify this default root path unless you want to.Autopilot uses this location to automatically train a model, saving you time from having to define the output location of the Data Wrangler flow and then define the input location of the Autopilot training data. This makes for a more seamless experience.
  3. Choose Export and train to export the transformed data to Amazon S3.

    When export is successful, you’re redirected to the Create an Autopilot experiment page, with the Input data S3 location already filled in for you (it was populated from the results of the previous page).
  4. For Experiment name, enter a name (or keep the default name).
  5. For Target, choose Outcome as the column you want to predict.
  6. Choose Next: Training method.

As detailed in the post Amazon SageMaker Autopilot is up to eight times faster with new ensemble training mode powered by AutoGluon, you can either let Autopilot select the training mode automatically based on the dataset size, or select the training mode manually for either ensembling or hyperparameter optimization (HPO).

The details of each option are as follows:

  • Auto – Autopilot automatically chooses either ensembling or HPO mode based on your dataset size. If your dataset is larger than 100 MB, Autopilot chooses HPO; otherwise it chooses ensembling.
  • Ensembling – Autopilot uses the AutoGluon ensembling technique to train several base models and combines their predictions using model stacking into an optimal predictive model.
  • Hyperparameter optimization – Autopilot finds the best version of a model by tuning hyperparameters using the Bayesian optimization technique and running training jobs on your dataset. HPO selects the algorithms most relevant to your dataset and picks the best range of hyperparameters to tune the models.For our example, we leave the default selection of Auto.
  1. Choose Next: Deployment and advanced settings to continue.
  2. On the Deployment and advanced settings page, select a deployment option.
    It’s important to understand the deployment options in more detail; what we choose will impact whether or not the transforms we made earlier in Data Wrangler will be included in the inference pipeline:

    • Auto deploy best model with transforms from Data Wrangler – With this deployment option, when you prepare data in Data Wrangler and train a model by invoking Autopilot, the trained model is deployed alongside all the Data Wrangler feature transforms as a SageMaker serial inference pipeline. This enables automatic preprocessing of the raw data with the reuse of Data Wrangler feature transforms at the time of inference. Note that the inference endpoint expects the format of your data to be in the same format as when it’s imported into the Data Wrangler flow.
    • Auto deploy best model without transforms from Data Wrangler – This option deploys a real-time endpoint that doesn’t use Data Wrangler transforms. In this case, you need to apply the transforms defined in your Data Wrangler flow to your data prior to inference.
    • Do not auto deploy best model – You should use this option when you don’t want to create an inference endpoint at all. It’s useful if you want to generate a best model for later use, such as locally run bulk inference. (This is the deployment option we selected in Part 1 of the series.) Note that when you select this option, the model created (from Autopilot’s best candidate via the SageMaker SDK) includes the Data Wrangler feature transforms as a SageMaker serial inference pipeline.

    For this post, we use the Auto deploy best model with transforms from Data Wrangler option.

  3. For Deployment option, select Auto deploy best model with transforms from Data Wrangler.
  4. Leave the other settings as default.
  5. Choose Next: Review and create to continue.
    On the Review and create page, we see a summary of the settings chosen for our Autopilot experiment.
  6. Choose Create experiment to begin the model creation process.

You’re redirected to the Autopilot job description page. The models show on the Models tab as they are generated. To confirm that the process is complete, go to the Job Profile tab and look for a Completed value for the Status field.

You can get back to this Autopilot job description page at any time from Amazon SageMaker Studio:

  1. Choose Experiments and Trials on the SageMaker resources drop-down menu.
  2. Select the name of the Autopilot job you created.
  3. Choose (right-click) the experiment and choose Describe AutoML Job.

View the training and deployment

When Autopilot completes the experiment, we can view the training results and explore the best model from the Autopilot job description page.

Choose (right-click) the model labeled Best model, and choose Open in model details.

The Performance tab displays several model measurement tests, including a confusion matrix, the area under the precision/recall curve (AUCPR), and the area under the receiver operating characteristic curve (ROC). These illustrate the overall validation performance of the model, but they don’t tell us if the model will generalize well. We still need to run evaluations on unseen test data to see how accurately the model makes predictions (for this example, we predict if an individual will have diabetes).

Perform inference against the real-time endpoint

Create a new SageMaker notebook to perform real-time inference to assess the model performance. Enter the following code into a notebook to run real-time inference for validation:

import boto3

### Define required boto3 clients

sm_client = boto3.client(service_name="sagemaker")
runtime_sm_client = boto3.client(service_name="sagemaker-runtime")

### Define endpoint name

endpoint_name = "<YOUR_ENDPOINT_NAME_HERE>"

### Define input data

payload_str = '5,166.0,72.0,19.0,175.0,25.8,0.587,51'
payload = payload_str.encode()
response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="text/csv",
    Body=payload,
)

response["Body"].read()

After you set up the code to run in your notebook, you need to configure two variables:

  • endpoint_name
  • payload_str

Configure endpoint_name

endpoint_name represents the name of the real-time inference endpoint the deployment auto-created for us. Before we set it, we need to find its name.

  1. Choose Endpoints on the SageMaker resources drop-down menu.
  2. Locate the name of the endpoint that has the name of the Autopilot job you created with a random string appended to it.
  3. Choose (right-click) the experiment, and choose Describe Endpoint.

    The Endpoint Details page appears.
  4. Highlight the full endpoint name, and press Ctrl+C to copy it the clipboard.
  5. Enter this value (make sure its quoted) for endpoint_name in the inference notebook.

Configure payload_str

The notebook comes with a default payload string payload_str that you can use to test your endpoint, but feel free to experiment with different values, such as those from your test dataset.

To pull values from the test dataset, follow the instructions in Part 1 to export the test dataset to Amazon S3. Then on the Amazon S3 console, you can download it and select the rows to use the file from Amazon S3.

Each row in your test dataset has nine columns, with the last column being the outcome value. For this notebook code, make sure you only use a single data row (never a CSV header) for payload_str. Also make sure you only send a payload_str with eight columns, where you have removed the outcome value.

For example, if your test dataset files look like the following code, and we want to perform real-time inference of the first row:

Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome 
10,115,0,0,0,35.3,0.134,29,0 
10,168,74,0,0,38.0,0.537,34,1 
1,103,30,38,83,43.3,0.183,33,0

We set payload_str to 10,115,0,0,0,35.3,0.134,29. Note how we omitted the outcome value of 0 at the end.

If by chance the target value of your dataset is not the first or last value, just remove the value with the comma structure intact. For example, assume we’re predicting bar, and our dataset looks like the following code:

foo,bar,foobar
85,17,20

In this case, we set payload_str to 85,,20.

When the notebook is run with the properly configured payload_str and endpoint_name values, you get a CSV response back in the format of outcome (0 or 1), confidence (0-1).

Cleaning Up

To make sure you don’t incur tutorial-related charges after completing this tutorial, be sure to shutdown the Data Wrangler app (https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-shut-down.html), as well as all notebook instances used to perform inference tasks. The inference endpoints created via the Auto Pilot deploy should be deleted to prevent additional charges as well.

Conclusion

In this post, we demonstrated how to integrate your data processing, featuring engineering, and model building using Data Wrangler and Autopilot. Building on Part 1 in the series, we highlighted how you can easily train, tune, and deploy a model to a real-time inference endpoint with Autopilot directly from the Data Wrangler user interface. In addition to the convenience provided by automatic endpoint deployment, we demonstrated how you can also deploy with all the Data Wrangler feature transforms as a SageMaker serial inference pipeline, providing for automatic preprocessing of the raw data, with the reuse of Data Wrangler feature transforms at the time of inference.

Low-code and AutoML solutions like Data Wrangler and Autopilot remove the need to have deep coding knowledge to build robust ML models. Get started using Data Wrangler today to experience how easy it is to build ML models using Autopilot.


About the authors

Geremy Cohen is a Solutions Architect with AWS where he helps customers build cutting-edge, cloud-based solutions. In his spare time, he enjoys short walks on the beach, exploring the bay area with his family, fixing things around the house, breaking things around the house, and BBQing.

Pradeep Reddy is a Senior Product Manager in the SageMaker Low/No Code ML team, which includes SageMaker Autopilot, SageMaker Automatic Model Tuner. Outside of work, Pradeep enjoys reading, running and geeking out with palm sized computers like raspberry pi, and other home automation tech.

Dr. John He is a senior software development engineer with Amazon AI, where he focuses on machine learning and distributed computing. He holds a PhD degree from CMU.

Read More