Amazon Rekognition introduces Streaming Video Events to provide real-time alerts on live video streams

Today, AWS announced the general availability of Amazon Rekognition Streaming Video Events, a fully managed service for camera manufacturers and service providers that uses machine learning (ML) to detect objects such as people, pets, and packages in live video streams from connected cameras. Amazon Rekognition Streaming Video Events sends them a notification as soon as the desired object is detected in the live video stream.

With these event notifications, service providers can send timely and actionable smart alerts to their users such as “Pet detected in the backyard,” enable home automation experiences such as turning on garage lights when a person is detected, build custom in-app experiences such as a smart search to find specific video events of packages without scrolling through hours of footage, or integrate these alerts with Echo devices for Alexa announcements such as “A package was detected at the front door” when the doorbell detects a delivery person dropping off a package – all while keeping cost and latency low.

This post describes how camera manufacturers and security service providers can use Amazon Rekognition Streaming Video Events on live video streams to deliver actionable smart alerts to their users in real time.

Amazon Rekognition Streaming Video Events

Many camera manufacturers and security service providers offer home security solutions that include camera doorbells, indoor cameras, outdoor cameras, and value-added notification services to help their users understand what is happening on their property. Cameras with built-in motion detectors are placed at entry or exit points of the home to notify users of any activity in real time, such as “Motion detected in the backyard.” However, motion detectors are noisy, can be set off by innocuous events like wind and rain, creating notification fatigue, and resulting in clunky home automation setup. Building the right user experience for smart alerts, search, or even browsing video clips requires ML and automation that is hard to get right and can be expensive.

Amazon Rekognition Streaming Video Events lowers the costs of value-added video analytics by providing a low-cost, low-latency, fully managed ML service that can detect objects (such as people, pets, and packages) in real time on video streams from connected cameras. The service starts analyzing the video clip only when a motion event is triggered by the camera. When the desired object is detected, it sends a notification that includes the objects detected, bounding box coordinates, zoomed-in image of the objects detected, and the timestamp. The Amazon Rekognition pre-trained APIs provide high accuracy even in varying lighting conditions, camera angles, and resolutions.

Customer success stories

Customers like Abode Systems and 3xLOGIC are using Amazon Rekognition Streaming Video Events to send relevant alerts to their users and minimize false alarms.

Abode Systems (Abode) offers homeowners a comprehensive suite of do-it-yourself home security solutions that can be set up in minutes and enables homeowners to keep their family and property safe. Since the company’s launch in 2015, in-camera motion detection sensors have played an essential part in Abode’s solution, enabling customers to receive notifications and monitor their homes from anywhere. Abode recognized that to offer its customers the best video stream smart notification experience, they needed highly accurate yet inexpensive and scalable streaming computer vision solutions that can detect objects and events of interest in real time. After weighing alternatives, Abode chose to pilot Amazon Rekognition Streaming Video Events. Within a matter of weeks, Abode was able to deploy a serverless, well-architected solution integrating tens of thousands of cameras. To learn more about Abode’s case study, see Abode uses Amazon Rekognition Streaming Video Events to provide real-time notifications to their smart home customers.

“We are always focused on making technology choices that provide value to our customers and enable rapid growth while keeping costs low. With Amazon Rekognition Streaming Video Events, we could launch person, pet, and package detection at a fraction of the cost of developing everything ourselves. Our smart home customers are notified in real time when Amazon Rekognition detects an object or activity of interest. This helps us filter out the noise and focus on what’s important to our customers – quality notifications.

For us it was a no-brainer, we didn’t want to create and maintain a custom computer vision service. We turned to the experts on the Amazon Rekognition team. Amazon Rekognition Streaming Video Events APIs are accurate, scalable, and easy to incorporate into our systems. The integration powers our smart notification features, so instead of a customer receiving 100 notifications a day, every time the motion sensor is triggered, they receive just two or three smart notifications when there is an event of interest present in the video stream.”

– Scott Beck, Chief Technology Officer at Abode Systems.

3xLOGIC is a leader in commercial electronic security systems. They provide commercial security systems and managed video monitoring for businesses, hospitals, schools, and government agencies. Managed video monitoring is a critical component of a comprehensive security strategy for 3xLOGIC’s customers. With more than 50,000 active cameras in the field, video monitoring teams face a daily challenge of dealing with false alarms coming from in-camera motion detection sensors. These false notifications pose a challenge for operators because they must treat every notification as if it were an event of interest. 3xLOGIC wanted to improve their managed video monitoring product VIGIL CLOUD with intelligent video analytics and provide monitoring center operators with real-time smart notifications. To do this, 3xLOGIC used Amazon Rekognition Video Streaming Events. The service enables 3xLOGIC to analyze live video streams from connected cameras to detect the presence of individuals and filter out the noise from false notifications. To learn more about 3xLOGIC’s case study, see 3xLOGIC uses Amazon Rekognition Streaming Video Events to provide intelligent video analytics on live video streams to monitoring agents.

“Simply relying on motion detection sensors triggers several alarms that are not a security or safety risk when there is a lot of activity in a scene. By utilizing machine learning to filter out the vast majority of events, such as animals, shadows, moving vegetation, and more, we can dramatically reduce the workload of the security operators and improve their efficiency.”

– Ola Edman, Senior Director Global Video Development at 3xLOGIC.

“With over 50,000 active cameras in the field, many without the advanced analytics of newer and more expensive camera models, 3xLOGIC takes on the challenge of false alarms every day. Building, training, testing, and maintaining computer vision models is resource-intensive and has a huge learning curve. With Amazon Rekognition Streaming Video Events, we simply call the API and surface the results to our users. It has been very easy to use and the accuracy is impressive.”

– Charlie Erickson, CTO at 3xLOGIC.

How it works

Amazon Rekognition Streaming Video Events works with Amazon Kinesis Video Streams to detect objects from live video streams. This enables camera manufacturers and service providers to minimize false alerts from camera motion events by sending real-time notifications only when a desired object (such as a person, pet, or package) is detected in the video frame. The Amazon Rekognition streaming video APIs enable service providers to accurately alert on objects that are relevant for their customer, successfully adjust the duration of the video to process per motion event, and even define specific areas within the frame that needs to be analyzed.

Amazon Rekognition helps service providers protect their user data by automatically encrypting the data at rest using AWS Key Management Service (KMS) and in transit using the industry-standard Transport Layer Security (TLS) protocol.

Here’s how camera manufacturers and service providers can incorporate video analysis on live video streams:

  1. Integrate Kinesis Video Streams with Amazon Rekognition – Kinesis Video Streams allows camera manufacturers and service providers to easily and securely stream live video from devices such as video doorbells and indoor and outdoor cameras to AWS. It integrates seamlessly with new or existing Kinesis video streams to facilitate live video stream analysis.
  2. Specify video duration –Amazon Rekognition Streaming Video Events allows service providers to control how much video they need to process per motion event. They can specify the length of the video clips to be between 1–120 seconds (the default is 10 seconds). When motion is detected, Amazon Rekognition starts analyzing video from the relevant Kinesis video stream for the specific duration. This provides camera manufacturers and service providers with the flexibility to better manage their ML inference costs.
  3. Choose relevant objects –Amazon Rekognition Streaming Video Events provides the capability to choose one or more objects for detection in live video streams. This minimizes false alerts from camera motion events by sending notifications only when desired objects are detected in the video frame.
  4. Let Amazon Rekognition know where to send the notifications – Service providers can specify their Amazon Simple Notification Service (Amazon SNS) destination to send event notifications. When Amazon Rekognition starts processing the video stream, it sends a notification as soon a desired object is detected. This notification includes the object detected, the bounding box, the time stamp, and a link to the specified Amazon Simple Storage Service (Amazon S3) bucket with the zoomed-in image of the object detected. They can then use this notification to send smart alerts to their users.
  5. Send motion detection trigger notifications – Whenever a connected camera detects motion, the service provider sends a trigger to Amazon Rekognition to start processing the video streams. Amazon Rekognition processes the applicable Kinesis video stream for the specific objects for the defined duration. When the desired object is detected, Amazon Rekognition sends a notification to their private SNS topic.
  6. Integrate with Alexa or other voice assistants (optional) – Service providers can integrate these notifications with Alexa Smart Home skills to enable Alexa announcements for their users. Whenever Amazon Rekognition Streaming Video Events sends them a notification, they can send these notifications to Alexa to provide audio announcements from Echo devices, such as “Package detected at the front door.”

To learn more, see Amazon Rekognition Streaming Video Events developer guide.

The following diagram illustrates Abode’s architecture with Amazon Rekognition Streaming Video Events.

The following diagram illustrates 3xLOGIC’s architecture with Amazon Rekognition Streaming Video Events.

Amazon Rekognition Video Streaming Events is generally available to AWS customers in US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), and Asia Pacific (Mumbai) Regions, with availability in additional Regions in the coming months.

Conclusion

AWS customers such as Abode and 3xLOGIC are using Amazon Rekognition Streaming Video Events to innovate and add intelligent video analytics to their security solutions and modernize their offerings without having to invest in new hardware or develop and maintain custom computer vision analytics.

To get started with Amazon Rekognition Streaming Video Events, visit Amazon Rekognition Streaming Video Events.


About the Author

Prathyusha Cheruku is an AI/ML Computer Vision Principal Product Manager at AWS. She focuses on building powerful, easy-to-use, no-code/low-code deep learning-based image and video analysis services for AWS customers. Outside of work, she has a passion for music, karaoke, painting, and traveling.

Read More

3xLOGIC uses Amazon Rekognition Streaming Video Events to provide intelligent video analytics on live video streams to monitoring agents

3xLOGIC is a leader in commercial electronic security systems. They provide commercial security systems and managed video monitoring for businesses, hospitals, schools, and government agencies. Managed video monitoring is a critical component of a comprehensive security strategy for 3xLOGIC’s customers. With more than 50,000 active cameras in the field, video monitoring teams face a daily challenge of dealing with false alarms coming from in-camera motion detection sensors. These false notifications pose a challenge for operators because they must treat every notification as if it were an event of interest. This means that the operator must tap into the live video stream and potentially send personnel to the location for further investigation.

3xLOGIC wanted to improve their managed video monitoring product VIGIL CLOUD with intelligent video analytics and provide monitoring center operators with real-time smart notifications. To do this, 3xLOGIC used Amazon Rekognition Video Streaming Events, a low-latency, low-cost, scalable, managed computer vision service from AWS. The service enables 3xLOGIC to analyze live video streams from connected cameras to detect the presence of people and filter out the noise from false notifications. When a person is detected the service sends a notification that includes the object detected, zoomed in image of the object, bounding boxes, and timestamps to monitoring center operators for further review.

“Simply relying on motion detection sensors triggers several alarms that are not a security or safety risk when there is a lot of activity in a scene. By utilizing machine learning to filter out the vast majority of events, such as animals, shadows, moving vegetation, and more, we can dramatically reduce the workload of the security operators and improve their efficiency.”

– Ola Edman, Senior Director Global Video Development at 3xLOGIC.

Video analytics with Amazon Rekognition Streaming Video Events

The challenge for managed video monitoring operators is that the more false notifications they receive, the more they get desensitized to the noise and the more likely they are to miss a critical notification. Providers like 3xLOGIC want agents to respond to notifications with the same urgency on the last alarm of their shift as they did on the first. The best way for that to happen is to simply filter out the noise from in-camera motion detection events.

3xLOGIC worked with AWS to develop and launch a multi-location pilot program that showed a significant decrease in false alarms. The following diagram illustrates 3xLOGIC’s integration with Amazon Rekognition Streaming Video Events.

When a 3xLOGIC camera detects motion, it starts streaming video to Amazon Kinesis Video Streams and calls an API to trigger Amazon Rekognition to start analyzing the video stream. When Amazon Rekognition detects a person in the video stream, it sends an event to Amazon Simple Notification Service (Amazon SNS), which notifies a video monitoring agent of the event. Amazon Rekognition provides out-of-the-box notifications, which include zoomed-in images of the people, bounding boxes, labels, and timestamps of the event. Monitoring agents use these notifications in concert with live camera views to evaluate the event and take appropriate action. To learn more about Amazon Rekognition Streaming Video Events, refer to the Amazon Rekognition Developer guide.

“With over 50,000 active cameras in the field, many without the advanced analytics of newer and more expensive camera models, 3xLOGIC takes on the challenge of false alarms every day. Building, training, testing, and maintaining computer vision models is resource-intensive and has a huge learning curve. With Amazon Rekognition Streaming Video Events, we simply call the API and surface the results to our users. It has been very easy to use and the accuracy is impressive.”

– Charlie Erickson, CTO at 3xLOGIC Products and Solutions.

Conclusion

The managed video monitoring market requires an in-depth understanding of the variety of security risks that firms face. It also requires that you keep up with the latest technology, regulations, and best practices. By partnering with AWS, providers like 3xLOGIC are innovating and adding intelligent video analytics to their security solutions and modernizing their offerings without having to invest in new hardware or develop and maintain custom computer vision analytics.

To get started with Amazon Rekognition Streaming Video Events, visit Amazon Rekognition Streaming Video Events.


About the Authors

Mike Ames is a Principal Applied AI/ML Solutions Architect with AWS. He helps companies use machine learning and AI services to combat fraud, waste, and abuse. In his spare time, you can find him mountain biking, kickboxing, or playing Frisbee with his dog Max.

Prathyusha Cheruku is a Principal Product Manager for AI/ML Computer Vision at AWS. She focuses on building powerful, easy-to-use, no-code/low-code deep learning-based image and video analysis services for AWS customers. Outside of work, she has a passion for music, karaoke, painting, and traveling.

David Robo is a Principal WW GTM Specialist for AI/ML Computer Vision at Amazon Web Services. In this role, David works with customers and partners throughout the world who are building innovative video-based devices, products, and services. Outside of work, David has a passion for the outdoors and carving lines on waves and snow.

Read More

Abode uses Amazon Rekognition Streaming Video Events to provide real-time notifications to their smart home customers

Abode Systems (Abode) offers homeowners a comprehensive suite of do-it-yourself home security solutions that can be set up in minutes and enables homeowners to keep their family and property safe. Since the company’s launch in 2015, in-camera motion detection sensors have played an essential part in Abode’s solution, enabling customers to receive notifications and monitor their homes from anywhere. The challenge with in-camera-based motion detection is that a large percentage (up to 90%) of notifications are triggered from insignificant events like wind, rain, or passing cars. Abode wanted to overcome this challenge and provide their customers with highly accurate smart notifications.

Abode has been an AWS user since 2015, taking advantage of multiple AWS services for storage, compute, database, IoT, and video streaming for its solutions. Abode reached out to AWS to understand how they could use AWS computer vision services to build smart notifications into their home security solution for their customers. After evaluating their options, Abode chose to use Amazon Rekognition Streaming Video Events, a low-cost, low-latency, fully managed AI service that can detect objects such as people, pets, and packages in real time on video streams from connected cameras.

“We are always focused on making technology choices that provide value to our customers and enable rapid growth while keeping costs low. With Amazon Rekognition Streaming Video Events, we could launch person, pet, and package detection at a fraction of the cost of developing everything ourselves.”

– Scott Beck, Chief Technology Officer at Abode Systems.

Smart notifications for the connected home market segment

Abode recognized that to offer its customers the best video stream smart notification experience, they needed highly accurate yet inexpensive and scalable streaming computer vision solutions that can detect objects and events of interest in real time. After weighing alternatives, Abode leaned on their relationship with AWS to pilot Amazon Rekognition Streaming Video Events. Within a matter of weeks, Abode was able to deploy a serverless, well-architected solution integrating tens of thousands of cameras.

“Every time a camera detects motion, we stream video to Amazon Kinesis Video Streams and trigger Amazon Rekognition Streaming Video Events APIs to detect if there truly was a person, pet, or package in the stream,” Beck says. “Our smart home customers are notified in real time when Amazon Rekognition detects an object or activity of interest. This helps us filter out the noise and focus on what’s important to our customers – quality notifications.”

Amazon Rekognition Streaming Video Events

Amazon Rekognition Streaming Video Events detects objects and events in video streams and returns the labels detected, bounding box coordinates, zoomed-in images of the object detected, and timestamps. With this service, companies like Abode can deliver timely and actionable smart notifications only when a desired label such as a person, pet, or package is detected in the video frame. For more information, refer to the Amazon Rekognition Streaming Video Events Developer Guide.

“For us it was a no-brainer, we didn’t want to create and maintain a custom computer vision service,” Beck says. “We turned to the experts on the Amazon Rekognition team. Amazon Rekognition Streaming Video Events APIs are accurate, scalable, and easy to incorporate into our systems. The integration powers our smart notification features, so instead of a customer receiving 100 notifications a day, every time the motion sensor is triggered, they receive just two or three smart notifications when there is an event of interest present in the video stream.”

Solution overview

Abode’s goal was to improve accuracy and usefulness of camera-based motion detection notifications to their customers by providing highly accurate label detection using their existing camera technology. This meant that Abode’s customers wouldn’t have to buy additional hardware to take advantage of new features, and Abode wouldn’t have to develop and maintain a bespoke solution. The following diagram illustrates Abode’s integration with Amazon Rekognition Streaming Video Events.

The solution consists of the following steps:

  1. Integrate Amazon Kinesis Video Streams with Amazon Rekognition – Abode was already using Amazon Kinesis Video Streams to easily stream live video from devices such as video doorbells and indoor and outdoor cameras to AWS. They simply integrated Kinesis Video Streams with Amazon Rekognition to facilitate live video stream analysis.
  2. Specify video duration – With Amazon Rekognition, Abode can control how much video needs to be processed per motion event. Amazon Rekognition allows you to specify the length of the video clips to be between 0–120 seconds (the default is 10 seconds) per motion event. When motion is detected, Amazon Rekognition starts analyzing video from the relevant Kinesis video stream for the specific duration. This allows Abode the flexibility to better manage their machine learning (ML) inference costs.
  3. Choose relevant labels – With Amazon Rekognition, customers like Abode can choose one or more labels for detection in live video streams. This minimizes false alerts from camera motion events by sending notifications only when desired objects are detected in the video frame. Abode opted for person, pet, and package detection.
  4. Let Amazon Rekognition know where to send the notifications – When Amazon Rekognition starts processing the video stream, it sends a notification as soon a desired object is detected to the Amazon Simple Notification Service (Amazon SNS) destination configured by Abode. This notification includes the object detected, the bounding box, the timestamp, and a link to Abode’s specified Amazon Simple Storage Service (Amazon S3) bucket with the zoomed-in image of the object detected. Abode then uses this information to send relevant smart alerts to the homeowner, such as “A package has been detected at 12:53pm” or “A pet detected in the backyard.”
  5. Send motion detection trigger notifications – Whenever the smart camera detects motion, Abode sends a trigger to Amazon Rekognition to start processing the video streams. Amazon Rekognition processes the applicable Kinesis video stream for the specific objects and the duration defined. When the desired object is detected, Amazon Rekognition sends a notification to Abode’s private SNS topic.
  6. Integrate with Alexa or other voice assistants (optional) – Abode also integrated these notifications with Alexa Smart Home skills to enable Alexa announcements for their users. Whenever they receive a notification from Amazon Rekognition Streaming Video Events, Abode sends these notifications to Alexa to provide audio announcements from Echo devices, such as “Package detected at the front door.”

Conclusion

The connected home security market segment is dynamic and evolving, driven by consumers’ increased need for security, convenience, and entertainment. AWS customers like Abode are innovating and adding new ML capabilities to their smart home security solutions for their consumers. The proliferation of camera and streaming video technology is just beginning, and managed computer vision services like Amazon Rekognition Streaming Video Events is paving the way for new smart video streaming capabilities in the home automation market.

To learn more, check out Amazon Rekognition Streaming Video Events and developer guide.


About the Authors

Mike Ames is a Principal Applied AI/ML Solutions Architect with AWS. He helps companies use machine learning and AI services to combat fraud, waste, and abuse. In his spare time, you can find him mountain biking, kickboxing, or playing Frisbee with his dog Max.

Prathyusha Cheruku is a Principal Product Manager for AI/ML Computer Vision at AWS. She focuses on building powerful, easy-to-use, no-code/low-code deep learning-based image and video analysis services for AWS customers. Outside of work, she has a passion for music, karaoke, painting, and traveling.

David Robo is a Principal WW GTM Specialist for AI/ML Computer Vision at Amazon Web Services. In this role, David works with customers and partners throughout the world who are building innovative video-based devices, products, and services. Outside of work, David has a passion for the outdoors and carving lines on waves and snow.

Read More

Pandas user-defined functions are now available in Amazon SageMaker Data Wrangler

Amazon SageMaker Data Wrangler reduces the time to aggregate and prepare data for machine learning (ML) from weeks to minutes. With Data Wrangler, you can select and query data with just a few clicks, quickly transform data with over 300 built-in data transformations, and understand your data with built-in visualizations without writing any code.

Additionally, you can create custom transforms unique to your requirements. Custom transforms allow you to write custom transformations using either PySpark, Pandas, or SQL.

Data Wrangler now supports a custom Pandas user-defined function (UDF) transform that can process large datasets efficiently. You can choose from two custom Pandas UDF modes: Pandas and Python. Both modes provide an efficient solution to process datasets, and the mode you choose depends on your preference.

In this post, we demonstrate how to use the new Pandas UDF transform in either mode.

Solution overview

At the time of this writing, you can import datasets into Data Wrangler from Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, Databricks, and Snowflake. For this post, we use Amazon S3 to store the 2014 Amazon reviews dataset.

The data has a column called reviewText containing user-generated text. The text also contains several stop words, which are common words that don’t provide much information, such as “a,” “an,” and “the.” Removal of stop words is a common preprocessing step in natural language processing (NLP) pipelines. We can create a custom function to remove the stop words from the reviews.

Create a custom Pandas UDF transform

Let’s walk through the process of creating two Data Wrangler custom Pandas UDF transforms using Pandas and Python modes.

  1. Download the Digital Music reviews dataset and upload it to Amazon S3.
  2. Open Amazon SageMaker Studio and create a new Data Wrangler flow.
  3. Under Import data, choose Amazon S3 and navigate to the dataset location.
  4. For File type, choose jsonl.

A preview of the data should be displayed in the table.

  1. Choose Import to proceed.
  2. After your data is imported, choose the plus sign next to Data types and choose Add transform.
  3. Choose Custom transform.
  4. On the drop-down menu, Python (User-Defined Function).

Now we create our custom transform to remove stop words.

  1. Specify your input column, output column, return type, and mode.

The following example uses Pandas mode. This means the function should accept and return a Pandas series of the same length. You can think of a Pandas series as a column in a table or a chunk of the column. This is the most performant Pandas UDF mode because Pandas can vectorize operations across batches of values as opposed to one at a time. The pd.Series type hints are required in Pandas mode.

import pandas as pd
from sklearn.feature_extraction import text

# Input: the quick brown fox jumped over the lazy dog
# Output: quick brown fox jumped lazy dog
def remove_stopwords(series: pd.Series) -> pd.Series:
  """Removes stop words from the given string."""
  
  # Replace nulls with empty strings and lowercase to match stop words case
  series = series.fillna("").str.lower()
  tokens = series.str.split()
  
  # Remove stop words from each entry of series
  tokens = tokens.apply(lambda t: [token for token in t 
                                   if token not in text.ENGLISH_STOP_WORDS])
  
  # Joins the filtered tokens by spaces
  return tokens.str.join(" ")

If you prefer to use pure Python as opposed to the Pandas API, Python mode allows you to specify a pure Python function that accepts a single argument and returns a single value. The following example is equivalent to the preceding Pandas code in terms of output. Type hints are not required in Python mode.

from sklearn.feature_extraction import text

def remove_stopwords(value: str) -> str:
  if not value:
    return ""
  
  tokens = value.lower().split()
  tokens = [token for token in tokens 
            if token not in text.ENGLISH_STOP_WORDS]
  return " ".join(tokens)

  1. Choose Add to add your custom transform.

Conclusion

Data Wrangler has over 300 built-in transforms, and you can also add custom transformations unique to your requirements. In this post, we demonstrated how to process datasets with Data Wrangler’s new custom Pandas UDF transform, using both Pandas and Python modes. You can use either mode based on your preference. To learn more about Data Wrangler, refer to Create and Use a Data Wrangler Flow.


About the Authors

Ben Harris is a software engineer with experience designing, deploying, and maintaining scalable data pipelines and machine learning solutions across a variety of domains. Ben has built systems for data collection and labeling, image and text classification, sequence-to-sequence modeling, embedding, and clustering, among others.

Haider Naqvi is a Solutions Architect at AWS. He has extensive Software Development and Enterprise Architecture experience. He focuses on enabling customers to achieve business outcomes with AWS. He is based out of New York.

Vishal Srivastava is a Technical Account Manager at AWS. With a background in Software Development and Analytics, he primarily works with financial services sector and digital native business customers and supports their cloud journey. In his free time, he loves to travel with his family.

Read More

How Searchmetrics uses Amazon SageMaker to automatically find relevant keywords and make their human analysts 20% faster

Searchmetrics is a global provider of search data, software, and consulting solutions, helping customers turn search data into unique business insights. To date, Searchmetrics has helped more than 1,000 companies such as McKinsey & Company, Lowe’s, and AXA find an advantage in the hyper-competitive search landscape.

In 2021, Searchmetrics turned to AWS to help with artificial intelligence (AI) usage to further improve their search insights capabilities.

In this post, we share how Searchmetrics built an AI solution that increased the efficiency of its human workforce by 20% by automatically finding relevant search keywords for any given topic, using Amazon SageMaker and its native integration with Hugging Face.

“Amazon SageMaker made it a breeze to evaluate and integrate Hugging Face’s state-of-the-art NLP models into our systems.
The solution we built makes us more efficient and greatly improves our user experience.”– Ioannis Foukarakis, Head of Data, Searchmetrics

Using AI to identify relevance from a list of keywords

A key part of Searchmetrics’ insights offering is its ability to identify the most relevant search keywords for a given topic or search intent.

To do this, Searchmetrics has a team of analysts assessing the potential relevance of certain keywords given a specific seed word. Analysts use an internal tool to review a keyword within a given topic and a generated list of potentially related keywords, and they must then select one or more related keywords that are relevant to that topic.

This manual filtering and selection process was time consuming and slowed down Searchmetrics’s ability to deliver insights to its customers.

To improve this process, Searchmetrics sought to build an AI solution that could use natural language processing (NLP) to understand the intent of a given search topic and automatically rank an unseen list of potential keywords by relevance.

Using SageMaker and Hugging Face to quickly build advanced NLP capabilities

To solve this, Searchmetrics’ engineering team turned to SageMaker, an end-to-end machine learning (ML) platform that helps developers and data scientists quickly and easily build, train, and deploy ML models.

SageMaker accelerates the deployment of ML workloads by simplifying the ML build process. It provides a broad set of ML capabilities on top of a fully managed infrastructure. This removes the undifferentiated heavy lifting that too-often hinders ML development.

Searchmetrics chose SageMaker because of the full range of capabilities it provided at every step of the ML development process:

  • SageMaker notebooks enabled the Searchmetrics team to quickly spin up fully managed ML development environments, perform data preprocessing, and experiment with different approaches
  • The batch transform capabilities in SageMaker enabled Searchmetrics to efficiently process its inference payloads in bulk, as well as easily integrate into its existing web service in production

Searchmetrics was also particularly interested in the native integration of SageMaker with Hugging Face, an exciting NLP startup that provides easy access to more than 7,000 pre-trained language models through its popular Tranformers library.

SageMaker provides a direct integration with Hugging Face through a dedicated Hugging Face estimator in the SageMaker SDK. This makes it easy to run Hugging Face models on the fully managed SageMaker infrastructure.

With this integration, Searchmetrics was able to test and experiment with a range of different models and approaches to find the best-performing approach to their use case.

The end solution uses a zero-shot classification pipeline to identify the most relevant keywords. Different pre-trained models and query strategies were evaluated, with facebook/bart-large-mnli providing the most promising results.

Using AWS to improve operational efficiency and find new innovation opportunities

With SageMaker and its native integration with Hugging Face, Searchmetrics was able to build, train, and deploy an NLP solution that could understand a given topic and accurately rank an unseen list of keywords based on their relevance. The toolset offered by SageMaker made it easier to experiment and deploy.

When integrated with Searchmetrics’s existing internal tool, this AI capability delivered an average reduction of 20% in the time taken for human analysts to complete their job. This resulted in higher throughput, improved user experience, and faster onboarding of new users.

This initial success has not only improved the operational performance of Searchmetrics’s search analysts, but has also helped Searchmetrics chart a clearer path to deploying more comprehensive automation solutions using AI in its business.

These exciting new innovation opportunities help Searchmetrics continue to improve their insights capabilities, and also help them ensure that customers continue to stay ahead in the hyper-competitive search landscape.

In addition, Hugging Face and AWS announced a partnership earlier in 2022 that makes it even easier to train Hugging Face models on SageMaker. This functionality is available through the development of Hugging Face AWS Deep Learning Containers (DLCs). These containers include Hugging Face Transformers, Tokenizers, and the Datasets library, which allows us to use these resources for training and inference jobs.

For a list of the available DLC images, see available Deep Learning Containers Images, which are maintained and regularly updated with security patches. You can find many examples of how to train Hugging Face models with these DLCs and the Hugging Face Python SDK in the following GitHub repo.

Learn more about how you can accelerate your ability to innovate with AI/ML by visiting Getting Started with Amazon SageMaker, getting hands-on learning content by reviewing the Amazon SageMaker developer resources, or visiting Hugging Face on Amazon SageMaker.


About the Author

Daniel Burke is the European lead for AI and ML in the Private Equity group at AWS. Daniel works directly with Private Equity funds and their portfolio companies, helping them accelerate their AI and ML adoption to improve innovation and increase enterprise value.

Read More

Identify paraphrased text with Hugging Face on Amazon SageMaker

Identifying paraphrased text has business value in many use cases. For example, by identifying sentence paraphrases, a text summarization system could remove redundant information. Another application is to identify plagiarized documents. In this post, we fine-tune a Hugging Face transformer on Amazon SageMaker to identify paraphrased sentence pairs in a few steps.

A truly robust model can identify paraphrased text when the language used may be completely different, and also identify differences when the language used has high lexical overlap. In this post, we focus on the latter aspect. Specifically, we look at whether we can train a model that can identify the difference between two sentences that have high lexical overlap and very different or opposite meanings. For example, the following sentences have the exact same words but opposite meanings:

  • I took a flight from New York to Paris
  • I took a flight from Paris to New York

Solution overview

We walk you through the following high-level steps:

  1. Set up the environment.
  2. Prepare the data.
  3. Tokenize the dataset.
  4. Fine-tune the model.
  5. Deploy the model and perform inference.
  6. Evaluate model performance.

If you want to skip setting up the environment, you can use the following notebook on GitHub and run the code in SageMaker.

Hugging Face and AWS announced a partnership earlier in 2022 that makes it even easier to train Hugging Face models on SageMaker. This functionality is available through the development of Hugging Face AWS Deep Learning Containers (DLCs). These containers include Hugging Face Transformers, Tokenizers, and the Datasets library, which allows us to use these resources for training and inference jobs. For a list of the available DLC images, see Available Deep Learning Containers Images. They are maintained and regularly updated with security patches. You can find many examples of how to train Hugging Face models with these DLCs and the Hugging Face Python SDK in the following GitHub repo.

The PAWS dataset

Realizing the lack of efficient sentence pairs datasets that exhibit high lexical overlap without being paraphrases, the original PAWS dataset released in 2019 aimed to provide the natural language processing (NLP) community a new resource for training and evaluating paraphrase detection models. PAWS sentence pairs are generated in two steps using Wikipedia and the Quora Question Pairs (QQP) dataset. A language model first swaps words in a sentence pair with the same Bag of Words (BOW) to generate a sentence pair. A back translation step then generates paraphrases with high BOW overlap but using a different word order. The final PAWS dataset contains a total of 108,000 human-labeled and 656,000 noisily labeled pairs.

In this post, we use the PAWS-Wiki Labeled (Final) dataset from Hugging Face. Hugging Face has already performed the data split for us, which results in 49,000 sentence pairs in the training dataset, and 8,000 sentence pairs each for the validation and test datasets. Two sentence pair examples from the training dataset are shown in the following example. A label of 1 indicates that the two sentences are paraphrases of each other.

Sentence 1 Sentence 2 Label
Although interchangeable, the body pieces on the 2 cars are not similar. Although similar, the body parts are not interchangeable on the 2 cars. 0
Katz was born in Sweden in 1947 and moved to New York City at the age of 1. Katz was born in 1947 in Sweden and moved to New York at the age of one. 1

Prerequisites

You need to complete the following prerequisites:

  1. Sign up for an AWS account if you don’t have one. For more information, see Set Up Amazon SageMaker Prerequisites.
  2. Get started using SageMaker notebook instances.
  3. Set up the right AWS Identity and Access Management (IAM) permissions. For more information, see SageMaker Roles.

Set up the environment

Before we begin examining and preparing our data for model fine-tuning, we need to set up our environment. Let’s start by spinning up a SageMaker notebook instance. Choose an AWS Region in your AWS account and follow the instructions to create a SageMaker notebook instance. The notebook instance may take a few minutes to spin up.

When the notebook instance is running, choose conda_pytorch_p38 as your kernel type. To use the Hugging Face dataset, we first need to install and import the Hugging Face library:

!pip --quiet install "sagemaker" "transformers==4.17.0" "datasets==1.18.4" --upgrade
!pip --quiet install sentence-transformers

import sagemaker.huggingface
import sagemaker
from datasets import load_dataset

Next, let’s establish a SageMaker session. We use the default Amazon Simple Storage Service (Amazon S3) bucket associated with the SageMaker session to store the PAWS dataset and model artifacts:

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sess.default_bucket()

Prepare the data

We can load the Hugging Face version of the PAWS dataset with its load_dataset() command. This call downloads and imports the PAWS Python processing script from the Hugging Face GitHub repository, which then downloads the PAWS dataset from the original URL stored in the script and caches the data as an Arrow table on the drive. See the following code:

dataset_train, dataset_val, dataset_test = load_dataset("paws", "labeled_final", split=['train', 'validation', 'test'])

Before we begin fine-tuning our pre-trained BERT model, let’s look at our target class distribution. For our use case, the PAWS dataset has binary labels (0 indicates the sentence pair is not a paraphrase, and 1 indicates it is). Let’s create a column chart to view the class distribution, as shown in the following code. We see that there is a slight class imbalance issue in our training set (56% negative samples vs. 44% positive samples). However, the imbalance is small enough to avoid employing class imbalance mitigation techniques.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = dataset_train.to_pandas()

ax = sns.countplot(x="label", data=df)
ax.set_title('Label Count for PAWS Dataset', fontsize=15)
for p in ax.patches:
    ax.annotate(f'n{p.get_height()}', (p.get_x()+0.4, p.get_height()), ha='center', va='top', color='white', size=13)

Tokenize the dataset

Before we can begin fine-tuning, we need to tokenize our dataset. As a starting point, let’s say we want to fine-tune and evaluate the roberta-base transformer. We selected roberta-base because it’s a general-purpose transformer that was pre-trained on a large corpus of English data and has frequently shown high performance on a variety of NLP tasks. The model was originally introduced in the paper RoBERTa: A Robustly Optimized BERT Pretraining Approach.

We perform tokenization on the sentences with a roberta-base tokenizer from Hugging Face, which uses byte-level Byte Pair Encoding to split the document into tokens. For more details about the RoBERTa tokenizer, refer to RobertaTokenizer. Because our inputs are sentence pairs, we need to tokenize both sentences simultaneously. Because most BERT models require the input to have a fixed tokenized input length, we set the following parameters: max_len=128 and truncation=True. See the following code:

from transformers import AutoTokenizer
tokenizer_and_model_name = 'roberta-base'

# Download tokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_and_model_name)

# Tokenizer helper function
def tokenize(batch, max_len=128):
    return tokenizer(batch['sentence1'], batch['sentence2'], max_length=max_len, truncation=True)

dataset_train_tokenized = dataset_train.map(tokenize, batched=True, batch_size=len(dataset_train))
dataset_val_tokenized = dataset_val.map(tokenize, batched=True, batch_size=len(dataset_val))

The last preprocessing step for fine-tuning our BERT model is to convert the tokenized train and validation datasets into PyTorch tensors and upload them to our S3 bucket:

import botocore
from datasets.filesystems import S3FileSystem

s3 = S3FileSystem()
s3_prefix = 'sts-sbert-paws/sts-paws-datasets'

# convert and save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train'
dataset_train_tokenized = dataset_train_tokenized.rename_column("label", "labels")
dataset_train_tokenized.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
dataset_train_tokenized.save_to_disk(training_input_path,fs=s3)

# convert and save val_dataset to s3
val_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/val'
dataset_val_tokenized = dataset_val_tokenized.rename_column("label", "labels")
dataset_val_tokenized.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
dataset_val_tokenized.save_to_disk(val_input_path,fs=s3)

Fine-tune the model

Now that we’re done with data preparation, we’re ready to fine-tune our pre-trained roberta-base model on the paraphrase identification task. We can use the SageMaker Hugging Face Estimator class to initiate the fine-tuning process in two steps. The first step is to specify the training hyperparameters and metric definitions. The metric definitions variable tells the Hugging Face Estimator what types of metrics to extract from the model’s training logs. Here, we’re primarily interested in extracting validation set metrics at each training epoch.

# Step 1: specify training hyperparameters and metric definitions
hyperparameters = {'epochs': 4,
                   'train_batch_size': 16,
                   'model_name': tokenizer_and_model_name}
                   
metric_definitions=[
    {'Name': 'loss', 'Regex': "'loss': ([0-9]+(.|e-)[0-9]+),?"},
    {'Name': 'eval_loss', 'Regex': "'eval_loss': ([0-9]+(.|e-)[0-9]+),?"},
    {'Name': 'eval_accuracy', 'Regex': "'eval_accuracy': ([0-9]+(.|e-)[0-9]+),?"},
    {'Name': 'eval_f1', 'Regex': "'eval_f1': ([0-9]+(.|e-)[0-9]+),?"},
    {'Name': 'eval_precision', 'Regex': "'eval_precision': ([0-9]+(.|e-)[0-9]+),?"},
    {'Name': 'eval_recall', 'Regex': "'eval_recall': ([0-9]+(.|e-)[0-9]+),?"},
    {'Name': 'epoch', 'Regex': "'epoch': ([0-9]+(.|e-)[0-9]+),?"}
]           

The second step is to instantiate the Hugging Face Estimator and start the fine-tuning process with the .fit() method:

# Step 2: instantiate estimator and begin fine-tuning
from sagemaker.huggingface import HuggingFace

huggingface_estimator = HuggingFace(
                            entry_point='train.py',
                            source_dir='./scripts',
                            output_path=f's3://{sess.default_bucket()}',
                            base_job_name='huggingface-sdk-extension',
                            instance_type='ml.p3.8xlarge',
                            instance_count=1,
                            volume_size=100,
                            transformers_version='4.17.0',
                            pytorch_version='1.10.2',
                            py_version='py38',
                            role=role,
                            hyperparameters=hyperparameters,
                            metric_definitions=metric_definitions
                        )
                        
huggingface_estimator.fit({'train': training_input_path, 'test': val_input_path}, 
                          wait=True, 
                          job_name='sm-sts-blog-{}'.format(int(time.time())))

The fine-tuning process takes approximately 30 minutes using the specified hyperparameters.

Deploy the model and perform inference

SageMaker offers multiple deployment options depending on your use case. For persistent, real-time endpoints that make one prediction at a time, we recommend using SageMaker real-time hosting services. If you have workloads that have idle periods between traffic spurts and can tolerate cold starts, we recommend using Serverless Inference. Serverless endpoints automatically launch compute resources and scale them in and out depending on traffic, eliminating the need to choose instance types or manage scaling policies. We demonstrate how to deploy our fine-tuned Hugging Face model to both a real-time inference endpoint and a Serverless Inference endpoint.

Deploy to a real-time inference endpoint

You can deploy a training object onto real-time inference hosting within SageMaker using the .deploy() method. For a full list of the accepted parameters, refer to Hugging Face Model. To start, let’s deploy the model to one instance, by passing in the following parameters: initial_instance_count, instance_type, and endpoint_name. See the following code:

rt_predictor = huggingface_estimator.deploy(initial_instance_count=1,
instance_type="ml.g4dn.xlarge",
endpoint_name="sts-sbert-paws")

The model takes a few minutes to deploy. After the model is deployed, we can submit sample records from the unseen test dataset to the endpoint for inference.

Deploy to a Serverless Inference endpoint

To deploy our training object onto a serverless endpoint, we need to first specify a serverless config file with memory_size_in_mb and max_concurrency arguments:

from sagemaker.serverless.serverless_inference_config import ServerlessInferenceConfig

serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=6144,
    max_concurrency=1,
)

memory_size_in_mb defines the total RAM size of your serverless endpoint; the minimal RAM size is 1024 MB (1 GB) and it can scale up to 6144 MB (6 GB). Generally, you should aim to choose a memory size that is at least as large as your model size. max_concurrency defines the quota for how many concurrent invocations can be processed at the same time (up to 50 concurrent invocations) for a single endpoint.

We also need to supply the Hugging Face inference image URI, which you can retrieve using the following code:

image_uri = sagemaker.image_uris.retrieve(
    framework="huggingface",
    base_framework_version="pytorch1.10",
    region=sess.boto_region_name,
    version="4.17",
    py_version="py38",
    instance_type="ml.m5.large",
    image_scope="inference",
)

Now that we have the serverless config file, we can create a serverless endpoint in the same way as our real-time inference endpoint, using the .deploy() method:

sl_predictor = huggingface_estimator.deploy(
    serverless_inference_config=serverless_config, image_uri=image_uri
)

The endpoint should be created in a few minutes.

Perform model inference

To make predictions, we need to create the sentence pair by adding the [CLS] and [SEP] special tokens and subsequently submit the input to the model endpoints. The syntax for real-time inference and serverless inference is the same:

import random 

rand = random.randrange(0, 8000)

true_label = dataset_test[rand]['label']
sent_1 = dataset_test[rand]['sentence1']
sent_2 = dataset_test[rand]['sentence2']

sentence_pair = {"inputs": ['[CLS] ' + sent_1 + ' [SEP] ' + sent_2 + ' [SEP]']}


# real-time inference 
print('Sentence 1:', sent_1) 
print('Sentence 2:', sent_2)
print()
print('Inference Endpoint:', rt_predictor.endpoint_name)
print('True Label:', true_label)
print('Predicted Label:', rt_predictor.predict({"inputs": sentence_pair})[0]['label'])
print('Prediction Confidence:', rt_predictor.predict({"inputs": sentence_pair})[0]['score'])

# serverless inference
print('Sentence 1:', sent_1) 
print('Sentence 2:', sent_2)
print()
print('Inference Endpoint:', sl_predictor.endpoint_name)
print('True Label:', true_label)
print('Predicted Label:', sl_predictor.predict({"inputs": sentence_pair})[0]['label'])
print('Prediction Confidence:', sl_predictor.predict({"inputs": sentence_pair})[0]['score'])

In the following examples, we can see the model is capable of correctly classifying whether the input sentence pair contains paraphrased sentences.

The following is a real-time inference example.

The following is a Serverless Inference example.

Evaluate model performance

To evaluate the model, let’s expand the preceding code and submit all 8,000 unseen test records to the real-time endpoint:

from tqdm import tqdm

preds = []
labels = []

# Inference takes ~5 minutes for all test records using a fine-tuned roberta-base and ml.g4dn.xlarge instance

for i in tqdm(range(len(dataset_test))):
    true_label = dataset_test[i]['label']
    sent_1 = dataset_test[i]['sentence1']
    sent_2 = dataset_test[i]['sentence2']
    
    sentence_pair = {"inputs": ['[CLS] ' + sent_1 + ' [SEP] ' + sent_2 + ' [SEP]']}
    pred = rt_predictor.predict(sentence_pair)
    
    labels.append(true_label)
    preds.append(int(pred[0]['label'].split('_')[1]))

Next, we can create a classification report using the extracted predictions:

from sklearn.metrics import classification_report

print('Endpoint Name:', rt_predictor.endpoint_name)
class_names = ['paraphase', 'not paraphrase']
print(classification_report(labels, preds, target_names=class_names))

We get the following test scores.

We can observe that roberta-base has a combined macro-average F1 score of 92% and performs slightly better at detecting sentences that are paraphrases. The roberta-base model performs well, but it’s good practice to calculate model performance using at least one other model.

The following table compares roberta-base performance results on the same test set against another fine-tuned transformer called paraphrase-mpnet-base-v2, a sentence transformer pre-trained specifically for the paraphrase identification task. Both models were trained on an ml.p3.8xlarge instance.

The results show that roberta-base has a 1% higher F1 score with very similar training and inference times using real-time inference hosting on SageMaker. The performance difference between the models is relatively minor, however, roberta-base is ultimately the winner since it has marginally better performance metrics and almost identical training and inference times.

Precision Recall F1-score Training time (billable) Inference time (full test set)
roberta-base 0.92 0.93 0.92 18 minutes 2 minutes

paraphrase-mpnet-

base-v2

0.92 0.91 0.91 17 minutes 2 minutes

Clean up

When you’re done using the model endpoints, you can delete them to avoid incurring future charges:

rt_predictor.delete_endpoint()
sl_predictor.delete_endpoint()

Conclusion

In this post, we discussed how to rapidly build a paraphrase identification model using Hugging Face transformers on SageMaker. We fine-tuned two pre-trained transformers, roberta-base and paraphrase-mpnet-base-v2, using the PAWS dataset (which contains sentence pairs with high lexical overlap). We demonstrated and discussed the benefits of real-time inference vs. Serverless Inference deployment, the latter being a new feature that targets spiky workloads and eliminates the need to manage scaling policies. On an unseen test set with 8,000 records, we demonstrated that both models achieved an F1 score greater than 90%.

To expand on this solution, consider the following:

  • Try fine-tuning with your own custom dataset. If you don’t have sufficient training labels, you could evaluate the performance of a fine-tuned model like the one demonstrated in this post on a custom test dataset.
  • Integrate this fine-tuned model into a downstream application that requires information on whether two sentences (or blocks of text) are paraphrases of each other.

Happy building!


About the Authors

Bala Krishnamoorthy is a Data Scientist with AWS Professional Services, where he enjoys applying machine learning to solve customer business problems. He specializes in natural language processing use cases and has worked with customers in industries such as software, finance and healthcare. In his free time, he enjoys trying new food, watching comedies and documentaries, working out at Orange Theory, and being out on the water (paddle-boarding, snorkeling and hopefully diving soon).

Ivan Cui is a Data Scientist with AWS Professional Services, where he helps customers build and deploy solutions using machine learning on AWS. He has worked with customers across diverse industries, including software, finance, pharmaceutical, and healthcare. In his free time, he enjoys reading, spending time with his family, and maximizing his stock portfolio.

Read More

How Moovit turns data into insights to help passengers avoid delays using Apache Airflow and Amazon SageMaker

This is a guest post by Moovit’s Software and Cloud Architect, Sharon Dahan.

Moovit, an Intel company, is a leading Mobility as a Service (MaaS) solutions provider and creator of the top urban mobility app. Moovit serves over 1.3 billion riders in 3,500 cities around the world.

We help people everywhere get to their destination in the smoothest way possible, by combining all options for real-time trip planning and payment in one app. We provide governments, cities, transit agencies, operators, and all organizations with mobility challenges with AI-powered mobility solutions that cover planning, operations, and analytics.

In this post, we describe how Moovit built an automated pipeline to train and deploy BERT models which classify public transportation service alerts in multiple metropolitan areas using Apache Airflow and Amazon SageMaker.

The service alert challenge

One of the key features in Moovit’s urban mobility app is offering access to transit service alerts (sourced from local operators and agencies) to app users around the world.

A service alert is a text message that describes a change (which can be positive or negative) in public transit service. These alerts are typically communicated by the operator in a long textual format and need to be analyzed in order to classify their potential impact on the user’s trip plan. The service alert classification affects the way transit recommendations are shown in the app. An incorrect classification may cause users to ignore important service interruptions that may impact their trip plan.

Service Alert in Moovit App

Existing solution and classification challenges

Historically, Moovit applied both automated rule-based classification (which works well for simple logic) as well as manual human classification for more complex cases.

For example, the following alert “Line 46 will arrive 10 min later as a result of an accident with a deer.” Can be classified into one of the following categories:

1: "NO_SERVICE",
2: "REDUCED_SERVICE",
3: "SIGNIFICANT_DELAYS",
4: "DETOUR",
5: "ADDITIONAL_SERVICE",
6: "MODIFIED_SERVICE",
7: "OTHER_EFFECT",
9: "STOP_MOVED",

The above example should be classified as 3, which is SIGNIFICANT_DELAYS.

The existing rule-based classification solution searches the text for key phrases (for example delay or late) as illustrated in the following diagram.

Service Alert Diagram

While the rule-based classification engine offered accurate classifications, it was able to classify only 20% of the service alerts requiring the other 80% to be manually classified. This was not scalable and resulted in gaps in our service alerts coverage.

NLP based classification with a BERT framework

We decided to leverage a neural network that can learn to classify service alerts and selected the BERT model for this challenge.

BERT (Bidirectional Encoder Representations from Transformers) is an open-source machine learning (ML) framework for natural language processing (NLP). BERT is designed to help computers understand the meaning of ambiguous language in the text by using surrounding text to establish context. The BERT framework was pre-trained using text from the BooksCorpus with 800M words and English Wikipedia with 2,500M words, and can be fine-tuned with question-answer datasets.

We leveraged classified data from our rule-based classification engine as ground truth for the training job and explored two possible approaches:

  • Approach 1: The first approach was to train using the BERT pre-trained model which meant adding our layers in the beginning and at the end of the pre-trained model.
  • Approach 2: The second approach was to use the BERT tokenizer with a standard five-layer model.

Comparison tests showed that, due to the limited amount of available ground truth data, the BERT tokenizer approach yielded better results, was less time-consuming, and required minimal compute resources for training. The model was able to successfully classify service alerts that could not be classified with the existing rule-based classification engine.

The following diagram illustrates the model’s high-level architecture.
BERT high level architecture

After we have the trained model, we deploy it to a SageMaker endpoint and expose it to the Moovit backend server (with request payload being the service alert’s raw text). See the following example code:

{
   "instances": [
       "Expect longer waits for </br> B4, B8, B11, B12, B14, B17, B24, B35, B38, B47, B48, B57, B60, B61, B65, B68, B82, and B83 buses.rnrnWe're working to provide as much service as possible."
   ]
}

The response is the classification and the level of confidence:

{
   "response": [
       {
           "id": 1,
           "prediction": "SIGNIFICANT_DELAYS",
           "confidance": 0.921
       }
   ]
}

From research to production – overcoming operational challenges

Once we trained an NLP model, we had to overcome several challenges in order to enable our app users to access service alerts at scale and in a timely manner:

  • How do we deploy a model to our production environment?
  • How do we serve the model at scale with low latency?
  • How do we re-train the model in order to future proof our solution?
  • How do we expand to other metropolitan areas (aka “metros”) in an efficient way?

Prior to using SageMaker, we used to take the trained ML models and manually integrate them into our backend environment. This created a dependency between the model deployment and a backend upgrade. As a result, our ability to deploy new models was very limited and resulted in extremely rare model updates.

In addition, serving an ML model can require substantial compute resources which are difficult to predict and need to be provisioned for in advance in order to ensure adherence to our strict latency requirements.  When the model is served within the backend this can cause unnecessary scaling of compute resources and erratic behavior.

The solution to both these challenges was to use SageMaker endpoints for our real time inference requirements. This enabled us to (1) de-couple the model serving and deployment cycle from the backend release schedule and (2) de-couple the resource provisioning required for model serving (also in peak periods) from the backend provisioning.

Because our group already had deep experience with Airflow, we decided to automate the entire pipeline using Airflow operators in conjunction with SageMaker. As you can see below, we built a full CI/CD pipeline to automate data collection, model re-training and to manage the deployment process. This pipeline can also be leveraged to make the entire process scalable to new metropolitan areas, as we continue to increase our coverage in additional cities worldwide.

AI Lake architecture

The architecture shown in the following diagram is based on SageMaker and Airflow; all endpoints exposed to developers use Amazon API Gateway. This implementation was dubbed “AI lake”.

AI Lake Architecture

SageMaker helps data scientists and developers to prepare, build, train, and deploy high-quality machine learning models quickly by bringing together a broad set of capabilities purpose-built for machine learning.

Moovit uses SageMaker to automate the training and deployment process. The trained models are saved to Amazon Simple Storage Service (Amazon S3) and cataloged.

SageMaker helps us significantly reduce the need for engineering time and lets us focus more on developing features for the business and less on the infrastructure required to support the model’s lifecycle. Below you can see Moovit’s SageMaker Training Jobs.

Sagemaker Training Jobs

After we train the Metro’s model, we expose it using the SageMaker endpoint. SageMaker enables us to deploy a new version seamlessly to the app, without any downtime.

SageMaker endpoint

Moovit uses API Gateway to expose all models under the same domain, as shown in the following screenshot.

API Gateway

Moovit decided to use Airflow to schedule and create a holistic workflow. Each model has its own workflow, which includes the following steps:

  • Dataset generation – The owner of this step is the BI team. This step automatically creates a fully balanced dataset with which to train the model. The final dataset is saved to an S3 bucket.
  • Train – The owner of this step is the server team. This step fetches the dataset from the previous step and trains the model using SageMaker. SageMaker takes care of the whole training process, such as provisioning the instance, running the training code, saving the model, and saving the training job results and logs.
  • Verify – This step is owned by the data science team. During the verification step, Moovit runs a confusion matrix and checks some of the parameters to make sure that the model is healthy and stands within proper thresholds. If the new model misses the criteria, the flow is canceled and the deploy step doesn’t run.
  • Deploy – The owner of this step is the DevOps teams. This step triggers the deploy function for SageMaker (using Boto3) to update the existing endpoint or create a new one.

Results

With the AI lake solution and service alert classification model, Moovit accomplished two major achievements:

  • Functional – In Metros where the service alert classification model was deployed, Moovit has achieved x3 growth in percentage of classified service alerts! (from 20% to over 60%)
  • Operational – Moovit now has the ability to maintain and develop more ML models with less engineering effort, and with very clear and outlined best practices and responsibilities. This opens new opportunities for integrating AI and ML models into Moovit’s products and technologies.

The following charts illustrate the service alert classifications before (left) and after (right) implementing this solution – the turquoise area is the unclassified alerts (aka “modified service”).

service alerts before and after

Conclusion

In this post, we shared how Moovit used SageMaker with AirFlow to improve the number of classified service alerts by 200% (x3). Moovit is now able to maintain and develop more ML models with less engineering efforts and with very clear practices and responsibilities.

For further reading, refer to the following:


About the Authors

Sharon DahanSharon Dahan is a Software & Cloud Architect at Moovit. He is responsible for bringing innovative and creative solutions which can stand within Moovit’s tremendous scale. In his spare time, Sharon makes tasty hoppy beer.

Miron PerelMiron Perel is a Senior Machine Learning Business Development Manager with Amazon Web Services. Miron helps enterprise organizations harness the power of data and Machine Learning to innovate and grow their business.

Eitan SelaEitan Sela is a Machine Learning Specialist Solutions Architect with Amazon Web Services. He works with AWS customers to provide guidance and technical assistance, helping them build and operate machine learning solutions on AWS. In his spare time, Eitan enjoys jogging and reading the latest machine learning articles.

Read More

Build and deploy a scalable machine learning system on Kubernetes with Kubeflow on AWS

In this post, we demonstrate Kubeflow on AWS (an AWS-specific distribution of Kubeflow) and the value it adds over open-source Kubeflow through the integration of highly optimized, cloud-native, enterprise-ready AWS services.

Kubeflow is the open-source machine learning (ML) platform dedicated to making deployments of ML workflows on Kubernetes simple, portable and scalable. Kubeflow provides many components, including a central dashboard, multi-user Jupyter notebooks, Kubeflow Pipelines, KFServing, and Katib, as well as distributed training operators for TensorFlow, PyTorch, MXNet, and XGBoost, to build simple, scalable, and portable ML workflows.

AWS recently launched Kubeflow v1.4 as part of its own Kubeflow distribution (called Kubeflow on AWS), which streamlines data science tasks and helps build highly reliable, secure, portable, and scalable ML systems with reduced operational overheads through integrations with AWS managed services. You can use this Kubeflow distribution to build ML systems on top of Amazon Elastic Kubernetes Service (Amazon EKS) to build, train, tune, and deploy ML models for a wide variety of use cases, including computer vision, natural language processing, speech translation, and financial modeling.

Challenges with open-source Kubeflow

When you use an open-source Kubeflow project, it deploys all Kubeflow control plane and data plane components on Kubernetes worker nodes. Kubeflow component services are deployed as part of the Kubeflow control plane, and all resource deployments related to Jupyter, model training, tuning, and hosting are deployed on the Kubeflow data plane. The Kubeflow control plane and data plane can run on the same or different Kubernetes worker nodes. This post focuses on Kubeflow control plane components, as illustrated in the following diagram.

This deployment model may not provide an enterprise-ready experience due to the following reasons:

  • All Kubeflow control plane heavy lifting infrastructure components, including database, storage, and authentication, are deployed in the Kubernetes cluster worker node itself. This makes it challenging to implement a highly available Kubeflow control plane design architecture with a persistent state in the event of worker node failure.
  • Kubeflow control plane generated artifacts (such as MySQL instances, pod logs, or MinIO storage) grow over time and need resizable storage volumes with continuous monitoring capabilities to meet the growing storage demand. Because the Kubeflow control plane shares resources with Kubeflow data plane workloads (for example, for training jobs, pipelines, and deployments), right-sizing and scaling Kubernetes cluster and storage volumes can become challenging and result in increased operational cost.
  • Kubernetes restricts the log file size, with most installations keeping the most recent limit of 10 MB. By default, the pod logs become inaccessible after they reach this upper limit. The logs could also become inaccessible if pods are evicted, crashed, deleted, or scheduled on a different node, which could impact your application log availability and monitoring capabilities.

Kubeflow on AWS

Kubeflow on AWS provides a clear path to use Kubeflow, with the following AWS services:

These AWS service integrations with Kubeflow (as shown in the following diagram) allow us to decouple critical parts of the Kubeflow control plane from Kubernetes, providing a secure, scalable, resilient, and cost-optimized design.

Let’s discuss the benefits of each service integration and their solutions around security, running ML pipelines, and storage.

Secure authentication of Kubeflow users with Amazon Cognito

Cloud security at AWS is the highest priority, and we’re investing in tightly integrating Kubeflow security directly into the AWS shared-responsibility security services, such as the following:

In this section, we focus on AWS Kubeflow control plane integration with Amazon Cognito. Amazon Cognito removes the need to manage and maintain a native Dex (open-source OpenID Connect (OIDC) provider backed by local LDAP) solution for user authentication and makes secret management easier.

You can also use Amazon Cognito to add user sign-up, sign-in, and access control to your Kubeflow UI quickly and easily. Amazon Cognito scales to millions of users and supports sign-in with social identity providers (IdPs), such as Facebook, Google, and Amazon, and enterprise IdPs via SAML 2.0. This reduces the complexity in your Kubeflow setup, making it operationally lean and easier to operate to achieve multi-user isolation.

Let’s look at a multi-user authentication flow with Amazon Cognito, ALB, and ACM integrations with Kubeflow on AWS. There are a number of key components as part of this integration. Amazon Cognito is configured as an IdP with an authentication callback configured to route the request to Kubeflow after user authentication. As part of the Kubeflow setup, a Kubernetes ingress resource is created to manage external traffic to the Istio Gateway service. The AWS ALB Ingress Controller provisions a load balancer for that ingress. We use Amazon Route 53 to configure a public DNS for the registered domain and create certificates using ACM to enable TLS authentication at the load balancer.

The following diagram shows the typical user workflow of logging in to Amazon Cognito and getting redirected to Kubeflow in their respective namespace.

The workflow contains the following steps:

  1. The user sends an HTTPS request to the Kubeflow central dashboard hosted behind a load balancer. Route 53 resolves the FQDN to the ALB alias record.
  2. If the cookie isn’t present, the load balancer redirects the user to the Amazon Cognito authorization endpoint so that Amazon Cognito can authenticate the user.
  3. After the user is authenticated, Amazon Cognito sends the user back to the load balancer with an authorization grant code.
  4. The load balancer presents the authorization grant code to the Amazon Cognito token endpoint.
  5. Upon receiving a valid authorization grant code, Amazon Cognito provides the ID token and access token to load balancer.
  6. After your load balancer authenticates a user successfully, it sends the access token to the Amazon Cognito user info endpoint and receives user claims. The load balancer signs and adds user claims to the HTTP header x-amzn-oidc-* in a JSON web token (JWT) request format.
  7. The request from the load balancer is sent to the Istio Ingress Gateway’s pod.
  8. Using an envoy filter, Istio Gateway decodes the x-amzn-oidc-data value, retrieves the email field, and adds the custom HTTP header kubeflow-userid, which is used by the Kubeflow authorization layer.
  9. The Istio resource-based access control policies are applied to the incoming request to validate the access to the Kubeflow Dashboard. If either of those are inaccessible to the user, an error response is sent back. If the request is validated, it’s forwarded to the appropriate Kubeflow service and provides access to the Kubeflow Dashboard

Persisting Kubeflow component metadata and artifact storage with Amazon RDS and Amazon S3

Kubeflow on AWS provides integration with Amazon Relational Database Service (Amazon RDS) in Kubeflow Pipelines and AutoML (Katib) for persistent metadata storage, and Amazon S3 in Kubeflow Pipelines for persistent artifact storage. Let’s continue to discuss Kubeflow Pipelines in more detail.

Kubeflow Pipelines is a platform for building and deploying portable, scalable ML workflows. These workflows can help automate complex ML pipelines using built-in and custom Kubeflow components. Kubeflow Pipelines includes Python SDK, a DSL compiler to convert Python code into a static config, a Pipelines service that runs pipelines from the static configuration, and a set of controllers to run the containers within the Kubernetes Pods needed to complete the pipeline.

Kubeflow Pipelines metadata for pipeline experiments and runs are stored in MySQL, and artifacts including pipeline packages and metrics are stored in MinIO.

As shown in the following diagram, Kubeflow on AWS lets you store the following components with AWS managed services:

  • Pipeline metadata in Amazon RDS – Amazon RDS provides a scalable, highly available, and reliable Multi-AZ deployment architecture with a built-in automated failover mechanism and resizable capacity for an industry-standard relational database like MySQL. It manages common database administration tasks without needing to provision infrastructure or maintain software.
  • Pipeline artifacts in Amazon S3 – Amazon S3 offers industry-leading scalability, data availability, security, and performance, and could be used to meet your compliance requirements.

These integrations help offload the management and maintenance of the metadata and artifact storage from self-managed Kubeflow to AWS managed services, which is easier to set up, operate, and scale.

Support for distributed file systems with Amazon EFS and Amazon FSx

Kubeflow builds upon Kubernetes, which provides an infrastructure for large-scale, distributed data processing, including training and tuning large models with a deep network with millions or even billions of parameters. To support such distributed data processing ML systems, Kubeflow on AWS provides integration with the following storage services:

  1. Amazon EFS – A high-performance, cloud-native, distributed file system, which you could manage through an Amazon EFS CSI driver. Amazon EFS provides ReadWriteMany access mode, and you can now use it to mount into pods (Jupyter, model training, model tuning) running in a Kubeflow data plane to provide a persistent, scalable, and shareable workspace that automatically grows and shrinks as you add and remove files with no need for management.
  2. Amazon FSx for Lustre – An optimized file system for compute-intensive workloads, such as high-performance computing and ML, that you can manage through the Amazon FSx CSI driver. FSx for Lustre provides ReadWriteMany access mode as well, and you can use it to cache training data with direct connectivity to Amazon S3 as the backing store, which you can use to support Jupyter notebook servers or distributed training running in a Kubeflow data plane. With this configuration, you don’t need to transfer data to the file system before using the volume. FSx for Lustre provides consistent submillisecond latencies and high concurrency, and can scale to TB/s of throughput and millions of IOPS.

Kubeflow deployment options

AWS provides various Kubeflow deployment options:

  • Deployment with Amazon Cognito
  • Deployment with Amazon RDS and Amazon S3
  • Deployment with Amazon Cognito, Amazon RDS, and Amazon S3
  • Vanilla deployment

For details on service integration and available add-ons for each of these options, refer to Deployment Options. You can fit the option that best fits your use case.

In the following section, we walk through the steps to install AWS Kubeflow v1.4 distribution on Amazon EKS. Then we use the existing XGBoost pipeline example available on the Kubeflow central UI dashboard to demonstrate the integration and usage of AWS Kubeflow with Amazon Cognito, Amazon RDS, and Amazon S3, with Secrets Manager as an add-on.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Install the following tools on the client machine used to access your Kubernetes cluster. You can use AWS Cloud9, a cloud-based integrated development environment (IDE) for the Kubernetes cluster setup.

Install Kubeflow on AWS

Configure kubectl so that you can connect to an Amazon EKS cluster:

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

# Set the cluster name, region where the cluster exists
export CLUSTER_NAME=<CLUSTER_NAME>
export CLUSTER_REGION=<CLUSTER_REGION>

aws eks update-kubeconfig --name $CLUSTER_NAME --region $CLUSTER_REGION
kubectl config current-context

Various controllers in Kubeflow deployment use IAM roles for service accounts (IRSA). An OIDC provider must exist for your cluster to use IRSA. Create an OIDC provider and associate it with for your Amazon EKS cluster by running the following command, if your cluster doesn’t already have one:

eksctl utils associate-iam-oidc-provider --cluster ${CLUSTER_NAME} 
--region ${CLUSTER_REGION} --approve

Clone the AWS manifests repo and Kubeflow manifests repo, and checkout the respective release branches:

git clone https://github.com/awslabs/kubeflow-manifests.git
cd kubeflow-manifests
git checkout v1.4.1-aws-b1.0.0
git clone --branch v1.4.1 https://github.com/kubeflow/manifests.git upstream

export kubeflow_manifest_dir=$PWD

For more information about these versions, refer to Releases and Versioning.

Set up Amazon RDS, Amazon S3, and Secrets Manager

You create Amazon RDS and Amazon S3 resources before you deploy the Kubeflow manifests. We use automated Python scripts that take care of creating the S3 bucket, RDS database, and required secrets in Secrets Manager. It also edits the required configuration files for the Kubeflow pipeline and AutoML to be properly configured for the RDS database and S3 bucket during Kubeflow installation.

Create an IAM user with permissions to allow GetBucketLocation and read and write access to objects in an S3 bucket where you want to store the Kubeflow artifacts. Use the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY of the IAM user in the following code:

cd ${kubeflow_manifest_dir}/tests/e2e/
export BUCKET_NAME=<S3_BUCKET_NAME>
export S3_ACCESS_KEY_ID=<YOUR_ACCESS_KEY_ID_FOR_S3>
export S3_SECRET_ACCESS_KEY=<YOUR_SECRET_ACCESS_KEY_FOR_S3>

#Install the dependencies for the script
pip install -r requirements.txt

#Replace YOUR_CLUSTER_REGION, YOUR_CLUSTER_NAME and YOUR_S3_BUCKET with your values.
PYTHONPATH=.. python utils/rds-s3/auto-rds-s3-setup.py --region ${CLUSTER_REGION} --cluster ${CLUSTER_NAME} --bucket ${BUCKET_NAME} --db_name kubeflow --db_root_user admin --db_root_password password --s3_aws_access_key_id ${S3_ACCESS_KEY_ID} --s3_aws_secret_access_key ${S3_SECRET_ACCESS_KEY}

Set up Amazon Cognito as the authentication provider

In this section, we create a custom domain in Route 53 and ALB to route external traffic to Kubeflow Istio Gateway. We use ACM to create a certificate to enable TLS authentication at ALB and Amazon Cognito to maintain the user pool and manage user authentication.

Substitute the following values in

${kubeflow_manifest_dir}/tests/e2e/utils/cognito_bootstrap/config.yaml:
  • route53.rootDomain.name – The registered domain. Let’s assume this domain is example.com.
  • route53.rootDomain.hostedZoneId – If your domain is managed in Route53, enter the hosted zone ID found under the hosted zone details. Skip this step if your domain is managed by another domain provider.
  • route53.subDomain.name – The name of the subdomain where you want to host Kubeflow (for example, platform.example.com). For more information about subdomains, refer to Deploying Kubeflow with AWS Cognito as IdP.
  • cluster.name – The cluster name and where Kubeflow is deployed.
  • cluster.region – The cluster Region where Kubeflow is deployed (for example, us-west-2).
  • cognitoUserpool.name – The name of the Amazon Cognito user pool (for example, kubeflow-users).

The config file looks something like the following code:

cognitoUserpool:
     name: kubeflow-users
 cluster:
     name: kube-eks-cluster
     region: us-west-2
 route53:
     rootDomain:
         hostedZoneId: XXXX
         name: example.com
     subDomain:
         name: platform.example.com

Run the script to create the resources:

cd ${kubeflow_manifest_dir}/tests/e2e/
PYTHONPATH=.. python utils/cognito_bootstrap/cognito_pre_deployment.py

The script updates the config.yaml file with the resource names, IDs, and ARNs it created. It looks something like the following code:

cognitoUserpool:
     ARN: arn:aws:cognito-idp:us-west-2:123456789012:userpool/us-west-2_yasI9dbxF
     appClientId: 5jmk7ljl2a74jk3n0a0fvj3l31
     domainAliasTarget: xxxxxxxxxx.cloudfront.net
     domain: auth.platform.example.com
     name: kubeflow-users
 kubeflow:
     alb:
         serviceAccount:
             name: alb-ingress-controller
             policyArn: arn:aws:iam::123456789012:policy/alb_ingress_controller_kube-eks-clusterxxx
 cluster:
     name: kube-eks-cluster
     region: us-west-2
 route53:
     rootDomain:
         certARN: arn:aws:acm:us-east-1:123456789012:certificate/9d8c4bbc-3b02-4a48-8c7d-d91441c6e5af
         hostedZoneId: XXXXX
         name: example.com
     subDomain:
         us-west-2-certARN: arn:aws:acm:us-west-2:123456789012:certificate/d1d7b641c238-4bc7-f525-b7bf-373cc726
         hostedZoneId: XXXXX
         name: platform.example.com
         us-east-1-certARN: arn:aws:acm:us-east-1:123456789012:certificate/373cc726-f525-4bc7-b7bf-d1d7b641c238

Build manifests and deploy Kubeflow

Deploy Kubeflow using the following command:

while ! kustomize build ${kubeflow_manifest_dir}/docs/deployment/cognito-rds-s3 | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

Update the domain with the ALB address

The deployment creates an ingress-managed AWS application load balancer. We update the DNS entries for the subdomain in Route 53 with the DNS of the load balancer. Run the following command to check if the load balancer is provisioned (this takes around 3–5 minutes):

kubectl get ingress -n istio-system

NAME            CLASS    HOSTS   ADDRESS                                                                  PORTS   AGE
istio-ingress   <none>   *       ebde55ee-istiosystem-istio-2af2-1100502020.us-west-2.elb.amazonaws.com   80      15d

If the ADDRESS field is empty after a few minutes, check the logs of alb-ingress-controller. For instructions, refer to ALB fails to provision.

When the load balancer is provisioned, copy the DNS name of the load balancer and substitute the address for kubeflow.alb.dns in ${kubeflow_manifest_dir}/tests/e2e/utils/cognito_bootstrap/config.yaml. The Kubeflow section of the config file looks like the following code:

 kubeflow:
     alb:
         dns: ebde55ee-istiosystem-istio-2af2-1100502020.us-west-2.elb.amazonaws.com
         serviceAccount:
             name: alb-ingress-controller
             policyArn: arn:aws:iam::123456789012:policy/alb_ingress_controller_kube-eks-clusterxxx

Run the following script to update the DNS entries for the subdomain in Route 53 with the DNS of the provisioned load balancer:

cd ${kubeflow_manifest_dir}/tests/e2e/
PYTHONPATH=.. python utils/cognito_bootstrap/cognito_post_deployment.py

Troubleshooting

If you run into any issues during the installation, refer to the troubleshooting guide or start fresh by following the “Clean up” section in this blog.

Use case walkthrough

Now that we have completed installing the required Kubeflow components, let’s see them in action using one of the existing examples provided by Kubeflow Pipelines on the dashboard.

Access the Kubeflow Dashboard using Amazon Cognito

To get started, let’s get access to the Kubeflow Dashboard. Because we used Amazon Cognito as the IdP, use the information provided in the official README file. We first create some users on the Amazon Cognito console. These are the users who will log in to the central dashboard. Next, create a profile for the user you created. Then you should be able to access the dashboard through the login page at https://kubeflow.platform.example.com.

The following screenshot shows our Kubeflow Dashboard.

Run the pipeline

On the Kubeflow Dashboard, choose Pipelines in the navigation name. You should see four examples provided by Kubeflow Pipelines that you can run directly to explore various Pipelines features.

For this post, we use the XGBoost sample called [Demo] XGBoost – Iterative model training. You can find the source code on GitHub. This is a simple pipeline that uses the existing XGBoost/Train and XGBoost/Predict Kubeflow pipeline components to iteratively train a model until the metrics are considered good based on specified metrics.

To run the pipeline, complete the following steps:

  1. Select the pipeline and choose Create experiment.
  2. Under Experiment details, enter a name (for this post, demo-blog) and optional description.
  3. Choose Next.
  4. Under Run details¸ choose your pipeline and pipeline version.
  5. For Run name, enter a name.
  6. For Experiment, choose the experiment you created.
  7. For Run type, select One-off.
  8. Choose Start.

After the pipeline starts running, you should see components completing (within a few seconds). At this stage, you can choose any of the completed components to see more details.

Access the artifacts in Amazon S3

While deploying Kubeflow, we specified Kubeflow Pipelines should use Amazon S3 to store its artifacts. This includes all pipeline output artifacts, cached runs, and pipeline graphs—all of which can then be used for rich visualizations and performance evaluation.

When the pipeline run is complete, you should be able to see the artifacts in the S3 bucket you created during installation. To confirm this, choose any completed component of the pipeline and check the Input/Output section on the default Graph tab. The artifact URLs should point to the S3 bucket that you specified during deployment.

To confirm that the resources were added to Amazon S3, we can also check the S3 bucket in our AWS account via the Amazon S3 console.

The following screenshot shows our files.

Verify ML metadata in Amazon RDS

We also integrated Kubeflow Pipelines with Amazon RDS during deployment, which means that any pipeline metadata should be stored in Amazon RDS. This includes any runtime information such as the status of a task, availability of artifacts, custom properties associated with the run or artifacts, and more.

To verify the Amazon RDS integration, follow the steps provided in the official README file. Specifically, complete the following steps:

  1. Get the Amazon RDS user name and password from the secret that was created during the installation:
    export CLUSTER_REGION=<region> 
    
    aws secretsmanager get-secret-value 
        --region $CLUSTER_REGION 
        --secret-id rds-secret 
        --query 'SecretString' 
        --output text

  2. Use these credentials to connect to Amazon RDS from within the cluster:
    kubectl run -it --rm --image=mysql:5.7 --restart=Never mysql-client -- mysql -h <YOUR RDS ENDPOINT> -u admin -pKubefl0w

  3. When the MySQL prompt opens, we can verify the mlpipelines database as follows:
    mysql> use mlpipeline; show tables;
    
    +----------------------+
    | Tables_in_mlpipeline |
    +----------------------+
    | db_statuses          |
    | default_experiments  |
    | experiments          |
    | jobs                 |
    | pipeline_versions    |
    | pipelines            |
    | resource_references  |
    | run_details          |
    | run_metrics          |
    +----------------------+

  4. Now we can read the content of specific tables, to make sure that we can see metadata information about the experiments that ran the pipelines:
    mysql> select * from experiments;
    +--------------------------------------+---------+-------------------------------------------------------------------------+----------------+---------------------------+------------------------+
    | UUID                                 | Name    | Description                                                             | CreatedAtInSec | Namespace                 | StorageState           |
    +--------------------------------------+---------+-------------------------------------------------------------------------+----------------+---------------------------+------------------------+
    | 36ed05cf-e341-4ff4-917a-87c43be8afce | Default | All runs created without specifying an experiment will be grouped here. |     1647214692 |                           | STORAGESTATE_AVAILABLE |
    | 7a1d6b85-4c97-40dd-988b-b3b91cf31545 | run-1   |                                                                         |     1647216152 | kubeflow-user-example-com | STORAGESTATE_AVAILABLE |
    +--------------------------------------+---------+-------------------------------------------------------------------------+----------------+---------------------------+------------------------+
    2 rows in set (0.00 sec)

Clean up

To uninstall Kubeflow and delete the AWS resources you created, complete the following steps:

  1. Delete the ingress and ingress-managed load balancer by running the following command:
    kubectl delete ingress -n istio-system istio-ingress

  2. Delete the rest of the Kubeflow components:
    kustomize build ${kubeflow_manifest_dir}/docs/deployment/cognito-rds-s3 | kubectl delete -f -

  3. Delete the AWS resources created by scripts:
    1. Resources created for Amazon RDS and Amazon S3 integration. Make sure you have the configuration file created by the script in ${kubeflow_manifest_dir}/tests/e2e/utils/rds-s3/metadata.yaml:
      cd ${kubeflow_manifest_dir}/tests/e2e/
      PYTHONPATH=.. python utils/rds-s3/auto-rds-s3-cleanup.py

    2. Resources created for Amazon Cognito integration. Make sure you have the configuration file created by the script in ${kubeflow_manifest_dir}/tests/e2e/utils/cognito_bootstrap/config.yaml:
      cd ${kubeflow_manifest_dir}/tests/e2e/
      PYTHONPATH=.. python utils/cognito_bootstrap/cognito_resources_cleanup.py

  4. If you created a dedicated Amazon EKS cluster for Kubeflow using eksctl, you can delete it with the following command:
    eksctl delete cluster --region $CLUSTER_REGION --name $CLUSTER_NAME

Summary

In this post, we highlighted the value that Kubeflow on AWS provides through native AWS-managed service integrations for secure, scalable, and enterprise-ready AI and ML workloads. You can choose from several deployment options to install Kubeflow on AWS with various service integrations. The use case in this post demonstrated Kubeflow integration with Amazon Cognito, Secrets Manager, Amazon RDS, and Amazon S3. To get started with Kubeflow on AWS, refer to the available AWS-integrated deployment options in Kubeflow on AWS.

Starting with v1.3, you can follow the AWS Labs repository to track all AWS contributions to Kubeflow. You can also find us on the Kubeflow #AWS Slack Channel; your feedback there will help us prioritize the next features to contribute to the Kubeflow project.


About the Authors

Kanwaljit Khurmi is an AI/ML Specialist Solutions Architect at Amazon Web Services. He works with the AWS product, engineering and customers to provide guidance and technical assistance helping them improve the value of their hybrid ML solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.

Meghna Baijal is a Software Engineer with AWS AI making it easier for users to onboard their Machine Learning workloads onto AWS by building ML products and platforms such as the Deep Learning Containers, the Deep Learning AMIs, the AWS Controllers for Kubernetes (ACK) and Kubeflow on AWS. Outside of work she enjoys reading, traveling and dabbling in painting.

Suraj Kota is a Software Engineer specialized in Machine Learning infrastructure. He builds tools to easily get started and scale machine learning workload on AWS. He worked on the AWS Deep Learning Containers, Deep Learning AMI, SageMaker Operators for Kubernetes, and other open source integrations like Kubeflow.

Read More