How Amp on Amazon used data to increase customer engagement, Part 2: Building a personalized show recommendation platform using Amazon SageMaker

Amp is a new live radio app from Amazon. With Amp, you can host your own radio show and play songs from the Amazon Music catalog, or tune in and listen to shows other Amp users are hosting. In an environment where content is plentiful and diverse, it’s important to tailor the user experience to each user’s individual taste, so they can easily find shows they like and discover new content they would enjoy.

Amp uses machine learning (ML) to provide personalized recommendations for live and upcoming Amp shows on the app’s home page. The recommendations are computed using a Random Forest model using features representing the popularity of a show (such as listen and like count), popularity of a creator (such as total number of times the recent shows were played), and personal affinities of a user to a show’s topic and creator. Affinities are computed either implicitly from the user’s behavioral data or explicitly from topics of interest (such as pop music, baseball, or politics) as provided in their user profiles.

This is Part 2 of a series on using data analytics and ML for Amp and creating a personalized show recommendation list platform. The platform has shown a 3% boost to customer engagement metrics tracked (liking a show, following a creator, enabling upcoming show notifications) since its launch in May 2022.

Refer to Part 1 to learn how behavioral data was collected and processed using the data and analytics systems.

Solution overview

The ML-based show recommender for Amp has five main components, as illustrated in the following architecture diagram:

  1. The Amp mobile app.
  2. Back-end services that gather the behavioral data, such as likes and follows, as well as broadcast show-related information such as status updates when shows go live.
  3. Real-time ingestion of behavioral and show data, and real-time (online) feature computing and storage.
  4. Batch (offline) feature computing and storage.
  5. A Recommender System that handles incoming requests from the app backend for getting a list of shows. This includes real-time inference to rank shows based on personalized and non-personalized features.

This post focuses on parts 3, 4, and 5 in an effort to detail the following:

The following diagram shows the high-level architecture and its components.

In the following sections, we provide more details regarding real-time feature computing, batch feature computing, real-time inference, operational health, and the outcomes we observed.

Real-time feature computing

Some features, such as like and listen count for a show, need to be streamed in continuously and used as is, whereas others, such as the number of listening sessions longer than 5 minutes, need to also be transformed in real time as raw data for sessions is streamed. These types of features where values need to be computed at inference time are referred as point-in-time (PIT) features. Data for PIT features need to be updated quickly, and the latest version should be written and read with low latency (under 20 milliseconds per user for 1,000 shows). The data also needs to be in a durable storage because missing or partial data may cause deteriorated recommendations and poor customer experience. In addition to the read/write latency, PIT features also require low reflection time. Reflection time is the time it takes for a feature to be available to read after the contributing events were emitted, for example, the time between a listener liking a show and the PIT LikeCount feature being updated.

Sources of the data are the backend services directly serving the app. Some of the data are transformed into metrics that are then broadcasted via Amazon Simple Notification Service (Amazon SNS) to downstream listeners such as the ML feature transformation pipeline. An in-memory database such as MemoryDB is an ideal service for durable storage and ultra-fast performance at high volumes. The compute component that transforms and writes features to MemoryDB is Lambda. App traffic follows daily and weekly patterns of peaks and dips depending on the time and day. Lambda allows for automatic scaling to incoming volume of events. The independent nature of each individual metric transformation also makes Lambda, which is a stateless service on its own, a good fit for this problem. Putting Amazon Simple Queue Service (Amazon SQS) between Amazon SNS and Lambda not only prevents message loss but also acts as a buffer for unexpected bursts of traffic that preconfigured Lambda concurrency limits may not be sufficient to serve.

Batch feature computing

Features that use historical behavioral data to represent a user’s ever-evolving taste are more complex to compute and can’t be computed in real time. These features are computed by a batch process that runs every so often, for example once daily. Data for batch features should support fast querying for filtering and aggregation of data, and may span long periods of time, so will be larger in volume. Because batch features are also retrieved and sent as inputs for real-time inference, they should still be read with low latency.

Collecting raw data for batch feature computing doesn’t have the sub-minute reflection time requirement PIT features have, which makes it feasible to buffer the events longer and transform metrics in batch. This solution utilized Kinesis Data Firehose, a managed service to quickly ingest streaming data into several destinations, including Amazon Simple Storage Service (Amazon S3) for persisting metrics to the S3 data lake to be utilized in offline computations. Kinesis Data Firehose provides an event buffer and Lambda integration to easily collect, batch transform, and persist these metrics to Amazon S3 to be utilized later by the batch feature computing. Batch feature computations don’t have the same low latency read/write requirements as PIT features, which makes Amazon S3 the better choice because it provides low-cost, durable storage for storing these large volumes of business metrics.

Our initial ML model uses 21 batch features computed daily using data captured in the past 2 months. This data includes both playback and app engagement history per user, and grows with the number of users and frequency of app usage. Feature engineering at this scale requires an automated process to pull the required input data, process it in parallel, and export the outcome to persistent storage. The processing infrastructure is needed only for the duration of the computations. SageMaker Processing provides prebuilt Docker images that include Apache Spark and other dependencies needed to run distributed data processing jobs at large scale. The underlying infrastructure for a Processing job is fully managed by SageMaker. Cluster resources are provisioned for the duration of your job, and cleaned up when a job is complete.

Each step in the batch process—data gathering, feature engineering, feature persistence—is part of a workflow that requires error handling, retries, and state transitions in between. With AWS Step Functions, you can create a state machine and split your workflow into several steps of preprocessing and postprocessing, as well as a step to persist the features into SageMaker Feature Store or the other data to Amazon S3. A state machine in Step Functions can be triggered via Amazon EventBridge to automate batch computing to run at a set schedule, such as once every day at 10:00 PM UTC.

After the features are computed, they need to be versioned and stored to be read during inference as well as model retraining. Rather than build your own feature storage and management service, you can use SageMaker Feature Store. Feature Store is a fully managed, purpose-built repository to store, share, and manage features for ML models. It stores history of ML features in the offline store (Amazon S3) and also provides APIs to an online store to allow low-latency reads of most recent features. The offline store can serve the historical data for further model training and experimentation, and the online store can be called by your customer-facing APIs to get features for real-time inference. As we evolve our services to provide more personalized content, we anticipate training additional ML models and with the help of Feature Store, search, discover and reuse features amongst these models.

Real-time inference

Real-time inference usually requires hosting ML models behind endpoints. You could do this using web servers or containers, but this requires ML engineering effort and infrastructure to manage and maintain. SageMaker makes it easy to deploy ML models to real-time endpoints. SageMaker lets you train and upload ML models and host them by creating and configuring SageMaker endpoints. Real-time inference satisfies the low-latency requirements for ranking shows as they are browsed on the Amp home page.

In addition to managed hosting, SageMaker provides managed endpoint scaling. SageMaker inference allows you to define an auto scaling policy with minimum and maximum instance counts and a target utilization to trigger the scaling. This way, you can easily scale in or out as demand changes.

Operational health

The number of events this system handles for real-time feature computing changes accordingly with the natural pattern of app usage (higher or lower traffic based on time of day or day of the week). Similarly, the number of requests it receives for real-time inference scales with the number of concurrent app users. These services also get unexpected peaks in traffic due to self promotions in social media by popular creators. Although it’s important to ensure the system can scale up and down to serve the incoming traffic successfully and frugally, it’s also important to monitor operational metrics and alert for any unexpected operational issues to prevent loss of data and service to customers. Monitoring the health of these services is straightforward using Amazon CloudWatch. Vital service health metrics such as faults and latency of operations as well as utilization metrics such as memory, disk, and CPU usage are available out of the box using CloudWatch. Our development team uses metrics dashboards and automated monitoring to ensure we can serve our clients with high availability (99.8%) and low latency (less than 200 milliseconds end-to end to get recommended shows per user).

Measuring the outcome

Prior to the ML-based show recommender described in this post, a simpler heuristic algorithm ranked Amp shows based on a user’s personal topics of interest that are self-reported on their profile. We set up an A/B test to measure the impact of switching to ML-based recommenders with a user’s data from their past app interactions. We identified improvements in metrics such as listening duration and number of engagement actions (liking a show, following a show creator, turning on notifications) as indicators of success. A/B testing with 50% of users receiving show recommendations ranked for them via the ML-based recommender has shown a 3% boost to customer engagement metrics and a 0.5% improvement to playback duration.

Conclusion

With purpose-built services, the Amp team was able to release the personalized show recommendation API as described in this post to production in under 3 months. The system also scales well for the unpredictable loads created by well-known show hosts or marketing campaigns that could generate an influx of users. The solution uses managed services for processing, training, and hosting, which helps reduce the time spent on day-to-day maintenance of the system. We’re also able to to monitor all these managed services via CloudWatch to ensure the continued health of the systems in production.

A/B testing the first version of the Amp’s ML-based recommender against a rule-based approach (which sorts shows by customer’s topics of interest only) has shown that the ML-based recommender exposes customers to higher-quality content from more diverse topics, which results in a higher number of follows and enabled notifications. The Amp team is continuously working towards improving the models to provide highly relevant recommendations.

For more information about Feature Store, visit Amazon SageMaker Feature Store and check out other customer use cases in the AWS Machine Learning Blog.


About the authors

Tulip Gupta is a Solutions Architect at Amazon Web Services. She works with Amazon to design, build, and deploy technology solutions on AWS. She assists customers in adopting best practices while deploying solution in AWS, and is a Analytics and ML enthusiast. In her spare time, she enjoys swimming, hiking and playing board games.

David Kuo is a Solutions Architect at Amazon Web Services. He works with AWS customers to design, build and deploy technology solutions on AWS. He works with Media and Entertainment customers and has interests in machine learning technologies. In his spare time, he wonders what he should do with his spare time.

Manolya McCormick is a Sr Software Development Engineer for Amp on Amazon. She designs and builds distributed systems using AWS to serve customer facing applications. She enjoys reading and cooking new recipes at her spare time.

Jeff Christophersen is a Sr. Data Engineer for Amp on Amazon. He works to design, build, and deploy Big Data solutions on AWS that drive actionable insights. He assists internal teams in adopting scalable and automated solutions, and is a Analytics and Big Data enthusiast. In his spare time, when he is not on a pair of skis you can find him on his mountain bike.

Read More

How Amp on Amazon used data to increase customer engagement, Part 1: Building a data analytics platform

Amp, the new live radio app from Amazon, is a reinvention of radio featuring human-curated live audio shows. It’s designed to provide a seamless customer experience to listeners and creators by debuting interactive live audio shows from your favorite artists, radio DJs, podcasters, and friends.

However, as a new product in a new space for Amazon, Amp needed more relevant data to inform their decision-making process. Amp wanted a scalable data and analytics platform to enable easy access to data and perform machine leaning (ML) experiments for live audio transcription, content moderation, feature engineering, and a personal show recommendation service, and to inspect or measure business KPIs and metrics.

This post is the first in a two-part series. Part 1 shows how data was collected and processed using the data and analytics platform, and Part 2 shows how the data was used to create show recommendations using Amazon SageMaker, a fully managed ML service. The personalized show recommendation list service has shown a 3% boost to customer engagement metrics tracked (such as liking a show, following a creator, or enabling upcoming show notifications) since its launch in May 2022.

Solution overview

The data sources for Amp can be broadly categorized as either streaming (near-real time) or batch (point in time). The source data is emitted from Amp-owned systems or other Amazon systems. The two different data types are as follows:

  • Streaming data – This type of data mainly consists of follows, notifications (regarding users’ friends, favorite creators, or shows), activity updates, live show interactions (call-ins, co-hosts, polls, in-app chat), real-time updates on live show activities (live listen count, likes), live audio playback metrics, and other clickstream metrics from the Amp application. Amp stakeholders require this data to power ML processes or predictive models, content moderation tools, and product and program dashboards (for example, trending shows). Streaming data enables Amp customers to conduct and measure experimentation.
  • Batch data – This data mainly consists of catalog data, show or creator metadata, and user profile data. Batch data enables more point-in-time reporting and analytics vs. real-time.

The following diagram illustrates the high-level architecture.

The Amp data and analytics platform can be broken down into three high-level systems:

  • Streaming data ingestion, stream processing and transformation, and stream storage
  • Batch data ingestion, batch processing and transformation, and batch storage
  • Business intelligence (BI) and analytics

In the following sections, we discuss each component in more detail.

Streaming data ingestion, processing, transformation, and storage

Amp created a serverless streaming ingestion pipeline capable of tapping into data from sources without the need for infrastructure management, as shown in the following diagram.

The pipeline was able to ingest the Amp show catalog data (what shows are available on Amp) and pass it to the data lake for two different use cases: one for near-real-time analytics, and one for batch analytics.

As part of the ingestion pipeline, the Amp team has an Amazon Simple Queue Service (Amazon SQS) queue that receives messages from an upstream Amazon Simple Notification Service (Amazon SNS) topic that contains information on changes to shows in the catalog. These changes could be the addition of new shows or adjustments to existing ones that have been scheduled.

When the message is received by the SQS queue, it triggers the AWS Lambda function to make an API call to the Amp catalog service. The Lambda function retrieves the desired show metadata, filters the metadata, and then sends the output metadata to Amazon Kinesis Data Streams. Amazon Kinesis Data Firehose receives the records from the data stream. Kinesis Data Firehose then invokes a secondary Lambda function to perform a data transformation that flattens the JSON records received and writes the transformed records to an Amazon Simple Storage Service (Amazon S3) data lake for consumption by Amp stakeholders.

Kinesis Data Firehose enabled buffering and writing data to Amazon S3 every 60 seconds. This helped Amp teams make near-real-time programming decisions that impacted external customers.

The streaming ingestion pipeline supported the following objectives: performance, availability, scalability, and flexibility to send data to multiple downstream applications or services:

  • Kinesis Data Streams handles streaming data ingestion when necessary. Kinesis Data Streams supported these objectives by enabling the Amp team to quickly ingest data for analytics with minimal operational load. As a fully managed service, it reduced operational overhead, and Amp was able to scale with the product needs.
  • Lambda enabled the team to create lightweight functions to run API calls and perform data transformations.
  • Because Kinesis Data Firehose is a managed service, it was able to handle all the scaling, sharding, and monitoring needs of the streaming data without any additional overheard for the team.

Batch data ingestion, processing, transformation, and storage

Amp created a transient batch (point in time) ingestion pipeline capable of data ingestion, processing and transformation, and storage, as shown in the following diagram.

A transient extract, transform, and load (ETL) and extract, load, and transform (ELT) job approach was implemented because of the batch nature of these workloads and unknown data volumes. As a part of the workflow automation, Amazon SQS was used to trigger a Lambda function. The Lambda function then activated the AWS Glue crawler to infer the schema and data types. The crawler wrote the schema metadata to the AWS Glue Data Catalog, providing a unified metadata store for data sharing.

The ETL and ELT jobs were required to run on either a set schedule or event-driven workflow. To handle these needs, Amp used Amazon Managed Workflows for Apache Airflow (Amazon MWAA). Apache Airflow is an open-source Python-based workflow management platform. Amazon MWAA is a fully managed service that automatically handles scaling. It provides sequencing, error handling, retry logic, and state. With Amazon MWAA, Amp was able to take advantage of the the benefits of Airflow for job orchestration while not having to manage or maintain dedicated Airflow servers. Additionally, by using Amazon MWAA, Amp was able to create a code repository and workflow pipeline stored in Amazon S3 that Amazon MWAA could access. The pipeline allowed Amp data engineers to easily deploy Airflow DAGs or PySpark scripts across multiple environments.

Amp used Amazon EMR on Amazon Elastic Kubernetes Service (Amazon EKS) to configure and manage containers for their data processing and transformation jobs. Due to the unique nature of the Amp service, the initial expected data volumes that would be processed were relatively unknown. To provide flexibility as the service evolved, the team decided to go with Amazon EMR on EKS to eliminate any unnecessary operational overheard required to bootstrap and scale Amazon EMR for data processing. This approach allowed them to run transient hybrid EMR clusters backed by a mix of AWS Fargate and Amazon Elastic Compute Cloud (Amazon EC2) nodes, where all system tasks and workloads were offloaded to Fargate, while Amazon EC2 handled all the Apache Spark processing and transformation. This provided the flexibility to have a cluster with one node running, while the Amazon EKS auto scaler dynamically instantiated and bootstrapped any additional EC2 nodes that were required for the job. When the job was complete, they were automatically deleted by the cluster auto scaler. This pattern eliminated the need for the team to manage any of the cluster bootstrap actions or scaling required to respond to evolving workloads.

Amazon S3 was used as the central data lake, and data was stored in Apache Parquet (Parquet) format. Parquet is a columnar format, which speeds up data retrieval and provides efficient data compression. Amazon S3 provided the flexibility, scalability, and security needs for Amp. With Amazon S3, the Amp team was able to centralize data storage in one location and federate access to the data virtually across any service or tool within or outside of AWS. The data lake was split into two S3 buckets: one for raw data ingestion and one for transformed data output. Amazon EMR performed the transformation from raw data to transformed data. With Amazon S3 as the central data lake, Amp was able to securely expose and share the data with other teams across Amp and Amazon.

To simplify data definition, table access provisioning, and the addition and removal of tables, they used AWS Glue crawlers and the AWS Glue Data Catalog. Because Amp is a new service and constantly evolving, the team needed a way to easily define, access, and manage the tables in the data lake. The crawlers handled data definition (including schema changes) and the addition and removal of tables, while the Data Catalog served as a unified metadata store.

Business intelligence and analytics

The following diagram illustrates the architecture for the BI and analytics component.

Amp chose to store the data in the S3 data lake, and not in the data warehouse. This enabled them to access it in a unified manner through the AWS Glue Data Catalog and provided greater flexibility for data consumers. This resulted in faster data access across a variety of services or tools. With data being stored in Amazon S3, it also reduced data warehouse infrastructure costs, because the costs are a function of the compute type and the amount of data stored.

The Amazon Redshift RA3 node type was used as the compute layer to enable stakeholders to query data stored in Amazon S3. Amazon Redshift RA3 nodes decouple storage and compute, and are designed for an access pattern through the AWS Glue Data Catalog. RA3 nodes introduce Amazon Redshift Managed Storage, which is Amazon S3 backed. The combination of these features enabled Amp to right-size the clusters and provide better query performance for their customers while minimizing costs.

Amazon Redshift configuration was automated using a Lambda function, which connected to a given cluster and ran parameterized SQL statements. The SQL statements contained the logic to deploy schemas, user groups, and users, while AWS Secrets Manager was used to automatically generate, store, and rotate Amazon Redshift user passwords. The underlying configuration variables were stored in Amazon DynamoDB. The Lambda function retrieved the variables and requested temporary Amazon Redshift credentials to perform the configuration. This process enabled the Amp team to set up Amazon Redshift clusters in a consistent manner.

Business outcomes

Amp was able to achieve the following business outcomes:

  • Business reporting – Standard reporting required to run the business, such as daily flash reports, aggregated business review mechanisms, or project and program updates.
  • Product reporting – Specific reporting required to enable the inspection or measurement of key product KPIs and Metrics. This included visual reports through dashboards such as marketing promotion effectiveness, app engagement metrics, and trending shows.
  • ML experimentation – Enabled downstream Amazon teams to use this data to support experimentation or generate predictions and recommendations. For example, ML experimentations like a personalized show recommendation list, show categorization, and content moderation helped with Amp’s user retention.

Key benefits

By implementing a scalable, cost-efficient architecture, Amp was able to achieve the following:

  • Limited operational complexity – They built a flexible system that used AWS managed services wherever possible.
  • Use the languages of data – Amp was able to support the two most common data manipulation languages, Python and SQL, to perform platform operations, conduct ML experiments, and generate analytics. With this support, the developers with Amp were able to use languages they were familiar with.
  • Enable experimentation and measurement – Amp allowed developers to quickly generate the datasets needed to conduct experiments and measure the results. This helps in optimizing the Amp customer experience.
  • Build to learn but design to scale – Amp is a new product that is finding its market fit, and was able to focus their initial energy on building just enough features to get feedback. This enabled them to pivot toward the right product market fit with each launch. They were able to build incrementally, but plan for the long term.

Conclusion

In this post, we saw how Amp created their data analytics platform using user behavioral data from streaming and batch data sources. The key factors that drove the implementation were the need to provide a flexible, scalable, cost-efficient, and effort-efficient data analytics platform. Design choices were made evaluating various AWS services.

Part 2 of this series shows how we used this data and built out the personalized show recommendation list using SageMaker.

As next steps, we recommend doing a deep dive into each stage of your data pipeline system and making design choices that would be cost-effective and scalable for your needs. For more information, you can also check out other customer use cases in the AWS Analytics Blog.

If you have feedback about this post, submit it in the comments section.


About the authors

Tulip Gupta is a Solutions Architect at Amazon Web Services. She works with Amazon to design, build, and deploy technology solutions on AWS. She assists customers in adopting best practices while deploying solution in AWS, and is a Analytics and ML enthusiast. In her spare time, she enjoys swimming, hiking and playing board games.

David Kuo is a Solutions Architect at Amazon Web Services. He works with AWS customers to design, build and deploy technology solutions on AWS. He works with Media and Entertainment customers and has interests in machine learning technologies. In his spare time, he wonders what he should do with his spare time.

Manolya McCormick is a Sr Software Development Engineer for Amp on Amazon. She designs and builds distributed systems using AWS to serve customer facing applications. She enjoys reading and cooking new recipes at her spare time.

Jeff Christophersen is a Sr. Data Engineer for Amp on Amazon. He works to design, build, and deploy Big Data solutions on AWS that drive actionable insights. He assists internal teams in adopting scalable and automated solutions, and is a Analytics and Big Data enthusiast. In his spare time, when he is not on a pair of skis you can find him on his mountain bike.

Read More

Build repeatable, secure, and extensible end-to-end machine learning workflows using Kubeflow on AWS

This is a guest blog post cowritten with athenahealth.

athenahealth a leading provider of network-enabled software and services for medical groups and health systems nationwide. Its electronic health records, revenue cycle management, and patient engagement tools allow anytime, anywhere access, driving better financial outcomes for its customers and enabling its provider customers to deliver better quality care.

In the artificial intelligence (AI) space, athenahealth uses data science and machine learning (ML) to accelerate business processes and provide recommendations, predictions, and insights across multiple services. From its first implementation in automated document services, touchlessly processing millions of provider-patient documents, to its more recent work in virtual assistants and improving revenue cycle performance, athenahealth continues to apply AI to help drive efficiency, service capabilities, and better outcomes for providers and their patients.

This blog post demonstrates how athenahealth uses Kubeflow on AWS (an AWS-specific distribution of Kubeflow) to build and streamline an end-to-end data science workflow that preserves essential tooling, optimizes operational efficiency, increases data scientist productivity, and sets the stage for extending their ML capabilities more easily.

Kubeflow is the open-source ML platform dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable. Kubeflow achieves this by incorporating relevant open-source tools that integrate well with Kubernetes. Some of these projects include Argo for pipeline orchestration, Istio for service mesh, Jupyter for notebooks, Spark, TensorBoard, and Katib. Kubeflow Pipelines helps build and deploy portable, scalable ML workflows that can include steps like data extraction, preprocessing, model training, and model evaluation in the form of repeatable pipelines.

AWS is contributing to the open-source Kubeflow community by providing its own Kubeflow distribution (called Kubeflow on AWS) that helps organizations like athenahealth build highly reliable, secure, portable, and scalable ML workflows with reduced operational overhead through integration with AWS managed services. AWS provides various Kubeflow deployment options like deployment with Amazon Cognito, deployment with Amazon Relational Database Service (Amazon RDS) and Amazon Simple Storage Service (Amazon S3), and vanilla deployment. For details on service integration and available add-ons for each of these options, refer to Deployment.

Today, Kubeflow on AWS provides a clear path to using Kubeflow, augmented with the following AWS services:

Many AWS customers are taking advantage of the Kubeflow on AWS distribution, including athenahealth.

Here, the athenahealth MLOps team discuss the challenges they encountered and the solutions they created in their Kubeflow journey.

Challenges with the previous ML environment

Prior to our adoption of Kubeflow on AWS, our data scientists used a standardized set of tools and a process that allowed flexibility in the technology and workflow used to train a given model. Example components of the standardized tooling include a data ingestion API, security scanning tools, the CI/CD pipeline built and maintained by another team within athenahealth, and a common serving platform built and maintained by the MLOps team. However, as our use of AI and ML matured, the variety of tools and infrastructure created for each model grew. Although we were still able to support the existing process, we saw the following challenges on the horizon:

  • Maintenance and growth – Reproducing and maintaining model training environments took more effort as the number of deployed models increased. Each project maintained detailed documentation that outlined how each script was used to build the final model. In many cases, this was an elaborate process involving 5 to 10 scripts with several outputs each. These had to be manually tracked with detailed instructions on how each output would be used in subsequent processes. Maintaining this over time became cumbersome. Moreover, as the projects became more complex, the number of tools also increased. For example, most models utilized Spark and TensorFlow with GPUs, which required a larger variety of environment configurations. Over time, users would switch to newer versions of tools in their development environments but then couldn’t run older scripts when those versions became incompatible. Consequently, maintaining and augmenting older projects required more engineering time and effort. In addition, as new data scientists joined the team, knowledge transfers and onboarding took more time, because synchronizing local environments included many undocumented dependencies. Switching between projects faced the same issues because each model had its own workflows.
  • Security – We take security seriously, and therefore prioritize compliance with all contractual, legal, and regulatory obligations associated with ML and data science. Data must be utilized, stored, and accessed in specific ways, and we have embedded robust processes to ensure our practices comply with our legal obligations as well as align with industry best practices. Prior to Kubeflow adoption, ensuring that data was stored and accessed in a specific way involved regular verification across multiple, diverse workflows. We knew that we could improve efficiencies by consolidating these diverse workflows onto a single platform. However, that platform would need to be flexible enough to integrate well with our standardized tooling.
  • Operations – We also saw an opportunity to increase operational efficiency and management through centralizing the logging and monitoring of the workflows. Because each team had developed their own tools, we collected this information from each workflow individually and aggregated them.

The data science team evaluated various solutions for consolidating the workflows. In addition to addressing these requirements, we looked for a solution that would integrate seamlessly with the existing standardized infrastructure and tools. We selected Amazon EKS and Kubeflow on AWS as our workflow solution.

The data scientist development cycle incorporating Kubeflow

A data science project begins with a clean slate: no data, no code, only the business problem that can be solved with ML. The first task is a proof of concept (POC) to discover if the data holds enough signal to make an ML model effective at solving the business problem, starting with querying for the raw dataset from our Snowflake data warehouse. This stage is iterative, and the data scientists use Kubernetes pods or Kubeflow Jupyter notebooks during this process.

Our Kubeflow cluster uses the Karpenter cluster autoscaler, which makes spinning up resources easy for data scientists because they only need to focus on defining the desired instance types, while the provisioning work is done by a set of predefined Karpenter provisioners. We have separate provisioners for CPU and GPU instance types, and all the instances supported by Amazon EKS fall in one of these two categories as per our provisioner configuration. The data scientists choose instance types using node selectors, and Karpenter takes care of node lifecycle management.

After the query is developed, the data scientists extract the raw data to a location on Amazon S3, then launch a Jupyter notebook from the AWS Kubeflow UI to explore the data. The goal is to create the feature set that will be used to train the first model. This allows the data scientists to determine if there is enough signal in the data to fulfill the customer’s business need.

After the results are satisfactory, the data scientists move to the next stage of the development cycle and turn their discoveries into a robust pipeline. They convert the POC code into production-quality code that runs at scale. To ensure compliance through using approved libraries, a container is created with the appropriate base Docker image. For our data scientists, we have found that providing a standard Python, TensorFlow, and Spark base image gives sufficient flexibility for most, if not all, workloads. They can then use the Dockerfile of their component to further customize their development environment. This Dockerfile is then utilized by the CI/CD process to build the components image that will be used in production, therefore maintaining consistency between development and production environments.

We have a tool that gives data scientists the ability to launch their development environment in a pod running on Kubernetes. When this pod is running, the data scientists can then attach the Visual Studio Code IDE directly to the pod and debug their model code. After they have the code running successfully, they can then push their changes to git and a new development environment is created with the most recent changes.

The standard data science pipeline consists of stages that include extraction, preprocessing, training, and evaluation. Each stage in the pipeline appears as a component in Kubeflow, which consists of a Kubernetes pod that runs a command with some information passed in as parameters. These parameters can either be static values or references to output from a previous component. The Docker image used in the pod is built from the CI/CD process. Details on this process appear in the CI/CD workflow discussed in the next section.

Development Cycle on Kubeflow. The development workflow starts on the left with the POC. The completed model is deployed to the athenahealth model serving platform running on Amazon ECS.

Development Cycle on Kubeflow. The development workflow starts on the left with the POC. The completed model is deployed to the athenahealth model serving platform running on Amazon ECS.

CI/CD process supporting automated workflows

As part of our CI/CD process, we use Jenkins to build and test all Kubeflow component images in parallel. On successful completion, the pipeline component template contains reference pointers to the images, and the resulting pipeline is uploaded to Kubeflow. Parameters in the Jenkins pipeline allow users to launch the pipelines and run their model training tests after successful builds.

Alternatively, to maintain a short development cycle, data scientists can also launch the pipeline from their local machine, modifying any pipeline parameters they may be experimenting with.

Tooling exists to ensure the reference pointers from the CI/CD build are utilized by default. If there is a deployable artifact in the repo, then the CI/CD logic will continue to deploy the artifact to the athenahealth model serving platform (the Prediction Service) running on Amazon ECS with AWS Fargate. After all these stages have passed, the data scientist merges the code to the primary branch. The pipelines and deployable artifacts are then pushed to production.

CI/CD Deployment workflow. This diagram describes the Data Science build and deployment workflow. The CI/CD process is driven by Jenkins.

Security

In consolidating our data science workflows, we were able to centralize our approach to securing the training pipeline. In this section, we discuss our approach to data and cluster security.

Data security

Data security is of the utmost importance at athenahealth. For this reason, we develop and maintain infrastructure that is fully compliant with the regulations and standards that protect the security and integrity of these data.

To ensure we meet data compliance standards, we provision our AWS infrastructure in accordance with our athenahealth enterprise guidelines. The two main stores for data are Amazon RDS for highly scalable pipeline metadata and Amazon S3 for pipeline and model artifacts. For Amazon S3, we ensure the buckets are encrypted, HTTPS endpoints are enforced, and the bucket policies and AWS Identity and Access Management (IAM) roles follow the principles of least privilege when permitting access to the data. This is true for Amazon RDS data as well: encryption is always enabled, and the security groups and credential access follow the principle of least privilege. This standardization ensures that only authorized parties have access to the data, and this access is tracked.

In addition to these measures, the platform also undergoes security threat assessments and continuous security and compliance scans.

We also address data retention requirements via data lifecycle management for all S3 buckets that contain sensitive data. This policy automatically moves data to Amazon S3 Glacier after 30 days of creation. Exceptions to this are managed through data retrieval requests and are approved or denied on a case-by-case basis. This ensures that all workflows comply with the data retention policy. This also solves the problem with recovering data if a model performs poorly, and retraining is required, or when a new model must be evaluated against a historical iteration of an older model’s dataset.

For restricting access to Amazon S3 and Amazon RDS from within Kubeflow on AWS and Amazon EKS, we use IRSA (IAM Roles for Service Accounts), which provides IAM-based permission provisioning for resources within Kubernetes. Each tenant in Kubeflow has a unique pre-created service account which we bind to an IAM role created specifically to fulfill the tenant access requirements. User access to tenants is also restricted using the Amazon Cognito user pools group membership for each user. When a user is authenticated to the cluster, the generated token contains group claims, and Kubernetes RBAC uses this information to allow or deny access to a particular resource in the cluster. This setup is explained in more detail in the next section.

Cluster security using multi-user isolation

As we noted in the previous section, data scientists perform exploratory data analyses, run data analytics, and train ML models. To allocate resources, organize data, and manage workflows based on projects, Kubeflow on AWS provides isolation based on Kubernetes namespaces. This isolation works for interacting with the Kubeflow UI; however, it doesn’t provide any tooling to control access to the Kubernetes API using Kubectl. This means that user access can be controlled on the Kubeflow UI but not over the Kubernetes API via Kubectl.

The architecture described in the following diagram addresses this issue by unifying access to projects in Kubeflow based on group memberships. To achieve this, we took advantage of the Kubeflow on AWS manifests, which have integration with Amazon Cognito user pools. In addition, we use Kubernetes role-based access control (RBAC) to control authorization within the cluster. The user permissions are provisioned based on Amazon Cognito group membership. This information is passed to the cluster with the token generated by the OIDC client. This process is simplified thanks to the built-in Amazon EKS functionality that allows associating OIDC identity providers to authenticate with the cluster.

By default, Amazon EKS authentication is performed by the IAM authenticator, which is a tool that enables authenticating with an EKS cluster using IAM credentials. This authentication method has its merits; however, it’s not suitable for our use case because athenahealth uses Microsoft Azure Active Directory for identity service across the organization.

Kubernetes namespace isolation. Data Scientists can obtain membership to a single or multiple groups as needed for their work. Access is reviewed on a regular basis and removed as appropriate.

Azure Active Directory, being an enterprise-wide identity service, is the source of truth for controlling user access to the Kubeflow cluster. The setup for this includes creating an Azure Enterprise Application that acts as service principal and adding groups for various tenants that require access to the cluster. This setup on Azure is mirrored in Amazon Cognito by setting up a federated OIDC identity provider that outsources authentication responsibility to Azure. The access to Azure groups is controlled by SailPoint IdentityIQ, which sends access requests to the project owner to allow or deny as appropriate. In the Amazon Cognito user pool, two application clients are created: one is used to set up the authentication for the Kubernetes cluster using the OIDC identity provider, and the other to secure Kubeflow authentication into the Kubeflow UI. These clients are configured to pass group claims upon authentication with the cluster, and these group claims are used alongside RBAC to set up authorization within the cluster.

Kubernetes RBAC role bindings are set up between groups and the cluster role Kubeflow-edit, which is created upon installing Kubeflow in the cluster. This role binding ensures any user interacting with the cluster after logging in via OIDC can access the namespaces they have permissions for as defined in their group claims. Although this works for users interacting with the cluster using Kubectl, the Kubeflow UI currently doesn’t provision access to users based on group membership because it doesn’t use RBAC. Instead, it uses the Istio Authorization Policy resource to control access for users. To overcome this challenge, we developed a custom controller that synchronizes users by polling Amazon Cognito groups and adds or removes corresponding role bindings for each user rather than by group. This setup enables users to have the same level of permissions when interacting with both the Kubeflow UI and Kubectl.

Operational efficiency

In this section, we discuss how we took advantage of the open source and AWS tools available to us to manage and debug our workflows as well as to minimize the operational impact of upgrading Kubeflow.

Logging and monitoring

For logging, we utilize FluentD to push all our container logs to Amazon OpenSearch Service and system metrics to Prometheus. We then use Kibana and the Grafana UI for searching and filtering logs and metrics. The following diagram describes how we set this up.

Kubeflow Logging. We use both Grafana UI and Kibana to view and sift through the logs

The following screenshot is a Kibana UI view from our pipeline.

Sample Kibana UI View. Kibana allows for customized views.

Safe Kubeflow cluster upgrades

As we onboard users to Kubeflow on AWS, we maintain a reliable and consistent user experience while allowing the MLOps team to stay agile with releasing and integrating new features. On the surface, Kustomize seems modular for us to enable working and upgrading one component at a time without impacting others, thereby allowing us to add new capabilities with minimal disruption to the users. However, in practice there are scenarios where the best approach is to simply spin up a new Kubernetes cluster rather than applying component-level upgrades for existing clusters. We found two use cases where it made more sense to create completely new clusters:

  • Upgrading to a Kubernetes version where AWS does provide in-place cluster upgrades. However, it becomes difficult to test if each of the Kubeflow and Kubernetes resources are working as intended and the manifests retain backward compatibility.
  • Upgrading Kubeflow to a newer release where there are several features added or modified and it almost always is not a promising idea to perform in-place upgrades on an existing Kubernetes cluster.

In addressing this issue, we developed a strategy that enables us to have safe cluster replacements without impacting any existing workloads. To achieve this, we had to meet the following criteria:

  • Separate the Kubeflow storage and compute resources so that the pipeline metadata, pipeline artifacts, and user data are retained when deprovisioning the older cluster
  • Integrate with Kubeflow on AWS manifests so that when a Kubeflow version upgrade occurs, minimal changes are required
  • Have an effortless way to roll back if things go wrong after cluster upgrade
  • Have a simple interface to promote a candidate cluster to production

The following diagram illustrates this architecture.

Safe Kubeflow Cluster Upgrade. Once testing of the Kubeflow Candidate is successful, it is promoted to Kubeflow Prod through an update to Route 53.

Kubeflow on AWS manifests come pre-packaged with Amazon RDS and Amazon S3 integrations. With these managed services acting as common data stores, we can set up a blue-green deployment strategy. To achieve this, we ensured that the pipeline metadata is persisted in Amazon RDS, which works independently of the EKS cluster, and the pipeline logs and artifacts are persisted in Amazon S3. In addition to pipeline metadata and artifacts, we also set up FluentD to route pod logs to Amazon OpenSearch Service.

This ensures that the storage layer is completely separated from the compute layer and thereby enables testing changes during Kubeflow version updates on a completely new EKS cluster. After all the tests are successful, we’re able to simply change the Amazon Route 53 DNS record to the candidate cluster hosting Kubeflow. Also, we keep the old cluster running as backup for a few days just in case we need to roll back.

Benefits of Amazon EKS and Kubeflow on AWS for our ML pipeline

Amazon EKS and the Kubeflow on AWS package moved our development workflow to a pattern that strongly encourages repeatable model training. These tools allow us to have fully defined clusters with fully defined tenants and run fully defined code.

A lot of wins from building this platform are less quantitative and have more to do with how the workflows have improved for both platform developers and users. For example, MinIO was replaced with direct access to Amazon S3, which moves us closer to our original workflows and reduces the number of services we must maintain. We are also able to utilize Amazon RDS as the backend for Kubeflow, which enables easier migrations between clusters and gives us the ability to back up our pipelines nightly.

We also found the improvements in the Kubeflow integration with AWS managed services beneficial. For example, with Amazon RDS, Amazon S3, and Amazon Cognito preconfigured in the Kubeflow on AWS manifests, we save time and effort updating to newer distributions of Kubeflow. When we used to modify the official Kubeflow manifests manually, updating to a new version would take several weeks, from design to testing.

Switching to Amazon EKS gives us the opportunity to define our cluster in Kustomize (now part of Kubectl) and Terraform. It turns out that for platform work, Kubernetes and Terraform are very easy to work with after putting in enough time to learn. After many iterations, the tools available to us make it very easy to perform standard platform operations like upgrading a component or swapping out an entire development cluster. Compared to running jobs off raw Amazon Elastic Compute Cloud (Amazon EC2) instances, it’s hard to compare what an enormous difference it makes to have well-defined pods with guaranteed resource cleanup and retry mechanisms built in.

Kubernetes provides great security standards, and we have only scratched the surface of what the multi-user isolation allows us to do. We see multi-user isolation as a pattern that has more payoff in the future when the training platform produces production-level data, and we bring on developers from outside our team.

Meanwhile, Kubeflow allows us to have reproducible model training. Even with the same data, no training produces identical models, but we have the next best thing. With Kubeflow, we know exactly what code and data were used to train a model. Onboarding has greatly improved because each step in our pipeline is clearly and programmatically defined. When new data scientists have the task of fixing a bug, they need much less handholding because there is a clear structure to how outputs of code are used between stages.

Using Kubeflow also yields a lot of performance improvements compared to running on a single EC2 instance. Often in model training, data scientists need different tools and optimizations for preprocessing and training. For example, preprocessing is often run using distributed data processing tools, like Spark, whereas training is often run using GPU instances. With Kubeflow pipelines, they can specify different instance types for different stages in the pipeline. This allows them to use the powerful GPU instances in one stage and a fleet of smaller machines for distributed processing in another stage. Also, because Kubeflow pipelines describe the dependencies between stages, the pipelines can run stages in parallel.

Finally, because we created a process for adding tenants to the cluster, there is now a more formal way to register teams to a tenant on the cluster. Because we use Kubecost to track costs in our EKS cluster, it allows us to attribute cost to a single project rather than having cost attributed at the account level, which includes all data science projects. Kubecost presents a report of the money spent per namespace, which is tightly coupled to the tenant or team that is responsible for running the pipeline.

Despite all the benefits, we would caution to only undertake this kind of migration if there is total buy-in from users. Users that put in the time get a lot of benefits from using Amazon EKS and Kubernetes, but there is a significant learning curve.

Conclusion

With the implementation of the Kubeflow on AWS pipeline in our end-to-end ML infrastructure, we were able to consolidate and standardize our data science workflows while retaining our essential tooling (such as CI/CD and model serving). Our data scientists can now move between projects based on this workflow without the overhead of learning how to maintain a completely different toolset. For some of our models, we were also pleasantly surprised at the speed of the new workflow (five times faster), which allowed for more training iterations and consequently producing models with better predictions.

We also have established a solid foundation to augment our MLOps capabilities and scale the number and size of our projects. For example, as we harden our governance posture in model lineage and tracking, we have reduced our focus from over 15 workflows to just one. And when the Log4shell vulnerability came to light in late 2021, we were able to focus on a single workflow and quickly remediate as needed (performing Amazon Elastic Container Registry (Amazon ECR) scans, upgrading Amazon OpenSearch Service, updating our tooling, and more) with minimal impact to the ongoing work by the data scientists. As AWS and Kubeflow enhancements become available, we can incorporate them as we see fit.

This brings us to an important and understated aspect of our Kubeflow on AWS adoption. One of the critical outcomes of this journey is the ability to roll out upgrades and enhancements to Kubeflow seamlessly for our data scientists. Although we discussed our approach to this earlier, we also rely on the Kubeflow manifests provided by AWS. We started our Kubeflow journey as a proof of concept in 2019, prior to the release of version 1.0.0. (We’re currently on 1.4.1, evaluating 1.5. AWS is already working on the 1.6 version.) In the intervening 3 years, there have been at least six releases with substantial content. Through their disciplined approach to integrating and validating these upgrades and releasing the manifests on a predictable, reliable schedule, the Kubeflow team at AWS has been crucial in enabling the athenahealth MLOps team to plan our development roadmap, and consequently our resource allocations and areas of focus, further into the future with greater confidence.

You can follow the AWS Labs GitHub repository to track all AWS contributions to Kubeflow. You can also find AWS teams on the Kubeflow #AWS Slack Channel; your feedback there helps AWS prioritize the next features to contribute to the Kubeflow project.


About the authors

Kanwaljit Khurmi is a Senior Solutions Architect at Amazon Web Services. He works with the AWS customers to provide guidance and technical assistance helping them improve the value of their solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.

Tyler Kalbach is a Principal Member of Technical Staff at athenahealth. Tyler has approximately 7 years of experience in Analytics, Data Science, Neural Networks, and development of Machine Learning applications in the Healthcare space. He has contributed to several Machine Learning solutions that are currently serving production traffic. Currently working as a Principal Data Scientist in athenahealth’s Engineering organization, Tyler has been part of the team that has built the new Machine Learning Training Platform for athenahealth from the inception of that effort.

Victor Krylov is a Principal Member of Technical Staff at athenahealth. Victor is an engineer and scrum master, helping data scientists build secure fast machine learning pipelines. In athenahealth he has worked on interfaces, clinical ordering, prescriptions, scheduling, analytics and now machine learning. He values cleanly written and well unit tested code, but has an unhealthy obsession with code one-liners. In his spare time he enjoys listening to podcasts while walking his dog.

Sasank Vemuri is a Lead Member of Technical Staff at athenahealth. He has experience working with developing data driven solutions across domains such as healthcare, insurance and bioinformatics. Sasank currently works with designing and developing machine learning training and inference platforms on AWS and Kubernetes that help with training and deploying ML solutions at scale.

Anu Tumkur is an Architect at athenahealth. Anu has over two decades of architecture, design, development experience building various software products in machine learning, cloud operations, big data, real-time distributed data pipelines, ad tech, data analytics, social media analytics. Anu currently works as an architect in athenahealth’s Product Engineering organization on the Machine Learning Platform and Data Pipeline teams.

William Tsen is a Senior Engineering Manager at athenahealth. He has over 20 years of engineering leadership experience building solutions in healthcare IT, big data distributed computing, intelligent optical networks, real-time video editing systems, enterprise software, and group healthcare underwriting. William currently leads two awesome teams at athenahealth, the Machine Learning Operations and DevOps engineering teams, in the Product Engineering organization.

Read More

A Multi-Axis Approach for Vision Transformer and MLP Models

A Multi-Axis Approach for Vision Transformer and MLP Models

Convolutional neural networks have been the dominant machine learning architecture for computer vision since the introduction of AlexNet in 2012. Recently, inspired by the evolution of Transformers in natural language processing, attention mechanisms have been prominently incorporated into vision models. These attention methods boost some parts of the input data while minimizing other parts so that the network can focus on small but important parts of the data. The Vision Transformer (ViT) has created a new landscape of model designs for computer vision that is completely free of convolution. ViT regards image patches as a sequence of words, and applies a Transformer encoder on top. When trained on sufficiently large datasets, ViT demonstrates compelling performance on image recognition.

While convolutions and attention are both sufficient for good performance, neither of them are necessary. For example, MLP-Mixer adopts a simple multi-layer perceptron (MLP) to mix image patches across all the spatial locations, resulting in an all-MLP architecture. It is a competitive alternative to existing state-of-the-art vision models in terms of the trade-off between accuracy and computation required for training and inference. However, both ViT and the MLP models struggle to scale to higher input resolution because the computational complexity increases quadratically with respect to the image size.

Today we present a new multi-axis approach that is simple and effective, improves on the original ViT and MLP models, can better adapt to high-resolution, dense prediction tasks, and can naturally adapt to different input sizes with high flexibility and low complexity. Based on this approach, we have built two backbone models for high-level and low-level vision tasks. We describe the first in “MaxViT: Multi-Axis Vision Transformer”, to be presented in ECCV 2022, and show it significantly improves the state of the art for high-level tasks, such as image classification, object detection, segmentation, quality assessment, and generation. The second, presented in “MAXIM: Multi-Axis MLP for Image Processing” at CVPR 2022, is based on a UNet-like architecture and achieves competitive performance on low-level imaging tasks including denoising, deblurring, dehazing, deraining, and low-light enhancement. To facilitate further research on efficient Transformer and MLP models, we have open-sourced the code and models for both MaxViT and MAXIM.

A demo of image deblurring using MAXIM frame by frame.

Overview
Our new approach is based on multi-axis attention, which decomposes the full-size attention (each pixel attends to all the pixels) used in ViT into two sparse forms — local and (sparse) global. As shown in the figure below, the multi-axis attention contains a sequential stack of block attention and grid attention. The block attention works within non-overlapping windows (small patches in intermediate feature maps) to capture local patterns, while the grid attention works on a sparsely sampled uniform grid for long-range (global) interactions. The window sizes of grid and block attentions can be fully controlled as hyperparameters to ensure a linear computational complexity to the input size.

The proposed multi-axis attention conducts blocked local and dilated global attention sequentially followed by a FFN, with only a linear complexity. The pixels in the same colors are attended together.

Such low-complexity attention can significantly improve its wide applicability to many vision tasks, especially for high-resolution visual predictions, demonstrating greater generality than the original attention used in ViT. We build two backbone instantiations out of this multi-axis attention approach – MaxViT and MAXIM, for high-level and low-level tasks, respectively.

MaxViT
In MaxViT, we first build a single MaxViT block (shown below) by concatenating MBConv (proposed by EfficientNet, V2) with the multi-axis attention. This single block can encode local and global visual information regardless of input resolution. We then simply stack repeated blocks composed of attention and convolutions in a hierarchical architecture (similar to ResNet, CoAtNet), yielding our homogenous MaxViT architecture. Notably, MaxViT is distinguished from previous hierarchical approaches as it can “see” globally throughout the entire network, even in earlier, high-resolution stages, demonstrating stronger model capacity on various tasks.

The meta-architecture of MaxViT.

MAXIM
Our second backbone, MAXIM, is a generic UNet-like architecture tailored for low-level image-to-image prediction tasks. MAXIM explores parallel designs of the local and global approaches using the gated multi-layer perceptron (gMLP) network (patching-mixing MLP with a gating mechanism). Another contribution of MAXIM is the cross-gating block that can be used to apply interactions between two different input signals. This block can serve as an efficient alternative to the cross-attention module as it only employs the cheap gated MLP operators to interact with various inputs without relying on the computationally heavy cross-attention. Moreover, all the proposed components including the gated MLP and cross-gating blocks in MAXIM enjoy linear complexity to image size, making it even more efficient when processing high-resolution pictures.

Results
We demonstrate the effectiveness of MaxViT on a broad range of vision tasks. On image classification, MaxViT achieves state-of-the-art results under various settings: with only ImageNet-1K training, MaxViT attains 86.5% top-1 accuracy; with ImageNet-21K (14M images, 21k classes) pre-training, MaxViT achieves 88.7% top-1 accuracy; and with JFT (300M images, 18k classes) pre-training, our largest model MaxViT-XL achieves a high accuracy of 89.5% with 475M parameters.

Performance comparison of MaxViT with state-of-the-art models on ImageNet-1K. Top: Accuracy vs. FLOPs performance scaling with 224×224 image resolution. Bottom: Accuracy vs. parameters scaling curve under ImageNet-1K fine-tuning setting.

For downstream tasks, MaxViT as a backbone delivers favorable performance on a broad spectrum of tasks. For object detection and segmentation on the COCO dataset, the MaxViT backbone achieves 53.4 AP, outperforming other base-level models while requiring only about 60% the computational cost. For image aesthetics assessment, the MaxViT model advances the state-of-the-art MUSIQ model by 3.5% in terms of linear correlation with human opinion scores. The standalone MaxViT building block also demonstrates effective performance on image generation, achieving better FID and IS scores on the ImageNet-1K unconditional generation task with a significantly lower number of parameters than the state-of-the-art model, HiT.

The UNet-like MAXIM backbone, customized for image processing tasks, has also demonstrated state-of-the-art results on 15 out of 20 tested datasets, including denoising, deblurring, deraining, dehazing, and low-light enhancement, while requiring fewer or comparable number of parameters and FLOPs than competitive models. Images restored by MAXIM show more recovered details with less visual artifacts.

Visual results of MAXIM for image deblurring, deraining, and low-light enhancement.

Summary
Recent works in the last two or so years have shown that ConvNets and Vision Transformers can achieve similar performance. Our work presents a unified design that takes advantage of the best of both worlds — efficient convolution and sparse attention — and demonstrates that a model built on top, namely MaxViT, can achieve state-of-the-art performance on a variety of vision tasks. More importantly, MaxViT scales well to very large data sizes. We also show that an alternative multi-axis design using MLP operators, MAXIM, achieves state-of-the-art performance on a broad range of low-level vision tasks.

Even though we present our models in the context of vision tasks, the proposed multi-axis approach can easily extend to language modeling to capture both local and global dependencies in linear time. Motivated by the work here, we expect that it is worthwhile to study other forms of sparse attention in higher-dimensional or multimodal signals such as videos, point clouds, and vision-language models.

We have open-sourced the code and models of MAXIM and MaxViT to facilitate future research on efficient attention and MLP models.

Acknowledgments
We would like to thank our co-authors: Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, and Alan Bovik. We would also like to acknowledge the valuable discussion and support from Xianzhi Du, Long Zhao, Wuyang Chen, Hanxiao Liu, Zihang Dai, Anurag Arnab, Sungjoon Choi, Junjie Ke, Mauricio Delbracio, Irene Zhu, Innfarn Yoo, Huiwen Chang, and Ce Liu.

Read More

NVIDIA Hopper Sweeps AI Inference Benchmarks in MLPerf Debut

In their debut on the MLPerf industry-standard AI benchmarks, NVIDIA H100 Tensor Core GPUs set world records in inference on all workloads, delivering up to 4.5x more performance than previous-generation GPUs.

The results demonstrate that Hopper is the premium choice for users who demand utmost performance on advanced AI models.

Additionally, NVIDIA A100 Tensor Core GPUs and the NVIDIA Jetson AGX Orin module for AI-powered robotics continued to deliver overall leadership inference performance across all MLPerf tests: image and speech recognition, natural language processing and recommender systems.

The H100, aka Hopper, raised the bar in per-accelerator performance across all six neural networks in the round. It demonstrated leadership in both throughput and speed in separate server and offline scenarios.

Hopper performance on MLPerf AI inference tests
NVIDIA H100 GPUs set new high watermarks on all workloads in the data center category.

The NVIDIA Hopper architecture delivered up to 4.5x more performance than NVIDIA Ampere architecture GPUs, which continue to provide overall leadership in MLPerf results.

Thanks in part to its Transformer Engine, Hopper excelled on the popular BERT model for natural language processing. It’s among the largest and most performance-hungry of the MLPerf AI models.

These inference benchmarks mark the first public demonstration of H100 GPUs, which will be available later this year. The H100 GPUs will participate in future MLPerf rounds for training.

A100 GPUs Show Leadership

NVIDIA A100 GPUs, available today from major cloud service providers and systems manufacturers, continued to show overall leadership in mainstream performance on AI inference in the latest tests.

A100 GPUs won more tests than any submission in data center and edge computing categories and scenarios. In June, the A100 also delivered overall leadership in MLPerf training benchmarks, demonstrating its abilities across the AI workflow.

Since their July 2020 debut on MLPerf, A100 GPUs have advanced their performance by 6x, thanks to continuous improvements in NVIDIA AI software.

NVIDIA AI is the only platform to run all MLPerf inference workloads and scenarios in data center and edge computing.

Users Need Versatile Performance

The ability of NVIDIA GPUs to deliver leadership performance on all major AI models makes users the real winners. Their real-world applications typically employ many neural networks of different kinds.

For example, an AI application may need to understand a user’s spoken request, classify an image, make a recommendation and then deliver a response as a spoken message in a human-sounding voice. Each step requires a different type of AI model.

The MLPerf benchmarks cover these and other popular AI workloads and scenarios — computer vision, natural language processing, recommendation systems, speech recognition and more. The tests ensure users will get performance that’s dependable and flexible to deploy.

Users rely on MLPerf results to make informed buying decisions, because the tests are transparent and objective. The benchmarks enjoy backing from a broad group that includes Amazon, Arm, Baidu, Google, Harvard, Intel, Meta, Microsoft, Stanford and the University of Toronto.

Orin Leads at the Edge

In edge computing, NVIDIA Orin ran every MLPerf benchmark, winning more tests than any other low-power system-on-a-chip. And it showed  up to a 50% gain in energy efficiency compared to its debut on MLPerf in April.

In the previous round, Orin ran up to 5x faster than the prior-generation Jetson AGX Xavier module, while delivering an average of 2x better energy efficiency.

Orin leads MLPerf in edge inference
Orin delivered up to 50% gains in energy efficiency for AI inference at the edge.

Orin integrates into a single chip an NVIDIA Ampere architecture GPU and a cluster of powerful Arm CPU cores. It’s available today in the NVIDIA Jetson AGX Orin developer kit and production modules for robotics and autonomous systems, and supports the full NVIDIA AI software stack, including platforms for autonomous vehicles (NVIDIA Hyperion), medical devices (Clara Holoscan) and robotics (Isaac).

Broad NVIDIA AI Ecosystem

The MLPerf results show NVIDIA AI is backed by the industry’s broadest ecosystem in machine learning.

More than 70 submissions in this round ran on the NVIDIA platform.  For example, Microsoft Azure submitted results running NVIDIA AI on its cloud services.

In addition, 19 NVIDIA-Certified Systems appeared in this round from 10 systems makers, including ASUS, Dell Technologies, Fujitsu, GIGABYTE, Hewlett Packard Enterprise, Lenovo and Supermicro.

Their work shows users can get great performance with NVIDIA AI both in the cloud and in servers running in their own data centers.

NVIDIA partners participate in MLPerf because they know it’s a valuable tool for customers evaluating AI platforms and vendors. Results in the latest round demonstrate that the performance they deliver to users today will grow with the NVIDIA platform.

All the software used for these tests is available from the MLPerf repository, so anyone can get these world-class results. Optimizations are continuously folded into containers available on NGC, NVIDIA’s catalog for GPU-accelerated software. That’s where you’ll also find NVIDIA TensorRT, used by every submission in this round to optimize AI inference.

The post NVIDIA Hopper Sweeps AI Inference Benchmarks in MLPerf Debut appeared first on NVIDIA Blog.

Read More

Content moderation using machine learning: the server-side part

Content moderation using machine learning: the server-side part

Posted by Jen Person, Senior Developer Relations Engineer, TensorFlow

Welcome to part 2 of my dual approach to content moderation! In this post, I show you how to implement content moderation using machine learning in a server-side environment. If you’d like to see how to implement this moderation client-side, check out part 1.

Remind me: what are we doing here again?

In short, anonymity can create some distance between people in a way that allows them to say things they wouldn’t say in person. That is to say, there are tons of trolls out there. And let’s be honest: we’ve all typed something online we wouldn’t actually say IRL at least once! Any website that takes public text input can benefit from some form of moderation. Client-side moderation has the benefit of instant feedback, but server-side moderation cannot be bypassed like client-side might, so I like to have both.

This project picks up where part 1 left off, but you can also start here with a fresh copy of the Firebase Text Moderation demo code. The website in the Firebase demo showcases content moderation through a basic guestbook using a server-side content moderation system implemented through a Realtime Database-triggered Cloud Function. This means that the guestbook data is stored in the Firebase Realtime Database, a NoSQL database. The Cloud Function is triggered whenever data is written to a certain area of the database. We can choose what code runs when that event is triggered. In our case, we will use the Text Toxicity Classifier model to determine if the text written to the database is inappropriate, and then remove it from the database if needed. With this model, you can evaluate text on different labels of unwanted content, including identity attacks, insults, and obscenity. You can try out the demo to see the classifier in action.

If you prefer to start at the end, you can follow along in a completed version of the project on GitHub.

Server-side moderation

The Firebase text moderation example I used as my starting point doesn’t include any machine learning. Instead, it checks for the presence of profanity from a list of words and then replaces them with asterisks using the bad-words npm package. I thought about blending this approach with machine learning (more on that later), but I decided to just wipe the slate clean and replace the code of the Cloud Function altogether. Start by navigating to the Cloud Functions folder of the Text Moderation example:

cd textmoderation/functions

Open index.js and delete its contents. In index.js, add the following code:

const functions = require(‘firebase-functions’);

const toxicity = require(‘@tensorflow-models/toxicity’);


exports.moderator = functions.database.ref(‘/messages/{messageId}’).onCreate(async (snapshot, context) => {

  const message = snapshot.val();


  // Verify that the snapshot has a value

  if (!message) { 

    return;

  }

  functions.logger.log(‘Retrieved message content: ‘, message);


  // Run moderation checks on the message and delete if needed.

  const moderateResult = await moderateMessage(message.text);

  functions.logger.log(

    ‘Message has been moderated. Does message violate rules? ‘,

    moderateResult

  );

});

This code runs any time a message is added to the database. It gets the text of the message, and then passes it to a function called `moderateResult`. If you’re interested in learning more about Cloud Functions and the Realtime Database, then check out the Firebase documentation.

Add the Text Toxicity Classifier model

Depending on your development environment, you probably have some sort of error now since we haven’t actually written a function called moderateMessage yet. Let’s fix that. Below your Cloud Function trigger function, add the following code:

exports.moderator = functions.database.ref(‘/messages/{messageId}’).onCreate(async (snapshot, context) => {

        //…

        // Your other function code is here.

});


async function moderateMessage(message) {

  const threshold = 0.9;


  let model = await toxicity.load(threshold);


  const messages = [message];


  let predictions = await model.classify(messages);


  for (let item of predictions) {

    for (let i in item.results) {

      if (item.results[i].match === true) {

        return true;

      }

    }

  }

  return false;

}

This function does the following:

  1. Sets the threshold for the model to 0.9. The threshold of the model is the minimum prediction confidence you want to use to set the model’s predictions to true or false–that is, how confident the model is that the text does or does not contain the given type of toxic content. The scale for the threshold is 0-1.0. In this case, I set the threshold to .9, which means the model will predict true or false if it is 90% confident in its findings.
  2. Loads the model, passing the threshold. Once loaded, it sets toxicity_model to the model` value.
  3. Puts the message into an array called messages, as an array is the object type that the classify function accepts.
  4. Calls classify on the messages array.
  5. Iterates through the prediction results. predictions is an array of objects each representing a different language label. You may want to know about only specific labels rather than iterating through them all. For example, if your use case is a website for hosting the transcripts of rap battles, you probably don’t want to detect and remove insults.
  6. Checks if the content is a match for that label. if the match value is true, then the model has detected the given type of unwanted language. If the unwanted language is detected, the function returns true. There’s no need to keep checking the rest of the results, since the content has already been deemed inappropriate.
  7. If the function iterates through all the results and no label match is set to true, then the function returns false – meaning no undesirable language was found. The match label can also be null. In that case, its value isn’t true, so it’s considered acceptable language. I will talk more about the null option in a future post.
If you completed part 1 of this tutorial, then these steps probably sound familiar. The server-side code is very similar to the client-side code. This is one of the things that I like about TensorFlow.js: it’s often straightforward to transition code from the client to server and vice versa.

Complete the Cloud Functions code

Back in your Cloud Function, you now know that based on the code we wrote for moderateMessage, the value of moderateResult will be true or false: true if the message is considered toxic by the model, and false if it does not detect toxicity with certainty greater than 90%. Now add code to delete the message from the database if it is deemed toxic:

  // Run moderation checks on the message and delete if needed.

  const moderateResult = await moderateMessage(message.text);

  functions.logger.log(

    ‘Message has been moderated. Does message violate rules? ‘,

    moderateResult

  );


  if (moderateResult === true) {

    var modRef = snapshot.ref;

    try {

      await modRef.remove();

    } catch (error) {

      functions.logger.error(‘Remove failed: ‘ + error.message);

    }

  }

This code does the following:

  1. Checks if moderateResult is true, meaning that the message written to the guestbook is inappropriate.
  2. If the value is true, it removes the data from the database using the remove function from the Realtime Database SDK.
  3. Logs an error if one occurs.

Deploy the code

To deploy the Cloud Function, you can use the Firebase CLI. If you don’t have it, you can install it using the following npm command:

npm install g firebasetools

Once installed, use the following command to log in:

firebase login

Run this command to connect the app to your Firebase project:

firebase use add

From here, you can select your project in the list, connect Firebase to an existing Google Cloud project, or create a new Firebase project.
Once the project is configured, use the following command to deploy your Cloud Function:

firebase deploy

Once deployment is complete, the logs include the link to your hosted guestbook. Write some guestbook entries. If you followed part 1 of the blog, you will need to either delete the moderation code from the website and deploy again, or manually add guestbook entries to the Realtime Database in the Firebase console.

You can view your Cloud Functions logs in the Firebase console.

Building on the example

I have a bunch of ideas for ways to build on this example. Here are just a few. Let me know which ideas you would like to see me build, and share your suggestions as well! The best ideas come from collaboration.

Get a queue

I mentioned that the “match” value of a language label can be true, false, or null without going into detail on the significance of the null value. If the label is null, then the model cannot determine if the language is toxic within the given threshold. One way to limit the number of null values is to lower this threshold. For example, if you change the threshold value to 0.8, then the model will label the match value as true if it is at least 80% certain that the text contains language that fits the label. My website example assigns labels of value null the same as those labeled false, allowing that text through the filter. But since the model isn’t sure if that text is appropriate, it’s probably a good idea to get some eyes on it. You could add these posts to a queue for review, and then approve or deny them as needed. I said “you” here, but I guess I mean “me”. If you think this would be an interesting use case to explore, let me know! I’m happy to write about it if it would be useful.

What’s in ‘store

The Firebase moderation sample that I used as the foundation of my project uses Realtime Database. I prefer to use Firestore because of its structure, scalability, and security. Firestore’s structure is well suited for implementing a queue because I could have a collection of posts to review within the collection of posts. If you’d like to see the website using Firestore, let me know.

Don’t just eliminate – moderate!

One of the things I like about the original Firebase moderation sample is that it sanitizes the text rather than just deleting the post. You could run text through the sanitizer before checking for toxic language through the text toxicity model. If the sanitized text is deemed appropriate, then it could overwrite the original text. If it still doesn’t meet the standards of decent discourse, then you could still delete it. This might save some posts from otherwise being deleted.

What’s in a name?

You’ve probably noticed that my moderation functionality doesn’t extend to the name field. This means that even a halfway-clever troll could easily get around the filter by cramming all of their expletives into that name field. That’s a good point and I trust that you will use some type of moderation on all fields that users interact with. Perhaps you use an authentication method to identify users so they aren’t provided a field for their name. Anyway, you get it: I didn’t add moderation to the name field, but in a production environment, you definitely want moderation on all fields.

Build a better fit

When you test out real-world text samples on your website, you might find that the text toxicity classifier model doesn’t quite fit your needs. Since each social space is unique, there will be specific language that you are looking to include and exclude. You can address these needs by training the model on new data that you provide.

If you enjoyed this article and would like to learn more about TensorFlow.js, then there are a ton of things you can do:

Read More