How to scale machine learning inference for multi-tenant SaaS use cases

This post is co-written with Sowmya Manusani, Sr. Staff Machine Learning Engineer at Zendesk

Zendesk is a SaaS company that builds support, sales, and customer engagement software for everyone, with simplicity as the foundation. It thrives on making over 170,000 companies worldwide serve their hundreds of millions of customers efficiently. The Machine Learning team at Zendcaesk is responsible for enhancing Customer Experience teams to achieve their best. By combining the power of data and people, Zendesk delivers intelligent products that make their customers more productive by automating manual work.

Zendesk has been building ML products since 2015, including Answer Bot, Satisfaction Prediction, Content Cues, Suggested Macros, and many more. In the last few years, with the growth in deep learning, especially in NLP, they saw a lot of opportunity to automate workflows and assist agents in supporting their customers with Zendesk solutions. Zendesk currently use TensorFlow and PyTorch to build deep learning models.

Customers like Zendesk have built successful, high-scale software as a service (SaaS) businesses on Amazon Web Services (AWS). A key driver for a successful SaaS business model is the ability to apply multi-tenancy in the application and infrastructure. This enables cost and operational efficiencies because the application only needs to be built once, but it can be used many times and the infrastructure can be shared. We see many customers build secure, cost-efficient, multi-tenant systems on AWS at all layers of the stack, from compute, storage, database, to networking, and now we’re seeing customers needing to apply it to machine learning (ML).

Making the difficult tradeoff between model reuse and hyper-personalization

Multi-tenancy for SaaS businesses typically means that a single application is reused between many users (SaaS customers). This creates cost efficiencies and lowers operational overhead. However, machine learning models sometimes need to be personalized to a high degree of specificity (hyper-personalized) to make accurate predictions. This means the “build once, use many times” SaaS paradigm can’t always be applied to ML if models have specificity. Take for example the use case of customer support platforms. The language that users include in a support ticket varies depending on if it’s a ride share issue (“ride took too long”) or a clothing purchase issue (“discoloration when washed”). In this use case, improving the accuracy of predicting the best remediation action may require training a natural language processing (NLP) model on a dataset specific to a business domain or an industry vertical. Zendesk face exactly this challenge when trying to leverage ML in their solutions. They needed to create thousands of highly customized ML models, each tailored for a specific customer. To solve this challenge of deploying thousands of models, cost effectively, Zendesk turned to Amazon SageMaker.

In this post, we show how to use some of the newer features of Amazon SageMaker, a fully managed machine learning service, to build a multi-tenant ML inference capability. We also share a real-world example of how Zendesk successfully achieved the same outcome by deploying a happy medium between being able to support hyper-personalization in their ML models and the cost-efficient, shared use of infrastructure using SageMaker multi-model endpoints (MME).

SageMaker multi-model endpoints

SageMaker multi-model endpoints enable you to deploy multiple models behind a single inference endpoint that may contain one or more instances. Each instance is designed to load and serve multiple models up to its memory and CPU capacity. With this architecture, a SaaS business can break the linearly increasing cost of hosting multiple models and achieve reuse of infrastructure consistent with the multi-tenancy model applied elsewhere in the application stack.

The following diagram illustrates the architecture of a SageMaker multi-model endpoint.

The SageMaker multi-model endpoint dynamically loads models from Amazon Simple Storage Service (Amazon S3) when invoked, instead of downloading all the models when the endpoint is first created. As a result, an initial invocation to a model might see higher inference latency than the subsequent inferences, which are completed with low latency. If the model is already loaded on the container when invoked, then the download step is skipped and the model returns the inferences with low latency. For example, assume you have a model that is only used a few times a day. It is automatically loaded on demand, while frequently accessed models are retained in memory and invoked with consistently low latency.

Let’s take a closer look at how Zendesk used SageMaker MME to achieve cost-effective, hyper-scale ML deployment with their Suggested Macros ML feature.

Why Zendesk built hyper-personalized models

Zendesk’s customers are spread globally in different industry verticals with different support ticket semantics. Therefore, to serve their customers best, they often have to build personalized models that are trained on customer-specific support ticket data to correctly identify intent, macros, and more.

In October 2021, they released a new NLP ML feature, Suggested Macros, which recommends macros (predefined actions) based on thousands of customer-specific model predictions. Zendesk’s ML team built a TensorFlow-based NLP classifier model trained from the previous history of ticket content and macros per customer. With these models available, a macro prediction is recommended whenever an agent views the ticket (as shown in the following screenshot), which assists the agent in serving customers quickly. Because macros are specific to customers, Zendesk needs customer-specific models to serve accurate predictions.

Under the hood of Zendesk’s Suggested Macros

Suggested Macros models are NLP-based neural nets that are around 7–15 MB in size. The main challenge is to put thousands of these models in production with cost-efficient, reliable, and scalable solutions.

Each model has different traffic patterns, with a minimum of two requests per second and a peak of hundreds of requests per second, serving millions of predictions per day with a model latency of approximately 100 milliseconds when the model is available in memory. SageMaker endpoints are deployed in multiple AWS Regions, serving thousands of requests per minute per endpoint.

With its ability to host multiple models on a single endpoint, SageMaker helped Zendesk reduce deployment overhead and create a cost-effective solution when compared to deploying a single-model endpoint per customer. The tradeoff here is less control on per-model management; however, this is an area where Zendesk is collaborating with AWS to improve multi-model endpoints.

One of the SageMaker multi-model features is lazy loading of models, that is, models are loaded into memory when invoked for the first time. This is to optimize memory utilization; however, it causes response time spikes on first load, which can be seen as a cold start problem. For Suggested Macros, this was a challenge; however, Zendesk overcame this by implementing a preloading functionality on top of the SageMaker endpoint provisioning to load the models into memory before serving production traffic. Secondly, MME unloads infrequently used models from memory, so to achieve consistent low latency on all the models and avoid “noisy neighbors” impacting other less active models, Zendesk is collaborating with AWS to add new features, discussed later in the post, to enable more explicit per-model management. Additionally, as an interim solution, Zendesk has right-sized the MME fleet to minimize too many models unloading. With this, Zendesk is able to serve predictions to all their customers with low latency, around 100 milliseconds, and still achieve 90% cost savings when compared to dedicated endpoints.

On right-sizing MME, Zendesk observed during load testing that having a higher number of smaller instances (bias on horizontal scaling) behind MME was a better choice than having fewer larger memory instances (vertical scaling). Zendesk observed that bin packing too many models (beyond 500 TensorFlow models in their case) on a single large memory instance didn’t work well because memory is not the only resource on an instance that can be a bottleneck. More specifically, they observed that TensorFlow spawned multiple threads (3 x total instance vCPUs) per model, so loading over 500 models on a single instance caused kernel level limits to be breached on the max number of threads that could be spawned on an instance. Another issue with using fewer, larger instances occurred when Zendesk experienced throttling (as a safety mechanism) on some instances behind MME because the unique model invocation per second rate exceeded what the Multi Model Server (MMS) on a single instance could safely handle without browning out the instance. This was another issue that was resolved with the use of more and smaller instances.

From the observability perspective, which is a crucial component of any production application, Amazon CloudWatch metrics like invocations, CPU, memory utilization, and multi model-specific metrics like loaded models in memory, model loading time, model load wait time, and model cache hit are informative. Specifically, the breakdown of model latency helped Zendesk understand the cold start problem and its impact.

Under the hood of MME auto scaling

Behind each multi-model endpoint, there are model hosting instances, as depicted in the following diagram. These instances load and evict multiple models to and from memory based on the traffic patterns to the models.

SageMaker continues to route inference requests for a model to the instance where the model is already loaded such that the requests are served from cached model copy (see the following diagram, which shows the request path for the first prediction request vs. the cached prediction request path). However, if the model receives many invocation requests, and there are additional instances for the multi-model endpoint, SageMaker routes some requests to another instance to accommodate the increase. To take advantage of automated model scaling in SageMaker, make sure you have instance auto scaling set up to provision additional instance capacity. Set up your endpoint-level scaling policy with either custom parameters or invocations per minute (recommended) to add more instances to the endpoint fleet.

Use cases best suited for MME

SageMaker multi-model endpoints are well suited for hosting a large number of similar models that you can serve through a shared serving container and don’t need to access all the models at the same time. MME is best suited for models that are similar in size and invocation latencies. Some variation in model size is acceptable; for example, Zendesk’s models range from 10–50 Mb, which works fine, but variations in size that are a factor of 10, 50, or 100 times greater aren’t suitable. Larger models may cause a higher number of loads and unloads of smaller models to accommodate sufficient memory space, which can result in added latency on the endpoint. Differences in performance characteristics of larger models could also consume resources like CPU unevenly, which could impact other models on the instance.

MME is also designed for co-hosting models that use the same ML framework because they use the shared container to load multiple models. Therefore, if you have a mix of ML frameworks in your model fleet (such as PyTorch and TensorFlow), SageMaker dedicated endpoints or multi-container hosting is a better choice. Finally, MME is suited for applications that can tolerate an occasional cold start latency penalty because infrequently used models can be off-loaded in favor of frequently invoked models. If you have a long tail of infrequently accessed models, a multi-model endpoint can efficiently serve this traffic and enable significant cost savings.

Summary

In this post, you learned how SaaS and multi-tenancy relate to ML and how SageMaker multi-model endpoints enable multi-tenancy and cost-efficiency for ML inference. You learned about Zendesk’s multi-tenanted use case of per-customer ML models and how they hosted thousands of ML models in SageMaker MME for their Suggested Macros feature and achieved 90% cost savings on inference when compared to dedicated endpoints. Hyper-personalization use cases can require thousands of ML models, and MME is a cost-effective choice for this use case. We will continue to make enhancements in MME to enable you to host models with low latency and with more granular controls for each personalized model. To get started with MME, see Host multiple models in one container behind one endpoint.


About the Authors

Syed Jaffry is a Sr. Solutions Architect with AWS. He works with a range of companies from mid-sized organizations to large enterprises, financial services to ISVs, in helping them build and operate secure, resilient, scalable, and high performance applications in the cloud.

Sowmya Manusani is a Senior Staff Machine Learning Engineer at Zendesk. She works on productionalizing NLP-based Machine Learning features that focus on improving Agent productivity for thousands of Zendesk Enterprise customers. She has experience with building automated training pipelines for thousands of personalized models and serving them using secure, resilient, scalable, and high-performance applications. In her free time, she likes to solve puzzles and try painting.

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and making machine learning more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch and spending time with his family.

Deepti Ragha is a Software Development Engineer in the Amazon SageMaker team. Her current work focuses on building features to host machine.

Read More

The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon

The grokking phenomenon as reported by Power et al., refers to a regime where a long period of overfitting is followed by a seemingly sudden transition to perfect generalization. In this paper, we attempt to reveal the underpinnings of Grokking via a series of empirical studies. Specifically, we uncover an optimization anomaly plaguing adaptive optimizers at extremely late stages of training, referred to as the Slingshot Mechanism. A prominent artifact of the Slingshot Mechanism can be measured by the cyclic phase transitions between stable and unstable training regimes, and can be easily…Apple Machine Learning Research

Smart Utility Vehicle: NIO ES7 Redefines Category with Intelligent, Versatile EV Powered by NVIDIA DRIVE Orin

Accounting for nearly half of global vehicle sales in 2021, SUVs have grown in popularity given their versatility. Now, NIO aims to amp up the volume further.

This week, the electric automaker unveiled the ES7 SUV, purpose-built for the intelligent vehicle era. Its sporty yet elegant body houses an array of cutting-edge technology, including the Adam autonomous driving supercomputer, powered by NVIDIA DRIVE Orin.

SUVs gained a foothold among consumers in the late 1990s as useful haulers for people and cargo. As powertrain and design technology developed, the category has flourished, with some automakers converting their fleets to mostly SUVs and trucks.

With the ES7, NIO is adding even more to the SUV category, packing it with plenty of features to please any driver.

The intelligent EV sports 10 driving modes, in addition to autonomous capabilities that will gradually cover expressways, urban areas, parking, and battery swapping. It also includes a camping mode that maintains a comfortable cabin temperature with lower power consumption and immersive audio and lighting.

Utility Meets Technology

The technology inside the ES7 is the core of what makes it a category-transforming vehicle.

The SUV is the first to incorporate NIO’s watchtower sensor design, combining 33 high-performance lidars, radars, cameras and ultrasonics arranged in around the vehicle. Data from these sensors is fused and processed by the centralized Adam supercomputer for robust surround perception.

With more than 1,000 trillion operations per second (TOPS) of performance provided by four DRIVE Orin systems-on-a-chip (SoCs), Adam can power a wide range of intelligent features in addition to perception, with enough headroom to add new capabilities over the air.

Using multiple SoCs, Adam integrates the redundancy and diversity necessary for safe autonomous operation. The first two SoCs process the 8 gigabytes of data produced every second by the vehicle’s sensor set.

The third Orin serves as a backup to ensure the system can operate safely in any situation. And the fourth enables local training, improving the vehicle with fleet learning and personalizing the driving experience based on individual user preferences.

With high-performance compute at its center, the ES7 delivers everything an SUV customer could need, and more.

A Growing Lineup

The ES7 joins the ET7 and ET5 as the third NIO vehicle built on the DRIVE Orin-powered Adam supercomputer, adding even greater selection for customers seeking a more intelligent driving experience.

NIO intends to have vehicle offerings in more than two dozen countries and regions by 2025 to bring one of the most advanced AI platforms to more customers.

Preorders for the ES7 SUV are now on the NIO app, with deliveries slated to begin in August.

The post Smart Utility Vehicle: NIO ES7 Redefines Category with Intelligent, Versatile EV Powered by NVIDIA DRIVE Orin appeared first on NVIDIA Blog.

Read More

Artificial neural networks model face processing in autism

Many of us easily recognize emotions expressed in others’ faces. A smile may mean happiness, while a frown may indicate anger. Autistic people often have a more difficult time with this task. It’s unclear why. But new research, published June 15 in The Journal of Neuroscience, sheds light on the inner workings of the brain to suggest an answer. And it does so using a tool that opens new pathways to modeling the computation in our heads: artificial intelligence.

Researchers have primarily suggested two brain areas where the differences might lie. A region on the side of the primate (including human) brain called the inferior temporal (IT) cortex contributes to facial recognition. Meanwhile, a deeper region called the amygdala receives input from the IT cortex and other sources and helps process emotions.

Kohitij Kar, a research scientist in the lab of MIT Professor James DiCarlo, hoped to zero in on the answer. (DiCarlo, the Peter de Florez Professor in the Department of Brain and Cognitive Sciences, is also a member of the McGovern Institute for Brain Research and director of MIT’s Quest for Intelligence.)

Kar began by looking at data provided by two other researchers: Shuo Wang at Washington University in St. Louis and Ralph Adolphs at Caltech. In one experiment, they showed images of faces to autistic adults and to neurotypical controls. The images had been generated by software to vary on a spectrum from fearful to happy, and the participants judged, quickly, whether the faces depicted happiness. Compared with controls, autistic adults required higher levels of happiness in the faces to report them as happy.

Modeling the brain

Kar trained an artificial neural network, a complex mathematical function inspired by the brain’s architecture, to perform the same task. The network contained layers of units that roughly resemble biological neurons that process visual information. These layers process information as it passes from an input image to a final judgment indicating the probability that the face is happy. Kar found that the network’s behavior more closely matched the neurotypical controls than it did the autistic adults.

The network also served two more interesting functions. First, Kar could dissect it. He stripped off layers and retested its performance, measuring the difference between how well it matched controls and how well it matched autistic adults. This difference was greatest when the output was based on the last network layer. Previous work has shown that this layer in some ways mimics the IT cortex, which sits near the end of the primate brain’s ventral visual processing pipeline. Kar’s results implicate the IT cortex in differentiating neurotypical controls from autistic adults.

The other function is that the network can be used to select images that might be more efficient in autism diagnoses. If the difference between how closely the network matches neurotypical controls versus autistic adults is greater when judging one set of images versus another set of images, the first set could be used in the clinic to detect autistic behavioral traits. “These are promising results,” Kar says. Better models of the brain will come along, “but oftentimes in the clinic, we don’t need to wait for the absolute best product.”

Next, Kar evaluated the role of the amygdala. Again, he used data from Wang and colleagues. They had used electrodes to record the activity of neurons in the amygdala of people undergoing surgery for epilepsy as they performed the face task. The team found that they could predict a person’s judgment based on these neurons’ activity. Kar reanalyzed the data, this time controlling for the ability of the IT-cortex-like network layer to predict whether a face truly was happy. Now, the amygdala provided very little information of its own. Kar concludes that the IT cortex is the driving force behind the amygdala’s role in judging facial emotion.

Noisy networks

Finally, Kar trained separate neural networks to match the judgments of neurotypical controls and autistic adults. He looked at the strengths or “weights” of the connections between the final layers and the decision nodes. The weights in the network matching autistic adults, both the positive or “excitatory” and negative or “inhibitory” weights, were weaker than in the network matching neurotypical controls. This suggests that sensory neural connections in autistic adults might be noisy or inefficient.

To further test the noise hypothesis, which is popular in the field, Kar added various levels of fluctuation to the activity of the final layer in the network modeling autistic adults. Within a certain range, added noise greatly increased the similarity between its performance and that of the autistic adults. Adding noise to the control network did much less to improve its similarity to the control participants. This further suggest that sensory perception in autistic people may be the result of a so-called “noisy” brain.

Computational power

Looking forward, Kar sees several uses for computational models of visual processing. They can be further prodded, providing hypotheses that researchers might test in animal models. “I think facial emotion recognition is just the tip of the iceberg,” Kar says. They can also be used to select or even generate diagnostic content. Artificial intelligence could be used to generate content like movies and educational materials that optimally engages autistic children and adults. One might even tweak facial and other relevant pixels in what autistic people see in augmented reality goggles, work that Kar plans to pursue in the future.

Ultimately, Kar says, the work helps to validate the usefulness of computational models, especially image-processing neural networks. They formalize hypotheses and make them testable. Does one model or another better match behavioral data? “Even if these models are very far off from brains, they are falsifiable, rather than people just making up stories,” he says. “To me, that’s a more powerful version of science.”

Read More

AI for Personalized Health: Startup Advances Precision Medicine for COVID-19, Chronic Diseases

At a time when much about COVID-19 remained a mystery, U.K.-based PrecisionLife used AI and combinatorial analytics to discover new genes associated with severe symptoms and hospitalizations for patients.

The techbio company’s study, published in June 2020, pinpoints 68 novel genes associated with individuals who experienced severe disease from the virus. Over 70 percent of these targets have since been independently validated in global scientific literature as genetic risk factors for severe COVID-19 symptoms.

The startup was able to perform this early and accurate analysis using the first small COVID-19 patient dataset reported in the UK Biobank, with the help of AI, trained on NVIDIA A40 GPUs and backed by CUDA software libraries. PrecisionLife’s combinatorial analytics approach identifies interactions between genetic variants and other clinical or epidemiological factors in patients.

Results are shown in the featured image above, which depicts the disease architecture stratification of a severe COVID-19 patient population at the pandemic’s outset. Colors represent patient subgroups. Circles represent disease-associated genetic variants. And lines represent co-associated variants.

PrecisionLife technology helps researchers better understand complex disease biology at a population and personal level. Beyond COVID-19, the PrecisionLife analytics platform has been used to identify targets for precision medicine for more than 30 chronic diseases, including type 2 diabetes and ALS.

The company is a member of NVIDIA Inception, a free program that supports startups revolutionizing industries with cutting-edge technology.

Unique Disease Findings

Precision medicine considers an individual’s genetics, environment and lifestyle when selecting the treatment that could work best for them. PrecisionLife focuses on identifying how combinations of such factors impact chronic diseases.

The PrecisionLife platform enables a deeper understanding of the biology that leads to chronic disease across subgroups of patients. It uses combinatorial analytics to draw insights from the genomics and clinical history of patients — pulled from datasets provided by national biobanks, research consortia, patient charities and more.

Due to the inherent heterogeneity of chronic diseases, patients with the same diagnosis don’t necessarily experience the same causes, trajectories or treatments of disease.

The PrecisionLife platform identifies subgroups — within large patient populations — that have matching disease drivers, disease progression and treatment response. This can help researchers to select the right targets for drug development, treatments for individuals, as well as patients for clinical trials.

“Chronic disease is a complex space — a multi-genetic, multi-environmental problem with multiple patient subgroups,” said Mark Strivens, chief technology officer at PrecisionLife. “We work on technology to tackle problems that previous techniques couldn’t solve, and our unique disease findings will lead to a different set of therapeutic opportunities to best treat individuals.”

PrecisionLife technology is different from traditional analytical methods, like genome-wide association studies, which work best when single genetic variants are responsible for most of the disease risk. Instead, PrecisionLife offers combinatorial analytics, discovering significant combinations of multiple genetic and environmental factors.

The PrecisionLife platform can analyze data from 100,000 patients in just hours using NVIDIA A40 GPUs, a previously impossible feat, according to Strivens.

Plus, being a member of NVIDIA Inception gives the PrecisionLife team access to technical resources, hardware discounts and go-to-market support.

“Inception gives us access to technical expertise and connects us with other data-driven organizations that are a part of NVIDIA’s biotechnology AI ecosystem,” Strivens said. “Training from the NVIDIA Deep Learning Institute reduces the time it takes for our team members to ramp up learning a specific branch of programming.”

As a part of the groundbreaking U.K. life sciences community, PrecisionLife has access to a hub of healthcare innovation and specialist talent, Strivens said. Looking forward, the company plans to deliver new disease insights based on combinatorial analytics all across the globe.

Learn more about PrecisionLife and apply to join NVIDIA Inception.

Subscribe to NVIDIA healthcare news.

The post AI for Personalized Health: Startup Advances Precision Medicine for COVID-19, Chronic Diseases appeared first on NVIDIA Blog.

Read More

Get Your Wish: Genshin Impact Coming to GeForce NOW

Greetings, Traveler.

Prepare for adventure. Genshin Impact, the popular open-world action role-playing game, is leaving limited beta and launching for all GeForce NOW members next week.

Gamers can get their game on today with the six total games joining the GeForce NOW library.

As announced last week, Warhammer 40,000: Darktide is coming to the cloud at launch — with GeForce technology. This September, members will be able to leap thousands of years into the future to the time of the Space Marines, streaming on GeForce NOW with NVIDIA DLSS and more.

Plus, the 2.0.41 GeForce NOW app update brings a highly requested feature: in-stream copy-and-paste support from the clipboard while streaming from the PC and Mac apps — so there’s no need to enter a long, complex password for the digital store. Get to your games even faster with this new capability.

GeForce NOW is also giving mobile gamers more options by bringing the perks of RTX 3080 memberships and PC gaming at 120 frames per second to all devices with support for 120Hz phones. The capability is rolling out in the coming weeks.

Take a Trip to Teyvat

After the success of a limited beta and receiving great feedback from members, Genshin Impact is coming next week to everyone streaming on GeForce NOW.

Embark on a journey as a traveler from another world, stranded in the fantastic land of Teyvat. Search for your missing sibling in a vast continent made up of seven nations. Master the art of elemental combat and build a dream team of over 40 uniquely skilled playable characters – like the newest additions of Yelan and Kuki Shinobu – each with their own rich stories, personalities and combat styles.

Experience the immersive campaign, dive deep into rich quests alongside iconic characters and complete daily challenges. Charge head-on into battles solo or invite friends to join the adventures. The world is constantly expanding, so bring it wherever you go across devices, streaming soon to underpowered PCs, Macs and Chromebooks on GeForce NOW.

RTX 3080 members can level up their gaming for the best experience by streaming in 4K resolution and 60 frames per second on the PC and Mac apps.

Let the Gaming Commence

All of the action this GFN Thursday kicks off with six new games arriving on the cloud. Members can also gear up for Rainbow Six Siege Year 7 Season 2.

Rainbow Six Siege Year 7 Season 2
Get ready for a new Operator, Team Deathmatch map and more in “Rainbow Six Siege” Year 7 Season 2.

Members can look for the following streaming this week:

Finally, members still have a chance to stream the PC Building Simulator 2 open beta before it ends on Monday, June 20. Experience deeper simulation, an upgraded career mode and powerful new customization features to bring your ultimate PC to life.

To start your weekend gaming adventures, we’ve got a question. Let us know your thoughts on Twitter or in the comments below.

The post Get Your Wish: Genshin Impact Coming to GeForce NOW appeared first on NVIDIA Blog.

Read More

How Disney Improved Activity Recognition Through Multimodal Approaches with PyTorch

Introduction

Among the many things Disney Media & Entertainment Distribution (DMED) is responsible for, is the management and distribution of a huge array of media assets including news, sports, entertainment and features, episodic programs, marketing and advertising and more.

Our team focuses on media annotation as part of DMED Technology’s content platforms group. In our day-to-day work, we automatically analyze a variety of content that constantly challenges the efficiency of our machine learning workflow and the accuracy of our models.

Several of our colleagues recently discussed the workflow efficiencies that we achieved by switching to an end-to-end video analysis pipeline using PyTorch, as well as how we approach animated character recognition. We invite you to read more about both in this previous post.

While the conversion to an end-to-end PyTorch pipeline is a solution that any company might benefit from, animated character recognition was a uniquely-Disney concept and solution.

In this article we will focus on activity recognition, which is a general challenge across industries — but with some specific opportunities when leveraged in the media production field, because we can combine audio, video, and subtitles to provide a solution.

Experimenting with Multimodality

Working on a multimodal problem adds more complexity to the usual training pipelines. Having multiple information modes for each example means that the multimodal pipeline has to have specific implementations to process each mode in the dataset. Usually after this processing step, the pipeline has to merge or fuse the outputs.

Our initial experiments in multimodality were completed using the MMF framework. MMF is a modular framework for vision and language multimodal research. MMF contains reference implementations of state-of-the-art vision and language models and has also powered multiple research projects at Meta AI Research (as seen in this poster presented in PyTorch Ecosystem Day 2020). Along with the recent release of TorchMultimodal, a PyTorch library for training state-of-the-art multimodal models at scale, MMF highlights the growing interest in Multimodal understanding.

MMF tackles this complexity with modular management of all the elements of the pipeline through a wide set of different implementations for specific modules, ranging from the processing of the modalities to the fusion of the processed information.

In our scenario, MMF was a great entry point to experiment with multimodality. It allowed us to iterate quickly by combining audio, video and closed captioning and experiment at different levels of scale with certain multimodal models, shifting from a single GPU to TPU Pods.

Multimodal Transformers

With a workbench based on MMF, our initial model was based on a concatenation of features from each modality evolving to a pipeline that included a Transformer-based fusion module to combine the different input modes.

Specifically, we made use of the fusion module called MMFTransformer, developed in collaboration with the Meta AI Research team. This is an implementation based on VisualBERT for which the necessary modifications were added to be able to work with text, audio and video.

Despite having decent results with the out-of-box implementation MMFTransformer, we were still far from our goal, and the Transformers-based models required more data than we had available.

Searching for less data-hungry solutions

Searching for less data-hungry solutions, our team started studying MLP-Mixer. This new architecture has been proposed by the Google Brain team and it provides an alternative to well established de facto architectures like convolutions or self-attention for computer vision tasks.

MLP-Mixer

The core idea behind mixed variations consists of replacing the convolutions or self-attention mechanisms used in transformers with Multilayer Perceptrons. This change in architecture favors the performance of the model in high data regimes (especially with respect to the Transformers), while also opening some questions regarding the inductive biases hidden in the convolutions and the self-attention layers.

Those proposals perform great in solving image classification tasks by splitting the image in chunks, flattening those chunks into 1D vectors and passing them through a sequence of Mixer Layers.

Inspired by the advantages of Mixer based architectures, our team searched for parallelisms with the type of problems we try to solve in video classification: specifically, instead of a single image, we have a set of frames that need to be classified, along with audio and closed captioning in the shape of new modalities.

Activity Recognition reinterpreting the MLP-Mixer

Our proposal takes the core idea of the MLP-Mixer — using multiple multi-layer perceptrons on a sequence and transposed sequence and extends it into a Multi Modal framework that allows us to process video, audio & text with the same architecture.

For each of the modalities, we use different extractors that will provide embeddings describing the content. Given the embeddings of each modality, the MLP-Mixer architecture solves the problem of deciding which of the modalities might be the most important, while also weighing how much each modality contributes to the final labeling.

For example, when it comes to detecting laughs, sometimes the key information is in audio or in the frames, and in some of the cases we have a strong signal in the closed caption.

We tried processing each frame separately with a ResNet34 and getting a sequence of embeddings and by using a video-specific model called R3D, both pre-trained on ImageNet and Kinetics400 respectively.

To process the audio, we use the pretrained ResNet34, and we remove the final layers to be able to extract 2D embeddings from the audio spectrograms (for 224×224 images we end up with 7×7 embeddings).

For closed captioning, we are using a pre-trained BERT-large, with all layers frozen, except for the Embeddings & LayerNorms.

Once we have extracted the embedding from each modality, we concatenate them into a single sequence and pass it through a set of MLP-Mixer blocks; next we use average pooling & a classification head to get predictions.

Our experiments have been performed on a custom, manually labeled dataset for activity recognition with 15 classes, which we know from experiments are hard and cannot all be predicted accurately using a single modality.

These experiments have shown a significant increase in performance using our approach, especially in a low/mid-data regime (75K training samples).

When it comes to using only Text and Audio, our experiments showed a 15 percent improvement in accuracy over using a classifier on top of the features extracted by state-of-the-art backbones.

Using Text, Audio and Video we have seen a 17 percent improvement in accuracy over using Meta AIFacebook’s MMF Framework, which uses a VisualBERT-like model to combine modalities using more powerful state of the art backbones.

Currently, we extended the initial model to cover up to 55 activity classes and 45 event classes. One of the challenges we expect to improve upon in the future is to include all activities and events, even those that are less frequent.

Interpreting the MLP-Mixer mode combinations

An MLP-Mixer is a concatenation of MultiLayer Perceptrons. This can be, very roughly, approximated to a linear operation, in the sense that, once trained, the weights are fixed and the input will directly affect the output.

Once we assume that approximation, we also assume that for an input consisting of NxM numbers, we could find a NxM matrix that (when multiplied elementwise) could approximate the predictions of the MLP-Mixer for a class.

We will call this matrix a stencil, and if we have access to it, we can find what parts of the input embeddings are responsible for a specific prediction.

You can think of it as a punch card with holes in specific positions. Only information in those positions will pass and contribute to a specific prediction. So we can measure the intensity of the input at those positions.

Of course, this is an oversimplification, and there won’t exist a unique stencil that perfectly represents all of the contributions of the input to a class (otherwise that would mean that the problem could be solved linearly). So this should be used for visualization purposes only, not as an accurate predictor.

Once we have a set of stencils for each class, we can effortlessly measure input contribution without relying on any external visualization techniques.

To find a stencil, we can start from a “random noise” stencil and optimize it to maximize the activations for a specific class by just back-propagating through the MLP-Mixer.

By doing this we can end up with many valid stencils, and we can reduce them to a few by using K-means to cluster them into similar stencils and averaging each cluster.

Using the Mixer to get the best of each world

MLP-Mixer, used as an image classification model without convolutional layers, requires a lot of data, since the lack of inductive bias – one of the model’s good points overall – is a weakness when it comes to working in low data domains.

When used as a way to combine information previously extracted by large pretrained backbones (as opposed to being used as a full end-to-end solution), they shine. The Mixer’s strength lies in finding temporal or structural coherence between different inputs. For example, in video-related tasks we could extract embeddings from the frames using a powerful, pretrained model that understands what is going on at frame level and use the mixer to make sense of it in a sequential manner.

This way of using the Mixer allows us to work with limited amounts of data and still get better results than what was achieved with Transformers. This is because Mixers seem to be more stable during training and seem to pay attention to all the inputs, while Transformers tend to collapse and pay attention only to some modalities/parts of the sequence.

Acknowledgements: We would like to thank the Meta AI Research and Partner Engineering teams for this collaboration.

Read More