ByteDance processes billions of daily videos using their multimodal video understanding models on AWS Inferentia2

This is a guest post authored by the team at ByteDance.

ByteDance is a technology company that operates a range of content platforms to inform, educate, entertain, and inspire people across languages, cultures, and geographies. Users trust and enjoy our content platforms because of the rich, intuitive, and safe experiences they provide. These experiences are made possible by our machine learning (ML) backend engine, with ML models built for video understanding, search, recommendation, advertising, and novel visual effects.

In support of its mission to “Inspire Creativity and Enrich Life,” we’ve made it straightforward and fun for people to engage with, create, and consume content. People can also discover and transact with a suite of more than a dozen products and services, such as CapCut, e-Shop, Lark, Pico, and Mobile Legends: Bang Bang.

At ByteDance, we collaborated with Amazon Web Services (AWS) to deploy multimodal large language models (LLMs) for video understanding using AWS Inferentia2 across multiple AWS Regions around the world. By using sophisticated ML algorithms, the platform efficiently scans billions of videos each day. We use this process to identify and flag content that violates community guidelines, enabling a better experience for all users. By using Amazon EC2 Inf2 instances for these video understanding workloads, we were able to cut the inference cost by half.

In this post, we discuss the use of multimodal LLMs for video understanding, the solution architecture, and techniques for performance optimization.

Overcoming video understanding hurdles with multimodal LLMs

Multimodal LLMs enable better understanding of the world, enabling various forms of digital content as inputs to the LLM, greatly increasing the range of useful applications we can now build. The need for AI systems capable of processing various content forms has become increasingly apparent. Multimodal LLMs have risen to meet this challenge by taking multiple data modalities, including text, images, audio, and video (refer to the following diagram), which allows for full understanding of content, mimicking human perception and interaction with the world. The enhanced capabilities of these models are evident in their performance, which far surpasses that of traditional models in tasks ranging from sophisticated virtual assistant to advanced content creation. By expanding the boundaries of AI capabilities and paving the way for more natural and intuitive interactions with technology, these models aren’t just improving existing applications but opening doors to entirely new possibilities in the realm of AI and user experience.

In our operations, the implementation of multimodal LLMs for video understanding represents a significant shift in thinking about AI-driven content analysis. This innovation addresses the daily challenge of processing billions of volumes of video content, overcoming the efficiency limits of traditional AI models. We’ve developed our own multimodal LLM architecture, designed to achieve state-of-the-art performance across single-image, multi-image, and video applications. Unlike traditional ML models, this new generative AI–enabled system integrates multiple input streams into a unified representational space. Cross-modal attention mechanisms facilitate information exchange between modalities, and fusion layers combine representations from different modalities. The decoder then generates output based on the fused multimodal representation, enabling a more nuanced and context-aware analysis of content.

Solution overview

We’ve collaborated with AWS since the first generation of Inferentia chips. Our video understanding department has been committed to finding more cost-efficient solutions that deliver higher performance to better meet ever-growing business needs. During this period, we found that AWS has been continually inventing and adding features and capabilities to its AWS Neuron software development kit (SDK), the software enabling high-performance workloads on the Inferentia chips. The popular Meta Llama and Mistral models were well supported with high performance on Inferentia2 shortly after their open source release. Therefore, we began to evaluate the Inferentia2 based solution, illustrated in the following diagram.

We made the strategic decision to deploy a fine-tuned middle-sized LLM on Inferentia2, to provide a performant and cost-effective solution capable of processing billions of videos daily. The process was a comprehensive effort aimed at optimizing end-to-end response time for our video understanding workload. The team explored a wide range of parameters, including tensor parallel sizes, compile configurations, sequence lengths, and batch sizes. We employed various parallelization techniques, such as multi-threading and model replication (for non-LLM models) across multiple NeuronCores. Through these optimizations, which included parallelizing sequence steps, reusing devices, and using auto-benchmark and profiling tools, we achieved a substantial performance boost, maintaining our position at the forefront of industry performance standards

We used tensor parallelism to effectively distribute and scale the model across multiple accelerators in an Inf2 instance. We used static batching, which improved the latency and throughput of our models by making sure that data is processed in uniform, fixed-size batches during inference. Using repeated n-grams filtering significantly improved the quality of automatically generated text and reduced inference time. Quantizing the weights of the multimodal model from FP16/BF16 to INT8 format allowed it to run more efficiently on Inferentia2 with less device memory usage, without compromising on accuracy. Using these techniques and model serialization, we optimized the throughput on inf2.48xlarge instance by maximizing the batch size such that the model could still fit on a single accelerator in an instance so we could deploy multiple model replicas on the same instance. This comprehensive optimization strategy helped us meet our latency requirements while providing optimal throughput and cost reduction. Notably, the Inferentia2 based solution cut the cost by half compared to comparable Amazon Elastic Compute Cloud (Amazon EC2) instances, highlighting the significant economic advantages of using Inferentia2 chips for large-scale video understanding tasks.

The following diagram shows how we deploy our LLM container on Amazon EC2 Inf2 instances using Neuron.

In summary, our collaboration with AWS has revolutionized video understanding, setting new industry standards for efficiency and accuracy. The multimodal LLM’s ability to adapt to global market demands and its scalable performance on Inferentia2 chips underscore the profound impact of this technology in safeguarding the platform’s global community.

Future plans

Looking further ahead, the development of a unified multimodal LLM represents an important shift in video understanding technology. This ambitious project aims to create a universal content tokenizer capable of processing all content types and aligning them within a common semantic space. After it’s tokenized, the content will be analyzed by advanced large models, generating appropriate content understanding outputs regardless of the original format (as shown in the following diagram). This unified approach can streamline the content understanding process, potentially improving both efficiency and consistency across diverse content types.

For additional learning, refer to the paper The Evolution of Multimodal Model Architectures.

The implementation of this comprehensive strategy sets new benchmarks in video understanding technology, striking a balance between accuracy, speed, and cultural sensitivity in an increasingly complex digital ecosystem. This forward-looking approach not only addresses current challenges in video understanding but also positions the system at the forefront of AI-driven content analysis and management for the foreseeable future.

By using cutting-edge AI techniques and a holistic approach to content understanding, this next-generation content understanding system aims to set new industry standards, providing safer and more inclusive online environments while adapting to the ever-evolving landscape of digital communication. At the same time, AWS is investing in next-generation AI chips such as AWS Trainium2, which will continue to push the performance boundaries while keeping costs under control. At ByteDance, we’re planning to test out this new generation of AWS AI chips and adopt them appropriately as the models and workloads continue to evolve.

Conclusion

The collaboration between ByteDance and AWS has revolutionized video understanding through the deployment of multimodal LLMs on Inferentia2 chips. This partnership has yielded remarkable results, the ability to process billions of videos daily, and significant cost reductions and higher performance over comparable EC2 instances.

As ByteDance continues to innovate with projects such as the unified multimodal large model, we remain committed to pushing the boundaries of AI-driven content analysis. Our goal is to make sure our platforms remain safe, inclusive, and creative spaces for our global community, setting new industry standards for efficient video understanding.

To learn more about Inf2 instances, refer to Amazon EC2 Inf2 Architecture.

About the Authors

Wangpeng An, Principal Algorithm Engineer at TikTok, specializes in multimodal LLMs for video understanding, advertising, and recommendations. He has led key projects in model acceleration, content moderation, and Ads LLM pipelines, enhancing TikTok’s real-time machine learning systems.

Haotian Zhang is a Tech Lead MLE at TikTok, specializing in content understanding, search, and recommendation. He received an ML PhD from University of Waterloo. At TikTok, he leads a group of engineers to improve the efficiency, robustness, and effectiveness of training and inference for LLMs and multimodal LLMs, especially for large distributed ML systems.

Xiaojie Ding is a senior engineer at TikTok, focusing on content moderation system development, model resource and deployment optimization, and algorithm engineering stability construction. In his free time, he likes to play single-player games.

Nachuan Yang is a senior engineer at TikTok, focusing on content security and moderation. He has successively been engaged in the construction of moderation systems, model applications, and deployment and performance optimization.

Kairong Sun is a Senior SRE on the AML Team at ByteDance. His role focuses on maintaining the seamless operation and efficient allocation of resources within the cluster, specializing in cluster machine maintenance and resource optimization.

The authors would like to thank other ByteDance and AWS team members for their contributions: Xi Dai, Kaili Zhao, Zhixin Zhang, Jin Ye, and Yann Xia from ByteDance; Jia Dong, Bingyang Huang, Kamran Khan, Shruti Koparkar, and Diwakar Bansal from AWS.

Vedere AI

ByteDance processes billions of daily videos using their multimodal video understanding models on AWS Inferentia2

Overcoming video understanding hurdles with multimodal LLMs

Solution overview

Future plans

Conclusion

About the Authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.