Vid2Seq: a pretrained visual language model for describing multi-event videos

Videos have become an increasingly important part of our daily lives, spanning fields such as entertainment, education, and communication. Understanding the content of videos, however, is a challenging task as videos often contain multiple events occurring at different time scales. For example, a video of a musher hitching up dogs to a dog sled before they all race away involves a long event (the dogs pulling the sled) and a short event (the dogs being hitched to the sled). One way to spur research in video understanding is via the task of dense video captioning, which consists of temporally localizing and describing all events in a minutes-long video. This differs from single image captioning and standard video captioning, which consists of describing short videos with a single sentence.

Dense video captioning systems have wide applications, such as making videos accessible to people with visual or auditory impairments, automatically generating chapters for videos, or improving the search of video moments in large databases. Current dense video captioning approaches, however, have several limitations — for example, they often contain highly specialized task-specific components, which make it challenging to integrate them into powerful foundation models. Furthermore, they are often trained exclusively on manually annotated datasets, which are very difficult to obtain and hence are not a scalable solution.

In this post, we introduce “Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning”, to appear at CVPR 2023. The Vid2Seq architecture augments a language model with special time tokens, allowing it to seamlessly predict event boundaries and textual descriptions in the same output sequence. In order to pre-train this unified model, we leverage unlabeled narrated videos by reformulating sentence boundaries of transcribed speech as pseudo-event boundaries, and using the transcribed speech sentences as pseudo-event captions. The resulting Vid2Seq model pre-trained on millions of narrated videos improves the state of the art on a variety of dense video captioning benchmarks including YouCook2, ViTT and ActivityNet Captions. Vid2Seq also generalizes well to the few-shot dense video captioning setting, the video paragraph captioning task, and the standard video captioning task. Finally, we have also released the code for Vid2Seq here.

Vid2Seq is a visual language model that predicts dense event captions together with their temporal grounding in a video by generating a single sequence of tokens.

A visual language model for dense video captioning

Multimodal transformer architectures have improved the state of the art on a wide range of video tasks, such as action recognition. However it is not straightforward to adapt such an architecture to the complex task of jointly localizing and captioning events in minutes-long videos.

For a general overview of how we achieve this, we augment a visual language model with special time tokens (like text tokens) that represent discretized timestamps in the video, similar to Pix2Seq in the spatial domain. Given visual inputs, the resulting Vid2Seq model can both take as input and generate sequences of text and time tokens. First, this enables the Vid2Seq model to understand the temporal information of the transcribed speech input, which is cast as a single sequence of tokens. Second, this allows Vid2Seq to jointly predict dense event captions and temporally ground them in the video while generating a single sequence of tokens.

The Vid2Seq architecture includes a visual encoder and a text encoder, which encode the video frames and the transcribed speech input, respectively. The resulting encodings are then forwarded to a text decoder, which autoregressively predicts the output sequence of dense event captions together with their temporal localization in the video. The architecture is initialized with a powerful visual backbone and a strong language model.

Vid2Seq model overview: We formulate dense event captioning as a sequence-to-sequence problem, using special time tokens to allow the model to seamlessly understand and generate sequences of tokens containing both textual semantic information and temporal localization information grounding each text sentence in the video.

Large-scale pre-training on untrimmed narrated videos

Due to the dense nature of the task, the manual collection of annotations for dense video captioning is particularly expensive. Hence we pre-train the Vid2Seq model using unlabeled narrated videos, which are easily available at scale. In particular, we use the YT-Temporal-1B dataset, which includes 18 million narrated videos covering a wide range of domains.

We use transcribed speech sentences and their corresponding timestamps as supervision, which are cast as a single sequence of tokens. We pre-train Vid2Seq with a generative objective that teaches the decoder to predict the transcribed speech sequence given visual inputs only, and a denoising objective that encourages multimodal learning by requiring the model to predict masked tokens given a noisy transcribed speech sequence and visual inputs. In particular, noise is added to the speech sequence by randomly masking out spans of tokens.

Vid2Seq is pre-trained on unlabeled narrated videos with a generative objective (top) and a denoising objective (bottom).

Results on downstream dense video captioning benchmarks

The resulting pre-trained Vid2Seq model can be fine-tuned on downstream tasks with a simple maximum likelihood objective using teacher forcing (i.e., predicting the next token given previous ground-truth tokens). After fine-tuning, Vid2Seq notably improves the state of the art on three standard downstream dense video captioning benchmarks (ActivityNet Captions, YouCook2 and ViTT) and two video clip captioning benchmarks (MSR-VTT, MSVD). In our paper we provide additional ablation studies, qualitative results, as well as results in the few-shot settings and in the video paragraph captioning task.

Comparison to state-of-the-art methods for dense video captioning (left) and for video clip captioning (right), on the CIDEr metric (higher is better).

Conclusion

We introduce Vid2Seq, a novel visual language model for dense video captioning that simply predicts all event boundaries and captions as a single sequence of tokens. Vid2Seq can be effectively pretrained on unlabeled narrated videos at scale, and achieves state-of-the-art results on various downstream dense video captioning benchmarks. Learn more from the paper and grab the code here.

Acknowledgements

This research was conducted by Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic and Cordelia Schmid.

Read More