Weakly-Supervised Multimodal Video Pre-Training via Image-Caption Pseudo-Labeling
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large-scale weakly-supervised training has enabled transformative advances in multimodal learning, particularly in the image-text domain, where models like CLIP and CoCa achieve impressive generalization using noisy web-scale data. However, replicating such success in video-language learning remains limited due to the intrinsic difficulty of acquiring temporally-aligned video-text data at scale. Existing solutions such as ASR-based captioning or alt-text retrieval often suffer from low quality, domain bias, or coverage issues, thus constraining their utility in training generalized video models. In this paper, we propose \textbf{PseudoCap-Vid}, a scalable and accurate framework for self-supervised multimodal video pre-training that bypasses the need for aligned video-text data. Our method leverages recent advances in image captioning to pseudolabel video frames and clips, producing dense and informative captions that serve as effective supervision signals. Unlike prior approaches, PseudoCap-Vid neither relies on domain-specific assumptions nor on expensive frame-text alignment pipelines. We instantiate our framework using a frozen TimeSformer visual encoder and a pre-trained OPT-based language model, and train on a combination of image-caption and video-pseudocaption data. Through comprehensive experiments, we demonstrate that our approach significantly outperforms models pre-trained with noisy ASR transcripts, and achieves a +4 CIDEr improvement on MSR-VTT. We also introduce a novel separable cross-attention mechanism tailored for multimodal fusion and analyze optimization dynamics across large-scale setups. Our findings reveal practical guidelines for stable pre-training and open up new avenues for multimodal representation learning with minimal annotation cost.