Weakly-Supervised Multimodal Video Pre-Training via Image-Caption Pseudo-Labeling

Callen Rhodes
Emily Marwood
Juniper Hale

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large-scale weakly-supervised training has enabled transformative advances in multimodal learning, particularly in the image-text domain, where models like CLIP and CoCa achieve impressive generalization using noisy web-scale data. However, replicating such success in video-language learning remains limited due to the intrinsic difficulty of acquiring temporally-aligned video-text data at scale. Existing solutions such as ASR-based captioning or alt-text retrieval often suffer from low quality, domain bias, or coverage issues, thus constraining their utility in training generalized video models. In this paper, we propose \textbf{PseudoCap-Vid}, a scalable and accurate framework for self-supervised multimodal video pre-training that bypasses the need for aligned video-text data. Our method leverages recent advances in image captioning to pseudolabel video frames and clips, producing dense and informative captions that serve as effective supervision signals. Unlike prior approaches, PseudoCap-Vid neither relies on domain-specific assumptions nor on expensive frame-text alignment pipelines. We instantiate our framework using a frozen TimeSformer visual encoder and a pre-trained OPT-based language model, and train on a combination of image-caption and video-pseudocaption data. Through comprehensive experiments, we demonstrate that our approach significantly outperforms models pre-trained with noisy ASR transcripts, and achieves a +4 CIDEr improvement on MSR-VTT. We also introduce a novel separable cross-attention mechanism tailored for multimodal fusion and analyze optimization dynamics across large-scale setups. Our findings reveal practical guidelines for stable pre-training and open up new avenues for multimodal representation learning with minimal annotation cost.

Version published to 10.20944/preprints202506.2276.v1
Jun 27, 2025

Weakly Supervised Temporal Action Localization Based on Feature Enhancement

This article has 2 authors:
1. Hongying Zhang
2. Yi Yao
This article has no evaluationsLatest version Jun 9, 2025
A Multi-Scale Spatial Attention-Based Zero-Shot Learning Framework for Low-Light Image Enhancement

This article has 3 authors:
1. Muhammad Azeem Aslam
2. Hassan Khalid
3. Nisar Ahmed
This article has no evaluationsLatest version Jun 27, 2025
Lightweight Self-Supervised Representation Learning with Knowledge Distillation on Compact Datasets

This article has 1 author:
1. Khawla Hussein ِAli
This article has no evaluationsLatest version Jun 25, 2025

Listed in

Abstract

Article activity feed

Related articles

Weakly Supervised Temporal Action Localization Based on Feature Enhancement

A Multi-Scale Spatial Attention-Based Zero-Shot Learning Framework for Low-Light Image Enhancement

Lightweight Self-Supervised Representation Learning with Knowledge Distillation on Compact Datasets