Adaptive Spatiotemporal Condenser for Efficient Long-Form Video Question Answering
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Long-form video question answering (VQA) presents substantial challenges due to the extensive volume of spatiotemporal information, inherent redundancy, and limitations of conventional sequence reduction methods. To address these issues, we introduce the Adaptive Spatiotemporal Condenser (ASC), a novel architecture designed for efficient extraction and condensation of question-relevant information from lengthy video sequences. ASC employs a lightweight, learnable module that dynamically identifies and aggregates critical spatiotemporal tokens, compressing them into a fixed-length, information-dense representation suitable for large language models (LLMs). Our key innovations include an adaptive condensation mechanism, a question-conditioned importance scoring process for precise information focusing, and an inherently efficient and flexible design. Extensive experiments on challenging long-form VQA benchmarks demonstrate that our ASC-LLaVA model consistently achieves state-of-the-art performance, surpassing prior methods. Ablation studies confirm the critical contribution of each ASC component, while further analysis validates its robustness across varying video lengths, effectiveness in "needle-in-a-haystack" scenarios, and generalizability across different LLM backbones. These findings highlight ASC's capability to significantly enhance VQA accuracy and computational efficiency for complex, long-form video understanding.