SRIN: Structured Reasoning Integration Network for Robust Video Question Answering

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Video Question Answering (VideoQA) demands deep understanding of visual, temporal, and causal relationships. While Multimodal Large Language Models (MLLMs) offer powerful reasoning capabilities, their raw outputs often lack structure, contain noise, or include erroneous conclusions, posing challenges for effective integration. This paper introduces the Structured Reasoning Integration Network (SRIN), a novel framework designed to leverage MLLM-generated reasoning more robustly and precisely. SRIN comprises two core components: a Structured Reasoning Generation (SRG) module that employs multi-stage prompting to elicit multi-dimensional, fine-grained reasoning cues from a powerful MLLM (InternVL 1.5), and a Dynamic Reasoning Integration (DRI) module. The DRI module is a key innovation that adaptively weights and fuses these structured reasoning components based on the specific question's semantics, thereby enhancing the main VideoQA model's (BLIP-FlanT5) ability to utilize even imperfect MLLM outputs effectively. Extensive experiments on NExT-QA, STAR, and IntentQA datasets demonstrate that SRIN consistently achieves superior performance compared to existing state-of-the-art methods, particularly for questions requiring complex causal, intent, and predictive reasoning. Ablation studies confirm the critical contributions of both the structured reasoning generation and the dynamic integration mechanisms. Furthermore, human evaluations and qualitative analyses underscore SRIN's capacity to produce more correct, coherent, and complete answers, validating its robustness and effectiveness.

Article activity feed