SRIN: Structured Reasoning Integration Network for Robust Video Question Answering

Kentaro Yamada

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Video Question Answering (VideoQA) demands deep understanding of visual, temporal, and causal relationships. While Multimodal Large Language Models (MLLMs) offer powerful reasoning capabilities, their raw outputs often lack structure, contain noise, or include erroneous conclusions, posing challenges for effective integration. This paper introduces the Structured Reasoning Integration Network (SRIN), a novel framework designed to leverage MLLM-generated reasoning more robustly and precisely. SRIN comprises two core components: a Structured Reasoning Generation (SRG) module that employs multi-stage prompting to elicit multi-dimensional, fine-grained reasoning cues from a powerful MLLM (InternVL 1.5), and a Dynamic Reasoning Integration (DRI) module. The DRI module is a key innovation that adaptively weights and fuses these structured reasoning components based on the specific question's semantics, thereby enhancing the main VideoQA model's (BLIP-FlanT5) ability to utilize even imperfect MLLM outputs effectively. Extensive experiments on NExT-QA, STAR, and IntentQA datasets demonstrate that SRIN consistently achieves superior performance compared to existing state-of-the-art methods, particularly for questions requiring complex causal, intent, and predictive reasoning. Ablation studies confirm the critical contributions of both the structured reasoning generation and the dynamic integration mechanisms. Furthermore, human evaluations and qualitative analyses underscore SRIN's capacity to produce more correct, coherent, and complete answers, validating its robustness and effectiveness.

Version published to 10.20944/preprints202508.1438.v1
Aug 19, 2025

Contextualized Diverse Reasoning: Enhancing Video Question Answering with Multi-Perspective MLLM Pathways

This article has 2 authors:
1. Xuan Li
2. Haoran Zuo
This article has no evaluationsLatest version Jan 5, 2026
Image and Video Question Answering with Large Language Models: A Comprehensive Review

This article has 3 authors:
1. Alexander Davis
2. Justin Parker
3. Julian Perry
This article has no evaluationsLatest version Dec 19, 2025
BHRE-RAG: A Benchmark and Retrieval-Augmented Framework for Advancing Comprehension-Based Question Answering in Bangla

This article has 2 authors:
1. Md Saiyem Raiyan
2. Nayeema Ferdous
This article has no evaluationsLatest version Jan 23, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Contextualized Diverse Reasoning: Enhancing Video Question Answering with Multi-Perspective MLLM Pathways

Image and Video Question Answering with Large Language Models: A Comprehensive Review

BHRE-RAG: A Benchmark and Retrieval-Augmented Framework for Advancing Comprehension-Based Question Answering in Bangla