Structured Modeling and Representation Methods for Post-Retrieval Inference Processes in Large Video Language Models

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Existing Video-RAG systems often concatenate retrieved segments directly into input, leading toreasoning drift when hard negative samples are introduced. This paper proposes a Structured Post-Retrieval Reasoning (SPRR) module for Large Video Language Models (LVLMs), explicitly modelingthe post-retrieval process into three stages:(1) Evidence Validation: Generates "decidable" sub-problems (3–8) for Top-k=20 candidate clips, outputs binary/numeric scores, and filters to k′=4–6;(2) Conflict Resolution: Establishes consistency constraints (e.g., temporal order, entity attributeinvariance) for contradictory information across multiple clips, selecting the minimum conflictsubset to form a coherent evidence pool;(3) Temporal Aggregation: Indexed by event timestamps,evidence is serialized to generate interpretable reasoning chains (including referenced clip IDs andtemporal ranges).Evaluated on MLVU (3,102 QA) and LongVideoBench (6,678 MCQ) using open-ended and multiple-choice formats respectively, while measuring interpretability metrics (averageevidence count, conflict rate, reasoning chain length) and efficiency metrics (input tokens/reasoningsteps). This validates SPRR's benefits in "reducing noise, enhancing interpretability, and improvingstability.

Article activity feed