Revisiting Multimodal and Unimodal Representation Strategies for Document-level Relation Extraction

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Understanding relationships among entities in visually rich documents (VrDU) is a cornerstone for various industries, including finance, healthcare, and legal services. While the integration of multimodal signals—such as textual content, layout structures, and visual cues—has driven substantial progress in VrDU-related tasks like relation extraction (RE), there remains a gap in comprehensively assessing the predictive effectiveness of each modality. In this paper, we introduce MORAE, a systematic framework designed to dissect and analyze the individual and joint contributions of text, layout, and vision in RE tasks. Through an extensive series of ablation experiments under multiple controlled settings, we investigate the incremental utility of each modality both in isolation and combination. Our findings demonstrate that while a bimodal fusion of text and layout achieves the highest F1-score of 0.728, the textual component alone remains the most influential predictor in establishing entity relationships. Furthermore, our study uncovers the surprisingly competitive performance of geometric layout data as a standalone modality, presenting a cost-efficient alternative in scenarios where textual extraction might be hindered. Visual information, though less dominant, exhibits supportive capacity in certain complex document layouts. Beyond empirical validations, we provide a lightweight RE classifier under MORAE, encouraging practical deployment in resource-constrained applications. These insights offer a deeper understanding of modality synergies and promote the informed design of future VrDU systems.

Article activity feed