Revisiting Multimodal and Unimodal Representation Strategies for Document-level Relation Extraction

Freja Lindholm
Wyne Nasir
Emil Sörensen

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Understanding relationships among entities in visually rich documents (VrDU) is a cornerstone for various industries, including finance, healthcare, and legal services. While the integration of multimodal signals—such as textual content, layout structures, and visual cues—has driven substantial progress in VrDU-related tasks like relation extraction (RE), there remains a gap in comprehensively assessing the predictive effectiveness of each modality. In this paper, we introduce MORAE, a systematic framework designed to dissect and analyze the individual and joint contributions of text, layout, and vision in RE tasks. Through an extensive series of ablation experiments under multiple controlled settings, we investigate the incremental utility of each modality both in isolation and combination. Our findings demonstrate that while a bimodal fusion of text and layout achieves the highest F1-score of 0.728, the textual component alone remains the most influential predictor in establishing entity relationships. Furthermore, our study uncovers the surprisingly competitive performance of geometric layout data as a standalone modality, presenting a cost-efficient alternative in scenarios where textual extraction might be hindered. Visual information, though less dominant, exhibits supportive capacity in certain complex document layouts. Beyond empirical validations, we provide a lightweight RE classifier under MORAE, encouraging practical deployment in resource-constrained applications. These insights offer a deeper understanding of modality synergies and promote the informed design of future VrDU systems.

Version published to 10.20944/preprints202505.1217.v1
May 15, 2025

A Multimodal Information Mining and Classification Framework for Textual Content Understanding in Complex Video Scenes

This article has 3 authors:
1. Kinsley Harper
2. Wyne Nasir
3. Jaxon Everett
This article has no evaluationsLatest version May 16, 2025
Research on a Denoising Model for Entity-Relation Extraction Using Hierarchical Contrastive Learning with Distant Supervision

This article has 4 authors:
1. Ayiguli Halike*
2. Aishan Wumaier
3. kahaerjiang abiderexiti
4. Tuergen Yibulayin
This article has no evaluationsLatest version Apr 30, 2025
Recognizing and Sequencing Multi-word Texts in Maps Using an Attentive Pointer

This article has 6 authors:
1. Mengjie Zou
2. Tianhao Dai
3. Rémi Petitpierre
4. Beatrice Vaienti
5. Frédéric Kaplan
6. Isabella di Lenardo
This article has no evaluationsLatest version Apr 16, 2025

Listed in

Abstract

Article activity feed

Related articles

A Multimodal Information Mining and Classification Framework for Textual Content Understanding in Complex Video Scenes

Research on a Denoising Model for Entity-Relation Extraction Using Hierarchical Contrastive Learning with Distant Supervision

Recognizing and Sequencing Multi-word Texts in Maps Using an Attentive Pointer