Unified Representation Learning for Relation Extraction in Visually-Rich Documents

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Recent advances in multimodal integration—combining text, geometric layout, and visual cues—have led to significant improvements in the field of Visually-rich Document Understanding (VrDU), particularly in relation extraction (RE) tasks. In our work, we introduce a novel approach, termed \textbf{UniFusion}, which reimagines the process of joint representation learning through a comprehensive analysis of each modality’s contribution. Our experiments systematically exclude each data type in turn, and we also evaluate text and layout modalities in isolation, thereby providing a detailed account of the predictive capacity inherent in each signal. In our extensive study, we demonstrate that a bimodal configuration integrating textual content with layout geometry consistently outperforms other configurations, achieving an F1 score of 0.684. This observation underscores the pivotal role of textual information as the primary driver in predicting entity relationships. However, our analysis further reveals that geometric layout features, when used as a unimodal predictor, offer substantial predictive power and can serve as an effective standalone approach under certain circumstances. Although the visual modality, when used in isolation, exhibits relatively lower performance, our results indicate that its inclusion in a multimodal fusion strategy can enhance overall performance by providing supplementary contextual information. Moreover, our experiments span a diverse array of document types and noise conditions, confirming that the integration of multiple modalities via UniFusion leads to more robust performance, particularly in scenarios with incomplete or noisy textual data. In summary, our findings provide compelling evidence for the efficacy of joint representation learning, demonstrating that a carefully balanced fusion of text, layout, and visual modalities is essential for advancing the state-of-the-art in RE tasks within the VrDU framework.

Article activity feed