Deep Multimodal-Interactive Document Summarization Network and Its Cross-Modal Text–Image Retrieval Application for Future Smart City Information Management Systems

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Urban documents like city planning reports and environmental data often feature complex charts and texts that require effective summarization tools, particularly in smart city management systems. These documents increasingly use graphical abstracts alongside textual summaries to enhance readability, making automated abstract generation crucial. This study explores the application of summarization technology using scientific paper abstract generation as a case. The challenge lies in processing the longer multimodal content typical in research papers. To address this, a deep multimodal-interactive network is proposed for accurate document summarization. This model enhances structural information from both images and text, using a combination module to learn the correlation between them. The integrated model aids both summary generation and significant image selection. For the evaluation, a dataset is created that encompasses both textual and visual components along with structural information, such as the coordinates of the text and the layout of the images. While primarily focused on abstract generation and image selection, the model also supports text–image cross-modal retrieval. Experimental results on the proprietary dataset demonstrate that the proposed method substantially outperforms both extractive and abstractive baselines. In particular, it achieves a Rouge-1 score of 46.55, a Rouge-2 score of 16.13, and a Rouge-L score of 24.95, improving over the best comparison abstractive model (Pegasus: Rouge-1 43.63, Rouge-2 14.62, Rouge-L 24.46) by approximately 2.9, 1.5, and 0.5 points, respectively. Even against strong extractive methods like TextRank (Rouge-1 30.93) and LexRank (Rouge-1 29.63), our approach shows gains of over 15 points in Rouge-1, underlining its effectiveness in capturing both textual and visual semantics. These results suggest significant potential for smart city applications—such as accident scene documentation and automated environmental monitoring summaries—where rapid, accurate processing of urban multimodal data is essential.

Article activity feed