Deep Multimodal-Interactive Document Summarization Network and Its Cross-Modal Text-Image Retrieval Application for Future Smart City Information Management Systems

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Urban documents like city planning reports and environmental data often feature complex charts and texts that require effective summarization tools, particularly in smart city management systems. These documents increasingly use graphical abstracts alongside textual summaries to enhance readability, making automated abstract generation crucial. This study explores the application of summarization technology using scientific paper abstract generation as a case. The challenge lies in processing the longer multimodal content typical in research papers. To address this, we propose a deep multimodal-interactive network for accurate document summarization. This model enhances structural information from both images and text, using a combination module to learn the correlation between them. The integrated model aids both summary generation and significant image selection. For the evaluation, we create a dataset that encompasses both textual and visual components along with structural information, such as the coordinates of the text and the layout of the images. While primarily focused on abstract generation and image selection, the model also supports text-image cross-modal retrieval. Comparative experiments on a proprietary dataset demonstrate that our method consistently outperforms other models, potentially benefiting smart city applications like accident scene documentation and automated environmental monitoring summaries, thereby enhancing the processing of urban multimodal data.

Article activity feed