Deep Multimodal-Interactive Document Summarization Network and Its Cross-Modal Text-Image Retrieval Application for Future Smart City Information Management Systems

Wenhui Yu
Gengshen Wu
Jungong Han

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Urban documents like city planning reports and environmental data often feature complex charts and texts that require effective summarization tools, particularly in smart city management systems. These documents increasingly use graphical abstracts alongside textual summaries to enhance readability, making automated abstract generation crucial. This study explores the application of summarization technology using scientific paper abstract generation as a case. The challenge lies in processing the longer multimodal content typical in research papers. To address this, we propose a deep multimodal-interactive network for accurate document summarization. This model enhances structural information from both images and text, using a combination module to learn the correlation between them. The integrated model aids both summary generation and significant image selection. For the evaluation, we create a dataset that encompasses both textual and visual components along with structural information, such as the coordinates of the text and the layout of the images. While primarily focused on abstract generation and image selection, the model also supports text-image cross-modal retrieval. Comparative experiments on a proprietary dataset demonstrate that our method consistently outperforms other models, potentially benefiting smart city applications like accident scene documentation and automated environmental monitoring summaries, thereby enhancing the processing of urban multimodal data.

Version published to 10.20944/preprints202503.2167.v1
Mar 29, 2025

(Tagged) Centroid-based Hierarchical Ordered Processing for Summarization

This article has 2 authors:
1. Aniruth Ananthanarayanan
2. Jianguo Liu
This article has no evaluationsLatest version Apr 9, 2025
Bridging Vision and Texts: An External Graph Framework for Enhanced Language Comprehension

This article has 3 authors:
1. Martínez Pérez
2. Emily Marwood
3. Martina Fernández Gómez
This article has no evaluationsLatest version Mar 13, 2025
An Optimized Content Retriever from Web Articles using Large Language Models and FAISS Indexing

This article has 2 authors:
1. Veerababu Reddy
2. Veeranjaneyulu N
This article has no evaluationsLatest version Apr 11, 2025

Listed in

Abstract

Article activity feed

Related articles

(Tagged) Centroid-based Hierarchical Ordered Processing for Summarization

Bridging Vision and Texts: An External Graph Framework for Enhanced Language Comprehension

An Optimized Content Retriever from Web Articles using Large Language Models and FAISS Indexing