A Multimodal Information Mining and Classification Framework for Textual Content Understanding in Complex Video Scenes

Kinsley Harper
Wyne Nasir
Jaxon Everett

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The increasingly critical role of textual information embedded within video content has underscored the necessity for more refined and sophisticated understanding approaches. Traditionally, the semantic extraction of such texts has been predominantly addressed via Optical Character Recognition (OCR) techniques, with an emphasis on text localization and recognition. However, these methodologies have predominantly overlooked the crucial task of classifying the recognized texts into semantically meaningful categories, a gap that significantly hampers downstream tasks such as content-aware video retrieval, adaptive browsing, and intelligent video summarization. Addressing this overlooked challenge, we introduce a pioneering multimodal classification framework, named MIMIC, that synergistically leverages visual, textual, and spatial information to enable robust and precise classification of video texts. MIMIC incorporates a specialized correlation modeling component, designed to explicitly capture and exploit the rich layout and structural cues inherent in video scenes, thereby enhancing the feature representational capacity. Complementing this, we employ contrastive learning strategies to mine implicit associations among a vast corpus of unlabeled video data, further augmenting the model’s discriminative power in challenging scenarios where text categories may exhibit ambiguous appearances, irregular fonts, or overlapping content. To facilitate comprehensive evaluation and spur future research, we introduce TI-News, a large-scale, domain-specific dataset curated from industrial news sources, meticulously annotated for both recognition and classification tasks. Extensive experimental results on TI-News validate the superior performance and generalization capabilities of MIMIC, setting a new benchmark for multimodal video text classification.

Version published to 10.20944/preprints202505.1275.v1
May 16, 2025

Multimodal Model Based on Contrastive Language-Image Pretraining for Micro-Expression Recognition

This article has 5 authors:
1. Peng Yang
2. Xiaoguang Wu
3. Yanyang Zhou
4. Qilin Wei
5. Zhifeng Zeng
This article has no evaluationsLatest version Dec 17, 2025
MultiLingual Scene Text Detection via Group-Specific Models

This article has 5 authors:
1. Jhonatas Conceição
2. Manuel Córdova
3. Allan Pinto
4. Ricardo da S. Torres
5. Helio Pedrini
This article has no evaluationsLatest version Dec 19, 2025
Non-Salient Visual Content Grounding for Multimodal Relation Extraction

This article has 4 authors:
1. Zefan Zhang
2. Yanhui Li
3. Weiqi Zhang
4. Tian Bai
This article has no evaluationsLatest version Dec 15, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Multimodal Model Based on Contrastive Language-Image Pretraining for Micro-Expression Recognition

MultiLingual Scene Text Detection via Group-Specific Models

Non-Salient Visual Content Grounding for Multimodal Relation Extraction