Event Aware Visual Language Modeling for Cross Modal Event Retrieval

Wei Chen
Jiing Fang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The rapid expansion of multi-modal information across social media and news platforms has intensified the need for accurate cross-modal fine-grained event retrieval. Existing approaches, constrained by keyword matching and single-modal representations, struggle to capture complex event semantics and their inter-modal dependencies. This paper presents UniEvent LVLM, a unified visual-language model that integrates a large language model for text, a vision transformer for images, and a temporal transformer for videos to achieve comprehensive event understanding. An event-aware fusion module with cross-modal attention and event concept pooling explicitly aligns and distills event-centric features, which are projected into a unified embedding space optimized by contrastive learning with hard negative mining. We further construct NewsEvent-200K, a large-scale multi-modal dataset with 200,000 annotated news events for rigorous evaluation. Experimental results show that UniEvent LVLM achieves state-of-the-art performance in cross-modal event retrieval, demonstrating the effectiveness of unified multi-modal modeling and event-aware feature fusion.

Version published to 10.20944/preprints202510.1883.v1
Oct 24, 2025

Entity-Aware Cross-Modal Fusion Network for Fine-Grained Entity Consistency Verification in Multimodal News Misinformation Detection

This article has 3 authors:
1. Mark Harris
2. Hunter Shaw
3. Ryan Young
This article has no evaluationsLatest version Jan 9, 2026
CLARA: Enhancing Multimodal Sentiment Analysis via Efficient Vision-Language Fusion

This article has 3 authors:
1. Phuong Lam
2. Phan Thi Tuoi
3. Thien Khai Tran
This article has no evaluationsLatest version Jan 7, 2026
Graph-Based Cross-Modal Transformer Framework for Detecting Hate Speech in Social Media Streams

This article has 2 authors:
1. Prabhu R
2. Seethalakshmi V
This article has no evaluationsLatest version Dec 12, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Entity-Aware Cross-Modal Fusion Network for Fine-Grained Entity Consistency Verification in Multimodal News Misinformation Detection

CLARA: Enhancing Multimodal Sentiment Analysis via Efficient Vision-Language Fusion

Graph-Based Cross-Modal Transformer Framework for Detecting Hate Speech in Social Media Streams