Event Aware Visual Language Modeling for Cross Modal Event Retrieval
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The rapid expansion of multi-modal information across social media and news platforms has intensified the need for accurate cross-modal fine-grained event retrieval. Existing approaches, constrained by keyword matching and single-modal representations, struggle to capture complex event semantics and their inter-modal dependencies. This paper presents UniEvent LVLM, a unified visual-language model that integrates a large language model for text, a vision transformer for images, and a temporal transformer for videos to achieve comprehensive event understanding. An event-aware fusion module with cross-modal attention and event concept pooling explicitly aligns and distills event-centric features, which are projected into a unified embedding space optimized by contrastive learning with hard negative mining. We further construct NewsEvent-200K, a large-scale multi-modal dataset with 200,000 annotated news events for rigorous evaluation. Experimental results show that UniEvent LVLM achieves state-of-the-art performance in cross-modal event retrieval, demonstrating the effectiveness of unified multi-modal modeling and event-aware feature fusion.