Event Aware Visual Language Modeling for Cross Modal Event Retrieval

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The rapid expansion of multi-modal information across social media and news platforms has intensified the need for accurate cross-modal fine-grained event retrieval. Existing approaches, constrained by keyword matching and single-modal representations, struggle to capture complex event semantics and their inter-modal dependencies. This paper presents UniEvent LVLM, a unified visual-language model that integrates a large language model for text, a vision transformer for images, and a temporal transformer for videos to achieve comprehensive event understanding. An event-aware fusion module with cross-modal attention and event concept pooling explicitly aligns and distills event-centric features, which are projected into a unified embedding space optimized by contrastive learning with hard negative mining. We further construct NewsEvent-200K, a large-scale multi-modal dataset with 200,000 annotated news events for rigorous evaluation. Experimental results show that UniEvent LVLM achieves state-of-the-art performance in cross-modal event retrieval, demonstrating the effectiveness of unified multi-modal modeling and event-aware feature fusion.

Article activity feed