EgoFusion: Unified Semantic and Scale-Aware Prompt Fusion for Egocentric Action Recognition
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Egocentric video understanding has gained escalating attention for its unique capacity to capture rich first-person sensory signals and interaction dynamics. As a core research direction within this field, action recognition has become a focal point due to its critical role in decoding behavioral intentions and interaction processes in first-person scenarios.Despite the progress of existing methods, two core challenges remain: (1) they often overlook the inherent semantic dependencies between verbs and nouns, treating them as independent tasks, which results in semantically inconsistent or implausible action predictions; and (2) they struggle to effectively fuse information from objects at different scales, leading to incomplete capture of both fine-grained interaction details and global contextual cues.To address these issues, we propose EgoFusion , a prompt learning framework specifically designed for egocentric action recognition. This framework resolves the aforementioned problems through two key modules: the Component Semantic Interaction module leverages the cross-attention mechanism of verb-noun prompts to enhance their semantic alignment and co-occurrence capabilities; the Hierarchical Feature Aggregator module enriches the semantic expression of hand-object interaction information through multi-scale feature fusion. Experiments on datasets such as Ego4D and Epic-Kitchens demonstrate that EgoFusion significantly improves recognition accuracy and generalization performance in within-dataset, cross-dataset, and base-to-novel settings, validating its effectiveness for the unique challenges of egocentric action recognition.