Visually-Guided Audio-Visual Segmentation via Multi-Scale Fusion and Content-Guided Attention
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Audio-Visual Segmentation (AVS) aims to pixel-wise locate and segment sounding objects in videos driven by audio cues. However, current mainstream methods typically employ audio-centric Transformer frameworks that derive object queries primarily from audio features. These approaches often suffer from a fundamental modality mismatch: relying on temporal audio signals to resolve spatial segmentation tasks leads to perceptual ambiguity and a loss of fine-grained visual details, particularly in complex acoustic environments. To address these challenges, this paper proposes a novel visually-guided framework incorporating a Multi-Scale Fusion (MSF) module and a Content-Guided Attention Fusion (CGAF) mechanism. Unlike existing approaches, our method prioritizes visual information to generate visually-derived queries, which then interact with audio context within a Transformer decoder for deep semantic refinement. Extensive experiments on standard benchmarks demonstrate that our approach effectively aligns cross-modal information and achieves state-of-the-art performance, significantly outperforming existing baselines