Visually-Guided Audio-Visual Segmentation via Multi-Scale Fusion and Content-Guided Attention

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Audio-Visual Segmentation (AVS) aims to pixel-wise locate and segment sounding objects in videos driven by audio cues. However, current mainstream methods typically employ audio-centric Transformer frameworks that derive object queries primarily from audio features. These approaches often suffer from a fundamental modality mismatch: relying on temporal audio signals to resolve spatial segmentation tasks leads to perceptual ambiguity and a loss of fine-grained visual details, particularly in complex acoustic environments. To address these challenges, this paper proposes a novel visually-guided framework incorporating a Multi-Scale Fusion (MSF) module and a Content-Guided Attention Fusion (CGAF) mechanism. Unlike existing approaches, our method prioritizes visual information to generate visually-derived queries, which then interact with audio context within a Transformer decoder for deep semantic refinement. Extensive experiments on standard benchmarks demonstrate that our approach effectively aligns cross-modal information and achieves state-of-the-art performance, significantly outperforming existing baselines

Article activity feed