Self-Supervised Audio Representation Learning Model Based on Time-Frequency Decoupling and Masked Reconstruction
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In the field of audio processing, self-supervised learning has emerged as a key paradigm for learning general audio representations.However, existing Transformer-based models, such as the Audio Spectrogram Transformer (AST), commonly face two major challenges.First, they inherit the fixed input size paradigm from computer vision, leading to suboptimal preprocessing like cropping or padding when handling variable-length audio, which results in the loss of critical information or the introduction of redundancy.Second, these models rely on expensive supervised pre-training or cross-modal knowledge transfer capturing the underlying structural patterns of audio directly from raw data. To resolve these challenges in a unified manner, we introduce a new architecture that incorporates a time-frequency decoupling feature extraction module with a dual-task self-supervised learning framework. The model separates the time and frequency dimensions at the input stage, natively supporting variable-length audio inputs and more effectively capturing the unique time-frequency structure of audio. Simultaneously, by combining a generative masked latent prediction task with a discriminative contrastive learning task, the model ensures the learning of robust general representations that encompass both local details and global semantics. This architecture draws inspiration from supervised time-frequency decoupling in time-frequency decoupled audio models and extends it to an unsupervised paradigm for the first time, enabling from-scratch training on the unlabeled AudioSet-20K dataset.Downstream task evaluations cover benchmarks such as AudioSet-20K and Speech Commands V2. Experimental results demonstrate that, without any external pre-training, our model achieves a linear evaluation accuracy of 0.336 on AudioSet-20K, representing a significant relative improvement of 20.4% over self-supervised baseline models.