Self-Supervised Audio Representation Learning Model Based on Time-Frequency Decoupling and Masked Reconstruction

Jie Xu
Yuhao Dai
Zhifeng Wang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

In the field of audio processing, self-supervised learning has emerged as a key paradigm for learning general audio representations.However, existing Transformer-based models, such as the Audio Spectrogram Transformer (AST), commonly face two major challenges.First, they inherit the fixed input size paradigm from computer vision, leading to suboptimal preprocessing like cropping or padding when handling variable-length audio, which results in the loss of critical information or the introduction of redundancy.Second, these models rely on expensive supervised pre-training or cross-modal knowledge transfer capturing the underlying structural patterns of audio directly from raw data. To resolve these challenges in a unified manner, we introduce a new architecture that incorporates a time-frequency decoupling feature extraction module with a dual-task self-supervised learning framework. The model separates the time and frequency dimensions at the input stage, natively supporting variable-length audio inputs and more effectively capturing the unique time-frequency structure of audio. Simultaneously, by combining a generative masked latent prediction task with a discriminative contrastive learning task, the model ensures the learning of robust general representations that encompass both local details and global semantics. This architecture draws inspiration from supervised time-frequency decoupling in time-frequency decoupled audio models and extends it to an unsupervised paradigm for the first time, enabling from-scratch training on the unlabeled AudioSet-20K dataset.Downstream task evaluations cover benchmarks such as AudioSet-20K and Speech Commands V2. Experimental results demonstrate that, without any external pre-training, our model achieves a linear evaluation accuracy of 0.336 on AudioSet-20K, representing a significant relative improvement of 20.4% over self-supervised baseline models.

Version published to 10.21203/rs.3.rs-8361849/v1 on Research Square
Dec 31, 2025

Visually-Guided Audio-Visual Segmentation via Multi-Scale Fusion and Content-Guided Attention

This article has 4 authors:
1. Ying Cao
2. Sikun Meng
3. Yonghang Yan
4. Hengyi Ren
This article has no evaluationsLatest version Feb 6, 2026
An Integrated Framework of Frequency-Domain Denoising with Learnable Parameters in Variational Autoencoders

This article has 3 authors:
1. Xiaochen Li
2. Hongtian Zhao
3. Peng Li
This article has no evaluationsLatest version Jan 6, 2026
Deepfake Audio Detection Using Machine Learning and Deep Learning Methods

This article has 1 author:
1. Mainul Islam
This article has no evaluationsLatest version Jan 6, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Visually-Guided Audio-Visual Segmentation via Multi-Scale Fusion and Content-Guided Attention

An Integrated Framework of Frequency-Domain Denoising with Learnable Parameters in Variational Autoencoders

Deepfake Audio Detection Using Machine Learning and Deep Learning Methods