Auditory Feature-based Perceptual Distance

Shukai Chen
Marvin Thielk
Timothy Q. Gentner

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Studies comparing acoustic signals often rely on pixel-wise differences between spectrograms, as in for example mean squared error (MSE). Pixel-wise errors are not representative of perceptual sensitivity, however, and such measures can be highly sensitive to small local signal changes that may be imperceptible. In computer vision, high-level visual features extracted with convolutional neural networks (CNN) can be used to calculate the fidelity of computer-generated images. Here, we propose the auditory perceptual distance (APD) metric based on acoustic features extracted with an unsupervised CNN and validated by perceptual behavior. Using complex vocal signals from songbirds, we trained a Siamese CNN on a self-supervised task using spectrograms rescaled to match the auditory frequency sensitivity of European starlings, Sturnus vulgaris. We define APD for any pair of sounds as the cosine distance between corresponding feature vectors extracted by the trained CNN. We show that APD is more robust to temporal and spectral translation than MSE, and captures the sigmoidal shape of typical behavioral psychometric functions over complex acoustic spaces. When fine-tuned using starlings’ behavioral judgments of naturalistic song syllables, the APD model yields even more accurate predictions of perceptual sensitivity, discrimination, and categorization on novel complex (high-dimensional) acoustic dimensions, including diverging decisions for identical stimuli following different training conditions. Thus, the APD model outperforms MSE in robustness and perceptual accuracy, and offers tunability to match experience-dependent perceptual biases.

Version published to 10.1101/2024.02.28.582631 on bioRxiv
Mar 3, 2024

Environmental Sound Classification Using Feature Fusion of MFCCs, Mel-spectrogram, and Chroma

This article has 1 author:
1. Mainul Islam
This article has no evaluationsLatest version Jan 16, 2026
Deepfake Audio Detection Using Machine Learning and Deep Learning Methods

This article has 1 author:
1. Mainul Islam
This article has no evaluationsLatest version Jan 6, 2026
Reverse-Engineering Speech and Music Categorization from a Single Sound Source

This article has 5 authors:
1. Lauren K Fink
2. Madita Hörster
3. David Poeppel
4. Melanie Wald-Fuhrmann
5. Pauline Larrouy-Maestri
This article has no evaluationsLatest version Jan 25, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Environmental Sound Classification Using Feature Fusion of MFCCs, Mel-spectrogram, and Chroma

Deepfake Audio Detection Using Machine Learning and Deep Learning Methods

Reverse-Engineering Speech and Music Categorization from a Single Sound Source