3D Mel-Spectrogram–Based Deep Learning for Automated Multiclass Diagnosis of Pathological Voices

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Voice disorders are common otolaryngological conditions that significantly impair patients’ communication ability and quality of life. Most existing studies rely on two-dimensional Mel-spectrograms, which, while practical, have inherent limitations in feature extraction. This study demonstrates that three-dimensional Mel-spectrograms can capture more comprehensive pathological voice features, thereby enabling more accurate diagnosis. Methods This study was conducted using the voice database from Fudan University Eye, Ear, Nose, and Throat Hospital (1,839 cases covering six categories). Focusing on six voice conditions—healthy voice (NV), spasmodic dysphonia (SD), uncompensated unilateral vocal fold paralysis (UVFP), vocal cord sulcus (VFS), benign proliferative lesions (BPLVF), and malignant vocal fold tumors (MVFT)—we propose a classification method termed Voice-3D, which integrates three-dimensional Mel-spectrograms (3D-Mel) with a deep learning framework. By mapping voice signals into a three-dimensional “time-frequency-energy” space and incorporating a multi-view convolutional feature fusion structure, the model comprehensively captures pathological voice characteristics. Results Compared with 2D-Mel, 3D-Mel demonstrated superior overall classification performance; under identical data partitions and training settings, the overall accuracy increased from 68.0% to 81.1%. The most notable improvements were observed in malignant vocal fold tumors (from 57.2% to 84.3%) and vocal cord sulcus (from 86.7% to 93.5%), while normal voices also showed a modest but statistically significant gain (from 89.8% to 93.6%). Confusion matrix analysis further revealed that 3D-Mel substantially reduced cross-class misclassifications, particularly in clinically challenging categories. It is worth noting that the overall accuracy for spasmodic dysphonia slightly decreased, although its F1-score improved, indicating a better balance between precision and recall. Conclusions The deep learning framework based on 3D Mel-spectrograms substantially outperforms traditional 2D methods in multiclass classification of voice disorders, enabling noninvasive, objective, and automated auxiliary diagnosis. This method holds promise for clinical decision support and remote voice health screening. Future work will focus on large-scale multicenter validation and the exploration of multimodal fusion with laryngoscopic images and clinical records to enhance generalizability and clinical applicability.

Article activity feed