Enhancing Environmental Sound Classification Performance through Data Fusion: A Comparative Machine Learning Analysis
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Environmental Sound Classification (ESC) has become a fundamental component of intelligent acoustic systems, enabling applications such as smart cities, environmental monitoring, and public safety. This study proposes a comprehensive feature-level fusion framework for machine learning-based ESC. We extract complementary features from the UrbanSound8K dataset—time-domain attributes (Zero Crossing Rate and Root Mean Square) and frequency-domain descriptors (Mel-Frequency Cepstral Coefficients and Chroma) —which are then concatenated into an enriched representation space. To ensure robustness, multiple preprocessing configurations were evaluated across various window sizes, hop lengths, and sampling rates. Seven classifiers, including a Multi-Layer Perceptron (MLP), XGBoost, and Support Vector Machine (SVM), were systematically compared using both individual and fused feature sets. The results empirically demonstrate that feature-level fusion consistently enhances classification performance, achieving a maximum accuracy of 94.4% with the MLP model and significantly outperforming the baseline configurations that rely on individual features. These findings affirm that the integration of heterogeneous acoustic features at the feature level substantially improves the generalization and robustness of environmental sound recognition, offering a scalable pathway for real-world acoustic scene analysis and intelligent monitoring infrastructures. Our main contributions are summarized as follows: 1. A comprehensive feature-level fusion strategy is proposed, integrating both time-domain (ZCR, RMS) and frequency-domain (MFCC, Chroma) acoustic features to construct a robust and discriminative representation for environmental sound classification. 2. A comprehensive experimental setup is designed, enabling a detailed performance analysis across diverse acoustic pre-processing configurations by systematically varying window size, hop length, and sampling rate parameters. 3. A systematic evaluation of multiple machine learning classifiers—including SVM, K-NN, Decision Tree, Random Forest, Naive Bayes, XGBoost, and Multi-Layer Perceptron (MLP)—is conducted to assess the impact of feature fusion on classification performance. 4. Performance comparisons demonstrate that the fused feature set significantly outperforms individual feature inputs, achieving a peak classification accuracy of 94.4% with the MLP model, thereby validating the efficacy of the proposed fusion approach. 5. The results validate the suitability of the proposed system for real-world acoustic monitoring tasks, including smart city surveillance and urban environmental sound recognition.