Identification of the sim-to-real gap in the speech directivity classification task using deep learning techniques

Sebastián Guajardo-Herrera
Carla E. Contreras-Saavedra
Jorge P. Arenas
Rhoddy Viveros-Muñoz

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Accurate modeling of speech directivity is essential for artificial auditory systems. However, deep learning models trained on simulated data often fail to generalize to real acoustic conditions, a limitation known as the Sim-to-Real gap. This study systematically identifies the simulation parameters that most strongly influence this degradation in a speech directivity classification task. A convolutional–recurrent neural network (CRNN) was trained using speech data synthesized with an acoustic virtual-reality framework. The training involved controlled variations in the head-related transfer function (HRTF), direction of arrival (DoA), reverberation time (RT), speaker identity, and sentence content. Out-of-domain (OOD) evaluations across 31 test configurations revealed that unseen HRTFs and speakers were primarily responsible for the reduction in F1-macro. At the same time, variations in RT and DoA showed comparatively minor effects. These results indicate that anthropometric and vocal diversity, rather than environmental complexity, dominate model generalization. Based on these findings, we propose a “person-first” synthetic pipeline that prioritizes coverage of HRTFs and voices before refining environmental conditions. The framework introduced here establishes a quantitative methodology for measuring the Sim-to-Real gap using F1-macro as a diagnostic metric and provides practical guidelines for constructing robust, transferable acoustic datasets. This approach enables deep learning models to better approximate real-world auditory perception, improving future applications in spatial hearing, speech enhancement, and acoustic scene analysis.

Version published to 10.21203/rs.3.rs-7972040/v1 on Research Square
Nov 7, 2025

Fake Voice Detection: A Comparative Analysis of Complex-Valued Deep Learning and Transformer Models across Multiple Languages

This article has 5 authors:
1. Mario Jojoa
2. Alfonso Bahillo
3. Dávid Sztahó
4. Giovanni Hernandez
5. Géza Nemeth
This article has no evaluationsLatest version Feb 3, 2026
Environmental Sound Classification Using Feature Fusion of MFCCs, Mel-spectrogram, and Chroma

This article has 1 author:
1. Mainul Islam
This article has no evaluationsLatest version Jan 16, 2026
Deepfake Audio Detection Using Machine Learning and Deep Learning Methods

This article has 1 author:
1. Mainul Islam
This article has no evaluationsLatest version Jan 6, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Fake Voice Detection: A Comparative Analysis of Complex-Valued Deep Learning and Transformer Models across Multiple Languages

Environmental Sound Classification Using Feature Fusion of MFCCs, Mel-spectrogram, and Chroma

Deepfake Audio Detection Using Machine Learning and Deep Learning Methods