Identification of the sim-to-real gap in the speech directivity classification task using deep learning techniques
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurate modeling of speech directivity is essential for artificial auditory systems. However, deep learning models trained on simulated data often fail to generalize to real acoustic conditions, a limitation known as the Sim-to-Real gap. This study systematically identifies the simulation parameters that most strongly influence this degradation in a speech directivity classification task. A convolutional–recurrent neural network (CRNN) was trained using speech data synthesized with an acoustic virtual-reality framework. The training involved controlled variations in the head-related transfer function (HRTF), direction of arrival (DoA), reverberation time (RT), speaker identity, and sentence content. Out-of-domain (OOD) evaluations across 31 test configurations revealed that unseen HRTFs and speakers were primarily responsible for the reduction in F1-macro. At the same time, variations in RT and DoA showed comparatively minor effects. These results indicate that anthropometric and vocal diversity, rather than environmental complexity, dominate model generalization. Based on these findings, we propose a “person-first” synthetic pipeline that prioritizes coverage of HRTFs and voices before refining environmental conditions. The framework introduced here establishes a quantitative methodology for measuring the Sim-to-Real gap using F1-macro as a diagnostic metric and provides practical guidelines for constructing robust, transferable acoustic datasets. This approach enables deep learning models to better approximate real-world auditory perception, improving future applications in spatial hearing, speech enhancement, and acoustic scene analysis.