Decision making in auditory externalization perception: model predictions for static conditions

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

Under natural conditions, listeners perceptually attribute sounds to external objects in their environment. This core function of perceptual inference is often distorted when sounds are produced via hearing devices such as headphones or hearing aids, resulting in sources being perceived unrealistically close or even inside the head. Psychoacoustic studies suggest a mixed role of various monaural and interaural cues contributing to the externalization process. We developed a model framework for perceptual externalization able to probe the contribution of cue-specific expectation errors and to contrast dynamic versus static strategies for combining those errors within static listening environments. Effects of reverberation and visual information were not considered. The model was applied to various acoustic distortions as tested under various spatially static conditions in five previous experiments. Most accurate predictions were obtained for the combination of monaural and interaural spectral cues with a fixed relative weighting (approximately 60% of monaural and 40% of interaural). That model version was able to reproduce the externalization rating of the five experiments with an average error of 12% (relative to the full rating scale). Further, our results suggest that auditory externalization in spatially static listening situations underlie a fixed weighting of monaural and interaural spectral cues, rather than a dynamic selection of those auditory cues.

Article activity feed

  1. ##Author Response

    ###Summary:

    As you will see the reviewers agreed that the premise behind this manuscript is important and timely both in the context of basic auditory science and for informing technology. However, they raised largely consistent concerns about the generalizability of your observations to other auditory stimuli and to more naturalistic listening conditions.

    We appreciate the reviewers’ positive assessment underpinning the significance and timeliness of our present research endeavours. We assume generalizability of our findings to more naturalistic listening conditions because the proposed model framework successfully explained the outcomes of experiments that were conducted under listening conditions differing in reverberation and source stimuli. Those differences, however, only occurred across but not within experiments and thus were not considered in the model explicitly. The set of experiments and relevant cues was chosen such that the investigation of decision strategies for the combination or selection of cues in the context of perceptual externalization could be conducted on a limited but still divers set of cues. The proposed framework allows to easily extend the set of cues. For example, in another work (see Li et al., in press), we successfully modelled the impact of situational changes of the amount of reverberation on externalization perception by extending the framework to reverberation-related cues. This further strengthens our assumption that our findings can be generalized. Nevertheless, we understand that more direct evidence for this generalizability would further increase the confidence in the conclusions we draw.

    ###Reviewer #1:

    I agree with the authors that the question at the basis of this work is timely and important both from the point of view of understanding auditory perception and for informing technology. However I am not convinced that the findings here will necessarily generalize to other stimuli/listening situations.

    I think the biggest limiting factor here is that the primary data on which the modelling is based are drawn from many different studies which used different stimuli, different tasks, different presentation environments and different equipment). I can see how testing the model on existing data is an important first step, but I would think that a critical next step is to form a set of (contrasting) predictions to be tested on a single stimulus set, within a single group of participants, as a way of confirming model validity. In this experiment I would also avoid using static non-reverberant environments since we know that these factors greatly affect spatial perception.

    We do not follow the reasoning why the above mentioned diversity of experimental paradigms is a limitation. On the contrary, in our opinion, the diversity of the considered experiments demonstrates robustness of our findings for a variety of experimental procedures. We agree that an additional validation experiment would further strengthen our study, but we question its necessity and still believe that the present modelling work is extensive and compelling enough to warrant publication.

    Other comments:

    1. The title greatly overstates the main findings, it would be toned down.

    In the title, we aimed at describing the research topic in general terms accessible to a broad readership. We take your comment as an advice to state the main findings instead.

    1. Intro, line 30-33 this statement is misleading. As written it appears to claim temporal aspects of auditory perception are based on short term regularity, whilst spatial perception is based on long term effects. This is not correct see e,g Ulanovsky 2004.

    Agreed. We will remove the sentence or rephrase it in more general terms because the misleading distinction is actually irrelevant to our study.

    1. As a reader not highly familiar with the auditory spatial processing literature I found the results section very dense and hard to follow. If you are targeting a general audience it is important to clarify concepts, avoid using abbreviations where possible etc.

    Thank you for your advice. We will aim to increase the level of abstraction within the results section.

    1. When discussing the various decision strategies which you tested, consider explaining how they might be implemented by the auditory system, at which stage of processing etc.

    Our study approached the problem from an algorithmic point of view and did not touch upon the more detailed level of neural implementation. While the cue processing has a clear neurophysiological basis in the subcortical layers of the auditory system, we will include some speculation about the involved cortical networks in a revised version of the manuscript.

    1. It is very difficult to evaluate your results without more information about the stimuli and studies from which they were taken. Whilst you do provide references, I think the paper would be much clearer if you provide a more complete description of the stimuli (even in table form; paradigms etc).

    We appreciate your advice and will provide more details about the simulated experiments in a table.

    ###Reviewer #2:

    The current study compares four decision rules, factoring in seven potential acoustic cues, for predicting perceived sound externalization for single-source binaural sound with stationary interaural cues. Test stimuli included a harmonic vowel complex, noise and speech. Results show that monaural and binaural cues shape externalization. However, how listeners weighted these cues varied across the tested conditions. The authors consider the fact that some of these cues covary acoustically, by additionally testing their model on subsets of two of these cues only. No single externalization cue emerged as a clear predictor for perceived externalization. However, overall, a static cue weighting strategy tended to outperform dynamic cue weighting for predicting externalization.

    Major concerns dampen enthusiasm for the current work.

    1. It is unclear what neural mechanism is being tested. A premise of the current approach is that perceived sound externalization is primarily driven by acoustic cues. However, we know this not to be true. Context matters. As pointed out by the authors (l370-372), when listening to sounds processed with head related transfer functions (HRTFs) over headphones, listeners can externalize sound better when the context of the test room matches the room where HRTFs were recorded (Werner and Klein 2014).

    Sound externalization is an auditory percept and as such primarily driven by acoustic cues. How those cues are used for perceptual inference is certainly context dependent. From the present study, we conclude that the auditory system evaluates deviations from a small set of expected acoustic cues in a fixed weighted (and not selective) manner. We further explain that these expectations, which are represented as templates in the model, must be adaptive to the context. This is well in line with your example of room divergence (Werner and Klein, 2004): listeners are thought to establish expectations about reverberation-related acoustic cues and evaluate incoming sensory information against those expectations with a fixed weighting between cues. If expectations are not met (i.e., acoustic cues deviate from their templates), perceptual externalization degrades.

    1. Most external sounds are neither anechoic nor stationary. Therefore, any neural decision metric on externalization must have been shaped by lifelong experience with dynamic, reverberant cues for interpreting externalization. The current work mostly models stationary single source sound that was either anechoic or mildly reverberant, providing pristine spatial cues. I do not follow the author's point that this would not matter (l498-502): "While the constant reverberation and visual information may or may not have stabilized auditory externalization, they certainly did not prevent the tested signal modifications to be effective within the tested condition. In our study, we thus assumed that such differences in experimental procedures do not modulate our effects of interest." That is an untested assumption.

    Others showed that the type of spectral manipulations we considered remain effective also if reverberation is present (e.g. Hassager et al., 2013) and if listeners are exposed to dynamic cues by moving their heads or the sound source (Brimijoin et al., 2013). We used the above-mentioned argument in order to motivate why we ignored certain differences across studies in the first place and the high explanatory power obtained with the proposed model framework suggests that this simplification was adequate. We agree that the above-mentioned sentence can be easily misunderstood and we will modify it by including the explanation stated here.

    1. Many of the current test stimuli are perceived as ambiguous - providing 50% externalization ratings - and thus do not provide a sensitive test of brain mechanisms of sound externalization.

    The field mostly agrees that auditory externalization is not a binary phenomenon but a matter of degree – we very recently published a review article that discusses this issue in detail (Best, et al., 2020). Hence, the experimental outcomes, denoted as externalization scores, ranging from 0 to 1 indicate the degree of externalization that is considered to mediate perceived egocentric distance. The externalization scores do not indicate the level of perceptual ambiguity.

    We will include this explanation in the manuscript in order to prevent further misunderstanding.

    1. Reverberation enhances perceived externalization, but this cannot be predicted by any of the tested decision metrics which only consider stationary monaural or binaural cues.

    True, there are also other cues potentially affecting the degree of auditory externalization. Reverberation-related acoustic cues are one of them. The main purpose of our study was to identify the basic functional mechanisms that integrates or selects between various cues – the purpose was not the identification of all possible cues that may affect auditory externalization. Thus, we chose a set of experiments that can be narrowed down a priori, particularly allowing to ignore reverberation-related cues.

    For the effect of reverberation-related cues, we point interested readers to another modelling study (Li et al., in press) that we conducted in parallel, in which we applied the here proposed framework also to reverberation-related cues and obtained good predictions.

    On balance, this reviewer is unconvinced that the current work will generalize to realistic dynamic and reverberant conditions.

    We agree with the reviewer that our study does not address dynamic and variable reverberant conditions. It was by-design limited to static conditions with fixed reverberation because we had no reason to believe that the targeted decision strategies applied to combine or select cues would be fundamentally different in more complex conditions.

    S. Werner and F. Klein, "Influence of Context Dependent Quality Parameters on the Perception of Externalization and Direction of an Auditory Event," presented at the AES 55th International Conference: Spatial Audio (2014 Aug.), conference paper 6-4.

    ###Reviewer #3:

    The manuscript "Decision making in auditory externalization perception" aims to identify cues that create/hinder an auditory externalization percept by using a template-based modeling approach. The approach as well as the findings are very interesting, and the study is thoroughly conducted. However, the manuscript adds little new knowledge to the field. Furthermore, a critical discussion is missing. The authors use a template-based model, but do not discuss the possible problems with such an approach. Particularly as each condition uses another model fit. This potentially allows the model to use cues that the auditory system cannot or does not consider. Nevertheless, the approach can still teach us which cues are potentially important for auditory externalization.

    1. The title seems inappropriate as the main work seems to be on the identification and combination of cues for externalization but not on the decision making.

    In combination with Reviewer #1’s first comment, we understand that the title could have been more specific. We will change the title accordingly.

    1. The model needs a more detailed explanation in the introduction. Otherwise the result section is not understandable without consulting the methods section.

    We will carefully re-evaluate which methodological details are necessary to understand the results section on a more abstract level.

    1. Add a Discussion on template-based models and fitting conditions. The risk of mathematical inspired models is that features are exploited that the auditory system cannot access. A more sophisticated front-end than a gammatone filterbank might reduce this risk. Alternatively, the use of physiologically inspired front-ends as in Scheidiger et al. (2018) might be interesting to consider. Nevertheless, I acknowledge that some of the features used in this study are backed by physiological and psychoacoustical studies.

    We agree with the concern behind the use of efficient functional approximations of the auditory periphery. Interestingly, however, we are very confident that this particular approximation does not provide spurious cues, especially in the context of monaural spectral shapes, because we did cross-validate the effectiveness of those cues with a physiologically more accurate model (Zilany et al., 2014) in previous work (Baumgartner et al., 2016).

    We will incorporate a corresponding explanation in the manuscript.

    1. It is known that the monaural spectral shape is important for externalization, for example from the studies that you have used. Thus, I partly question the novelty of the findings.

    We partly agree. It has also been suggested that interaural spectral cues are important for externalization perception. Further, it is also known that other cues contribute (e.g., reverberation-related cues as already discussed in response to the comments of Reviewer #2). Now, which cues contribute to which degree and how are they integrated? This is the main research question behind our study, with the ultimate goal to better understand the mechanisms of cue integration in the context of a perceptual inference task.

    1. I am not too familiar with template based models but I wonder if there is a problem if you use your models to fit and test with the same datasets?

    Cross-validation (i.e., using separate data sets for fitting/training, validating, and testing) is particularly important for complex models that allow overfitting. Such models can often be very closely fit to comparably small sets of data and thus the goodness of fit is not discriminative between those models. Here, in contrast, we compared the goodness of fit for models that contained a rather small and equal number of model parameters and this goodness of fit did strongly differ across models and was therefore informative for model selection in itself. If we separated the data sets, we would need to jointly assess the differences in initial model fits (to training data) together with the differences in predictive power (for testing data).

    References:

    Baumgartner, R., Majdak, P., & Laback, B. (2016). Modeling the effects of sensorineural hearing loss on sound localization in the median plane. Trends in Hearing, 20, 2331216516662003.

    Best, V., Baumgartner, R., Lavandier, M., Majdak, P., & Kopčo, N. (2020). Sound Externalization: A Review of Recent Research. Trends in Hearing, 24, 2331216520948390.

    Brimijoin, W. O., Boyd, A. W., & Akeroyd, M. A. (2013). The contribution of head movement to the externalization and internalization of sounds. PloS one, 8(12), e83068.

    Li, S., Baumgartner, R., & Peissig, J. (in press). Modeling perceived externalization of a static, lateral sound image. Acta Acustica.

    Zilany, M. S., Bruce, I. C., & Carney, L. H. (2014). Updated parameters and expanded simulation options for a model of the auditory periphery. The Journal of the Acoustical Society of America, 135(1), 283-286.

  2. ###Reviewer #3:

    The manuscript "Decision making in auditory externalization perception" aims to identify cues that create/hinder an auditory externalization percept by using a template-based modeling approach. The approach as well as the findings are very interesting, and the study is thoroughly conducted. However, the manuscript adds little new knowledge to the field. Furthermore, a critical discussion is missing. The authors use a template-based model, but do not discuss the possible problems with such an approach. Particularly as each condition uses another model fit. This potentially allows the model to use cues that the auditory system cannot or does not consider. Nevertheless, the approach can still teach us which cues are potentially important for auditory externalization.

    1. The title seems inappropriate as the main work seems to be on the identification and combination of cues for externalization but not on the decision making.

    2. The model needs a more detailed explanation in the introduction. Otherwise the result section is not understandable without consulting the methods section.

    3. Add a Discussion on template-based models and fitting conditions. The risk of mathematical inspired models is that features are exploited that the auditory system cannot access. A more sophisticated front-end than a gammatone filterbank might reduce this risk. Alternatively, the use of physiologically inspired front-ends as in Scheidiger et al. (2018) might be interesting to consider. Nevertheless, I acknowledge that some of the features used in this study are backed by physiological and psychoacoustical studies.

    4. It is known that the monaural spectral shape is important for externalization, for example from the studies that you have used. Thus, I partly question the novelty of the findings.

    5. I am not too familiar with template based models but I wonder if there is a problem if you use your models to fit and test with the same datasets?

  3. ###Reviewer #2:

    The current study compares four decision rules, factoring in seven potential acoustic cues, for predicting perceived sound externalization for single-source binaural sound with stationary interaural cues. Test stimuli included a harmonic vowel complex, noise and speech. Results show that monaural and binaural cues shape externalization. However, how listeners weighted these cues varied across the tested conditions. The authors consider the fact that some of these cues covary acoustically, by additionally testing their model on subsets of two of these cues only. No single externalization cue emerged as a clear predictor for perceived externalization. However, overall, a static cue weighting strategy tended to outperform dynamic cue weighting for predicting externalization.

    Major concerns dampen enthusiasm for the current work.

    1. It is unclear what neural mechanism is being tested. A premise of the current approach is that perceived sound externalization is primarily driven by acoustic cues. However, we know this not to be true. Context matters. As pointed out by the authors (l370-372), when listening to sounds processed with head related transfer functions (HRTFs) over headphones, listeners can externalize sound better when the context of the test room matches the room where HRTFs were recorded (Werner and Klein 2014).

    2. Most external sounds are neither anechoic nor stationary. Therefore, any neural decision metric on externalization must have been shaped by lifelong experience with dynamic, reverberant cues for interpreting externalization. The current work mostly models stationary single source sound that was either anechoic or mildly reverberant, providing pristine spatial cues. I do not follow the author's point that this would not matter (l498-502): "While the constant reverberation and visual information may or may not have stabilized auditory externalization, they certainly did not prevent the tested signal modifications to be effective within the tested condition. In our study, we thus assumed that such differences in experimental procedures do not modulate our effects of interest." That is an untested assumption.

    3. Many of the current test stimuli are perceived as ambiguous - providing 50% externalization ratings - and thus do not provide a sensitive test of brain mechanisms of sound externalization.

    4. Reverberation enhances perceived externalization, but this cannot be predicted by any of the tested decision metrics which only consider stationary monaural or binaural cues.

    On balance, this reviewer is unconvinced that the current work will generalize to realistic dynamic and reverberant conditions.

    S. Werner and F. Klein, "Influence of Context Dependent Quality Parameters on the Perception of Externalization and Direction of an Auditory Event," presented at the AES 55th International Conference: Spatial Audio (2014 Aug.), conference paper 6-4.

  4. ###Reviewer #1:

    I agree with the authors that the question at the basis of this work is timely and important both from the point of view of understanding auditory perception and for informing technology. However I am not convinced that the findings here will necessarily generalize to other stimuli/listening situations.

    I think the biggest limiting factor here is that the primary data on which the modelling is based are drawn from many different studies which used different stimuli, different tasks, different presentation environments and different equipment). I can see how testing the model on existing data is an important first step, but I would think that a critical next step is to form a set of (contrasting) predictions to be tested on a single stimulus set, within a single group of participants, as a way of confirming model validity. In this experiment I would also avoid using static non-reverberant environments since we know that these factors greatly affect spatial perception.

    Other comments:

    1. The title greatly overstates the main findings, it would be toned down.

    2. Intro, line 30-33 this statement is misleading. As written it appears to claim temporal aspects of auditory perception are based on short term regularity, whilst spatial perception is based on long term effects. This is not correct see e,g Ulanovsky 2004.

    3. As a reader not highly familiar with the auditory spatial processing literature I found the results section very dense and hard to follow. If you are targeting a general audience it is important to clarify concepts, avoid using abbreviations where possible etc.

    4. When discussing the various decision strategies which you tested, consider explaining how they might be implemented by the auditory system, at which stage of processing etc.

    5. It is very difficult to evaluate your results without more information about the stimuli and studies from which they were taken. Whilst you do provide references, I think the paper would be much clearer if you provide a more complete description of the stimuli (even in table form; paradigms etc).

  5. ##Preprint Review

    This preprint was reviewed using eLife’s Preprint Review service, which provides public peer reviews of manuscripts posted on bioRxiv for the benefit of the authors, readers, potential readers, and others interested in our assessment of the work. This review applies only to version 4 of the manuscript.

    ###Summary:

    As you will see the reviewers agreed that the premise behind this manuscript is important and timely both in the context of basic auditory science and for informing technology. However, they raised largely consistent concerns about the generalizability of your observations to other auditory stimuli and to more naturalistic listening conditions.