Preference for animate domain sounds in the fusiform gyrus of blind individuals is modulated by shape-action mapping

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

In high-level visual shape areas in the human brain, preference for inanimate objects is observed regardless of stimulation modality (visual/auditory/tactile) and subjects’ visual experience (sighted/blind individuals), whereas preference for animate entities seems robust only in the visual modality. Here, we test a hypothesis explaining this effect: visual shape representations can be reliably activated through different sensory modalities when they are systematically related to action system representations. We studied fMRI activations in congenitally blind and sighted subjects listening to animal, object, and human sounds. We found that, in blind individuals, the typical anatomical location of the fusiform face area responds to human facial expression sounds, with a clear mapping between the facial motor action and the resulting face shape, but not to speech or animal sounds. Using face areas’ activation in the blind subjects we could distinguish between specific facial expressions used in the study, but not between specific speech sounds. We conclude that auditory stimulation can reliably activate visual representations of those stimuli – inanimate or animate - for which shape and action computations are transparently related. Our study suggests that visual experience is not necessary for the development of functional preference for face-related information in the fusiform gyrus.

Article activity feed

  1. ###Author Response:

    ###Summary:

    While the work addresses an interesting research question, several shortcomings have been raised by three independent reviewers. A first issue is the lack of theoretical clarity and linkage with prior work, as discussed by Reviewer 1 and Reviewer 2. A second critical set of concerns is raised by all reviewers with the need for several additional analyses to nail down the interpretations proposed by the authors. Reviewer 2 specifically raised concerns regarding the interpretability of activation in auditory cortices, while Reviewer 3 provides insights on the MVPA analysis and suggests the possible use of RSA to clarify the main findings.

    While we respect the editor’s decision, we think that all points raised by Reviewer 1 and Reviewer 3 can be easily addressed through editing of the text and additional analyses. As we describe below, these revisions do not undermine the findings reported in our study – instead, they improve the clarity of the manuscript and further demonstrate that our results are genuine and robust. Furthermore, we believe that points raised by Reviewer 2 are based on misunderstanding. Differences in auditory properties across sound categories in our experiment cannot explain the pattern of results reported. Thus, additional analyses in the auditory cortex, proposed by Reviewer 2, can neither support nor undermine the claims made in our study. Nevertheless, we performed all the analyses suggested by the Reviewer 2.

    We also want to stress that all reviewers find our study timely and interesting for broad readership. Furthermore, Reviewer 1 and Reviewer 3 made a number of positive comments on study methodology. Overall, we believe that there are no doubts regarding the novelty and importance of our study, and that we are able to address all additional methodological concerns raised by the reviewers.

    ###Reviewer #1:

    Bola and colleagues asked whether the coupling in perception-action systems may be reflected in early representations of the face. The authors used fMRI to assess the responses of the human occipital temporal cortex (FFA in particular) to the presentation of emotional (laughing/crying), non-emotional (yawning/sneezing), speech (Chinese), object and animal sounds of congenitally blind and sighted participants. The authors present a detailed set of independent and direct univariate and multivariate contrasts, which highlight a striking difference of engagement to facial expressions in the OTC of the congenitally blind compared to the sighted participants. The specificity of facial expression sounds in OTC for the congenitally blind is well captured in the final MVPA analysis presented in Fig.5.

    We would like to thank the reviewer for an overall positive assessment of our work.

    -The use of "transparency of mapping" is rather metaphorical and hand-wavy for a non-expert audience. If the issue relates to the notion of compatibility of representational formats, then it should be expressed formally.

    Following the reviewer’s suggestion, we revised the introduction and clarified what we mean by “transparency of mapping”, and how this concept might be related to the compatibility of representations computed in different areas of the brain. As is now extensively explained, we propose that shape features of inanimate objects are directly relevant to our actions. In contrast, a relationship between shape and relevant actions is much less clear in the case of most animate objects. We hypothesized that this inherent difference between the inanimate and the animate domain, combined with evolutionary pressures for quick, accurate, and efficient object-directed responses, resulted in the inanimate vOTC areas being more strongly coupled with the action system, both in terms of manipulability and navigation, than the animate vOTC areas. The stronger coupling is likely to be reflected in the format of vOTC shape representation of inanimate objects being more compatible with the format of representations computed in the action system.

    -The theoretical stance of the authors does not clearly predict why blind individuals should show more precise emotional expressions in FFA as compared to sighted - as the authors start addressing in their Discussion. In the context of the action-perception loop, it is even more surprising considering that the sighted have direct training and visual access to the facial gestures of interlocutors, which they can internalize. Can the authors entertain alternative scenarios such as the need to rely on mental imagery for congenitally blind for instance?

    We agree that our approach does not predict the difference between the blind and the sighted subjects, and we openly discuss this in the discussion: “An unexpected finding in our study is the clear difference in vOTC univariate response to facial expression sounds across the congenitally blind and the sighted group”. We also propose an explanation of this unexpected difference. Specifically, we suggest that the interactions between the action system and the animate areas in the vOTC are relatively weak, even in the case of facial expressions – thus, they can be captured mostly in blind individuals, whose visual areas are known to increase their sensitivity to non-visual stimulation. This explanation can account for this unexpected between-group difference and is consistent with our theoretical proposal.

    The “mental imagery account” can be, in our opinion, divided into two distinct hypotheses. One version of this account would be to assume that the representation of animate entities typically computed in the vOTC (i.e., also in sighted people) can be activated through visual mental imagery (as suggested by several previous studies), and that this would affect our between-group comparisons. In that case, however, we should observe an effect opposite to that obtained in our study – namely, the activation in the vOTC animate areas should be stronger in the sighted subjects, since they, but not the congenitally blind participants, can create visual mental images (as the reviewer pointed out). This is clearly not what we observed. A second version of the mental imagery account would be to assume representational plasticity in the vOTC of blind individuals – that is, to assume that vOTC animate areas in this population switch from representing visually, face-related information to representing motor mental imagery, which presumably they can generate just like sighted individuals. However, such an account does not, on its own, explain why the animate vOTC areas in the congenitally blind participants are more strongly activated than they are in the sighted subjects, who can generate both visual and motor mental imagery. Based on these considerations, we do not think that the mental imagery account provides a sufficient explanation. Nonetheless, it is certainly a factor worth considering, which we add in a revised discussion of the reported results. Similar reasoning can be applied to other accounts which assume that the observed difference between the blind and the sighted group is a result of representational plasticity in this region in the blind group. Such accounts would need to propose a plausible dimension, different than face shape and its relation to the action system, that is captured by the animate vOTC areas in blind individuals. Since the effect we report is independent of auditory, emotional, social or linguistic dimensions present in our stimuli, it is hard to say what this dimension might be.

    We now elaborate on these important points in the Discussion section.

    ###Reviewer #2:

    The study by Bola and colleagues tested the specific hypothesis that visual shape representations can be reliably activated through different sensory modalities only when they systematically map onto action system computations. To this aim, the authors scanned a group of congenitally blind individuals and a group of sighted controls while subjects listened to multiple sound categories.

    While I find the study of general interest, I think that there are main methodological limitations, which do not allow to support the general claim.

    We would like to thank the reviewer for this assessment. Below, we argue that the results presented in the paper support our claim, and that they cannot be explained by alternative accounts described by the reviewer.

    Main concerns

    1. Auditory stimuli have been equalized to have the same RMS (-20 dB). In my opinion, this is not a sufficient control. As shown in Figure 3 - figure supplement 1, the different sound categories elicited extremely different patterns of response in A1. This is clearly linked to intrinsic sound properties. In my opinion without a precise characterization of sound properties across categories, it is not possible to conclude that the observed effects in face responsive regions (incidentally, as assessed using an atlas and not a localizer) are explained by the different category types. On the stimulus side, authors should at least provide (a) spectrograms and (b) envelope dynamics; in case sound properties would differ across categories all results might have a confound associated to stimuli selection.

    We now present spectrograms and waveforms for sounds used in the study in the Methods section. We did not present this information in the original version of the paper because, in our opinion, it is quite obvious that sounds from different categories will differ in terms of their auditory properties – after all, this is why we can distinguish among human speech, animal sounds or object sounds. Thus, differences in sound properties across conditions are an inherent characteristic of every study comparing sounds from several domains or semantic categories (e.g., human vs. non-human), including our own study. We now clarify this issue in the Methods section of the manuscript.

    Having said that, we believe that differences in acoustic properties across sound categories cannot explain the results in the vOTC, reported in our work. We report that, in blind subjects, the vOTC face areas respond more strongly to sounds of emotional facial expressions and non-emotional facial expressions than to speech sounds, animal sounds and object sounds. These brain areas did not show differential responses to two expression categories or to three other sound categories. To explain this pattern of results, the “acoustic confound account” would need to assume that there is some special auditory property that differentiate both types of expression sounds, but does not differentiate sound categories in any other comparison. Moreover, this account would need to further assume that this is precisely the auditory dimension to which the vOTC face areas are sensitive, while being insensitive to other auditory characteristics, different across the other sound categories (e.g., across object sounds and animal sounds, or expression sounds and speech sounds - as the reviewer pointed out, all categories are acoustically very different, as indicated by the activation of A1). We find this account extremely unlikely. We now comment on these points in the Methods and the Results section.

    1. More on the same point: the authors use the activation of A1 as a further validation of the results in face selective areas. Page 16 line 304 "We observed activation pattern that was the same for the blind and the sighted subjects, and markedly different from the pattern that was observed in the fusiform gyrus in the blind group (see Fig. 1D). This suggests that the effects detected in this region in the blind subjects were not driven by the differences in acoustic characteristics of sounds, as such characteristics are likely to be captured by activation patterns of the primary auditory cortex." It is the opinion of this reader that this control, despite being important, does not support the claim. A1 is certainly a good region to show how basic sound properties are mapped. However, the same type of analysis should be performed in higher auditory areas, as STS. If result patterns would be similar to the FFA region, I guess that the current interpretation of results would not hold.

    As we discuss above, we believe that the explanation of the results observed in the vOTC in terms of “acoustic confound” does not hold, even without any empirical analysis in the auditory cortex. The analysis in A1 was planned to clearly illustrate this point and to support interpretation of potential unexpected pattern of results across sound categories (such an unexpected pattern was not observed).

    However, per reviewer’s request, we performed an ROI analysis also in the STS. Specifically, we chose two ROIs – a broad and bilateral ROI covering the whole STS, and a more constrained ROI covering the right posterior STS (rpSTS), known to be a part of the face processing network and to respond primarily to dynamic aspects of the face shape. As can be seen in Supplementary Materials, the broad STS ROI pattern of responses is markedly different from the one observed in the FFA. Particularly, the magnitude of the STS activation is clearly different for speech sounds, animal sounds, and objects sounds, in both the blind and the sighted group. In the case of the FFA, the activation magnitudes for these three sound categories were indistinguishable. Furthermore, in the blind group, the STS showed stronger activation for emotional facial expression sounds than for non-emotional expression sounds. Again, such a difference was not observed in the FFA (if anything, the FFA showed slightly stronger activation for non-emotional expression sounds in the blind group). The pattern of the rpSTS responses is more similar to the responses observed in the FFA. This is exactly what can be expected based on our hypothesis that the FFA in the blind group is sensitive primarily to dynamic facial reconfigurations, with transparent link between the motoric and visual shape representations. Overall, we think that the pattern of results observed in the auditory cortex is fully in line with our hypothesis – the auditory regions (A1 and STS, defined broadly) show responses that are different than the responses observed in the FFA (one may hypothesize that responses in the auditory regions are driven by low-level auditory features of stimuli to a larger extent); the rpSTS, which is specialized in the processing of dynamic aspects of the face shape, shows the pattern of responses that is more similar to the pattern of responses observed in the FFA. Importantly, the responses in the rpSTS were not different across subject groups. As we describe below, this is the pattern of results that was observed also in MVPA. We now report all the above-described results in the paper.

    1. Linked to the previous point. Given that the authors implemented a MPVA pipeline at the ROI level, it is important to perform the same analysis in both groups, but especially in the blind, in areas such as STS as well as in a control region, engaged by the task (with signal) to check the specificity of the FFA activation.

    Per reviewer’s request, we additionally performed the MVPA in three control regions. Firstly, we performed the analysis in the auditory cortex, defined as A1 and the STS combined. We treated this area as a positive control – particularly, given the acoustic differences between sound categories, we expected to successfully decode all sound categories from the activity of this ROI. Secondly, we performed the analysis in the parahippocampal place area (PPA). We treated the PPA as a negative control – given that this area does not seem to contain much information about animate entities, we did not expect to find effects there for most of our comparisons. Furthermore, as the PPA is the vOTC area bordering the FFA, the negative results in this area would be a proof of spatial specificity of our results. Thirdly, we performed the analysis in the rpSTS – here, we expected to observe the results similar to the ones observed in the FFA, for the reasons provided above. We now present the results of these analyses as supplementary figures.

    We were able to successfully distinguish all sound categories, in both groups, based on the activation of the auditory cortex (all p = 0.001; the lowest value that can be achieved in our permutation analysis). Furthermore, based on the activation of this area, we were able to classify specific facial expressions, specific speech sounds, and the gender of the actor, in contrast to the result from the FFA, where the decoding of facial expressions was the only positive result.

    As expected, the decoding of animate sound categories was generally not successful in the PPA. However, as one might expect, activation of this area allowed us, to some extent, to distinguish object sounds from animate sounds – especially in the blind group. Furthermore, based on the PPA activation, we were not able to classify specific facial expressions, speech sounds, or the gender of the actor. These results confirm that the results reported for the FFA are specific to only certain parts of the brain and even certain parts of the vOTC.

    As can be expected, the results in the rpSTS were the most similar to the results observed in the FFA – while the activation of this region was diagnostic of all categorical distinctions, the more detailed analysis showed that this region represented differences between specific facial expressions, but not between the speech sounds or the gender of actors acting the expressions (the similar pattern of results was observed in both groups). This is the same specificity that the FFA in blind people show.

    Finally, we would like to stress that the difference between results observed in the FFA and the PPA is yet another argument against interpreting the results in the FFA as being driven by auditory properties of stimuli – the issue that we discussed in details above. We do not see the reason why putative acoustic influences on the vOTC responses in the blind group should be present in the FFA, but not in the PPA.

    1. I find the manuscript rather biased with regard to the literature. This is a topic which has been extensively investigated in the past. For instance, the manuscript does not include relevant references for the present context, such as:

    Plaza, P., Renier, L., De Volder, A., & Rauschecker, J. (2015). Seeing faces with your ears activates the left fusiform face area, especially when you're blind. Journal of vision, 15(12), 197-197.

    Kitada, R., Okamoto, Y., Sasaki, A. T., Kochiyama, T., Miyahara, M., Lederman, S. J., & Sadato, N. (2013). Early visual experience and the recognition of basic facial expressions: involvement of the middle temporal and inferior frontal gyri during haptic identification by the early blind. Frontiers in human neuroscience, 7, 7.

    Pietrini, P., Furey, M. L., Ricciardi, E., Gobbini, M. I., Wu, W. H. C., Cohen, L., ... & Haxby, J. V. (2004). Beyond sensory images: Object-based representation in the human ventral pathway. Proceedings of the National Academy of Sciences, 101(15), 5658-5663.

    The first reference listed by the reviewer is actually a conference abstract. Thus, we feel that it would be premature to give it comparable weight to peer-reviewed papers. Furthermore, based on the abstract, without the published paper, we cannot assess the robustness of the results and their relevance to our study (particularly, it is unclear whether some effects were observed in the right FFA, and whether a statistically significant difference between blind and sighted subjects was detected).

    In the second reference, the authors did not observe effects in the FFA in the visual version of their experiment with sighted subjects, at the threshold of p < 0.05, corrected for multiple comparisons. In our opinion, this makes the null result of the tactile experiment, reported for the FFA, hard to interpret – thus, while the paper is very interesting in certain contexts, it is not particularly informative when it comes to the question addressed here.

    While the third reference reports interesting results, it does not investigate preference for inanimate objects or animate objects in the vOTC, which is the main topic of our paper (only comparisons vs. rest and between- and within-category correlations are reported). Furthermore, based on that study, we cannot conclude whether effects reported for faces are found in the face areas or in other parts of the vOTC (no analyses in specific vOTC areas were reported).

    These were the reasons why we did not refer to these materials in the previous version of the manuscript. Importantly, none of them compel us to revise our claims, and we refer to a number of other papers, directly relevant to the question we are interested in – that is, the difference between vOTC animate and inanimate areas in sensitivity to non-visual stimulation. Nevertheless, we agree that referring to materials suggested by the reviewer might be informative for non-expert readers – thus, we cite them in the revised version of our paper.

    ###Reviewer #3:

    Bola and colleagues set out to test the hypothesis that vOT domain specific organization is due to the evolutionary pressure to couple visual representations and downstream computations (e.g., action programs). A prediction of such theory is that cross-modal activations (e.g., response in FFA to face-related sounds) should be detected as a function of the transparency of such coupling (e.g., sounds associated with facial expression > speech).

    To this end, the Authors compared brain activity of 20 congenitally blind and 22 sighted subjects undergoing fMRI while performing a semantic judgment task (i.e., is it produced by a human?) on sounds belonging to 5 different categories (emotional and non-emotional facial expressions, speech, object sounds and animal sounds).The results indicate preferential response to sounds associated with facial expressions (vs. speech or animal/objects sounds) in the fusiform gyrus of blind individuals regardless of the emotional content.

    The issue tackled is relevant and timely for the field, and the method chosen (i.e., clinical model + univariate and multivariate fMRI analyses) well suited to address it. The analyses performed are overall sound and the paper clear and exhaustive.

    We thank the reviewer for this positive assessment.

    1. While I overall understand why the Authors would choose a broader ROI for multivariate (vs. univariate) analyses, I believe it would be appropriate to show both analyses on both ROIs. In particular, the fact that the ROI used for the univariate analyses is right-hemisphere only, while the multivariate one is bilateral should be (at least) discussed.

    We shortly discuss this issue in the Methods section: “The reason behind broader and bilateral ROI definition was that the multivariate analysis relies on dispersed and subthreshold activation and deactivation patterns, which might be well represented also by cross-talk between hemispheres (for example, a certain subcategory might be represented by activation of the right FFA and deactivation of the left analog of this area).”

    Constraining the FFA ROI in the multivariate analysis (i.e., using the same ROI as was used in the univariate analysis) makes the results slightly weaker, in both groups. However, the pattern of results is qualitatively comparable. Slight decrease in statistical power can be expected, for the reasons described in the Methods and cited above:

    Similarly, using broader FFA ROI in the univariate analysis (i.e., using the same ROI as was used in the multivariate analysis) results in qualitatively comparable, but slightly weaker effects in the blind group and no change in sighted subjects (no difference between sound categories). Again, this is expectable – visual studies show that the functional sensitivity to face-related stimuli is weaker in the left counterpart of the FFA than in the right FFA. This is also the case in our data - using broader and bilateral ROI essentially averages a stronger effect in the right FFA and a more subtle effect in the left counterpart of the FFA.

    We now clarify this issue in the Methods section.

    1. The significance of the multivariate results is established testing the cross-validated classification accuracy against chance-level with t-tests. Did these tests consider the hypothetical chance level based on class number? A permutation scheme assessing the null distribution would be advisable. In general, more details should be provided with respect to the multivariate analyses performed, for instance the confusion matrix in Figure 5B is never mentioned in the text.

    Yes, the chance level was calculated in a standard way, by dividing 100 % by the number of conditions/classes included in the analysis (note that all stimulus classes were presented equal number of times). To respond to the reviewer’s comment, we used a permutation approach to recalculate significances of all MVPA analyses reported in the paper (note that the whole-brain univariate analyses are already performed within the permutation framework). To this aim, we reran each analysis 1000 times with condition labels randomized and compared the actual result of this analysis with the null distribution created in this way (see the Methods section for details). We replicated all results reported in the paper. We now report this new analysis in the manuscript, changing the figure legends and the Methods section accordingly.

    The confusion matrix was not mentioned in the text because it is not a separate analysis. As explained in the figure legend, it is just a graphical representation of classifiers performance (i.e., its choices for specific stimulus classes) during the decoding analysis reported in Fig. 5A. To clarify this, we now briefly mention the graph presented in Fig. 5B in the main text.

    1. I wonder whether a representational similarity approach could be useful in better delineating similarity/differences in blind vs. sighted participants sounds representations in vOT. Such analysis could also help further exploring potential graded effects: i.e., sounds associated with facial expression (face related, with salient link to movement) > speech (face related, with less salient link with movement) > animals sounds (non-human face related) > object sounds (not face related at all). The above-mentioned confusion matrix could be the starting point of such investigation.

    We thank the reviewer for this interesting suggestion. In response to this comment, we performed an additional RSA analysis, aimed at investigating graded similarity in the FFA response patterns, across categories used in the experiment. Based on our hypothesis, we created a simple theoretical model assuming that responses to both types of facial expression sounds are the most similar to each other (animate sounds with high shape-action mapping transparency), somewhat similar to speech sounds (animate sounds with weaker shape-action mapping transparency), and the least similar to animal and object sounds (animate sounds with no clear shape-action mapping transparency and inanimate sounds). We observed a significant correlation between this theoretical model and FFA response patterns in the blind group (pFDR = 0.012), but not in the sighted group pFDR = 0.223). We believe that the RSA analysis further supports our visual-shape-to-action mapping conjecture, at least when it comes to blind subjects (see the Discussion section for our interpretation of the observed differences between the blind and the sighted subjects). We describe this new analysis in the revised text.

  2. ###Reviewer #3:

    Bola and colleagues set out to test the hypothesis that vOT domain specific organization is due to the evolutionary pressure to couple visual representations and downstream computations (e.g., action programs). A prediction of such theory is that cross-modal activations (e.g., response in FFA to face-related sounds) should be detected as a function of the transparency of such coupling (e.g., sounds associated with facial expression > speech).

    To this end, the Authors compared brain activity of 20 congenitally blind and 22 sighted subjects undergoing fMRI while performing a semantic judgment task (i.e., is it produced by a human?) on sounds belonging to 5 different categories (emotional and non-emotional facial expressions, speech, object sounds and animal sounds).The results indicate preferential response to sounds associated with facial expressions (vs. speech or animal/objects sounds) in the fusiform gyrus of blind individuals regardless of the emotional content.

    The issue tackled is relevant and timely for the field, and the method chosen (i.e., clinical model + univariate and multivariate fMRI analyses) well suited to address it. The analyses performed are overall sound and the paper clear and exhaustive.

    1. While I overall understand why the Authors would choose a broader ROI for multivariate (vs. univariate) analyses, I believe it would be appropriate to show both analyses on both ROIs. In particular, the fact that the ROI used for the univariate analyses is right-hemisphere only, while the multivariate one is bilateral should be (at least) discussed.

    2. The significance of the multivariate results is established testing the cross-validated classification accuracy against chance-level with t-tests. Did these tests consider the hypothetical chance level based on class number? A permutation scheme assessing the null distribution would be advisable. In general, more details should be provided with respect to the multivariate analyses performed, for instance the confusion matrix in Figure 5B is never mentioned in the text.

    3. I wonder whether a representational similarity approach could be useful in better delineating similarity/differences in blind vs. sighted participants sounds representations in vOT. Such analysis could also help further exploring potential graded effects: i.e., sounds associated with facial expression (face related, with salient link to movement) > speech (face related, with less salient link with movement) > animals sounds (non-human face related) > object sounds (not face related at all). The above-mentioned confusion matrix could be the starting point of such investigation.

  3. ###Reviewer #2:

    The study by Bola and colleagues tested the specific hypothesis that visual shape representations can be reliably activated through different sensory modalities only when they systematically map onto action system computations. To this aim, the authors scanned a group of congenitally blind individuals and a group of sighted controls while subjects listened to multiple sound categories.

    While I find the study of general interest, I think that there are main methodological limitations, which do not allow to support the general claim.

    Main concerns

    1. Auditory stimuli have been equalized to have the same RMS (-20 dB). In my opinion, this is not a sufficient control. As shown in Figure 3 - figure supplement 1, the different sound categories elicited extremely different patterns of response in A1. This is clearly linked to intrinsic sound properties. In my opinion without a precise characterization of sound properties across categories, it is not possible to conclude that the observed effects in face responsive regions (incidentally, as assessed using an atlas and not a localizer) are explained by the different category types. On the stimulus side, authors should at least provide (a) spectrograms and (b) envelope dynamics; in case sound properties would differ across categories all results might have a confound associated to stimuli selection.

    2. More on the same point: the authors use the activation of A1 as a further validation of the results in face selective areas. Page 16 line 304 "We observed activation pattern that was the same for the blind and the sighted subjects, and markedly different from the pattern that was observed in the fusiform gyrus in the blind group (see Fig. 1D). This suggests that the effects detected in this region in the blind subjects were not driven by the differences in acoustic characteristics of sounds, as such characteristics are likely to be captured by activation patterns of the primary auditory cortex." It is the opinion of this reader that this control, despite being important, does not support the claim. A1 is certainly a good region to show how basic sound properties are mapped. However, the same type of analysis should be performed in higher auditory areas, as STS. If result patterns would be similar to the FFA region, I guess that the current interpretation of results would not hold.

    3. Linked to the previous point. Given that the authors implemented a MPVA pipeline at the ROI level, it is important to perform the same analysis in both groups, but especially in the blind, in areas such as STS as well as in a control region, engaged by the task (with signal) to check the specificity of the FFA activation.

    4. I find the manuscript rather biased with regard to the literature. This is a topic which has been extensively investigated in the past. For instance, the manuscript does not include relevant references for the present context, such as:

    Plaza, P., Renier, L., De Volder, A., & Rauschecker, J. (2015). Seeing faces with your ears activates the left fusiform face area, especially when you're blind. Journal of vision, 15(12), 197-197.

    Kitada, R., Okamoto, Y., Sasaki, A. T., Kochiyama, T., Miyahara, M., Lederman, S. J., & Sadato, N. (2013). Early visual experience and the recognition of basic facial expressions: involvement of the middle temporal and inferior frontal gyri during haptic identification by the early blind. Frontiers in human neuroscience, 7, 7.

    Pietrini, P., Furey, M. L., Ricciardi, E., Gobbini, M. I., Wu, W. H. C., Cohen, L., ... & Haxby, J. V. (2004). Beyond sensory images: Object-based representation in the human ventral pathway. Proceedings of the National Academy of Sciences, 101(15), 5658-5663.

  4. ###Reviewer #1:

    Bola and colleagues asked whether the coupling in perception-action systems may be reflected in early representations of the face. The authors used fMRI to assess the responses of the human occipital temporal cortex (FFA in particular) to the presentation of emotional (laughing/crying), non-emotional (yawning/sneezing), speech (Chinese), object and animal sounds of congenitally blind and sighted participants. The authors present a detailed set of independent and direct univariate and multivariate contrasts, which highlight a striking difference of engagement to facial expressions in the OTC of the congenitally blind compared to the sighted participants. The specificity of facial expression sounds in OTC for the congenitally blind is well captured in the final MVPA analysis presented in Fig.5.

    -The use of "transparency of mapping" is rather metaphorical and hand-wavy for a non-expert audience. If the issue relates to the notion of compatibility of representational formats, then it should be expressed formally.

    -The theoretical stance of the authors does not clearly predict why blind individuals should show more precise emotional expressions in FFA as compared to sighted - as the authors start addressing in their Discussion. In the context of the action-perception loop, it is even more surprising considering that the sighted have direct training and visual access to the facial gestures of interlocutors, which they can internalize. Can the authors entertain alternative scenarios such as the need to rely on mental imagery for congenitally blind for instance?

  5. ##Preprint Review

    This preprint was reviewed using eLife’s Preprint Review service, which provides public peer reviews of manuscripts posted on bioRxiv for the benefit of the authors, readers, potential readers, and others interested in our assessment of the work. This review applies only to version 3 of the manuscript.

    ###Summary:

    While the work addresses an interesting research question, several shortcomings have been raised by three independent reviewers. A first issue is the lack of theoretical clarity and linkage with prior work, as discussed by Reviewer 1 and Reviewer 2. A second critical set of concerns is raised by all reviewers with the need for several additional analyses to nail down the interpretations proposed by the authors. Reviewer 2 specifically raised concerns regarding the interpretability of activation in auditory cortices, while Reviewer 3 provides insights on the MVPA analysis and suggests the possible use of RSA to clarify the main findings.