The Spatial Frequency Representation Predicts Category Coding in the Inferior Temporal Cortex

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife Assessment

    This useful study aimed to examine the relationship of spatial frequency selectivity of single macaque inferotemporal (IT) neurons to category selectivity. Interesting findings in this report suggest a shift in preferred spatial frequency during the response, from low to high spatial frequencies. This agrees with a coarse-to-fine processing strategy, which is in line with multiple studies in the early visual cortex. Some of the findings were difficult to evaluate because the methods are incomplete. The conclusion that single-unit spatial frequency selectivity can predict object coding requires further evidence to confirm.

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Abstract

Understanding the neural representation of spatial frequency (SF) in the primate cortex is vital for unraveling visual processing mechanisms in object recognition. While numerous studies concentrate on the representation of SF in the primary visual cortex, the characteristics of SF representation and its interaction with category representation remain inadequately understood. To explore SF representation in the inferior temporal (IT) cortex of macaque monkeys, we conducted extracellular recordings with complex stimuli systematically filtered by SF. Our findings disclose an explicit SF coding at single-neuron and population levels in the IT cortex. Moreover, the coding of SF content exhibits a coarse-to-fine pattern, declining as the SF increases. Temporal dynamics analysis of SF representation reveals that low SF (LSF) is decoded faster than high SF (HSF), and the SF preference dynamically shifts from LSF to HSF over time. Additionally, the SF representation for each neuron forms a profile that predicts category selectivity at the population level. IT neurons can be clustered into four groups based on SF preference, each exhibiting different category coding behaviors. Particularly, HSF-preferred neurons demonstrate the highest category decoding performance for face stimuli. Despite the existing connection between SF and category coding, we have identified uncorrelated representations of SF and category. In contrast to the category coding, SF is more sparse and places greater reliance on the representations of individual neurons. Comparing SF representation in the IT cortex to deep neural networks, we observed no relationship between SF representation and category coding. However, SF coding, as a category-orthogonal property, is evident across various ventral stream models. These results dissociate the separate representations of SF and object category, underscoring the pivotal role of SF in object recognition.

Article activity feed

  1. eLife Assessment

    This useful study aimed to examine the relationship of spatial frequency selectivity of single macaque inferotemporal (IT) neurons to category selectivity. Interesting findings in this report suggest a shift in preferred spatial frequency during the response, from low to high spatial frequencies. This agrees with a coarse-to-fine processing strategy, which is in line with multiple studies in the early visual cortex. Some of the findings were difficult to evaluate because the methods are incomplete. The conclusion that single-unit spatial frequency selectivity can predict object coding requires further evidence to confirm.

  2. Reviewer #1 (Public Review):

    This study reports that spatial frequency representation can predict category coding in the inferior temporal cortex. The original conclusion was based on likely problematic stimulus timing (33 ms which was too brief). Now the authors claim that they also have a different set of data on the basis of longer stimulus duration (200 ms).

    One big issue in the original report was that the experiments used a stimulus duration that was too brief and could have weakened the effects of high spatial frequencies and confounded the conclusions. Now the authors provided a new set of data on the basis of a longer stimulus duration and made the claim that the conclusions are unchanged. These new data and the data in the original report were collected at the same time as the authors report.

    The authors may provide an explanation why they performed the same experiments using two stimulus durations and only reported one data set with the brief duration. They may also explain why they opted not to mention in the original report the existence of another data set with a different stimulus duration, which would otherwise have certainly strengthened their main conclusions.

  3. Reviewer #2 (Public Review):

    Summary:

    This paper aimed to examine the spatial frequency selectivity of macaque inferotemporal (IT) neurons and its relation to category selectivity. The authors suggest in the present study that some IT neurons show a sensitivity for the spatial frequency of scrambled images. Their report suggests a shift in preferred spatial frequency during the response, from low to high spatial frequencies. This agrees with a coarse-to-fine processing strategy, which is in line with multiple studies in the early visual cortex. In addition, they report that the selectivity for faces and objects, relative to scrambled stimuli, depends on the spatial frequency tuning of the neurons.

    Strengths:

    Previous studies using human fMRI and psychophysics studied the contribution of different spatial frequency bands to object recognition, but as pointed out by the authors little is known about the spatial frequency selectivity of single IT neurons. This study addresses this gap and shows spatial frequency selectivity in IT for scrambled stimuli that drive the neurons poorly. They related this weak spatial frequency selectivity to category selectivity, but these findings are premature given the low number of stimuli they employed to assess category selectivity.

    The authors revised their manuscript and provided some clarifications regarding their experimental design and data analysis. They responded to most of my comments but I find that some issues were not fully or poorly addressed. The new data they provided confirmed my concern about low responses to their scrambled stimuli. Thus, this paper shows spatial frequency selectivity in IT for scrambled stimuli that drive the neurons poorly (see main comments below). They related this (weak) spatial frequency selectivity to category selectivity, but these findings are premature given the low number of stimuli to assess category selectivity.

  4. Author response:

    The following is the authors’ response to the original reviews.

    Public Reviews:

    Reviewer #1 (Public Review):

    This study reports that spatial frequency representation can predict category coding in the inferior temporal cortex.

    Thank you for taking the time to review our manuscript. We greatly appreciate your valuable feedback and constructive comments, which have been instrumental in improving the quality and clarity of our work.

    The original conclusion was based on likely problematic stimulus timing (33 ms which was too brief). Now the authors claim that they also have a different set of data on the basis of longer stimulus duration (200 ms).

    One big issue in the original report was that the experiments used a stimulus duration that was too brief and could have weakened the effects of high spatial frequencies and confounded the conclusions. Now the authors provided a new set of data on the basis of a longer stimulus duration and made the claim that the conclusions are unchanged. These new data and the data in the original report were collected at the same time as the authors report.

    The authors may provide an explanation why they performed the same experiments using two stimulus durations and only reported one data set with the brief duration. They may also explain why they opted not to mention in the original report the existence of another data set with a different stimulus duration, which would otherwise have certainly strengthened their main conclusions.

    Thank you for your comments regarding the stimulus duration used in our experiments. We appreciate the opportunity to clarify and provide further details on our methodology and decisions.

    In our original report, we focused on the early phase of the neuronal response, which is less affected by the duration of the stimulus. Observations from our data showed that certain neurons exhibited high firing rates even with the brief 33 ms stimulus duration, and the results we obtained were consistent across different durations. To avoid redundancy, we initially chose not to include the results from the 200 ms stimulus duration, as they reiterated the findings of the 33 ms duration.

    However, we acknowledge that the brief stimulus duration could raise concerns regarding the robustness of our conclusions, particularly concerning the effects of high spatial frequencies. Upon reflecting on the reviewer’s comments during the first revision, we recognized the importance of addressing these potential concerns directly. Therefore, we have included the data from the 200 ms stimulus duration in our revised manuscript.

    Furthermore, Our team is actively investigating the differences between fast (33 ms) and slow (200 ms) presentations in terms of SF processing. Our preliminary observations suggest similar processing of HSF in the early phase of the response for both fast and slow presentations, but different processing of HSF in the late phase. This was another reason we initially opted to publish the results from the brief stimulus duration separately, as we intended to explore the different aspects of SF processing in fast and slow presentations in subsequent studies.

    I suggest the authors upload both data sets and analyzing codes, so that the claim could be easily examined by interested readers.

    Thank you for your suggestion to make both data sets and the analyzing codes available for examination by interested readers.

    We have created a repository that includes a sample of the dataset along with the necessary codes to output the main results. While we cannot provide the entire dataset at this time due to ongoing investigations by our team, we are committed to ensuring transparency and reproducibility. The data and code samples we have provided should enable interested readers to verify our claims and understand our analysis process.

    Repository: https://github.com/ramintoosi/spatial-frequency-selectivity

    Reviewer #2 (Public Review):

    Summary:

    This paper aimed to examine the spatial frequency selectivity of macaque inferotemporal (IT) neurons and its relation to category selectivity. The authors suggest in the present study that some IT neurons show a sensitivity for the spatial frequency of scrambled images. Their report suggests a shift in preferred spatial frequency during the response, from low to high spatial frequencies. This agrees with a coarse-to-fine processing strategy, which is in line with multiple studies in the early visual cortex. In addition, they report that the selectivity for faces and objects, relative to scrambled stimuli, depends on the spatial frequency tuning of the neurons.

    Strengths:

    Previous studies using human fMRI and psychophysics studied the contribution of different spatial frequency bands to object recognition, but as pointed out by the authors little is known about the spatial frequency selectivity of single IT neurons. This study addresses this gap and shows spatial frequency selectivity in IT for scrambled stimuli that drive the neurons poorly. They related this weak spatial frequency selectivity to category selectivity, but these findings are premature given the low number of stimuli they employed to assess category selectivity.

    Thank you for your thorough review and insightful feedback on our manuscript. We greatly appreciate your time and effort in providing valuable comments and suggestions, which have significantly contributed to enhancing the quality of our work.

    The authors revised their manuscript and provided some clarifications regarding their experimental design and data analysis. They responded to most of my comments but I find that some issues were not fully or poorly addressed. The new data they provided confirmed my concern about low responses to their scrambled stimuli. Thus, this paper shows spatial frequency selectivity in IT for scrambled stimuli that drive the neurons poorly (see main comments below). They related this (weak) spatial frequency selectivity to category selectivity, but these findings are premature given the low number of stimuli to assess category selectivity.

    While we acknowledge that the number of instances per condition is relatively low, the overall dataset is substantial. Specifically, our study includes a total of 180 stimuli (6 spatial frequencies × 2 scrambled/non-scrambled conditions × 15 instances, including 9 fixed and 6 non-fixed) and 5400 trials (180 stimuli × 2 durations × 15 repetitions). Conducting these trials requires approximately one hour of experimental time per session.

    Extending the number of stimuli, while potentially addressing this limitation, would significantly compromise the quality of the experiment by increasing the duration and introducing potential fatigue effects in the subjects. Despite this limitation, our findings lay important groundwork by offering novel insights into object recognition through the lens of spatial frequency. We believe this work can serve as a foundation for future experiments designed to further explore and validate these theories with expanded stimulus sets.

    Main points.

    (1) They have provided now the responses of their neurons in spikes/s and present a distribution of the raw responses in a new Figure. These data suggest that their scrambled stimuli were driving the neurons rather poorly and thus it is unclear how well their findings will generalize to more effective stimuli. Indeed, the mean net firing rate to their scrambled stimuli was very low: about 3 spikes/s. How much can one conclude when the stimuli are driving the recorded neurons that poorly? Also, the new Figure 2- Appendix 1 shows that the mean modulation by spatial frequency is about 2 spikes/s, which is a rather small modulation. Thus, the spatial frequency selectivity the authors describe in this paper is rather small compared to the stimulus selectivity one typically observes in IT (stimulus-driven modulations can be at least 20 spikes/s).

    To address the concerns regarding the firing rates and the modulation of neuronal responses by spatial frequency (SF), we emphasize several key points:

    (1) Significance of Firing Rate Differences: While it is true that the mean net firing rate to our scrambled stimuli was relatively low, the firing rate differences observed were statistically significant, with p-values approximately at 1e-5. This indicates that despite the low firing rates, the observed differences are reliable and unlikely to have occurred by chance.

    (2) Classification Rate and Modulation by SF: Our analysis showed that the difference between various SF responses led to a classification rate of 44.68%, which is 24.68% higher than the chance level. This substantial increase above the chance level demonstrates that SF significantly modulates IT responses, even if the overall firing rates are modest.

    (3) Effect Size and SF Modulation: While the effect size in terms of firing rate differences may be small, it is significant. The significant modulation of IT responses by SF, as evidenced by our statistical analyses and classification rate, supports our conclusions regarding the role of SF in driving IT responses.

    (4) Expectations for Noise-like Pure SF Stimuli: We acknowledge that IT responses are typically higher for various object stimuli. Given the nature of our pure SF stimuli, which resemble noise-like patterns, we did not anticipate high responses in terms of spikes per second. The low firing rates are consistent with the expectation for such stimuli and do not undermine the significance of the observed modulation by SF.

    We believe that these points collectively support the validity of our findings and the significance of SF modulation in IT responses, despite the low firing rates. We appreciate your insights and hope this clarifies our stance on the data and its implications.

    We added the following description to the Appendix 1 - “Strength of SF selectivity” section:

    “While the firing rates and net responses to scrambled stimuli were modest (e.g., 2.9 Hz in T1), the differences across spatial frequency (SF) bands were statistically significant (p ≈ 1e-5) and led to a classification accuracy 24.68\% above chance. This demonstrates the robustness of SF modulation in IT neurons despite low firing rates. The modest responses align with expectations for noise-like stimuli, which are less effective in driving IT neurons, yet the observed SF selectivity highlights a fundamental property of IT encoding.”

    (2) Their new Figure 2-Appendix 1 does not show net firing rates (baseline-subtracted; as I requested) and thus is not very informative. Please provide distributions of net responses so that the readers can evaluate the responses to the stimuli of the recorded neurons.

    We understand the reviewer’s concern about the presentation of net firing rates. In T2 (the late time interval), the average response rate falls below the baseline, resulting in negative net firing rates, which might confuse readers. To address this, we have added the net responses to the text for clarity. Additionally, we have included the average baseline response in the figure to provide a more comprehensive view of the data.

    “To check the SF response strength, the histogram of IT neuron responses to scrambled, face, and non-face stimuli is illustrated in this figure. A Gamma distribution is also fitted to each histogram. To calculate the histogram, the neuron response to each unique stimulus is calculated for each neuron in spike/seconds (Hz). In the early phase, T1, the average firing rate to scrambled stimuli is 26.3 Hz which is significantly higher than the response in -50 to 50ms which is 23.4 Hz. In comparison, the mean response to intact face stimuli is 30.5 Hz, while non-face stimuli elicit an average response of 28.8 Hz. The average net responses to the scrambled, face, and non-face stimuli are 2.9 Hz, 7.1 Hz, and 5.4 Hz, respectively. Moving to the late phase, T2, the responses to scrambled, face, and object stimuli are 19.5 Hz, 19.4 Hz, and 22.4 Hz, respectively. The corresponding average net responses are 3.9 Hz, 4.0 Hz, and 1.0 Hz below the baseline response.”

    (3) The poor responses might be due to the short stimulus duration. The authors report now new data using a 200 ms duration which supported their classification and latency data obtained with their brief duration. It would be very informative if the authors could also provide the mean net responses for the 200 ms durations to their stimuli. Were these responses as low as those for the brief duration? If so, the concern of generalization to effective stimuli that drive IT neurons well remains.

    The firing rates for the 200 ms stimulus duration are as follows: 27.7 Hz, 30.7 Hz, and 30.4 Hz for scrambled, face, and object stimuli in T1), respectively; and 26.2 Hz, 29.1 Hz, and 33.9 Hz in T2. The average baseline firing rate (−50 to 50 ms) is 23.4 Hz. Therefore, the net responses are 4.3 Hz, 7.3 Hz, and 7.0 Hz for T1; and 2.8 Hz, 5.7 Hz, and 10.5 Hz for T2 for scrambled, face, and object stimuli, respectively.

    Notably, the impact of stimulus duration is more pronounced in T2, which is consistent with the time interval of the T2 compared to T1. However, the firing rates in T1 do not show substantial changes with the longer duration. As we discussed in our response to the first comment, it is important to note that high net responses are not typically expected for scrambled or noise-like stimuli in IT neurons. Instead, the key findings of this study lie in the statistical significance of these responses and their meaningful relationship to category selectivity. These results highlight the broader implications for understanding the role of spatial frequency in object recognition.

    We added the firing rates to the, Appendix 1, “Extended stimulus duration supports LSF-preferred tuning” part as follows.

    “For the 200 ms stimulus duration, the firing rates were 27.7 Hz, 30.7 Hz, and 30.4 Hz for scrambled, face, and object stimuli in T1, respectively, and 26.2 Hz, 29.1 Hz, and 33.9 Hz in T2. The corresponding net responses were 4.3 Hz, 7.3 Hz, and 7.0 Hz in T1, and 2.8 Hz, 5.7 Hz, and 10.5 Hz in T2. While the longer stimulus duration did not substantially increase firing rates in T1, its impact was more pronounced in T2.”

    (4) I still do not understand why the analyses of Figures 3 and 4 provide different outcomes on the relationship between spatial frequency and category selectivity. I believe they refer to this finding in the Discussion: "Our results show a direct relationship between the population's category coding capability and the SF coding capability of individual neurons. While we observed a relation between SF and category coding, we have found uncorrelated representations. Unlike category coding, SF relies more on sparse, individual neuron representations.". I believe more clarification is necessary regarding the analyses of Figures 3 and 4, and why they can show different outcomes.

    Figure 3 explores the relationship between SF coding and category coding at both the single-neuron and population levels.

    ● Figures 3(a) and 3(b) examine the relationship between a single neuron’s response pattern and object decoding in the population.

    ● Figure 3(c) investigates the relationship between a single neuron’s SF decoding capabilities and object decoding in the population.

    ● Figure 3(d) assesses the relationship between a single neuron’s object decoding capabilities and SF decoding in the population.

    In summary, Figure 3 demonstrates a relation between SF coding/response pattern at the single level and category coding at the population level.

    Figure 4, on the other hand, addresses the uncorrelated nature of SF and category coding.

    ● Figure 4(a) shows the uncorrelated relation between a single neuron’s SF decoding capability and its object decoding capability. This suggests that a neuron's ability to decode SF does not predict its ability to decode object categories.

    ● Figure 4(b) illustrates that the contribution of a neuron to the population decoding of SF is uncorrelated with its contribution to the population decoding of object categories. This further supports the idea that the mechanisms behind SF coding and object coding are uncorrelated.

    In summary, Figure 4 suggests that while there is a relation between SF coding and category coding as illustrated in Figure 3, the mechanisms underlying SF coding and object coding operate independently (in terms of correlation), highlighting the distinct nature of these processes.

    We hope this explanation clarifies why the analyses in Figures 3 and 4 present different outcomes. Figure 3 provides insight into the relationship between SF and category coding, while Figure 4 emphasizes the uncorrelated nature of these processes. We also added the following explanation in the “Uncorrelated mechanisms for SF and category coding” section.

    Based on your command, to clarify the presentation of the work, we added the following description to the “Uncorrelated mechanisms for SF and category coding” section:

    “Figures 3 and 4 examine different aspects of the relationship between SF and category coding. Figure 3 highlights a relationship between SF coding at the single-neuron level and category coding at the population level. Conversely, Figure 4 demonstrates the uncorrelated mechanisms underlying SF and category coding, showing that a neuron’s ability to decode SF is not predictive of its ability to decode object categories. This distinction underscores that while SF and category coding are related at broader levels, their underlying mechanisms are independent, emphasizing the distinct processes driving each form of coding.”

    (5) The authors found a higher separability for faces (versus scrambled patterns) for neurons preferring high spatial frequencies. This is consistent for the two monkeys but we are dealing here with a small amount of neurons. Only 6% of their neurons (16 neurons) belonged to this high spatial frequency group when pooling the two monkeys. Thus, although both monkeys show this effect I wonder how robust it is given the small number of neurons per monkey that belong to this spatial frequency profile. Furthermore, the higher separability for faces for the low-frequency profiles is not consistent across monkeys which should be pointed out.

    We appreciate the reviewer’s concern regarding the relatively small number of neurons in the high spatial frequency group (16 neurons, 6% of the total sample across the two monkeys) and the consistency of the results. While we acknowledge this limitation, it is important to note that findings involving sparse subsets of neurons can still be meaningful. For example, Dalgleish et al. (2020) demonstrated that perception can arise from the activity of as few as ~14 neurons in the mouse cortex, supporting the sparse coding hypothesis. This underscores the potential robustness of results derived from small neuronal populations when the activity is statistically significant and functionally relevant.

    Regarding the higher separability for faces among neurons preferring high spatial frequencies, the consistency of this finding across both monkeys suggests that this effect is robust within this subgroup. For neurons preferring low spatial frequencies, we agree that the lack of consistency across monkeys should be explicitly noted. These differences may reflect individual variability or differences in sampling across subjects and merit further investigation in future studies.

    To address this concern, we have updated the text to explicitly discuss the small size of the high spatial frequency group, its implications, and the observed inconsistency in the low spatial frequency profiles between monkeys. We have added the following description to the discussion.

    “Next, according to Figure 3(a), 6% of the neurons are HSF-preferred and their firing rate in HSF is comparable to the LSF firing rate in the LSF-preferred group. This analysis is carried out in the early phase of the response (70-170ms). While most of the neurons prefer LSF, this observation shows that there is an HSF input that excites a small group of neurons. Importantly, findings involving small neuronal populations can still be meaningful, as studies like Dalgleish et al. (2020) have demonstrated that perception can arise from the activity of as few as ~14 neurons in the mouse cortex, emphasizing the robustness of sparse coding.”

    Regarding the separability of faces for the low-frequency profiles, we added the following to the appendix section,

    “For neurons preferring LSF, LP profile, it is important to note the lack of consistency in responses across monkeys. This variability may reflect individual differences in neural processing or variations in sampling between subjects.”

    And in the discussion:

    “Our results are based on grouping the neurons of the two monkeys; however, the results remain consistent when looking at the data from individual monkeys as illustrated in Appendix 2. However, for neurons preferring LSF, we observed inconsistency across monkeys, which may reflect individual differences or sampling variability. These findings highlight the complexity of SF processing in the IT cortex and suggest the need for further research to explore these variations.”

    * Henry WP Dalgleish, Lloyd E Russel, lAdam M Packer, Arnd Roth, Oliver M Gauld, Francesca Greenstreet, Emmett J Thompson, Michael Häusser (2020) How many neurons are sufficient for perception of cortical activity? eLife 9:e58889.

    (6) I agree that CNNs are useful models for ventral stream processing but that is not relevant to the point I was making before regarding the comparison of the classification scores between neurons and the model. Because the number of features and trial-to-trial variability differs between neural nets and neurons, the classification scores are difficult to compare. One can compare the trends but not the raw classification scores between CNN and neurons without equating these variables.

    We appreciate the reviewer’s follow-up comment and agree that differences in the number of features and trial-to-trial variability between IT neurons and CNN units make direct comparisons of raw classification scores challenging. As the reviewer suggests, it is more appropriate to focus on comparing trends rather than absolute scores when analyzing the similarities and differences between these systems. In light of this, we have revised the text to clarify that our intention was not to equate raw classification scores but to highlight the qualitative patterns and trends observed in spatial frequency encoding between IT and CNN units.

    “SF representation in the artificial neural networks

    We conducted a thorough analysis to compare our findings with CNNs. To assess the SF coding capabilities and trends of CNNs, we utilized popular architectures, including ResNet18, ResNet34, VGG11, VGG16, InceptionV3, EfficientNetb0, CORNet-S, CORTNet-RT, and CORNet-z, with both pre-trained on ImageNet and randomly initialized weights. Employing feature maps from the four last layers of each CNN, we trained an LDA model to classify the SF content of input images. Figure 5(a) shows the SF decoding accuracy of the CNNs on our dataset (SF decoding accuracy with random (R) and pre-trained (P) weights, ResNet18: P=0.96±0.01 / R=0.94±0.01, ResNet34 P=0.95±0.01 / R=0.86±0.01, VGG11: P=0.94±0.01 / R=0.93±0.01, VGG16: P=0.92±0.02 / R=0.90±0.02, InceptionV3: P=0.89±0.01 / R=0.67±0.03, EfficientNetb0: P=0.94±0.01 / R=0.30±0.01, CORNet-S: P=0.77±0.02 / R=0.36±0.02, CORTNet-RT: P=0.31±0.02 / R=0.33±0.02, and CORNet-z: P=0.94±0.01 / R=0.97±0.01). Except for CORNet-z, object recognition training increases the network's capacity for SF coding, with an improvement as significant as 64\% in EfficientNetb0. Furthermore, except for the CORNet family, LSF content exhibits higher recall values than HSF content, as observed in the IT cortex (p-value with random (R) and pre-trained (P) weights, ResNet18: P=0.39 / R=0.06, ResNet34 P=0.01 / R=0.01, VGG11: P=0.13 / R=0.07, VGG16: P=0.03 / R=0.05, InceptionV3: P=<0.001 / R=0.05, EfficientNetb0: P=0.07 / R=0.01). The recall values of CORNet-Z and ResNet18 are illustrated in Figure 5(b). However, while the CNNs exhibited some similarities in SF representation with the IT cortex, they did not replicate the SF-based profiles that predict neuron category selectivity. As depicted in Figure 5(c) although neurons formed similar profiles, these profiles were not associated with the category decoding performances of the neurons sharing the same profile.”

    Discussion:

    “Finally, we compared SF's representation trends and findings within the IT cortex and the current state-of-the-art networks in deep neural networks.”

    Recommendations for the authors:

    Reviewer #2 (Recommendations For The Authors):

    The mean baseline firing rate of their neurons (23.4 Hz) was rather high for single IT neurons (typically around 10 spikes/s or lower). Were these well-isolated units or mainly multiunit activity?

    We confirm that the recordings in our study were from both well-isolated single units and multi-unit activities (remaining after isolation neurons) sorted based on our spike sorting toolbox. The higher baseline firing rate is likely due to the experimental design, particularly the inclusion of the responsive neurons from the selectivity phase. We added the following statement to the methods section.

    “In our analysis, we utilized both well-isolated single units and multi-unit activities (which represent neural activities that could not be further sorted into single units), ensuring a comprehensive representation of neural responses across the recorded population.”

  5. Author response:

    The following is the authors’ response to the original reviews.

    Public Reviews:

    Reviewer #1 (Public Review):

    Summary:

    This study reports that IT neurons have biased representations toward low spatial frequency

    (SF) and faster decoding of low SFs than high SFs. High SF-preferred neurons, and low SF-preferred neurons to a lesser degree, perform better category decoding than neurons with other profiles (U and inverted U shaped). SF coding also shows more sparseness than category coding in the earlier phase of the response and less sparseness in the later phase. The results are also contrasted with predictions of various DNN models.

    Strengths:

    The study addressed an important issue on the representations of SF information in a high-level visual area. Data are analyzed with LDA which can effectively reduce the dimensionality of neuronal responses and retain category information.

    We would like to express our sincere gratitude for your insightful and constructive comments which greatly contributed to the refinement of the manuscript. We appreciate the time and effort you dedicated to reviewing our work and providing suggestions. We have carefully considered each of your comments and addressed the suggested revisions accordingly.

    Weaknesses:

    The results are likely compromised by improper stimulus timing and unmatched spatial frequency spectrums of stimuli in different categories.

    The authors used a very brief stimulus duration (35ms), which would degrade the visual system's contrast sensitivity to medium and high SF information disproportionately (see Nachmias, JOSAA, 1967). Therefore, IT neurons in the study could have received more degraded medium and high SF inputs compared to low SF inputs, which may be at least partially responsible for higher firing rates to low SF R1 stimuli (Figure 1c) and poorer recall performance with median and high SF R3-R5 stimuli in LDA decoding. The issue may also to some degree explain the delayed onset of recall to higher SF stimuli (Figure 2a), preferred low SF with an earlier T1 onset (Figure 2b), lower firing rate to high SF during T1 (Figure 2c), somewhat increased firing rate to high SF during T2 (because weaker high SF inputs would lead to later onset, Figure 2d).

    We appreciate your concern regarding the course-to-fine nature of SF processing in the vision hierarchy and the short exposure time of our paradigm. According to your comment, we repeated the analysis of SF representation with 200ms exposure time as illustrated in Appendix 1 - Figure 4. Our recorded data contains the 200ms version of exposure time for all neurons in the main phase. As can be seen, the results are similar to what we found with 33 ms experiments.

    Next, we bring your attention to the following observations:

    (1) According to Figure 2d, the average firing rate of IT neurons for HSF could be higher than LSF in the late response phase. Therefore, the amount of HSF input received by the IT neurons is as much as LSF, however, its impact on the IT response is observable in the later phase of the response. Thus, the LSF preference is because of the temporal advantage of the LSF processing rather than contrast sensitivity.

    (2) According to Figure 3a, 6% of the neurons are HSF-preferred and their firing rate in HSF is comparable to the LSF firing rate in the LSF-preferred group. This analysis is carried out in the early phase of the response (70-170 ms). While most of the neurons prefer LSF, this observation shows that there is an HSF input that excites a small group of neurons. Furthermore, the highest separability index also belongs to the HSF-preferred profile in the early phase of the response which supports the impact of the HSF part of the input.

    (3) Similar LSF-preferred responses are also reported by Chen et al. (2018) (50ms for SC) and Zhang et al. (2023) (3.5 - 4 secs for V2 and V4) for longer duration times.

    Our results suggest that the LSF-preferred nature of the IT responses in terms of firing rate and recall, is not due to the weakness or lack of input source (or information) for HSF but rather to the processing nature of the SF in the vision hierarchy.

    To address this issue in the manuscript:

    Figure Appendix 1 - Figure 4 is added to the manuscript and shows the recall value and onset for R1-R5 with 200ms of exposure time.

    We added the following description to the discussion:

    “To rule out the degraded contrast sensitivity of the visual system to medium and high SF information because of the brief exposure time, we repeated the analysis with 200ms exposure time as illustrated in Appendix 1 - Figure 4 which indicates the same LSF-preferred results. Furthermore, according to Figure 2, the average firing rate of IT neurons for HSF could be higher than LSF in the late response phase. It indicates that the amount of HSF input received by the IT neurons in the later phase is as much as LSF, however, its impact on the IT response is observable in the later phase of the response. Thus, the LSF preference is because of the temporal advantage of the LSF processing rather than contrast sensitivity. Next, according to Figure 3(a), 6\% of the neurons are HSF-preferred and their firing rate in HSF is comparable to the LSF firing rate in the LSF-preferred group. This analysis is carried out in the early phase of the response (70-170ms). While most of the neurons prefer LSF, this observation shows that there is an HSF input that excites a small group of neurons. Additionally, the highest SI belongs to the HSF-preferred profile in the early phase of the response which supports the impact of the HSF part of the input. Similar LSF-preferred responses are also reported by Chen et. al. (2018) (50ms for SC) and Zhang et. al. (2023) (3.5 - 4 secs for V2 and V4). Therefore, our results show that the LSF-preferred nature of the IT responses in terms of firing rate and recall, is not due to the weakness or lack of input source (or information) for HSF but rather to the processing nature of the SF in the IT cortex.”

    Figure 3b shows greater face coding than object coding by high SF and to a lesser degree by low SF neurons. Only the inverted-U-shaped neurons displayed slightly better object coding than face coding. Overall the results give an impression that IT neurons are significantly more capable of coding faces than coding objects, which is inconsistent with the general understanding of the functions of IT neurons. The problem may lie with the selection of stimulus images (Figure 1b). To study SF-related category coding, the images in two categories need to have similar SF spectrums in the Fourier domain. Such efforts are not mentioned in the manuscript, and a look at the images in Figure 1b suggests that such efforts are likely not properly made. The ResNet18 decoding results in Figure 6C, in that IT neurons of different profiles show similar face and object coding, might be closer to reality.

    Because of the limited number of stimuli in our experiments, it is hard to discuss the category selectivity, which needs a higher number of stimuli. To overcome the limited number of stimuli in our experiment, we fixed 60% (nine out of 15 stimuli) while varying the remaining stimuli to reduce the selective bias. To check the coding capability of the IT neurons for face and non-face objects, we evaluated the recall of face vs. non-face classification in intact stimuli (similar to classifiers stated in the manuscript). Results show that at the population level, the recall value for objects is 90.45%, and for faces is 92.45%. However, the difference is not significant (p-value=0.44). On the other hand, we note that a large difference in the SI value does not translate directly to the classification accuracy, rather it illustrates the strength of representation.

    Regarding the SF spectrums, after matching the luminance and contrast of the images we matched the power of the images concerning SF and category. Powers are calculated using the sum of the absolute value of the Fourier transform of the image. Considering all stimuli, the ANOVA analysis shows that various SF bands have similar power (one-way ANOVA, p-value=0.24). Furthermore, comparing the power of faces and images in all SF bands (including intact) and both unscrambled and scrambled images indicates no significant difference between face and object (p-vale > 0.1). Therefore, the result of Figure 3b suggests that IT employs various SF bands for the recognition of various objects.

    Comparing the results of CNNs and IT shows that the CNNs do not capture the complexities of the IT cortex in terms of SF. One of the sources of this difference is because of the behavioral saliency of the face stimulus in the training of the primate visual system.

    To address this issue in the manuscript:

    The following description is added to the discussion:

    “… the decoding performance of category classification (face vs. non-face) in intact stimuli is 94.2%. The recall value for objects vs. scrambled is 90.45%, and for faces vs. scrambled is 92.45% (p-value=0.44), which indicates the high level of generalizability and validity characterizing our results.”

    The following description is added to the method section, SF filtering.

    “Finally, we equalized the stimulus power in all SF bands (intact, R-R5). The SF power among all conditions (all SF bands, face vs. non-face and unscrambled vs. scrambled) does not vary significantly (p-value > 0.1). SF power is calculated as the sum of the square value of the image coefficients in the Fourier domain.”

    Reviewer #2 (Public Review):

    Summary:

    This paper aimed to examine the spatial frequency selectivity of macaque inferotemporal (IT) neurons and its relation to category selectivity. The authors suggest in the present study that some IT neurons show a sensitivity for the spatial frequency of scrambled images. Their report suggests a shift in preferred spatial frequency during the response, from low to high spatial frequencies. This agrees with a coarse-to-fine processing strategy, which is in line with multiple studies in the early visual cortex. In addition, they report that the selectivity for faces and objects, relative to scrambled stimuli, depends on the spatial frequency tuning of the neurons.

    Strengths:

    Previous studies using human fMRI and psychophysics studied the contribution of different spatial frequency bands to object recognition, but as pointed out by the authors little is known about the spatial frequency selectivity of single IT neurons. This study addresses this gap and they show that at least some IT neurons show a sensitivity for spatial frequency and

    interestingly show a tendency for coarse-to-fine processing.

    We extend our sincere appreciation for your thoughtful and constructive feedback on our paper. We are grateful for the time and expertise you invested in reviewing our work. Your detailed suggestions have been instrumental in addressing several key aspects of the paper, contributing to its clarity and scholarly merit. We have carefully considered each of your comments and have made revisions accordingly.

    Weaknesses and requested clarifications:

    (1) It is unclear whether the effects described in this paper reflect a sensitivity to spatial frequency, i.e. in cycles/ deg (depends on the distance from the observer and changes when rescaling the image), or is a sensitivity to cycles /image, largely independent of image scale. How is it related to the well-documented size tolerance of IT neuron selectivity?

    Our stimuli are filtered using cycles/images and knowing the distance of the subject to the monitor, we can calculate the cycles/degrees. To the best of our knowledge, this is also the case for all other SF-related studies. To find the relation of observations to the cycles/image and degree/image, one should keep one of them fixed while changing the other, for example changing the subject's distance to the monitor will change the SF content in terms of cycle/degree. With our current data, we cannot discriminate this effect. To address this issue, we added the following description to the discussion. To address this issue, we added the following description to the discussion:

    “Finally, since our experiment maintains a fixed SF content in terms of both cycles per degree and cycles per image, further experiments are needed to discern whether our observations reflect sensitivity to cycles per degree or cycles per image.”

    (2) The authors' band-pass filtered phase scrambled images of faces and objects. The original images likely differed in their spatial frequency amplitude spectrum and thus it is unclear whether the differing bands contained the same power for the different scrambled images. If not, this could have contributed to the frequency sensitivity of the neurons.

    After equalizing the luminance and contrast of the images, we equilized their power concerning SF and category. The powers were calculated using the sum of the absolute values of the Fourier transform of the images. The results of the ANOVA analysis across all stimuli indicate that various SF bands exhibit similar power (one-way ANOVA, p-value = 0.24). Additionally, a comparison of power between faces and objects in all SF bands (including intact), for both unscrambled and scrambled images, reveals no significant differences (p-value > 0.1). To clarify this point, we have incorporated the following information into the Methods section.

    “Finally, we equalized the stimulus power in all SF bands (intact, R-R5). The SF power among all conditions (all SF bands, face vs. non-face and unscrambled vs. scrambled) does not vary significantly (ANOVA, p-value > 0.1).”

    (3) How strong were the responses to the phase-scrambled images? Phase-scrambled images are expected to be rather ineffective stimuli for IT neurons. How can one extrapolate the effect of the spatial frequency band observed for ineffective stimuli to that for more effective stimuli, like objects or (for some neurons) faces? A distribution should be provided, of the net responses (in spikes/s) to the scrambled stimuli, and this for the early and late windows.

    The sample neuron in Figure 1c is chosen to be a good indicator of the recorded neurons. In the early response phase, the average firing rate to scrambled stimuli is 26.3 spikes/s which is significantly higher than the response in -50 to 50ms which is 23.4. In comparison, the mean response to intact face stimuli is 30.5 spikes/s, while object stimuli elicit an average response of 28.8 spikes/s. Moving to the late phase, T2, the responses to scrambled, face, and object stimuli are 19.5, 19.4, and 22.4 spikes/s, respectively. Moreover, when the classification accuracy for SF exceeds chance levels, it indicates a significant impact of SF bands on the IT response. This raises a direct question about the explicit coding for SF bands in the IT cortex observed for ineffective stimuli and how it relates to complex and effective stimuli, such as faces. To show the strength of neuron responses to the SF bands in scrambled images, We added Appendix 1 - Figure 2 and also added Appendix 1 - Figure 1, according to comment 4, which shows the average and std of the responses to all SF bands. The following description is added to the results section.

    “Considering the strength of responses to scrambled stimuli, the average firing rate in response to scrambled stimuli is 26.3 Hz, which is significantly higher than the response observed between -50 and 50 ms, where it is 23.4 Hz (p-value=3x10-5). In comparison, the mean response to intact face stimuli is 30.5 Hz, while non-face stimuli elicit an average response of 28.8 Hz. The distribution of neuron responses for scrambled, face, and non-face in T1 is illustrated in Appendix 1 - Figure 2.

    […]

    Moreover, the average firing rates of scrambled, face, and non-face stimuli are 19.5 Hz, 19.4 Hz, and 22.4 Hz, respectively. The distribution of neuron responses is illustrated in Appendix 1 Figure 2.”

    (4) The strength of the spatial frequency selectivity is unclear from the presented data. The authors provide the result of a classification analysis, but this is in normalized units so that the reader does not know the classification score in percent correct. Unnormalized data should be provided. Also, it would be informative to provide a summary plot of the spatial frequency selectivity in spikes/s, e.g. by ranking the spatial frequency bands for each neuron based on half of the trials and then plotting the average responses for the obtained ranks for the other half of the trials. Thus, the reader can appreciate the strength of the spatial frequency selectivity, considering trial-to-trial variability. Also, a plot should be provided of the mean response to the stimuli for the two analysis windows of Figure 2c and 2d in spikes/s so one can appreciate the mean response strengths and effect size (see above).

    The normalization of the classification result is just obtained by subtracting the chance level, which is 0.2, from the whole values. Therefore the values could still be interpreted in percent as we did in the results section. To make this clear, we removed the “a.u.” from the figure and we added the following description to the results section.

    “The accuracy value is normalized by subtracting the chance level (0.2).”

    Regarding the selectivity of the neuron, as suggested by your comment, we added a new figure in the appendix section, Appendix 1 - figure 2. This figure shows the strength of SF selectivity, considering trial-to-trial variability. The following description is added to the results section:

    “The strength of SF selectivity, considering the trial-to-trial variability is provided in Appendix 1 Figure 2, by ranking the SF bands for each neuron based on half of the trials and then plotting the average responses for the obtained ranks for the other half of the trials.”

    The firing rates of Figures 2c and 2d are normalized for better illustration since the variation in firing rates is high across neurons, as can be observed in Figure Appendix 1 - Figure 1. Since we seek trends in the response, the absolute values are not important (since the baseline firing rates of neurons are different), but the values relative to the baseline firing rate determine the trend. To address the mean response and the strength of the SF response, the following description is added to the results section.

    “Considering the strength of responses to scrambled stimuli, the average firing rate in response to scrambled stimuli is 26.3 Hz, which is significantly higher than the response observed between -50 and 50 ms, where it is 23.4 Hz (p-value=3x10-5). In comparison, the mean response to intact face stimuli is 30.5 Hz, while non-face stimuli elicit an average response of 28.8 Hz. The distribution of neuron responses for scrambled, face, and non-face in T1 is illustrated in Appendix 1 - Figure 2.

    […]

    Moreover, the average firing rates of scrambled, face, and non-face stimuli are 19.5 Hz, 19.4

    Hz, and 22.4 Hz, respectively. The distribution of neuron responses is illustrated in Appendix 1 Figure 2.”

    Furthermore, we added a figure, Appendix 1 - Figure 3, to illustrate the strength of SF selectivity in our profiles. The following is added to the results section:

    “To check the robustness of the profiles, considering the trial-to-trial variability, the strength of SF selectivity in each profile is provided in Appendix 1 - Figure 3, by forming the profile of each neuron based on half of the trials and then plotting the average SF responses with the other

    half of the trials.”

    (5) It is unclear why such brief stimulus durations were employed. Will the results be similar, in particular the preference for low spatial frequencies, for longer stimulus durations that are more similar to those encountered during natural vision?

    Please refer to the first comment of Reviewer 1.

    (6) The authors report that the spatial frequency band classification accuracy for the population of neurons is not much higher than that of the best neuron (line 151). How does this relate to the SNC analysis, which appears to suggest that many neurons contribute to the spatial frequency selectivity of the population in a non-redundant fashion? Also, the outcome of the analyses should be provided (such as SNC and decoding (e.g. Figure 1D)) in the original units instead of undefined arbitrary units.

    The population accuracy is approximately 5% higher than the best neuron. However, we have no reference to compare the effect size (the value is roughly similar for face vs object while the chance levels are different). However, as stated in Methods, SNC is calculated for two label modes (LSF and HSF) and it can not be directly compared to the best neuron accuracy. Regarding the unit of SNC, it can be interpreted directly to percent by multiplying by a factor of 100. We removed the “a.u.” to prevent misunderstanding and modified the results section for clearance.

    “… SNC score for SF (two labels, LSF (R1 and R2) vs. HSF (R4 and R5)) and category … (average SNC for SF=0.51\%±0.02 and category=0.1\%±0.04 …”

    (7) To me, the results of the analyses of Figure 3c,d, and Figure 4 appear to disagree. The latter figure shows no correlation between category and spatial frequency classification accuracies while Figure 3c,d shows the opposite.

    In Figure 3c,d, following what we observed in Figure 3a,b about the category coding capabilities in the population of neurons based on the profile of the single neurons, we tested a similar idea if the coding capability of single neurons in SF/category could predict the coding capability of population neurons in terms of category/SF. Therefore, both analyses investigate a relation between a characteristic of single neurons and the coding capability of a population of similar neurons. On the other hand, in Figure 4, the idea is to check the characteristics of the coding mechanisms behind SF and category coding. In Figure 4a, we check if there exists any relation between category and SF coding capability within a single neuron activity without the impact of other neurons, to investigate the idea that SF coding may be a byproduct of an object recognition mechanism. In Figure 4b, we investigated the contribution of all neurons in population decision, again to check whether the mechanisms behind the SF and category coding are the same or not. This analysis shows how individual neurons contribute to SF or category coding at the population level. Therefore, the experiments in Figures 3 and 4 are different in the analysis method and what they were designed to investigate and we cannot directly compare the results.

    (8) If I understand correctly, the "main" test included scrambled versions of each of the "responsive" images selected based on the preceding test. Each stimulus was presented 15 times (once in each of the 15 blocks). The LDA classifier was trained to predict the 5 spatial frequency band labels and they used 70% of the trials to train the classifier. Were the trained and tested trials stratified with respect to the different scrambled images? Also, LDA assumes a normal distribution. Was this the case, especially because of the mixture of repetitions of the same scrambled stimulus and different scrambled stimuli?

    In response to your inquiry regarding the stratification of trials, both the training and testing data were representative of the entire spectrum of scrambled images used in our experiment. To address your concern about the assumption of a normal distribution, especially given the mixture of repetitions of the same scrambled stimulus and different stimuli, our analysis of firing rates reveals a slightly left-skewed normal distribution. While there is a deviation from a perfectly normal distribution, we are confident that this skewness does not compromise the robustness of the LDA classifier.

    (9) The LDA classifiers for spatial frequency band (5 labels) and category (2 labels) have different chance and performance levels. Was this taken into account when comparing the SNC between these two classifiers? Details and SNC values should be provided in the original (percent difference) instead of arbitrary units in Figure 5a. Without such details, the results are impossible to evaluate.

    For both SNC and CMI calculations in SF, we considered two labels of HSF (R4 and R5) and LSF (R1 and R2). This was mentioned in the Methods section, after equation (5). According to your comment, to make it clear in the results section, we also added this description to the results section.

    “… illustrates the SNC score for SF (two labels, LSF (R1 and R2) vs. HSF (R4 and R5)) and category (face vs. non-face) … conditioned on the label, SF (LSF (R1 and R2) vs. HSF (R4 and R5)) or category, to assess the information.”

    The value of SNC can also be directly converted to the percent by a factor of 100. To make it clear, we removed “a.u.” from the y-axis.

    (10) Recording locations should be described in IT, since the latter is a large region. Did their recordings include the STS? A/P and M/L coordinate ranges of recorded neurons?

    We appreciate your suggestion for the recording location. Nevertheless, given the complexities associated with neurophysiological recordings and the limitations imposed by our methodologies, we face challenges in precisely localizing every unit if they are located in STS or not. To address your comment, We added Appendix 1 - Figure 5 which shows the SF and category coding capability of neurons along their recorded locations.

    (11) The authors should show in Supplementary Figures the main data for each of the two animals, to ensure the reader that both monkeys showed similar trends.

    We added Appendix 2 which shows the consistency of the main results in the two monkeys.

    (12) The authors found that the deep nets encoded better the spatial frequency bands than the IT units. However, IT units have trial-to-trial response variability and CNN units do not. Did they consider this when comparing IT and CNN classification performance? Also, the number of features differs between IT and CNN units. To me, comparing IT and CNN classification performances is like comparing apples and oranges.

    Deep convolutional neural networks are currently considered the state-of-the-art models of the primate visual pathway. However, as you mentioned and based on our results, they do not yet capture various complexities of the visual ventral stream. Yet studying the similarities and differences between CNN and brain regions, such as the IT cortex, is an active area of research, such as:

    a. Kubilius, Jonas, et al. "Brain-like object recognition with high-performing shallow recurrent ANNs." Advances in neural information processing systems 32 (2019).

    b. Xu, Yaoda, and Maryam Vaziri-Pashkam. "Limits to visual representational correspondence between convolutional neural networks and the human brain." Nature Communications, 12.1 (2021).

    c. Jacob, Georgin, et al. "Qualitative similarities and differences in visual object representations between brains and deep networks." Nature Communications, 12.1 (2021).

    Therefore, we believe comparing IT and CNN, despite all of the differences in terms of their characteristics, can help both fields grow faster, especially in introducing brain-inspired networks.

    (13) The authors should define the separability index in their paper. Since it is the main index to show a relationship between category and spatial frequency tuning, it should be described in detail. Also, results should be provided in the original units instead of undefined arbitrary units. The tuning profiles in Figure 3A should be in spikes/s. Also, it was unclear to me whether the classification of the neurons into the different tuning profiles was based on an ANOVA assessing per neuron whether the effect of the spatial frequency band was significant (as should be done).

    Based on your comment, we added the description of the separability index to the methods section. However, since the separability index is defined as the division of two dispersion matrices, it has no units by nature. The tuning profiles in Figure 3a are normalized for better illustration since the variation in firing rates is high. Since we seek trends in the response, the absolute values are not important. Regarding the SF profile formation, to better present the SF profile assignment, we updated the method section. Furthermore, The strength of responses for scrambled stimuli can be observed in Appendix 1 - Figures 1 and 2.

    (14) As mentioned above, the separability analysis is the main one suggesting an association between category and spatial frequency tuning. However, they compute the separability of each category with respect to the scrambled images. Since faces are a rather homogeneous category I expect that IT neurons have on average a higher separability index for faces than for the more heterogeneous category of objects, at least for neurons responsive to faces and/or objects. The higher separability for faces of the two low- and high-pass spatial frequency neurons could reflect stronger overall responses for these two classes of neurons. Was this the case? This is a critical analysis since it is essential to assess whether it is category versus responsiveness that is associated with the spatial frequency tuning. Also, I do not believe that one can make a strong claim about category selectivity when only 6 faces and 3 objects (and 6 other, variable stimuli; 15 stimuli in total) are employed to assess the responses for these categories (see next main comment). This and the above control analysis can affect the main conclusion and title of the paper.

    We appreciate your concern regarding category selectivity or responsiveness of the SF profiles. First, we note that we used SI since it overcomes the limitations of the accuracy and recall metrics as they are discrete and can be saturated. Using SI, we cannot directly calculate face vs object with SI, since this index only reports one value for the whole discrimination task. Therefore, we have to calculate the SI for face/object vs scrambled to obtain a value per category. However, as you suggested, it raises the question of whether we assess how well the neural responses distinguish between actual images (faces or objects) and their scrambled versions or if we just assess the responsiveness. Based on Figure 3b, since we have face-selective (LSF and HSF preferred profiles), object-selective (inverse U), and the U profile, where SI is the same for both face and object, we believe the SF profile is associated with the category selectivity, otherwise we would have the same face/object recall in all profiles, as we have in the U shape profile.

    To analyze this issue further, we calculated the number of face/object selective neurons in 70-170ms. We found 43 face-selective neurons and 36 object-selective neurons (FDR corrected p-value < 0.05). Therefore, the number of face-selective and object-selective neurons is similar. Next, we check the selectivity of the neurons within each profile. Number of face/object selective neurons is LP=13/3, HP=6/2, IU=3/9, U=14/13, and the remaining belong to the NP group. Results show higher face-selective neurons in LP and HP and a higher number of object-selective neurons in the IU class. The U class contains roughly the same number of face and object-selective neurons. This observation supports the relationship between category selectivity and profiles.

    Next, we examined the average neuron response to the face and object in each profile. The difference between the firing rate of the face and object in none of the profiles was significant (Ranksum with a significance level of 0.05). However, the rates are as follows. The average firing rate (spikes/s) of face/object is LP=36.72/28.77, HP=28.55/25.52, IU=21.55/27.25, U=38.48/36.28. While the differences are not significant, they support the relationship between profiles and categories instead of responsiveness.

    The following description is added to the results section to cover this point of view.

    “To assess whether the SF profiles distinguish category selectivity or merely evaluate the neuron's responsiveness, we quantified the number of face/non-face selective neurons in the 70-170ms time window. Our analysis shows a total of 43 face-selective neurons and 36 non-face-selective neurons (FDR-corrected p-value < 0.05). The results indicate a higher proportion of face-selective neurons in LP and HP, while a greater number of non-face-selective neurons is observed in the IU category (number of face/non-face selective neurons: LP=13/3, HP=6/2, IU=3/9). The U category exhibits a roughly equal distribution of face and non-face-selective neurons (U=14/13). This finding reinforces the connection between category selectivity and the identified profiles. We then analyzed the average neuron response to faces and non-faces within each profile. The difference between the firing rates for faces and non-faces in none of the profiles is significant (face/non-face average firing rate (Hz): LP=36.72/28.77, HP=28.55/25.52, IU=21.55/27.25, U=38.48/36.28, Ranksum with significance level of 0.05). Although the observed differences are not statistically significant, they provide support for the association between profiles and categories rather than mere responsiveness.”

    About the low number of stimuli, please check the next comment.

    (15) For the category decoding, the authors employed intact, unscrambled stimuli. Were these from the main test? If yes, then I am concerned that this represents a too small number of stimuli to assess category selectivity. Only 9 fixed + 6 variable stimuli = 15 were in the main test. How many faces/ objects on average? Was the number of stimuli per category equated for the classification? When possible use the data of the preceding selectivity test which has many more stimuli to compute the category selectivity.

    We used only the main phase recorded data, which contains 15 images in each session. Each image results in 12 stimuli (intact, R1-R5, and phase-scrambled version). Thus, there exists a total of 180 unique stimuli in each session. Increasing the number of images would have increased the recording time. We compensated for this limitation by increasing the diversity of images in each session by picking the most responsive ones from the selectivity phase. On average, 7.54 of the stimuli were face in each session. We added this information to the Methods section. Furthermore, as mentioned in the discussion, for each classification run, the number of samples per category is equalized. We note that we cannot use the selectivity data for analysis, since the SF-related stimuli are filtered in different bands.

    Recommendations For The Authors:

    Reviewer #1 (Recommendations For The Authors):

    I suggest that the authors double-check their results by performing control experiments with longer stimulus duration and SF-spectrum-matched face and object stimuli.

    Thanks for your suggestion, according to your comment, we added Appendix 1 - Figure 3.

    In addition, I had a very difficult time understanding the differences between Figure 3c and Figure 4a. Please rewrite the descriptions to clarify.

    Thanks for your suggestion, we tried to revise the description of these two figures. The following description is added to the results section for Figure 3c.

    “Next, to examine the relation between the SF (category) coding capacity of the single neurons and the category (SF) coding capability of the population level, we calculated the correlation between coding performance at the population level and the coding performance of single neurons within that population (Figure 3 c and d). In other words, we investigated the relation between single and population levels of coding capabilities between SF and category. The SF (or category) coding performance of a sub-population of 20 neurons that have roughly the same single-level coding capability of the category (or SF) is examined.”

    Lines 147-148: The text states that 'The maximum accuracy of a single neuron was 19.08% higher than the chance level'. However, in Figure 4, the decoding accuracies of individual neurons for category and SF range were between 49%-90% and 20%-40%, respectively.

    Please explain the discrepancies.

    The first number is reported according to chance level which is 20%, thus the unnormalized number is 39% which is consistent with the SF accuracy in Figure 4. We added the following description to prevent any misunderstanding.

    “… was 19.08\% higher than the chance level (unnormalized accuracy is 49.08\%, neuron \#193, M2).”

    Lines 264-265: Should 'the alternative for R3 and R4' be 'the alternative for R4 and R5'?

    Thanks for your attention, it's “R4 and R5”. We corrected that mistake.

    Lines 551-562: The labels for SF classification are R1-R5. Is it a binary or a multi-classification task?

    It’s a multi-label classification. We made it clear in the text.

    “… labels were SF bands (R1, R2, ..., R5, multi-label classifier).”

    Figure 4b: Neurons in SF/category decoding exhibit both positive and negative weights. However, in the analysis of sparse neuron weights in Equation 6, only the magnitude of the weights is considered. Is the sign of weight considered too?

    We used the absolute value of the neuron weight to calculate sparseness. We also corrected Equation 6.

    Reviewer #2 (Recommendations For The Authors):

    (1) Line 52: what do the authors mean by coordinate processing in object recognition?

    To avoid any potential misunderstanding, we used the exact phrase in Saneyoshi and Michimata (2015). It is in fact, coordinate relations processing. Coordinate relations specify the metric information of the relative locations of objects.

    (2) About half of the Introduction is a summary of the Results. This can be shortened.

    Thanks for your suggestion.

    (3) Line 134: Peristimulus time histogram instead of Prestimulus time histogram.

    Thanks for your attention. We corrected that.

    (4) Line 162: the authors state that R1 is decoded faster than R5, but the reported statistic is only for R1 versus R2.

    It was a typo, the p-value is only reported for R1 and R5.

    (5) Line 576: which test was used for the asses the statistical significance?

    The test is Wilcoxon signed-rank. We added it to the text.

    (6) How can one present a 35 ms long stimulus with a 60 Hz frame rate (the stimuli were presented on a 60Hz monitor (line 470))? Please correct.

    Thanks for your attention. We corrected that. The time of stimulus presentation is 33ms and the monitor rate is 120Hz.

  6. eLife assessment

    This useful study aimed to examine the relationship of spatial frequency selectivity of single macaque inferotemporal (IT) neurons to category selectivity. There are some interesting findings in this report but some of these findings were difficult to evaluate because several critical details of the analysis are incomplete. The conclusion that single-unit spatial frequency selectivity can predict object coding needs further evidence to confirm.

  7. Reviewer #1 (Public Review):

    This study reports that spatial frequency representation can predict category coding in the inferior temporal cortex. The original conclusion was based on likely problematic stimulus timing (33 ms which was too brief). Now the authors claim that they also have a different set of data on the basis of longer stimulus duration (200 ms).

    One big issue in the original report was that the experiments used a stimulus duration that was too brief and could have weakened the effects of high spatial frequencies and confounded the conclusions. Now the authors provided a new set of data on the basis of a longer stimulus duration and made the claim that the conclusions are unchanged. These new data and the data in the original report were collected at the same time as the authors report.

    The authors may provide an explanation why they performed the same experiments using two stimulus durations and only reported one data set with the brief duration. They may also explain why they opted not to mention in the original report the existence of another data set with a different stimulus duration, which would otherwise have certainly strengthened their main conclusions.

    I suggest the authors upload both data sets and analyzing codes, so that the claim could be easily examined by interested readers.

  8. Reviewer #2 (Public Review):

    Summary:

    This paper aimed to examine the spatial frequency selectivity of macaque inferotemporal (IT) neurons and its relation to category selectivity. The authors suggest in the present study that some IT neurons show a sensitivity for the spatial frequency of scrambled images. Their report suggests a shift in preferred spatial frequency during the response, from low to high spatial frequencies. This agrees with a coarse-to-fine processing strategy, which is in line with multiple studies in the early visual cortex. In addition, they report that the selectivity for faces and objects, relative to scrambled stimuli, depends on the spatial frequency tuning of the neurons.

    Strengths:

    Previous studies using human fMRI and psychophysics studied the contribution of different spatial frequency bands to object recognition, but as pointed out by the authors little is known about the spatial frequency selectivity of single IT neurons. This study addresses this gap and shows spatial frequency selectivity in IT for scrambled stimuli that drive the neurons poorly. They related this weak spatial frequency selectivity to category selectivity, but these findings are premature given the low number of stimuli they employed to assess category selectivity.

    The authors revised their manuscript and provided some clarifications regarding their experimental design and data analysis. They responded to most of my comments but I find that some issues were not fully or poorly addressed. The new data they provided confirmed my concern about low responses to their scrambled stimuli. Thus, this paper shows spatial frequency selectivity in IT for scrambled stimuli that drive the neurons poorly (see main comments below). They related this (weak) spatial frequency selectivity to category selectivity, but these findings are premature given the low number of stimuli to assess category selectivity.

    Main points.

    (1) They have provided now the responses of their neurons in spikes/s and present a distribution of the raw responses in a new Figure. These data suggest that their scrambled stimuli were driving the neurons rather poorly and thus it is unclear how well their findings will generalize to more effective stimuli. Indeed, the mean net firing rate to their scrambled stimuli was very low: about 3 spikes/s. How much can one conclude when the stimuli are driving the recorded neurons that poorly? Also, the new Figure 2- Appendix 1 shows that the mean modulation by spatial frequency is about 2 spikes/s, which is a rather small modulation. Thus, the spatial frequency selectivity the authors describe in this paper is rather small compared to the stimulus selectivity one typically observes in IT (stimulus-driven modulations can be at least 20 spikes/s).
    (2) Their new Figure 2-Appendix 1 does not show net firing rates (baseline-subtracted; as I requested) and thus is not very informative. Please provide distributions of net responses so that the readers can evaluate the responses to the stimuli of the recorded neurons.
    (3) The poor responses might be due to the short stimulus duration. The authors report now new data using a 200 ms duration which supported their classification and latency data obtained with their brief duration. It would be very informative if the authors could also provide the mean net responses for the 200 ms durations to their stimuli. Were these responses as low as those for the brief duration? If so, the concern of generalization to effective stimuli that drive IT neurons well remains.
    (4) I still do not understand why the analyses of Figures 3 and 4 provide different outcomes on the relationship between spatial frequency and category selectivity. I believe they refer to this finding in the Discussion: "Our results show a direct relationship between the population's category coding capability and the SF coding capability of individual neurons. While we observed a relation between SF and category coding, we have found uncorrelated representations. Unlike category coding, SF relies more on sparse, individual neuron representations.". I believe more clarification is necessary regarding the analyses of Figures 3 and 4, and why they can show different outcomes.
    (5) The authors found a higher separability for faces (versus scrambled patterns) for neurons preferring high spatial frequencies. This is consistent for the two monkeys but we are dealing here with a small amount of neurons. Only 6% of their neurons (16 neurons) belonged to this high spatial frequency group when pooling the two monkeys. Thus, although both monkeys show this effect I wonder how robust it is given the small number of neurons per monkey that belong to this spatial frequency profile. Furthermore, the higher separability for faces for the low-frequency profiles is not consistent across monkeys which should be pointed out.
    (6) I agree that CNNs are useful models for ventral stream processing but that is not relevant to the point I was making before regarding the comparison of the classification scores between neurons and the model. Because the number of features and trial-to-trial variability differs between neural nets and neurons, the classification scores are difficult to compare. One can compare the trends but not the raw classification scores between CNN and neurons without equating these variables.

  9. eLife assessment

    This useful study aimed to examine the relationship of spatial frequency selectivity of single macaque inferotemporal (IT) neurons to category selectivity. There are some interesting findings in this report but some of these findings were difficult to evaluate because several critical details of the analysis are incomplete. The conclusion that single-unit spatial frequency selectivity can predict object coding needs further evidence to confirm.

  10. Reviewer #1 (Public Review):

    Summary:
    This study reports that IT neurons have biased representations toward low spatial frequency (SF) and faster decoding of low SFs than high SFs. High SF-preferred neurons, and low SF-preferred neurons to a lesser degree, perform better category decoding than neurons with other profiles (U and inverted U shaped). SF coding also shows more sparseness than category coding in the earlier phase of the response and less sparseness in the later phase. The results are also contrasted with predictions of various DNN models.

    Strengths:
    The study addressed an important issue on the representations of SF information in a high-level visual area. Data are analyzed with LDA which can effectively reduce the dimensionality of neuronal responses and retain category information.

    Weaknesses:
    The results are likely compromised by improper stimulus timing and unmatched spatial frequency spectrums of stimuli in different categories.

    The authors used a very brief stimulus duration (35ms), which would degrade the visual system's contrast sensitivity to medium and high SF information disproportionately (see Nachmias, JOSAA, 1967). Therefore, IT neurons in the study could have received more degraded medium and high SF inputs compared to low SF inputs, which may be at least partially responsible for higher firing rates to low Sf R1 stimuli (Figure 1c) and poorer recall performance with median and high SF R3-R5 stimuli in LDA decoding. The issue may also to some degree explain the delayed onset of recall to higher SF stimuli (Figure 2a), preferred low SF with an earlier T1 onset (Figure 2b), lower firing rate to high SF during T1 (Figure 2c), somewhat increased firing rate to high SF during T2 (because weaker high SF inputs would lead to later onset, Figure 2d).

    Figure 3b shows greater face coding than object coding by high SF and to a lesser degree by low SF neurons. Only the inverted-U-shaped neurons displayed slightly better object coding than face coding. Overall the results give an impression that IT neurons are significantly more capable of coding faces than coding objects, which is inconsistent with the general understanding of the functions of IT neurons. The problem may lie with the selection of stimulus images (Figure 1b). To study SF-related category coding, the images in two categories need to have similar SF spectrums in the Fourier domain. Such efforts are not mentioned in the manuscript, and a look at the images in Figure 1b suggests that such efforts are likely not properly made. The ResNet18 decoding results in Figure 6C, in that IT neurons of different profiles show similar face and object coding, might be closer to reality.

  11. Reviewer #2 (Public Review):

    Summary:
    This paper aimed to examine the spatial frequency selectivity of macaque inferotemporal (IT) neurons and its relation to category selectivity. The authors suggest in the present study that some IT neurons show a sensitivity for the spatial frequency of scrambled images. Their report suggests a shift in preferred spatial frequency during the response, from low to high spatial frequencies. This agrees with a coarse-to-fine processing strategy, which is in line with multiple studies in the early visual cortex. In addition, they report that the selectivity for faces and objects, relative to scrambled stimuli, depends on the spatial frequency tuning of the neurons.

    Strengths:
    Previous studies using human fMRI and psychophysics studied the contribution of different spatial frequency bands to object recognition, but as pointed out by the authors little is known about the spatial frequency selectivity of single IT neurons. This study addresses this gap and they show that at least some IT neurons show a sensitivity for spatial frequency and interestingly show a tendency for coarse-to-fine processing.

    Weaknesses and requested clarifications:
    1. It is unclear whether the effects described in this paper reflect a sensitivity to spatial frequency, i.e. in cycles/ deg (depends on the distance from the observer and changes when rescaling the image), or is a sensitivity to cycles /image, largely independent of image scale. How is it related to the well-documented size tolerance of IT neuron selectivity?

    2. The authors' band-pass filtered phase scrambled images of faces and objects. The original images likely differed in their spatial frequency amplitude spectrum and thus it is unclear whether the differing bands contained the same power for the different scrambled images. If not, this could have contributed to the frequency sensitivity of the neurons.

    3. How strong were the responses to the phase-scrambled images? Phase-scrambled images are expected to be rather ineffective stimuli for IT neurons. How can one extrapolate the effect of the spatial frequency band observed for ineffective stimuli to that for more effective stimuli, like objects or (for some neurons) faces? A distribution should be provided, of the net responses (in spikes/s) to the scrambled stimuli, and this for the early and late windows.

    4. The strength of the spatial frequency selectivity is unclear from the presented data. The authors provide the result of a classification analysis, but this is in normalized units so that the reader does not know the classification score in percent correct. Unnormalized data should be provided. Also, it would be informative to provide a summary plot of the spatial frequency selectivity in spikes/s, e.g. by ranking the spatial frequency bands for each neuron based on half of the trials and then plotting the average responses for the obtained ranks for the other half of the trials. Thus, the reader can appreciate the strength of the spatial frequency selectivity, considering trial-to-trial variability. Also, a plot should be provided of the mean response to the stimuli for the two analysis windows of Figure 2c and 2d in spikes/s so one can appreciate the mean response strengths and effect size (see above).

    5. It is unclear why such brief stimulus durations were employed. Will the results be similar, in particular the preference for low spatial frequencies, for longer stimulus durations that are more similar to those encountered during natural vision?

    6. The authors report that the spatial frequency band classification accuracy for the population of neurons is not much higher than that of the best neuron (line 151). How does this relate to the SNC analysis, which appears to suggest that many neurons contribute to the spatial frequency selectivity of the population in a non-redundant fashion? Also, the outcome of the analyses should be provided (such as SNC and decoding (e.g. Figure 1D)) in the original units instead of undefined arbitrary units.

    7. To me, the results of the analyses of Figure 3c,d, and Figure 4 appear to disagree. The latter figure shows no correlation between category and spatial frequency classification accuracies while Figure 3c,d shows the opposite.

    8. If I understand correctly, the "main" test included scrambled versions of each of the "responsive" images selected based on the preceding test. Each stimulus was presented 15 times (once in each of the 15 blocks). The LDA classifier was trained to predict the 5 spatial frequency band labels and they used 70% of the trials to train the classifier. Were the trained and tested trials stratified with respect to the different scrambled images? Also, LDA assumes a normal distribution. Was this the case, especially because of the mixture of repetitions of the same scrambled stimulus and different scrambled stimuli?

    9. The LDA classifiers for spatial frequency band (5 labels) and category (2 labels) have different chance and performance levels. Was this taken into account when comparing the SNC between these two classifiers? Details and SNC values should be provided in the original (percent difference) instead of arbitrary units in Figure 5a. Without such details, the results are impossible to evaluate.

    10. Recording locations should be described in IT, since the latter is a large region. Did their recordings include the STS? A/P and M/L coordinate ranges of recorded neurons?

    11. The authors should show in Supplementary Figures the main data for each of the two animals, to ensure the reader that both monkeys showed similar trends.

    12. The authors found that the deep nets encoded better the spatial frequency bands than the IT units. However, IT units have trial-to-trial response variability and CNN units do not. Did they consider this when comparing IT and CNN classification performance? Also, the number of features differs between IT and CNN units. To me, comparing IT and CNN classification performances is like comparing apples and oranges.

    13. The authors should define the separability index in their paper. Since it is the main index to show a relationship between category and spatial frequency tuning, it should be described in detail. Also, results should be provided in the original units instead of undefined arbitrary units. The tuning profiles in Figure 3A should be in spikes/s. Also, it was unclear to me whether the classification of the neurons into the different tuning profiles was based on an ANOVA assessing per neuron whether the effect of the spatial frequency band was significant (as should be done).

    14. As mentioned above, the separability analysis is the main one suggesting an association between category and spatial frequency tuning. However, they compute the separability of each category with respect to the scrambled images. Since faces are a rather homogeneous category I expect that IT neurons have on average a higher separability index for faces than for the more heterogeneous category of objects, at least for neurons responsive to faces and/or objects. The higher separability for faces of the two low- and high-pass spatial frequency neurons could reflect stronger overall responses for these two classes of neurons. Was this the case? This is a critical analysis since it is essential to assess whether it is category versus responsiveness that is associated with the spatial frequency tuning. Also, I do not believe that one can make a strong claim about category selectivity when only 6 faces and 3 objects (and 6 other, variable stimuli; 15 stimuli in total) are employed to assess the responses for these categories (see next main comment). This and the above control analysis can affect the main conclusion and title of the paper.

    15. For the category decoding, the authors employed intact, unscrambled stimuli. Were these from the main test? If yes, then I am concerned that this represents a too small number of stimuli to assess category selectivity. Only 9 fixed + 6 variable stimuli = 15 were in the main test. How many faces/ objects on average? Was the number of stimuli per category equated for the classification? When possible use the data of the preceding selectivity test which has many more stimuli to compute the category selectivity.