Evidence for a deep, distributed and dynamic code for animacy in human ventral anterior temporal cortex

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    This manuscript will be of interest to neuroscientists and psychologists interested in how semantic information is encoded in the brain. It provides a framework for a model driven comparison of semantic encoding in recurrent neural networks and neural data. Limitations in the ways the neural data are analyzed and compared to the model provide only limited support for the major claim regarding the nature of the semantic code in human anterior temporal lobe.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #2 agreed to share their name with the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

How does the human brain encode semantic information about objects? This paper reconciles two seemingly contradictory views. The first proposes that local neural populations independently encode semantic features; the second, that semantic representations arise as a dynamic distributed code that changes radically with stimulus processing. Combining simulations with a well-known neural network model of semantic memory, multivariate pattern classification, and human electrocorticography, we find that both views are partially correct: information about the animacy of a depicted stimulus is distributed across ventral temporal cortex in a dynamic code possessing feature-like elements posteriorly but with elements that change rapidly and nonlinearly in anterior regions. This pattern is consistent with the view that anterior temporal lobes serve as a deep cross-modal ‘hub’ in an interactive semantic network, and more generally suggests that tertiary association cortices may adopt dynamic distributed codes difficult to detect with common brain imaging methods.

Article activity feed

  1. Author Response:

    Reviewer #2:

    The current work makes the case that local neural measurements of selectivity to stimulus features and categories can, under certain circumstances, be misleading. The authors illustrate this point first through simulations within an artificial, deep, neural network model that is trained to map high-level visual representations of animals, plants, and objects to verbal labels, as well as to map the verbal labels back to their corresponding visual representations. As activity cycles forward and backward through the model, activity in the intermediate hidden layer (referred to as the "Hub") behaves in an interesting and non-linear fashion, with some units appearing first to respond more to animals than objects (or vice-versa) and then reversing category preference later in processing. This occurs despite the network progressively settling to a stable state (often referred to as a "point attractor"). Nevertheless, when the units are viewed at the population level, they are able to distinguish animals and objects (using logistic regression classifiers with L1- norm regularization) across the time points when the individual unit preferences appear to change. During the evolution of the network's states, classifiers trained at one time point do not apply well to data from earlier or later periods of time, with a gradual expansion of generalization to later time points as the network states become more stable. The authors then ask whether these same data properties (constant decodability, local temporal generalization, widening generalization window, change in code direction) are also present in electrophysiological recordings (ECoG) of anterior ventral temporal cortex during picture naming in 8 human epilepsy patients. Indeed, they find support for all four data properties, with more stable animal/object classification direction in posterior aspects of the fusiform gyrus and more dynamic changes in classification in the anterior fusiform gyrus (calculated in the average classifier weights across all patients).

    Strengths:

    Rogers et al. clearly expose the potential drawbacks to massive univariate analyses of stimulus feature and task selectivity in neuroimaging and physiological methods of all types -- which is a really important point given that this is the predominant approach to such analyses in cognitive neuroscience. fMRI, while having high spatial resolution, will almost certainly average over the kinds of temporal changes seen in this study. Even methods with high temporal and moderate spatial resolution (e.g. MEG, EEG) will often fail to find selectivity that is detectable only though multivariate methods. While some readers may be skeptical about the relevance of artificial neural networks to real human brain function, I found the simulations to be extremely useful. For me, what the simulations show is that a relatively typical multi-layer, recurrent backpropagation network (similar to ones used in numerous previous papers) does not require anything unusual to produce these kinds of counterintuitive effects. They simply need to exhibit strong attractor dynamics, which are naturally present in deep networks with multiple hidden layers, especially if the recurrent network interactions aid the model during training. This kind of recurrent processing should not be thought of as a stretch for the real brain. If anything, it should be the default expectation given our current knowledge of neuroanatomy. The authors also do a good job relating properties detected in their simulations to the ECoG data measured in human patients.

    We thank the reviewer for these positive comments.

    Weaknesses:

    While the ECoG data generally show the properties articulated by the authors, I found myself wanting to know more about the individual patients. Averaging across patients with different electrode locations -- and potentially different latencies of classification on different electrodes -- might be misleading. For example, how do we know that the shifts from negative to positive classification weights seen in the anterior temporal electrode sites are not really reflecting different dynamics of classification in separate patients? The authors partially examine this issue in the Supplementary Information (SI-3 and Figure SI-4) by analyzing classification shifts on individual patient electrodes. However, we don't know the locations of these electrodes (anterior versus posterior fusiform gyrus locations). The use of raw-ish LFPs averaged across the four repetitions of each stimulus (making an ERP) was also not an obvious choice, particularly if one desires to maximize the spatial precision of ECoG measures (compare unfiltered LFPs, which contain prominent low frequency fluctuations that can be shared across a larger spatial extent, to high frequency broadband power, 80-200 Hz).

    In the new statistical tests described above, we compute each metric separately for each patient, then conduct cross-subject statistical tests against a null hypothesis to assess whether the global pattern observed in the mean data is reliable across patients. We hope this addresses the reviewer's general concern that the mean pattern obscures heterogeneity across patients. With regard to the question of greater variability in anterior electrodes, the new analysis showing a remarkably strong correlation between variability of coefficient change and electrode location along the anterior-posterior axis provides a formal statistical test of this observation. We view variability of decoder coefficients as more informative than the independent correlations between electrode activity and category label shown in the supplementary materials, because the coefficients indicate the influence of electrode activity on classification when all other electrode states are taken into account (akin in some ways to a partial correlation coefficient). This distinction is noted in SI-3, p 48.

    The authors are well-known for arguing that conceptual processing is critically mediated by a single hub region located in the anterior temporal lobe, through which all sensory and motor modalities interact. I think that it's worth pointing out that the current data, while compatible with this theory, are also compatible with a conceptual system with multiple hubs. Deep recurrent dynamics from high-level visual processing, for which visual properties may be separated for animals and objects in the posterior aspects of the fusiform gyrus, through to phonological processing of object names may operate exactly as the authors suggest. However, other aspects of conceptual processing relating to object function (such as tool use) may not pass through the anterior fusiform gyrus, but instead through more posterior ventral stream (and dorsal stream) regions for which the high-level visual features are more segregated for animals versus tools. Social processing may similarly have its own distinct networks that tie in to visual<- >verbal networks at a distinct point. So while the authors are persuasive with regard to the need for deep, recurrent interactions, the status of one versus multiple conceptual hubs, and the exact locations of those hubs, remains open for debate.

    We agree that the current data does not speak to hypotheses about other components of the cortical semantic network outside the field-of-view of our dataset. We have added an explicit statement of this in the General Discussion (page 22).

    The concepts that the authors introduce are important, and they should lead researchers to examine the potential utility of multivariate classification methods for their own work. To the extent that fMRI is blind to the dynamics highlighted here, supplementing fMRI with other approaches with high temporal resolution will be required (e.g. MEG and simultaneous fMRI-EEG). For those interested in applying deep neural networks to neuroscientific data, the current demonstration should also be a cautionary tale for the use of feed-forward-only networks. Finally, the authors make an important contribution to our thinking about conceptual processing, providing novel arguments and evidence in support of point-attactor models.

    Thanks to the reviewer for highlighting these points, which we take to be central contributions of this work!

    Reviewer #3:

    The authors compared how semantic information is encoded as a function of time between a recurrent neural network trained to link visual and verbal representations of objects and in the ventral anterior temporal lobe of humans (ECOG recordings). The strategy is to decode between 'living' and 'nonliving' objects and test/train at different timepoints to examine how dynamic the underlying code is. The observation is that coding is dynamic in both the neural network as well as the neural data as shown by decoders not generalizing to all other timepoints and by some units contributing with different sign to decoders trained at different timepoints. These findings are well in line with extensive evidence for a dynamic neural code as seen in numerous experiments (Stokes et al. 2013, King&Dehaene 2014).

    Strengths of this paper include a direct model to data comparison with the same analysis strategy, a model capable of generating a dynamic code, and the usage of rare intracranial recordings from humans. Weaknesses: While the model driven examination of recordings is a major strength, the data analysis does only provide limited support for the major claim of a 'distributed and dynamic semantic code' - it isn't clear that the code is semantic and the claims of dynamics and anatomical distribution are not quantitative.

    Major issues:

    1. Claims re a 'semantic code'. The ECOG analysis shows that decoding 'living from 'nonliving' during viewing of images exhibits a dynamic code, with some electrodes coding to early decodability and some to later, and with some contributing with different signs. It is a far stretch to conclude from this that this shows evidence for a 'dynamic semantic code'. No work is done to show that this representation is semantic- in fact this kind of single categorical distinction could probably be done also based on purely visual signals (such as in higher levels of a network such as VGG or higher visual cortex recordings). In contrast the model has rich structure across numerous semantic distinctions.

    We have added a new analysis showing that the animate/inanimate distinction cannot be decoded for these stimuli from purely visual information as captured by a well-known unsupervised method for computing visual similarity structure amongst bitmap line drawings (Chamfer matching). We did not consider deep layers of the VGG-19 model as that model is explicitly trained to assign photographs to human-labeled semantic categories, so the representations do not reflect purely visual structure. The new analysis appears as part of the description of the stimulus set on page 31.

    The proposal that ventral anterior temporal cortex encodes semantic information is not new to this paper but is based on an extensive prior literature that includes studies of semantic impairments in patients with pathology in this area (e.g. refs 7, 13, 29-32), studies of semantic disruption by TMS applied to this region (refs. 37-38 ), functional brain imaging of semantic processing with PET (33), distortion-corrected MRI (34-36), MEG (e.g. Mollos et al., 2017, PLOS ONE), and ECOG (ref. 46), and neurally-constrained computational models of developing, mature, and disordered semantic processing (refs. 7, 31, 40, 53). A great deal of this literature uses the same animate/inanimate distinction employed here as a paradigmatic example of a semantic distinction. It is especially useful in the current case because the animate/inanimate distinction is unrelated to the response elicited by the stimuli (the basic-level name).

    1. Missing quantification of model-data comparison. These conclusions aren't supported by quantitative analysis. This includes importantly statements regarding anatomical location (Fig 4E), ressemblenes in dynamic coding patterns ('overlapping waves' Fig 4C-D), and presence of electrodes that 'switch sign'. These key conclusions seem to be derived purely by graphical inspection, which is not appropriate.

    We have added new statistical analyses of each core claim as explained above.

    1. ECOG recordings analysis. Raw LFP voltage was used as the feature (if I interpreted the methods correctly, see below). This does not seem like an appropriate way to decode from ECOG signals given the claims that are made due to sensitivity to large deflections (evoked potentials). Analysis of different frequency bands, power, phase etc would be necessary to substantiate these claims. As it stands, a simpler interpretation of the findings is that the early onset evoked activity (ERPs) gives rise to clusters 1-4, and more sustained deflections to the other clusters. This could also give rise to sign changes as ERPs change sign.

    The reviewer's comment suggests that information about the category should be reflected in spectral properties of the time-varying signals but not the direction/magnitude of the LFP itself. While we recognize that this is a common hypothesis in the literature, an alternative hypothesis more consistent with neural-network models of cognition suggests that such information can be encoded in magnitude and direction of the LFP itself—the closest brain analog to unit activity in a neural network model. The fact that semantic information can be accurately decoded from the LFPs, following a pattern closely resembling that arising in the model, is consistent with this hypothesis. We agree that, in future, it would be interesting to look at decoding of spectral properties of the signal. We note these points on revised manuscript page 22.

    With regard to this comment:

    a simpler interpretation of the findings is that the early onset evoked activity (ERPs) gives rise to clusters 1-4, and more sustained deflections to the other clusters. This could also give rise to sign changes as ERPs change sign

    We are not sure how this constitutes a simpler or even a different explanation of our data. ERPs at an intracranial electrode reflect local neural responses to the stimulus, which change over stimulus processing. The data show that semantic information about the stimulus can be decoded from these signals at the initial evoked response and all subsequent timepoints, but the relationship between the neural response and the semantic category (ie how the semantic information is encoded in the measured response) changes as the stimulus is processed. The changing sign of an ERP reflects changing activity of nearby neural populations. "More sustained deflections" indicates that changes to the code are slowing over time. These are essentially the conclusions that we draw about the dynamic code from our data.

    Maybe the reviewer is concerned that the results are an artifact of just the temporal structure of the LFPs themselves—that these change rapidly with stimulus onset and then slow down, so that the “expanding window” pattern arises from, for instance, temporal auto-correlation in the raw data. Testing this possibility was the goal of the analysis in SI-5, where we show that auto- correlation of the raw LFP signal does not grow broader over time—so the widening-window pattern observed in the generalization of classifiers is not attributable to the temporal autocorrelation structure of the raw data.

  2. Reviewer #3 (Public Review):

    The authors compared how semantic information is encoded as a function of time between a recurrent neural network trained to link visual and verbal representations of objects and in the ventral anterior temporal lobe of humans (ECOG recordings). The strategy is to decode between 'living' and 'nonliving' objects and test/train at different timepoints to examine how dynamic the underlying code is. The observation is that coding is dynamic in both the neural network as well as the neural data as shown by decoders not generalizing to all other timepoints and by some units contributing with different sign to decoders trained at different timepoints. These findings are well in line with extensive evidence for a dynamic neural code as seen in numerous experiments (Stokes et al. 2013, King&Dehaene 2014).

    Strengths of this paper include a direct model to data comparison with the same analysis strategy, a model capable of generating a dynamic code, and the usage of rare intracranial recordings from humans. Weaknesses: While the model driven examination of recordings is a major strength, the data analysis does only provide limited support for the major claim of a 'distributed and dynamic semantic code' - it isn't clear that the code is semantic and the claims of dynamics and anatomical distribution are not quantitative.

    Major issues:

    1. Claims re a 'semantic code'. The ECOG analysis shows that decoding 'living from 'nonliving' during viewing of images exhibits a dynamic code, with some electrodes coding to early decodability and some to later, and with some contributing with different signs. It is a far stretch to conclude from this that this shows evidence for a 'dynamic semantic code'. No work is done to show that this representation is semantic- in fact this kind of single categorical distinction could probably be done also based on purely visual signals (such as in higher levels of a network such as VGG or higher visual cortex recordings). In contrast the model has rich structure across numerous semantic distinctions.

    2. Missing quantification of model-data comparison. These conclusions aren't supported by quantitative analysis. This includes importantly statements regarding anatomical location (Fig 4E), ressemblenes in dynamic coding patterns ('overlapping waves' Fig 4C-D), and presence of electrodes that 'switch sign'. These key conclusions seem to be derived purely by graphical inspection, which is not appropriate.

    3. ECOG recordings analysis. Raw LFP voltage was used as the feature (if I interpreted the methods correctly, see below). This does not seem like an appropriate way to decode from ECOG signals given the claims that are made due to sensitivity to large deflections (evoked potentials). Analysis of different frequency bands, power, phase etc would be necessary to substantiate these claims. As it stands, a simpler interpretation of the findings is that the early onset evoked activity (ERPs) gives rise to clusters 1-4, and more sustained deflections to the other clusters. This could also give rise to sign changes as ERPs change sign.

  3. Reviewer #2 (Public Review):

    The current work makes the case that local neural measurements of selectivity to stimulus features and categories can, under certain circumstances, be misleading. The authors illustrate this point first through simulations within an artificial, deep, neural network model that is trained to map high-level visual representations of animals, plants, and objects to verbal labels, as well as to map the verbal labels back to their corresponding visual representations. As activity cycles forward and backward through the model, activity in the intermediate hidden layer (referred to as the "Hub") behaves in an interesting and non-linear fashion, with some units appearing first to respond more to animals than objects (or vice-versa) and then reversing category preference later in processing. This occurs despite the network progressively settling to a stable state (often referred to as a "point attractor"). Nevertheless, when the units are viewed at the population level, they are able to distinguish animals and objects (using logistic regression classifiers with L1-norm regularization) across the time points when the individual unit preferences appear to change. During the evolution of the network's states, classifiers trained at one time point do not apply well to data from earlier or later periods of time, with a gradual expansion of generalization to later time points as the network states become more stable. The authors then ask whether these same data properties (constant decodability, local temporal generalization, widening generalization window, change in code direction) are also present in electrophysiological recordings (ECoG) of anterior ventral temporal cortex during picture naming in 8 human epilepsy patients. Indeed, they find support for all four data properties, with more stable animal/object classification direction in posterior aspects of the fusiform gyrus and more dynamic changes in classification in the anterior fusiform gyrus (calculated in the average classifier weights across all patients).

    Strengths:

    Rogers et al. clearly expose the potential drawbacks to massive univariate analyses of stimulus feature and task selectivity in neuroimaging and physiological methods of all types -- which is a really important point given that this is the predominant approach to such analyses in cognitive neuroscience. fMRI, while having high spatial resolution, will almost certainly average over the kinds of temporal changes seen in this study. Even methods with high temporal and moderate spatial resolution (e.g. MEG, EEG) will often fail to find selectivity that is detectable only though multivariate methods. While some readers may be skeptical about the relevance of artificial neural networks to real human brain function, I found the simulations to be extremely useful. For me, what the simulations show is that a relatively typical multi-layer, recurrent backpropagation network (similar to ones used in numerous previous papers) does not require anything unusual to produce these kinds of counterintuitive effects. They simply need to exhibit strong attractor dynamics, which are naturally present in deep networks with multiple hidden layers, especially if the recurrent network interactions aid the model during training. This kind of recurrent processing should not be thought of as a stretch for the real brain. If anything, it should be the default expectation given our current knowledge of neuroanatomy. The authors also do a good job relating properties detected in their simulations to the ECoG data measured in human patients.

    Weaknesses:

    While the ECoG data generally show the properties articulated by the authors, I found myself wanting to know more about the individual patients. Averaging across patients with different electrode locations -- and potentially different latencies of classification on different electrodes -- might be misleading. For example, how do we know that the shifts from negative to positive classification weights seen in the anterior temporal electrode sites are not really reflecting different dynamics of classification in separate patients? The authors partially examine this issue in the Supplementary Information (SI-3 and Figure SI-4) by analyzing classification shifts on individual patient electrodes. However, we don't know the locations of these electrodes (anterior versus posterior fusiform gyrus locations). The use of raw-ish LFPs averaged across the four repetitions of each stimulus (making an ERP) was also not an obvious choice, particularly if one desires to maximize the spatial precision of ECoG measures (compare unfiltered LFPs, which contain prominent low frequency fluctuations that can be shared across a larger spatial extent, to high frequency broadband power, 80-200 Hz).

    The authors are well-known for arguing that conceptual processing is critically mediated by a single hub region located in the anterior temporal lobe, through which all sensory and motor modalities interact. I think that it's worth pointing out that the current data, while compatible with this theory, are also compatible with a conceptual system with multiple hubs. Deep recurrent dynamics from high-level visual processing, for which visual properties may be separated for animals and objects in the posterior aspects of the fusiform gyrus, through to phonological processing of object names may operate exactly as the authors suggest. However, other aspects of conceptual processing relating to object function (such as tool use) may not pass through the anterior fusiform gyrus, but instead through more posterior ventral stream (and dorsal stream) regions for which the high-level visual features are more segregated for animals versus tools. Social processing may similarly have its own distinct networks that tie in to visual<->verbal networks at a distinct point. So while the authors are persuasive with regard to the need for deep, recurrent interactions, the status of one versus multiple conceptual hubs, and the exact locations of those hubs, remains open for debate.

    The concepts that the authors introduce are important, and they should lead researchers to examine the potential utility of multivariate classification methods for their own work. To the extent that fMRI is blind to the dynamics highlighted here, supplementing fMRI with other approaches with high temporal resolution will be required (e.g. MEG and simultaneous fMRI-EEG). For those interested in applying deep neural networks to neuroscientific data, the current demonstration should also be a cautionary tale for the use of feed-forward-only networks. Finally, the authors make an important contribution to our thinking about conceptual processing, providing novel arguments and evidence in support of point-attactor models.

  4. Reviewer #1 (Public Review):

    In this technically difficult study of a crucial and understudied area of the human anterior temporal lobe (ATL), the authors set out to investigate the possibility that representations in this area are dynamic, in keeping with its putative role as a semantic hub. In short, they report evidence for stable representations in posterior areas and dynamic representations in anterior areas.

    The major strength of this paper is in the nature of the physiological data (ECOG) and the complexity of the associated modeling and computational work. In particular, the consideration of and attempt to model dynamic representations is a real strength.

    The major weakness is a slight lack of direct statistical tests to back up certain claims. For example, there is a discussed difference between the posterior and anterior electrode, but not much of direct statistical comparison of those areas. The model which has the best performance clearly changes over time, but there are not direct statistical comparisons of the models performance over that period.

    Overall, there is some evidence for a dynamic representation in this area, and the analyses here do point at the need for a more thorough (i.e., considering the possibility of dynamic change) and generally applied approach to studying representations.

  5. Evaluation Summary:

    This manuscript will be of interest to neuroscientists and psychologists interested in how semantic information is encoded in the brain. It provides a framework for a model driven comparison of semantic encoding in recurrent neural networks and neural data. Limitations in the ways the neural data are analyzed and compared to the model provide only limited support for the major claim regarding the nature of the semantic code in human anterior temporal lobe.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #2 agreed to share their name with the authors.)