Gain, not concomitant changes in spatial receptive field properties, improves task performance in a neural network attention model

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    This manuscript by Fox, Birman, and Gardner combines human behavioral experiments with spatial attention manipulation and computational modeling (image-computable convolutional neural network models) to investigate the computational mechanisms that may underlie improvements in behavioral performance when deploying spatial attention. Through carefully controlled manipulations of computational architecture and parameters, the authors dissociate the effects of different tuning properties (e.g. tuning gain vs. tuning shifts) and conclude that increases in gain are the primary means by which attention improves behavioral performance. The analyses and results are technically sound and clearly presented, but the generality of the conclusions is limited by certain modeling/task choices made in the work.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #2 agreed to share their names with the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Attention allows us to focus sensory processing on behaviorally relevant aspects of the visual world. One potential mechanism of attention is a change in the gain of sensory responses. However, changing gain at early stages could have multiple downstream consequences for visual processing. Which, if any, of these effects can account for the benefits of attention for detection and discrimination? Using a model of primate visual cortex we document how a Gaussian-shaped gain modulation results in changes to spatial tuning properties. Forcing the model to use only these changes failed to produce any benefit in task performance. Instead, we found that gain alone was both necessary and sufficient to explain category detection and discrimination during attention. Our results show how gain can give rise to changes in receptive fields which are not necessary for enhancing task performance.

Article activity feed

  1. Author Response

    Reviewer 1 (Public Review):

    Fox, Birman, and Gardner use a previously proposed convolutional neural network of the ventral visual pathway to test the behavioral and physiological impact of an attentional gain spotlight operating on the inputs to the network. They show that a gain modulation that matches the behavioral benefit of attentional cueing in a matching behavioral task, induces changes in the receptive fields (RFs) of the model units, which are consistent with previous neurophysiological reports: RF scaling, RF shift towards the attentional focus, and RF shrinkage around the focus of attention. Ingenious simulations then allow them to isolate the specific impact of these RF modulations in achieving performance improvements. The simulations show that RF scaling is primarily responsible for the improvement in performance in this computational model, whereas RF shift does not induce any significant change in decoding performance. This is significant because many previous studies have hypothesized a leading role of RF shifts in attentional selection. With their elegant approach, the authors show in this manuscript that this is questionable and argue that changes in the shape of RFs are epiphenomena of the truly relevant modulation, which is the multiplicative scaling of neural responses.

    Strengths:

    The use of a multi-layer network that accomplishes visual processing, with an approximate correspondence with the visual system, is a strength of this manuscript that allows it to address in a principled way the behavioral advantage contributed by various attentional neural modulations.

    The simulations designed to isolate the contributions of the various RF modulations are very ingenious and convincingly demonstrate a superior role of gain modulation over RF shifts in improving detection performance in the model.

    We thank the reviewer for these supportive comments.

    Weaknesses:

    There is no mention of a possible specificity of the manuscript conclusions in relation to the type of task to be performed. It is conceivable that mechanisms that are not important for detection tasks are instead crucial for a reproduction task, as in Vo et al. (2017).

    We agree that other behavioral tasks may rely on different attentional mechanisms then the ones we have studied here for detection and discrimination and now specifically point this out in the discussion [379-395].

    The manuscript puts emphasis on the biological plausibility of the model, and some quantitative agreements. But at some important points these comparisons do not appear very consistent:

    1. It is unclear what output of the model at each cortical area is to be compared with neurophysiological data. On the one hand, the manuscript argues that a 1.25 attentional factor is consistent with single-neuron results, but here this factor is applied to the inputs into V1 units. When this modulation goes through normalization in area V1, the output of V1 has a 2x gain. Intuitively, one would think that recordings in V1 neurons would correspond to layer V1 outputs in the model, but this is not the approach taken in the manuscript. This needs clarification. Also, note that the 20-40% gain reported in line 287 corresponds to high-order visual areas (V4 or MT), but not to V1, in the cited references. The quantitative correspondence between gain factors at various processing steps in the model and in the data is confusing and should be clearer.

    We agree that making a one-to-one mapping of gain effects measured in neurophysiology and different layers of the CNN is problematic. We therefore have clarified that the introduction of gain at the earliest stages of processing is meant to study how gain propagates through a complex CNN and has downstream effects [49-52 and 410-447] and we have also also clarified the various uncertainties in making one-to-one mapping from the CNN to neurophysiological measurements of gain [410-447].

    1. The model assumes a gain modulation in the inputs to V1. This would correspond to an attentional gain modulation in LGN unit outputs. There is little evidence of such strong modulation of LGN activity by attention. Also in V1 attentional modulation is small. As stated in Discussion (line 295), there is no reason to favor the current model as opposed to a model where the attentional gain is imposed later on in the visual hierarchy (for example V4). If anything, neurophysiology would be more consistent with this last scenario, given the evidence for direct V4 gain control from frontal eye fields (Moore and Armstrong, Nature 2003). The rationale for focusing on a model that incorporates the attentional spotlight on the inputs to V1 should be disclosed.

    We agree that measurements of gain changes with attention appear larger in later stages of visual processing and do not wish to explicitly link the gain changes imposed at the earliest stages of processing in our CNN observer model with changes in input from LGN as we agree this would be unrealistic. Instead, our goal was to examine how gain changes can propagate through complex neural networks and cause downstream effects on spatial tuning properties and the efficacy of readout. We have substantially re-written the manuscript, in particular the introduction [24-38, 49-52] and discussion [441-447] to better describe this rationale. We also now explicitly discuss how our propagated gain test demonstrates exactly the reviewer’s point - that gain can be injected late in the system, rather than at the earliest stages [274-276, 441-447].

    1. The model chosen is the CORnet-z model, but this model does not include recurrent dynamics within each layer. Recurrent dynamics is a prominent feature in the cortex, and there is evidence indicating that attentional modulations operate differently in feedforward and in recurrent architectures (Compte and Wang, Cerebral Cortex 2006). A specific feature of recurrent models is that the attentional spotlight need not be a multiplicative factor (which is biologically complicated) but an additive term before the ReLU non-linearity, which achieves the expected RF modulations (Compte and Wang, 2006). A model with recurrence thus represents another architecture that links gain and shift in a way that has not been explored in this manuscript, and this may limit the generalization of the conclusions (line 205).

    We appreciate the reviewer pointing us toward the Compte paper and we’ve added a discussion of recurrence as an alternate model [410-423].

    Reviewer 2 (Public Review):

    This manuscript by Fox, Birman, and Gardner combines human behavioral experiments with spatial attention manipulation and computational modeling (image-computable convolutional neural network models) to investigate the computational mechanisms that may underlie improvements in behavioral performance when deploying spatial attention.

    Strengths:

    • The manuscript is clear and the analyses, modeling, and exposition are executed well.
    • The behavioral experiments are carefully conducted and of high quality.
    • The manuscript takes a creative approach to constructing a ”neural network observer model”, that is, coupling an image-computable model to a potential readout mechanism that specifies how the representations might be used for the purposes of behavior. The focused analyses of the model innards (architecture, parameters) provide insight into how different model components lead to the final behavior of the model.

    Thank you for these supportive comments.

    Weaknesses:

    • The overall conclusions and insights gained seem heavily dependent on particular choices and design decisions made in this specific model. In particular, the readout mechanism lacks some critical descriptive details, and it is not clear whether the readout mechanism (512-dimensional representation that reflects summing over visual space) is a reasonable choice. As such, while the computational analyses and results may be correct for this model, it is not clear whether the strong general conclusions are justified. Thus, the results in their current form feel more like exploratory work showing proof of concept of how the issue of attention and underlying computational mechanisms can be studied in a rigorous and concrete computational modeling context, rather than definitive results concerning how attention operates in the visual system.

    Please see below for our response to the issue with readout and conclusions.

    Overall, the work is solidly constructed, but the overall generality and strength of the conclusions require substantial dampening.

  2. Evaluation Summary:

    This manuscript by Fox, Birman, and Gardner combines human behavioral experiments with spatial attention manipulation and computational modeling (image-computable convolutional neural network models) to investigate the computational mechanisms that may underlie improvements in behavioral performance when deploying spatial attention. Through carefully controlled manipulations of computational architecture and parameters, the authors dissociate the effects of different tuning properties (e.g. tuning gain vs. tuning shifts) and conclude that increases in gain are the primary means by which attention improves behavioral performance. The analyses and results are technically sound and clearly presented, but the generality of the conclusions is limited by certain modeling/task choices made in the work.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #2 agreed to share their names with the authors.)

  3. Reviewer #1 (Public Review):

    Fox, Birman, and Gardner use a previously proposed convolutional neural network of the ventral visual pathway to test the behavioral and physiological impact of an attentional gain spotlight operating on the inputs to the network. They show that a gain modulation that matches the behavioral benefit of attentional cueing in a matching behavioral task, induces changes in the receptive fields (RFs) of the model units, which are consistent with previous neurophysiological reports: RF scaling, RF shift towards the attentional focus, and RF shrinkage around the focus of attention. Ingenious simulations then allow them to isolate the specific impact of these RF modulations in achieving performance improvements. The simulations show that RF scaling is primarily responsible for the improvement in performance in this computational model, whereas RF shift does not induce any significant change in decoding performance. This is significant because many previous studies have hypothesized a leading role of RF shifts in attentional selection. With their elegant approach, the authors show in this manuscript that this is questionable and argue that changes in the shape of RFs are epiphenomena of the truly relevant modulation, which is the multiplicative scaling of neural responses.

    Strengths:

    The use of a multi-layer network that accomplishes visual processing, with an approximate correspondence with the visual system, is a strength of this manuscript that allows it to address in a principled way the behavioral advantage contributed by various attentional neural modulations.

    The simulations designed to isolate the contributions of the various RF modulations are very ingenious and convincingly demonstrate a superior role of gain modulation over RF shifts in improving detection performance in the model.

    Weaknesses:

    There is no mention of a possible specificity of the manuscript conclusions in relation to the type of task to be performed. It is conceivable that mechanisms that are not important for detection tasks are instead crucial for a reproduction task, as in Vo et al. (2017).

    The manuscript puts emphasis on the biological plausibility of the model, and some quantitative agreements. But at some important points these comparisons do not appear very consistent:

    1. It is unclear what output of the model at each cortical area is to be compared with neurophysiological data. On the one hand, the manuscript argues that a 1.25 attentional factor is consistent with single-neuron results, but here this factor is applied to the inputs into V1 units. When this modulation goes through normalization in area V1, the output of V1 has a 2x gain. Intuitively, one would think that recordings in V1 neurons would correspond to layer V1 outputs in the model, but this is not the approach taken in the manuscript. This needs clarification. Also, note that the 20-40% gain corresponds to high-order visual areas (V4 or MT), but not to V1, in the cited references. The quantitative correspondence between gain factors at various processing steps in the model and in the data is confusing and should be clearer.

    2. The model assumes a gain modulation in the inputs to V1. This would correspond to an attentional gain modulation in LGN unit outputs. There is little evidence of such strong modulation of LGN activity by attention. Also in V1 attentional modulation is small. As stated in Discussion, there is no reason to favor the current model as opposed to a model where the attentional gain is imposed later on in the visual hierarchy (for example V4). If anything, neurophysiology would be more consistent with this last scenario, given the evidence for direct V4 gain control from frontal eye fields (Moore and Armstrong, Nature 2003). The rationale for focusing on a model that incorporates the attentional spotlight on the inputs to V1 should be disclosed.

    3. The model chosen is the CORnet-z model, but this model does not include recurrent dynamics within each layer. Recurrent dynamics is a prominent feature in the cortex, and there is evidence indicating that attentional modulations operate differently in feedforward and in recurrent architectures (Compte and Wang, Cerebral Cortex 2006). A specific feature of recurrent models is that the attentional spotlight need not be a multiplicative factor (which is biologically complicated) but an additive term before the ReLU non-linearity, which achieves the expected RF modulations (Compte and Wang, 2006). A model with recurrence thus represents another architecture that links gain and shift in a way that has not been explored in this manuscript, and this may limit the generalization of the conclusions.

  4. Reviewer #2 (Public Review):

    This manuscript by Fox, Birman, and Gardner combines human behavioral experiments with spatial attention manipulation and computational modeling (image-computable convolutional neural network models) to investigate the computational mechanisms that may underlie improvements in behavioral performance when deploying spatial attention.

    Strengths:

    - The manuscript is clear and the analyses, modeling, and exposition are executed well.

    - The behavioral experiments are carefully conducted and of high quality.

    - The manuscript takes a creative approach to constructing a "neural network observer model", that is, coupling an image-computable model to a potential readout mechanism that specifies how the representations might be used for the purposes of behavior. The focused analyses of the model innards (architecture, parameters) provide insight into how different model components lead to the final behavior of the model.

    Weaknesses:

    - The overall conclusions and insights gained seem heavily dependent on particular choices and design decisions made in this specific model. In particular, the readout mechanism lacks some critical descriptive details, and it is not clear whether the readout mechanism (512-dimensional representation that reflects summing over visual space) is a reasonable choice. As such, while the computational analyses and results may be correct for this model, it is not clear whether the strong general conclusions are justified. Thus, the results in their current form feel more like exploratory work showing proof of concept of how the issue of attention and underlying computational mechanisms can be studied in a rigorous and concrete computational modeling context, rather than definitive results concerning how attention operates in the visual system.

    Overall, the work is solidly constructed, but the overall generality and strength of the conclusions require substantial dampening.