Convolutional networks can model the functional modulation of MEG responses during reading

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife assessment

    van Vliet and colleagues show a correlation between internal states of a convolutional neural network (CNN) trained on visual word stimuli with three specific components of evoked MEG potentials during reading in humans. The findings are useful, but the current results remain incomplete, without evidence that the CNN can produce any of the phenomena that the human visual system is known to have (e.g., feedback connections, sensitivity to word frequency), or that the model has comparable performance to human behaviour (i.e., similar task accuracy with a comparable pattern of mistakes).

This article has been Reviewed by the following groups

Read the full article

Abstract

Neuroimaging studies have provided a wealth of information about when and where changes in brain activity might be expected during reading. We sought to better understand the computational steps that give rise to such task-related modulations of neural activity by using a convolutional neural network to model the macro-scale computations necessary to perform single-word recognition. We presented the model with stimuli that had been shown to human volunteers in an earlier magnetoencephalography ( meg ) experiment and evaluated whether the same experimental effects could be observed in both brain activity and model. In a direct comparison between model and meg recordings, the model accurately predicted the amplitude changes of three evoked meg response components commonly observed during single-word reading. In contrast to traditional models of reading, our model directly operates on the pixel values of an image containing text. This allowed us to simulate the whole gamut of processing from the detection and segmentation of letter shapes to word-form identification, with the deep learning architecture facilitating inclusion of a large vocabulary of 10k Finnish words. Interestingly, the key to achieving the desired behavior was to use a noisy activation function for the units in the model as well as to obey word frequency statistics when repeating stimuli during training. We conclude that the deep learning techniques that revolutionized models of object recognition can also create models of reading that can be straightforwardly compared to neuroimaging data, which will greatly facilitate testing and refining theories on language processing in the brain.

Article activity feed

  1. Author response:

    We thank the reviewers for their efforts. They have pointed out several shortcomings and made very helpful suggestions. Below, we shortly address the weak points that the reviewers brought up and outline what improvements we intend to make for the revised paper in response.

    Reviewer #1:

    The interpretation of CNN results, especially the number of layers in the final model and its relationship with the processing of visual words in the human brain, needs to be further strengthened.

    The results of our experimentation with the number of layers and the number of units in each layer can be found in the supplementary information. In the revised version, we will bring some of these results into the main text and discuss them more thoroughly.

    Reviewer #2:

    As has been shown over many decades, many potential computational algorithms, with varied model architectures, can perform the task of text recognition from an image. However, there is no evidence presented here that this particular algorithm has comparable performance to human behavior (i.e. similar accuracy with a comparable pattern of mistakes). This is a fundamental prerequisite before attempting to meaningfully correlate these layer activations to human neural activations. Therefore, it is unlikely that correlating these derived layer weights to neural activity provides meaningful novel insights into neural computation beyond what is seen using traditional experimental methods.

    We very much agree with the reviewer that a qualitative analysis of whether the model can explain experimental effects needs to happen before a quantitative analysis, such as evaluating model-brain correlation scores. In fact, this is one of the key points we wished to make.

    This starts with the observation that "traditional" models of reading (=those that do not rely on deep learning) cannot explain some very basic human behavioral results, such as humans being able to recognize a word regardless of exact letter shape, size, and (up to a point) rotation. This is not so much a failure on the part of traditional models as it is a difference in focus. There are models of vision that focus on these low-level things, currently dominated by deep learning, but these are rarely evaluated in the context of reading, which has its own literature and well-known experimental effects. We believe the current version of the manuscript makes insufficiently clear what the goals of our modeling effort are exactly, which is something we will attempt to correct in the revision.

    Since our model only covers the first phase of reading, with a special focus on letter shape detection, we sought to compare it with neuroimaging data that can provide "snapshots" of the state of the brain during these early phases, rather than comparing it with behavioral results that occur at the very end. However, we very much make this comparison in the spirit hinted at by the reviewer. The different MEG components have a distinct "behavior" to them in the way they respond to different experimental conditions (Figure 2), and the model needs to replicate this behavior (Figure 4). Only then do we move on to a quantitative analysis.

    One example of a substantial discrepancy between this model and neural activations is that, while incorporating frequency weighting into the training data is shown to slightly increase neural correlation with the model, Figure 7 shows that no layer of the model appears directly sensitive to word frequency. This is in stark contrast to the strong neural sensitivity to word frequency seen in EEG (e.g. Dambacher et al 2006 Brain Research), fMRI (e.g. Kronbichler et al 2004 NeuroImage), MEG (e.g. Huizeling et al 2021 Neurobio. Lang.), and intracranial (e.g. Woolnough et al 2022 J. Neurosci.) recordings. Figure 7 also demonstrates that the late stages of the model show a strong negative correlation with font size, whereas later stages of neural visual word processing are typically insensitive to differences in visual features, instead showing sensitivity to lexical factors.

    We are glad the reviewer brought up the topic of frequency balancing, as it is a good example of the importance of the qualitative analysis. As the reviewer points out, frequency balancing during training only had a moderate impact on correlation scores and from that point of view does not seem impactful. However, when we look at the qualitative evaluation, we see that with a large vocabulary, a model without frequency balancing fails to properly distinguish between consonant strings and (pseudo)words (Figure 4, 5th row). Hence, from the point of view of being able to reproduce experimental effects, frequency balancing had a large impact. It is true that the model, even with frequency balancing, only captures letter- and bigram-frequency effects and not word-frequency effects, as we know the N400 is sensitive to. This could mean that N400 word-frequency effects are driven by mechanics that our current model lacks, such as top-down effects from systems further up the processing pipeline.

    We agree with the reviewer that the late-stage sensitivity of the model to font size must be seen as a flaw. Of course, we say as much when we discuss this result in the paper. Important context for this flaw is that the main aim of the model is to reproduce the experimental effects of Vartiainen et al. (2011), which does not include manipulation of word length. The experimental contrasts in Figure 7 are meant to explore a bit beyond the boundaries of that particular study, but were never considered "failure points". When presenting a model, it's important to show its limitations too.

    Another example of the mismatch between this model and the visual cortex is the lack of feedback connections in the model. Within the visual cortex, there are extensive feedback connections, with later processing stages providing recursive feedback to earlier stages. This is especially evident in reading, where feedback from lexical-level processes feeds back to letter-level processes (e.g. Heilbron et al 2020 Nature Comms.). This feedback is especially relevant for the reading of words in noisy conditions, as tested in the current manuscript, as lexical knowledge enhances letter representation in the visual cortex (the word superiority effect). This results in neural activity in multiple cortical areas varying over time, changing selectivity within a region at different measured time points (e.g. Woolnough et al 2021 Nature Human Behav.), which in the current study is simplified down to three discrete time windows, each attributed to different spatial locations.

    In this study, we make a start in showing how deep learning techniques could be beneficial to enhance models of reading by showing how even a simple CNN, after a few enhancements, can account for several experimental MEG effects that we see in reading tasks, but are outside the focus of traditional models of reading. We never intended to claim that our model offers a complete view of all the processes involved. This is why we have dedicated a section in the Discussion to the various ways in which our simple CNN is incomplete as a model of reading. In this section we hint at the usage of recurrent connections, but the reviewer does an excellent job of highlighting the importance of top-down connections even in models focusing on early visual processes, which we are very happy to include in this section.

    The presented model needs substantial further development to be able to replicate, both behaviorally and neurally, many of the well-characterized phenomena seen in human behavior and neural recordings that are fundamental hallmarks of human visual word processing. Until that point, it is unclear what novel contributions can be gleaned from correlating low-dimensional model weights from these computational models with human neural data.

    The CNN model we present in this study is a small piece in a bigger effort to employ deep learning techniques to further enhance already existing models of reading. For our revision, we plan to expand on the question of where to go from here and outline our vision on how these techniques could help us better model the phenomena the reviewer speaks of. We agree with the reviewer that there is a long way to go, and we are excited to be a part of it.

    Reviewer #3:

    The paper is rather qualitative in nature. In particular, the authors show that some resemblance exists between the behavior of some layers and some parts of the brain, but it is hard to quantitively understand how strong the resemblances are in each layer, and the exact impact of experimental settings such as the frequency balancing (which seems to only have a very moderate effect according to Figure 5).

    The large focus on a qualitative evaluation of the model is intentional. The ability of the model to reproduce experimental effects (Figure 4) is a pre-requisite for any subsequent qualitative metrics (such as correlation) to be valid. The introduction of frequency balancing is a good example of this. As the reviewer points out, frequency balancing during training has only a moderate impact on correlation scores and from that point of view does not seem impactful. However, when we look at the qualitative evaluation, we see that with a large vocabulary, a model without frequency balancing fails to properly distinguish between consonant strings and (pseudo)words (Figure 4, 5th row). Hence, from the point of view of being able to reproduce experimental effects, frequency balancing has a large impact.

    That said, the reviewer is right to highlight the value of quantitative analysis. An important limitation of the "traditional" models of reading that do not employ deep learning is that they operate in unrealistically simplified environments (e.g. input as predefined line segments, words of a fixed length), which makes a quantitative comparison with brain data problematic. The main benefit that deep learning brings may very well be the increase in scale that makes more direct comparisons with brain data possible. In our revision we will attempt to capitalize on this benefit more. The reviewer has provided some helpful suggestions for doing so in their recommendations.

    The experiments only consider a rather outdated vision model (VGG).

    VGG was designed to use a minimal number of operations (convolution-and-pooling, fully-connected linear steps, ReLU activations, and batch normalization) and rely mostly on scale to solve the classification task. This makes VGG a good place to start our explorations and see how far a basic CNN can take us in terms of explaining experimental MEG effects in visual word recognition. However, we agree with the reviewer that it is easy to envision more advanced models that could potentially explain more. For our revision, we plan to expand on the question of where to go from here and outline our vision on what types of models would be worth investigating and how one may go about doing that in a way that provides insights beyond higher correlation values.

  2. eLife assessment

    van Vliet and colleagues show a correlation between internal states of a convolutional neural network (CNN) trained on visual word stimuli with three specific components of evoked MEG potentials during reading in humans. The findings are useful, but the current results remain incomplete, without evidence that the CNN can produce any of the phenomena that the human visual system is known to have (e.g., feedback connections, sensitivity to word frequency), or that the model has comparable performance to human behaviour (i.e., similar task accuracy with a comparable pattern of mistakes).

  3. Reviewer #1 (Public Review):

    Summary:

    This study trained a CNN for visual word classification and supported a model that can explain key functional effects of the evoked MEG response during visual word recognition, providing an explicit computational account from detection and segmentation of letter shapes to final word-form identification.

    Strengths:

    This paper not only bridges an important gap in modeling visual word recognition, by establishing a direct link between computational processes and key findings in experimental neuroimaging studies, but also provides some conditions to enhance biological realism.

    Weaknesses:

    The interpretation of CNN results, especially the number of layers in the final model and its relationship with the processing of visual words in the human brain, needs to be further strengthened.

  4. Reviewer #2 (Public Review):

    van Vliet and colleagues present the results of a study correlating internal states of a convolutional neural network trained on visual word stimuli with evoked MEG potentials during reading.

    In this study, a standard deep learning image recognition model (VGG-11) trained on a large natural image set (ImageNet) that begins illiterate but is then further trained on visual word stimuli, is used on a set of predefined stimulus images to extract strings of characters from "noisy" words, pseudowords and real words. This methodology is used in hopes of creating a model that learns to apply the same nonlinear transforms that could be happening in different regions of the brain - which would be validated by studying the correlations between the weights of this model and neural responses. Specifically, the aim is that the model learns some vector embedding space, as quantified by the spread of activations across a layer's units (L2 Norm after ReLu Activation Function), for the different kinds of stimuli, that creates a parameterized decision boundary that is similar to amplitude changes at different times for a MEG signal. More importantly, the way that the stimuli are ordered or ranked in that space should be separable to the degree we see separation in neural activity. This study shows that the activation corresponding to five different broad classes of stimuli statistically correlates with three specific components in the ERP. However, I believe there are fundamental theoretical issues that limit the implications of the results of this study.

    As has been shown over many decades, many potential computational algorithms, with varied model architectures, can perform the task of text recognition from an image. However, there is no evidence presented here that this particular algorithm has comparable performance to human behavior (i.e. similar accuracy with a comparable pattern of mistakes). This is a fundamental prerequisite before attempting to meaningfully correlate these layer activations to human neural activations. Therefore, it is unlikely that correlating these derived layer weights to neural activity provides meaningful novel insights into neural computation beyond what is seen using traditional experimental methods.

    One example of a substantial discrepancy between this model and neural activations is that, while incorporating frequency weighting into the training data is shown to slightly increase neural correlation with the model, Figure 7 shows that no layer of the model appears directly sensitive to word frequency. This is in stark contrast to the strong neural sensitivity to word frequency seen in EEG (e.g. Dambacher et al 2006 Brain Research), fMRI (e.g. Kronbichler et al 2004 NeuroImage), MEG (e.g. Huizeling et al 2021 Neurobio. Lang.), and intracranial (e.g. Woolnough et al 2022 J. Neurosci.) recordings. Figure 7 also demonstrates that the late stages of the model show a strong negative correlation with font size, whereas later stages of neural visual word processing are typically insensitive to differences in visual features, instead showing sensitivity to lexical factors.

    Another example of the mismatch between this model and the visual cortex is the lack of feedback connections in the model. Within the visual cortex, there are extensive feedback connections, with later processing stages providing recursive feedback to earlier stages. This is especially evident in reading, where feedback from lexical-level processes feeds back to letter-level processes (e.g. Heilbron et al 2020 Nature Comms.). This feedback is especially relevant for the reading of words in noisy conditions, as tested in the current manuscript, as lexical knowledge enhances letter representation in the visual cortex (the word superiority effect). This results in neural activity in multiple cortical areas varying over time, changing selectivity within a region at different measured time points (e.g. Woolnough et al 2021 Nature Human Behav.), which in the current study is simplified down to three discrete time windows, each attributed to different spatial locations.

    The presented model needs substantial further development to be able to replicate, both behaviorally and neurally, many of the well-characterized phenomena seen in human behavior and neural recordings that are fundamental hallmarks of human visual word processing. Until that point, it is unclear what novel contributions can be gleaned from correlating low-dimensional model weights from these computational models with human neural data.

  5. Reviewer #3 (Public Review):

    Summary:

    The authors investigate the extent to which the responses of different layers of a vision model (VGG-11) can be linked to the cascade of responses (namely, type-I, type-II, and N400) in the human brain when reading words. To achieve maximal consistency, they add noisy-activations to VGG and finetune it on a character recognition task. In this setup, they observe various similarities between the behavior of VGG and the brain when when presented with various transformations of the words (added noise, font modification, etc).

    Strengths:

    - The paper is well-written and well-presented.

    - The topic studied is interesting.

    - The fact that the response of the CNN on unseen experimental contrasts such as adding noise correlated with previous results on the brain is compelling.

    Weaknesses:

    - The paper is rather qualitative in nature. In particular, the authors show that some resemblance exists between the behavior of some layers and some parts of the brain, but it is hard to quantitively understand how strong the resemblances are in each layer, and the exact impact of experimental settings such as the frequency balancing (which seems to only have a very moderate effect according to Figure 5).

    - The experiments only consider a rather outdated vision model (VGG).