When word order matters: human brains represent sentence meaning differently from large language models
Curation statements for this article:-
Curated by eLife
eLife Assessment
This work provides a valuable comparison of sentence structure representations in the human brain and state-of-the-art Large Language Models (LLMs). Based on solid analysis of 7T fMRI data, it systematically identifies sentences in which LLMs underperform relative to models that explicitly code for syntactic structure. The study will be of significant interest to both cognitive neuroscientists and artificial intelligence researchers.
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (eLife)
Abstract
Abstract
Large language models based on the transformer architecture are now capable of producing human-like language. But do they encode and process linguistic meaning in a human-like way? Here, we address this question by analysing 7T fMRI data from 30 participants reading 108 sentences each. These sentences are carefully designed to disentangle sentence structure from word meaning, thereby testing whether transformers are able to represent aspects of sentence meaning above the word level. We found that while transformer models match brain representations better than models that completely ignore word order, all transformer models performed poorly overall. Further, transformers were significantly inferior to models explicitly designed to encode the structural relations between words. Our results provide insight into the nature of sentence representation in the brain, highlighting the critical role of sentence structure. They also cast doubt on the claim that transformers represent sentence meaning similarly to the human brain.
Article activity feed
-
Author response:
We thank the reviewers for their insightful comments on our manuscript. Here we briefly highlight our responses to several issues raised by reviewers, and also provide a summary of planned changes to be made with the next draft.
Reviewer 1:
(1) The reviewer questions the rationale for averaging sentence embeddings across different models. However, our method involves computing correlations separately for each model, then averaging the correlations. We also report model correlations for each model separately in Fig S2. We will clarify this in our revised manuscript.
(2) We agree with the reviewer that including a context-free grammar model as a comparison would be informative. We will incorporate this in the revised manuscript.
(3) The reviewer raises questions about the low correlation between behavioural and brain …
Author response:
We thank the reviewers for their insightful comments on our manuscript. Here we briefly highlight our responses to several issues raised by reviewers, and also provide a summary of planned changes to be made with the next draft.
Reviewer 1:
(1) The reviewer questions the rationale for averaging sentence embeddings across different models. However, our method involves computing correlations separately for each model, then averaging the correlations. We also report model correlations for each model separately in Fig S2. We will clarify this in our revised manuscript.
(2) We agree with the reviewer that including a context-free grammar model as a comparison would be informative. We will incorporate this in the revised manuscript.
(3) The reviewer raises questions about the low correlation between behavioural and brain similarities. While the behavioural judgements are made by different participants and involve a different task than the neuroimaging results, nonetheless we agree the difference is surprising and warrants more detailed consideration. We will provide additional discussion of the relationship between behavioural judgements and brain data in the revised manuscript.
(4) The reviewer suggests contrasting our models with a ‘semantic ground truth’, as in our design matrix shown in Fig 1. While our design matrix served as the basis for constructing a set of stimuli with systematic modifications, we respectfully suggest that it should not be regarded as a ‘semantic ground truth’. In particular, sentence pairs within each category will not have the same degrees of semantic similarity since the words and context differ across sentences in a graded manner. Furthermore, while we anticipated ‘different’ sentence pairs would be less similar than ‘swapped’ sentence pairs, and that within each of the six block diagonals the ‘modified’ or ‘substituted’ sentence pairs would be the most similar, we did not have any prediction about the magnitude of these differences. Our goal was to construct a set of sentence pairs which spanned a range of semantic similarities, and allowed for dissociation between lexical similarity and overall similarity in meaning. The design matrix is not intended to represent a ‘ground truth’ that human judgements or brain representations would be expected to conform with.
(5) In the revised draft we will modify the location of Fig. 5 so that it flows better with the text.
(6) We agree that the discussion of the differences between brain regions could be expanded. We will include this in the revised version of our manuscript. The reviewer questions our inclusion of the simple-average and group-average RSA analysis as they show similar results. We included both analyses in line with our preregistration, and also because we believe the fact that two distinct approaches to analyzing the data yield similar results strengthens our conclusions.
(7) We believe that the grid-like pattern in the RSA results is an important unexpected finding that warrants discussion in the main manuscript.
Reviewer 2:
(1) The reviewer argues that our stimuli do not fully control for lexical content across conditions, and that a more appropriate paradigm may be to utilise minimal pairs in which only a single variable of interest (such as sentence structure) is modified. We agree that most of our sentence pairs do not constitute minimal pairs, however this was not our objective. Our study design aimed to synthesise traditional minimal pair approaches with more recent research paradigms using naturalistic stimuli. As such, we selected stimuli which are more complex and contain more variable features than traditional minimal pair studies, but which also are tailored to highlight differences which are of particular theoretical interest. Because we are interested in comparing the effects of multiple sentence elements and semantic roles, a systematic pairwise comparison of minimal pairs is not necessarily optimal. Instead, we designed our stimuli to leverage the advantage of fMRI in that we can measure the brain representations corresponding to each sentence, and hence can conduct a full series of pairwise comparisons of sentence representations. Most of these comparisons will not be between minimal pairs, but we selected sentences so as to provide a range of semantic similarities (low to high), while also providing for semantic contrasts of theoretical interest (such as the ‘swapped’ and ‘substituted’ sentence pairs). We do not claim this approach to be universally superior to a minimal pair approach, but we do believe our novel approach provides additional insights and a new perspective on semantic representation relative to minimal pair studies. We will add additional detail in the revised manuscript providing additional explanation for how stimuli were chosen, and contrasting this with minimal pair approaches.
(2) The reviewer notes that low RSA correlations do not imply that transformers fail to encode syntactic information. We acknowledge this in our discussion (page 10), where we also highlight that our focus is not on whether transformers encode such information, but rather what transformer representations can tell us about how sentence structure is represented in the brain. Our results indicate that transformer embeddings do not have the same geometric properties as brain representations of sentence meaning, at least for certain types of sentences where lexical information is insufficient to determine overall meaning. The reviewer also notes that transformer embeddings are highly anisotropic, however we adjust for this by normalising each feature as discussed on page 14. Finally, the reviewer notes that the transformers we examine differ in architecture and training objectives. This is not critical for our study because we are not seeking to determine which architecture or training objectives are best. Our goal is simply to compare a range of approaches and see which, if any, have similar sentence representations to those formed by the brain. In fact, our results indicate that architecture and training regime make relatively little difference for our stimuli.
(3) The reviewer argues that RSA correlations do not measure the extent to which a model encodes syntactic information. This is very similar to the previous point. We do not claim that our results show that transformers do not encode syntactic information. Rather, our claim is that sentence embeddings derived from transformers have different geometric properties to brain representations, and that brain representations are better described by models explicitly representing key semantic roles. From this we conclude that, at least for the sentences we present, the brain is highly sensitive to semantic roles in a way that transformer representations are not (at least to the same extent). We also respectfully disagree with the reviewer’s suggestions that sentence length and orthographic or lexical similarities may drive model correlations with brain activity. As we discuss on page 19, we explicitly control for differences in sentence length when computing correlations. Our process for constructing our sentence set also controls for lexical similarity by generating pairs of sentences with all or mostly the same words but different orderings. We did not explicitly address orthographic similarity, but this will be strongly correlated with lexical similarity.
Reviewer 3:
(1) The reviewer emphasises the need for nuance in our conclusions, given that some of the transformers achieve higher correlations when assessed over the full set of sentences. We agree with this comment, and will modify the discussion section in the revised manuscript to address this point. Having said that, we would like to note one of the disadvantages of transformers as a model of mind or brain representations is that they are largely a ‘black box’ whose workings are poorly understood. One advantage of hybrid models like our simple semantic role model is that they can be much easier to interpret, thereby enabling them to be used to determine which features are most important for brain representations of sentence meaning, and what mechanisms are used to combine individual words into a full sentence. Given their relative simplicity and interpretability, we believe hybrid models have considerable value as scientific tools, even in cases where they achieve comparable correlations to transformers. We will highlight this issue more clearly in our revised manuscript.
(2) The reviewer notes that despite our existing controls, residual confounds of sentence length may remain. We agree that this is a potential issue, and will add discussion to the revised manuscript. We also will present further supplementary analyses which we believe indicate that sentence length effects do not drive our main results. At the same time, we believe the fact that our results are robust to simultaneously controlling for sentence length and the ‘minimum length effect’ (Fig. S5) indicates they are not primarily driven by sentence length effects.
(3) The reviewer notes that the method for computing similarities differs between the vector-based (mean and transformer) models, and the hybrid and syntax-based models, thereby potentially adding an additional confound to our results. We agree that this is a potential limitation, and our correlations should always be understood as applying to a model paired with a similarity metric. However, we believe that this is mostly unavoidable when comparing different formalisms. An alterative approach of first embedding a graph into a vector and then training an encoding model on the graph embeddings has a similar limitation of being dependent not just on the graph representation, but also on the way it was embedded into a vector and the way the encoding model was trained. Arguably this process is more opaque than similarity methods, since it is unclear to what extent the graph embeddings preserve the logic and properties of a graph-based representation. Further, it not clear whether there is any single method which can overcome the difficulty of comparing distinct formalisms for representing semantics. The reviewer also highlights how the correlations measured for the syntax model differ greatly depending on whether the Smatch or WWLK similarity metrics are used. We believe this highlights the need for careful examination of commonly used graph similarity metrics, as has been noted in previous research. We will include additional discussion of this issue in our revised manuscript.
-
eLife Assessment
This work provides a valuable comparison of sentence structure representations in the human brain and state-of-the-art Large Language Models (LLMs). Based on solid analysis of 7T fMRI data, it systematically identifies sentences in which LLMs underperform relative to models that explicitly code for syntactic structure. The study will be of significant interest to both cognitive neuroscientists and artificial intelligence researchers.
-
Reviewer #1 (Public review):
Summary:
This paper investigates whether transformer-based models can represent sentence-level semantics in a human-like way. The authors designed a set of 108 sentences specifically to dissociate lexical semantics from sentence-level information and collected 7T fMRI data from 30 participants reading these sentences. They conducted representational similarity analysis (RSA) comparing brain data and model representations, as well as the human behavioral ratings. It is found that transformer-based models match brain representation better than a static word embedding baseline, which ignores word order, but fall short of models that encode the structural relations between words. The main contributions of this paper are:
(1) The construction of a sentence set that disentangles sentence structure from word meaning.
Reviewer #1 (Public review):
Summary:
This paper investigates whether transformer-based models can represent sentence-level semantics in a human-like way. The authors designed a set of 108 sentences specifically to dissociate lexical semantics from sentence-level information and collected 7T fMRI data from 30 participants reading these sentences. They conducted representational similarity analysis (RSA) comparing brain data and model representations, as well as the human behavioral ratings. It is found that transformer-based models match brain representation better than a static word embedding baseline, which ignores word order, but fall short of models that encode the structural relations between words. The main contributions of this paper are:
(1) The construction of a sentence set that disentangles sentence structure from word meaning.
(2) A comprehensive comparison of neural sentence representations (via fMRI), human behavior, and multiple computational models at the sentence level.
Strengths:
(1) The paper evaluates a wide variety of models, including layer-wise analysis for transformers and region-wise analysis in the human brain.
(2) The stimulus design allows precise dissociation between lexical and sentence-level semantics. The RSA-based approach is empirically sound and intuitive.
(3) The constructed sentences, along with the fMRI and behavioral data, represent a valuable resource for studying sentence representation.
Weaknesses:
(1) The rationale behind averaging sentence embeddings across multiple transformer models (with different architectures and training objectives) is unclear. These transformer-based models have different training paradigms and model architectures, which may result in misaligned semantic spaces. The averaging operation may dilute the distinct sentence representations learned by each model, potentially weakening the overall semantic encoding for sentences. Please clarify this choice or cite supporting methodology.
(2) All structure-sensitive models discussed incorporate semantics to some extent. Including a purely syntactic baseline, such as a model based on context-free grammar, would help confirm the importance of syntactic structures.
(3) In Figure 2, human behavioral judgments show weak correlations with neural data, and even fall below those of computational models, suggesting the behavioral judgments may not reflect the sentence structures in a brain-like way. This discrepancy between behavioral and neural data should be clarified, as it affects the interpretation of the results.
(4) To better contextualize model and neural performance, sentence similarity should be anchored to a notion of semantic "ground truth", such as the matrix shown in Figure 1a. Comparing this reference with human judgments, brain responses, and model similarities would help establish an upper bound.
(5) The structure of this paper is confusing. For instance, Figure 5 is cited early but appears much later. Reordering sections and figures would enhance readability.
(6) While the analysis is broad and comprehensive, it lacks depth in some respects. For instance, it remains unclear what specific insights are gained from comparing across brain regions (e.g., whole brain, language network, and other subregions). Similarly, the results of simple-average and group-average RSA appear quite similar and may not advance the interpretation.
(7) While explaining the grid-like pattern due to sentence length is important, this part feels somewhat disconnected from the central question of this paper (word order). It might be better placed in supplementary material.
-
Reviewer #2 (Public review):
Summary:
The paper used fMRI data while reading a set of sentences. The sentences are designed to disentangle syntax from meaning. RSA was performed using voxel activations and a variety of language models. The results show that transformers are inferior to models with explicit syntactic representation in terms of matching brain representations.
Strengths:
(1) The study controls for some variables that allow for an investigation of sentence structure in the brain. This controlled setting has an advantage over naturalistic stimuli in targeting more specific linguistic phenomena.
(2) The study combines fMRI data with behavioral similarity ratings and a variety of language models (static, transformers, graph-based models).
Weaknesses:
(1) The stimuli are not fully controlled for lexical content across …
Reviewer #2 (Public review):
Summary:
The paper used fMRI data while reading a set of sentences. The sentences are designed to disentangle syntax from meaning. RSA was performed using voxel activations and a variety of language models. The results show that transformers are inferior to models with explicit syntactic representation in terms of matching brain representations.
Strengths:
(1) The study controls for some variables that allow for an investigation of sentence structure in the brain. This controlled setting has an advantage over naturalistic stimuli in targeting more specific linguistic phenomena.
(2) The study combines fMRI data with behavioral similarity ratings and a variety of language models (static, transformers, graph-based models).
Weaknesses:
(1) The stimuli are not fully controlled for lexical content across conditions. Residual lexical differences between sentences could still influence both brain and model similarity patterns. To more cleanly isolate syntactic effects, it would be useful to systematically vary only a single structural element while keeping all other lexical content constant (e.g., the boy kicked the ball / the ball kicked the boy). It would be better to engage more with the minimal pair paradigm, which is widely used in large language model probing research.
(2) The comparisons are done across fundamentally different model types, including static embeddings, graph-based parsers, and transformers. The inherent differences in dimensionality and training objectives might make the conclusion drawn from RSA inconclusive. Transformer embeddings typically occupy much higher-dimensional, anisotropic representational spaces, and their similarity structure may reflect richer, more heterogeneous information than models explicitly encoding semantic roles. A lower RSA correlation in this study does not necessarily imply that transformers fail to encode syntactic information; rather, they may represent additional aspects of meaning or context that diverge from the narrow structural contrasts probed here.
(3) The interpretation of the RSA correlation largely depends on the understanding of models. The authors suggest that because hybrid models correlate better than transformers, this implies that transformers are inferior at representing syntax. However, this is not a direct test of syntactic ability. Transformers may encode syntactic information, but it may not be expressed in a way that aligns with the RSA paradigm or the chosen stimuli. RSA does not reveal what the model encodes, and the models might achieve a good correlation for non-syntactic reasons (e.g., length of sentence, orthographic similarity, lexical features).
-
Reviewer #3 (Public review):
Summary:
Large Language Models have revolutionized Artificial Intelligence and can now match or surpass human language abilities on many tasks. This has fueled interest in cognitive neuroscience in exposing representational similarities between Language Models and brain recordings of language comprehension. The current study breaks from this mold by: (1) Systematically identifying sentence structures for which brain and Large Language Model representations diverge. (2) Demonstrating that brain representations for these sentences can be better accounted for by a model structured by the semantic roles of words in the sentence. As such, the study may now fuel interest in characterizing how Large Language Models and brain representations differ, which may prompt new, more brain-like language models.
Strengths:
(1…
Reviewer #3 (Public review):
Summary:
Large Language Models have revolutionized Artificial Intelligence and can now match or surpass human language abilities on many tasks. This has fueled interest in cognitive neuroscience in exposing representational similarities between Language Models and brain recordings of language comprehension. The current study breaks from this mold by: (1) Systematically identifying sentence structures for which brain and Large Language Model representations diverge. (2) Demonstrating that brain representations for these sentences can be better accounted for by a model structured by the semantic roles of words in the sentence. As such, the study may now fuel interest in characterizing how Large Language Models and brain representations differ, which may prompt new, more brain-like language models.
Strengths:
(1) This study presents a bold and solid challenge to a literature trend that has touted similarities between Transformer models and human cognition based on representational correlations with brain activity. This challenge is substantiated by identifying sentences for which brain and model representations of sentences diverge and explaining those divergences using models structured by semantic roles/syntax.
(2) This study conducts a rigorous pre-registered analysis of a comprehensive selection of the state-of-the-art Large Language Models, on a controlled sentence comprehension fMRI dataset. The analysis is conducted within a Representation Similarity framework to support similarity comparisons between graph structures and brain activity without needing to vectorize graphs. Transformer models are predicted and shown to diverge from brain representations on subsets of sentences with similar word-level content but different sentence structures.
(3) The study introduces a 7T fMRI sentence comprehension dataset and accompanying human sentence similarity ratings, which may be a fruitful resource for developing more human-like language models. Unlike other model-based sentence datasets, the relation between grammatical structure and word-level content is controlled, and subsets of sentences for which models and brains diverge are identified.
Weaknesses:
(1) The interpretation of findings is nuanced. Although Transformers underperform as brain models on the critical subsets of controlled sentences, a Transformer outperforms all other models when evaluated on the union of all sentences when both word-level content and structure vary. Transformers also yield equivalent or better models of human behavioral data. Thus, although Transformers have demonstrable flaws as human models, which are pinpointed here, in the general case, (some) Transformers are more human-like than the other models considered.
(2) There may be confounds between the critical sentence structure manipulations and visual representations of sentence stimuli. This is inconvenient because activation in brain regions that process semantics tends to partially correlate with visual cortex representations, and computational models tend to reflect the number of words/tokens/elements in sentences. Although the study commendably controls for confounds associated with sentence length, there could still be residual effects that remain. For instance, the Graph model correlates most strongly with the visual cortex despite these sentence length controls.
(3) Sentence similarity computations are emphasized as the basis for unifying comparative analyses of graph structures and vector data. A strength of this approach is that correlation is not always the ideal similarity metric. However, a weakness is that similarity computations are not unified across models. This has practical consequences here because different similarity metrics applied to the same model produce positive or negative correlations with brain data.
-
-
-