Meaning-based guidance of attention in rhesus monkeys during naturalistic scene viewing
Curation statements for this article:-
Curated by eLife
eLife Assessment
This valuable study shows that macaque monkeys preferentially fixate regions in natural scenes that are classified as "meaningful" by a computational model - an earlier model that was developed to identify locations that are semantically informative to humans - suggesting that overt attention to structured visual content is shared across primates. However, support is incomplete for the stronger claim that macaques are guided by semantic meaning, which is confounded by lower-level visual features that co-vary with it and by methodological limitations that complicate interpretation. If the semantic interpretation were more reliably established, the significance of the findings would increase, as they would connect the human cognitive process of scene understanding to neural circuit mechanisms accessible in non-human primates.
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (eLife)
Abstract
In humans and other primates, high-acuity vision is restricted to the fovea, requiring frequent saccadic eye movements to sample visual information, a process known as overt attention. Classical visual salience theory explains how low-level image features guide these movements, but recent human studies have shown that overt attention is also strongly guided by scene meaning—the spatial distribution of semantic informativeness. Whether this form of attentional guidance is uniquely human or shared across primate vision remains unknown. Here, we addressed this question by recording eye movements from two rhesus macaques freely viewing naturalistic indoor scenes. Fixation selection was modeled using meaning maps alongside image-based salience maps and center proximity. In both monkeys, meaning robustly predicted fixation selection after controlling for visual salience and center bias. Moreover, high-meaning regions captured attention independently of visual salience, whereas salience played an increasingly important role as meaning decreased. While this prioritization of meaningful regions remained robust across environments, familiarity broadened visual exploration by increasing the likelihood of fixating less meaningful areas. Finally, the influence of meaning on fixation selection strengthened with attentional engagement. These findings suggest that meaning-based attention is an evolutionarily conserved component of primate vision and establish a behavioral foundation for investigating its neural mechanisms.
Article activity feed
-
eLife Assessment
This valuable study shows that macaque monkeys preferentially fixate regions in natural scenes that are classified as "meaningful" by a computational model - an earlier model that was developed to identify locations that are semantically informative to humans - suggesting that overt attention to structured visual content is shared across primates. However, support is incomplete for the stronger claim that macaques are guided by semantic meaning, which is confounded by lower-level visual features that co-vary with it and by methodological limitations that complicate interpretation. If the semantic interpretation were more reliably established, the significance of the findings would increase, as they would connect the human cognitive process of scene understanding to neural circuit mechanisms accessible in non-human …
eLife Assessment
This valuable study shows that macaque monkeys preferentially fixate regions in natural scenes that are classified as "meaningful" by a computational model - an earlier model that was developed to identify locations that are semantically informative to humans - suggesting that overt attention to structured visual content is shared across primates. However, support is incomplete for the stronger claim that macaques are guided by semantic meaning, which is confounded by lower-level visual features that co-vary with it and by methodological limitations that complicate interpretation. If the semantic interpretation were more reliably established, the significance of the findings would increase, as they would connect the human cognitive process of scene understanding to neural circuit mechanisms accessible in non-human primates.
-
Reviewer #1 (Public review):
Summary:
The manuscript examines whether scene meaning guides overt attention in rhesus macaques. Two monkeys freely viewed naturalistic indoor scenes, including laboratory or housing scenes described as familiar and other indoor scenes described as unfamiliar. The authors compare fixation locations with matched non-fixated control locations using predictors derived from center proximity, image salience, and a DeepMeaning model intended to capture the spatial distribution of semantic informativeness. They report that meaning predicts fixation selection beyond salience and center bias, that meaning and salience interact, that familiar scenes produce broader exploration of low-meaning regions, and that the influence of meaning increases with attentional engagement.
Strengths:
A major strength of the study is …
Reviewer #1 (Public review):
Summary:
The manuscript examines whether scene meaning guides overt attention in rhesus macaques. Two monkeys freely viewed naturalistic indoor scenes, including laboratory or housing scenes described as familiar and other indoor scenes described as unfamiliar. The authors compare fixation locations with matched non-fixated control locations using predictors derived from center proximity, image salience, and a DeepMeaning model intended to capture the spatial distribution of semantic informativeness. They report that meaning predicts fixation selection beyond salience and center bias, that meaning and salience interact, that familiar scenes produce broader exploration of low-meaning regions, and that the influence of meaning increases with attentional engagement.
Strengths:
A major strength of the study is its use of natural free-viewing behavior in macaques. The experimental approach takes advantage of intrinsic gaze allocation rather than relying on a more artificial task, which makes the work a useful bridge between human scene-viewing studies and future neurophysiological studies in nonhuman primates.
The statistical analyses are extensive. The authors model fixated and matched non-fixated samples with Bayesian generalized linear mixed models, including center proximity and salience as important controls, examined interactions among predictors, and reported diagnostics for multicollinearity and model convergence. These analyses support the basic observation that the human-derived meaning maps are associated with macaque fixation allocation beyond the particular center and salience terms included in the model.
The question is interesting and timely. If meaning-like scene structure can be operationalized for macaque viewing, this would provide a useful behavioral foundation for future work on the neural mechanisms that link scene analysis, gaze allocation, and natural behavior.
Weaknesses:
The main weakness is interpretive. The manuscript often treats the DeepMeaning map as though it measures scene meaning for the monkey, but the map is ultimately human-derived. Some of the examples make this issue especially salient: regions such as clocks, phones, dining tables, or other human artifacts may be meaningful to human observers, but it is not clear that they have semantic meaning for macaques. If meaning-based guidance is argued to emerge through experience, then unfamiliar human indoor scenes that the monkeys have never encountered cannot straightforwardly be meaningful to them in the same sense that they are meaningful to humans. Predictive success for these scenes may therefore indicate sensitivity to visual or object-level structure correlated with human-rated meaning, rather than macaque semantic understanding.
A related concern is that the DeepMeaning predictor may capture forms of visual salience, objectness, or high-level image structure not captured by the particular low-level salience model. For example, a clock or phone may attract gaze because of shape, contrast, face-like configuration, object boundaries, or other mid-level features rather than because it carries semantic meaning for a macaque. The present analyses show that this model is predictive, but they do not by themselves establish that the predictive variable is semantic meaning rather than visual structure beyond Itti-Koch-style salience.
The manuscript relies heavily on fitted model parameters and derived maps, with relatively little return to the raw behavioral data. The main claims would be easier to evaluate if the authors showed more direct fixation-density maps, scene-by-scene examples, and aggregate raw relationships between fixation behavior and map values. At present, much of the argument rests on interpreting fitted coefficients, without enough behavioral visualization to show what the monkeys actually did across the stimulus set.
It is also unclear whether model performance was evaluated on held-out data. The comparison to repeated viewing of the same images is useful as a behavioral benchmark, but a second viewing may itself be affected by familiarity or memory for the image. This makes it a potentially imperfect estimate of a noise ceiling for first-pass fixation predictability. Cross-validation or held-out prediction, ideally across held-out images as well as trials, would make the predictive claims more convincing.
Although the authors describe multicollinearity as negligible, Figure S2B-C appears to show some nontrivial correlations among predictors. These correlations may matter for interpretation even if variance inflation factors fall below conventional thresholds, especially when the signs of fitted effects point in directions that may be expected from the input correlations, such as relationships involving meaning and familiarity. The manuscript would benefit from reporting these correlations quantitatively and relating them to the fitted effects.
The familiarity analysis is interesting but would benefit from further control. Familiar scenes are photographs of the monkeys' housing and laboratory environments, whereas unfamiliar scenes are other indoor environments. These categories may differ not only in familiarity but also in clutter, spatial layout, object density, color distribution, luminance, contrast, edge density, texture statistics, or the distributions of salience and meaning values. Without additional characterization of the image sets, the conclusion that familiarity itself broadens exploration should be treated cautiously.
The engagement effects also appear less consistent across the two monkeys than some of the summary language suggests. The monkey-specific results should be emphasized, and claims about engagement strengthening meaning-based guidance should be stated in proportion to the cross-animal evidence.
Finally, the manuscript sometimes uses language that sounds more mechanistic than the behavioral data can support. The negative interaction between meaning and salience is an interesting result, but terms such as competitive integration in a shared priority map go beyond what can be concluded from overt fixation selection alone. The study lacks a causal or perturbational manipulation, such as image inversion or another transformation that preserves local features while altering semantic organization. The result would be clearer if described first as a model-based association or subadditive interaction in gaze allocation, with the priority-map interpretation presented as a plausible account rather than a direct conclusion.
-
Reviewer #2 (Public review):
Summary:
In prior work, the authors developed an ML algorithm that computes spatial maps of "meaning": image regions that are likely to be given semantic labels by human observers. They also previously showed that "meaning" predicts fixations in humans and human infants. Here, these observations were extended to macaque monkeys, testing the hypothesis that meaning is a phylogenetically preserved driver of overt attention across primates.
Strengths:
The paper reports that fixated locations had higher values of meaning compared to nearby, non-fixated locations. Specifically, it shows that meaning values - as inferred from a neural network model - are useful in differentiating these two classes of locations, beyond the established effects of image salience and centrality on gaze. The reported results were …
Reviewer #2 (Public review):
Summary:
In prior work, the authors developed an ML algorithm that computes spatial maps of "meaning": image regions that are likely to be given semantic labels by human observers. They also previously showed that "meaning" predicts fixations in humans and human infants. Here, these observations were extended to macaque monkeys, testing the hypothesis that meaning is a phylogenetically preserved driver of overt attention across primates.
Strengths:
The paper reports that fixated locations had higher values of meaning compared to nearby, non-fixated locations. Specifically, it shows that meaning values - as inferred from a neural network model - are useful in differentiating these two classes of locations, beyond the established effects of image salience and centrality on gaze. The reported results were consistent in both monkeys.
Weaknesses:
It is difficult to understand what, precisely, is meant by meaning from this paper, although the prior work from this group may offer some insight. Given that, it is not clear if "high-meaning" image locations tend to be objects, for example, or faces, or other such behaviorally relevant image features. Indeed, the utility of the meaning maps was not evaluated against other algorithms that consider more complex natural scene information. This is a particular concern as the paper does not demonstrate that meaning predicts where the viewer will look within the image; instead, it shows that meaning is one of the variables that differentiates fixated locations from nearby non-fixated locations. Because this is not a causal study by necessity, caution is also needed in interpreting the results. In our view, the most parsimonious interpretation may not be that meaning guides gaze in monkeys, but instead that people tend to name things that primate brains evolved to fixate on at the expense of neighboring locations.
-
Reviewer #3 (Public review):
Summary:
This novel study asks whether meaning-based guidance of overt attention, well-established in humans through the "meaning map" framework, extends to non-human primates. The authors recorded eye movements from two rhesus macaques freely viewing naturalistic indoor scenes and modeled fixation selection using DeepMeaning maps, Itti-Koch salience maps, and center proximity. They report that scene meaning robustly predicts fixation selection after controlling for salience and center bias, that meaning and salience interact competitively rather than additively, and that the influence of meaning is modulated by scene familiarity and attentional engagement. The cross-species extension of the meaning map approach is a valuable contribution, and the Bayesian GLMM framework with variance partitioning is …
Reviewer #3 (Public review):
Summary:
This novel study asks whether meaning-based guidance of overt attention, well-established in humans through the "meaning map" framework, extends to non-human primates. The authors recorded eye movements from two rhesus macaques freely viewing naturalistic indoor scenes and modeled fixation selection using DeepMeaning maps, Itti-Koch salience maps, and center proximity. They report that scene meaning robustly predicts fixation selection after controlling for salience and center bias, that meaning and salience interact competitively rather than additively, and that the influence of meaning is modulated by scene familiarity and attentional engagement. The cross-species extension of the meaning map approach is a valuable contribution, and the Bayesian GLMM framework with variance partitioning is well-suited to the question.
Strengths:
(1) The cross-species extension itself is novel and well-motivated. Nobody has applied the meaning map framework to NHP gaze behavior before. Even with the interpretive caveats I raise below, creating this methodological bridge between human scene perception research and NHP circuit neuroscience is a valuable contribution.
(2) The statistical framework is strong. The Bayesian GLMM with posterior distributions, HDIs, and probability of direction is more informative than frequentist alternatives. The variance partitioning with ΔR² is the right approach for disentangling predictor contributions. Random intercepts for scene are appropriate. The convergence diagnostics (R-hat = 1.00, ESS > 8000 across all models) are exemplary.
(3) Transparent individual-subject reporting. With N = 2, reporting each monkey separately rather than pooling or averaging is the correct choice, and the authors do this consistently. The individual differences are visible because the reporting is honest.
(4) The experimental design is excellent. 200 scenes is a substantial stimulus set by NHP standards. The inclusion of both familiar and unfamiliar environments, the repeated-viewing design for reliability estimation, and the 5-second free viewing window that yields ~15 fixations per trial all reflect thoughtful design.
(5) The familiarity and engagement analyses go beyond the basic demonstration. Even with the limitations we identified, asking how behavioral context modulates the meaning-gaze relationship is more ambitious than simply showing that the correlation exists. These analyses generate testable predictions for future work.
(6) Data and code sharing commitment. The authors plan to release raw data, preprocessing, and analysis code on OSF and GitHub.
Weaknesses:
(1) The authors' central claim is that meaning-based attentional guidance is an "evolutionarily conserved component of primate vision." This claim rests on the finding that macaque fixation patterns correlate with DeepMeaning maps. However, DeepMeaning is trained on human ratings of local scene meaning using a vision-language transformer (CoCa) pretrained on billions of human image-text pairs. What the model captures, then, is the spatial distribution of visual structure that humans judge to be semantically informative. The authors acknowledge that DeepMeaning represents "structured visual representations of scene regions containing identifiable objects and informative relationships" (lines 261-262), but this acknowledgment actually highlights the problem: regions containing identifiable objects and informative spatial relationships would plausibly attract fixations in any visual system with object-selective neurons and a bias toward structured content, regardless of whether the observer is processing "meaning" in any semantic sense. That is, the correlation between macaque gaze and DeepMeaning maps is consistent with shared object-level visual processing, but doesn't uniquely implicate shared semantic processing. The critical adversarial test from Hayes & Henderson (2022a)-where meaning maps detected the removal of semantic content via diffeomorphic scrambling while deep saliency models did not-has not been applied to macaque viewing behavior. Importantly, such a test would require new data collection (showing monkeys scrambled scenes), which may not be feasible. A more tractable approach with the existing data would be to compare DeepMeaning against some other model that captures mid-level visual structure without semantic supervision, though this would be a weaker test. Given these constraints, I would ask the authors to (a) acknowledge this limitation explicitly and temper the evolutionary conservation claim accordingly-for example, framing the result as evidence that macaques and humans share attentional biases toward visually structured scene regions, with the semantic interpretation remaining an open question-and (b) note the diffeomorphic scrambling experiment as an important future direction for establishing whether macaque attention is guided by semantic content per se.
(2) The familiar/unfamiliar scene comparison confounds long-term familiarity with systematic differences in scene content. Familiar scenes are photographs of the vivarium and laboratory; unfamiliar scenes are restaurants, bedrooms, kitchens, and offices. These two categories almost certainly differ in visual complexity, object density, spatial layout, clutter, and the types of objects present. The familiar environments (vivarium caging, lab equipment) are likely more spatially repetitive and lower in object diversity than, say, a restaurant or residential kitchen. Any difference attributed to "familiarity" could therefore reflect these systematic content differences. The negative interaction between meaning and familiarity (Monkey V: β = −0.19; Monkey I: β = −0.19), which the authors interpret as familiarity broadening exploration, could instead reflect the fact that vivarium/lab scenes have a different distribution of meaning values or a different relationship between meaning and salience than human domestic environments. The authors should address this confound directly. At minimum, comparing the distributions of meaning and salience values across the two scene categories would help the reader evaluate whether the familiarity effect can be separated from content effects. Ideally, the authors would include a subset analysis using only scenes matched on feature distributions or include scene-level summary statistics of the meaning and salience maps as covariates in the familiarity model.
-