A model of egocentric to allocentric understanding in mammalian brains

Benigno Uria
Borja Ibarz
Andrea Banino
Vinicius Zambaldi
Dharshan Kumaran
Demis Hassabis
Caswell Barry
Charles Blundell

Curated by eLife

Evaluation Summary:

This paper presents an artificial neuronal network which, from action and visual inputs, develops representations of space comparable to those found in the navigational system of the brain. They show that the representations developed by this network can be used in novel environments and in a reinforcement learning task. This demonstration of representations in absolute coordinates from agent-centered information is a significant contribution to neuroscientists as well as machine learning.

(This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. The reviewers remained anonymous to the authors.)

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (eLife)

Abstract

In the mammalian brain, allocentric representations support efficient self-location and flexible navigation. A number of distinct populations of these spatial responses have been identified but no unified function has been shown to account for their emergence. Here we developed a network, trained with a simple predictive objective, that was capable of mapping egocentric information into an allocentric spatial reference frame. The prediction of visual inputs was sufficient to drive the appearance of spatial representations resembling those observed in rodents: head direction, boundary vector, and place cells, along with the recently discovered egocentric boundary cells, suggesting predictive coding as a principle for their emergence in animals. Strikingly, the network learned a solution for head direction tracking and stabilisation convergent with known biological connectivity. Moreover, like mammalian representations, responses were robust to environmental manipulations, including exposure to novel settings. In contrast to existing reinforcement learning approaches, agents equipped with this network were able to flexibly reuse learnt behaviours —adapting rapidly to unfamiliar environments. Thus, our results indicate that these representations, derived from a simple egocentric predictive framework, form an efficient basis-set for cognitive mapping.

eLife
Jul 4, 2022

Evaluation Summary:

This paper presents an artificial neuronal network which, from action and visual inputs, develops representations of space comparable to those found in the navigational system of the brain. They show that the representations developed by this network can be used in novel environments and in a reinforcement learning task. This demonstration of representations in absolute coordinates from agent-centered information is a significant contribution to neuroscientists as well as machine learning.

(This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. The reviewers remained anonymous to the authors.)

Read the original source
eLife
Jul 4, 2022

Reviewer #1 (Public Review):

This article presents interesting proof of concept of how predictive coding based on visual inputs, coupled with a complex array of RNNs can produce head direction and egocentric/allocentric boundary responses akin (to some extent) to the neural responses found in mammalian hippocampal formation neurons. However, while an impressive technical feat, the model contradicts key experimental findings, and the developmental timeline of spatial cell responses does not support the sole reliance on visual inputs.

Developmental considerations:
Developmental studies have shown that rudimentary HD signals emerge before the unfusing of rat pup eyelids (suggesting visual inputs are not necessary for these initial responses). Head direction is fully formed more than a week before allocentric boundary responses (boundary …

Reviewer #1 (Public Review):

This article presents interesting proof of concept of how predictive coding based on visual inputs, coupled with a complex array of RNNs can produce head direction and egocentric/allocentric boundary responses akin (to some extent) to the neural responses found in mammalian hippocampal formation neurons. However, while an impressive technical feat, the model contradicts key experimental findings, and the developmental timeline of spatial cell responses does not support the sole reliance on visual inputs.

Developmental considerations:
Developmental studies have shown that rudimentary HD signals emerge before the unfusing of rat pup eyelids (suggesting visual inputs are not necessary for these initial responses). Head direction is fully formed more than a week before allocentric boundary responses (boundary vector cells: BVCs) emerge, and initial head direction signals predate BVC coding by 5-6 weeks in rat pups (Min, Wills, Cacucci 2017). While this does not exclude separate emergence per se, it removes the need to insist on the absence of HD inputs for the formation of BVCs. Hence it is plausible that at least HD would be available to any learning/developmental process that enables the emergence of allocentric boundary responses in the rodent brain. Not using this information would make this learning task unnecessarily difficult. Similarly, it is more plausible that egocentric boundary responses are constructed by a network that forms in a developmental time window and can later instantiate boundary responses based on visual inputs, depth perception, etc, without additional learning. The staggered emergence of spatial responses reported by experiments also strongly suggests that developmental stages build upon each other. Granted, it remains a possibility that egocentric boundary responses and head direction coding could be generated by a predictive coding framework (though the principle inputs to HD are vestibular), but there is no need to assume the head direction signal and the egocentric boundary signal wouldn't be used in a subsequent learning/maturation step that forms allocentric boundary responses from head direction and possibly egocentric boundary inputs. While I share the authors' sentiment that those previous models have not sufficiently accounted for this learning step (generating the network that allows egocentric signals to be translated to allocentric boundary representations), the appeal to a staggered developmental process is much more plausible and in line with developmental data.

The parallel emergence of distinct spatial responses:
In the paper, very little is actually said about the interaction of the learned representation in the model. Since the different RNNs learn in parallel, one yielding HD, one egocentric boundary cell (EBCs), and one BVCs, this means that EBCs and BVCs do not need to interact. Similarly, HD cells and BVCs do not need to interact. This incredibly salient prediction is not emphasised at all. It would suggest the EBCs could be lesioned without affecting BVCs in a novel environment. In alternative models, this is not the case. Such strong claims should be emphasised as this is actually one of the few novel, direct experimental predictions that can be made here. Whether or not EBCs and BVCs interact is an open, empirical question. However, taking this line of reasoning further, the present model also predicts that lesioning HD cells should leave EBC and BVC unperturbed. This is extremely unlikely for BVCs. Lesions to the mammillary bodies (where HD cells are found and where the HD attractor signal is likely generated) lead to severe memory deficits. The orientation of BVCs and place cells is likely set by head direction cells. The three populations have repeatedly been shown to rotate in concert. Object vector cells (not addressed in this article) similarly co-rotate with HD cells. The article does not present sufficient evidence (or gains in understanding) to abandon this well-established view.

Relating the model to biological function:
The normative account of the paper is interesting, but it is unclear how much (if anything) the model tells us about the biological underpinnings of spatial cognition despite the overt claim that the model would be useful to neuroscientists. The modelling approach is far from biologically plausible. This creates the unfortunate impression that a bunch of RNNs has been thrown together (with considerable technical skill), which are known to be able to extract the information inherent in the inputs. What does this tell us about how the brain generates these responses, and how can experimenters test for properties specific to the model? To provide a normative model that outlines one way for the appearance of known mammalian spatial representations based solely on interaction with the sensory world, is fine (and interesting in itself) but the method employed being so far from real biological function makes it impossible to assess if it is the correct normative explanation (see also next point).

Experimental contradictions:
BVC activity emerges immediately upon entry into a new environment, while the present model needs to be retrained on sets of environmental geometries to be able to respond correctly in all those environments. This discrepancy cannot be remedied by appealing to the theoretical notion that an animal might experience all possible geometries during some developmental phase. Given the developmental timeline of spatial responses and the fact that rat pups do not leave their nest straight away this can in all likelihood be excluded. Competing models claim that EBC responses are computed directly from perceptual inputs (utilising networks formed in development), with the consequence that EBC (and hence BVCs driven by EBCs) can straightforwardly represent any new geometry without additional learning. This would be consistent with BVC activity emerging immediately in a new environment, even when faced with a never-before-experienced environmental geometry.

Read the original source
eLife
Jul 4, 2022

Reviewer #2 (Public Review):

The authors wish to investigate how various allocentric representations, such as those observed in the brain's navigational system, can emerge from the interaction between action and sensory inputs. They use a predictive architecture, in which visual inputs are predicted from actions, to explain the emergence of multiple allocentric representations (HD cells, place cells, boundary vector cells). The major strength of the paper is the demonstration of the network's ability to develop spatial representations of multiple virtual environments and the demonstration that such representations can be used as a foundation to quickly represent new environments and to support further reinforcement learning tasks. However, the analysis is not yet sufficient to support a number of claims made in the paper about critical …

Reviewer #2 (Public Review):

The authors wish to investigate how various allocentric representations, such as those observed in the brain's navigational system, can emerge from the interaction between action and sensory inputs. They use a predictive architecture, in which visual inputs are predicted from actions, to explain the emergence of multiple allocentric representations (HD cells, place cells, boundary vector cells). The major strength of the paper is the demonstration of the network's ability to develop spatial representations of multiple virtual environments and the demonstration that such representations can be used as a foundation to quickly represent new environments and to support further reinforcement learning tasks. However, the analysis is not yet sufficient to support a number of claims made in the paper about critical pieces of the findings. Further, two critical aspects of the model, namely the correction step, and the RNN-3 memory store, are not adequately described, rely on decisions that are not adequately justified, and their properties/significance are not adequately investigated. Thus, while the authors did demonstrate the emergence of spatial representation and the utility of their model, their presentation did not adequately support their conclusions. With significant revisions to the text and additional experiments/analysis, this work will have a significant impact on the field, and their model will be of further use to the community.

My major concern is that two critical aspects of the model, namely the correction step, and the RNN-3 Memory store, are not adequately described, rely on decisions that are not adequately justified, and their properties/significance are not adequately investigated, as discussed below.

Correction step

- In the results, the correction step is minimally described. However, the method is fairly involved. For example, lines 81-82 state that "visual information being communicated only by the activation of slots in the memory stores (Fig 1B)". Similar descriptions are given in lines 102-103 and 125-126. However, the nature of these predictions is not stated in the results or well-diagrammed in Figure 1B. It might help to specify, for example in the figure legend, that further details about this step are provided in Supplementary figure 1. As this is a crucial piece of the model, I recommend that at least a few more sentences be given to this step in the results, which outlines the high-level details of the correction step.

- In the methods, the description of the correction step is inadequate, it's given simply as G(x,x). While this may be appropriate for a machine learning conference proceeding, it's not appropriate for a general journal. The authors should include equations that specify G (as well as F), which could be included in the section "Sigmoid-LSTM and Sigmoid-Vanilla". Further, the authors might want to justify the need for an entirely new RNN cell, rather than another input to the existing RNN. In lines 318-319: "each x~ can be thought of as the result of a weighted reactivation of the RNN memory embeddings by the current visual input." It might be useful to explain the correction code as: "the expected RNN activation given the current visual input's activation of the memory cells".

- Lines 125-126 state that: "RNN-3 received no self-motion inputs, thus being dependent on temporal coherence, and corrections from mispredictions as its sole input". It's unclear why the corrections to this RNN are generated from "mispredictions", and not just visual "corrections", like in the other RNNs. Further, nothing in the implementation of the correction step enforces that it gives "corrections", only that it learns to incorporate information from the current visual input, via the memory store, to the action of the RNNs. They're just occasional information that the network learns to use to update the RNN state as best as possible. While this is presented as a correction, it's unclear what this RNN actually does. Does it learn to simply replace the existing x with what it should be from the memory store (i.e. a correction)? Or does it combine information from x^hat and x^tilde in some complicated way? To understand this, I recommend the authors could compare x^, x~, and x. During the correction step.

- Finally, the authors state that (Line numbers missing), "to correct for the accumulation of integration errors, the RNNs must incorporate positional and directional information from upstream visual inputs as well. This correction step should not be performed at every time step, or the integration of velocities would be unnecessary; in our experiments, it was performed at random timesteps with probability Pcorrection = 0.1." This entails a claim that for Pcorrection=0, errors will accumulate, while for Pcorrection=1, the integration of velocities will be "unnecessary". While this makes intuitive sense, no empirical justification for these claims is shown, and their implications for the model's function and representation are not demonstrated. I would suggest that the authors compare a range of Pcorrection values, for example, p=[1, 0.3, 0.1, 0.03, 0.01], and demonstrate how the network performance and spatial representation vary as a function of Pcorrection. Finally, though less important, it's unclear why this correction is probabilistic. This decision could be justified, e.g. with an experiment comparing the results of probabilistic versus deterministic/periodic corrections.

RNN-3 and Memory store
This seems like a key feature of the model, yet its implementation gets very little attention in the results, and the description is conflicting and difficult to understand.

- Line 142 states that "the allocentric representations of RNN-3 were stored in the external memory slots as a second set of targets - being reactivated at each time step by comparison to the current state of the RNN-3". However, it's unclear what's meant by a "second" set of targets, or why this is unique to RNN-3. From the text, it seems that this could either refer to m(x)_3 (the memory map corresponding to RNN3), or s (the slots). However, from my interpretation of the methods as written, the m(x) parameters are learned, and s are activated by the joint activity of all three RNNs, not just RNN-3 (Equation 4). Why is this written as if it's a separate group of slots unique to RNN-3?

- Further, how is the activity of memory slots assessed? While I can imagine (though not found in the method), how the tuning curves of RNN-1-3 are calculated, because of the confusion with what this set of targets refers to I don't know how e.g. Figure 2E was calculated. I recommend this be included in the methods. Importantly, I recommend the authors expand the description of RNN-3 and its associated memory store in the results, and clarify its description in the methods section.

- Lines 320 and 322 states that the memory store contents corresponding to the RNNs m(x) are optimized parameters, while those corresponding to upstream inputs m(y) are not. However, Line 325 states that all contents are chosen and assigned (m(y), m(x)) := (y, x).

- Finally, no justification was given as to why RNN-3 was added. The authors justify the addition of RNN-2 by stating that "a single RNN receiving all the velocity inputs did not develop the whole range of representations" (Line 101). However, no justification is given for a third RNN that receives no input. As this is a key piece of the results, justifying and understanding its contribution is critical. Does this affect predictive performance, the ability to generalize to new environments, or utility for RL, or is it simply adding a representational similarity to hippocampal place fields and egoBVCs? I recommend that the authors show the results of a network with only RNN-1 and RNN-2, to justify the addition of RNN-3 and demonstrate its utility for prediction.

On the head direction attractor analysis
- Lines 174-176 state, "To investigate how our model incorporates visual information in its representation of heading, we simulated the input of visual corrections (512 images from the training environment), However, this experiment does not tell you "how the model incorporates visual information", but only the response to selected images. The intuitive idea is that the network learns to map distal cues to specific angles, but not ambiguous images. To test this hypothesis, I would recommend that the authors compare the heading direction of the visual correction input to the direction on the attractor activated, i.e. to show that images that give an attractor point match the heading of that image. Further, because the corrections are given through an entirely different RNN cell (G), from that which (presumably) holds the attractor (F), I would recommend that the authors show how the correction input to G interacts with an existing action-driven point on the attractor via F. For example, what if an image is shown that disagrees with the current heading direction?

On the RL agent
- Lines 200-202 state that "self-consistency is an adaptive characteristic allowing spatial behaviour learned in one environment to be quickly transferred to novel environments", and Line 223: "the spatial responses present in the SMP's RNN support rapid generalization to novel settings." While they've shown that the SMP can support RL and generalization, they haven't tested whether its spatial tuning is responsible for the performance. One way they could test this is to replace the SMP input to the RL agent with equivalent rate tuned units as inputs (whose rate is simply what would be expected from the tuning curve of each RNN unit). This experiment could be done for the pre-trained agent (to see if performance is maintained from the tuning curves alone, or if there's more information in the SMP that's being used), and possibly compared to a newly-trained agent.

Read the original source
Version published to 10.1101/2020.11.11.378141 on bioRxiv
Nov 12, 2020

Steerable autoencoders underlying remapping, spatiotopy, and visual stability.

This article has 2 authors:
1. Patrick Cavanagh
2. David Melcher
This article has no evaluationsLatest version Jan 23, 2026
From sudden perceptual learning to enduring engrams: A representational perspective

This article has 4 authors:
1. Johannah Völler
2. Juan Linde Domingo
3. Javier Ortiz-Tudela
4. Carlos González-García
This article has no evaluationsLatest version Feb 5, 2026
Cognitive maps in the prefrontal cortex

This article has 4 authors:
1. Sebastijan Veselic
2. Elena Gutierrez
3. Mohamady El-Gaby
4. Mathias Sablé-Meyer
This article has no evaluationsLatest version Jan 2, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Steerable autoencoders underlying remapping, spatiotopy, and visual stability.

From sudden perceptual learning to enduring engrams: A representational perspective

Cognitive maps in the prefrontal cortex