Learning predictive cognitive maps with spiking neurons during behavior and replays

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    This is a valuable paper that uses solid computational modeling approaches to link plasticity in the hippocampal circuit with behavioral learning. The work focuses on reinforcement learning, a theoretical framework for how animals can optimize learning by extracting the statistical structure of the sensory environments. While a vast range of experimental data regarding the physiological properties of neurons in the hippocampus exists, reinforcement learning often lacks such physiological details. The manuscript begins to fill this gap, by developing a spiking computational model of the hippocampus that can implement reinforcement learning and capture some features of hippocampal physiology.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #2 agreed to share their name with the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

The hippocampus has been proposed to encode environments using a representation that contains predictive information about likely future states, called the successor representation. However, it is not clear how such a representation could be learned in the hippocampal circuit. Here, we propose a plasticity rule that can learn this predictive map of the environment using a spiking neural network. We connect this biologically plausible plasticity rule to reinforcement learning, mathematically and numerically showing that it implements the TD-lambda algorithm. By spanning these different levels, we show how our framework naturally encompasses behavioral activity and replays, smoothly moving from rate to temporal coding, and allows learning over behavioral timescales with a plasticity rule acting on a timescale of milliseconds. We discuss how biological parameters such as dwelling times at states, neuronal firing rates and neuromodulation relate to the delay discounting parameter of the TD algorithm, and how they influence the learned representation. We also find that, in agreement with psychological studies and contrary to reinforcement learning theory, the discount factor decreases hyperbolically with time. Finally, our framework suggests a role for replays, in both aiding learning in novel environments and finding shortcut trajectories that were not experienced during behavior, in agreement with experimental data.

Article activity feed

  1. Author Response

    Reviewer #2 (Public Review):

    Reinforcement learning (RL) theory is important because it provides a broad, mathematically proven framework for linking behavioral states to behavioral actions, and has the potential for linking realistic biological network dynamics to behavior. The most detailed neurophysiological modeling uses biophysical compartmental models with the theoretical framework of HodgkinHuxley and Rall to describe the dynamics of real neurons, but those models are extremely difficult to link to behavioral output. RL provides a theoretical framework that could help bridge across the still-underexplored chasm between behavioral modeling and neurophysiological detail.

    On the positive side, this paper uses a network of interacting neurons in region CA3 and CA1 (as used in previous models by McNaughton and Morris, 1987; Hasselmo and Schnell, 1994; Treves and Rolls, 1994; Mehta, Quirk and Wilson. 2000; Hasselmo, Bodelon and Wyble, 2002) to address how a simple representation of biological network dynamics could generate the successor representation used in RL. The successor representation is an interesting theory of hippocampal function, as it contrasts with a previous idea of model-based planning. Previous neuroscience data supports the idea that animals use a model-based representation (a cognitive map made up of place cells or grid cells) to read out potential future paths to plan their behavior in the environment. For example, Johnson and Redish, 2007 showed activity spreading into alternating arms of a T-maze before a decision is made (i.e. a model-based exploration of possible actions, NOT a successor representation), and Pfeiffer and Foster, 2013 showed that replay in 2-dimensions corresponds to future goal directed activity. Models such as Erdem and Hasselmo, 2012 and Fenton and Kubie, 2012 showed how forward planning of possible trajectories could guide performance of behavioral tasks. In contrast, the successor representation proposes that model-based activity is too computationally expensive and proposes that instead of reading out various possible model-based future paths when making a decision, that a simulated agent could instead learn a look-up table indicating the probability of future behavioral states accessible from a given state. In previous work, the successor representations accounted for certain aspects of experimental neuroscience data such as place cells responding to the insertion of barriers as seen by Alvernhe et al. and the backward expansion of place field seen by Mehta et al. The current paper is admirable for addressing the potential role of neural replay in training of successor representations and its relationship to other neural and behavioral data such as the papers by Cheng and Frank 2008 and by Wu et al. 2017.

    However, a lot of this same data could still be interpreted as indicating that animals use a model-based representation as described above. There's nothing in this paper that rules out a model-based interpretation of the results discussed above. In fact, the cited paper by Momennejad et al. 2017 shows that humans extensively use model-based mechanisms along with some use of a successor representation in addition to the model-based mechanism. The description in the article under review needs to avoid treating successor representations as if they are already the ground truth.

    To do this, throughout the paper, the authors need to repeatedly address the fact that the Successor Representation is just a theory and not proven experimental fact. And they need to repeatedly in all sections point out that the successor representations hypothesis can be contrasted with the theory that model-based neural activity could instead guide behavior and could be the correct account for all of the data that they address (i.e. such as the darkavoidance behavior). They should cite the previous examples of neural data that looks like model-based planning such as Johnson and Redish, 2007 in the T-maze and Pfeiffer and Foster, 2013 in open fields, and cite models such as Hasselmo and Eichenbaum, 2005; Erdem and Hasselmo, 2012 and Fenton and Kubie, 2012 that showed how forward replay or planning of possible trajectories could guide performance of behavioral tasks

    We thank the reviewer for the valuable feedback. We have adapted the manuscript throughout to discuss the important point that the SR is not the ground truth (e.g. the final paragraphs in the sections “Bias-variance trade-off” and “Leveraging replays to learn novel trajectories”). We also discussed more extensively the model-based literature and the suggested citations in the manuscript.

    The title and text repeatedly refers to a "spiking" model. They show spikes in Figure 2 and extensively discuss the influence of spiking on STDP, but they ought to more explicitly discuss the interaction of their spike generation mechanisms (using a Poisson process) and the authors should compare their model to the model of George, DeCothi, Stachenfeld and Barry which addresses many of the same questions but using theta phase precession to obtain the correct spike timing in STDP.

    Yes, that's a great suggestion. We have extended our discussion section. In particular, we added:

    In our work, we did not include theta modulation, but phase precession and theta sequences could be yet another type of activity within the TD lambda framework. Interestingly, more groups have recently investigated related ideas. A recent work \citep{George2022} incorporated the theta sweeps into behavioural activity, showing it approximately learns the SR. Moreover, theta sequences allow for fast learning, playing a similar role as replays (or any other fast temporalcode sequences) in our work. By simulating the temporally compressed and precise theta sequences, their model also reconciles the learning over behavioral timescales with STDP. In contrast, our framework reconciles both timescales relying purely on rate-coding during behaviour. Finally, their method allows to learn the SR within continuous space. It would be interesting to investigate whether these methods co-exist in the hippocampus and other brain areas. Furthermore, \citep{Fang2022} et al. recently showed how the SR can be learned using recurrent neural networks with biologically plausible plasticity.

    The introduction and start of the Results section are should have more citations to neuroscience data. The introduction currently cites only three experimental citations (O'Keefe and Dostrovsky, 1971; O'Keefe and Nadel, 1978 and Mehta et al. 2000) and then gives repeated citations of previous theory papers as if those papers define the experimental data that is relevant to this study. The article should review actual neuroscience literature, instead of acting as if a few theory papers in the last five years are more important sources of data than decades worth of experimental work. The start of the results section makes a statement about the role of hippocampus and only cites Stachenfeld et al. 2017 as if it were an experimental paper. The introduction, start of results and discussion need to be modified to address actual experimental data instead of just prior modeling papers. They need to add at least a paragraph to the introduction discussing real experimental data. There are numerous original research papers that should be cited for the role of hippocampus in behavior so that the reader doesn't get the impression all of this work started with the paper by Stachenfeld et al. 2017. For example, the introduction should supplement the citations to O'Keefe and Mehta with other experimental papers including those that they cite later in the paper. They should also cite other seminal work of Morris et al. 1982 in Morris water maze and Olton, 1979 in 8-arm radial maze and work by Wood, Dudchenko, Robitsek and Eichenbaum on neural activity during spatial alternation. At the start of the Results, instead of only citing Stachenfeld (which should have reduced emphasis when speaking about experiments), they should again cite O'Keefe and Nadel, 1978 for the very comprehensive review of the literature up to that time, plus the work of Morris and Eichenbaum and Aggleton and other experimental work.

    We thank the reviewer for the suggested citations. We have added many citations in order to discuss the experimental literature more thoroughly.

    This article is admirable for addressing how to utilize a continuous representation of space and time, which Kenji Doya also addressed in his NeurIPS article in 1995 and Neural Computation 2000 (which should be cited). To emphasize the significance of this continuous representation, they could note that reinforcement learning (RL) theory models still tend to use a discretized grid-like map of the world and discrete representation of time that does not correspond to the probabilistic nature of place cell response properties (Fenton and Muller) and the continuous nature of the response of time cells (Kraus et al. 2013).

    We thank the reviewer for this important comment and this is indeed one of the main strengths of the proposed framework. We have now emphasised this point, by adding the following paragraph to the Discussion:

    “Importantly, the discount parameter also depends on the time spent in each state. This eliminates the need for time discretization, which does not reflect the continuous nature of the response of time cells (Kraus et al. 2013).”

    I think the authors of this article need to be clear about the shortcomings of RL. They should devote some space in the discussion to noting neuroscience data that has not been addressed yet. They could note that most components of their RL framework are still implemented as algorithms rather than neural models. They could note that most RL models usually don't have neurons of any kind in them and that their own model only uses neurons to represent state and successor representations, without representing actions or action selection processes. They could note that the agents in most RL models commonly learn about barriers by needing to bang into the barrier in every location, rather than learning to look at it from a distance. The ultimate goal of research such as this should to link cellular level neurophysiological data to experimental data on behavior. To the extent possible, they should focus on how they link neurophysiological data at the cellular level to spatial behavior and the unit responses of place cells in behaving animals, rather than basing the validity of their work on the assumption that the successor representation is correct.

    We thank the reviewer for this suggestion, we have now extended the Discussion to include a paragraph on the “Limitations of the Reinforcement Learning framework” which we reproduce here:

    We have already outlined some of the perks of using reinforcement learning for modelling behaviour, including providing clear computational and algorithmic frameworks. However, there are several intrinsic limitations to this framework. For example, it needs to be noted that RL agents that only use spatial data do not provide complete descriptions of behavior, which likely arises from integrating information across multiple sensory inputs. Whereas an animal would be able to smell and see a reward from a certain distance, an agent exploring the environment would only be able to discover it when randomly visiting the exact reward location. Furthermore, the framework rests on fairly strict mathematical assumptions: typically the state space needs to be markovian, time and space need to be discretized (which we manage to evade in this particular framework) and the discounting needs to follow an exponential decay. These assumptions are overly simplistic and it is not clear how often they are actually met. Reinforcement Learning is also a sample-intensive technique, whereas we know that some animals, including humans, are capable of much faster or even one-shot learning. \ Regarding the specific limitations of our model, we can note that even though we have provided a neural implementation of the SR, and of the value function as its read-out (see Figure 5-figure supplement S2, the whole action selection process is still computed only at the algorithmic level. It may be interesting to extend the neural implementation to the policy selection mechanism in the future.

  2. Evaluation Summary:

    This is a valuable paper that uses solid computational modeling approaches to link plasticity in the hippocampal circuit with behavioral learning. The work focuses on reinforcement learning, a theoretical framework for how animals can optimize learning by extracting the statistical structure of the sensory environments. While a vast range of experimental data regarding the physiological properties of neurons in the hippocampus exists, reinforcement learning often lacks such physiological details. The manuscript begins to fill this gap, by developing a spiking computational model of the hippocampus that can implement reinforcement learning and capture some features of hippocampal physiology.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #2 agreed to share their name with the authors.)

  3. Reviewer #1 (Public Review):

    This manuscript proposes a spiking network model of the hippocampal circuit, in which spike-time-dependent plasticity leads to the learning of the successor-representation, i.e. a predictive map of the environment. More specifically, the network consists of two layers representing the CA1 and CA3 regions and the connections between the layers are plastic. The main result is that the resulting plasticity process on behavioural timescales can be mapped onto temporal difference learning so that the weights between the two layers learn the successor representation.

    Strengths:
    - this work presents a model that links two very different levels of description, a biophysical spiking model and reinforcement learning
    - analytical results are provided to support the results
    - the model provides a framework to implement discounting in continuous time, alleviating the need to discretise time.

    Weaknesses:
    - the successor representation is learned at the level of synaptic weights between the two layers. It is not clear how it is read out into neural activity and exploited to perform actual computations, as both layers are assumed to be strongly driven by external inputs. This is a major limitation of this work.
    - one of the results is that STDP at the timescale of milliseconds can lead to learning over behavioral timescales of seconds. This result seems related to Drew and Abbott PNAS 2006. In that work, the mapping between learning on micro and macro timescales in fact relied on precise tuning of plasticity parameters. It is not clear to which extent similar limitations apply here, and what is the precise relation with Drew & Abbott.
    - most of the results are presented at a formal, descriptive level relating plasticity to reinforcement learning algorithms. The provided examples are quite limited and focus on a simplified setting, a linear track. It would be important to see that the results extend to two-dimensional environments, and to show how the successor representation is actually used (see first comment).
    - the main text does not explain clearly how replays are implemented.

  4. Reviewer #2 (Public Review):

    Reinforcement learning (RL) theory is important because it provides a broad, mathematically proven framework for linking behavioral states to behavioral actions, and has the potential for linking realistic biological network dynamics to behavior. The most detailed neurophysiological modeling uses biophysical compartmental models with the theoretical framework of Hodgkin-Huxley and Rall to describe the dynamics of real neurons, but those models are extremely difficult to link to behavioral output. RL provides a theoretical framework that could help bridge across the still-underexplored chasm between behavioral modeling and neurophysiological detail.

    On the positive side, this paper uses a network of interacting neurons in region CA3 and CA1 (as used in previous models by McNaughton and Morris, 1987; Hasselmo and Schnell, 1994; Treves and Rolls, 1994; Mehta, Quirk and Wilson. 2000; Hasselmo, Bodelon and Wyble, 2002) to address how a simple representation of biological network dynamics could generate the successor representation used in RL. The successor representation is an interesting theory of hippocampal function, as it contrasts with a previous idea of model-based planning. Previous neuroscience data supports the idea that animals use a model-based representation (a cognitive map made up of place cells or grid cells) to read out potential future paths to plan their behavior in the environment. For example, Johnson and Redish, 2007 showed activity spreading into alternating arms of a T-maze before a decision is made (i.e. a model-based exploration of possible actions, NOT a successor representation), and Pfeiffer and Foster, 2013 showed that replay in 2-dimensions corresponds to future goal directed activity. Models such as Erdem and Hasselmo, 2012 and Fenton and Kubie, 2012 showed how forward planning of possible trajectories could guide performance of behavioral tasks. In contrast, the successor representation proposes that model-based activity is too computationally expensive and proposes that instead of reading out various possible model-based future paths when making a decision, that a simulated agent could instead learn a look-up table indicating the probability of future behavioral states accessible from a given state. In previous work, the successor representations accounted for certain aspects of experimental neuroscience data such as place cells responding to the insertion of barriers as seen by Alvernhe et al. and the backward expansion of place field seen by Mehta et al. The current paper is admirable for addressing the potential role of neural replay in training of successor representations and its relationship to other neural and behavioral data such as the papers by Cheng and Frank 2008 and by Wu et al. 2017.

    However, a lot of this same data could still be interpreted as indicating that animals use a model-based representation as described above. There's nothing in this paper that rules out a model-based interpretation of the results discussed above. In fact, the cited paper by Momennejad et al. 2017 shows that humans extensively use model-based mechanisms along with some use of a successor representation in addition to the model-based mechanism. The description in the article under review needs to avoid treating successor representations as if they are already the ground truth.

    To do this, throughout the paper, the authors need to repeatedly address the fact that the Successor Representation is just a theory and not proven experimental fact. And they need to repeatedly in all sections point out that the successor representations hypothesis can be contrasted with the theory that model-based neural activity could instead guide behavior and could be the correct account for all of the data that they address (i.e. such as the dark-avoidance behavior). They should cite the previous examples of neural data that looks like model-based planning such as Johnson and Redish, 2007 in the T-maze and Pfeiffer and Foster, 2013 in open fields, and cite models such as Hasselmo and Eichenbaum, 2005; Erdem and Hasselmo, 2012 and Fenton and Kubie, 2012 that showed how forward replay or planning of possible trajectories could guide performance of behavioral tasks

    The title and text repeatedly refers to a "spiking" model. They show spikes in Figure 2 and extensively discuss the influence of spiking on STDP, but they ought to more explicitly discuss the interaction of their spike generation mechanisms (using a Poisson process) and the authors should compare their model to the model of George, DeCothi, Stachenfeld and Barry which addresses many of the same questions but using theta phase precession to obtain the correct spike timing in STDP.

    The introduction and start of the Results section are should have more citations to neuroscience data. The introduction currently cites only three experimental citations (O'Keefe and Dostrovsky, 1971; O'Keefe and Nadel, 1978 and Mehta et al. 2000) and then gives repeated citations of previous theory papers as if those papers define the experimental data that is relevant to this study. The article should review actual neuroscience literature, instead of acting as if a few theory papers in the last five years are more important sources of data than decades worth of experimental work. The start of the results section makes a statement about the role of hippocampus and only cites Stachenfeld et al. 2017 as if it were an experimental paper. The introduction, start of results and discussion need to be modified to address actual experimental data instead of just prior modeling papers. They need to add at least a paragraph to the introduction discussing real experimental data. There are numerous original research papers that should be cited for the role of hippocampus in behavior so that the reader doesn't get the impression all of this work started with the paper by Stachenfeld et al. 2017. For example, the introduction should supplement the citations to O'Keefe and Mehta with other experimental papers including those that they cite later in the paper. They should also cite other seminal work of Morris et al. 1982 in Morris water maze and Olton, 1979 in 8-arm radial maze and work by Wood, Dudchenko, Robitsek and Eichenbaum on neural activity during spatial alternation. At the start of the Results, instead of only citing Stachenfeld (which should have reduced emphasis when speaking about experiments), they should again cite O'Keefe and Nadel, 1978 for the very comprehensive review of the literature up to that time, plus the work of Morris and Eichenbaum and Aggleton and other experimental work.

    This article is admirable for addressing how to utilize a continuous representation of space and time, which Kenji Doya also addressed in his NeurIPS article in 1995 and Neural Computation 2000 (which should be cited). To emphasize the significance of this continuous representation, they could note that reinforcement learning (RL) theory models still tend to use a discretized grid-like map of the world and discrete representation of time that does not correspond to the probabilistic nature of place cell response properties (Fenton and Muller) and the continuous nature of the response of time cells (Kraus et al. 2013).

    I think the authors of this article need to be clear about the shortcomings of RL. They should devote some space in the discussion to noting neuroscience data that has not been addressed yet. They could note that most components of their RL framework are still implemented as algorithms rather than neural models. They could note that most RL models usually don't have neurons of any kind in them and that their own model only uses neurons to represent state and successor representations, without representing actions or action selection processes. They could note that the agents in most RL models commonly learn about barriers by needing to bang into the barrier in every location, rather than learning to look at it from a distance. The ultimate goal of research such as this should to link cellular level neurophysiological data to experimental data on behavior. To the extent possible, they should focus on how they link neurophysiological data at the cellular level to spatial behavior and the unit responses of place cells in behaving animals, rather than basing the validity of their work on the assumption that the successor representation is correct.

  5. Reviewer #3 (Public Review):

    This paper provides a novel framework for understanding prediction-based learning rules that are potentially employed by the hippocampus to optimize behavior. Specifically, the authors examined how a cognitive map containing predictive information (termed the successor representation) is computed in the hippocampus with spike-timing-dependent synaptic plasticity (STDP).

    Strengths:
    By using an ecologically plausible computational model that is embedded with important biological characteristics, the authors propose a novel framework that demonstrates a set of computational principles employed by the hippocampus to achieve successful predictive learning. The paper clearly and thoroughly explains different components of the model with concrete examples and illustrations. Analytical solutions are also provided in addition to narratives to help readers understand the model setup as well as its relevance and connection to biological studies. Among the set of biologically realistic computational dynamics achieved by the modeling framework, the proposed model can elegantly account for both exponential and hyperbolic discounting by demonstrating that exponential discounting is utilized while the animals travel through space, whereas hyperbolic discounting is capitalized while the animals travel through time. Additionally, this paper discusses the model findings in the context of experimental and theoretical work which help readers understand how the proposed framework can be utilized in future work to guide investigations on predictive learning both empirically and computationally.

    The proposed model makes connections to other theoretical frameworks

    Weaknesses:
    While the framework proposed in this paper is potentially powerful in capturing different aspects of hippocampus-based predictive learning, the links between the model results and experimental findings are not sufficiently demonstrated. There are several biological concepts that are discussed in the context of the model. It is, however, unclear if the implementations of these concepts within the model capture the same underlying principles that happen in nature. For example, there is rich literature on hippocampal replays including their heterogeneity across contexts and species. The paper does not provide sufficient information regarding the specific types of replays or the specific aspects of replay dynamics that are observed in the model.