Successor-like representation guides the prediction of future events in human visual cortex and hippocampus

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    In this paper, Ekman and colleagues present novel evidence, using a visual sequence task in fMRI, that the early visual cortex (V1) and the hippocampus both represent perceptual sequences in the form of a predictive "successor" representation, where the current state is represented in terms of its future (successor) states in a temporally discounted fashion. In both brain structures, there was evidence for upcoming, but not preceding steps in the sequence, and these results were found only in the temporal but not spatial domain. This study suggests that the hippocampus and V1 represent temporally structured information in a predictive, future-oriented manner.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #3 agreed to share their name with the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Human agents build models of their environment, which enable them to anticipate and plan upcoming events. However, little is known about the properties of such predictive models. Recently, it has been proposed that hippocampal representations take the form of a predictive map-like structure, the so-called successor representation (SR). Here, we used human functional magnetic resonance imaging to probe whether activity in the early visual cortex (V1) and hippocampus adhere to the postulated properties of the SR after visual sequence learning. Participants were exposed to an arbitrary spatiotemporal sequence consisting of four items (A-B-C-D). We found that after repeated exposure to the sequence, merely presenting single sequence items (e.g., - B - -) resulted in V1 activation at the successor locations of the full sequence (e.g., C-D), but not at the predecessor locations (e.g., A). This highlights that visual representations are skewed toward future states, in line with the SR. Similar results were also found in the hippocampus. Moreover, the hippocampus developed a coactivation profile that showed sensitivity to the temporal distance in sequence space, with fading representations for sequence events in the more distant past and future. V1, in contrast, showed a coactivation profile that was only sensitive to spatial distance in stimulus space. Taken together, these results provide empirical evidence for the proposition that both visual and hippocampal cortex represent a predictive map of the visual world akin to the SR.

Article activity feed

  1. Author Response

    Reviewer #2 (Public Review):

    This paper presents novel evidence for the successor representation in the hippocampus and V1 for temporally structured visual sequences. Participants learned sequences of 4 items shown in specific locations (A-B-C-D) on the screen. On a subset of trials, participants were only shown one of the four items, which enabled the authors to test whether the remaining three items were reactivated equivalently, or whether the upcoming items were activated in a temporally graded predictive fashion, consistent with the successor representation. The data suggest the latter interpretation, which was observed in both the hippocampus and V1.

    The approach is well-motivated, and the hypotheses are laid out clearly. The manuscript is very clear and streamlined. The design adopted by the authors, which allowed them to disentangle spatial vs. temporal proximity, is clever and provides an interesting approach to the SR framework. The figures are also very clear and nicely designed. I just have a few comments which I hope the authors can address.

    We thank the reviewer for the positive evaluation.

    1. My main question is related to the difference between the analytic approach to V1 vs. hippocampal representations. In Fig. 3, the authors present evidence of a compelling gradation in V1 representations. However, the corresponding hippocampal results in Fig. 5 are collapsed across all predecessor vs. successor representations.

    I initially thought that the same approach could not be taken in the hippocampus (-3/-2/-1 vs. 1/2/3) due to the coarser representation of space - is that the correct interpretation? However, on p. 9 the authors state that they successfully trained a hippocampal classifier based on spatial locations, so I was unsure why the same approach would not be possible. It would be helpful if the authors could add a sentence clearly explaining why the plots and analyses are not parallel across V1 and the hippocampus.

    We appreciate the reviewer bringing up this point. The reviewer is correct, that in principle the same approach could be applied to both V1 and hippocampus. We have now added our motivation for collapsing the data for hippocampus and also appended the non-averaged hippocampus results as a Supplementary Figure. Below we copy our response to Reviewer #1 from above, who brought up a similar point.

    Given the significant, but very low classification accuracy in within the localizer (accuracy = 15% 3.6%, mean ± s.d.; p = 0.002), we had previously decided to only report averaged location results for the hippocampus as the non-averaged predictions would be very noisy. To put the hippocampus classification accuracy into context, in V1 cross-validated accuracy within the localizer was 92% ± 12%, mean ± s.d.).

    We now stressed this difference between V1 and hippocampus decoding in the Results section and motivate our reason for presenting averaged results:

    "Within localizer decoding accuracy results confirmed that hippocampus has a coarse representation of the eight stimulus locations (Figure 5B) within the localizer (one-sample t-test; t(34) = 3.28, p = 0.002; cross-validated accuracy = 15%  3.6%, mean  s.d.; see Materials and Methods). Notably, compared to V1 (cf. Figure 2A), within localizer accuracy was relatively low and as a consequence tuning curves in hippocampus appeared less sharp (Figure 5C). In order to maximize sensitivity for the hippocampus, we averaged classification evidence across successor and predecessor locations. Non-averaged results can be found in Supplementary Figure 1A."

    Further, we followed the reviewer’s suggestion and added a new supplementary Figure including the non-averaged results for hippocampus. The new Figure also includes the model comparison the reviewers had asked for. The new Supplementary Figure 1 is included here for convenience:

    1. The analysis disentangling temporal vs. spatial proximity in the localizer data (Fig. 6) is interesting, particularly the persistent gradation in hippocampal responses vs. their absence in V1. However, could the same/similar temporal vs. spatial model not be applied in the full vs. partial sequences as well, as one of the alternative models shown in Fig. 4? The CO model in Fig. 4B assumes a flat reactivation of all other items in the sequence, but it might be that the two items closer in terms of Euclidean distance are represented differently than the far item. After reading the detailed methods, I wonder if this was not possible because the second presented item was always the furthest from the start (180 degrees), but it would be helpful if the authors could clarify this.

    The reviewer is correct that the fact that the sequence order and spatial distance were not fully decorrelated (second presentation was always farthest away from starting dot, third and fourth dot always the same distance from start) prevents us from quantifying the interaction of the SR and CO model with a spatial model during the main task.

    We added the following to the Method section to clarify this:

    "Note that because within each dot sequence, temporal order and spatial distance were not perfectly decorrelated (e.g. the second sequence dot was always farthest apart from the starting dot), it is not possible to estimate the combined influence of the SR model and the spatial coactivation model on the observed BOLD activity."

    Having said that, we believe that there is little concern that the reported reactivations of the main task are driven by the Euclidean distance in a meaningful way for two reasons:

    (1) detailed analysis of the localizer data showed that there is no spatial spreading from one dot location to the other sequence locations (Figure 6). This is likely because the relevant dot locations were sufficiently spaced apart (at least 5.36 degrees of visual angle), whereas population receptive field sizes in V1 are well below 2 degrees (Dumoulin & Wandell, 2008). Given the lack of spreading during the localizer, where the dot was flashed for 13.5s, makes the presence of spreading during the main task, where the dot was flashed for only 100ms, equally unlikely.

    (2) the presence of spatial spreading would actually obfuscate the reported SR-like pattern and could not have caused it. Specifically, because the second sequence dot was always farthest apart from the start, this is where one would assume the least amount of activity spread (greatest Euclidean distance). Sequence dots three and four should be more active given that they are both closer to the starting point in terms of Euclidean distance. Our reported results are the opposite of that pattern, ruling out the possibility that these were caused by spatial spreading.

    1. As the authors state on p. 12, the present study did not require any long-term prospective planning. However, the participants' task during the full sequences was closely linked to their predictions about the temporal structure of the four stimuli. It would be useful to see whether the participants who were more closely 'locked' to the sequence and accurate at this temporal detection task also showed stronger SR representations (as these rely on temporal distance).

    This would also provide a useful test of the timescale at which successor representations are behaviorally relevant. In several prior studies, the timescales were quite long, so it would be important to test how strongly SR representations at these timescales relate to behavior.

    We thank the reviewer for this suggestion. In order to relate SR representations to behavior, we first calculated individual BOLD differences for successor vs predecessor locations to get an estimate for how much participant’s predictions were skewed toward future locations. One might argue, that participants with stronger predictions toward future locations would perform better at the behavioral task. We then correlated these values with behavioral accuracy across subjects. No significant correlation was found (r = 0.05; p = 0.769). The lack of significant correlation might not be surprising, given that our design is likely underpowered for such a between-subject correlation analysis. Additionally, there was no behavioral response in the prediction trials, that could be directly related to participants’ BOLD activity. Instead the behavioral response is taken from the full sequence trials.

    These new results were added to the results section:

    "One might argue that participants with stronger predictions toward future locations would perform better at the behavioral detection task. However, no such correlation between individual V1 BOLD activity and task accuracy was found in an across subject correlation analysis (see Materials and Methods, spearman r = 0.05; p = 0.769)."

    And described in the methods:

    "Correlation with behavior. In order to relate SR representations to behavior, we first calculated individual V1 BOLD differences for all successor vs all predecessor locations to get an estimate for how much participant’s predictions were skewed toward future locations. We then correlated these values with behavioral accuracy across subjects using spearman correlation."

  2. Evaluation Summary:

    In this paper, Ekman and colleagues present novel evidence, using a visual sequence task in fMRI, that the early visual cortex (V1) and the hippocampus both represent perceptual sequences in the form of a predictive "successor" representation, where the current state is represented in terms of its future (successor) states in a temporally discounted fashion. In both brain structures, there was evidence for upcoming, but not preceding steps in the sequence, and these results were found only in the temporal but not spatial domain. This study suggests that the hippocampus and V1 represent temporally structured information in a predictive, future-oriented manner.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #3 agreed to share their name with the authors.)

  3. Reviewer #1 (Public Review):

    In this study, the authors tested the properties of predictive processes in V1 and the hippocampus using spatiotemporal sequences of dots appearing at different locations. They used data from a localizer run to determine the subregions or voxel patterns sensitive to the different locations where dots could appear in the main task. Prior to the main task, the participants were familiarized with spatiotemporal sequences (one per subject) consisting of four dots appearing successively at different locations. During the main task, on some trials, all but one dot were omitted from the sequence. Using the localizer data, the authors were able to assess the degree to which each dot was represented in the brain on a given trial of the main task (despite it not being shown). They fitted two models to this data: one in which representations of both previous and subsequent dots were activated (co-occurrence model) and one in which only representations of subsequent dots were activated, with temporal discounting (successor representation; SR). In both regions, the SR model fit the data best. Similar results have been observed in the hippocampus in previous studies, but not in the visual cortex. The authors also performed an additional analysis of the localizer data to assess the representational format of both regions. This analysis revealed a temporal tuning in the hippocampus but not in V1.

    I enjoyed reading this manuscript. I found the study and the analyses generally well-made and the paper well written. I think the main result is interesting and important: it is advancing our understanding of how predictions are implemented in V1. The data is furthermore likely to be of interest to other researchers studying learning, predictions, and temporal sequences. The conclusions are generally well supported by the results; however, there are some issues with their general framing and interpretation.

    1. The authors frame their paper in terms of the successor representation (SR). To my knowledge, the SR has previously been used only in a reinforcement learning (RL) context, where there are rewards associated with specific states and where predictions are task-relevant. I don't think it is well defined outside of that context. In the RL literature, the SR has additional features that distinguish it from other RL models such as model-based learning. In the present study, model-based learning (where all one-step transitions are stored, and predictions are iteratively computed) would essentially make the same prediction as SR. There is no reward here and the context is very different, but even then, SR may not be an accurate description of the model tested here.

    2. It is unclear whether the presence of a successor representation in V1 is the result of feedback from the hippocampus or if it is intrinsic to V1. There should be more investigation into the mechanism explaining this finding, and more discussion of its implication.

    3. The goal of the tuning analysis and the interpretation of its result is unclear. At times, it seems like the aim was to investigate the underlying coding of the region, separate from predictive mechanisms. But temporal tuning is intrinsically dependent on the learned associations and hard to disentangle from predictions. Also, the fact that the localizer is run after the main task is a confounding factor to this interpretation. Instead, this analysis is probably more indicative of how much the predictive mechanism persists after the task (this is also the chosen interpretation at other times). But then, it is unclear why a temporally symmetric activity pattern (activation to predecessors as much as to successors) would be predicted and obtained. (Could this result simply be due to the absence of blank screens (omitted items) before and after the shown dot in the localizer run?)

    4. It is unclear whether there is a visual difference between the inter-trial intervals (ITI) and the parts of sequence trials where there is no dot shown. If there is none, as appears to be the case, the participants would be unable to detect the start of a partial sequence trial when the shown dot is not the first one in the sequence (especially since ITI is of variable duration). This could perturb predictive processes since a dot that is shown in the middle of a sequence would appear to participants as being shown at its beginning.

  4. Reviewer #2 (Public Review):

    This paper presents novel evidence for the successor representation in the hippocampus and V1 for temporally structured visual sequences. Participants learned sequences of 4 items shown in specific locations (A-B-C-D) on the screen. On a subset of trials, participants were only shown one of the four items, which enabled the authors to test whether the remaining three items were reactivated equivalently, or whether the upcoming items were activated in a temporally graded predictive fashion, consistent with the successor representation. The data suggest the latter interpretation, which was observed in both the hippocampus and V1.

    The approach is well-motivated, and the hypotheses are laid out clearly. The manuscript is very clear and streamlined. The design adopted by the authors, which allowed them to disentangle spatial vs. temporal proximity, is clever and provides an interesting approach to the SR framework. The figures are also very clear and nicely designed. I just have a few comments which I hope the authors can address.

    1. My main question is related to the difference between the analytic approach to V1 vs. hippocampal representations. In Fig. 3, the authors present evidence of a compelling gradation in V1 representations. However, the corresponding hippocampal results in Fig. 5 are collapsed across all predecessor vs. successor representations.
    I initially thought that the same approach could not be taken in the hippocampus (-3/-2/-1 vs. 1/2/3) due to the coarser representation of space - is that the correct interpretation? However, on p. 9 the authors state that they successfully trained a hippocampal classifier based on spatial locations, so I was unsure why the same approach would not be possible. It would be helpful if the authors could add a sentence clearly explaining why the plots and analyses are not parallel across V1 and the hippocampus.

    2. The analysis disentangling temporal vs. spatial proximity in the localizer data (Fig. 6) is interesting, particularly the persistent gradation in hippocampal responses vs. their absence in V1. However, could the same/similar temporal vs. spatial model not be applied in the full vs. partial sequences as well, as one of the alternative models shown in Fig. 4? The CO model in Fig. 4B assumes a flat reactivation of all other items in the sequence, but it might be that the two items closer in terms of Euclidean distance are represented differently than the far item. After reading the detailed methods, I wonder if this was not possible because the second presented item was always the furthest from the start (180 degrees), but it would be helpful if the authors could clarify this.

    3. As the authors state on p. 12, the present study did not require any long-term prospective planning. However, the participants' task during the full sequences was closely linked to their predictions about the temporal structure of the four stimuli. It would be useful to see whether the participants who were more closely 'locked' to the sequence and accurate at this temporal detection task also showed stronger SR representations (as these rely on temporal distance).

    This would also provide a useful test of the timescale at which successor representations are behaviorally relevant. In several prior studies, the timescales were quite long, so it would be important to test how strongly SR representations at these timescales relate to behavior.

  5. Reviewer #3 (Public Review):

    The main finding in this study is that during repeated exposure to a visual sequence (A-B-C-D), merely presenting single sequence items (e.g., - B - -) leads to V1 reinstatement of subsequent items in the sequence. Importantly, the successor stimuli (e.g., C-D) are reinstated, but not the predecessor stimuli (e.g., A). The authors propose that this predictive activity adheres to the postulated properties of the successor representation (SR). The SR, which has previously been used to describe activity in the hippocampus (Stachenfeld et al., Nature Neuroscience 2017), can be defined as a predictive representation where each state is represented in terms of its successor states, in a temporally discounted fashion. The idea that V1 might also employ the computationally efficient and flexible properties of the SR is highly interesting.

    Overall, the data presented in this article provide evidence for an SR-like representation in V1 during a visual sequence task but not during a post-scan localiser scan. I have several queries for the authors which they may wish to address. For example, using their data are the authors able to distinguish between an SR and other predictive sequence models? Why is the predictive activity only observed during the task and not during the post-task localiser? How does the interpretation of the data differ from the authors' previous reports of preplay in V1? Why do the authors fit certain models to data acquired during the task, and other models to data acquired during a localiser scan?