Neural signatures of model-based and model-free reinforcement learning across prefrontal cortex and striatum
Curation statements for this article:-
Curated by eLife
eLife Assessment
This important study presents single-unit activity collected during model-based (MB) and model-free (MF) reinforcement learning in non-human primates. The dataset was carefully collected, and the statistical analyses, including the modeling, are rigorous. The evidence convincingly supports different roles for particular cortical and subcortical areas in representing key variables during reinforcement learning.
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (eLife)
Abstract
Animals integrate knowledge about how the state of the environment evolves to choose actions that maximise reward. Such goal-directed behaviour - or model-based (MB) reinforcement learning (RL) - can flexibly adapt choice to changes, being thus distinct from simpler habitual - or model-free (MF) RL - strategies. Previous inactivation and neuroimaging work implicates prefrontal cortex (PFC) and the caudate striatal region in MB-RL; however, details are scarce about its implementation at the single-neuron level. Here, we recorded from two PFC regions – the dorsal anterior cingulate cortex (ACC) and dorsolateral PFC (DLPFC), and two striatal regions, caudate and putamen – while two rhesus macaques performed a sequential decision-making (two-step) task in which MB-RL involves knowledge about the statistics of reward and state transitions. All four regions, but particularly the ACC, encoded the rewards received and tracked the probabilistic state transitions that occurred. However, ACC (and to a lesser extent caudate) encoded the key variables of the task - namely the interaction between reward, transition and choice – which underlies MB decision-making. ACC and caudate neurons also encoded MB-derived estimates of choice values. Moreover, caudate value estimates of the choice options flipped when a rare transition occurred, demonstrating value update based on structural knowledge of the task. The striatal regions were unique (relative to PFC) in encoding the current and previous rewards with opposing polarities, reminiscent of dopaminergic neurons, and indicative of a MF prediction error. Our findings provide a deeper understanding of selective and temporally dissociable neural mechanisms underlying goal-directed behaviour.
Article activity feed
-
eLife Assessment
This important study presents single-unit activity collected during model-based (MB) and model-free (MF) reinforcement learning in non-human primates. The dataset was carefully collected, and the statistical analyses, including the modeling, are rigorous. The evidence convincingly supports different roles for particular cortical and subcortical areas in representing key variables during reinforcement learning.
-
Reviewer #1 (Public review):
Summary:
Using single-unit recording in 4 regions of non-human primate brains, the authors tested whether these regions encode computational variables related to model-based and model-free reinforcement learning strategies. While some of the variables seem to be encoded by all regions, there is clear evidence for stronger encoding of model-based information in the anterior cingulate cortex and caudate.
Strengths:
The analyses are thorough, the writing is clear, and the work is well-motivated by prior theory and empirical studies.
Weaknesses:
My comments here are quite minor.
The correlation between transition and reward coefficients is interesting, but I'm a little worried that this might be an artifact. I suspect that reward probability is higher after common transitions, due to the fact that animals are …
Reviewer #1 (Public review):
Summary:
Using single-unit recording in 4 regions of non-human primate brains, the authors tested whether these regions encode computational variables related to model-based and model-free reinforcement learning strategies. While some of the variables seem to be encoded by all regions, there is clear evidence for stronger encoding of model-based information in the anterior cingulate cortex and caudate.
Strengths:
The analyses are thorough, the writing is clear, and the work is well-motivated by prior theory and empirical studies.
Weaknesses:
My comments here are quite minor.
The correlation between transition and reward coefficients is interesting, but I'm a little worried that this might be an artifact. I suspect that reward probability is higher after common transitions, due to the fact that animals are choosing actions they think will lead to higher reward. This suggests that the coefficients might be inevitably correlated by virtue of the task design and the fact that all regions are sensitive to reward. Can the authors rule out this possibility (e.g., by simulation)?
The explore/exploit section seems somewhat randomly tacked on. Is this really relevant? If yes, then I think it needs to be integrated more coherently.
-
Reviewer #2 (Public review):
Summary:
The authors investigate single-neuron activity in rhesus macaques during model-based (MB) and model-free (MF) reinforcement learning (RL). Using a well-established two-step choice task, they analyze neural correlates of MB and MF learning across four brain regions: the anterior cingulate cortex (ACC), dorsolateral PFC (DLPFC), caudate, and putamen. The study provides strong evidence that these regions encode distinct RL-related signals, with ACC playing a dominant role in MB learning and caudate updating value representations after rare transitions. The authors apply rigorous statistical analyses to characterize neural encoding at both population and single-neuron levels.
Strengths:
(1) The research fills a gap in the literature, which has been limited in directly dissociating MB vs. MF learning at …
Reviewer #2 (Public review):
Summary:
The authors investigate single-neuron activity in rhesus macaques during model-based (MB) and model-free (MF) reinforcement learning (RL). Using a well-established two-step choice task, they analyze neural correlates of MB and MF learning across four brain regions: the anterior cingulate cortex (ACC), dorsolateral PFC (DLPFC), caudate, and putamen. The study provides strong evidence that these regions encode distinct RL-related signals, with ACC playing a dominant role in MB learning and caudate updating value representations after rare transitions. The authors apply rigorous statistical analyses to characterize neural encoding at both population and single-neuron levels.
Strengths:
(1) The research fills a gap in the literature, which has been limited in directly dissociating MB vs. MF learning at the single unit level and across brain areas known to be involved in reinforcement learning. This study advances our understanding of how different brain regions are involved in RL computations.
(2) The study used a two-step choice task Miranda et al., (2020), which was previously established for distinguishing MB and MF reinforcement learning strategies.
(3) The use of multiple brain regions (ACC, DLPFC, caudate, and putamen) in the study enabled comparisons across cortical and subcortical structures.
(4) The study used multiple GLMs, population-level encoding analyses, and decoding approaches. With each analysis, they conducted the appropriate controls for multiple comparisons and described their methods clearly.
(5) They implemented control regressors to account for neural drift and temporal autocorrelation.
(6) The authors showed evidence for three main findings:
a) ACC as the strongest encoder of MB variables from the four areas, which emphasizes its role in tracking transition structures and reward-based learning. The ACC also showed sustained representation of feedback that went into the next trial.
b) ACC was the only area to represent both MB and MF value representations.
c) The caudate selectively updates value representations when rare transitions occur, supporting its role in MB updating.(7) The findings support the idea that MB and MF reinforcement learning operate in parallel rather than strictly competing.
(8) The paper also discusses how MB computations could be an extension of sophisticated MF strategies.
Weaknesses: o
(1) There is limited evidence for a causal relationship between neural activity and behavior. The authors cite previous lesion studies, but causality between neural encoding in ACC, caudate, and putamen and behavioral reliance on MB or MF learning is not established.
(2) There is a heavy emphasis on ACC versus other areas, but it is unclear how much of this signal drives behavior relative to the caudate.
(3) The role of the putamen is somewhat underexplored here.
(4) The authors mention the monkeys were overtrained before recording, which might have led to a bias in the MB versus MF strategy.
(5) The GLM3 model combines MB and MF value estimates but does not clearly mention how hyperparameters were optimized to prevent overfitting. While the hybrid model explains behavior well, it does not clarify whether MB/MF weighting changes dynamically over time.
(6) It was unclear from the task description whether the images used changed periodically or how the transition effect (e.g., in Figure 3) could be disambiguated from a visual response to the pair of cues.
-
-
-