Dopamine ramps as a normative consequence of dual-process control

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife Assessment

    This important study developed a novel theory to account for various aspects of dopamine signals, particularly dopamine ramps. The authors propose that dopamine reward prediction error (RPE) signals are generated by a dual-process learning system in which values inferred by a model-based system enter the RPE asymmetrically into the update target but not the prediction. The results are well-presented and convincing, and make a contribution that is of importance to the field. This work will be of interest to those studying dopamine specifically or brain learning computations and systems more broadly.

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

1

Midbrain dopamine neurons are thought to implement a temporal difference (TD) reward prediction error (RPE) that updates cached values stored in striatum. This has been challenged by evidence that dopamine “ramps up” to predictable rewards during goal-directed behaviour. Here, we propose that dopamine ramps are RPEs generated by a dual-process learning system in which values inferred using a world model train cached values via the RPE. Ramps arise because efficient training of cached values requires that inferred values contribute to the update target but not the prediction component of the RPE. The model reproduces key dopamine ramp phenomena, including learning dynamics on fast and slow timescales, global updates following changes in reward expectation, transient responses during unexpected state transitions, and sensitivity to state uncertainty manipulations. We therefore argue that dopamine ramps are a signature of interactions between inferred and cached values that revise the traditional dichotomy between model-based and model-free learning.

Article activity feed

  1. eLife Assessment

    This important study developed a novel theory to account for various aspects of dopamine signals, particularly dopamine ramps. The authors propose that dopamine reward prediction error (RPE) signals are generated by a dual-process learning system in which values inferred by a model-based system enter the RPE asymmetrically into the update target but not the prediction. The results are well-presented and convincing, and make a contribution that is of importance to the field. This work will be of interest to those studying dopamine specifically or brain learning computations and systems more broadly.

  2. Reviewer #1 (Public review):

    Summary:

    This study develops a novel theory to account for various aspects of dopamine signals, particularly dopamine ramps. They propose that dopamine reward prediction error (RPE) signals are generated by a dual-process learning system in which values inferred by a model-based system enter the RPE asymmetrically into the update target but not the prediction (equation 6). The work offers specific, mechanistic explanations of Krausz et al. (2023) and Guru et al. (2020), Kim et al. (2020) by maintaining an RPE interpretation, and presents an alternative to the state-uncertainty account in Mikhael et al. (2022) that doesn't require the asymmetric uncertainty assumption Mikhael needs, using Campbell et al. (2025) in a thoughtful way. The asymmetric-RPE idea is clean and well presented. Overall, this study makes an important contribution to the field.

    Strengths:

    The theory is relatively simple and intuitive. It addresses a long-standing controversy or mystery in the field of dopamine.

    Weaknesses:

    (1) The biggest outstanding question is what V_TD does - letting V_MB drive everything would seem to produce much of the same outcomes in the settings discussed here. The discussion suggests that in situations where there is little contribution of the model-based system, the backpropagating bump is a feature (e.g. Amo et al.). It would be interesting to see if this is a true outcome of the model, potentially by varying the arbitration parameter k. This is an interesting alternative account from eligibility trace explanations of the lack of backpropagating bump in some experimental settings.

    (2) The model-based accounts are quite simplistic, and this should probably be acknowledged - it does help delineate their contribution, but in the model, only the goal-reward value is updated; everything else is a known computation. Perhaps engage more deeply with Sagiv et al?

    (3) The application of Campbell et al. (2025) to push back on Mikhael (lines 253-259) is interesting: if striatum to VTA implements TD via synaptic delays such that V(s_t) is a delayed copy of V(s_{t+1}), then state uncertainty is necessarily shared between the two terms in the RPE, defeating Mikhael's required asymmetry.

    But the same circuit logic creates tension for the dual-process model. It seems they are proposing that the frontal cortex projects V_MB into VTA dopamine neurons (as proposed in 3.1 and the Discussion) and adds to the prediction error derived from the biphasic filtering of value. But the biphasic idea (and data of Campbell et al.) implies that the V(t+1) and -V(t) come from the same source and are proportional. Adding the V_MB term is akin to adding a positive bias, breaking the optimality of the TD error for predicting value and predicting over-learning of cached value. It is worth considering whether V_MB passes through a similar filter - I am not sure if it is fatal if V_MB contributes somewhat to the negative term of the update error.

    (4) A few places where the predicate of the conclusion needs more care. The "normative" framing throughout 3.2 and the Discussion is normative conditional on the architecture already including a separate cached system that needs to converge to the true value function and on a system in which the model based is learnt much faster - see comments about learning rate parameter later.

    (5) Kim et al. is cited heavily as a data source for Figure 4, but is never engaged with as a theoretical alternative, even though Kim et al. explicitly argued that an appropriate state representation makes standard TD compatible with ramps and the teleport responses. That is, Kim et al. is already a TD account of these phenomena, and doesn't require a second learning system. The introduction and Mikhael discussion treat the field as if the choice were between "dopamine = value" (Hamid, Howe, Mohebi) and dopamine = RPE-with-special-conditions (Mikhael, Kato-Morita), but Kim et al.'s framework is also dopamine = RPE. Two specific places this matters: (i) Figure 4 currently demonstrates that the dual-process model reproduces the Kim teleport results, but Kim et al.'s framework also reproduces them - the figure doesn't distinguish the two, and I am not sure the figure gives this message cleanly. (ii) Kim et al. report that ramps develop with training over days; the manuscript should address whether the dual-process model has an alternative explanation for this, especially given the contrast with the Guru result (ramps diminishing with training over a longer timescale).

    (6) The arbitration parameter k is fixed at 0.5 throughout, and the paper acknowledges this is for simplicity, but a supplementary panel sweeping k ∈ {0, 0.2, 0.5, 0.8, 1.0} on the key figures (Figure 1B convergence, Figure 2D ramp dynamics, Figure 3D Krausz updating) would be informative. At k = 0, the model reduces to standard TD; at k = 1, it's effectively V_MB-driven. I think these would be easy to add and help clarify the work this assumption is doing.

    (7) Learning-rate asymmetry needs justification. The story relies on α_MB >> α_TD throughout (α_MB = 0.50, α_TD = 0.01 - a 50× ratio). With α_MB = 0.5, a single rewarded trial moves R[goal] halfway to the new value, which would predict strong dependence of dopamine ramp amplitude on the previous trial's outcome. This is testable in existing data (Krausz et al. should have enough trials to fit the exponential decay constant for trial-history dependence; Guru's swap-session data likewise), and the paper would be strengthened by explicitly deriving and checking that prediction.

    (8) α_MB is dropped to 0.10 specifically for the Krausz simulation without justification in the text - Why? Either the value should be the same as elsewhere, or the paper should explain why Krausz's task requires slower MB learning. It would be good to check the robustness of the Krausz simulation - the test phase is a single set of three trials (t-2 = omission, t-1 = reward, then t = 50% rewarded) after training on a single set of 500 simulated trials (believe only one random seed is used - given the high alpha, varying this set of simulated trials seems important). Also, do they get the other result in Krausz (t-2 = reward, t-1 = omission, t = 50% rewarded)?

    (9) It might be possible to fit the alpha to the Guru and Krausz simulations - this might be informative to show the range over which it varies.

    (10) The Kato and Morita account is cited in the introduction but never really discussed again - it would be good to engage with this a bit more in the discussion. The rejection of the value-based accounts seems to rely primarily on Kim et al., where the value and TDRPE accounts differ, but this could be directly acknowledged, rather than absorbing credit for this into their model.

  3. Reviewer #2 (Public review):

    Summary:

    This paper offers a novel theoretical account of dopamine ramps. The key idea is that the reward prediction error (putatively signaled by dopamine) uses a partially model-based estimate for future value (the prediction target). Because the model-based value estimate emerges more rapidly than the model-free estimate, it inflates the RPE, and this inflation increases with reward proximity - hence ramps. The authors show that this account can explain many aspects of existing data on dopamine ramps across several different studies.

    Strengths:

    Overall, I liked this paper. The idea is interesting and plausible. The paper is well-written and clearly argued. The modeling has been done rigorously.

    Weaknesses:

    My major comments are: (1) it's not always clear which phenomena are uniquely well-explained by this new account vs. earlier accounts; and (2) the limitations of the account are not entirely transparent.

    (1) The paper models some of the studies reported by Kim et al (2020). As was already shown in that paper, a standard TD error could explain the results (although a major limitation of that treatment was that it did not model the recursive effect of RPEs on learning, as discussed in the Mikhael paper). It's not clear if there's additional explanatory value provided by this new account, though, of course, it's good to know that those results are captured by the new account. Likewise, Mikhael et al (2022) already offered an account of their data (somewhat more complex than the standard TD model). Again, it's not clear if there's additional explanatory value provided by the new account (and again, it's nice to see that the model can capture these results). Finally, I found myself wondering whether the Guru et al (2020) result couldn't be explained by a more standard TD model (assuming the value function is sufficiently convex). I don't think it's essential that the new account provides additional explanatory value in every case, but I think it's important to convey to readers what's new and what's not, as well as what aspects of the data require particular kinds of mechanisms to explain. It would be really helpful to see the predictions of alternative TD models in order to make this clearer.

    (2) The Mikhael model was motivated by the puzzle that ramping is observed in navigation tasks (with sensory cues) but typically not in classical conditioning tasks lacking sensory cues. The correction term, derived from normative considerations, explained this discrepancy. It's not clear to me if/how the new account can explain the discrepancy.

  4. Reviewer #3 (Public review):

    Summary:

    This work presents a new hypothesis for why dopamine signals have sometimes been observed to "ramp up" in spatial tasks as rodents approach a location associated with reward. In essence, the hypothesis is that value estimates (i.e., predictions about future rewards) from a model-based system, which may be able to more quickly form such estimates via an inference-like process, can be used to speed up the (relatively slow) learning of such estimates by a model-free system. This is suggested to occur by including the model-based estimate as part of the target towards which model-free estimates are updated in the course of temporal-difference (TD) learning. The early discrepancy between these estimates can be expected to give rise to systematic TD errors - putatively represented in dopaminergic activity - that give rise to dopamine ramps, which are expected to diminish over time as the estimates of both systems converge. The authors show that a model that implements this idea makes predictions about dopamine activity that are a good qualitative match to data from a number of recent experimental studies.

    Strengths:

    The work suggests a normative account for a phenomenon that has persistently troubled the canonical theory of dopamine function. The account is appealing in its elegance and simplicity, and the authors present compelling evidence that it can capture the empirical observations of key recent papers. Another strength of the account is that it readily suggests avenues for future theory development and experimental test, including what the 'best' target estimate should be at any given time, how rapidly one might expect ramps to develop or diminish, and the neural implementation of the proposed algorithm. This is likely to stimulate further theoretical and experimental work in the field.

    Weaknesses:

    One aspect of dopamine "ramps" that was troubling from a theoretical standpoint was their apparent persistence over time. Given the authors' prediction that these would disappear over time in a stable environment and the supporting evidence they cite (from Guru et al., 2000), the reader might be left confused about the state of evidence about whether dopamine ramps persist or not. Perhaps relatedly, the issue of how the activity of dopamine cells and dopamine release are related is not discussed, which may be relevant given that early studies (e.g., Howe et al., 2013) used voltammetry to measure extracellular dopamine concentrations.