A new model of decision processing in instrumental learning tasks

Steven Miletić
Russell J Boag
Anne C Trutti
Niek Stevenson
Birte U Forstmann
Andrew Heathcote

Curated by eLife

Summary: This cognitive modeling study on a timely topic investigates the combination of reinforcement learning and decision-making for modeling choice and reaction-time data in sequential reinforcement problems (e.g., bandit tasks). The central claim of the paper is that the often-used combination of reinforcement learning with the drift-diffusion model (which decides based on the difference between option values) does not provide an adequate model of instrumental learning. Instead, the authors propose an "advantage racing" model which provides better fits to choice and reaction-time data in different variants of two-alternative forced-choice tasks. Furthermore, the authors emphasize that their advantage racing model allows for fitting decision problems with more than two alternatives - something which the standard drift-diffusion model cannot do. These findings can be of interest for researchers investigating learning and decision-making.

The study asks an important question for understanding the interaction between reinforcement learning and decision-making, the methods appear sound, and the manuscript is clearly written. The superiority of the advantage racing model is key to the novelty of the study, which otherwise relies on a canonical task studied in several recent papers on the same issue. However, the reviewers feel that the framing of the study and its conclusions would require additional analyses and experiments to transform the manuscript from a modest quantitative improvement into a qualitative theoretical advance. In particular, as described in the paragraphs below, the authors should test how their advantage racing model fares in reinforcement problems with more than two alternatives. This is, from their own account throughout the paper, a situation where their model could show most clearly its superiority over standard drift-diffusion models used in the recent literature.

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (eLife)

Abstract

Learning and decision-making are interactive processes, yet cognitive modeling of error-driven learning and decision-making have largely evolved separately. Recently, evidence accumulation models (EAMs) of decision-making and reinforcement learning (RL) models of error-driven learning have been combined into joint RL-EAMs that can in principle address these interactions. However, we show that the most commonly used combination, based on the diffusion decision model (DDM) for binary choice, consistently fails to capture crucial aspects of response times observed during reinforcement learning. We propose a new RL-EAM based on an advantage racing diffusion (ARD) framework for choices among two or more options that not only addresses this problem but captures stimulus difficulty, speed-accuracy trade-off, and stimulus-response-mapping reversal effects. The RL-ARD avoids fundamental limitations imposed by the DDM on addressing effects of absolute values of choices, as well as extensions beyond binary choice, and provides a computationally tractable basis for wider applications.

Version published to 10.7554/elife.63055 on eLife
Jan 27, 2021
eLife
Dec 2, 2020

Reviewer #2:

While much independent progress has been made in the development of RL models for learning and DDM-like models for decision-making, only recently have people begun to combine the two (e.g. Pedersen et al., 2017). In this paper, Miletić et al. develop a new set of combined reinforcement learning (RL) and evidence-accumulation models (EAM) in an attempt to account for learning/choice data and reaction time data in a series of probabilistic selection tasks (Frank et al., 2004). While previous developments have provided proof-of-concept that these models can be joined, here the authors present a new model, Advantage Racing Diffusion, which additionally captures stimulus difficulty, speed-accuracy trade-offs, and reversal learning effects. Using behavioral experiments and Bayesian model selection techniques, the authors …

Reviewer #2:

While much independent progress has been made in the development of RL models for learning and DDM-like models for decision-making, only recently have people begun to combine the two (e.g. Pedersen et al., 2017). In this paper, Miletić et al. develop a new set of combined reinforcement learning (RL) and evidence-accumulation models (EAM) in an attempt to account for learning/choice data and reaction time data in a series of probabilistic selection tasks (Frank et al., 2004). While previous developments have provided proof-of-concept that these models can be joined, here the authors present a new model, Advantage Racing Diffusion, which additionally captures stimulus difficulty, speed-accuracy trade-offs, and reversal learning effects. Using behavioral experiments and Bayesian model selection techniques, the authors demonstrate a superior fit to choice/RT data with their model relative to similar alternatives. These results suggest that the Advantage framework may be a key element in capturing choice/RT behavior during instrumental learning tasks.

I think this paper asks some really interesting questions, the methods are quite sound, and it is written nicely. I do think that the central focus of the Advantage learning element is key to the study's novelty. However, I feel that the framing of the paper and the implementation are somewhat at odds, and thus additional experiments (or re-analyses of extant data sets) may be needed to transform the paper from a welcome, modest incremental improvement to a qualitative theoretical advance. I outline my major concerns/suggestions below:

Major Points:

In the abstract, the authors allude to both learning tasks with >2 options and to the role of absolute values of choices in characterizing the limitations of the typical DDM. However, in the manuscript the former is not addressed (and actually does not appear to be amenable to the current model implementation; see below), and the latter is addressed via modest improvements to model fits rather than true qualitative divergence between their model and other models' ability to capture specific behavior effects. Thus, I think the authors' could greatly strengthen their conclusions if they extend their model to RL data sets with a) >2 options, and b) variations in the absolute mean reward across blocks of learning trials. For instance, does their model predict set size effects during instrumental learning? Does their model predict qualitative shifts in choice and RT when different task blocks have different µ rewards? At the moment the primary results are improved fits, but I think it would be important to show their model's unique ability to capture more salient qualitative behavior effects.

Moreover, I'm not sure I understand how the winning model would easily transfer to >2 options. As depicted in Equation 1, the model depends on the difference between two unique Q-values (weighted by w-d). How would this be implemented with >2 options? I see some paths forward on this (e.g., the current Q relative to the top Q-value, the current Q minus the average, etc.) but they seem to require somewhat arbitrary heuristics? Perhaps the authors could incorporate modulation of drift rates by policies? Or use an actor-critic approach? I may be missing something, but I think if the model in its current form doesn't accurately transfer to >2 options, the primary contribution is the utility of urgency, which has been presented in earlier studies.

I appreciate the rigorous parameter recovery experiments in the supplement, but I think the authors could also perform a model separability analysis (e.g., plot a confusion matrix) - it seems several of the models are relatively similar and it could be useful to see if they're confusable (though I imagine they're mostly separable).

I may be missing something, but I do not think the authors are implementing SARSA. SARSA is: Q(s,a)[t+1] = Q(s,a)[t] + lr(r[t+1] + discount(Q(s,a)[t+1]) - Q(s,a)[t]. However, this is a single-step task...isn't it just 'SAR' (aka, the standard Rescorla-Wagner delta rule)?

Read the original source
eLife
Dec 2, 2020
Reviewer #1:

This is a rigorous and very interesting study on a timely topic: combining modeling traditions of (reinforcement) learning and decision-making. The central claim of the paper is that the often-used combination of reinforcement learning with the drift diffusion model does not provide an adequate model of instrumental learning, but that the recently proposed "advantage accumulation framework" does. This claim will likely be of interest for anyone studying learning and decision-making, ranging from mathematical psychologists to neuroscientists running animal labs. I have a number of concerns regarding this paper.
1. I think the basic behavior and model fit quality should be better described. The reinforcement-learning + evidence accumulation models (RL-EAM) are fitted to choices and reaction times (RTs). I find it therefore odd …
Reviewer #1:

This is a rigorous and very interesting study on a timely topic: combining modeling traditions of (reinforcement) learning and decision-making. The central claim of the paper is that the often-used combination of reinforcement learning with the drift diffusion model does not provide an adequate model of instrumental learning, but that the recently proposed "advantage accumulation framework" does. This claim will likely be of interest for anyone studying learning and decision-making, ranging from mathematical psychologists to neuroscientists running animal labs. I have a number of concerns regarding this paper.

I think the basic behavior and model fit quality should be better described. The reinforcement-learning + evidence accumulation models (RL-EAM) are fitted to choices and reaction times (RTs). I find it therefore odd that we don't get to see any actual RT distributions, but only the 10th, 50th and 90th percentile thereof. What did the grand average RT distribution and model predictions look like (pooled across subjects and trials)? How much variability was there across subjects? I understand that that model was fit hierarchically, but it would be nice to (i) see a distribution of fit quality across subjects, to (ii) see RT distributions of a couple of good and bad fits, and to (iii) check whether the results hold after excluding the subjects with worst fits (if there are any outliers). Related, in the RT percentile plots (Figures 3 & 4), it would be nice to see some measure of variability across subjects.

The authors pit four competing RL-EAMs against one another. I have a number of issues with the way this is done:

-The qualitative model fits presented in Figure 3 are potentially misleading, as the competing models have different numbers of free parameters: DDM, 4; RL-RD, 5; RL-IARD, 5; RL-ARD: 6. RL-ARD has most free parameters, which might trivially lead to the best visual fit. For this reason, I find the BPIC results more compelling, and I think these should feature more prominently (perhaps even as bars in the main figure?).

-All three racing diffusion models implement an urgency signal. Why did the authors not consider a similar mechanism within the DDM framework? Here, urgency could be implemented either as (linearly or hyperbolically) collapsing bounds, or as self-excitation (inverse of leak); both require only one extra parameter.

I could imagine a scenario in which the decision-making process becomes progressively biased toward the more rewarding stimulus. In fact, this can be observed in Figure 7. Therefore, I wonder if the authors have considered RL-AEMs in which the choice boundaries do not correspond to correct vs. error, but instead to the actual choice alternatives (stimulus A vs. B). In such an implementation one can fit bias parameters like starting point and/or drift bias.

The authors write that RL-AEMs assume that "[...] a subject gradually accumulates evidence for each choice option by sampling from a distribution of memory representations of the subjective value (or expected reward) associated with each choice option (known as Q-values)." Sampling from a distribution of memory representations is a relatively new idea, and I think it would help if the authors would be more circumscribed in the interpretation of these results, and also provide more context and rationale both in the Introduction and Discussion. For example, an interesting Discussion paragraph would be on how such a memory-sampling process might actually be implemented in the brain.
Read the original source
eLife
Dec 2, 2020

Summary: This cognitive modeling study on a timely topic investigates the combination of reinforcement learning and decision-making for modeling choice and reaction-time data in sequential reinforcement problems (e.g., bandit tasks). The central claim of the paper is that the often-used combination of reinforcement learning with the drift-diffusion model (which decides based on the difference between option values) does not provide an adequate model of instrumental learning. Instead, the authors propose an "advantage racing" model which provides better fits to choice and reaction-time data in different variants of two-alternative forced-choice tasks. Furthermore, the authors emphasize that their advantage racing model allows for fitting decision problems with more than two alternatives - something which the standard drift-diffusion …

Summary: This cognitive modeling study on a timely topic investigates the combination of reinforcement learning and decision-making for modeling choice and reaction-time data in sequential reinforcement problems (e.g., bandit tasks). The central claim of the paper is that the often-used combination of reinforcement learning with the drift-diffusion model (which decides based on the difference between option values) does not provide an adequate model of instrumental learning. Instead, the authors propose an "advantage racing" model which provides better fits to choice and reaction-time data in different variants of two-alternative forced-choice tasks. Furthermore, the authors emphasize that their advantage racing model allows for fitting decision problems with more than two alternatives - something which the standard drift-diffusion model cannot do. These findings can be of interest for researchers investigating learning and decision-making.

The study asks an important question for understanding the interaction between reinforcement learning and decision-making, the methods appear sound, and the manuscript is clearly written. The superiority of the advantage racing model is key to the novelty of the study, which otherwise relies on a canonical task studied in several recent papers on the same issue. However, the reviewers feel that the framing of the study and its conclusions would require additional analyses and experiments to transform the manuscript from a modest quantitative improvement into a qualitative theoretical advance. In particular, as described in the paragraphs below, the authors should test how their advantage racing model fares in reinforcement problems with more than two alternatives. This is, from their own account throughout the paper, a situation where their model could show most clearly its superiority over standard drift-diffusion models used in the recent literature.

Read the original source
Version published to 10.1101/2020.09.12.294512 on bioRxiv
Sep 14, 2020

Automatic value learning results in counterproductive human behavior

This article has 5 authors:
1. Ido Ben-Artzi
2. Maayan Pereg
3. Roy Luria
4. Rani Moran
5. Nitzan Shahar
This article has no evaluationsLatest version Mar 11, 2026
Now or later: A reinforcement learning model of behavioural delay

This article has 4 authors:
1. Sahiti Chebolu
2. Peiyuan Zhang
3. Wei Ji Ma
4. Peter Dayan
This article has no evaluationsLatest version Apr 19, 2026
Biased processing of multiple outcomes in human reinforcement learning: evidence from computational modeling and eye-tracking

This article has 6 authors:
1. Henri Vandendriessche
2. Gruson Charlotte
3. Antonios Nasioulas
4. Camille Straboni
5. Maël Lebreton
6. Stefano Palminteri
This article has no evaluationsLatest version Apr 11, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Automatic value learning results in counterproductive human behavior

Now or later: A reinforcement learning model of behavioural delay

Biased processing of multiple outcomes in human reinforcement learning: evidence from computational modeling and eye-tracking