Normative decision rules in changing environments

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    This paper investigates scenarios in which the environment changes during the course of a decision, and shows that optimal behavior can be highly complex. It will be of broad interest to researchers in psychology, behavioural economics, and neuroscience interested in decision-making in real-world tasks. It also awaits detailed empirical testing.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #3 agreed to share their name with the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Models based on normative principles have played a major role in our understanding of how the brain forms decisions. However, these models have typically been derived for simple, stable conditions, and their relevance to decisions formed under more naturalistic, dynamic conditions is unclear. We previously derived a normative decision model in which evidence accumulation is adapted to fluctuations in the evidence-generating process that occur during a single decision (Glaze et al., 2015), but the evolution of commitment rules (e.g. thresholds on the accumulated evidence) under dynamic conditions is not fully understood. Here, we derive a normative model for decisions based on changing contexts, which we define as changes in evidence quality or reward, over the course of a single decision. In these cases, performance (reward rate) is maximized using decision thresholds that respond to and even anticipate these changes, in contrast to the static thresholds used in many decision models. We show that these adaptive thresholds exhibit several distinct temporal motifs that depend on the specific predicted and experienced context changes and that adaptive models perform robustly even when implemented imperfectly (noisily). We further show that decision models with adaptive thresholds outperform those with constant or urgency-gated thresholds in accounting for human response times on a task with time-varying evidence quality and average reward. These results further link normative and neural decision-making while expanding our view of both as dynamic, adaptive processes that update and use expectations to govern both deliberation and commitment.

Article activity feed

  1. Evaluation Summary:

    This paper investigates scenarios in which the environment changes during the course of a decision, and shows that optimal behavior can be highly complex. It will be of broad interest to researchers in psychology, behavioural economics, and neuroscience interested in decision-making in real-world tasks. It also awaits detailed empirical testing.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #3 agreed to share their name with the authors.)

  2. Reviewer #1 (Public Review):

    This paper considers decision-making problems when information and/or reward changes over time. It shows that the policy - the decision boundary that tells subjects when to make a decision - can have a very complicated shape; much more complicated than is typically considered. The authors use well-established techniques in reinforcement learning, but apply them in regimes where they are not normally used. Possibly the most important aspect of the paper is that it presents the relevant techniques in a reasonably accessible manner (and with a little work it could become very accessible). The paper also shows, in one non-trivial decision-making task, that normative models outperform heuristic ones by a large margin.

  3. Reviewer #2 (Public Review):

    Barendregt et al. derive normative decision strategies between two choice alternatives in dynamic environments. In these environments, either the decision reward or the decision supporting evidence reliability changes deterministically and knowingly within individual decisions. In all considered instances, normative decisions require accumulating noisy evidence to successively improve one's belief about whether either choice yields reward. The normative strategy corresponds to two time-varying thresholds on this belief whose shapes depend on the various costs and rewards, and the amount and dynamics of evidence provided to the decision maker. The authors show that, if the reward for correct choices changes abruptly at a fixed, known time after stimulus onset, optimal decision thresholds might evolve non-monotonically with time and differ qualitatively for reward increases and reward decreases. If it was instead the quality of evidence that changed, optimal decision thresholds always evolved monotonically across time. The authors furthermore demonstrated under which conditions decision makers might lose most reward when acting according to simpler heuristics rather than the normative strategy, even in the presence of additional sensor and motor noise. Lastly, they derive the normative strategy for the tokens task of Cisek et al. (2009) and show that human behavior in that task is best described by this normative strategy.

    Strengths:

    Decision-making research has in the past significantly benefited from a detailed understanding of normative decision strategies. This work extends our understanding of these strategies, which is expected to benefit future theoretical and experimental decision-making work.

    The found decision strategies are not always obvious from the start, but in hindsight intuitive.

    A normative decision strategy for the by-now well-known tokens task has been missing. The authors derive this strategy and show that it matches human decisions better than all considered alternative heuristics.

    Weaknesses:

    One of the tested alternative heuristics is the Urgency Gating Model (UGM) from Cisek et al. (2009). In Cisek et al. (2009) it is described as performing low-pass filtering of the immediate/momentary evidence, but in this manuscript, it performs low-pass filtering of the accumulated evidence. Thus, these two models seem to differ, but why the authors choose to implement a different variant remains unclear.

    Participants of the token task from Cisek et al. (2009) performed two task conditions that are fitted separately in this manuscript, leading to model parameters that differed by an order of magnitude. This difference might arise from human behavior not constraining some parameters well, or from a significantly different decision strategy in either condition, but no explanation for this difference is provided in the manuscript.

  4. Reviewer #3 (Public Review):

    The goal of Barendregt et al. is to extend the normative model of decision thresholds to changing environments. The immediate precursors of this work are Drugowitsch et al (2012) and Malhotra et al (2018), both of which derive optimal decision boundaries using dynamic programming. However, both those papers assumed a stationary environment. Barendregt et al. relax this assumption and show that non-stationary environments predict some very strange decision boundaries - decision boundaries can be non-monotonic or infinite, depending on the change in the environment. They consider two types of changes: change in reward and change in signal-to-noise ratio. Decision boundaries for a change in reward are particularly intriguing. To show empirical support for their theory, Barendregt et al. compare decision boundaries derived from their task with the Urgency Gating Model (UGM) and show their model shows a better fit to the data, at least under some conditions.

    Here are my thoughts on the paper:

    1. The theory of the paper is elegantly developed and clearly presented. While I can't be certain that there are no errors in the theory or simulation, the results presented based on this theory make intuitive sense.

    2. The authors have developed the theory diligently and explored different predictions. They not only present some example thresholds for a few selected conditions but explore the space of possible types of thresholds (Figure 2C & 3C). They go further and explore the benefits of adopting this theory over UGM and constant thresholds (Figure 3) and they also show some evidence that participant behaviour is more in line with their model than UGM in a previous study (the "Tokens task").

    3a. As much as I appreciate the authors' efforts (and the elegance of the theory) it seems to me that the notion of 'changing environments' explored by authors is quite limited. The decision thresholds are derived from a world in which an observer makes a (large) sequence of decisions and every decision has the exact same form of change. For example, in one of the reward-change tasks, the reward switches from low to high during every trial. In other words, the environment changes repeatedly in every trial (and in the exact same manner). There may be some circumstances in the natural world where such a setup is justified - the authors identify one where change is a function of the time of the day. But in many circumstances, the environment changes at an entirely different timescale - over the course of a sequence of trials. For example, a forging animal may make a sequence of decisions in a scarce environment, followed by another sequence of decisions in a plentiful environment. That is the statistics of the environment change over several trials. As far as I can see, the assumptions made by the authors mean that the results of the model cannot be applied to changes that occur at this timescale.

    3b. One particular area where the integrate-to-threshold models have been particularly successful is perceptual decision-making. For example, in motion perception (Shadlen & Newsome, 1996) or brightness perception (Ratcliff, 2003). This is where we have evidence of something like an integration signal in the cortex. However, these decisions are typically really fast, occurring at sub-second intervals. Another area is lexical decision tasks (e.g. Wagenmakers et al, 2008), where mean reaction times are <1s, frequently a lot faster. It is difficult to imagine that the model developed by the authors has much bearing on these types of decisions - firstly because it is unlikely that the reward structure in natural environments fluctuates at these timescales and secondly because participants are unlikely to pick up on such changes over the course of a small sequence of trials.

    3c. This does not mean that the model developed by Barendregt et al. is of no value. There will be situations (like the Tokens task) where the model will be the correct normative model. But these limitations are important to clarify for researchers in the field.

    4. The weakest part of the paper is its empirical support. The authors apply their model to the Tokens task. First of all, this is by no means the modal task used to study decision-making. The model developed by the authors simply does not apply to most perceptual decision-making tasks (see 3b above). So the ideal case would have been to design a task based on predictions of the model. For example, there is a clear prediction about RTs in Figure 4D, but this has never been tested. (My own view is that this prediction will only bear out under some scenarios - e.g. when decision-making is slow - but not during others). There are also some highly unusual boundaries predicted by the model - e.g. Figure 2i, 2ii, 2iv. I really doubt if participants ever adopt a boundary like this. The authors could have tested this, but haven't. I don't want to ask the authors to design and run these studies at this stage (it seems like a lot of work) but, at the very least, it would be good if the authors discussed whether they predict these highly idiosyncratic boundaries to bear out in empirical data. For example, an "infinite" threshold (Figure 2i, 2ii) means that participants never make a decision in this interval, even if they receive highly informative cues during this interval. Or do the authors believe that participants adopt some heuristic boundaries that approximate these normative boundaries? Currently, the authors seem to be arguing against heuristic models. Or perhaps they have a different heuristic model in mind? It would be good to know.

    5. One neat aspect of the paper is showing that there are some participants who show non-monotonic boundaries in the Tokens task. This task was specifically designed to justify the UGM. But the authors show that their model fits some participants better than UGM itself. To the best of my knowledge, this is the first demonstration of the fact that participants can show non-monotonic decision boundaries.

    7. Some of the write-ups need to make better contact with existing literature on boundary shapes. Here are some studies that come to mind:
    7a. Some early models to predict dynamic decision boundaries were proposed by Busemeyer & Rapoport (1988) and Rapoport & Burkheimer (1971) in the context of a deferred decision-making task.
    7b. One of the earliest models to use dynamic programming to predict non-constant decision boundaries was Frazier & Yu (2007). Indeed some boundaries predicted by the authors (e.g. Fig 2v) are very similar to boundaries predicted by this model. In fact, the switch from high to low reward used to propose boundaries in Fig 2v can be seen as a "softer" version of the deadline task in Frazier & Yu (2007).
    7c. Another early observation that time-varying boundaries can account for empirical data was made by Ditterich (2006). Seems highly relevant to the authors' predictions, but is not cited.
    7d. The authors seem to imply that their results are the first results showing non-monotonic thresholds. This is not true. See, for example, Malhotra et al. (2018). What is novel here is the specific shape of these non-monotonic boundaries.

    8. One of the more realistic scenarios is presented in Fig 2-Figure supplement 3, where reward doesn't switch at a fixed time, but uses a Markov process. But the authors do not provide enough details of the task or the results. Is m_R = R_H / R_L? Is it the dark line that corresponds to m_R=\inf (as indicated by legend) or the dotted line (as indicated by caption)? For what value of drift are these thresholds derived?

    9. Figure 4F: It is not clear to me why UGM in 0 noise condition have RTs aligned to the time reward increases from R1 to R2. Surely, this model does not take RR into account to compute the thresholds, does it? In fact, looking at Figure 4B, Supplement 1, the thresholds are always highest at t=0. Perhaps the authors can clarify.