The value of initiating a pursuit in temporal decision-making

Elissa Sutlief
Charlie Walters
Tanya Marton
Marshall G Hussain Shuler

Curated by eLife

eLife Assessment

This paper undertakes a valuable theoretical treatment of the potential role of foraging-related concepts in several forms of intertemporal choice. While the computational evidence and methodologies employed are novel, some issues with clarity and generality result in incomplete support for the paper's claims.

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (eLife)

Abstract

Reward-rate maximization is a prominent normative principle commonly held in behavioral ecology, neuroscience, economics, and artificial intelligence. Here, we identify and compare equations for evaluating the worth of initiating pursuits that an agent could implement to enable reward-rate maximization. We identify two fundamental temporal decision-making categories requiring the valuation of the initiation of a pursuit—forgo and choice decision-making—over which we generalize and analyze the optimal solution for how to evaluate a pursuit in order to maximize reward rate. From this reward-rate-maximizing formulation, we derive expressions for the subjective value of a pursuit, i.e. that pursuit's equivalent immediate reward magnitude, and reveal that time's cost is composed of an apportionment, in addition to, an opportunity cost. By re-expressing subjective value as a temporal discounting function, we show precisely how the temporal discounting function of a reward-rate-optimal agent is sensitive not just to the properties of a considered pursuit, but to the time spent and reward acquired outside of the pursuit for every instance spent within it. In doing so, we demonstrate how the apparent discounting function of a reward-rate-optimizing agent depends on the temporal structure of the environment and is a combination of hyperbolic and linear components, whose contributions relate the apportionment and opportunity cost of time, respectively. We further then show how purported signs of suboptimal behavior (hyperbolic discounting, the Delay effect, the Magnitude effect, the Sign effect) are in fact consistent with reward-rate maximization. Having clarified what features are and are not signs of optimal decision-making, we analyze the impact of the misestimation of reward rate-maximizing parameters in order to better account for the pattern of errors actually observed in humans and animals. We find that error in agents' assessment of the apportionment of time that underweights the time spent outside versus inside a considered pursuit type is the likely driver of suboptimal temporal decision-making observed behaviorally. We term this the Malapportionment Hypothesis. This generalized form for reward-rate maximization and its relation to subjective value and temporal discounting allows the true pattern of errors exhibited by humans and animals to be more deeply understood, identified, and quantified, which is key to deducing the learning algorithms and representational architectures actually used by humans and animals to evaluate the worth of pursuits.

Version published to 10.1101/2024.06.16.599189v2 on bioRxiv
Nov 22, 2024
eLife
Oct 2, 2024

eLife Assessment

This paper undertakes a valuable theoretical treatment of the potential role of foraging-related concepts in several forms of intertemporal choice. While the computational evidence and methodologies employed are novel, some issues with clarity and generality result in incomplete support for the paper's claims.

Read the original source
eLife
Oct 2, 2024

Reviewer #1 (Public review):

Summary:

This theoretical paper addresses how to optimize reward-rate-maximizing decisions in certain foraging-style environments. It presents a series of equations and graphical illustrations for quantities such as reward rates and time-related costs that a decision maker could estimate as a basis for such decisions. One of the main takeaways is that if the hypothetical agent underweights the time spent outside a focal reward pursuit relative to the time spent within it, this can predict a broadly realistic pattern of impatience in two alternative intertemporal choices paired with well-calibrated take-it-or-leave-it decisions. Another takeaway is that if the optimally estimated subjective value of a reward pursuit is plotted as a function of a range of temporal durations, the result resembles a hyperbolic …

Reviewer #1 (Public review):

Summary:

This theoretical paper addresses how to optimize reward-rate-maximizing decisions in certain foraging-style environments. It presents a series of equations and graphical illustrations for quantities such as reward rates and time-related costs that a decision maker could estimate as a basis for such decisions. One of the main takeaways is that if the hypothetical agent underweights the time spent outside a focal reward pursuit relative to the time spent within it, this can predict a broadly realistic pattern of impatience in two alternative intertemporal choices paired with well-calibrated take-it-or-leave-it decisions. Another takeaway is that if the optimally estimated subjective value of a reward pursuit is plotted as a function of a range of temporal durations, the result resembles a hyperbolic discounting function and is affected in empirically realistic ways by the magnitude and sign of the reward. Thus, the rate-maximization framework might lead to a hypothesis about the basis for the magnitude and sign effects in discounting.

Strengths:

The paper makes a useful contribution by broadening the application of reward-rate maximization to time-related decision scenarios. The paper's breadth of scope includes applying the same framework to accept/reject decisions and multi-alternative discounting decisions. The figures take a creative approach to illustrating the internal quantities in the model. It's particularly useful that the paper gives consideration to internal distortions that could give rise to documented anomalies in decision behavior.

Weaknesses:

(1) Although there are many citations acknowledging relevant previous work, there often isn't a very granular attribution of individual previous findings to their sources. In the results section, it's sometimes ambiguous when the paper is recapping established background and when it is breaking new ground. For example, around equation 8 in the results (sv = r - rho*t), it would be good to refer to previous places where versions of this equation have been presented. Offhand, McNamara 1982 (Theoretical Population Biology) is one early instance and Fawcett et al. 2012 (Behavioural Processes) is a later one. Line 922 of the discussion seems to imply this formulation is novel here.

(2) The choice environments that are considered in detail in the paper are very simple. The simplicity facilitates concrete examples and visualizations, but it would be worth further consideration of whether and how the conclusions generalize to more complex environments. The paper considers "forgo" scenario in which the agent can choose between sequences of pursuits like A-B-A-B (engaging with option B at all opportunities, which are interleaved with a default pursuit A) and A-A-A-A (forgoing option B). It considers "choice" scenarios where the agent can choose between sequences like A-B-A-B and A-C-A-C (where B and C are larger-later and smaller-sooner rewards, either of which can be interleaved with the default pursuit). Several forms of additional complexity would be valuable to consider. One would be a greater number of unique pursuits, not repeated identically in a predictable sequence, akin to a prey-selection paradigm. It seems to me this would cause t_out and r_out (the time and reward outside of the focal prospect) to be policy-dependent, making the 'apportionment cost' more challenging to ascertain. Another relevant form of complexity would be if there were variance or uncertainty in reward magnitudes or temporal durations or if the agent had the ability to discontinue a pursuit such as in patch-departure scenarios.

(3) I had a hard time arriving at a solid conceptual understanding of the 'apportionment cost' around Figure 5. I understand the arithmetic, but it would help if it were possible to formulate a more succinct verbal description of what makes the apportionment cost a useful and meaningful quality to focus on. I think Figure 6C relates to this, but I had difficulty relating the axis labels to the points, lines, and patterned regions in the plot. I also was a bit confused by how the mathematical formulation was presented. As I understood it, the apportionment cost essentially involves scaling the rest of the SV expression by t_out/(t_in + t_out). The way this scaling factor is written in Figure 5C, as 1/(1 + (1/t_out)t_in), seems less clear than it could be. Also, the apportionment cost is described in the text as being subtracted from SV rather than as a multiplicative scaling factor. It could be written as a subtraction, by subtracting a second copy of the rest of the SV expression scaled by t_in/(t_in + t_out). But that shows the apportionment cost to depend on the opportunity cost, which is odd because the original motivation on line 404 was to resolve the lack of independence between terms in the SV expression.

(4) In the analysis of discounting functions (line 664 and beyond), the paper doesn't say much about the fact that many discounting studies take specific measures to distinguish true time preferences from opportunity costs and reward-rate maximization. In many of the human studies, delay time doesn't preclude other activities. In animal studies, rate maximization can serve as a baseline against which to measure additional effects of temporal discounting. This is an important caveat to claims about discounting anomalies being rational under rate maximization (e.g., line 1024).

(5) The paper doesn't feature any very concrete engagement with empirical data sets. This is ok for a theoretical paper, but some of the characterizations of empirical results that the model aims to match seem oversimplified. An example is the contention that real decision-makers are optimal in accept/reject decisions (line 816 and elsewhere). This isn't always true; sometimes there is evidence of overharvesting, for example.

(6) Related to the point above, it would be helpful to discuss more concretely how some of this paper's theoretical proposals could be empirically evaluated in the future. Regarding the magnitude and sign effects of discounting, there is not a very thorough overview of the several other explanations that have been proposed in the literature. It would be helpful to engage more deeply with previous proposals and consider how the present hypothesis might make unique predictions and could be evaluated against them. A similar point applies to the 'malapportionment hypothesis' although in this case there is a very helpful section on comparisons to prior models (line 1163). The idea being proposed here seems to have a lot in common conceptually with Blanchard et al. 2013, so it would be worth saying more about how data could be used to test or reconcile these proposals.

Read the original source
eLife
Oct 2, 2024

Reviewer #2 (Public review):

Summary:

This paper from Sutlief et al. focuses on an apparent contradiction observed in experimental data from two related types of pursuit-based decision tasks. In "forgo" decisions, where the subject is asked to choose whether or not to accept a presented pursuit, after which they are placed into a common inter-trial interval, subjects have been shown to be nearly optimal in maximizing their overall rate of reward. However, in "choice" decisions, where the subject is asked which of two mutually-exclusive pursuits they will take, before again entering a common inter-trial interval, subjects exhibit behavior that is believed to be sub-optimal. To investigate this contradiction, the authors derive a consistent reward-maximizing strategy for both tasks using a novel and intuitive geometric approach that …

Reviewer #2 (Public review):

Summary:

This paper from Sutlief et al. focuses on an apparent contradiction observed in experimental data from two related types of pursuit-based decision tasks. In "forgo" decisions, where the subject is asked to choose whether or not to accept a presented pursuit, after which they are placed into a common inter-trial interval, subjects have been shown to be nearly optimal in maximizing their overall rate of reward. However, in "choice" decisions, where the subject is asked which of two mutually-exclusive pursuits they will take, before again entering a common inter-trial interval, subjects exhibit behavior that is believed to be sub-optimal. To investigate this contradiction, the authors derive a consistent reward-maximizing strategy for both tasks using a novel and intuitive geometric approach that treats every phase of a decision (pursuit choice and inter-trial interval) as vectors. From this approach, the authors are able to show that previously reported examples of sub-optimal behavior in choice decisions are in fact consistent with a reward-maximizing strategy. Additionally, the authors are able to use their framework to deconstruct the different ways the passage of time impacts decisions, demonstrating that time cost contains both an opportunity cost and an apportionment cost, as well as examining how a subject's misestimation of task parameters impacts behavior.

Strengths:

The main strength of the paper lies in the authors' geometric approach to studying the problem. The authors chose to simplify the decision process by removing the highly technical and often cumbersome details of evidence accumulation that are common in most of the decision-making literature. In doing so, the authors were able to utilize a highly accessible approach that is still able to provide interesting insights into decision behavior and the different components of optimal decision strategies.

Weaknesses:

While the details of the paper are compelling, the authors' presentation of their results is often unclear or incomplete:

(1) The mathematical details of the paper are correct but contain numerous notation errors and are presented as a solid block of subtle equation manipulations. This makes the details of the authors' approach (the main contribution of the paper to the field) highly difficult to understand.

(2) One of the main contributions of the paper is the notion that time cost in decision-making contains an apportionment cost that reflects the allocation of decision time relative to the world. The authors use this cost to pose a hypothesis as to why subjects exhibit sub-optimal behavior in choice decisions. However, the equation for the apportionment cost is never clearly defined in the paper, which is a significant oversight that hampers the effectiveness of the authors' claims.

(3) Many of the paper's figures are visually busy and not clearly detailed in the captions (for example, Figures 6-8). Because of the geometric nature of the authors' approach, the figures should be as clean and intuitive as possible, as in their current state, they undercut the utility of a geometric argument.

(4) The authors motivate their work by focusing on previously-observed behavior in decision experiments and tell the reader that their model is able to qualitatively replicate this data. This claim would be significantly strengthened by the inclusion of experimental data to directly compare to their model's behavior. Given the computational focus of the paper, I do not believe the authors need to conduct their own experiments to obtain this data; reproducing previously accepted data from the papers the authors' reference would be sufficient.

(5) While the authors reference a good portion of the decision-making literature in their paper, they largely ignore the evidence-accumulation portion of the literature, which has been discussing time-based discounting functions for some years. Several papers that are both experimentally-(Cisek et al. 2009, Thurs et al. 2012, Holmes et al. 2016) and theoretically-(Drugowitsch et al. 2012, Tajima et al. 2019, Barendregt et al. 22) driven exist, and I would encourage the authors to discuss how their results relate to those in different areas of the field.

Read the original source
eLife
Oct 2, 2024

Reviewer #3 (Public review):

Summary:

The goal of the paper is to examine the objective function of total reward rate in an environment to understand the behavior of humans and animals in two types of decision-making tasks: (1) stay/forgo decisions and (2) simultaneous choice decisions. The main aims are to reframe the equation of optimizing this normative objective into forms that are used by other models in the literature like subjective value and temporally discounted reward. One important contribution of the paper is the use of this theoretical analysis to explain apparent behavioral inconsistencies between forgo and choice decisions observed in the literature.

Strengths:

The paper provides a nice way to mathematically derive different theories of human and animal behavior from a normative objective of global reward rate …

Reviewer #3 (Public review):

Summary:

The goal of the paper is to examine the objective function of total reward rate in an environment to understand the behavior of humans and animals in two types of decision-making tasks: (1) stay/forgo decisions and (2) simultaneous choice decisions. The main aims are to reframe the equation of optimizing this normative objective into forms that are used by other models in the literature like subjective value and temporally discounted reward. One important contribution of the paper is the use of this theoretical analysis to explain apparent behavioral inconsistencies between forgo and choice decisions observed in the literature.

Strengths:

The paper provides a nice way to mathematically derive different theories of human and animal behavior from a normative objective of global reward rate optimization. As such, this work has value in trying to provide a unifying framework for seemingly contradictory empirical observations in literature, such as differentially optimal behaviors in stay-forgo v/s choice decision tasks. The section about temporal discounting is particularly well motivated as it serves as another plank in the bridge between ecological and economic theories of decision-making.

Weaknesses:

One broad issue with the paper is readability. Admittedly, this is a complicated analysis involving many equations that are important to grasp to follow the analyses that subsequently build on top of previous analyses.

But, what's missing is intuitive interpretations behind some of the terms introduced, especially the apportionment cost without referencing the equations in the definition so the reader gets a sense of how the decision-maker thinks of this time cost in contrast with the opportunity cost of time.

Re-analysis of some existing empirical data through the lens of their presented objective functions, especially later when they describe sources of error in behavior.

Read the original source
eLife
Oct 2, 2024

Author response:

We thank the reviewers for their thoughtful criticisms. This provisional response addresses what we consider the central critiques, with a full, point-by-point reply to follow with the revised manuscript. Central critiques concern 1) providing further clarity about the apportionment cost of time, 2) generality & scope, and 3) clarifying the meaning of key equations.

(1) Apportionment cost

Reviewers commonly identified a need to provide a concise and intuitive definition of apportionment cost, and to explicitly solve and provide for its mathematical expression.

We will add the following definition of apportionment cost to the manuscript: “Apportionment cost is the difference in reward that can be expected, on average, between a policy of taking versus a policy of not taking the considered pursuit, over a time equal to …

Author response:

We thank the reviewers for their thoughtful criticisms. This provisional response addresses what we consider the central critiques, with a full, point-by-point reply to follow with the revised manuscript. Central critiques concern 1) providing further clarity about the apportionment cost of time, 2) generality & scope, and 3) clarifying the meaning of key equations.

(1) Apportionment cost

Reviewers commonly identified a need to provide a concise and intuitive definition of apportionment cost, and to explicitly solve and provide for its mathematical expression.

We will add the following definition of apportionment cost to the manuscript: “Apportionment cost is the difference in reward that can be expected, on average, between a policy of taking versus a policy of not taking the considered pursuit, over a time equal to its duration.” While this difference is the apportionment cost of time, the amount that would be expected over a time equal to the considered pursuit under a policy of not taking the considered pursuit is the opportunity cost of time. Together, they sum to Time’s Cost. The above definition of apportionment cost adds to other stated relationships of apportionment cost found throughout the paper (Lines 434,435,447,450).

As suggested, we will also add equations of apportionment cost, as below.

(2) Generality & Scope

Generality. We will add further examples in support of the generality of these equations for assessing and thinking about the value of initiating a pursuit. Specifically, this will include 1) illustrating forgo decision making in a world composed of multiple pursuits, as in prey selection, 2) demonstrating and examining worlds in which a sequence of pursuits compose a considered pursuit’s ‘outside’, and 3) clarifying how our framework does contend with variance and uncertainty in reward magnitude and occurrence.

Scope. In this manuscript, we consider the worth of initiating one or another pursuit having completed a prior one, and not the issue of continuing within a pursuit having already engaged in it. The worth of continuing a pursuit, as in patch-foraging/give-up tasks, constitutes a third fundamental time decision-making topology which is outside the scope of the current work. It engages a large and important literature, encompassing evidence accumulation, and requires a paper in its own right that can use the concepts and framework developed here. We will further consider applying this framework to extant datasets.

(3) Correction of typographical errors and further explanation of equations.

We would like to redress the two typographical errors identified by the reviewers that appeared in the equations on line 277 and on line 306, and provide further explanation to equations that gave pause to the reviewers.

Typographical errors:

The first typographical error in the main text regards equation 2 and will be corrected so that equation 2 appears correctly as…

Line 277:

The second typo regards the definition of the considered pursuit’s reward rate, and will be corrected to appear as…

Line 306:

Regarding equations:

Cross-reference to equations in the main text refer to equations as they appear in the main text. Where needed, the appendix in which they are derived is also given. Equation numbering within the appendices refer to equations as they appear in the appendices. In the revision, we will refer to all equations that appear in the appendices as Ap.#.#. so as to avoid confusion between referencing equations as they appear in the main text and equations as they appear in the appendices.

We would also like to clarify that equation 8, , as we derive, is not new, as it is similarly derived and expressed in prior foundational work by McNamara (1982), which is now so properly attributed.

Equation 1 and Appendix 1

Equation 1 is formulated to calculate the average reward received and average time spent per unit time spent in the default pursuit. So, fi is the encounter rate of pursuit for one unit of time spent in the default pursuit (lines 259-262). Added to the summation in the numerator, we have the average reward obtained in the default pursuit per unit time and in the denominator we have the time spent in the default pursuit per unit time (1).

Equation 2 and Appendix 2

Eq. 2.4 in Appendix 2 calculates the average time spent outside of the considered pursuit, per encounter with the considered pursuit. Breaking down eq. 2.4, the first term in the numerator,

gives the expected time spent in other pursuits, per unit time spent in the default pursuit, where fi is the encounter rate of pursuit per unit time spent in the default pursuit, and is the time required by pursuit i. The second term in the numerator, (1, added outside the summation) simply represents the unit of time spent in the default pursuit, over which the encounter rate of each pursuit is calculated. Together, these represent the total time spent outside the considered pursuit, per unit time spent in the default pursuit. The denominator,

is the frequency with which the considered pursuit is encountered per unit time spent in the default pursuit, so

is the average time spent within the default pursuit, per encounter with the considered pursuit. By multiplying the average time spent outside of the considered pursuit per unit time spent in the default pursuit by the average time spent within the default pursuit per encounter with the considered pursuit, we get eq. 2.4, the average time spent outside of the considered pursuit, per encounter with the considered pursuit, which is equal to tout.

(eq. 2.4)

Read the original source
Version published to 10.7554/elife.99957.1 on eLife
Oct 2, 2024
Version published to 10.7554/elife.99957 on eLife
Oct 2, 2024
Version published to 10.1101/2024.06.16.599189v1 on bioRxiv
Jun 16, 2024

A computational principle of habit formation

This article has 1 author:
1. Kaushik Lakshminarasimhan
This article has no evaluationsLatest version Oct 13, 2024
Should I stay or should I go? Generalized marginal value theorem with temporal discounting

This article has 1 author:
1. Joel Zylberberg
This article has no evaluationsLatest version Oct 28, 2024
Task attentiveness is driven by observations of external rewards rather than internal performance estimates

This article has 4 authors:
1. Gili Katabi
2. Yael Vofsi
3. Hanna Keren
4. Nitzan Shahar
This article has no evaluationsLatest version Nov 5, 2024

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

A computational principle of habit formation

Should I stay or should I go? Generalized marginal value theorem with temporal discounting

Task attentiveness is driven by observations of external rewards rather than internal performance estimates