Risking your Tail: Modeling Individual Differences in Risk-sensitive Exploration using Bayes Adaptive Markov Decision Processes

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife Assessment

    Shen et al. present a computational account of individual differences in mouse exploration when faced with a novel object in an open field from a previously published study (Akiti et al.) that relates subject-specific intrinsic exploration and caution about potential hazards to the spectrum of behaviors observed in this setting. Overall, this computational study is an important contribution that leverages a very general modeling framework (a Bayes Adaptive Markov Decision Process) to quantify and interrogate distinct drivers of exploratory behavior under potential threat. Given their assumptions, the modeling results are convincing: the authors are able to describe a substantial amount of the behavioral features and idiosyncracies in this dataset, and their model affords a normative interpretation related to inherent risk aversion and predation hazard "flexibility" of individual animals and should be of broad interest to researchers working to understand open-ended exploratory behaviors.

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Novelty is a double-edged sword for agents and animals alike: they might benefit from untapped resources or face unexpected costs or dangers such as predation. The conventional exploration/exploitation tradeoff is thus coloured by risk-sensitivity. A wealth of experiments has shown how animals solve this dilemma, for example using intermittent approach. However, there are large individual differences in the nature of approach, and modeling has yet to elucidate how this might be based on animals’ differing prior expectations about reward and threat, and differing degrees of risk aversion. To capture these factors, we built a Bayes adaptive Markov decision process model with three key components: an adaptive hazard function capturing potential predation, an intrinsic reward function providing the urge to explore, and a conditional value at risk (CVaR) objective, which is a contemporary measure of trait risk-sensitivity. We fit this model to a coarse-grain abstraction of the behaviour of 26 animals who freely explored a novel object in an open-field arena (Akiti et al. Neuron 110, 2022). We show that the model captures both quantitative (frequency, duration of exploratory bouts) and qualitative (stereotyped tail-behind) features of behavior, including the substantial idiosyncrasies that were observed. We find that “brave” animals, though varied in their behavior, are generally more risk neutral, and enjoy a flexible hazard prior. They begin with cautious exploration, and quickly transition to confident approach to maximize exploration for reward. On the other hand, “timid” animals, characterized by risk aversion and high and inflexible hazard priors, display self-censoring that leads to the sort of asymptotic maladaptive behavior that is often associated with psychiatric illnesses such as anxiety and depression. Explaining risk-sensitive exploration using factorized parameters of reinforcement learning models could aid in the understanding, diagnosis, and treatment of psychiatric abnormalities in humans and other animals.

Article activity feed

  1. eLife Assessment

    Shen et al. present a computational account of individual differences in mouse exploration when faced with a novel object in an open field from a previously published study (Akiti et al.) that relates subject-specific intrinsic exploration and caution about potential hazards to the spectrum of behaviors observed in this setting. Overall, this computational study is an important contribution that leverages a very general modeling framework (a Bayes Adaptive Markov Decision Process) to quantify and interrogate distinct drivers of exploratory behavior under potential threat. Given their assumptions, the modeling results are convincing: the authors are able to describe a substantial amount of the behavioral features and idiosyncracies in this dataset, and their model affords a normative interpretation related to inherent risk aversion and predation hazard "flexibility" of individual animals and should be of broad interest to researchers working to understand open-ended exploratory behaviors.

  2. Reviewer #1 (Public review):

    Summary:

    This work computationally characterized the threat-reward learning behavior of mice in a recent study (Akiti et al.), which had prominent individual differences. The authors constructed a Bayes-adaptive Markov decision process model and fitted the behavioral data by the model. The model assumed (i) hazard function starting from a prior (with free mean and SD parameters) and updated in a Bayesian manner through experience (actually no real threat or reward was given in the experiment), (ii) risk-sensitive evaluation of future outcomes (calculating lower 𝛼 quantile of outcomes with free 𝛼 parameter), and (iii) heuristic exploration bonus. The authors found that (i) brave animals had more widespread hazard priors than timid animals and thereby quickly learned that there was in fact little real threat, (ii) brave animals may also be less risk-aversive than timid animals in future outcome evaluation, and (iii) the exploration bonus could explain the observed behavioral features, including the transition of behavior from the peak to steady-state frequency of bout. Overall, this work is a novel interesting analysis of threat-reward learning, and provides useful insights for future experimental and theoretical work. However, there are several issues that I think need to be addressed.

    Strengths:

    (1) This work provides a normative Bayesian account for individual differences in braveness/timidity in reward-threat learning behavior, which complements the analysis by Akiti et al. based on model-free threat reinforcement learning.

    (2) Specifically, the individual differences were characterized by (i) the difference in the variance of hazard prior and potentially also (ii) the difference in the risk-sensitivity in the evaluation of future returns.

    Weakness:

    (1) Theoretically the effect of prior is diluted over experience whereas the effect of biased (risk-aversive) evaluation persists, but these two effects could not be teased apart in the fitting analysis of the current data.

    (2) It is currently unclear how (whether) the proposed model corresponds to neurobiological (rather than behavioral) findings, different from the analysis by Akiti et al.

    Major points:

    (1) Line 219
    It was assumed that the exploration bonus was replenished at a steady rate when the animal was at the nest. An alternative way would be assuming that the exploration bonus slowly degraded over time or experience, and if doing so, there appears to be a possibility that the transition of the bout rate from peak to steady-state could be at least partially explained by such a decrease in the exploration bonus.

    (2) Line 237- (Section 2.2.6, 2.2.7, Figures 7, 9)
    I was confused by the descriptions about nCVaR. I looked at the cited original literature Gagne & Dayan 2022, and understood that nCVaR is a risk-sensitive version of expected future returns (equation 4) with parameter α (α-bar) (ranging from 0 to 1) representing risk preference. Line 269-271 and Section 4.2 of the present manuscript described (in my understanding) that α was a parameter of the model. Then, isn't it more natural to report estimated values of α, rather than nCVaR, for individual animals in Section 2.2.6, 2.2.7, Figures 7, 9 (even though nCVaR monotonically depends on α)? In Figures 7 and 9, nCVaR appears to be upper-bounded to 1. The upper limit of α is 1 by definition, but I have no idea why nCVaR was also bounded by 1. So I would like to ask the authors to add more detailed explanations on nCVaR. Currently, CVaR is explained in Lines 237-243, but actually, there is no explanation about nCVaR rather than its formal name 'nested conditional value at risk' in Line 237.

    (3) Line 333 (and Abstract)
    Given that animals' behaviors could be equally well fitted by the model having both nCVaR (free α) and hazard prior and the alternative model having only hazard prior (with α = 1), may it be difficult to confidently claim that brave (/timid) animals had risk-neutral (/risk-aversive) preference in addition to widespread (/low-variance) hazard prior? Then, it might be good to somewhat weaken the corresponding expression in the Abstract (e.g., add 'potentially also' to the result for risk sensitivity) or mention the inseparability of risk sensitivity and prior belief pessimism (e.g., "... although risk sensitivity and prior belief pessimism could not be teased apart").

  3. Reviewer #2 (Public review):

    Shen and Dayan build a Bayes adaptive Markov decision process model with three key components: an adaptive hazard function capturing potential predation, an intrinsic reward function providing the urge to explore, and a conditional value at risk (CvaR, closely related to probability distortion explanations of risk traits). The model itself is very interesting and has many strengths including considering different sources of risk preference in generating behavior under uncertainty. I think this model will be useful to consider for those studying approach/avoid behaviors in dynamic contexts.

    The authors argue that the model explains behavior in a very simple and unconstrained behavioral task in which animals are shown novel objects and retreat from them in various manners (different body postures and patterns of motor chunks/syllables). The model itself does capture lots of the key mouse behavioral variability (at least on average on a mouse-by-mouse basis) which is interesting and potentially useful. However, the variables in the model - and the internal states it implies the mice have during the behavior - are relatively unconstrained given the wide range of explanations one can offer for the mouse behavior in the original study (Akiti et al). This reviewer commends the authors on an original and innovative expansion of existing models of animal behaviour, but recommends that the authors revise their study to reflect the obvious challenges. I would also recommend a reduction in claiming that this exercise gives a normative-like or at least quantitative account of mental disorders.

    My main comment is that this paper is a very nice model creation that can characterize the heterogeneity rodent behavior in a very simple approach/avoid context (Akiti et al; when a novel object is placed in an arena) that itself can be interpreted in a multitude of ways. The use of terms like "exploration", "brave", etc in this context is tricky because the task does not allow the original authors (Akiti et al) to quantify these "internal states" or "traits" with the appropriate level of quantitative detail to say whether this model is correct or not in capturing the internal states that result in the rodent behavior. That said, the original behavioral setup is so simple that one could imagine capturing the behavioral variability in multiple ways (potentially without evoking complex computations that the original authors never showed the mouse brain performs). I would recommend reframing the paper as a new model that proposes a set of internal states that could give rise to the behavioral heterogeneity observed in Akiti et al, but nonetheless is at this time only a hypothesis. Furthermore, an explanation of what would be really required to test this would be appreciated to make the point clearer.

  4. Reviewer #3 (Public review):

    Summary:

    The manuscript presents computational modelling of the behaviour of mice during encounters with novel and familiar objects, originally reported by Akiti et al. (Neuron 110, 2022). Mice typically perform short bouts of approach followed by a retreat to a safe distance, presumably to balance exploration to discover possible rewards with the potential risk of predation. However, there is considerable heterogeneity in this exploratory behaviour, both across time as an individual subject becomes more confident in approaching the object, and across subjects; with some mice rapidly becoming confident to closely explore the object, while other timid mice never become fully confident that the object is safe. The current work aims to explain both the dynamics of adaptation of individual animals over time, and the quantitative and qualitative differences in behaviour between subjects, by modelling their behaviour as arising from model-based planning in a Bayes adaptive Markov Decision Process (BAMDP) framework, in which the subjects maintain and update probabilistic estimates of the uncertain hazard presented by the object, and rationally balance the potential reward from exploring the object with the potential risk of predation it presents.

    In order to fit these complex models to the behaviour the authors necessarily make substantial simplifying assumptions, including coarse-graining the exploratory behaviour into phases quantified by a set of summary statistics related to the approach bouts of the animal. Inter-individual variation between subjects is modelled both by differences in their prior beliefs about the possible hazard presented by the object and by differences in their risk preference, modelled using a conditional value at risk (CVaR) objective, which focuses the subject's evaluation on different quantiles of the expected distribution of outcomes. Interestingly these two conceptually different possible sources of inter-subject variation in brave vs timid exploratory behaviour turn out not to be dissociable in the current dataset as they can largely compensate for each other in their effects on the measured behaviour. Nonetheless, the modelling captures a wide range of quantitative and qualitative differences between subjects in the dynamics of how they explore the object, essentially through differences in how subject's beliefs about the potential risk and reward presented by the object evolve over the course of exploration, and are combined to drive behaviour.

    Exploration in the face of risk is a ubiquitous feature of the decision-making problem faced by organisms, with strong clinical relevance, yet remains poorly understood and under-studied, making this work a timely and welcome addition to the literature.

    Strengths:

    (1) Individual differences in exploratory behaviour are an interesting, important, and under-studied topic.

    (2) Application of cutting-edge modelling methods to a rich behavioural dataset, successfully accounting for diverse qualitative and qualitative features of the data in a normative framework.

    (3) Thoughtful discussion of the results in the context of prior literature.

    Limitations:

    (1) The model-fitting approach used of coarse-graining the behaviour into phases and fitting to their summary statistics may not be applicable to exploratory behaviours in more complex environments where coarse-graining is less straightforward.

    (2) Some aspects of the work could be more usefully clarified within the manuscript.