Human complex exploration strategies are extended via noradrenaline-modulated heuristics

M Dubois
J Habicht
J Michely
R Moran
RJ Dolan
TU Hauser

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (eLife)
Multiple peer reviews (damianpattinson)

Abstract

An exploration-exploitation trade-off, the arbitration between sampling a lesser-known against a known rich option, is thought to be solved using computationally demanding exploration algorithms. Given known limitations in human cognitive resources, we hypothesised the presence of additional cheaper strategies. We examined for such heuristics in choice behaviour where we show this involves a value-free random exploration, that ignores all prior knowledge, and a novelty exploration that targets novel options alone. In a double-blind, placebo-controlled drug study, assessing contributions of dopamine (400mg amisulpride) and noradrenaline (40mg propranolol), we show that value-free random exploration is attenuated under the influence of propranolol, but not under amisulpride. Our findings demonstrate that humans deploy distinct computationally cheap exploration strategies and where value-free random exploration is under noradrenergic control.

Data and materials availability

Data and code will be provided upon acceptance.

Version published to 10.1101/2020.02.20.958025v4 on bioRxiv
Nov 24, 2020
eLife
Sep 2, 2020
###Reviewer #3:

Summary:

The authors report a between-subjects, double-blind psychopharmacological study on explore/exploit behavior in healthy human subjects. The authors used propranolol to block norepinephrine (NE), and amisulpride to block dopamine (DA), and compared to a group taking placebo. Using a 3-armed bandit task, coupled with computational modelling and pharmacological manipulation, the authors show that "tabula rasa" (or random exploration) is reduced when NE is blocked. This interpretation was supported by behavioral effects whereby subjects taking propranolol were significantly more consistent than other groups when facing identical choices, and chose the low-value option more often than the other groups. Blocking DA did not appear to affect any parameters. The computational model showed that the E-greedy parameter, …
###Reviewer #3:

Summary:

The authors report a between-subjects, double-blind psychopharmacological study on explore/exploit behavior in healthy human subjects. The authors used propranolol to block norepinephrine (NE), and amisulpride to block dopamine (DA), and compared to a group taking placebo. Using a 3-armed bandit task, coupled with computational modelling and pharmacological manipulation, the authors show that "tabula rasa" (or random exploration) is reduced when NE is blocked. This interpretation was supported by behavioral effects whereby subjects taking propranolol were significantly more consistent than other groups when facing identical choices, and chose the low-value option more often than the other groups. Blocking DA did not appear to affect any parameters. The computational model showed that the E-greedy parameter, which computes the proportion of time an entity makes a random selection, was most affected by the NE blockade. In addition, the modelling shows that some directed exploration (exploring lesser-known options) was also at play.

General comments:

The manuscript is well-written and the results are compelling. The findings are important to researchers particularly interested in the cognitive effects of catecholamines, and/or the explore/exploit dilemma. The results may not be that interesting to a broader readership.

Criticisms:

I do not really like the use of the term "tabula rasa" exploration, over "random" exploration. Using the term random exploration is just simpler, and clearer. The particular problem for me is that "tabula rasa" has the connotation that both the current "tabula rasa" choice and all future choices will not take into account information obtained before that choice. Random exploration is a better term because it is easy and intuitive to see that random choices can be sprinkled in with choices based on previous information, whereas tabula rasa implies wiping previous information away from that point forward. As best I can tell, previous related work has not termed the random exploration associated with the E-greedy parameter "tabula rasa". One consideration I am wrestling with is that apparently there is another parameter in one or more of the models that reflects random exploration (line 618, inverse temperature). This may be why the authors opted to call the E-greedy parameter something else. At the very least, I would like a better explanation of the choice of term (tabula rasa) as well as a thorough explanation of the difference between tabula rasa and random exploration. I recommend changing the term used as well, but am amenable to accepting an argument for keeping it.

Line 162: "Reported findings were corrected for IQ (WASI)". How? It seems WASI was included as a covariate in the repeated-measures ANOVA, but it's not clear exactly what factors went into the ANOVA by the results reported lines 170-185. I recognize that often in higher-impact journals including a full description of the factors and levels of statistical tests is considered a tedious waste of space, but I feel that holds only in cases where the structure of the test is obvious. In my opinion, that is not the case here.

Line 209-210: "the probability of choosing bandits with a lower expected value (here the low-value bandit, Fig 1e) will be higher. We investigated whether such behavioural signatures were increased in the long horizon condition (i.e. when exploration is useful), and we found a significant main effect of horizon (F(1, 54)=4.069, p=.049, η2=.07; Figure 3c)." Isn't this just evidence of general exploration, not specifically tabula rasa exploration? How does this test rule out, for example, directed exploration?
Read the original source
eLife
Sep 2, 2020
###Reviewer #2:

In this study, Dubois and colleagues claim that noradrenaline promotes tabula-rasa in decision making during exploration, using a novel paradigm involving a short and a long horizon conditions, to elicit exploitation and exploration, respectively. The work tests different computational models and examined in particular supposedly less costly forms of exploration, that is 1) tabula-rasa, in which prior information is ignored and the same probability is assigned to all available options and 2) novelty exploration, in which information processing is biased toward choices that has not been encountered previously. They provide evidence that both of these processes coexist with more demanding exploration strategies. In addition, using a double-blind, placebo-controlled, drug study, they provided support for a role of …
###Reviewer #2:

In this study, Dubois and colleagues claim that noradrenaline promotes tabula-rasa in decision making during exploration, using a novel paradigm involving a short and a long horizon conditions, to elicit exploitation and exploration, respectively. The work tests different computational models and examined in particular supposedly less costly forms of exploration, that is 1) tabula-rasa, in which prior information is ignored and the same probability is assigned to all available options and 2) novelty exploration, in which information processing is biased toward choices that has not been encountered previously. They provide evidence that both of these processes coexist with more demanding exploration strategies. In addition, using a double-blind, placebo-controlled, drug study, they provided support for a role of noradrenaline in tabula-rasa exploration.

This work extends previous work from the same group that aimed at solving the important question related to decision making and the neuromodulatory influences on these processes. The overall approach and the results are clearly presented. The extensive model comparison is particularly interesting to better approach this difficult question. The results are interesting and bring novel insights about the processes at play during exploration and the influence of neurotransmitters on these processes.

Noradrenaline influence on tabula-rasa exploration:

The authors claim that "Phasic noradrenaline is thought to act as a reset button, rendering an agent agnostic to all previously accumulated information, a de facto signature of tabula-rasa exploration." It might be interesting to discuss the results in terms of a potential impact of noradrenaline onto the subjective value of the choices. For instance, Rogers et al. (Psychopharmacology, 2004) suggest that propranolol affects the processing of possible losses in decision-making paradigms, and might also reduce the discrimination between the different levels of possible gains (Rogers et al. 2004). In another study, Sokol-Hessner et al. (Psychol Sci., 2015) also report a loss aversion reduction after propranolol administration. These effects might also change prior information and reset behavioral adaptation to look for new opportunities. In this latter study the authors also report a lack of effect of propranolol onto choice consistency, contrary to what the present study reports. I was also wondering how this new result about the effect of propranolol on decision making relates to previous findings from the same group (Hauser et al. 2019) where they described noradrenaline influence on information gathering and the urgency to decide. Finally, according to the network reset hypothesis, it has been indeed suggested that a change in the environment might enhance information gathering at the expense of prior expectations to produce an adaptive behavioral output. Perhaps the authors might avoid using the term 'agnostic', this might instead reflect a reduced influence of 'top-down' prior information, related to changes in subjective value of the different choices.

Model selection:

One strength of the paper is that the authors compared several computational models. The model selection is presented in Figure 4 and in Figure 4 - Figure supplement 1, the authors provide additional information regarding the winning model that accounted best for the largest number of subjects in comparison with two other models, namely the UCB model (with novelty and greedy parameters) or hybrid (with novelty and greedy parameters). It would be useful for the reader to get a better sense about the number of subjects which results favored any given model (i.e. a more exhaustive picture). One could use the same table as the one presented as in the Appendix Table 2 with the respective number of subjects for which the model achieved the best performance. In fact, as shown in Figure 4, the winning model does not look very different (at least visually) from other models such as UCB (with novelty and greedy parameters) or hybrid (with novelty parameter or novelty and greedy parameters) models. As such, I am wondering whether the conclusion about the 𝜖-greedy parameter would hold true if other model with similar performance were tested e.g. with UCB model (with novelty and greedy parameters) or hybrid (with novelty and greedy parameters)?

The authors used propranolol (40mg), a non-selective β-adrenoceptor antagonist to reduce noradrenaline functioning. Previous studies have shown that it significantly decreased heart rate (e.g. Rogers et al., 2004). How that might relate to the reported results? In terms of NA influence and given the distributions of β receptors, could the authors be more explicit about the relation of their work with the potential mechanisms (e.g. Goldman-Rakic et al. J Neurosci. 1990 or Waterhouse et al., Journal of Pharmacology and Experimental, 1982).

Could the authors clarify whether the PANAS questionnaire was administered to the participants prior to or after the drug treatment to understand if this group difference was a mere difference in groups or whether this was a consequence of the drug administration. It would be indeed interesting to have a measure of the drug effect on these parameters.

The authors claim that: "Although tabula-rasa exploration can comprise influences of attentional lapses or impulsive motor responses, the difference between horizon conditions cancels them out". I would suggest to temper this claim as the effect might be more enduring in the long horizons' conditions. The authors might also want to look at RT variability in addition to RT means that did not differ between groups.
Read the original source
eLife
Sep 2, 2020
###Reviewer #1:

Dubois and colleagues investigate how two modes of exploration - tabula-rasa and novelty-seeking - contribute to human choice behavior. They found that subjects used both tabula-rasa and novelty-seeking heuristics when the task conditions were in favor of exploration. Specifically, participants could, and had to, make more responses in the long-horizon condition, which favored exploration, compared to the short-horizon condition, which favored exploitation. Then the authors provide evidence that blockade of norepinephrine beta receptors leads to decreased tabula-rasa exploration and increased choice consistency whereas blockage of D2/D3 dopamine receptors had little effects. Novelty seeking was not affected by catecholaminergic drugs.

The paper provides evidence on exploration-exploitation trade-offs from two different …
###Reviewer #1:

Dubois and colleagues investigate how two modes of exploration - tabula-rasa and novelty-seeking - contribute to human choice behavior. They found that subjects used both tabula-rasa and novelty-seeking heuristics when the task conditions were in favor of exploration. Specifically, participants could, and had to, make more responses in the long-horizon condition, which favored exploration, compared to the short-horizon condition, which favored exploitation. Then the authors provide evidence that blockade of norepinephrine beta receptors leads to decreased tabula-rasa exploration and increased choice consistency whereas blockage of D2/D3 dopamine receptors had little effects. Novelty seeking was not affected by catecholaminergic drugs.

The paper provides evidence on exploration-exploitation trade-offs from two different points of view. On the one hand, it addresses computational aspects of exploration by investigating how computationally intense forms of exploration might be supplemented by the usage of heuristic strategies. For doing so, the authors propose a novel task allowing them to disentangle these strategies and quantitatively assess their usage. On the other hand, the findings presented in the paper shed some novel light on neuropharmacological mechanisms underlying explorations. Some interpretations seem to go beyond the data and information is missing in the description of the results and the computational approaches used. In general though, the manuscript conveys the impression of a well-designed and carefully conducted study.

Major points:

General

It is one thing to come up with computational terms and model-based quantities correlating with behavior but a different one to show their psychological meaning. Did the trials with tabula-rasa exploration or novelty exploration differ in terms of response times from the other types of responses? Did participants report that they indeed intended to explore in the tabula-rasa exploration trials?

On a related note, how do the authors distinguish random (tabula-rasa) exploration from making a mistake? From how the task was designed, choosing the low value option appears to receive a more natural interpretation as a mistake rather than as exploration because this option was clearly dominated by the other options and remained so within and across trials.

Previous research of the authors (Hauser et al., 2017, 2018, 2019) has associated beta receptor blockade with enhanced metacognition, decreased information gathering/increased commitment to an early decision (Hauser et al., 2018, JNS) and an arousal (i.e., reward)-induced boost of processing stimuli. Of course, it is possible that norepinephrine plays multiple roles, but it appears not exactly parsimonious to imbue it with a different role for each task tested. Are there some commonalities across these effects that could be explained with some common function(s)?

Throughout, the paper implies that a beta blocker provides information about the function of norepinephrine in general. However, blocking beta receptors leaves synaptic norepinephrine to act on alpha receptors; accordingly, beta-blockers can be viewed as partial alpha agonists. Given that the function of these receptor families differs, more care should be taken when describing the nature of the intervention, labeling the groups and interpreting the effects.

Introduction:

As mentioned above, the paper investigates not only computational aspects of exploration but also the underlying neuropharmacological correlates. However, the introduction focuses mostly on different computational algorithms (which is in itself very helpful for the understanding of the paper!) while the neuropharmacological basis of explorative behavior is only briefly introduced. In the same regard, while some insights were given in the Discussion, it would be interesting to have a rationale for using amisulpride and propranolol already in the introduction.

Relatedly, the introduction focuses on tabula-rasa and novelty strategies based on the argument that these are more computationally efficient. The authors may also want to motivate this with the perspective of neural constraints/brain process. Specifically, they argue that it may be computationally demanding to process the expected value (mean) and variance of choice options. However, computational efficiency has been put forward as an argument for why mean-variance-like signals are coded in the brain, particularly with multi-outcome options where expected utilities are difficult to compute (D'Acremont and Bossaerts, 2008). Thus, the computational efficiency argument at the moment seems insufficiently motivated.

Materials and Methods:

Successful performance of the task is based on the ability to discriminate between different reward types and select the one with the higher value. From the experimental design description, one can see that in order to do so, the subjects needed to distinguish between different apple sizes. In this regard, a question arises: how large was the difference between two adjacent apple sizes? Was it large enough so that after a visual inspection, the participant could easily understand that the apple size = 7 was less rewarding than the apple size = 8? Finally, since the task requires visual inspection of reward stimuli, was the subject vision somehow tested and did it differ between groups?

The point of heuristics from a psychological perspective is that they dispense with the need to use full-blown algorithmic calculations. However, in the present models, the heuristics are only added on top of these calculations and the winning model includes Thompson exploration. Stand-alone heuristic models would do the term more justice and one wonders how well a model would fare that includes only tabula rasa exploration and novelty exploration.

The simulations provide a nice intuition for understanding choice proportions from different models/strategies (Figures 1e and 1f). However, it would be helpful to provide simulated results for long and short horizons separately. Do the models make different predictions for the two horizons? Additionally, it would be helpful to also show the results from other models (i.e. the proportion of low value bandit chosen by novelty agent). These can be added in the supplement.

One of the best-known effects of propranolol is to reduce heart rate. Did the authors measure heart rate and can they control for the possibility that peripheral effects of the drug explain the findings (and what was the reason for not collecting pupil diameter data, contrary to the previous research of the authors)?

The long horizon condition appears to confound exploration with higher effort demands and longer delays to reward, at least in the early responses. If the authors cannot control for these they should mention them as limitations.

Not only choice rules but also value functions seem to differ between Thompson and UCB (lines 583 and 593). This raises the question how well pharmacological effects on choice rules can be distinguished from effects on valuation and how confident we are that the observed effects indeed arise from changes in choice rules.

Discussion:

Line 410: The statement that memory is not at play in the present task because all information is always visible on the screen seems too strong. At least some exploration-relevant information, such as the overall distribution of outcomes across all options, is not presented and may be remembered differently by the different groups.

D'Acremont M, Bossaerts P. Neurobiological studies of risk assessment: a comparison of expected utility and mean-variance approaches. Cogn Affect Behav Neurosci. 2008;8(4):363-374. doi:10.3758/CABN.8.4.363
Read the original source
eLife
Sep 2, 2020

##Preprint Review

This preprint was reviewed using eLife’s Preprint Review service, which provides public peer reviews of manuscripts posted on bioRxiv for the benefit of the authors, readers, potential readers, and others interested in our assessment of the work. This review applies only to version 3 of the manuscript.

Read the original source
Version published to 10.1101/2020.02.20.958025v3 on bioRxiv
Jun 11, 2020
Version published to 10.1101/2020.02.20.958025v2 on bioRxiv
Jun 8, 2020
Version published to 10.1101/2020.02.20.958025v1 on bioRxiv
Feb 21, 2020

A systematic review of human avoidance learning: Cognition, computation, and methods

This article has 2 authors:
1. Federico Mancinelli
2. Dominik R Bach
This article has no evaluationsLatest version Jun 11, 2025
Switching Between Cognitive Control States? No, Thank You.

This article has 3 authors:
1. Merve Ileri-Tayar
2. Julie Bugg
3. Wouter Kool
This article has no evaluationsLatest version Jul 9, 2025
Dissociable Effects of Curiosity and Hedonic Valence on Reinforcement Learning

This article has 4 authors:
1. Kathryn M. Rothenhoefer
2. McKenna D. Romac
3. Krystal Henderson
4. Vincent D. Costa
This article has no evaluationsLatest version Jun 25, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Data and materials availability

Article activity feed

Related articles

A systematic review of human avoidance learning: Cognition, computation, and methods

Switching Between Cognitive Control States? No, Thank You.

Dissociable Effects of Curiosity and Hedonic Valence on Reinforcement Learning