Robust group- but limited individual-level (longitudinal) reliability and insights into cross-phases response prediction of conditioned fear

Read the full article


Here we follow the call to target measurement reliability as a key prerequisite for individual-level predictions in translational neuroscience by investigating i) longitudinal reliability at the individual and ii) group level, iii) cross-sectional reliability and iv) response predictability across experimental phases. 120 individuals performed a fear conditioning paradigm twice six month apart. Analyses of skin conductance responses, fear ratings and BOLD-fMRI with different data transformations and included numbers of trials were conducted. While longitudinal reliability was generally poor to moderate at the individual level, it was good for acquisition but not extinction at the group-level. Cross-sectional reliability was satisfactory. Higher responding in preceding phases predicted higher responding in subsequent experimental phases at a weak to moderate level depending on data specifications. In sum, the results suggest the feasibility of individual-level predictions for (very) short time intervals (e.g., cross-phases) while predictions for longer time intervals may be problematic.

Article activity feed

  1. Evaluation Summary

    The authors comprehensively assess the measurement properties of behavioral (skin conductance and ratings) and fMRI measures of fear conditioning (acquisition and extinction) in a sample of 107 participants, with 71 providing retest measures at 6 months. Retest reliability was generally low, whereas internal-consistency reliability was generally high. At the group level, reliability and criterion validity were generally good. Most measurements proved sensitive to modality, processing, or statistical decisions. Results are framed within a larger discussion of the role of measurement properties in individual difference research and clinical translation and will serve as an important building block towards improvement in both these areas.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #2 agreed to share their name with the authors.)

  2. Reviewer #1 (Public Review):

    The authors comprehensively assess the measurement properties (reliability, criterion/predictive validity) of behavioral and neural measures of fear acquisition/extinction, with a focus on longitudinal reliability and consequences of analytic and processing choices via a multiverse approach. In a longitudinal design (6-mo interval), the authors collected fear acquisition and extinction measures (SCR, ratings, fMRI) at two time points in a relatively larger sample for this type of work. Most notably, test-retest reliability, which is identified as a key component in individual-difference and clinical translation work, was generally low, whereas internal consistency was generally high. Group-level (averaged) reliability and cross-phase prediction (i.e., criterion validity) were generally good. Most measurement indices varied as a function of modality, processing, or statistical decisions. This work is framed within a larger discussion of the role of measurement properties in individual difference work and clinical translation and will serve as an important building block towards improvement in both these areas.

    The conclusions of this work are largely supported by the data and methodological approach, and this is a good benchmark for the field. However, some aspects could be clearer or streamlined, and some analytic choices are relative weaknesses.


    The overall approach is excellent and represents the vanguard of open science practices (preregistration, all materials freely available, documentation of analysis deviations, multiverse analyses, etc.). Relatedly, this comprehensive approach reveals how different analytic choices/researcher degrees of freedom can have sometimes drastic effects on fundamental measurement properties. I think this underlines what I view as the key contribution of this manuscript: empirically highlighting the need for the fear conditioning field to pay more attention to measurement properties.

    Going beyond standard associative measures of reliability (ICCs) is an important contribution of this work, as they allow the authors to comment on nuances of individual-difference reliability that are not possible with the coarser ICCs. In turn, this facilitates researchers in making more informed decisions regarding the design of fear conditioning tasks to assess individual differences.

    The fMRI results are a particular strength, as fMRI continues to be a common fear conditioning index, yet its measurement properties within these studies are critically understudied. The choice to use standard ICCs in conjunction with similarity approaches is particularly fruitful here, as in conjunction with overlap metrics we now have a much better appraisal of the different components of reliability in fMRI data - and potential explanations for differences between behavioral and fMRI reliabilities.


    The authors structure their effort around the premise that reliability is essential in conducting solid individual-differences science, which I agree with wholeheartedly. However, I think the authors rely on relatively arbitrary cut-offs for classifying reliability as good/poor/etc to an extent that is not warranted, particularly in the context of the Discussion, and it takes away from the impact of this effort. As the authors point out, these categorical cut-offs are more guidelines than strict rules, yet the manuscript is structured around the premise that individual-level reliability is problematically poor. Many cut-off recommendations are based on psychometric work on trait self-report measures that usually assume fewer determinants/sources of error than would be seen in neuroscience experiments, which in turn allows for larger ceilings for effect sizes and reliability. The current manuscript does not address this issue and what meaningful (as opposed to good) fear conditioning reliability is when moving away from the categorical cut-offs. In other words, is it possible that the authors actually observed "good" reliability in the context of fear conditioning work, and that this reliability is lower than other types of paradigms is just inherent to the construct being studied?

    The internal consistency (cross-sectional reliability) calculation used is not well-justified, and potentially needs additional parameters. It is not clear why the authors deviate from the internal consistency calculation described in Parson, Kruijt, and Fox et al., 2019, especially given that these procedures are used for other metrics elsewhere in the manuscript.

    In fMRI analyses, the authors use an ROI approach based on prior studies of fear acquisition and extinction. The majority of the most consistently identified regions (as seen in meta-analyses, Fullana et al., 2016, 2018) are analyzed. However, it is not clear why other regions are omitted, particularly given meta-analytic evidence. Striatal regions and the thalamus are the most notable omissions. Further, a weakness is that functional ROIs in this study were based on peak coordinates from a handful of prior studies, instead of meta-analytically identified coordinates. As such, I do not think the authors present the strongest foundation for making conclusions about the reliability of fear conditioning fMRI Data.

  3. Reviewer #2 (Public Review):

    The manuscript describes a large set of statistical analyses on fear conditioning data from 107 participants (N=71 at two time points six months apart). The analyses comprise approaches to determine the reliability and predictability of conditioned fear responses: skin conductance, ratings, and fMRI data.

    The approach is thorough, with a range of analysis approaches, including within- and between-subjects similarity, the individual-level overlap of fMRI results, intraclass correlation coefficients, and cross-sectional reliability. It is important to determine these values so that researchers can discard incorrect assumptions, such as the belief that threat responses at baseline can be predictive of treatment responses in patient populations.

    The poor reliability identified by several of these approaches is likely to be of great importance to this large, translational field. A positive result was good reliability at the group level for fear learning, but not extinction.