Is the Replication Crisis a Measurement Crisis? Evidence from Over 100 Randomized Trial Outcomes

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Randomized controlled trials (RCTs) are the gold standard for causal inference, yet the validity of RCT conclusions depends not only on randomization but also on how outcomes are measured and scored. We analyze item-level data from 112 RCT outcome measures spanning psychology, medicine, public health, and education to test whether foundational measurement assumptions are met and whether alternative scoring approaches alter conclusions. Across disciplines, measurement assumptions are rarely evaluated and frequently violated. For nearly half of outcomes, the single score used is not likely a plausible representation of the data. Moreover, when outcomes are scored using statistical models aligned with the study design and better matching the data instead of sum scores, the proportion of pre/post trials reporting statistically significant treatment effects approximately doubles. These findings indicate that routine measurement decisions can systematically shift causal inferences, suggesting that a largely unexamined aspect of RCT analysis may be at the heart of replication failures.

Article activity feed