Correlational estimates of language effects are biased and directionally unpredictable: Evidence from large-scale field experiments
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The principle that correlation does not imply causation is foundational to scientific reasoning, yet correlational analyses routinely inform conclusions, particularly in research on the effects of language. We quantify the divergence between correlational and causal estimates using two large-scale field experiment datasets: 7,797 experiments on Upworthy.com (45,674 headlines) and 153,787 experiments across 398 news outlets (416,009 headlines), in which linguistic features of headlines were experimentally manipulated and click-through rates measured. Across 50 language constructs, correlational and causal estimates diverged in the estimated direction of effect 20-50% of the time. Critically, the direction of bias was unpredictable: correlational models underestimated causal effects in one dataset and overestimated them in the other. Standard corrections, including platform fixed effects, failed to eliminate this bias. These findings demonstrate that correlational evidence in language research is not merely imprecise but can be systematically misleading, and that the direction of distortion cannot be anticipated without experimental ground truth.