Classifying 25 Misinterpretations of Statistical Tests: A Comparison of Six Large Language Models

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Misinterpretations of statistical tests remain widespread and can be amplified by tools increasingly used to support scientific reasoning, including large language models (LLMs). This study evaluates whether LLMs endorse or reproduce well documented interpretive errors when asked to assess benchmark statements about frequentist inference. Methods We used a fixed benchmark of 25 statements incorrect by construction. Six LLM configurations accessed via their user interfaces were tested under two prompting conditions (a discursive prompt requesting correctness judgments and corrections; a minimal prompt requesting only correct/incorrect labels). To assess within-model variability and prompt dependence, we repeated prompts across separate chat sessions under two administration formats: batch submission of all 25 statements and single-item submission. Summary outcomes captured whether the model rejected or endorsed each incorrect statement. Expanded outcomes were assessed via structured textual analysis of explanations using an a priori glossary of fallacy categories. Results Across configurations, most incorrect benchmark statements were correctly rejected at the label level; however, both sporadic errors and systematic misclassifications were frequently observed. Under batch administration, erroneous endorsements concentrated in a subset of recurrent failure items and varied across models and prompting conditions. Under single-item administration, erroneous endorsements were markedly reduced, with some configurations producing no classification errors. However, textual analysis revealed that outputs correctly rejecting the benchmark claim often introduced additional fallacies. The most recurrent patterns included subordination of assumptions, overconfident interpretations of interval estimates, mixing of inferential logics, ritualistic reliance on thresholds, null privileging, oversimplification, and unqualified use of dichotomizing language. Conclusions In this study, LLM responses to statistical-testing misinterpretations depended on discourse constraints and prompting. Single-item administration yielded higher label accuracy, but explanatory text often introduced misleading inferential rhetoric. In light of these findings and causal knowledge about LLM generation, caution is warranted when using LLM explanations for methodological guidance. At this time, LLM-generated explanations should not be relied upon for methodological guidance without independent critical evaluation by qualified human experts.

Article activity feed