Quantifying new threats to health and biomedical literature integrity from rapidly scaled publications and problematic research
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The last three years have seen an explosion in published manuscripts analysing open-access health datasets, in many cases presenting misleading or biologically implausible findings. There is a growing evidence base to suggest that this is due in part to AI-assisted and formulaic workflows. Here we employ a top-down scientometric analysis to investigate which datasets have seen publication rates deviate from previous trends, especially where this coincides with changes to author geographical origins and increases in formulaic titles. Across 34 datasets we identify five showing hallmarks of paper mill exploitation (the FDA Adverse Event Reporting System, NHANES, UK Biobank, FinnGen and the Global Burden of Disease Study). These five datasets had, in 2024, a combined publication count of 11,554 indexed in the PubMed database. This represents an excess of around five thousand publications above the AutoRegressive Integrated Moving Average (ARIMA) forecast trend, and is a 2.8x fold change on the 4,001 publication count for these five datasets in 2021. We also identified a notable difference in the fold change for China (9.5x) versus the rest of the world (1.2x) and an increase in formulaic titles. These findings highlight potential risks to research integrity in areas such as public health and drug safety, and especially to the accessibility and interoperability principles central to Open Science and FAIR data practices. We argue that permissive open-access data policies naturally facilitate exploitative workflows, from direct API access for data dredging to large language model authoring of papers. These findings add to the case for the adoption of controlled data-access mechanisms combined with pre-registration of research protocols as safeguards, as previously used in the field of genomics, to reduce false discoveries and misleading conclusions. Such an approach would balance data availability with responsible use, preserving the goals of Open Science and mitigating against disincentives for compliance with the FAIR Guiding Principles.