Detecting Fraud-Associated Characteristics in the Medical AI Literature: A Multi-Signal NLP Framework Reveals Distinct Paper Mill Subtypes

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Paper mills increasingly compromise the integrity of the medical artificial intelligence (AI) literature. We developed a pre-registered, multi-signal natural language processing pipeline combining seven feature categories -- tortured phrases, structural formulaicity, AI-generated text markers, citation anomalies, cross-document similarity, co-authorship networks, and geographic metadata -- and applied it to 2,478 medical AI papers (2018-2025) labelled using Retraction Watch data via the Crossref Labs API. A Random Forest-XGBoost ensemble classifier achieved average precision 0.858 and AUC-ROC 0.916 on 5-fold cross-validation, but the most important features were writing quality indicators (vocabulary diversity, reference count) rather than fraud-specific signals, reflecting the dominance of the 2021-2022 Hindawi mass retractions. Retraction subtype analysis revealed distinct fingerprints across fraud types: AI-generated content papers had twice the boilerplate density of other subtypes, while fake peer review papers had the highest co-authorship network density. Unsupervised clustering identified an "author pool" cluster (n=133, 47% retracted) with extreme co-author reuse (0.75 vs 0.11 corpus mean). Among unlabelled papers, 9.1% fell in high or very high risk tiers. Prevalence was broadly uniform across WHO regions (15-20%) and robust to corpus definition. Four pre-registered sensitivity analyses confirmed robustness. The heterogeneity of paper mill operations -- synonym-substitution mills, AI content generators, and peer-review manipulation rings -- demands subtype-aware detection strategies. Code: https://doi.org/10.5281/zenodo.19488868. Pre-registration: https://doi.org/10.17605/OSF.IO/JB4T6.

Article activity feed