Paper mill or paper mine? A tentative answer to the sharp increase in research papers based on the Global Burden of Disease database

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

A number of recent studies have raised concerns arising from the exploitation ("mining") of public health databases for low-quality, mass-produced research papers. However, it remains challenging to disambiguate whether such papers originate from paper mills (commercial entities that sell authorships on mass-produced papers) or from the uncoordinated action of individuals facilitated by emerging technologies such as large language models. Our study aims to address this question for the case of one particular health database, the Global Burden of Disease Study (GBD). We selected this database after noticing that one of our own papers, on Bayesian age-period-cohort models (BAPC), was experiencing a rapid surge in geographically clustered citations from GBD papers with Chinese affiliations. To this end, we collected bibliometric and article-level metadata from GBD papers to search for indicators of mass-produced research. Moreover, we assessed 713 full-text articles for reported R versions, availability of code and data, and declaration of generative AI use. Finally, we qualitatively assessed 180 articles for similarities in reported figures. Our findings suggest that the observed increase in GBD publications from China is not driven by a specific paper mill. Rather, it appears that many of these articles were created by independent authors using shared resources. One possibility is the use of proprietary software sold by a Guangzhou-based tech company that specializes in streamlining research workflows using AI writing tools and R packages tailored for the mass-production of articles from specific public health databases, including but not limited to GBD. However, since more than 99% of the authors did not share their analysis code publicly, it remains challenging to disambiguate which authors used which packages for their GBD analysis. Our study thus demonstrates the vital importance of free and open-source software and open code sharing for transparent, trustworthy, and reproducible research.

Article activity feed