AutoReporter: Development and validation of an artificial intelligence tool for automated research reporting guideline assessment

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Incomplete reporting can compromise the cornerstone of reproducibility and peer review in scientific research. Evaluation of the adherence of manuscript publications to established reporting guidelines remains a resource-intensive and subjective task. To address this task, we created AutoReporter, a large language model-based system that automates item-wise assessment of reporting guideline adherence. Benchmarks of reasoning and general-purpose LLMs across eight prompt-engineered and retrieval augmented generation method combinations compared to human ratings in the SPIRIT-CONSORT-TM corpus demonstrated that the zero-shot, no-retrieval prompt coupled with the reasoning o3-mini LLM, known as AutoReporter, offered the optimal balance of mean classification performance (CONSORT accuracy: 90.09%; SPIRIT accuracy: 92.07%), run-time (CONSORT run-time: 617.26 seconds; SPIRIT run-time: 544.51 seconds), and cost (CONSORT cost: 0.68 USD; SPIRIT cost: 0.65 USD). Evaluation of AutoReporter on BenchReport, a benchmark dataset of 10 systematic reviews that assessed the adherence of 506 included studies to 10 reporting guidelines, demonstrated a mean weighted accuracy of 91.8% and substantial inter-assessor agreement in 9 out of 10 reviews compared to human ratings. These results establish the feasibility of using LLMs to automate reporting guideline adherence assessments, enabling the potential for scalable quality control in scientific research reporting. AutoReporter is publicly accessible at https://autoreporter.streamlit.app.

Article activity feed