Revolutionizing Peer Review: A Comparative Analysis of ChatGPT and Human Review Reports in Scientific Publishing

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

ChatGPT-4 and other large language models (LLMs) and their use in academic writing have raised questions regarding their capacity to facilitate the peer review process. This research article compares AI-generated peer review reports (using ChatGPT-4 with tools) with traditional review reports generated by humans. The review reports were received for 198 manuscripts submitted to the European Scientific Journal (ESJ) between January and September 2024. Each manuscript underwent a parallel evaluation process. First by human reviewers of the journal and by ChatGPT-4 with tools paid version afterward. However, each manuscript received review reports from at least two human reviewers and only one AI-generated review report. Both review types used the ESJ’s standardized evaluation form. The ChatGPT-4 was prompted to review the papers objectively and critically. Statistical analyses were conducted to compare the grades of different parts of the manuscripts, recommendation distributions, and consistency between the AI and human review reports. Kolmogorov-Smirnov test; Pearson Chi-Square test, Mann-Whitney U test; and Cohen's kappa test were used to analyze the data. Results showed that ChatGPT-4 reviewers consistently awarded higher grades and were less rigorous than the human reviewers. The ChatGPT-4 review reports mostly recommended minor revisions and have never recommended rejection of a manuscript. On the other hand, human reviewers demonstrated a more balanced distribution of recommendations, including stricter score evaluations. However, a lack of agreement between the human review reports was registered. While LLM tools can enhance the efficiency of the peer review process, their ability to uphold rigorous academic standards remains limited. Editors who use LLM tools as reviewers have to remain vigilant and not rely their decisions solely on LLM-generated reports. The existing version of ChatGPT-4 is not trained in peer review and cannot replace human expertise in peer review. However, it can be used as an assistant that under human oversight can provide useful comments and recommendations for content improvement of the manuscripts. Future research should focus on LLM tools trained for peer review in various academic fields as well as on the ethical frameworks for LLM integration in peer review.

Article activity feed