Human Researchers are Superior to Large Language Models in Writing a Systematic Review in a Comparative Multitask Assessment

Martina Sollini
Cristiano Pini
Alexandra Lazar
Fabrizia Gelardi
Gaia Ninatti
Matteo Bauckneht
Arturo Chiti
Margarita Kirienko

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background The capability of Large Language Models (LLMs) to support and facilitate research activities has sparked growing interest in their integration into scientific workflows. This paper aims to evaluate and compare against human researchers the performance of 6 different LLMs in conducting the various tasks necessary to produce a systematic literature review. Methods The evaluation of the 6 LLMs was split into 3 tasks: literature search, article screening and selection (task 1); data extraction and analysis (task 2); final paper drafting (task 3). Their results were compared with a human-produced systematic review on the same topic, serving as reference standard. The evaluation was repeated on two rounds to evaluate reproducibility and improvements of LLMs over time. Results Out of the 18 scientific articles to be extracted from the literature for task 1, the best LLM managed to identify 13. Data extraction and analysis for task 2 was only partially accurate and cumbersome. The full papers generated by LLMs for task 3 were short and uninspiring, often not fully adhering to the standard template for a systematic review. Conclusion Currently, LLMs are not capable of independently conducting a scientific systematic review. However, their capabilities are advancing rapidly, and, with an appropriate supervision they can provide valuable support throughout the review process.

Version published to 10.21203/rs.3.rs-6863946/v1 on Research Square
Jun 13, 2025

Automation of Systematic Reviews with Large Language Models

This article has 33 authors:
1. Christian Cao
2. Rohit Arora
3. Paul Cento
4. Katherine Manta
5. Elina Farahani
6. Matthew Cecere
7. Anabel Selemon
8. Jason Sang
9. Ling Xi Gong
10. Robert Kloosterman
11. Scott Jiang
12. Richard Saleh
13. Denis Margalik
14. James Lin
15. Jane Jomy
16. Jerry Xie
17. David Chen
18. Jaswanth Gorla
19. Sylvia Lee
20. Kelvin Zhang
21. Harriet Ware
22. Mairead Whelan
23. Bijan Teja
24. Alexander A. Leung
25. Lina Ghosn
26. Rahul K. Arora
27. Allen S. Detsky
28. Michael Noetel
29. David B. Emerson
30. Isabelle Boutron
31. David Moher
32. George Church
33. Niklas Bobrovitz
This article has no evaluationsLatest version Jun 13, 2025
CLEVER: Clinical Large Language Model Evaluationby Expert Review

This article has 4 authors:
1. Veysel Kocaman
2. Mustafa Kaya
3. Andrei Ferrer
4. David Talby
This article has no evaluationsLatest version Jul 23, 2025
Evaluation of Large Language Models in Medical Examinations: A Scoping Review Protocol

This article has 5 authors:
1. Weiqi Wang
2. Baifeng Wang
3. Yan Zhu
4. Zhe Wang
5. Suyuan Peng
This article has no evaluationsLatest version Jun 12, 2025

Listed in

Abstract

Article activity feed

Related articles

Automation of Systematic Reviews with Large Language Models

CLEVER: Clinical Large Language Model Evaluationby Expert Review

Evaluation of Large Language Models in Medical Examinations: A Scoping Review Protocol