Automation of Systematic Reviews with Large Language Models

Christian Cao
Rohit Arora
Paul Cento
Katherine Manta
Elina Farahani
Matthew Cecere
Anabel Selemon
Jason Sang
Ling Xi Gong
Robert Kloosterman
Scott Jiang
Richard Saleh
Denis Margalik
James Lin
Jane Jomy
Jerry Xie
David Chen
Jaswanth Gorla
Sylvia Lee
Kelvin Zhang
Harriet Ware
Mairead Whelan
Bijan Teja
Alexander A. Leung
Lina Ghosn
Rahul K. Arora
Allen S. Detsky
Michael Noetel
David B. Emerson
Isabelle Boutron
David Moher
George Church
Niklas Bobrovitz

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Systematic reviews (SRs) inform evidence-based decision making. Yet, they take over a year to complete, are prone to human error, and face challenges with reproducibility; limiting access to timely and reliable information. We developed otto-SR , an end-to-end agentic workflow using large language models (LLMs) to support and automate the SR workflow from initial search to analysis. We found that otto-SR outperformed traditional dual human workflows in SR screening ( otto-SR : 96.7% sensitivity, 97.9% specificity; human: 81.7% sensitivity, 98.1% specificity) and data extraction ( otto-SR : 93.1% accuracy; human: 79.7% accuracy). Using otto-SR , we reproduced and updated an entire issue of Cochrane reviews (n=12) in two days, representing approximately 12 work-years of traditional systematic review work. Across Cochrane reviews, otto-SR incorrectly excluded a median of 0 studies (IQR 0 to 0.25), and found a median of 2.0 (IQR 1 to 6.5) eligible studies likely missed by the original authors. Meta-analyses revealed that otto-SR generated newly statistically significant findings in 2 reviews and negated significance in 1 review. These findings demonstrate that LLMs can rapidly conduct and update systematic reviews with superhuman performance, laying the foundation for automated, scalable, and reliable evidence synthesis.

Version published to 10.1101/2025.06.13.25329541v2 on medRxiv
Jun 19, 2025
Version published to 10.1101/2025.06.13.25329541v1 on medRxiv
Jun 13, 2025

Human Researchers are Superior to Large Language Models in Writing a Systematic Review in a Comparative Multitask Assessment

This article has 8 authors:
1. Martina Sollini
2. Cristiano Pini
3. Alexandra Lazar
4. Fabrizia Gelardi
5. Gaia Ninatti
6. Matteo Bauckneht
7. Arturo Chiti
8. Margarita Kirienko
This article has no evaluationsLatest version Jun 13, 2025
Artificial Intelligence Software to Accelerate Screening for Living Systematic Reviews

This article has 10 authors:
1. Tracy Jean Evans-Whipp
2. Matthew Fuller-Tyszkiewicz
3. Allan Jones
4. Rajesh Vasa
5. Fiona Russell
6. Lisa Grbin
7. Jacqui A Macdonald
8. Camille Deane
9. Delyth Samuel
10. Craig Olsson
This article has no evaluationsLatest version Jun 10, 2025
Implementation of Large Language Models in Electronic Health Records

This article has 3 authors:
1. Maxime Griot
2. Jean Vanderdonckt
3. Demet Yuksel
This article has no evaluationsLatest version Jul 4, 2025

Listed in

Abstract

Article activity feed

Related articles

Human Researchers are Superior to Large Language Models in Writing a Systematic Review in a Comparative Multitask Assessment

Artificial Intelligence Software to Accelerate Screening for Living Systematic Reviews

Implementation of Large Language Models in Electronic Health Records