Collaborative large language models for automated data extraction in living systematic reviews

Muhammad Ali Khan
Umair Ayub
Syed Arsalan Ahmed Naqvi
Kaneez Zahra Rubab Khakwani
Zaryab bin Riaz Sipra
Ammad Raina
Sihan Zhou
Huan He
Amir Saeidi
Bashar Hasan
Robert Bryan Rumble
Danielle S Bitterman
Jeremy L Warner
Jia Zou
Amye J Tevaarwerk
Konstantinos Leventakos
Kenneth L Kehl
Jeanne M Palmer
Mohammad Hassan Murad
Chitta Baral
Irbaz bin Riaz

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objective

Data extraction from the published literature is the most laborious step in conducting living systematic reviews (LSRs). We aim to build a generalizable, automated data extraction workflow leveraging large language models (LLMs) that mimics the real-world 2-reviewer process.

Materials and Methods

A dataset of 10 trials (22 publications) from a published LSR was used, focusing on 23 variables related to trial, population, and outcomes data. The dataset was split into prompt development (n = 5) and held-out test sets (n = 17). GPT-4-turbo and Claude-3-Opus were used for data extraction. Responses from the 2 LLMs were considered concordant if they were the same for a given variable. The discordant responses from each LLM were provided to the other LLM for cross-critique. Accuracy, ie, the total number of correct responses divided by the total number of responses, was computed to assess performance.

Results

In the prompt development set, 110 (96%) responses were concordant, achieving an accuracy of 0.99 against the gold standard. In the test set, 342 (87%) responses were concordant. The accuracy of the concordant responses was 0.94. The accuracy of the discordant responses was 0.41 for GPT-4-turbo and 0.50 for Claude-3-Opus. Of the 49 discordant responses, 25 (51%) became concordant after cross-critique, increasing accuracy to 0.76.

Discussion

Concordant responses by the LLMs are likely to be accurate. In instances of discordant responses, cross-critique can further increase the accuracy.

Conclusion

Large language models, when simulated in a collaborative, 2-reviewer workflow, can extract data with reasonable performance, enabling truly “living” systematic reviews.

Version published to 10.1093/jamia/ocae325
Jan 21, 2025
Version published to 10.1101/2024.09.20.24314108 on medRxiv
Sep 23, 2024

Large Language Model Biases in Healthcare: A Scoping Review and Call for an Integrated Assessment Framework

This article has 8 authors:
1. Lu He
2. D. Phuong Do
3. Vishesh Girish Shet
4. Omar Farghaly
5. Priya Deshpande
6. Praveen Madiraju
7. Jiancheng Ye
8. Molly Beestrum
This article has no evaluationsLatest version Jan 16, 2026
Updated Approach to Error Rates in Systematic Review Screening: Integrating Active Learning, Large Language Models, and Full-Text Screening Data

This article has 5 authors:
1. Rutger Chris Neeleman
2. Berke Yazan
3. Emily Westerbeek
4. Wouter van Ballegooijen
5. Rens van de Schoot
This article has no evaluationsLatest version Jan 26, 2026
Prompt-Orchestrated Large Language Models for Clinical Information Extraction

This article has 13 authors:
1. Livia Lilli
2. Andrea Rosati
3. Giovanni Paolo Tobia
4. Massimo Criscione
5. Federica Tomassini
6. Chiara Dachena
7. Alice Luraschi
8. Chiara Cantarini
9. Carolina De Maria
10. Luigi Congedo
11. Massimo Bernaschi
12. Stefano Patarnello
13. Anna Fagotti
This article has no evaluationsLatest version Jan 16, 2026

Discuss this preprint

Listed in

Abstract

Objective

Materials and Methods

Results

Discussion

Conclusion

Article activity feed

Related articles

Large Language Model Biases in Healthcare: A Scoping Review and Call for an Integrated Assessment Framework

Updated Approach to Error Rates in Systematic Review Screening: Integrating Active Learning, Large Language Models, and Full-Text Screening Data

Prompt-Orchestrated Large Language Models for Clinical Information Extraction