Utilizing ChatGPT to select literature for meta-analysis shows workload reduction while maintaining a similar recall level as manual curation

Xiangming Cai
Yuanming Geng
Yiming Du
Bart Westerman
Duolao Wang
Chiyuan Ma
Juan J. Garcia Vallejo

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

AI (mark2d2)

Abstract

Background

Large language models (LLMs) like ChatGPT showed great potential in aiding medical research. A heavy workload in filtering records is needed during the research process of evidence-based medicine, especially meta-analysis. However, no study tried to use LLMs to help screen records in meta-analysis.

Objective

In this research, we aimed to explore the possibility of incorporating ChatGPT to facilitate the screening step based on the title and abstract of records during meta-analysis.

Methods

To assess our strategy, we selected three meta-analyses from the literature, together with a glioma meta-analysis embedded in the study, as additional validation. For the automatic selection of records from curated meta-analyses, a four-step strategy called LARS-GPT was developed, consisting of (1) criteria selection and single-prompt (prompt with one criterion) creation, (2) best combination identification, (3) combined-prompt (prompt with one or more criteria) creation, and (4) request sending and answer summary. Recall, workload reduction, precision, and F1 score were calculated to assess the performance of LARS-GPT.

Results

A variable performance was found between different single-prompts with a mean recall of 0.841. Based on these single-prompts, we were able to find combinations with performance better than the pre-set threshold. Finally, with a best combination of criteria identified, LARS-GPT showed a 39.5% workload reduction on average with a recall greater than 0.9.

Conclusions

We show here the groundbreaking finding that automatic selection of literature for meta-analysis is possible with ChatGPT. We provide it here as a pipeline, LARS-GPT, which showed a great workload reduction while maintaining a pre-set recall.

Version published to 10.1101/2023.09.06.23295072 on medRxiv
Sep 7, 2023

Updated Approach to Error Rates in Systematic Review Screening: Integrating Active Learning, Large Language Models, and Full-Text Screening Data

This article has 5 authors:
1. Rutger Chris Neeleman
2. Berke Yazan
3. Emily Westerbeek
4. Wouter van Ballegooijen
5. Rens van de Schoot
This article has no evaluationsLatest version Jan 26, 2026
Large Language Model Biases in Healthcare: A Scoping Review and Call for an Integrated Assessment Framework

This article has 8 authors:
1. Lu He
2. D. Phuong Do
3. Vishesh Girish Shet
4. Omar Farghaly
5. Priya Deshpande
6. Praveen Madiraju
7. Jiancheng Ye
8. Molly Beestrum
This article has no evaluationsLatest version Jan 16, 2026
Machine learning for medication error detection: a scoping review

This article has 5 authors:
1. Félicien Hêche
2. Sohrab Ferdowsi
3. Anthony Yazdani
4. Sara Sansaloni-Pastor
5. Douglas Teodoro
This article has no evaluationsLatest version Feb 20, 2026

Discuss this preprint

Listed in

Abstract

Background

Objective

Methods

Results

Conclusions

Article activity feed

Related articles

Updated Approach to Error Rates in Systematic Review Screening: Integrating Active Learning, Large Language Models, and Full-Text Screening Data

Large Language Model Biases in Healthcare: A Scoping Review and Call for an Integrated Assessment Framework

Machine learning for medication error detection: a scoping review