Large language models for abstract screening in systematic- and scoping reviews: A diagnostic test accuracy study

Christian Hedeager Krag
Trine Balschmidt
Frederik Bruun
Mathias Brejnebøl
Jack Junchi Xu
Mikael Boesen
Michael Brun Andersen
Felix Christoph Müller

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Introduction

We investigated if large language models (LLMs) can be used for abstract screening in systematic- and scoping reviews.

Methods

Two broad reviews were designed: a systematic review structured according to the PRISMA guideline with abstract inclusion based on PICO criteria; and a scoping review, where we defined abstract characteristics and features of interest to look for. For both reviews 500 abstracts were sampled. Two readers independently screened abstracts with disagreements handled with arbitrations or consensus, which served as the reference standard. The abstracts were analysed by six LLMs (GPT-4o, GPT-4T, GPT-3.5, Claude3-Opus, Claude3-Sonnet, and Claude3-Haiku). Primary outcomes were diagnostic test accuracy measures for abstract inclusion, abstract characterisation and feature of interest detection. Secondary outcome was the degree of automation using LLMs as a function of the error rate.

Results

In the systematic review 12 studies were marked as include by the human consensus. GPT-4o, GPT-4T, and Claude3-Opus achieved the highest accuracies (97% to 98%) comparable to the human readers (96% and 98%), although sensitivity was low (33% to 50%). In the scoping review 130 features of interest were present and the LLMs achieved sensitivities between 74-84%, comparable to the human readers (73% and 86%). The specificity of GPT-4o (98%) and GPT-4T (>99%) greatly surpassed the other LLMs (between 33% and 93%). For abstract characterization all LLMs achieved above 95% accuracy for language, manuscript type and study participant characterisation. For characterisation of disease-specific features only GPT-4T and GPT-4o showed very high accuracy. For abstract inclusion the highest automation rate (91%) at the lowest error rate (8%) was achieved by use of two LLMs with disagreement solved by human arbitration. An LLM pre screening before human abstract screening achieved an automation rate of 55% with no missed abstracts.

Conclusion

Abstract characterisation and specific feature of interest detection with LLMs is feasible and accurate with GPT-4o and GPT-4T. The majority of abstract screenings for systematic reviews can be automated with use of LLMs, at low error rates.

Version published to 10.1101/2024.10.01.24314702 on medRxiv
Oct 2, 2024

Large Language Model Biases in Healthcare: A Scoping Review and Call for an Integrated Assessment Framework

This article has 8 authors:
1. Lu He
2. D. Phuong Do
3. Vishesh Girish Shet
4. Omar Farghaly
5. Priya Deshpande
6. Praveen Madiraju
7. Jiancheng Ye
8. Molly Beestrum
This article has no evaluationsLatest version Jan 16, 2026
Updated Approach to Error Rates in Systematic Review Screening: Integrating Active Learning, Large Language Models, and Full-Text Screening Data

This article has 5 authors:
1. Rutger Chris Neeleman
2. Berke Yazan
3. Emily Westerbeek
4. Wouter van Ballegooijen
5. Rens van de Schoot
This article has no evaluationsLatest version Jan 26, 2026
Unified tools for assessing the methodological quality of intervention effects in rapid reviews: a scoping review

This article has 10 authors:
1. Deborah Edwards
2. Emily C Clark
3. Judit Csontos
4. Maureen Dobbins
5. Elizabeth Gillen
6. Juliet Hounsome
7. Sarah E. Neil-Sztramko
8. Ruth Lewis
9. Mala Mann
10. Gillian Prue
This article has no evaluationsLatest version Feb 4, 2026

Discuss this preprint

Listed in

Abstract

Introduction

Methods

Results

Conclusion

Article activity feed

Related articles

Large Language Model Biases in Healthcare: A Scoping Review and Call for an Integrated Assessment Framework

Updated Approach to Error Rates in Systematic Review Screening: Integrating Active Learning, Large Language Models, and Full-Text Screening Data

Unified tools for assessing the methodological quality of intervention effects in rapid reviews: a scoping review