Automating Screening of Titles and Abstracts in Systematic Reviews: An Assessment of GPT-4o mini

Mir Sohail Fazeli
Ellen Kasireddy
Mir-Masoud Pourrahmat
Cuthbert Chow
Jean-Paul Collet

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Systematic literature reviews (SLRs) are essential in medical research, but are often time-consuming and costly, necessitating more efficient methods while maintaining accuracy.

Objective

This study assessed the performance of a GPT-4o mini large language model (LLM) in automating the first phase of study selection based on titles and abstracts in systematic reviews. Specifically, we evaluated whether the model improved efficiency without compromising on quality.

Methods

Structured prompts were created for a GPT-4o mini LLM to facilitate title and abstract screening. The model’s performance was evaluated against expert human reviewers across five systematic reviews on inclusion rates, sensitivity, specificity, accuracy, positive predictive value, and negative predictive value.

Results

The model screened a total of 15,605 records. It included a higher percentage of studies than human screeners, with 3.5% (n=549/15,605) true positives and 14.2% (n=2,218/15,605) false positives. The model achieved an overall accuracy of 85.1%, with a sensitivity of 83.2% and specificity of 85.2%. The positive predictive value was 19.8%, while the negative predictive value was 99.1%. The model was able to screen 1,000 titles and abstracts in 40 minutes, compared to 16 hours required by a human reviewer.

Conclusion

This study demonstrated a strong performance and efficiency in the automation of title and abstract screening in SLRs using an advanced LLM. Further refinements could optimize the balance between sensitivity and specificity, supporting broader implementation in evidence synthesis. A hybrid AI-human approach is recommended to ensure accuracy, reduce reviewer burden, and maintain the methodological rigor required for high-quality SLRs.

Version published to 10.64898/2026.05.15.26353334 on medRxiv
May 20, 2026

Audited large language model triage for systematic review screening in national clinical guideline production: validation and prospective deployment

This article has 9 authors:
1. Petter Fagerberg
2. Oscar Sallander
3. Kim Vikhe Patil
4. Charlotta Thunborg
5. Lina Lundström
6. Anders Berg
7. Anastasia Nyman
8. Natalia Borg
9. Thomas Lindén
This article has no evaluationsLatest version Jun 3, 2026
Comparing Artificial Intelligence versus Human Screening in Systematic Reviews

This article has 3 authors:
1. Marco Gorici
2. Abel Torres-Espin
3. Mark Oremus
This article has no evaluationsLatest version Jul 2, 2026
Uncertainty-aware extraction of clinical findings from Finnish EHRs using open large language models

This article has 5 authors:
1. Jussi Leinonen
2. Juha Knuuttila
3. Siina Pamilo
4. Samu Kurki
5. Miika Koskinen
This article has no evaluationsLatest version Jul 9, 2026

Discuss this preprint

Listed in

Abstract

Background

Objective

Methods

Results

Conclusion

Article activity feed

Related articles

Audited large language model triage for systematic review screening in national clinical guideline production: validation and prospective deployment

Comparing Artificial Intelligence versus Human Screening in Systematic Reviews

Uncertainty-aware extraction of clinical findings from Finnish EHRs using open large language models