Automating Screening of Titles and Abstracts in Systematic Reviews: An Assessment of GPT-4o mini
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Systematic literature reviews (SLRs) are essential in medical research, but are often time-consuming and costly, necessitating more efficient methods while maintaining accuracy.
Objective
This study assessed the performance of a GPT-4o mini large language model (LLM) in automating the first phase of study selection based on titles and abstracts in systematic reviews. Specifically, we evaluated whether the model improved efficiency without compromising on quality.
Methods
Structured prompts were created for a GPT-4o mini LLM to facilitate title and abstract screening. The model’s performance was evaluated against expert human reviewers across five systematic reviews on inclusion rates, sensitivity, specificity, accuracy, positive predictive value, and negative predictive value.
Results
The model screened a total of 15,605 records. It included a higher percentage of studies than human screeners, with 3.5% (n=549/15,605) true positives and 14.2% (n=2,218/15,605) false positives. The model achieved an overall accuracy of 85.1%, with a sensitivity of 83.2% and specificity of 85.2%. The positive predictive value was 19.8%, while the negative predictive value was 99.1%. The model was able to screen 1,000 titles and abstracts in 40 minutes, compared to 16 hours required by a human reviewer.
Conclusion
This study demonstrated a strong performance and efficiency in the automation of title and abstract screening in SLRs using an advanced LLM. Further refinements could optimize the balance between sensitivity and specificity, supporting broader implementation in evidence synthesis. A hybrid AI-human approach is recommended to ensure accuracy, reduce reviewer burden, and maintain the methodological rigor required for high-quality SLRs.