Using OpenAI Models for Abstract Screening

Andrew Taylor
Josephine Usow
Eli Miller
Dilay Kalinoglu

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large Language Models (LLMs) are increasingly being integrated into tools that assistwith systematic literature reviews, yet few empirical evaluations exist on their effectivenessfor abstract screening—particularly within social science domains. This study evaluatesthe performance and cost of three OpenAI models (GPT-3.5 Turbo, GPT-4 Turbo, andGPT-4o Mini) in classifying relevance of abstracts in a real-world literature review on “netwidening” and diversion programs in the criminal justice system. Using a batch inferencepipeline, we tested models with both short and long prompt formats, assessing classificationaccuracy, precision, recall, and cost in 2024 USD. Our results show that while accuracy andrecall were relatively high across all models (up to 90% accuracy and 95% recall), precisionwas lower—particularly for GPT-3.5 with long prompts—suggesting that while LLMs cansupport abstract screening, they are not yet a substitute for trained human reviewers inhigh-stakes systematic reviews. Notably, the low-cost GPT-4o Mini achieved near-parity inperformance with GPT-4 Turbo, indicating promising potential for rapid, exploratory reviewworkflows. This paper offers practical benchmarks, cost analyses, and recommendationsfor integrating LLMs into evidence review pipelines, emphasizing the need for thoughtful,transparent use of these tools in academic research.

Version published to 10.31235/osf.io/g6hrz_v1 on OSF Preprints
Jun 20, 2025

GPT API Models Can Function as Highly Reliable Second Screeners of Titles and Abstracts in Systematic Reviews

This article has 4 authors:
1. Mikkel Helding Vembye
2. Julian Christensen
3. Anja Bondebjerg
4. Frederikke Lykke Witthöft Schytt
This article has no evaluationsLatest version May 21, 2025
GPT API Models Can Function as Highly Reliable Second Screeners of Titles and Abstracts in Systematic Reviews

This article has 4 authors:
1. Mikkel Helding Vembye
2. Julian Christensen
3. Anja Bondebjerg
4. Frederikke Lykke Witthöft Schytt
This article has no evaluationsLatest version May 21, 2025
A Comprehensive and Critical Survey of Large Language Model Inference and Feature Generation

This article has 1 author:
1. Snehil Shrivastava
This article has no evaluationsLatest version Jun 16, 2025

Listed in

Abstract

Article activity feed

Related articles

GPT API Models Can Function as Highly Reliable Second Screeners of Titles and Abstracts in Systematic Reviews

GPT API Models Can Function as Highly Reliable Second Screeners of Titles and Abstracts in Systematic Reviews

A Comprehensive and Critical Survey of Large Language Model Inference and Feature Generation