Using OpenAI Models for Abstract Screening

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large Language Models (LLMs) are increasingly being integrated into tools that assistwith systematic literature reviews, yet few empirical evaluations exist on their effectivenessfor abstract screening—particularly within social science domains. This study evaluatesthe performance and cost of three OpenAI models (GPT-3.5 Turbo, GPT-4 Turbo, andGPT-4o Mini) in classifying relevance of abstracts in a real-world literature review on “netwidening” and diversion programs in the criminal justice system. Using a batch inferencepipeline, we tested models with both short and long prompt formats, assessing classificationaccuracy, precision, recall, and cost in 2024 USD. Our results show that while accuracy andrecall were relatively high across all models (up to 90% accuracy and 95% recall), precisionwas lower—particularly for GPT-3.5 with long prompts—suggesting that while LLMs cansupport abstract screening, they are not yet a substitute for trained human reviewers inhigh-stakes systematic reviews. Notably, the low-cost GPT-4o Mini achieved near-parity inperformance with GPT-4 Turbo, indicating promising potential for rapid, exploratory reviewworkflows. This paper offers practical benchmarks, cost analyses, and recommendationsfor integrating LLMs into evidence review pipelines, emphasizing the need for thoughtful,transparent use of these tools in academic research.

Article activity feed