GPT API Models Can Function as Highly Reliable Second Screeners of Titles and Abstracts in Systematic Reviews

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Independent human double screening of titles and abstracts is a critical step to ensure the quality ofsystematic reviews and meta-analyses herein. However, double screening is a resource-demandingprocedure that decelerates the review process. To alleviate this issue, we evaluated the use ofOpenAI’s GPT API models as an alternative to human second screeners of titles and abstracts. Wedid so by developing a new benchmark scheme for interpreting the performances of automatedscreening tools against common human screening performances in high-quality systematic reviewsand conducting three large-scale experiments on three psychological systematic reviews with different levels of complexity. Across all experiments, we show that the GPT API models can perform on par with and in some cases even better than typical human screening performance in terms of detecting relevant studies while showing high exclusion performance, as well. Hereto, we introduce the use of multi-prompt screening, that is making one concise prompt per inclusion/exclusion criteria in a review, and show that it can be a valuable tool to use for screening in highly complex review settings. To support future reviews, we develop a reproducible workflow and tentative guidelines for when reviewers can or cannot use GPT API models as independent second screeners of titles and abstracts. Moreover, we present the R package AIscreenR to standardize and scale up the suggested application. Our aim is ultimately to make GPT API models acceptable as independent second screeners within high-quality systematic reviews, such as the ones published in Psychological Bulletin.

Article activity feed