Capability of chatbots powered by large language models to support the screening process of scoping reviews: a feasibility study

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The recent surge in publications increases the screening time required to maintain up-to-date and high-quality literature reviews. One of the most time-consuming phases is the screening of titles and abstracts. With the support of machine learning tools, this process has been semi-automated for systematic reviews, with limited success for scoping reviews. ChatGPT, a large language model, might support scoping review screening with its ability to identify key concepts and themes within texts. We hypothesise that ChatGPT’s performance in abstract screening surpasses that of the semi-automated tool Rayyan, increasing efficiency at acceptable costs while maintaining a low type II error. In our retrospective analysis, ChatGPT 4.0 decided upon 15 306 abstracts, vastly outperforming Rayyan. ChatGPT demonstrated high levels of accuracy (68%), specificity (67%) and sensitivity (88–89%) and a negative predictive value of 99% when compared to human researchers’ decisions. The workload savings were at 64% reasonable costs. Despite the promising results, human oversight remains paramount, as ChatGPT’s decisions resulted in a 11% false negative rate. A hybrid screening approach combining human raters and ChatGPT might ensure accuracy and quality while enhancing efficiency. Further research on ChatGPT’s parameters, the prompts and screening scenarios is necessary in order to validate these results and to develop a standardised approach.

Article activity feed