Large language models for abstract screening in systematic- and scoping reviews: A diagnostic test accuracy study
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Introduction
We investigated if large language models (LLMs) can be used for abstract screening in systematic- and scoping reviews.
Methods
Two broad reviews were designed: a systematic review structured according to the PRISMA guideline with abstract inclusion based on PICO criteria; and a scoping review, where we defined abstract characteristics and features of interest to look for. For both reviews 500 abstracts were sampled. Two readers independently screened abstracts with disagreements handled with arbitrations or consensus, which served as the reference standard. The abstracts were analysed by six LLMs (GPT-4o, GPT-4T, GPT-3.5, Claude3-Opus, Claude3-Sonnet, and Claude3-Haiku). Primary outcomes were diagnostic test accuracy measures for abstract inclusion, abstract characterisation and feature of interest detection. Secondary outcome was the degree of automation using LLMs as a function of the error rate.
Results
In the systematic review 12 studies were marked as include by the human consensus. GPT-4o, GPT-4T, and Claude3-Opus achieved the highest accuracies (97% to 98%) comparable to the human readers (96% and 98%), although sensitivity was low (33% to 50%). In the scoping review 130 features of interest were present and the LLMs achieved sensitivities between 74-84%, comparable to the human readers (73% and 86%). The specificity of GPT-4o (98%) and GPT-4T (>99%) greatly surpassed the other LLMs (between 33% and 93%). For abstract characterization all LLMs achieved above 95% accuracy for language, manuscript type and study participant characterisation. For characterisation of disease-specific features only GPT-4T and GPT-4o showed very high accuracy. For abstract inclusion the highest automation rate (91%) at the lowest error rate (8%) was achieved by use of two LLMs with disagreement solved by human arbitration. An LLM pre screening before human abstract screening achieved an automation rate of 55% with no missed abstracts.
Conclusion
Abstract characterisation and specific feature of interest detection with LLMs is feasible and accurate with GPT-4o and GPT-4T. The majority of abstract screenings for systematic reviews can be automated with use of LLMs, at low error rates.