Evaluating the Reliability of a Custom GPT in Full-Text Screening of a Systematic Review
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Systematic reviewing is a time-consuming process that can be aided by artificial intelligence (AI). There are several AI options to assist with title/abstract screening, however options for full text screening (FTS) are limited. The objective of this study was to evaluate the reliability of a custom GPT (cGPT) for FTS.
A cGPT powered by OpenAI’s ChatGPT4o was trained and tested with a subset of articles assessed in duplicate by human reviewers. Outputs from the testing subset were coded to simulate cGPT as an autonomous and assistant reviewer. Cohen’s kappa was used to assess interrater agreement. The threshold for practical use was defined as a cGPT-human kappa score exceeding the lower bound of the confidence interval (CI) for the lowest human-human kappa score in inclusion/exclusion and exclusion reason decisions. cGPT as an assistant reviewer met this reliability threshold. With the Cohen’s kappa CI for human-human pairs ranging from 0.658 to 1.00 in the inclusion/exclusion decision, assistant cGPT and human kappa scores were encompassed in two of four pairings. In exclusion reason classification, the benchmark human-human kappa score CI range was 0.606 to 0.912. Assistant cGPT and human kappa scores were encompassed in one of four pairings. cGPT as an autonomous reviewer did not meet reliability thresholds.
cGPT as an assistant could speed up systematic reviewing in a sufficiently reliable way. More research is needed to establish standardized thresholds for practical use. While the current study dealt with physiological population parameters, cGPTs can assist in FTS of systematic reviews in any field.
HIGHLIGHTS
-
There are several AI options to assist in title/abstract screening in systematic reviewing, however, options for full text screening are limited.
-
The reliability of a tailor-made AI model in the form of a custom GPT was explored in the role of an assistant to a human reviewer and as an autonomous reviewer.
-
Interrater agreement was sufficient when the model operated in the role of assistant reviewer but not in the role of autonomous reviewer. Here the model misclassified two articles out of ten, whereas the human reviewers erred in approximately one out of ten articles.
-
The study shows that it is possible to craft a custom GPT as a useful assistant in systematic reviews. cGPTs can be crafted to assist in reviews in any field.
-
An automated setup for inputting articles and coding cGPT responses is needed to maximize the potential time-saving benefit.