Evaluating the Reliability of a Custom GPT in Full-Text Screening of a Systematic Review

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Objective

The purpose of this study is to evaluate the reliability and time-saving potential of a custom GPT (cGPT) in full text screening of a systematic review focusing on average 24-hour urine production and 24-hour creatinine excretion in populations.

Methods

A cGPT model, developed using ChatGPT4o (OpenAI Plus), was trained on a subset of articles previously assessed in duplicate by human reviewers. The human operator of the cGPT manually uploaded individual articles into the cGPT conversation with a standardized prompt. The outputs were coded to simulate cGPT in 3 different roles: (1) autonomous reviewer, (2) assistant to the 1 st reviewer, and (3) assistant to the 2 nd reviewer. Cohen’s kappa was used to measure interrater agreement between cGPT and each human reviewer, as well as against human consensus decisions (the “gold standard”). The threshold for practical use was determined to be a cGPT-consensus kappa score which existed within the confidence intervals of at least one human-human pairing at inclusion/exclusion and exclusion reason.

Results

Of the three reviewer roles, cGPT as assistant to the 2 nd reviewer was the only role which met the threshold for practical use, producing a cGPT-consensus kappa score of 0.733 (95% CI: 0.607, 0.859) compared to a human-human kappa range between 0.713 (95% CI: 0.606, 0.821) and 0.784 (95% CI: 0.656, 0.912) for inclusion/exclusion. In the classification of exclusion reason, cGPT-consensus kappa score was 0.632 (95% CI: 0.568, 0.696) compared to the human-human kappa range from 0.713 (95% CI: 0.606, 0.821) to 0.784 (95% CI: 0.656, 0.912). The study found that there is a clear time saving advantage to using cGPT in this way for full text screening, estimating 10.1 to 84.4 hours saved in the data set investigated here. cGPT as an autonomous reviewer or as assistant to the 1 st reviewer did not meet reliability thresholds.

Conclusion

While cGPT did not have sufficiently reliable and accurate performance to replace human reviewers in full text screening, its use as an assistant holds promise in expediting the screening process, particularly with a large full-text corpus. There is a considerable deficit in published data exploring ChatGPT models for full text screening and more advanced models will require continued validation to determine which role is best suited to the capabilities of custom GPTs. More research is needed to establish a standardized threshold for practical use.

Article activity feed