Updated Approach to Error Rates in Systematic Review Screening: Integrating Active Learning, Large Language Models, and Full-Text Screening Data

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Screening records for a systematic review or meta-analysis (SRMA) can be both time-consuming and prone to errors. Wang et al. (2020) estimated that humans misclassify about 10.76% of records during abstract screening, relative to the gold-standard full-text decision, as false inclusions or false exclusions. Errors at the full-text level are rarely evaluated, as are errors in AI-assisted screening methods for SRMAs, which are becoming more common. We applied Wang et al. (2020)’s approach to determine error rates for human-only, human-AI, and AI-only screening at both the title-abstract and full-text stages. This analysis used data from collaborative projects focused on SRMA screening, including Synergy, IMPROVE, FORAS, and MetaPsy. Our overall weighted error rate was 3.51% (95% CI 0.42%-6.60%), likely influenced by lower error rates at the full-text level (0.15%, 95% CI 0.00%-2.61%) and for AI-assisted methods (1.13%, 95% CI 0.00%-9.38%). False exclusions, which can significantly affect SRMA outcomes, had the highest error rate at 10.89% (95% CI 4.43%-17.35%). Because gold standards are not perfect, they may include even more false exclusions due to incomplete assessment at the full-text level. AI techniques such as the noisy label filter (NLF) and large language models (LLMs) are well suited to identifying these false exclusions and reducing human workload. Furthermore, we propose a screening error origin model (SEOM) with potential error threats (PETs), highlighting the importance of clearer communication between human and AI agents, for instance, regarding label uncertainty. Future research could focus on specific PETs to minimize their impact and thereby decrease overall error rates.

Article activity feed