Will generative AI help solve systematic literature reviews? Evidence from a 2-year research programme

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Systematic literature reviews (SLRs) underpin clinical, regulatory, and policy decision-making but are time- and resource-intensive, particularly during data extraction and synthesis. Advances in large language models (LLMs) have renewed interest in artificial intelligence (AI)-assisted SLR workflows, but robust, large-scale evaluations against decision-grade standards remain limited. Methods We conducted a blinded, multi-year evaluation of LLM-based AI systems applied across key SLR stages under expert human oversight. AI performance was benchmarked against fully human-generated gold-standard datasets from completed SLRs. Evaluated stages included title/abstract screening, full-text screening, structured data extraction, and table narrative generation. The programme covered clinical trial and real-world evidence SLRs across multiple disease areas and complexity levels. In total, 20,594 titles/abstracts, 2,066 full texts, 51,352 extracted data points, and nine table narratives were assessed using predefined quantitative metrics. Results For title/abstract screening, AI achieved practical sensitivity of 89–98% and specificity of 68–89%. For full-text screening, sensitivity and practical sensitivity were ≥99%, with specificity 6–22% and accuracy 75–93%, reflecting intentional optimisation for sensitivity. For data extraction, completeness ranged from 85–98% (median 92%) and accuracy from 98–100% (median 100%). Accuracy was 100% for 90 variables and ≥90% for all but two variables. Lower completeness was observed in studies with multiple subgroups, high data density, or non-significant findings. Nine AI-generated table narratives across six SLRs achieved objective quality scores between 0 and 22 (≤14 indicates high quality). Error rates normalised per study remained <2 for all narratives and 0.2 for the largest (100-study) table. Subjectively, all but one narrative was rated Good or Medium. Conclusions Across large, diverse gold-standard datasets, AI systems operating within an expert-led human-in-the-loop framework demonstrated high sensitivity for screening, high completeness, and near-ceiling accuracy for data extraction, with generally high-quality narrative synthesis. The greatest practical value was found in data extraction and narrative drafting, where AI substantially reduced manual effort while maintaining methodological standards. These findings support AI as a very strong augmentation tool for expert humans for decision-grade SLRs, but not yet as an independent agent.

Article activity feed