A Prompt-Based Tutorial for Large Language Model–Assisted Screening in Systematic Reviews and Meta-Analyses
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Title/abstract and full-text screening are among the most time-consuming stages of systematic reviews. Large language models (LLMs) such as ChatGPT can assist in screening, but most prior evaluations focus only on abstract-level decisions and require coding or API access, limiting practical application. This study aimed to develop and evaluate a structured, prompt-based approach that enables LLMs to perform both abstract and full-text screening without programming or API integration. Using datasets from two completed meta-analyses, we implemented a stepwise training framework involving comprehension checks, criterion-specific feedback, and iterative prompt refinement. Study 1 optimized technical parameters, including file format and batch size, using 1,000 abstracts and 50 full texts from a completed meta-analysis. Study 2 validated the approach in an independent dataset on teletherapy for depression (1,321 abstracts, 82 full texts). Human reviewers’ decisions served as the reference standard, and sensitivity, specificity, and accuracy were the primary outcomes. In Study 1, the LLM achieved 98.0% sensitivity and 80.6% specificity at abstract screening, with optimal performance using batches of 50 abstracts in plain-text format. In Study 2, sensitivity reached 100% and specificity 85.6% for abstracts, and 82.1% accuracy for full-text screening, correctly retaining all eligible studies. A structured, prompt-based approach allows LLMs to approximate human-level accuracy in both abstract and full-text screening, with high sensitivity and specificity. This method makes LLM-assisted screening more accessible to review teams. While human oversight remains essential to address false positives and ensure rigor, prompt-based LLM workflows can substantially reduce reviewer burden and accelerate evidence synthesis.