Dual-Model LLM Ensemble via Web Chat Interfaces Reaches Near-Perfect Sensitivity for Systematic-Review Screening: A Multi-Domain Validation with Equivalence to API Access
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Prior work showed that state-of-the-art (mid-2025) large language models (LLMs) prompted with varying batch sizes can perform well on systematic review (SR) abstract screening via public APIs within a single medical domain. Whether comparable performance holds when using no-code web interfaces (GUIs) and whether results generalize across medical domains remain unclear.
Objective
To evaluate the screening performance of a zero-shot, large-batch, two-model LLM ensemble (OpenAI GPT-5 Thinking; Google Gemini 2.5 Pro) operated via public chat GUIs across a diverse range of medical topics, and to compare its performance with an equivalent API-based workflow.
Methods
We conducted a retrospective evaluation using 736 titles and abstracts from 16 Cochrane reviews (330 included, 406 excluded), all published in May-June 2025. The primary outcome was the sensitivity of a pre-specified “OR” ensemble rule designed to maximize sensitivity, benchmarked against final full-text inclusion decisions (reference standard). Secondary outcomes were specificity, single-model performance, and duplicate-run reliability (Cohen’s κ). Because models saw only titles/abstracts while the reference standard reflected full-text decisions, specificity estimates are conservative for abstract-level screening.
Results
The GUI-based ensemble achieved 99.7% sensitivity (95% CI, 98.3%-100.0%) and 49.3% specificity (95% CI, 44.3%-54.2%). The API-based workflow yielded comparable performance, with 99.1% sensitivity (95% CI, 97.4%-99.8%) and 49.3% specificity (95% CI, 44.3%-54.2%). The difference in sensitivity was not statistically significant (McNemar p=0.625) and met equivalence within a ±2-percentage-point margin (TOST<0.05). Duplicate-run reliability was substantial to almost perfect (Cohen’s κ: 0.78-0.93). The two models showed complementary strengths: Gemini 2.5 Pro consistently achieved higher sensitivity (94.5%-98.2% across single runs), whereas GPT-5 Thinking yielded higher specificity (62.3%-67.0%).
Conclusions
A zero-code, browser-based workflow using a dual-LLM ensemble achieves near-perfect sensitivity for abstract screening across multiple medical domains, with performance equivalent to API-based methods. Ensemble approaches spanning two model families may mitigate model-specific blind spots. Prospective studies should quantify workload, cost, and operational feasibility in end-to-end systematic review pipelines.