Dual-Model LLM Ensemble via Web Chat Interfaces Reaches Near-Perfect Sensitivity for Systematic-Review Screening: A Multi-Domain Validation with Equivalence to API Access

Petter Fagerberg
Oscar Sallander
Kim Vikhe Patil
Anders Berg
Anastasia Nyman
Natalia Borg
Thomas Lindén

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Prior work showed that state-of-the-art (mid-2025) large language models (LLMs) prompted with varying batch sizes can perform well on systematic review (SR) abstract screening via public APIs within a single medical domain. Whether comparable performance holds when using no-code web interfaces (GUIs) and whether results generalize across medical domains remain unclear.

Objective

To evaluate the screening performance of a zero-shot, large-batch, two-model LLM ensemble (OpenAI GPT-5 Thinking; Google Gemini 2.5 Pro) operated via public chat GUIs across a diverse range of medical topics, and to compare its performance with an equivalent API-based workflow.

Methods

We conducted a retrospective evaluation using 736 titles and abstracts from 16 Cochrane reviews (330 included, 406 excluded), all published in May-June 2025. The primary outcome was the sensitivity of a pre-specified “OR” ensemble rule designed to maximize sensitivity, benchmarked against final full-text inclusion decisions (reference standard). Secondary outcomes were specificity, single-model performance, and duplicate-run reliability (Cohen’s κ). Because models saw only titles/abstracts while the reference standard reflected full-text decisions, specificity estimates are conservative for abstract-level screening.

Results

The GUI-based ensemble achieved 99.7% sensitivity (95% CI, 98.3%-100.0%) and 49.3% specificity (95% CI, 44.3%-54.2%). The API-based workflow yielded comparable performance, with 99.1% sensitivity (95% CI, 97.4%-99.8%) and 49.3% specificity (95% CI, 44.3%-54.2%). The difference in sensitivity was not statistically significant (McNemar p=0.625) and met equivalence within a ±2-percentage-point margin (TOST<0.05). Duplicate-run reliability was substantial to almost perfect (Cohen’s κ: 0.78-0.93). The two models showed complementary strengths: Gemini 2.5 Pro consistently achieved higher sensitivity (94.5%-98.2% across single runs), whereas GPT-5 Thinking yielded higher specificity (62.3%-67.0%).

Conclusions

A zero-code, browser-based workflow using a dual-LLM ensemble achieves near-perfect sensitivity for abstract screening across multiple medical domains, with performance equivalent to API-based methods. Ensemble approaches spanning two model families may mitigate model-specific blind spots. Prospective studies should quantify workload, cost, and operational feasibility in end-to-end systematic review pipelines.

Version published to 10.1101/2025.11.03.25339455 on medRxiv
Nov 6, 2025

Large Language Model Biases in Healthcare: A Scoping Review and Call for an Integrated Assessment Framework

This article has 8 authors:
1. Lu He
2. D. Phuong Do
3. Vishesh Girish Shet
4. Omar Farghaly
5. Priya Deshpande
6. Praveen Madiraju
7. Jiancheng Ye
8. Molly Beestrum
This article has no evaluationsLatest version Jan 16, 2026
Updated Approach to Error Rates in Systematic Review Screening: Integrating Active Learning, Large Language Models, and Full-Text Screening Data

This article has 5 authors:
1. Rutger Chris Neeleman
2. Berke Yazan
3. Emily Westerbeek
4. Wouter van Ballegooijen
5. Rens van de Schoot
This article has no evaluationsLatest version Jan 26, 2026
Comparative efficacy of ChatGPT-5.1 Auto and DeepSeek-V3.1 Thinking in answering patients’ questions on cervical spine surgery

This article has 4 authors:
1. Xiaoyang Huo
2. Jiaming Zhou
3. Rongzhi Ma
4. Yuan Xue
This article has no evaluationsLatest version Jan 23, 2026

Discuss this preprint

Listed in

Abstract

Background

Objective

Methods

Results

Conclusions

Article activity feed

Related articles

Large Language Model Biases in Healthcare: A Scoping Review and Call for an Integrated Assessment Framework

Updated Approach to Error Rates in Systematic Review Screening: Integrating Active Learning, Large Language Models, and Full-Text Screening Data

Comparative efficacy of ChatGPT-5.1 Auto and DeepSeek-V3.1 Thinking in answering patients’ questions on cervical spine surgery