Scaling the Prompt: How Batch Size Shapes Performance of Mid-2025 State-of-the-Art LLMs in Automated Title-and-Abstract Screening
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: Manual abstract screening is a primary bottleneck in evidence synthesis. Emerging evidence suggests that large language models (LLMs) can automate this task, but their performance when processing multiple records simultaneously in batches is uncertain. Objectives: To evaluate the classification performance of four state-of-the-art LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, GPT-5, and GPT-5 mini) in predicting study eligibility across a wide range of batch sizes for a systematic review of randomised controlled trials. Methods: We used a gold-standard dataset of 790 records (93 inclusions) from a published Cochrane Review. Using the public APIs for each model, batches of 1 to 790 citations were submitted to classify records as include or exclude. Performance was assessed using sensitivity and specificity, with internal validation conducted through 10 repeated runs for each model-batch combination. Results: Gemini 2.5 Pro was the most robust model, successfully processing the full 790-record batch. In contrast, GPT-5 failed at batches ≥400, while GPT-5 mini and Gemini 2.5 Flash failed at the 790-record batch. Overall, all models demonstrated strong performance within their operational ranges, with two notable exceptions: Gemini 2.5 Flash showed low initial sensitivity at batch 1, and GPT-5 minis sensitivity degraded at higher batch sizes (from 0.88 at batch 200 to 0.48 at batch 400). At a practical batch size of 100, Gemini 2.5 Pro achieved the highest sensitivity (1.00, 95% CI 1.00-1.00), whereas GPT-5 delivered the highest specificity (0.98, 95% CI 0.98-0.98). Conclusion: State-of-the-art LLMs can effectively screen multiple abstracts per prompt, moving beyond inefficient single-record processing. However, performance is model-dependent, revealing trade-offs between sensitivity and specificity. Therefore, batch size optimisation and strategic model selection are important parameters for successful implementation.