Scaling the Prompt: How Batch Size Shapes Performance of Mid-2025 State-of-the-Art LLMs in Automated Title-and-Abstract Screening

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: Manual abstract screening is a primary bottleneck in evidence synthesis. Emerging evidence suggests that large language models (LLMs) can automate this task, but their performance when processing multiple records simultaneously in batches is uncertain. Objectives: To evaluate the classification performance of four state-of-the-art LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, GPT-5, and GPT-5 mini) in predicting study eligibility across a wide range of batch sizes for a systematic review of randomised controlled trials. Methods: We used a gold-standard dataset of 790 records (93 inclusions) from a published Cochrane Review. Using the public APIs for each model, batches of 1 to 790 citations were submitted to classify records as include or exclude. Performance was assessed using sensitivity and specificity, with internal validation conducted through 10 repeated runs for each model-batch combination. Results: Gemini 2.5 Pro was the most robust model, successfully processing the full 790-record batch. In contrast, GPT-5 failed at batches ≥400, while GPT-5 mini and Gemini 2.5 Flash failed at the 790-record batch. Overall, all models demonstrated strong performance within their operational ranges, with two notable exceptions: Gemini 2.5 Flash showed low initial sensitivity at batch 1, and GPT-5 minis sensitivity degraded at higher batch sizes (from 0.88 at batch 200 to 0.48 at batch 400). At a practical batch size of 100, Gemini 2.5 Pro achieved the highest sensitivity (1.00, 95% CI 1.00-1.00), whereas GPT-5 delivered the highest specificity (0.98, 95% CI 0.98-0.98). Conclusion: State-of-the-art LLMs can effectively screen multiple abstracts per prompt, moving beyond inefficient single-record processing. However, performance is model-dependent, revealing trade-offs between sensitivity and specificity. Therefore, batch size optimisation and strategic model selection are important parameters for successful implementation.

Article activity feed