Scaling the Prompt: How Batch Size Shapes Performance of Mid-2025 State-of-the-Art LLMs in Automated Title-and-Abstract Screening

Petter Fagerberg
Oscar Sallander
Kim Vikhe Patil
Anders Berg
Anastasia Nyman
Natalia Borg
Thomas Lindén

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Manual abstract screening is a primary bottleneck in evidence synthesis. Emerging evidence suggests that large language models (LLMs) can automate this task, but their performance when processing multiple records simultaneously in "batches" is uncertain.

Objectives

To evaluate the classification performance of four state-of-the-art LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, GPT-5, and GPT-5 mini) in predicting study eligibility across a wide range of batch sizes for a systematic review of randomised controlled trials.

Methods

We used a gold-standard dataset of 790 records (93 inclusions) from a published Cochrane Review. Using the public APIs for each model, batches of 1 to 790 citations were submitted to classify records as ’Include’ or ’Exclude’. Performance was assessed using sensitivity and specificity, with internal validation conducted through 10 repeated runs for each model-batch combination.

Results

Gemini 2.5 Pro was the most robust model, successfully processing the full 790-record batch. In contrast, GPT-5 failed at batches ≥400, while GPT-5 mini and Gemini 2.5 Flash failed at the 790-record batch. Overall, all models demonstrated strong performance within their operational ranges, with two notable exceptions: Gemini 2.5 Flash showed low initial sensitivity at batch 1, and GPT-5 mini’s sensitivity degraded at higher batch sizes (from 0.88 at batch 200 to 0.48 at batch 400). At a practical batch size of 100, Gemini 2.5 Pro achieved the highest sensitivity (1.00, 95% CI 1.00-1.00), whereas GPT-5 delivered the highest specificity (0.98, 95% CI 0.98-0.98).

Conclusion

State-of-the-art LLMs can effectively screen multiple abstracts per prompt, moving beyond inefficient single-record processing. However, performance is model-dependent, revealing trade-offs between sensitivity and specificity. Therefore, batch size optimisation and strategic model selection are important parameters for successful implementation.

Version published to 10.1101/2025.09.02.25334159 on medRxiv
Sep 4, 2025

Large Language Models (LLMs) for Evidence Synthesis: An Exploratory Evaluation and A New Approach for Automated Data Extraction

This article has 10 authors:
1. Yuchen Zhang
2. Nanyu Luo
3. Hajung Kim
4. Linxin Li
5. Linfeng Gao
6. Jiayi Han
7. Shiting Chen
8. Xiaoya Zhang
9. Jinbo He
10. Feng Ji
This article has no evaluationsLatest version Oct 16, 2025
Large language models for automatable real-world performance monitoring of diagnostic decision support systems: a comparison to manual doctor panel review in a prospective clinical study

This article has 20 authors:
1. Fabienne Cotte
2. Marcel Schmude
3. Philipp Bode
4. Oula Suliman
5. Filipa Dias Lourenço
6. Miguel Paiva Pereira
7. Nisha Kini
8. Vera Hartenstein
9. Allesandro Muscoloni
10. Lisa Stroux
11. Victor Hertz
12. Sebastian Köhler
13. Valerio Morelli
14. Henry Hoffmann
15. Peter Engerer
16. Stephen Gilbert
17. Kirsten Gray
18. Tauseef Mehrali
19. Micaela Seemann Monteiro
20. Pedro Flores
This article has no evaluationsLatest version Sep 21, 2025
Open-source DeepSeek-R1 Outperforms Proprietary Non-Reasoning Large Language Models With and Without Retrieval-Augmented Generation

This article has 4 authors:
1. Scott Song
2. Kenneth C. Peng
3. Elizabeth T. Wang
4. T.Y. Alvin Liu
This article has no evaluationsLatest version Sep 14, 2025

Discuss this preprint

Listed in

Abstract

Background

Objectives

Methods

Results

Conclusion

Article activity feed

Related articles

Large Language Models (LLMs) for Evidence Synthesis: An Exploratory Evaluation and A New Approach for Automated Data Extraction

Large language models for automatable real-world performance monitoring of diagnostic decision support systems: a comparison to manual doctor panel review in a prospective clinical study

Open-source DeepSeek-R1 Outperforms Proprietary Non-Reasoning Large Language Models With and Without Retrieval-Augmented Generation