Set-up, validation, evaluation, and cost-benefit analysis of an AI-assisted assessment of responsible research practices in a sample of life science publications

Abstract

The (semi-)automated screening of publications for diverse quality and transparency criteria is at the core of systematic literature assessment. Typically, the assessment process involves two initial reviewers and one additional reviewer for cases that require reconciliation. Here, we explore to what extent this process can be assisted by Large Language Models (LLMs). Specifically, whether LLMs are capable of assessing responsible research practices (RRPs) in scientific papers in a robust way. We employed proprietary LLMs to assess an initial set of 37 papers across ten RRPs. The same papers were also reviewed by three human reviewers. We iteratively redesigned prompts to increase model accuracy compared to human ratings which we treated as the gold standard. The resulting pipeline was validated on an additional set of 15 papers. We show that LLM accuracy is comparable to single human reviewer performance (90% for LLM vs 86% for a single human reviewer). However, performance strongly depended on the specific RRPs with accuracy ranging from 40% to 100%. LLMs exhibited an affirmative bias, making more errors when practices were not reported in the papers. Overall, we show how such an approach potentially replaces one human reviewer, enabling AI-assisted assessment of research papers. We discuss how dataset imbalances, validation procedures, and implementation time limit the broad applicability of such approaches. Through this, we develop initial guidance on the utility of proprietary LLMs in evidence synthesis.

This Zenodo record is a permanently preserved version of a PREreview. You can view the complete PREreview at https://prereview.org/reviews/19359570.

Summary

This study explores the feasibility of using Large Language Models (LLMs) to automate the screening of scientific publications for Responsible Research Practices (RRPs). These practices include reporting guidelines for randomization, blinding, and sample sizes. By comparing the performance of four proprietary LLMs against a "gold standard" of three human reviewers across 52 life sciences papers, the authors demonstrated that optimized LLMs (specifically Gemini 1.5 Pro) can achieve accuracy (~90%) comparable to a single human reviewer (~86%), suggesting that AI can effectively replace one human in a standard dual-reviewer evidence synthesis pipeline.

Major issues

Consensus Criteria: The authors narrowed RRPs down to 12 indicators through a three-round Delphi process. However, the specific decision-making criteria used for consensus were not fully disclosed. Clarification is needed on whether this was based on statistical thresholds, average scores, or majority vote.
Comparison Groups: A new suggested comparison would be Human + LLM vs. Human Expert vs. LLM. Exploring human-AI teaming could reveal synergistic benefits of human and LLMs.
Methodological Rigor: The Delphi study and the "gold standard" human assessment (two independent reviewers plus a third for reconciliation) are significant strengths of the paper.
Validation Split: With 37 papers for training and only 15 for validation, the small validation set limits claims of generalizability across diverse scientific sub-disciplines. Choosing papers from the BOX Program is strange and needs to be justified. Why not search across the most cited papers or most recent ones?
Affirmative Bias: A critical finding is the LLM's tendency to report a practice was followed when it was missing. This makes LLMs less reliable at confirming the absence of information.
Prompting Reproducibility: Prompt optimization was performed manually by a single researcher. The lack of a standardized, automated protocol may limit the reproducibility of the results.
Sample Size: The sample (n=52) may be insufficient for the linguistic variability in life sciences. The authors should provide justification or benchmarks supporting the reliability of this sample size.
Data Leakage: Figure 3 appears to combine training and validation sets. Metrics (Precision, Recall, F1-score) should be reported exclusively for the independent validation set to avoid performance overestimation.

Reporting

The preprint would benefit from adhering to the TRIPOD-LLM reporting guidelines (https://tripod-llm.vercel.app/).

Abstract & Model Selection

Abstract: Mention human oversight to catch hallucinations and key limitations (data leakage, sample size, outdated models). Avoid language about "replacing" a human reviewer, as we need to consider other setups (like Human + LLM) and the controversy around this. Or else justify the implication more. Even when accuracies match, specific behaviors differ.
Models: Justify the "availability heuristic" for model selection. Consider comparing proprietary models against open-source options (e.g., Llama 3.3) to address data privacy/security concerns.

Introduction

Add the bit from subsection Selection of experimental research papers and human review about "part of a larger endeavour to estimate the impact of higher education courses on research outputs of participants and the application of RRPs."

Results & Figures

Accessibility: Provide analysis scripts in Zenodo (the current file cleaning script appears deprecated).
Visuals: Use a confusion matrix to communicate accuracy (TP, FP, TN, FN).
Human Assessment: Clarify how the 87% individual accuracy was calculated (F1 vs. basic percentage).
Figure 1 Typos: Correct "Refinemnet" and "LMM."
Efficiency Results: Report standard deviations across the three reviewers.

Supplementary Materials & Data

Organization: Ensure supplemental tables are presented in the order they appear in the text. Add direct links to these materials in the article.
Reproducibility: Clarify which PDF numbers belong to which papers.
Technical Details: * Table S2: Verify if a temperature of 2.0 is possible for Gemini (usually maxes at 1.0).
- Report why the Delphi study excluded: code availability, compute environment/FAIR principles, missing data strategy, metadata standards, and explicit ethics approval (IRB/ACUC/GDPR).
Repository Management: * Enhance Zenodo metadata.
- Include a direct GitHub link in the manuscript for convenience.
- Clarify licensing on GitHub (the Zenodo archive is CC BY, but GitHub is unspecified).

Minor issues

Figure 3B: Was the caption data also calculated across all three human experts/reviewers?
Figure 4: Consider a 1:1 aspect ratio and define the lines (e.g., black line as y=x, blue as regression).
Figure 5: Fix x-axis label ("paper" to "papers") and add "minutes" to plot labels. Justify why paper reviewing times are listed at ≤ 60 minutes, as literature often suggests 3–5 hours.
Terminology: In the Introduction, "Qua overall time" should likely be rephrased to: "This would allow for the partial replacement of human reviewers in such assessment processes, improving efficiency in terms of time and human review effort."
Acknowledgements: Use the CRediT taxonomy to clarify author contributions and acknowledge specific contributions by volunteers.
Disclosures: Disclose use of LLMs or other automated technologies in the research process and authoring the manuscript.

Competing interests

The authors declare that they have no competing interests.

Use of Artificial Intelligence (AI)

The authors declare that they used generative AI to come up with new ideas for their review.

Read the original source

Set-up, validation, evaluation, and cost-benefit analysis of an AI-assisted assessment of responsible research practices in a sample of life science publications

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Summary

Major issues

Summary

Major issues

Reporting

Minor issues

Competing interests

Use of Artificial Intelligence (AI)

Evaluating Large Language Models’ Performance in FDA Regulatory Science

Fast-Track Your Abstract Screening: Mastering ASReview for Accelerating Abstract Screening and Evaluating Decisions From Automatic Screening Methods

Will generative AI help solve systematic literature reviews? Evidence from a 2-year research programme

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Summary

Major issues

Summary

Major issues

Reporting

Minor issues

Competing interests

Use of Artificial Intelligence (AI)

Related articles

Evaluating Large Language Models’ Performance in FDA Regulatory Science

Fast-Track Your Abstract Screening: Mastering ASReview for Accelerating Abstract Screening and Evaluating Decisions From Automatic Screening Methods

Will generative AI help solve systematic literature reviews? Evidence from a 2-year research programme