Assessment of Bias in Clinical Trials with LLMs Using ROBUST-RCT: A Feasibility Study

Pedro Rodrigues Vidor
Yohan Casiraghi
Adolfo Moraes de Souza
Maria Inês Schmidt

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

BACKGROUND

Bias assessment is a crucial step in evaluating evidence from randomized controlled trials. The widely adopted Cochrane RoB 2, designed to identify these issues, is complex, resource-intensive, and unreliable. Advances in artificial intelligence (AI), particularly in the field of large language models (LLMs), now allow the automation of complex tasks. While prior investigations have focused on whether LLMs could perform assessments with RoB 2, integrating technologies does not resolve the intrinsic methodological issues of the instrument. This is the first feasibility study to evaluate the reliability of ROBUST-RCT, a novel bias assessment tool, as applied by humans and LLMs.

METHODS

A sample of RCTs of drug interventions was screened for eligibility. Reviewers working independently used ROBUST-RCT to assess different aspects of the studies and then reached a consensus through discussion. A chain-of-thought prompt instructed four LLMs on how to apply ROBUST-RCT. The primary analysis used Gwet’s AC2 coefficient and benchmarking to assess inter-rater reliability of the “judgment set”, defined as the series of final assessments for the six core items in the ROBUST-RCT tool.

RESULTS

54 assessments of each LLM were compared to human consensus in the primary analysis. Gwet’s AC2 inter-rater reliability ranged from 0.46 to 0.69. With 95% confidence, three of the four tested LLMs achieved ’moderate’ or higher reliability based on probabilistic benchmarking. A secondary analysis also found a Fleiss’ Kappa of 0.49 (95% CI: 0.30 – 0.60) between human reviewers before consensus, numerically higher than the values reported in prior literature about RoB 2.

CONCLUSION

Large Language Models (LLMs) can effectively perform risk-of-bias assessments using the ROBUST-RCT tool, enabling their integration into future systematic review workflows aiming for enhanced objectivity and efficiency.

Version published to 10.1101/2025.08.12.25333520 on medRxiv
Aug 13, 2025

Effectiveness of neurofeedback interventions in cerebral palsy: A systematic review

This article has 5 authors:
1. Sevinc Nisa Abay
2. Kaat Alaerts
3. Bernard Dan
4. Elegast Monbaliu
5. Saranda Bekteshi
This article has no evaluationsLatest version Aug 16, 2025
Structured tools to assessing quality and bias in Mendelian randomisation studies: an updated systematic review

This article has 6 authors:
1. Jinyue Yu
2. Mengxuan Zou
3. Francesca Spiga
4. Sarah Dawson
5. George Davey Smith
6. Julian PT Higgins
This article has no evaluationsLatest version Sep 7, 2025
Assessment of selective reporting biases in studies included in Campbell Systematic Reviews: A systematic review

This article has 4 authors:
1. Julia H. Littell
2. Jeff Valentine
3. Dennis M. Gorman
4. Terri Pigott
This article has no evaluationsLatest version Sep 16, 2025

Discuss this preprint

Listed in

Abstract

BACKGROUND

METHODS

RESULTS

CONCLUSION

Article activity feed

Related articles

Effectiveness of neurofeedback interventions in cerebral palsy: A systematic review

Structured tools to assessing quality and bias in Mendelian randomisation studies: an updated systematic review

Assessment of selective reporting biases in studies included in Campbell Systematic Reviews: A systematic review