Evaluating Large Language Models for Quality Control in Research Ethics Review

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large Language Models (LLMs) could streamline research ethics and governance review by automating routine quality control checks. This study assessed Claude 3.5 Sonnet, ChatGPT4o, and Microsoft CoPilot in evaluating participant information sheets against Health Research Authority standards. Six clinical trial applications were processed using 32 standardised queries, with accuracy, processing time, error rates, and response quality assessed by two blinded reviewers. Claude showed the highest accuracy, detected more deviations from guidance, and had the fastest processing time (7.17 ± 0.68 min vs. ChatGPT 13.58 ± 2.2 min, CoPilot 23.25 ± 6.21 min). Hallucination rates were lowest in Claude, followed by ChatGPT and CoPilot. Claude also received the highest response quality ratings (77.6% “very good” vs. ChatGPT 56.7%, CoPilot 45.2%). While all three models were capable, Claude consistently outperformed the others, demonstrating the potential of LLM-assisted quality control while emphasizing the need for human oversight.

Article activity feed