Expert-level validation of AI-generated medical text with scalable language models

Asad Aali
Vasiliki Bikia
Maya Varma
Nicole Chiou
Sophie Ostmeier
Arnav Singhvi
Magdalini Paschali
Ashwin Kumar
Andrew Johnston
Karimar Amador-Martinez
Eduardo Guerrero
Paola Rivera
Sergios Gatidis
Christian Bluethgen
Eduardo Pontes Reis
Eddy Rilland
Poonam Hosamani
Kevin Keet
Minjoung Go
Evelyn Bin Ling
David Larson
Curtis Langlotz
Roxana Daneshjou
Jason Hom
Sanmi Koyejo
Emily Alsentzer
Akshay Chaudhari

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the “LM-as-judge” paradigm (a LM evaluating another LM) offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. To address these challenges, we propose MedVAL, a self-supervised framework that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset of 840 physician-annotated outputs across 6 diverse medical tasks capturing real-world challenges, including a multilingual task reviewed by bilingual physicians. Each output is reviewed following a physician-defined taxonomy of risk levels and error categories, enabling evaluation of LMs in making safety decisions for deployment. Across 10 state-of-the-art LMs spanning open-source, proprietary, and medically adapted models, MedVAL fine-tuning significantly improves (p < 0.001) alignment with physicians on both seen and unseen tasks, increasing average F1 scores from 66% to 83%, with per-sample safety classification scores up to 86%. MedVAL improves the performance of even the best-performing proprietary LM (GPT-4o) by 8%. To support a scalable, risk-aware pathway towards clinical integration, we open-source the 1) codebase, 2) MedVAL-Bench, and 3) MedVAL-4B, the best-performing open-source LM. Our research provides the first evidence of LMs approaching expert-level validation ability for medical text.

Version published to 10.21203/rs.3.rs-7041107/v1 on Research Square
Jul 8, 2025

CLEVER: Clinical Large Language Model Evaluationby Expert Review

This article has 4 authors:
1. Veysel Kocaman
2. Mustafa Kaya
3. Andrei Ferrer
4. David Talby
This article has no evaluationsLatest version Jul 23, 2025
Implementation of Large Language Models in Electronic Health Records

This article has 3 authors:
1. Maxime Griot
2. Jean Vanderdonckt
3. Demet Yuksel
This article has no evaluationsLatest version Jul 4, 2025
Automated Evaluation of Large Language Model Response Concordance with Human Specialist Responses on Physician-to-Physician eConsult Cases

This article has 18 authors:
1. David JH Wu
2. Fateme Nateghi Haredasht
3. David Wu
4. Vishnu Ravi
5. Liam G. McCoy
6. Yingjie Weng
7. Kanav Chopra
8. Selin S. Everett
9. George Nageeb
10. Wenyuan Chen
11. Stephen P. Ma
12. Saloni Kumar Maharaj
13. Jessica Tran
14. Leah Rosengaus
15. Lena Giang
16. Olivia Jee
17. Ethan Goh
18. Jonathan H Chen
This article has no evaluationsLatest version Aug 16, 2025

Listed in

Abstract

Article activity feed

Related articles

CLEVER: Clinical Large Language Model Evaluationby Expert Review

Implementation of Large Language Models in Electronic Health Records

Automated Evaluation of Large Language Model Response Concordance with Human Specialist Responses on Physician-to-Physician eConsult Cases