Automated Evaluation of Large Language Model Response Concordance with Human Specialist Responses on Physician-to-Physician eConsult Cases

David JH Wu
Fateme Nateghi Haredasht
David Wu
Vishnu Ravi
Liam G. McCoy
Yingjie Weng
Kanav Chopra
Selin S. Everett
George Nageeb
Wenyuan Chen
Stephen P. Ma
Saloni Kumar Maharaj
Jessica Tran
Leah Rosengaus
Lena Giang
Olivia Jee
Ethan Goh
Jonathan H Chen

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Specialist consults in primary care and inpatient settings typically address complex clinical questions beyond standard guidelines. eConsults have been developed as a way for specialist physicians to review cases asynchronously and provide clinical answers without a formal patient encounter. Meanwhile, large language models (LLMs) have approached human-level performance on structured clinical tasks, but their real-world effectiveness requires evaluation, which is bottlenecked by time-intensive manual physician review. To address this, we evaluate two automated methods: LLM-as-judge and a decompose-then-verify framework that breaks down AI answers into verifiable claims against human eConsult responses. Using 40 real-world physician-to-physician eConsults, we compared AI-generated responses to human answers using both physician raters and automated tools. LLM-as-judge outperformed decompose-then-verify, achieving human-level concordance assessment with F1-score of 0.89 (95% CI: 0.750, 0.960) and Cohen’s kappa of 0.75 (95% CI 0.47,0.90) —comparable to physician inter-rater agreement κ = 0.69-0.90 (95% CI 0.43-1.0).

Version published to 10.1101/2025.08.14.25332839 on medRxiv
Aug 16, 2025

Physician Evaluations of Large Language Model-Generated Responses to Medical Questions by Region and Years in Practice: A preliminary study

This article has 8 authors:
1. James Brooks
2. Paa-Kwesi Blankson
3. Peter Murphy Campbell
4. R Adams Cowley
5. Tsorng-Shyang Yang
6. Tijani Oseni
7. Anny Rodriguez
8. Muhammed Y. Idris
This article has no evaluationsLatest version Aug 19, 2025
Automating Evaluation of LLM-generated Responses to Patient Questions about Rare Diseases

This article has 7 authors:
1. Min Zhao
2. Inez Y. Oh
3. Aditi Gupta
4. Sally Cohen-Cutler
5. Kathryn M. Harmoney
6. Albert M. Lai
7. Bryan A. Sisk
This article has no evaluationsLatest version Oct 7, 2025
Large language models for automatable real-world performance monitoring of diagnostic decision support systems: a comparison to manual doctor panel review in a prospective clinical study

This article has 20 authors:
1. Fabienne Cotte
2. Marcel Schmude
3. Philipp Bode
4. Oula Suliman
5. Filipa Dias Lourenço
6. Miguel Paiva Pereira
7. Nisha Kini
8. Vera Hartenstein
9. Allesandro Muscoloni
10. Lisa Stroux
11. Victor Hertz
12. Sebastian Köhler
13. Valerio Morelli
14. Henry Hoffmann
15. Peter Engerer
16. Stephen Gilbert
17. Kirsten Gray
18. Tauseef Mehrali
19. Micaela Seemann Monteiro
20. Pedro Flores
This article has no evaluationsLatest version Sep 21, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Physician Evaluations of Large Language Model-Generated Responses to Medical Questions by Region and Years in Practice: A preliminary study

Automating Evaluation of LLM-generated Responses to Patient Questions about Rare Diseases

Large language models for automatable real-world performance monitoring of diagnostic decision support systems: a comparison to manual doctor panel review in a prospective clinical study