Automating Evaluation of LLM-generated Responses to Patient Questions about Rare Diseases

Min Zhao
Inez Y. Oh
Aditi Gupta
Sally Cohen-Cutler
Kathryn M. Harmoney
Albert M. Lai
Bryan A. Sisk

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objectives

Patients with rare diseases often struggle to find accurate medical information, and large language model (LLM)-based chatbots may help meet this need. However, evaluating LLM-generated free-text answers typically requires physician review, which is time-consuming and difficult to scale. This study compared traditional natural language processing (NLP) metrics to emerging LLM-based evaluation approaches for assessing answer quality in the context of Complex Lymphatic Anomalies (CLAs).

Materials and Methods

We compiled 25 common patients’ questions about CLAs and generated 175 responses to these questions from seven LLMs. Three expert physicians scored these responses for accuracy. We compared these physician-assigned scores with automated scores, generated by four NLP sentence similarity metrics (BLEU, ROUGE, METEOR, BERTScore) and six LLM evaluators (GPT-4, GPT-4o, Qwen3-32B, DeepSeek-R1-14B, Gemma3-27B, LLaMA3.3-70B). We examined both LLM-based scoring with and without reference answers (reference-guided vs. reference-free). We calculated Spearman, Phi, and Kendall’s Tau correlation coefficients to assess alignment between automated and physician-assigned scores.

Results

LLM-based evaluation demonstrated stronger alignment with physician-assigned scores than NLP metrics. The reference-guided GPT-4 evaluator achieved the highest correlation with physician-assigned scores (ρ=0.758), followed by GPT-4o (ρ=0.727). NLP metrics showed weak to moderate correlations with physician-assigned scores (ρ=0.240–0.403). Reference-guided scoring outperformed reference-free methods.

Discussion

Reference-guided LLM-based evaluation methods approximate expert physicians’ judgment better than traditional NLP metrics, offering an effective, scalable approach for assessing LLM-generated responses to patient questions about rare disease.

Conclusion

LLM-based evaluation, particularly reference-guided scoring with GPT models, can support the scalable development and evaluation of LLM-based rare disease-specific chatbot systems.

Version published to 10.1101/2025.10.06.25337181 on medRxiv
Oct 7, 2025

Entity-centric evaluation of large language model responses for medical question-answering tasks

This article has 2 authors:
1. Yi Liu
2. Vijaya B. Kolachalama
This article has no evaluationsLatest version Nov 14, 2025
Comparing Five Generative AI Chatbots’ Answers to LLM-Generated Clinical Questions with Medical Information Scientists’ Evidence Summaries

This article has 7 authors:
1. Mallory N. Blasingame
2. Taneya Y. Koonce
3. Annette M. Williams
4. Jing Su
5. Dario A. Giuse
6. Poppy A. Krump
7. Nunzia B. Giuse
This article has no evaluationsLatest version Sep 27, 2025
Machines flunking an exam: Evaluating large language models on course-related open questions

This article has 6 authors:
1. Jingxiu Huang
2. Yufeng Wei
3. Lixin Zhang
4. Ruilin Lai
5. Feiyu Lai
6. Yunxiang Zheng
This article has no evaluationsLatest version Sep 30, 2025

Discuss this preprint

Listed in

Abstract

Objectives

Materials and Methods

Results

Discussion

Conclusion

Article activity feed

Related articles

Entity-centric evaluation of large language model responses for medical question-answering tasks

Comparing Five Generative AI Chatbots’ Answers to LLM-Generated Clinical Questions with Medical Information Scientists’ Evidence Summaries

Machines flunking an exam: Evaluating large language models on course-related open questions