Automating Evaluation of LLM-generated Responses to Patient Questions about Rare Diseases
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Objectives
Patients with rare diseases often struggle to find accurate medical information, and large language model (LLM)-based chatbots may help meet this need. However, evaluating LLM-generated free-text answers typically requires physician review, which is time-consuming and difficult to scale. This study compared traditional natural language processing (NLP) metrics to emerging LLM-based evaluation approaches for assessing answer quality in the context of Complex Lymphatic Anomalies (CLAs).
Materials and Methods
We compiled 25 common patients’ questions about CLAs and generated 175 responses to these questions from seven LLMs. Three expert physicians scored these responses for accuracy. We compared these physician-assigned scores with automated scores, generated by four NLP sentence similarity metrics (BLEU, ROUGE, METEOR, BERTScore) and six LLM evaluators (GPT-4, GPT-4o, Qwen3-32B, DeepSeek-R1-14B, Gemma3-27B, LLaMA3.3-70B). We examined both LLM-based scoring with and without reference answers (reference-guided vs. reference-free). We calculated Spearman, Phi, and Kendall’s Tau correlation coefficients to assess alignment between automated and physician-assigned scores.
Results
LLM-based evaluation demonstrated stronger alignment with physician-assigned scores than NLP metrics. The reference-guided GPT-4 evaluator achieved the highest correlation with physician-assigned scores (ρ=0.758), followed by GPT-4o (ρ=0.727). NLP metrics showed weak to moderate correlations with physician-assigned scores (ρ=0.240–0.403). Reference-guided scoring outperformed reference-free methods.
Discussion
Reference-guided LLM-based evaluation methods approximate expert physicians’ judgment better than traditional NLP metrics, offering an effective, scalable approach for assessing LLM-generated responses to patient questions about rare disease.
Conclusion
LLM-based evaluation, particularly reference-guided scoring with GPT models, can support the scalable development and evaluation of LLM-based rare disease-specific chatbot systems.