Blind Expert Evaluation of Open-Weight LLMs for Thyroid Cancer Patient Education in a Non-English Setting: GPT-OSS-20B vs MedGemma-27B-Instruct
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: Non-English and resource-constrained clinical contexts are underrepresented in current large language models (LLM) benchmarking, making it uncertain whether medical specialization improves patient-facing education when models are deployed locally. Open-weight LLMs are increasingly used for patient education, yet it remains unclear whether medical domain specialization improves patient-facing answers in non-English settings. We compared a general-purpose open-weight model (GPT-OSS-20B) with a medically specialized open-weight model (MedGemma-27B-Instruct) for thyroid cancer patient education in Turkish. Methods : Sixty Turkish patient questions about thyroid cancer were answered by both models. Five endocrinologists, blinded to model identity and study hypotheses, rated each response on 5-point Likert scales for Accuracy, Completeness, Clarity, Clinical Utility, and Satisfaction. Primary inference used per-question median ratings (N = 60 paired observations per criterion) with Wilcoxon signed-rank tests and Holm adjustment; effect size was rank-biserial correlation (RBC), and location shift was estimated with Hodges–Lehmann differences. Inter-rater reliability was assessed using ICC (2, k), and ceiling-aware summaries included perfect-score and top-box analyses. Results : GPT-OSS-20B achieved higher question-level median ratings than MedGemma-27B-Instruct across all five criteria after Holm correction. The largest differences were observed for Satisfaction (median 5.0 vs 4.0; RBC = 0.788; Holm-adjusted p < 0.001) and Completeness (median 5.0 vs 4.0; RBC = 0.599; Holm-adjusted p < 0.001). Inter-rater reliability was good and comparable across models (ICC (2, k) ≈ 0.74–0.80). Ceiling aware reporting showed consistently higher perfect-score proportions for GPT-OSS-20B across criteria, with the most pronounced gaps in Satisfaction and Completeness. Conclusions : In this first head-to-head comparison of open-weight LLMs for thyroid cancer patient education in Turkish, the general-purpose GPT-OSS-20B significantly outperformed the medically fine-tuned MedGemma-27B-Instruct across all evaluation criteria. These findings suggest that medical domain specialization does not necessarily yield superior patient-facing educational content in non-English settings and that general-purpose open-weight models may offer advantages for patient education tasks in resource-constrained contexts.