Blind Expert Evaluation of Open-Weight LLMs for Thyroid Cancer Patient Education in a Non-English Setting: GPT-OSS-20B vs MedGemma-27B-Instruct

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: Non-English and resource-constrained clinical contexts are underrepresented in current large language models (LLM) benchmarking, making it uncertain whether medical specialization improves patient-facing education when models are deployed locally. Open-weight LLMs are increasingly used for patient education, yet it remains unclear whether medical domain specialization improves patient-facing answers in non-English settings. We compared a general-purpose open-weight model (GPT-OSS-20B) with a medically specialized open-weight model (MedGemma-27B-Instruct) for thyroid cancer patient education in Turkish. Methods : Sixty Turkish patient questions about thyroid cancer were answered by both models. Five endocrinologists, blinded to model identity and study hypotheses, rated each response on 5-point Likert scales for Accuracy, Completeness, Clarity, Clinical Utility, and Satisfaction. Primary inference used per-question median ratings (N = 60 paired observations per criterion) with Wilcoxon signed-rank tests and Holm adjustment; effect size was rank-biserial correlation (RBC), and location shift was estimated with Hodges–Lehmann differences. Inter-rater reliability was assessed using ICC (2, k), and ceiling-aware summaries included perfect-score and top-box analyses. Results : GPT-OSS-20B achieved higher question-level median ratings than MedGemma-27B-Instruct across all five criteria after Holm correction. The largest differences were observed for Satisfaction (median 5.0 vs 4.0; RBC = 0.788; Holm-adjusted p < 0.001) and Completeness (median 5.0 vs 4.0; RBC = 0.599; Holm-adjusted p < 0.001). Inter-rater reliability was good and comparable across models (ICC (2, k) ≈ 0.74–0.80). Ceiling aware reporting showed consistently higher perfect-score proportions for GPT-OSS-20B across criteria, with the most pronounced gaps in Satisfaction and Completeness. Conclusions : In this first head-to-head comparison of open-weight LLMs for thyroid cancer patient education in Turkish, the general-purpose GPT-OSS-20B significantly outperformed the medically fine-tuned MedGemma-27B-Instruct across all evaluation criteria. These findings suggest that medical domain specialization does not necessarily yield superior patient-facing educational content in non-English settings and that general-purpose open-weight models may offer advantages for patient education tasks in resource-constrained contexts.

Article activity feed