Blind Expert Evaluation of Open-Weight LLMs for Thyroid Cancer Patient Education in a Non-English Setting: GPT-OSS-20B vs MedGemma-27B-Instruct

Mehmet Poyrazer
Hüseyin Yağcı
Alican Bozdaş
Ayşe Münevver Mühürdaroğlu
Çağatay Emir Önder
Sevde Nur Fırat

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background: Non-English and resource-constrained clinical contexts are underrepresented in current large language models (LLM) benchmarking, making it uncertain whether medical specialization improves patient-facing education when models are deployed locally. Open-weight LLMs are increasingly used for patient education, yet it remains unclear whether medical domain specialization improves patient-facing answers in non-English settings. We compared a general-purpose open-weight model (GPT-OSS-20B) with a medically specialized open-weight model (MedGemma-27B-Instruct) for thyroid cancer patient education in Turkish. Methods : Sixty Turkish patient questions about thyroid cancer were answered by both models. Five endocrinologists, blinded to model identity and study hypotheses, rated each response on 5-point Likert scales for Accuracy, Completeness, Clarity, Clinical Utility, and Satisfaction. Primary inference used per-question median ratings (N = 60 paired observations per criterion) with Wilcoxon signed-rank tests and Holm adjustment; effect size was rank-biserial correlation (RBC), and location shift was estimated with Hodges–Lehmann differences. Inter-rater reliability was assessed using ICC (2, k), and ceiling-aware summaries included perfect-score and top-box analyses. Results : GPT-OSS-20B achieved higher question-level median ratings than MedGemma-27B-Instruct across all five criteria after Holm correction. The largest differences were observed for Satisfaction (median 5.0 vs 4.0; RBC = 0.788; Holm-adjusted p < 0.001) and Completeness (median 5.0 vs 4.0; RBC = 0.599; Holm-adjusted p < 0.001). Inter-rater reliability was good and comparable across models (ICC (2, k) ≈ 0.74–0.80). Ceiling aware reporting showed consistently higher perfect-score proportions for GPT-OSS-20B across criteria, with the most pronounced gaps in Satisfaction and Completeness. Conclusions : In this first head-to-head comparison of open-weight LLMs for thyroid cancer patient education in Turkish, the general-purpose GPT-OSS-20B significantly outperformed the medically fine-tuned MedGemma-27B-Instruct across all evaluation criteria. These findings suggest that medical domain specialization does not necessarily yield superior patient-facing educational content in non-English settings and that general-purpose open-weight models may offer advantages for patient education tasks in resource-constrained contexts.

Version published to 10.21203/rs.3.rs-8729661/v1 on Research Square
Mar 13, 2026

Clinical Safety of Large Language Models in Oral Cancer–Related Patient Communication: A Longitudinal Study

This article has 2 authors:
1. Burcu Yeliz KOLLAYAN
2. Tuğba CEBECİ
This article has no evaluationsLatest version Mar 16, 2026
Evaluating 11 Large Language Models in Answering Key Questions on Ovarian Cancer

This article has 7 authors:
1. Michela Quaranta
2. Yong Sheng Tan
3. Areti Karamanou
4. Evangelos Kalampokis
5. Nicolas M Orsi
6. Diederick DeJong
7. Alexandros Laios
This article has no evaluationsLatest version Apr 11, 2026
Large Language Model–Assisted Radiology Reporting: A Retrospective Cohort Study Using the UTAUT Framework to Analyze Workflow Integration and Efficiency Gains

This article has 1 author:
1. Nelly Tan
This article has no evaluationsLatest version Feb 22, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Clinical Safety of Large Language Models in Oral Cancer–Related Patient Communication: A Longitudinal Study

Evaluating 11 Large Language Models in Answering Key Questions on Ovarian Cancer

Large Language Model–Assisted Radiology Reporting: A Retrospective Cohort Study Using the UTAUT Framework to Analyze Workflow Integration and Efficiency Gains