Comparative Evaluation of Large Language Models for Medical Education: Performance Analysis in Urinary System Histology.
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large language models (LLMs) show potential for medical education, but their domain-specific capabilities need systematic evaluation. This study presents a comparative assessment of thirteen LLMs in urinary system histology education. Using a multi-dimensional framework, we evaluated models across two tasks: answering 65 validated multiple-choice questions (MCQs) and generating clinical scenarios with assessment items. For MCQ performance, we assessed accuracy along with explanation quality through relevance and comprehensiveness metrics. For scenario generation, we evaluated Quality, Complexity, Relevance, Correctness, and Variety dimensions. Performance varied substantially across models and tasks, with ChatGPT-o1 achieving highest MCQ accuracy (96.31 ± 17.85%) and Claude-3.5 demonstrating superior clinical scenario generation capabilities (91.4% of maximum possible score). All models significantly outperformed random guessing with large effect sizes. Statistical analyses revealed significant differences in consistency across multiple attempts and dimensional performance, with most models showing higher Correctness than Quality scores in scenario generation. Term frequency analysis revealed significant content imbalances across all models, with systematic overemphasis of certain anatomical structures and complete omission of others. Our findings demonstrate that while LLMs show considerable promise for medical education, their reliable implementation requires matching specific models to appropriate educational tasks, implementing verification mechanisms, and recognizing their current limitations in generating pedagogically balanced content.