Evaluation of DeepSeek-R1 and ChatGPT on the Chinese National Medical Licensing Examination: A Multi-Year Comparative Study
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and reasoning. However, their real-world applicability in high-stakes medical assessments remains underexplored, particularly in non-English contexts. This study aims to evaluate the performance of DeepSeek-R1 and ChatGPT on the Chinese National Medical Licensing Examination (NMLE), a comprehensive benchmark of medical knowledge and clinical reasoning. Methods We evaluated the performance of ChatGPT and DeepSeek-R1 on the Chinese National Unified Legal Professional Qualification Examination (2019–2021) using question-level binary accuracy (correct = 1, incorrect = 0) as the outcome. A generalized linear mixed model (GLMM) with a binomial distribution and logit link was used to examine fixed effects of model type, year, and subject unit, including their interactions, while accounting for random intercepts across questions. Post hoc pairwise comparisons were conducted to assess differences across model–year interactions. Results DeepSeek-R1 significantly outperformed ChatGPT overall (β = –1.829, p < 0.001). Temporal analysis revealed a significant decline in ChatGPT's accuracy from 2019 to 2021 ( p < 0.05), whereas DeepSeek-R1 maintained stable performance. Subject-wise, Unit 3 showed the highest accuracy (β = 0.344, p = 0.001) compared to Unit 1. A significant interaction in 2020 (β = –0.567, p = 0.009) indicated an amplified performance gap between the two models. These results highlight the importance of model selection, domain adaptation, and temporal robustness in the deployment of large language models for professional legal assessments. Conclusions This longitudinal evaluation highlights the potential and limitations of LLMs in medical licensing contexts. While current models demonstrate promising results, further fine-tuning is necessary for clinical applicability. The NMLE offers a robust benchmark for future development of trustworthy AI-assisted medical decision support tools in non-English settings.