Performance of Large Language Models in a National Medical Licensing Examination: A Two-Year Comparative Study of Gemini 3 Pro, DeepSeek V3.1, and GPT-5.2 in Traditional Chinese Medicine

Chenghan Du
Yien Pan
Cheoklong Ng
Yingjie Ding
Jiahua Pan
Wei Xue
Xiaoying Yao
Jiwei Huang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background Large language models (LLMs) are increasingly integrated into medical education and assessment. However, their performance in high-stakes, non-English medical licensing examinations—particularly within culturally distinct medical systems such as Traditional Chinese Medicine (TCM)—remains insufficiently evaluated. Objective This study aimed to systematically compare the performance of Gemini 3 Pro, DeepSeek V3.1, and GPT-5.2 in the Chinese National Traditional Chinese Medicine Licensing Examination (TCMLE) over two consecutive years (2023 and 2024), and to assess their potential implications for TCM education. Methods All original examination questions from the 2023 and 2024 TCMLE (600 questions per year), encompassing all official question types and examination units, were independently input into each model in Chinese. Model responses were evaluated based on accuracy. Comparative analyses across models, question types, and examination units were conducted using chi-square tests, with statistical significance set at P<.05. Results DeepSeek V3.1 demonstrated significantly higher overall accuracy than Gemini 3 Pro and GPT-5.2 in both 2023 (87.1%) and 2024 (86.7%) (P<.001 for all comparisons). Gemini 3 Pro exhibited moderate and relatively stable performance across both years, whereas GPT-5.2 achieved the lowest overall accuracy despite a modest improvement from 2023 to 2024. Notably, DeepSeek V3.1 showed particular strength in structured and clinically oriented question formats and in foundational knowledge units. Conclusions Linguistic and cultural alignment plays a critical role in LLM performance on specialized medical licensing examinations. Locally optimized models such as DeepSeek V3.1 may serve as valuable auxiliary tools in TCM education, particularly for examination preparation and knowledge reinforcement, although careful human oversight remains essential.

Version published to 10.21203/rs.3.rs-8930037/v1 on Research Square
Mar 3, 2026

Assessment of Professional Medical Capabilities in Mainstream Chinese Large Language Models for Tremor-Related Diseases: A Comparative Study Based on Expert Scoring

This article has 2 authors:
1. Yang Bai
2. Longsheng Pan
This article has no evaluationsLatest version Mar 23, 2026
Language-dependent variability in large language model performance on pharmaceutical knowledge tasks

This article has 7 authors:
1. Hiroto Asano
2. Yu-Shi Tian
3. Asuka Hatabu
4. Minako Ohishi
5. Kaori Fukuzawa
6. Daisuke Takaya
7. Kenji Ikeda
This article has no evaluationsLatest version Mar 27, 2026
Standardized Assessment of LLM English Proficiency

This article has 7 authors:
1. Shangchao Min
2. Shaonan Wang
3. Xinyu Gao
4. Hui Wang
5. Zhiling Jin
6. Chen Ling
7. Nai Ding
This article has no evaluationsLatest version Feb 19, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Assessment of Professional Medical Capabilities in Mainstream Chinese Large Language Models for Tremor-Related Diseases: A Comparative Study Based on Expert Scoring

Language-dependent variability in large language model performance on pharmaceutical knowledge tasks

Standardized Assessment of LLM English Proficiency