Evaluation of Large Language Models on the Chinese Dental Licensing Examination
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Objective: This study aimed to evaluate the performance of large language models (LLMs) on the Chinese Dental Licensing Examination (CDLE). It also examined whether including an ‘unknown’ option in prompts—or combining this option with a penalty for incorrect answers—could improve model accuracy and reduce hallucinations. Methods: The official preparation book, titled Historical Chinese Dental Licensing Examinations , authored by the Chinese National Licensed Physician Qualification Examination Proposition Research Group, was used as the data source. Three cloud-based models (Qwen3-Max, Qwen-Plus, DeepSeek-V3.1) and two locally deployed models (Qwen3-32B and GPT-OSS-120B) were evaluated on the CDLE. A custom-designed program was developed to automatically conduct the CDLE by leveraging the OpenAI API to communicate with both locally deployed and cloud-based LLMs. Model performance was evaluated at both the exam and question levels. Exam-level performance was assessed by mean accuracy (± standard deviation (SD)) and pass/fail outcomes, while question-level performance was evaluated primarily by accuracy with 95% and 99% confidence intervals (CIs). Results: A dataset comprising four CDLEs (2,400 questions in total) was constructed. Each question was a five-option, single-answer multiple-choice question. Qwen3-Max, Qwen-Plus, DeepSeek-V3.1, Qwen3-32B, and GPT-OSS-120B achieved exam-level mean accuracies ±SD of 0.866±0.089, 0.851±0.0767, 0.737±0.0738, 0.748±0.0868, 0.652±0.0799, respectively. At the question level, the accuracies with 95% CIs were 0.865 (0.852–0.878), 0.851 (0.837–0.865), 0.727 (0.709–0.745), 0.741 (0.724–0.756), and 0.651 (0.634–0.671), respectively. Prompts that included an ‘unknown’ option—or combined it with a penalty for incorrect answers—did not improve model accuracy. Conclusion: All models successfully passed the CDLEs, with some achieving remarkably high scores. Among them, Qwen3-Max demonstrated the best overall performance across all evaluated metrics. Other uncertainty estimation methods should be considered instead of simply adding an ‘unknown’ option to the input prompt. In the future, LLMs are expected to play an important role in dental education, particularly in supporting medical students’ self-directed learning.