Assessment of Professional Medical Capabilities in Mainstream Chinese Large Language Models for Tremor-Related Diseases: A Comparative Study Based on Expert Scoring

Yang Bai
Longsheng Pan

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background Large language models (LLMs), exemplified by Deepseek-R1, demonstrate transformative potential in medical knowledge question-answering tasks. However, their cognitive boundaries regarding complex diseases—particular conditions like tremor—require systematic evaluation. Objective To evaluate the medical capabilities of three mainstream large language models in the Chinese context by testing their responses to complex questions about tremor-related diseases, thereby exploring future application possibilities for large language models in medical fields. Methods Three commercial large language models in the Chinese context (DeepSeek-R1-671B, Moonshot-v1-128k-vision-preview, Doubao-1.5-pro-256k) were selected. Based on the clinical characteristics of tremor disorders and consultation with domain experts, an evaluation matrix was developed with six dimensions: pathogenesis, risk factors, clinical manifestations, diagnosis, treatment, prevention, and prognosis. Each dimension contained six complex questions. After standardized parameter-based question-answering, responses were randomly ordered. Three experts with over 10 years of subspecialty clinical experience scored the models' answer texts, comprehensively assessing their medical capabilities in addressing complex tremor-related inquiries. Results Large language models exhibit significant performance variations when addressing complex queries related to tremor disorders. DeepSeek-R1-671B demonstrated the best performance (mean score 9.1 ± 0.33), significantly outperforming Doubao-1.5-pro-256k (6.8 ± 1.65) and Moonshot-v1-128k-vision-preview (4.9 ± 1.02) (P < 0.05). Moonshot-v1-128k-vision-preview produced one potentially harmful response in treatment recommendation safety scoring. Expert internal consistency was assessed with a Cronbach's alpha of 0.94. In a comparative study against DeepSeek-R1-70B, DeepSeek-R1-671B also demonstrated significant advantages, likely attributable to its architectural parameters. Conclusion The Deepseek-R1-671B large language model is currently capable of assisting medical decision-making and providing medical background knowledge. However, future clinical applications require refinement based on high-quality, specialized medical training datasets. The "six-dimensional clinical question matrix" developed in this study provides a feasible framework for systematically evaluating the medical capabilities of LLMs in specific disease domains.

Version published to 10.21203/rs.3.rs-8136025/v1 on Research Square
Mar 23, 2026

Diagnostic Performance and Cost-Efficiency of Large Language Models in Secondary Hypertension: A Blinded Comparative Study

This article has 4 authors:
1. Asena Gökçay Canpolat
2. Özge Baş Aksu
3. Rıfat Emral
4. Uğur Canpolat
This article has no evaluationsLatest version Mar 18, 2026
Language-dependent variability in large language model performance on pharmaceutical knowledge tasks

This article has 7 authors:
1. Hiroto Asano
2. Yu-Shi Tian
3. Asuka Hatabu
4. Minako Ohishi
5. Kaori Fukuzawa
6. Daisuke Takaya
7. Kenji Ikeda
This article has no evaluationsLatest version Mar 27, 2026
Performance of Vision–Language Models Compared with 252 Medical Students on Text-only and Image-based Dermatology Examinations

This article has 8 authors:
1. Ozan Erdem
2. Abdurrahim Yilmaz
3. Ahmet Sait Sahin
4. Bugra Burc Dagtas
5. Ece Gokyayla
6. Melek Aslan Kayıran
7. Vefa Aslı Erdemir
8. Mehmet Salih Gurel
This article has no evaluationsLatest version Apr 9, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Diagnostic Performance and Cost-Efficiency of Large Language Models in Secondary Hypertension: A Blinded Comparative Study

Language-dependent variability in large language model performance on pharmaceutical knowledge tasks

Performance of Vision–Language Models Compared with 252 Medical Students on Text-only and Image-based Dermatology Examinations