TCM-3CEval: A Triaxial Benchmark for Assessing Responses From Large Language Models in Traditional Chinese Medicine

Jie Xu
Tianai Huang
Lu Lu
Jiayuan Chen

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models (LLMs) have demonstrated exceptional capabilities across diverse NLP tasks and domains, including modern medicine. However, the systematic evaluation of LLM in traditional Chinese medicine (TCM) - a field with rich historical depth and clinical complexity -remains underexplored. To address this gap, we introduce TCM-3CEval, a comprehensive benchmark designed to assess LLMs in TCM across three critical dimensions: (1) Mastery of core TCM knowledge, (2) Understanding of classical TCM texts, and (3) Clinical decision-making. We conduct rigorous evaluations on diverse model categories, including general-purpose international models (e.g., GPT-4o), Chinese general models (e.g., InternLM), and medical domain-specific models (e.g., PLUSE). Our findings reveal a clear performance hierarchy: (i) Systemic limitations persist –all models exhibit pronounced deficiencies in specialized subdomains such as Meridian & Acupoint theory and Various TCM Schools, revealing critical gaps between current capabilities and clinical requirements. (ii) Cultural-contextual alignment proves essential – models trained with Chinese linguistic and cultural priors significantly outperform international counterparts in classical text interpretation and clinical reasoning. TCM-3CEval establishes a standardized evaluation paradigm for AI in TCM, providing insights for optimizing LLMs in culturally grounded medical domains. The benchmark has been simultaneously uploaded to the recently launched Traditional Chinese Medicine (TCM) specialized track "In-depth Challenge for Comprehensive TCM Abilities: Fundamental Theories, Classical Interpretation, and Clinical Decision-making" on Medbench. This competition aims to comprehensively and multi-dimensionally assess LLMs’ capabilities in TCM, focusing on three core dimensions: mastery of basic TCM knowledge, understanding of classic texts, and clinical diagnosis and treatment decision-making. It is meticulously designed with multidimensional questions and real clinical case scenarios to accurately evaluate the professional TCM proficiency and practical application abilities of the models.

Version published to 10.21203/rs.3.rs-6267002/v1 on Research Square
May 12, 2025

Benchmarking General-Purpose and Medical AI Large Language Models for Clinical Assessment and Management in Parkinson’s Disease

This article has 5 authors:
1. Shechter Yosef
2. Klevor Raymond
3. Kouchache Trycia
4. Bouhadoun Sarah
5. Ronald B Postuma
This article has no evaluationsLatest version May 20, 2026
Evaluating 11 Large Language Models in Answering Key Questions on Ovarian Cancer

This article has 7 authors:
1. Michela Quaranta
2. Yong Sheng Tan
3. Areti Karamanou
4. Evangelos Kalampokis
5. Nicolas M Orsi
6. Diederick DeJong
7. Alexandros Laios
This article has no evaluationsLatest version Apr 11, 2026
Evaluation of ChatGPT-4o’s and DeepSeek R1’s responses to urological problems: A comparative study

This article has 7 authors:
1. Hanbo Lu
2. Yusa Zhang
3. Zhan Wang
4. Yang Zhao
5. Jiang Liu
6. Dongxu Qiu
7. Yushi Zhang
This article has no evaluationsLatest version Apr 8, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Benchmarking General-Purpose and Medical AI Large Language Models for Clinical Assessment and Management in Parkinson’s Disease

Evaluating 11 Large Language Models in Answering Key Questions on Ovarian Cancer

Evaluation of ChatGPT-4o’s and DeepSeek R1’s responses to urological problems: A comparative study