TCM-3CEval: A Triaxial Benchmark for Assessing Responses From Large Language Models in Traditional Chinese Medicine

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large language models (LLMs) have demonstrated exceptional capabilities across diverse NLP tasks and domains, including modern medicine. However, the systematic evaluation of LLM in traditional Chinese medicine (TCM) - a field with rich historical depth and clinical complexity -remains underexplored. To address this gap, we introduce TCM-3CEval, a comprehensive benchmark designed to assess LLMs in TCM across three critical dimensions: (1) Mastery of core TCM knowledge, (2) Understanding of classical TCM texts, and (3) Clinical decision-making. We conduct rigorous evaluations on diverse model categories, including general-purpose international models (e.g., GPT-4o), Chinese general models (e.g., InternLM), and medical domain-specific models (e.g., PLUSE). Our findings reveal a clear performance hierarchy: (i) Systemic limitations persist –all models exhibit pronounced deficiencies in specialized subdomains such as Meridian & Acupoint theory and Various TCM Schools, revealing critical gaps between current capabilities and clinical requirements. (ii) Cultural-contextual alignment proves essential – models trained with Chinese linguistic and cultural priors significantly outperform international counterparts in classical text interpretation and clinical reasoning. TCM-3CEval establishes a standardized evaluation paradigm for AI in TCM, providing insights for optimizing LLMs in culturally grounded medical domains. The benchmark has been simultaneously uploaded to the recently launched Traditional Chinese Medicine (TCM) specialized track "In-depth Challenge for Comprehensive TCM Abilities: Fundamental Theories, Classical Interpretation, and Clinical Decision-making" on Medbench. This competition aims to comprehensively and multi-dimensionally assess LLMs’ capabilities in TCM, focusing on three core dimensions: mastery of basic TCM knowledge, understanding of classic texts, and clinical diagnosis and treatment decision-making. It is meticulously designed with multidimensional questions and real clinical case scenarios to accurately evaluate the professional TCM proficiency and practical application abilities of the models.

Article activity feed