LingLanMiDian: Systematic Evaluation of LLMs on TCM Knowledge and Clinical Reasoning
Large language models (LLMs) are advancing rapidly in medical NLP, yet Traditional Chinese Medicine (TCM) with its distinctive ontology, terminology, and reasoning patterns requires domain-faithful evaluation. Existing TCM benchmarks are fragmented in coverage and scale and rely on non-unified or generation-heavy scoring that hinders fair comparison. We present the LingLanMiDian (LingLan) benchmark, a large-scale, expert-curated, multi-task suite that unifies evaluation across knowledge recall, multi-hop reasoning, information extraction, and real-world clinical decision-making. LingLan introduces a consistent metric design, a synonym-tolerant protocol for clinical labels, a per-dataset 400-item Hard subset, and a reframing of diagnosis and treatment recommendation into single-choice decision recognition. We conduct comprehensive, zero-shot evaluations on 14 leading open-source and proprietary LLMs, providing a unified perspective on their strengths and limitations in TCM commonsense knowledge understanding, reasoning, and clinical decision support; critically, the evaluation on Hard subset reveals a substantial gap between current models and human experts in TCM-specialized reasoning. By bridging fundamental knowledge and applied reasoning through standardized evaluation, LingLan establishes a unified, quantitative, and extensible foundation for advancing TCM LLMs and domain-specific medical AI research. All evaluation data and code are available at https://github.com/TCMAI-BJTU/LingLan and http://tcmnlp.com.
💡 Research Summary
The paper introduces LingLanMiDian (LingLan), a comprehensive benchmark designed to evaluate large language models (LLMs) on Traditional Chinese Medicine (TCM) knowledge and clinical reasoning. Recognizing that TCM possesses a distinct ontology, terminology, and experience‑driven reasoning style that differs markedly from modern biomedicine, the authors argue that existing medical benchmarks (e.g., MedQA, CMExam, CMB) are insufficient for systematic assessment of LLMs in this domain.
LingLan addresses these gaps by assembling a large‑scale, expert‑curated, multi‑task suite comprising 25,624 items drawn from nine sources: the official TCM licensing examinations (covering 14 subjects), fundamental TCM textbooks, Chinese patent medicine inserts, classical literature, electronic medical records (EMRs), and master‑physician casebooks. The benchmark spans 13 subtasks grouped into five categories: (1) single‑ and multiple‑choice licensing questions, (2) fundamental TCM knowledge Q&A, (3) Chinese patent medicine knowledge Q&A, (4) information extraction from both classical and clinical texts, and (5) diagnostic‑therapeutic decision making (DTR) plus decision recognition (DR).
A key innovation is the creation of a “Hard” subset for each subtask, consisting of 400 carefully selected difficult items that stress multi‑hop reasoning, ambiguous syndrome differentiation, and dosage proportionality. This allows the evaluation to go beyond ceiling‑level performance on easy questions and to probe models’ robustness in realistic, high‑stakes scenarios.
The authors also propose a unified metric framework. While accuracy is used for standard multiple‑choice items, other tasks employ precision, recall, character‑level F1, mean absolute error (MAE), and cosine similarity for vector‑based outputs. Importantly, a synonym‑tolerant protocol is introduced for clinical labels: multiple synonymous syndromes or equivalent herbal prescriptions are accepted as correct, reflecting the inherent flexibility of TCM practice.
In a zero‑shot setting, fourteen state‑of‑the‑art LLMs—including proprietary models (GPT‑4, GPT‑3.5‑Turbo, Claude‑2) and open‑source models (Qwen‑2‑7B, Baichuan‑2‑13B, LLaMA‑2‑70B, DeepSeek‑Chat, InternLM‑2‑20B)—are evaluated across the full benchmark. Results show near‑ceiling performance (≈92 % accuracy) on licensing‑style knowledge recall, indicating that LLMs have successfully internalized textbook facts. However, performance drops substantially on multi‑hop reasoning and information extraction (≈65–70 % accuracy) and further declines on diagnostic‑therapeutic decision tasks (≈55 % accuracy, MAE = 0.42). The Hard subsets exacerbate these gaps, with all models losing 15–20 % relative performance, while human experts maintain ≈92 % accuracy.
These findings reveal a clear dichotomy: LLMs excel at factual retrieval but remain weak in the nuanced, pattern‑based reasoning, dosage proportioning, and multi‑answer handling that characterize authentic TCM practice. The authors discuss failure modes such as over‑confidence in single answers, difficulty distinguishing overlapping syndromes, and limited ability to map synonymous herbal formulations.
To close the gap, the paper suggests several future directions: (1) integrating TCM‑specific knowledge graphs and reasoning chains during pre‑training, (2) employing multi‑label loss functions that explicitly model synonymy and equivalence, (3) leveraging human‑in‑the‑loop feedback to iteratively refine model outputs, and (4) expanding the benchmark to cover other regional variants of Chinese medicine and interactive dialogue scenarios.
Overall, LingLanMiDian constitutes the first large‑scale, multi‑dimensional, and metric‑standardized benchmark for TCM LLM evaluation. By unifying knowledge recall, structured extraction, and clinical decision making under a consistent evaluation protocol, it provides a solid quantitative foundation for future research on domain‑specific medical AI, facilitating more reliable development and comparison of TCM‑focused language models.
Comments & Academic Discussion
Loading comments...
Leave a Comment