AnesSuite: A Comprehensive Benchmark and Dataset Suite for Anesthesiology Reasoning in LLMs
The application of large language models (LLMs) in the medical field has garnered significant attention, yet their reasoning capabilities in more specialized domains like anesthesiology remain underexplored. To bridge this gap, we introduce AnesSuite, the first comprehensive dataset suite specifically designed for anesthesiology reasoning in LLMs. The suite features AnesBench, an evaluation benchmark tailored to assess anesthesiology-related reasoning across three levels: factual retrieval (System 1), hybrid reasoning (System 1.x), and complex decision-making (System 2). Alongside this benchmark, the suite includes three training datasets that provide an infrastructure for continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning with verifiable rewards (RLVR). Leveraging this suite, we develop Morpheus, the first baseline model collection for anesthesiology reasoning. Despite undergoing limited training with SFT and group relative policy optimization (GRPO), Morpheus not only achieves substantial improvements in anesthesiology that rival larger-scale models, but also demonstrates enhanced reasoning capabilities across general medical and broad-domain benchmarks. Furthermore, through comprehensive evaluations and experiments, we analyze the key factors influencing anesthesiology reasoning performance, including model characteristics, training strategies and training data. Both AnesSuite and Morpheus will be open-sourced at https://github.com/MiliLab/AnesSuite.
💡 Research Summary
The paper addresses a notable gap in the application of large language models (LLMs) to highly specialized medical domains, focusing specifically on anesthesiology—a field that demands simultaneous management of airway, respiratory, cardiovascular, and sedation parameters. While general medical LLM benchmarks exist, they largely assess factual recall and neglect the complex decision‑making processes intrinsic to anesthesia. To fill this void, the authors introduce AnesSuite, the first comprehensive dataset suite dedicated to anesthesiology reasoning.
AnesSuite components
-
AnesBench – a bilingual (English and Chinese) structured benchmark containing 7,972 multiple‑choice questions (MCQs). Each item is labeled with one of three cognitive demand levels derived from Kahneman’s System 1/System 2 framework:
- System 1 – pure factual retrieval.
- System 1.x – factual retrieval combined with elementary reasoning.
- System 2 – complex, multi‑step clinical decision‑making.
Approximately 20‑30 % of the questions fall into the higher‑order categories, ensuring a meaningful test of reasoning ability. The benchmark sources include ABA exam materials, standard textbooks, and validated online assessments, with manual verification and contamination checks confirming minimal overlap with existing model training data.
-
AnesCorpus – a large‑scale text collection for continued pre‑training (CPT), comprising 1.8 M English and 0.6 M Chinese documents harvested from FineWeb datasets and filtered for relevance to anesthesia. This corpus enables domain‑specific language modeling before any supervised fine‑tuning.
-
AnesQA – a supervised fine‑tuning (SFT) dataset of 20,713 English question‑answer pairs. Each QA is annotated with one of five question‑type labels (e.g., pharmacology, physiology, equipment) to facilitate targeted fine‑tuning.
-
AnesR1 – a verification‑oriented dataset of 10,287 MCQs (English + Chinese) each paired with a detailed chain‑of‑thought (CoT) reasoning trace. The CoT annotations are deliberately long, providing step‑by‑step logical scaffolding that can be used for SFT, reinforcement learning with verifiable rewards (RLVR), or as a gold standard for evaluating model reasoning.
Baseline model: Morpheus
Leveraging AnesSuite, the authors develop Morpheus, a family of models built on the open‑source Qwen2.5 architecture (7B, 14B, and 32B parameter variants). Training proceeds in two stages: (i) continued pre‑training on AnesCorpus, and (ii) supervised fine‑tuning on AnesR1 using both standard SFT and a novel Group Relative Policy Optimization (GRPO) algorithm, a variant of reinforcement learning that optimizes policies relative to a group baseline. Despite modest training budgets, Morpheus achieves:
- Domain‑specific gains – on AnesBench, Morpheus matches or exceeds proprietary models such as GPT‑4o, Claude‑3.7‑Sonnet, Gemini‑2.5‑Flash/Pro across all three cognitive levels.
- General medical improvements – notable performance lifts on MedQA, PubMedQA, and other clinical QA benchmarks.
- Broad‑domain robustness – competitive scores on ARC‑C, MMLU, and other reasoning‑heavy tasks, demonstrating transferability beyond anesthesia.
Key empirical findings
- Model scale vs. reasoning depth – Larger models improve overall accuracy, but the marginal benefit diminishes for System 2 tasks, indicating that sheer parameter count cannot fully substitute for specialized reasoning data.
- Chain‑of‑Thought length – Longer CoT annotations (≥3‑4 sentences) significantly boost System 2 performance, confirming prior work that step‑wise reasoning scaffolds help LLMs handle complex clinical scenarios.
- Multilingual transfer – Performance gaps persist between English and Chinese subsets, highlighting the sensitivity of multilingual models to the language balance in the CPT phase. Careful corpus curation is essential to avoid bias.
- Cross‑domain data synergy – Adding general medical QA data to the training mix yields higher generalization than training on anesthesia data alone, suggesting complementary knowledge benefits reasoning.
- Data contamination control – The authors employ a specialized MCQ leakage detector (Ni et al., 2025) and manual review, confirming that less than 2 % of AnesBench items appear in any public LLM training corpus, preserving benchmark integrity.
Contributions and impact
The paper makes three primary contributions: (1) the release of AnesSuite, a rigorously curated, multilingual benchmark and associated training corpora for anesthesiology reasoning; (2) the introduction of Morpheus, the first open‑source baseline suite of anesthesia‑focused LLMs that demonstrate competitive performance with far fewer resources; and (3) a comprehensive analysis of factors influencing domain‑specific reasoning, offering actionable guidance for future model scaling, data curation, and training strategy design.
All resources—including the benchmark, corpora, model checkpoints, and evaluation scripts—will be open‑sourced on GitHub (https://github.com/MiliLab/AnesSuite). By providing a standardized testbed and baseline, the work aims to accelerate research into safe, reliable, and clinically useful AI assistants for anesthesiology, ultimately contributing to higher‑quality peri‑operative care.
Comments & Academic Discussion
Loading comments...
Leave a Comment