Baichuan-M3: Modeling Clinical Inquiry for Reliable Medical Decision-Making

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce Baichuan-M3, a medical-enhanced large language model engineered to shift the paradigm from passive question-answering to active, clinical-grade decision support. Addressing the limitations of existing systems in open-ended consultations, Baichuan-M3 utilizes a specialized training pipeline to model the systematic workflow of a physician. Key capabilities include: (i) proactive information acquisition to resolve ambiguity; (ii) long-horizon reasoning that unifies scattered evidence into coherent diagnoses; and (iii) adaptive hallucination suppression to ensure factual reliability. Empirical evaluations demonstrate that Baichuan-M3 achieves state-of-the-art results on HealthBench, the newly introduced HealthBench-Hallu and ScanBench, significantly outperforming GPT-5.2 in clinical inquiry, advisory and safety. The models are publicly available at https://huggingface.co/collections/baichuan-inc/baichuan-m3.

💡 Research Summary

Baichuan‑M3 is a medical‑enhanced large language model (LLM) designed to move beyond passive question‑answering toward active, end‑to‑end clinical decision support. The authors identify three core shortcomings of existing medical LLMs: (1) inability to proactively acquire missing patient information, (2) difficulty maintaining coherent long‑horizon reasoning across multiple conversational turns, and (3) susceptibility to hallucinations when forced to fill gaps in evidence. To address these, Baichuan‑M3 introduces three capabilities—proactive information acquisition, unified long‑horizon reasoning, and adaptive hallucination suppression—and a three‑stage training pipeline that isolates and then integrates each capability.

Patient Simulator
A novel patient simulator is built with two complementary modes. The “Passive Interaction Mode” (75 % of samples) supplies only a patient profile and clinical rubrics, forcing the model to ask relevant questions from scratch. The “Interruption‑Injected Mode” (25 % of samples) inserts a pre‑written dialogue snippet ending with a patient‑initiated query, mimicking mid‑consultation interruptions. An asymmetric visibility mechanism ensures the snippet is visible only to the physician agent, preserving simulator stability. This design yields a training environment that reflects real‑world uncertainty and interruption patterns.

Verification System
Baichuan‑M3 employs a hybrid verification stack: (i) a Rubric Verifier that evaluates structural compliance with clinical guidelines using an LLM‑based judge, and (ii) a Fact‑Aware Verifier that extracts atomic claims from model outputs, then validates each claim via a search‑augmented agent against authoritative medical sources. Claim extraction is performed by a distilled 8‑B model trained to mimic GPT‑5’s extraction quality while remaining fast enough for online reinforcement learning. Fact verification labels each claim as Supported, Refuted, or Uncertain. To keep latency low, a two‑level caching system is introduced: Level‑1 exact‑match caching with Redis and Level‑2 semantic‑match caching using vector embeddings, achieving roughly 80 % cache hit rate and reducing external search traffic by ~85 %.

Training Pipeline
The learning process consists of three progressive stages:

Task‑RL – Independent reinforcement‑learning pipelines are launched for each specialized capability (clinical inquiry, general reasoning, healthcare consultation). Each pipeline produces a “teacher” model that excels in its domain.
Offline Policy Distillation – Teacher policies are compressed into a single “student” model using reverse KL divergence, preserving the distinct inductive biases while avoiding gradient interference typical of multi‑task mixture training.
Multi‑Teacher Online Policy Distillation (MOPD) – The student model is further refined online with the patient simulator and verification system. Segmented Pipeline RL decomposes a full consultation into stages (information gathering, lab testing, diagnosis), assigning separate reward signals to each stage. Dynamic Rubric Evolution gradually tightens rubric criteria, encouraging the model to prioritize evidence‑based reasoning over superficial fluency.

Empirical Evaluation
Baichuan‑M3 is evaluated on three benchmarks:

HealthBench‑Hard – Achieves 44.4 % accuracy, surpassing GPT‑5.2 by ~4 percentage points.
ScanBench (an OSCE‑style multi‑step benchmark) – Scores 74.9 (clinical inquiry), 72.1 (laboratory testing), and 74.4 (diagnosis), exceeding both GPT‑5.2‑High and expert baselines.
Hallucination Suppression – In tool‑free factuality tests, the Fact‑Aware verification pipeline yields markedly lower hallucination rates compared with prior models.

The results demonstrate that Baichuan‑M3 not only improves factual reliability but also exhibits genuine agency in eliciting missing data and constructing coherent diagnostic narratives.

Contributions and Significance

Integration of Inquiry and Reasoning – By modeling the full clinical workflow as a single policy, Baichuan‑M3 eliminates the “inquiry inertia” of prior models and delivers active, evidence‑driven decision support.
Clinical‑Process‑Aligned Optimization – Segmented Pipeline RL and dynamic rubric evolution align training objectives with real‑world medical procedures, mitigating credit‑assignment problems and reward saturation.
Fact‑Aware Hallucination Control – The dual‑stream verification system, combined with efficient claim caching, provides a scalable, quantifiable method for suppressing hallucinations during both training and inference.

In summary, Baichuan‑M3 represents a substantial step toward trustworthy, interactive AI assistants capable of supporting clinicians throughout the entire diagnostic process. While further validation in real clinical settings and regulatory compliance work remain, the paper’s methodology—particularly the staged training, sophisticated simulation, and fact‑aware verification—offers a blueprint for future medical LLM development.

Baichuan-M3: Modeling Clinical Inquiry for Reliable Medical Decision-Making

💡 Research Summary

Comments & Academic Discussion

Leave a Comment