Traceable Cross-Source RAG for Chinese Tibetan Medicine Question Answering
Retrieval-augmented generation (RAG) promises grounded question answering, yet domain settings with multiple heterogeneous knowledge bases (KBs) remain challenging. In Chinese Tibetan medicine, encyclopedia entries are often dense and easy to match, which can dominate retrieval even when classics or clinical papers provide more authoritative evidence. We study a practical setting with three KBs (encyclopedia, classics, and clinical papers) and a 500-query benchmark (cutoff $K{=}5$) covering both single-KB and cross-KB questions. We propose two complementary methods to improve traceability, reduce hallucinations, and enable cross-KB verification. First, DAKS performs KB routing and budgeted retrieval to mitigate density-driven bias and to prioritize authoritative sources when appropriate. Second, we use an alignment graph to guide evidence fusion and coverage-aware packing, improving cross-KB evidence coverage without relying on naive concatenation. All answers are generated by a lightweight generator, \textsc{openPangu-Embedded-7B}. Experiments show consistent gains in routing quality and cross-KB evidence coverage, with the full system achieving the best CrossEv@5 while maintaining strong faithfulness and citation correctness.
💡 Research Summary
This paper tackles the problem of building a traceable, low‑hallucination Retrieval‑Augmented Generation (RAG) system for Chinese Tibetan Medicine (TM) where knowledge is split across three heterogeneous knowledge bases (KBs): encyclopedia entries (E), classical canons (T), and modern clinical papers (P). In such a setting, dense encyclopedia passages are easy for neural retrievers to match, causing a “density‑driven bias” that often leads the system to cite encyclopedic material even when more authoritative evidence resides in the classics or clinical literature. The authors therefore propose a two‑stage framework that explicitly addresses source selection bias and cross‑source evidence fusion.
Stage 1 – DAKS Routing with Budgeted Retrieval
DAKS (Dynamic Authority‑aware Knowledge‑source Selection) first runs a lightweight probe retrieval in each KB, collecting the top‑L relevance scores. From these scores it extracts five KB‑level features: the peak score, the mean of the top‑M scores, the margin between the top‑1 and top‑M scores, the entropy (concentration) of the score distribution, and a lightweight coverage proxy (e.g., number of distinct documents). These features are linearly combined with learned weights w and an authority prior λ·a_k (where a_k reflects domain knowledge that P and T are generally more authoritative than E). The resulting KB score S_k(q) is turned into a soft allocation probability p_k via a softmax. Given a total retrieval budget B and a minimum per‑KB budget b_min, the algorithm allocates b_k = b_min + round((B−|K|·b_min)·p_k) retrieval slots to each KB. This budgeted retrieval mitigates the dominance of dense encyclopedia passages while guaranteeing that every KB receives at least a minimal number of candidates. The routing decisions (primary‑source and top‑2 KB ranking) are evaluated separately and show substantial improvements over a flat, un‑routed baseline.
Stage 2 – Alignment Graph‑Guided Fusion and Coverage‑Aware Packing
After the budgeted retrieval, the candidate passages are consolidated: structural expansion adds neighboring chunks from the same document, duplicate removal enforces diversity, and a base relevance score s_base(q,c) blends the original retriever score with an optional reranker score. The core of Stage 2 is an alignment graph G = (V_C ∪ V_E, E_G) where V_C are chunk nodes (each tagged with its KB) and V_E are typed entity nodes (disease, symptom, drug, formula, etc.). An edge (c, e) exists if entity e appears in chunk c. Query entities E(q) are extracted, and a seed entity set E_seed(q) is formed by union of E(q) and the entities from the top‑ranked chunks. A breadth‑first walk up to h hops from E_seed(q) collects additional “bridge” chunks R_graph(q) that connect different KBs through shared entities.
For each candidate chunk c (from the consolidated list C(q) or R_graph(q)) two graph‑derived signals are computed: (1) overlap o(q,c) = |E(q) ∩ E(c)| and (2) graph distance d(q,c) = min_{e∈E(c), e’∈E_seed(q)} dist_G(e, e’). These feed into a graph support score
s_g(q,c) = η₁·log(1+o) + η₂·1/(1+d) + η₃·I
Comments & Academic Discussion
Loading comments...
Leave a Comment