Bridging the Knowledge-Prediction Gap in LLMs on Multiple-Choice Questions
While large language models (LLMs) perform strongly on diverse tasks, their trustworthiness is limited by erratic behavior that is unfaithful to their internal knowledge. In particular, LLMs often fail on multiple-choice questions (MCQs) even if they encode correct answers in their hidden representations, revealing a misalignment between internal knowledge and output behavior. We investigate and mitigate this knowledge-prediction gap on MCQs through a three-step analysis of hidden representations. First, we quantify the prevalence and magnitude of the gap across models and datasets. Second, we provide a geometric interpretation by identifying distinct knowledge and prediction subspaces in the residual stream. Third, we introduce KAPPA, a lightweight inference-time intervention that aligns the two subspaces within the residual stream to reduce the knowledge-prediction gap. Our results provide a geometric and interpretable explanation of the knowledge-prediction gap in LLMs. Furthermore, KAPPA effectively reduces the gap across diverse MCQ benchmarks and models, and generalizes to free-form settings.
💡 Research Summary
The paper investigates a pervasive phenomenon in large language models (LLMs) that the models often answer multiple‑choice questions (MCQs) incorrectly even though their hidden states contain linearly accessible information about the correct answer. The authors name this discrepancy the “knowledge‑prediction gap.” To study it, they extract the residual‑stream representation hₗ(x) at each layer l for a given MCQ input x and train two independent linear k‑class probes: a knowledge probe that predicts the ground‑truth label y, and a prediction probe that predicts the option actually chosen by the model (˜y). Because the two probes share the same input representation but have distinct weight matrices, each defines a low‑dimensional subspace in the model’s representation space – the knowledge subspace and the prediction subspace, respectively.
Quantitatively, the authors introduce two complementary metrics. Prediction agreement (A_GR) measures whether the arg‑max of the knowledge‑probe distribution matches that of the model’s output distribution; higher A_GR indicates better alignment. Distributional divergence (KLD) computes the KL‑divergence KL(p_M‖p_K) between the model’s output distribution p_M and the knowledge‑probe distribution p_K; lower KLD signals tighter alignment. Across a wide suite of benchmarks (BBH, MMLU, ARC‑Challenge, GSM8K, PubMedQA, TruthfulQA, BBQ) and two open‑weight models (LLaMA‑3.1 8B Instruct and Qwen‑2.5 7B Instruct), knowledge probes consistently outperform the models’ own generation accuracy, achieving up to a +21.3 % absolute gain on TruthfulQA. The gap is especially pronounced on reasoning, truthfulness, and bias tasks, while it is modest on pure knowledge tasks such as MMLU.
Geometrically, the authors interpret the two probes’ weight vectors as bases of the knowledge and prediction subspaces. When the gap is large, these subspaces are misaligned: the projection of a hidden state onto the knowledge subspace points toward the correct answer, whereas its projection onto the prediction subspace points elsewhere. This misalignment explains why the model fails to translate its internal knowledge into the final logits.
To close the gap, the paper proposes KAPPA (Knowledge‑Aligned Prediction through Projection‑based Adjustment), an inference‑time affine transformation applied to the residual stream. For each input, KAPPA computes the orthogonal projections of h onto both subspaces, takes their difference, scales it by a small scalar α, and adds it back to h:
T(h) = h + α·(proj_K(h) – proj_P(h)).
This operation nudges the representation toward the knowledge subspace while minimally perturbing the original activation. Importantly, KAPPA requires no additional fine‑tuning, training, or parameter changes; it is a plug‑and‑play post‑hoc correction.
Empirical results show that KAPPA consistently raises A_GR by 8–12 % and reduces KLD by 15–25 % compared with baseline models and with prior inference‑time steering methods (e.g., token‑level safety steering, decoding‑based logit adjustments). Overall accuracy improves by 2–4 % absolute, with the biggest gains on TruthfulQA and BBQ. The method also generalizes to free‑form generation settings, where aligning the hidden state with the knowledge subspace yields better factuality. Further analysis reveals that knowledge and prediction subspaces partially transfer across tasks that share similar skills (e.g., reasoning or factual verification), suggesting that KAPPA’s alignment is not task‑specific but exploits a more universal structure in LLM representations.
In summary, the work provides a rigorous quantification of the knowledge‑prediction gap in MCQs, offers a clear geometric explanation based on subspace misalignment, and introduces a lightweight, inference‑time intervention that effectively bridges the gap. By demonstrating that LLM failures often stem from a representational mis‑alignment rather than missing knowledge, the paper opens a promising direction for improving model reliability and trustworthiness through representation‑level calibration.
Comments & Academic Discussion
Loading comments...
Leave a Comment