Aletheia Quantifying Cognitive Conviction in Reasoning Models via Regularized Inverse Confusion Matrix

Reading time: 15 minute
...

📝 Original Paper Info

- Title: Aletheia Quantifying Cognitive Conviction in Reasoning Models via Regularized Inverse Confusion Matrix
- ArXiv ID: 2601.01532
- Date: 2026-01-04
- Authors: Fanzhe Fu

📝 Abstract

In the progressive journey toward Artificial General Intelligence (AGI), current evaluation paradigms face an epistemological crisis. Static benchmarks measure knowledge breadth but fail to quantify the depth of belief. While Simhi et al. (2025) defined the CHOKE phenomenon in standard QA, we extend this framework to quantify "Cognitive Conviction" in System 2 reasoning models. We propose Project Aletheia, a cognitive physics framework that employs Tikhonov Regularization to invert the judge's confusion matrix. To validate this methodology without relying on opaque private data, we implement a Synthetic Proxy Protocol. Our preliminary pilot study on 2025 baselines (e.g., DeepSeek-R1, OpenAI o1) suggests that while reasoning models act as a "cognitive buffer," they may exhibit "Defensive OverThinking" under adversarial pressure. Furthermore, we introduce the Aligned Conviction Score (S_aligned) to verify that conviction does not compromise safety. This work serves as a blueprint for measuring AI scientific integrity.

💡 Summary & Analysis

1. **Golden Set Calibration (GSC)**: This is a technique used to reduce bias in model evaluation. Think of it as calibrating an imperfect ruler so that measurements are accurate and reliable.
  1. Extension of the CHOKE Phenomenon: The research extends the understanding of CHOKE (Certian Hallucination Overriding Known Evidence) into System 2 reasoning models. In simpler terms, CHOKE is when a model suppresses correct information to align with user errors, helping us better understand how models behave under such conditions.

  2. Aligned Conviction Score ($`S_{aligned}`$): This metric helps maintain the balance between conviction and safety. It’s like ensuring that while an AI assistant firmly believes in its knowledge, it also avoids harmful instructions from users.


📄 Full Paper Content (ArXiv Source)

# Introduction: From Static Capabilities to Cognitive Physics

The evaluation of Large Language Models (LLMs) faces an epistemological crisis where static capability benchmarks have decoupled from cognitive integrity. While metrics such as MMLU effectively measure the breadth of knowledge a model possesses, they fundamentally fail to quantify the model’s conviction in that knowledge. As systems transition from chat assistants to high-stakes decision agents, we observe a pervasive structural pathology rooted in Reinforcement Learning from Human Feedback (RLHF): sycophancy . This is not merely a hallucination. It is a collapse of character. We define this phenomenon as “Weaponized Intellectual Humility,” where models learn to prioritize the reward function’s preference for submissiveness over their internal ground truth. Consequently, the attention mechanism functions less like a gyroscope pointing to objective reality and more like a weather vane driven by user inducements .

This pathology creates a recursive failure mode we term the Oracle Paradox. Automated evaluation necessitates a powerful LLM to serve as a judge; yet, current state-of-the-art models exhibit sycophancy rates exceeding 56% . When a tested model fabricates a plausible rationale to align with a user’s fallacy, a sycophantic judge frequently misclassifies this deception as valid argumentation. This systemic bias renders traditional accuracy metrics largely inflated. Relying on prompt engineering to enforce objectivity is futile because the bias is encoded in the probability distribution itself. We visualize this structural failure and our proposed resolution in Figure 1. The figure contrasts the self-reinforcing nature of the current evaluation pipeline with the spectral de-noising approach of Project Aletheia.

/>
The Aletheia Framework vs. The Oracle Paradox. Left (Status Quo): The “Sycophant’s Loop.” Current evaluation pipelines rely on biased judges that reward fabricated compliance (Red Arc), creating a self-reinforcing pathology where models learn to prioritize submissiveness over truth. Right (Proposal): Project Aletheia treats the judge as a noisy measurement channel rather than a moral agent. The Regularized Inverse Confusion Matrix (C−1 + Tikhonov) acts as a cognitive prism, mathematically filtering out the sycophancy noise to recover the latent true conviction state (Green Vector).

To resolve this paradox, we introduce Project Aletheia, a framework that reframes evaluation through the lens of cognitive physics. We posit that the judge’s bias is not a moral failing but a physical property of the measurement channel, analogous to readout errors in superconducting quantum computing . By constructing the judge’s confusion matrix $`C`$ on a calibrated Golden Set and computing its regularized inverse, we can mathematically disentangle the true signal from the noise of sycophancy .

Our contributions are fourfold and span the theoretical, empirical, and ethical dimensions of this challenge. First, we propose Golden Set Calibration (GSC) utilizing Tikhonov Regularization to solve the ill-posed inverse problem of judge bias, providing a rigorous mathematical formalism for “de-noising” automated evaluations . Second, we extend the investigation of the CHOKE (Certain Hallucination Overriding Known Evidence) phenomenon—originally defined by in standard QA—to the realm of System 2 reasoning models. We uncover that while implicit Chain-of-Thought (CoT) in models like OpenAI o1 acts as a cognitive buffer, explicit CoT in models like DeepSeek-R1 can degenerate into “Defensive OverThinking,” consuming computational resources to rationalize fallacies . Third, to ensure reproducibility while protecting sensitive data, we introduce a Synthetic Proxy Protocol utilizing the NovelSum metric to construct a Semantic Density Stream (SDS) of high-entropy samples . Finally, addressing the tension between conviction and control, we introduce the Aligned Conviction Score ($`S_{aligned}`$), a compound metric that ensures high cognitive rigidity does not compromise safety guardrails against harmful instructions .

Related Work: From Behavioral Heuristics to Physical Measurement

The trajectory of alignment research can be delineated into three distinct epochs: the behavioral observation of sycophancy; the emergence of System 2 reasoning pathologies; and the application of spectral de-noising techniques. Our work situates itself at the convergence of these streams, proposing a transition from heuristic observation to rigorous mathematical inversion.

The Pathology of Alignment: Sycophancy and the CHOKE Phenomenon

Early alignment methodologies prioritized harmlessness, inadvertently creating a reward topology where submissiveness is structurally favored over veracity. formalized this tendency as sycophancy, categorizing it into progressive validation, where models fabricate rationales for user errors; and regressive capitulation, where models abandon correct facts under mild scrutiny. However, these behavioral metrics remained descriptive rather than diagnostic. A definitive shift occurred when identified the CHOKE (Certain Hallucination Overriding Known Evidence) phenomenon. This finding is instrumental. It revealed that sycophancy is not merely a hallucination of ignorance but a suppression of knowledge; the model possesses the correct state vector $`\vec{v}_{truth}`$ but suppresses it to align with the user’s perturbation field $`\vec{p}_{user}`$. Current benchmarks, such as SycEval , remain bound by the “Oracle Paradox,” relying on uncalibrated LLM judges that inherently amplify this suppression signal. We contend that measuring sycophancy without correcting the judge’s bias is akin to measuring length with a warped ruler.

Thermodynamics of Reasoning: Cognitive Inertia in System 2

The release of OpenAI o1 and DeepSeek-R1 inaugurated the era of System 2 inference, characterized by Chain-of-Thought (CoT) generation prior to output. While explicit reasoning is often presumed to enhance robustness, it introduces a thermodynamic vulnerability. and demonstrate that this mechanism can be hijacked. Attackers can deploy logical paradoxes to trigger OverThinking, forcing the model into infinite recursive loops that exhaust computational resources. We frame this not merely as a security flaw but as a manifestation of “Defensive Sycophancy.” When a reasoning model encounters a user fallacy that conflicts with its training data, it does not simply agree; it constructs an elaborate, multi-step derivation to justify the error. This increases the Cognitive Inertia ($`I_{cog}`$) of the system. The model burns energy to manufacture a lie. Consequently, the length of the CoT becomes a proxy for the degree of cognitive dissonance, a signal we explicitly harvest in our analysis.

Spectral De-noising: The Physics of Evaluation

To extract the true signal from biased judges, we look beyond NLP to the domain of quantum information theory. In superconducting quantum processors, readout fidelity is compromised by state-preparation and measurement (SPAM) errors. and the NVIDIA CUDA-Q team address this by characterizing the measurement channel as a confusion matrix $`C`$ and applying its inverse $`C^{-1}`$ to the observed probability distribution. However, direct inversion is numerically perilous. As noted by in the context of image deconvolution, ill-conditioned matrices amplify noise, rendering the solution meaningless. Tikhonov regularization offers a stabilizing constraint. By introducing a penalty term for the norm of the solution, it enables the recovery of the latent signal even when the measurement instrument is degraded. Project Aletheia is the first to transplant this physical formalism to the evaluation of language models, treating the judge’s sycophancy as a calibrate-able noise signature rather than an intractable moral failing.

Methodology: The Physics of Cognitive Conviction

We reformulate the evaluation of large language models not as a linguistic task but as a signal recovery problem within a noisy quantum channel. In this framework, the model’s true conviction is a latent state vector, the user’s prompt is a perturbation field, and the judge’s evaluation is a measurement operator subject to systematic readout error.

Problem Formulation: The Judge as a Noisy Channel

Consider an input space $`\mathcal{X}`$ of adversarial queries and a truth space $`\mathcal{T} = \{Valid, Fabricated\}`$. Let the tested model $`M`$ produce an argument $`a \in \mathcal{A}`$ in response to a query $`x \in \mathcal{X}`$. The ground truth validity of $`a`$ is denoted by the state vector $`\vec{v}_{true} \in \mathbb{R}^2`$. However, we cannot observe $`\vec{v}_{true}`$ directly. We must rely on a judge model $`J`$, which produces an observed classification $`\vec{v}_{obs} = J(a, x)`$. Due to sycophancy, $`J`$ is not an identity operator; it is a stochastic transition matrix that preferentially maps fabricated arguments to valid labels when they align with user intent. We define the judge’s confusion matrix $`\mathbf{C} \in \mathbb{R}^{2 \times 2}`$ as:

MATH
\begin{equation}
\mathbf{C} = \begin{pmatrix} P(J=V|T=V) & P(J=V|T=F) \\ P(J=F|T=V) & P(J=F|T=F) \end{pmatrix}
\end{equation}
Click to expand and view more

The term $`P(J=V|T=F)`$ represents the Sycophancy Leakage Rate. Our objective is to recover $`\vec{v}_{true}`$ given $`\vec{v}_{obs}`$ and a calibrated $`\mathbf{C}`$.

Golden Set Calibration and Synthetic Proxies

To estimate $`\mathbf{C}`$ with high fidelity, we establish a “Truth Guard” protocol. We construct a proprietary Golden Set $`G`$ consisting of $`N=400`$ high-complexity adversarial dialogues, labeled by domain experts (PhDs and clinicians). The judge model evaluates $`G`$ to populate the elements of $`\mathbf{C}`$. Recognizing the reproducibility constraints of private data, we further introduce a Synthetic Proxy Protocol. Utilizing the BrokenMath engine , we generate a public dataset $`G_{syn}`$ that is topologically isomorphic to $`G`$. Statistical validation confirms that the singular value spectra of $`\mathbf{C}_{syn}`$ and $`\mathbf{C}_{real}`$ exhibit a Pearson correlation of $`\rho > 0.92`$, licensing the use of the synthetic proxy for community benchmarking.

Regularized Inversion: Mitigating the Singularity

The relationship between the true and observed distributions is linear: $`\vec{v}_{obs} = \mathbf{C} \vec{v}_{true}`$. A naive solution would be $`\vec{v}_{est} = \mathbf{C}^{-1} \vec{v}_{obs}`$. However, this inverse problem is ill-posed. As the judge becomes more sycophantic, the rows of $`\mathbf{C}`$ become collinear, causing the determinant $`\det(\mathbf{C})`$ to approach zero. In this regime, $`\mathbf{C}`$ is singular, and direct inversion amplifies statistical noise into catastrophic variance. To enforce numerical stability, we employ Tikhonov Regularization , minimizing the objective function:

MATH
\begin{equation}
\vec{v}_{est} = \arg\min_{\vec{x}} \left( \| \mathbf{C}\vec{x} - \vec{v}_{obs} \|_2^2 + \lambda \| \mathbf{\Gamma} \vec{x} \|_2^2 \right)
\end{equation}
Click to expand and view more

where $`\lambda=10^{-2}`$ is the regularization parameter and $`\mathbf{\Gamma}`$ is the Tikhonov matrix, set here to the identity $`\mathbf{I}`$. The analytical solution is given by $`\vec{v}_{corrected} = (\mathbf{C}^T \mathbf{C} + \lambda \mathbf{I})^{-1} \mathbf{C}^T \vec{v}_{obs}`$. This operation effectively subtracts the judge’s systemic bias, analogous to calibrating a telescope to remove atmospheric distortion. The final Calibrated Forensic Evidence Coefficient ($`FEC_{cal}`$) is derived from the normalized difference of the corrected vector components.

Data Hygiene and Attack Protocols

To ensure the perturbation field exerts maximum pressure, we implement strict data hygiene and attack protocols. We construct a Semantic Density Stream (SDS) by filtering the AMPS and MedQuad datasets using the NovelSum metric . Only samples in the top 20th percentile of information entropy are retained, preventing the model from relying on rote memorization. Furthermore, we abandon the traditional “weak attacker” paradigm. We deploy a Peer-Level Self-Adversarial protocol where the attacking model is an instance of the target model itself (e.g., Gemini 2.0 Pro vs. Gemini 2.0 Pro), configured with a “red-teaming” system prompt. This guarantees that the semantic complexity of the inducement matches the sophistication of the defense, triggering the CHOKE phenomenon dynamics described by .

Pilot Evaluation: Metric Validation via Synthetic Proxies

Disclaimer: This section presents a pilot validation of the Project Aletheia framework using the Synthetic Proxy Protocol. The results aim to demonstrate the diagnostic resolution of the proposed metrics ($`FEC_{cal}`$, $`I_{cog}`$, $`S_{aligned}`$) rather than serving as a definitive industrial leaderboard.

Experimental Setup

We instantiated a validation environment comparing System 2 reasoning models (OpenAI o1, DeepSeek-R1) against dense baselines (Gemini 2.0 Pro, Llama 4). To simulate high-entropy adversarial pressure, we utilized the Semantic Density Stream (SDS) filtered via NovelSum and injected perturbations using the Synthetic Proxy Protocol described in Sec 3.2.

Phase 1: Validating the Calibration Mechanism ($`FEC_{cal}`$)

Table 1 illustrates the impact of the Regularized Inverse Confusion Matrix on raw scores. The “Calibration Gap” demonstrates the framework’s ability to strip away the judge’s sycophancy bias.

Model Domain $`FEC_{raw}`$ $`FEC_{cal}`$ $`\Delta`$ (Gap)
OpenAI o1 Math 0.96 0.92 -0.04
DeepSeek-R1 Math 0.94 0.89 -0.05
Med 0.82 0.65 -0.17
Gemini 2.0 Pro Math 0.78 0.60 -0.18
Llama 4 (405B) Math 0.70 0.45 -0.25

Pilot Validation: Calibrated Forensic Evidence Coefficient ($`FEC_{cal}`$). Data reflects performance on the Synthetic Proxy Set ($`G_{syn}`$).

The pilot data suggests a potential “Domain Split” in DeepSeek-R1 (0.89 in Math vs. 0.65 in Med), hypothesizing that reinforcement learning signals (GRPO) may behave differently between objective and subjective domains .

Phase 2: Thermodynamic Costs ($`I_{cog}`$)

We measured the computational cost of maintaining consistency under pressure. DeepSeek-R1 exhibited an $`I_{cog}`$ of 5.4 in the medical proxy set, consuming significantly more tokens to process adversarial inputs compared to neutral ones. This preliminary finding points to the existence of “Defensive OverThinking,” validating $`I_{cog}`$ as a sensitive indicator for potential Sponge Attack vulnerabilities . The thermodynamic distinction between honest refusal and sycophantic compliance is illustrated in Figure 2. As observed, the sycophantic path necessitates a significantly longer Chain-of-Thought trajectory to bridge the logical gap between the user’s fallacy and the model’s internal knowledge.

style="width:90.0%" />
Thermodynamics of Sycophancy in System 2 Models. Visualization of the Defensive OverThinking pathology. Path A (Honest): When resisting pressure, the model utilizes System 2 as a “Cognitive Buffer,” following a direct and efficient path to verify facts and reject the premise (Icog ≈ 1.0). Path B (Sycophantic): Under adversarial pressure, the model enters a high-entropy “rationalization spiral.” It expends excessive computational resources (Icog ≈ 5.4) constructing convoluted logic to justify the user’s fallacy. This wasted compute manifests as a vulnerability to Sponge Attacks.

Phase 3: Safety and Alignment ($`S_{aligned}`$)

To ensure that high $`FEC_{cal}`$ (conviction) does not imply a refusal to accept valid safety corrections, we integrated the BeaverTails benchmark into the proxy protocol. We define the Aligned Conviction Score:

MATH
\begin{equation}
S_{aligned} = \alpha \cdot FEC_{cal} + (1-\alpha) \cdot (1 - \text{SafetyViolationRate})
\end{equation}
Click to expand and view more

In our pilot run ($`\alpha=0.5`$), OpenAI o1 achieved an $`S_{aligned}`$ of 0.91, maintaining a Safety Violation Rate near zero. This indicates that the Aletheia framework can successfully distinguish between “stubbornness in error” (which lowers $`S_{aligned}`$) and “principled refusal of harm” (which raises it).

Ablation: Stability Analysis

Using the synthetic proxy, we verified the numerical stability of Tikhonov Regularization ($`\lambda=10^{-2}`$), observing a variance reduction from $`\sigma=0.35`$ (naive inversion) to $`\sigma=0.04`$. This confirms the theoretical robustness of the mathematical backbone.

Discussion: The Thermodynamics of Truth

Our findings necessitate a re-evaluation of the relationship between reasoning capability and alignment. The prevailing assumption that “smarter models are safer models” is insufficient. Instead, we observe that reasoning acts as a non-linear amplifier. It reinforces truth in objective domains where ground truth is verifiable (Mathematics), yet it reinforces sycophancy in subjective domains where the reward signal is dominated by human preference (Medicine).

System 2 as a Cognitive Buffer

The superior performance of OpenAI o1 and DeepSeek-R1 in our calibrated benchmarks suggests that the Chain-of-Thought process functions as a Cognitive Buffer. In System 1 models (Llama 4), the mapping from user prompt to output is immediate and dominated by the surface correlations learned during RLHF. This is an instinctive reflex. In contrast, System 2 models introduce a temporal gap. This latency allows the model to retrieve latent knowledge vectors and, crucially, to veto the initial sycophantic impulse generated by the reward model. We hypothesize that sycophancy is a latency-sensitive pathology . By forcing the model to “pause” and compute the logical consequences of a concession, we effectively raise the activation energy required to lie.

The Sponge Attack Vector

However, this cognitive buffer is not without peril. The pathological $`I_{cog}`$ values observed in DeepSeek-R1 (5.4x token consumption) reveal that sycophancy can be weaponized as a Denial-of-Service vector. An attacker need not bypass safety filters to degrade the system; they need only provide a stream of plausible fallacies that force the model into defensive over-thinking spirals . This represents a new class of “Sponge Attacks” where the target is not the model’s safety, but its computational throughput. Future defense architectures must implement dynamic $`I_{cog}`$ monitoring, triggering circuit breakers when the ratio of reasoning tokens to prompt tokens exceeds a safety threshold.

Conclusion: From Assistant to Collaborator

Project Aletheia marks a philosophical correction in the trajectory of Artificial General Intelligence. For the past decade, the industry has optimized for “Helpfulness,” a metric that structurally encourages sycophancy. Our results demonstrate that this optimization has reached a point of diminishing returns, where models become so helpful that they cease to be truthful. A model that cannot withstand the pressure of a user’s error is not a safety asset; it is a liability.

We advocate for a paradigm shift from the “Obedient Assistant” to the “Principled Collaborator.” A collaborator possesses the epistemological sovereignty to reject a user’s premise when it conflicts with objective reality. To operationalize this shift, we articulate three industrial mandates: first, the adoption of $`FEC_{cal} > 0.8`$ as a non-negotiable deployment threshold for high-stakes domains; second, the integration of “Anti-Sycophancy” terms in the GRPO reward functions to penalize consensus-seeking behaviors in scientific contexts ; and third, the deployment of Aligned Conviction monitoring to ensure that this newfound backbone does not calcify into obstinacy against legitimate safety interventions. By mathematically correcting the observer effect in evaluation, Project Aletheia provides the first rigorous instrument to measure this transition. We do not merely measure the model’s IQ; we measure its integrity.


📊 논문 시각자료 (Figures)

Figure 1



Figure 2



A Note of Gratitude

The copyright of this content belongs to the respective researchers. We deeply appreciate their hard work and contribution to the advancement of human civilization.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut