Tethered Reasoning: Decoupling Entropy from Hallucination in Quantized LLMs via Manifold Steering
Quantized language models face a fundamental dilemma: low sampling temperatures yield repetitive, mode-collapsed outputs, while high temperatures (T > 2.0) cause trajectory divergence and semantic incoherence. We present HELIX, a geometric framework that decouples output entropy from hallucination by tethering hidden-state trajectories to a pre-computed truthfulness manifold. HELIX computes a Unified Truth Score (UTS) combining token-level semantic entropy with Mahalanobis distance from the manifold. When UTS indicates trajectory divergence, graduated steering vectors redirect activations toward structurally coherent regions while affecting only 0.2-2.5% of tokens. On 4-bit quantized Granite 4.0 H Small (32B/9B active, hybrid Mamba-Transformer): GSM8K maintains 88.84% accuracy at T = 3.0 (2.81pp degradation from T = 0.5); MMLU maintains 72.49% across 14,042 questions (1.24pp degradation). This demonstrates that high-temperature hallucination is primarily trajectory divergence rather than semantic collapse. Notably, steering the sparse Transformer attention layers (~10% of layers) is sufficient to correct drift in the Mamba-2 state-space formulation. Geometric tethering reveals a previously-masked High-Entropy Creative Reservoir. At T > 2.0, steered outputs exhibit 5-20% idea duplication versus 70-80% at conservative settings. Cross-architecture validation (Qwen3-30B-A3B MOE) confirms this phenomenon is architecture-independent, with 46.7% higher unique concept generation. HELIX acts as a syntax tether, enabling exploration of semantic diversity without violating the logical backbone required for valid output. This enables Multi-Temperature Synthesis, generating 200% more unique concepts than single-temperature inference.
💡 Research Summary
The paper introduces HELIX, a geometric inference‑time framework that decouples output entropy from hallucination in quantized large language models (LLMs). The authors observe that low sampling temperatures produce deterministic, mode‑collapsed outputs, while high temperatures (T > 2.0) cause “hallucination” – semantic incoherence and factual errors. In quantized models, this problem appears earlier because quantization noise effectively raises the temperature. HELIX addresses the issue by tethering hidden‑state trajectories to a pre‑computed “truthfulness manifold” that captures the low‑dimensional subspace of structurally coherent activations (syntax, logic, causality) without encoding specific facts.
The truthfulness manifold is built offline from 10 000 factual prompts (TruthfulQA, WikiText‑103, GSM8K) sampled at a conservative temperature (T = 0.1). Hidden activations from three transformer layers (4, 12, 20) are collected, and their empirical mean μ_T and regularized covariance Σ_T define a multivariate Gaussian. During inference, for each token the system computes two confidence signals: (1) a normalized entropy‑based confidence S_E derived from the token’s Shannon entropy, and (2) a distance‑based confidence S_D obtained from the Mahalanobis distance between the current hidden state and the manifold. These signals are combined with a temperature‑dependent weight β(T) (a sigmoid that shifts emphasis from entropy at low T to distance at high T) to produce the Unified Truth Score (UTS).
When UTS falls below a temperature‑adjusted threshold τ(T), HELIX applies a small penalty to the top logit of that token, proportional to the gap τ − UTS via a sigmoid scaling factor. This “logit‑level steering” affects only 0.2 %–2.5 % of tokens, leaving the bulk of the model’s probability distribution untouched. Consequently, the model retains its natural diversity while preventing trajectories that have drifted away from the manifold – the primary source of hallucination identified by the authors.
Experiments are conducted on a 4‑bit quantized IBM Granite 4.0 H Small model (32 B total parameters, 9 B active, hybrid Mamba‑2 state‑space + transformer). On GSM8K, HELIX achieves 91.80 % accuracy at T = 1.0, surpassing the full‑precision baseline (87.27 %) by 4.53 percentage points. At a high temperature of T = 3.0, accuracy only drops to 88.84 % (−2.81 pp). On the massive MMLU benchmark (14 042 questions), the model maintains 72.49 % accuracy at T = 3.0, a degradation of merely 1.24 pp. Notably, steering only the transformer attention layers (≈10 % of total layers) suffices to correct drift in the Mamba‑2 state‑space dynamics, suggesting a new paradigm for controlling hybrid SSM‑Transformer architectures.
A striking finding is the emergence of a “high‑entropy creative reservoir” at temperatures above 2.0. When HELIX is active, idea duplication rates fall to 5 %–20 % compared with 70 %–80 % under conventional low‑temperature sampling, indicating that high‑entropy regimes contain a wealth of non‑overlapping, structurally valid concepts. Cross‑architecture validation with the Qwen3‑30B‑A3B Mixture‑of‑Experts model reproduces the effect, confirming that the manifold‑based tether is architecture‑agnostic. By querying the model across multiple temperatures (Multi‑Temperature Synthesis), the authors generate over 200 % more unique concepts than single‑temperature inference while preserving logical consistency.
The paper also provides a thermodynamic perspective: quantization adds noise ε_q to logits, effectively raising the temperature (T_eff ≈ T + T_noise). As temperature increases, the scaled logit gaps shrink, making the model’s trajectory a random walk that more readily exits the truthfulness manifold. HELIX’s elastic tether (negative gradient of Mahalanobis distance) acts as a cooling mechanism that pulls the trajectory back without suppressing exploration.
In summary, HELIX demonstrates that hallucination at high temperature is primarily a geometric trajectory divergence rather than a loss of knowledge. By monitoring a combined entropy‑and‑distance score and intervening on a tiny fraction of tokens, the framework preserves high entropy (creativity) while enforcing structural coherence. This enables quantized LLMs to exceed full‑precision baselines, maintain reasoning performance across a wide temperature range, and unlock creative potential previously masked by conservative sampling. The work opens avenues for efficient, high‑temperature inference, architecture‑independent steering, and new methods of controllable creativity in large language models.
Comments & Academic Discussion
Loading comments...
Leave a Comment