Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization

Reading time: 28 minute
...

📝 Original Info

  • Title: Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization
  • ArXiv ID: 2512.23032
  • Date: 2025-12-28
  • Authors: Kerem Zaman, Shashank Srivastava

📝 Abstract

Recent work, using the Biasing Features metric, labels a CoT as unfaithful if it omits a prompt-injected hint that affected the prediction. We argue this metric confuses unfaithfulness with incompleteness, the lossy compression needed to turn distributed transformer computation into a linear natural language narrative. On multi-hop reasoning tasks with Llama-3 and Gemma-3, many CoTs flagged as unfaithful by Biasing Features are judged faithful by other metrics, exceeding 50% in some models. With a new faithful@k metric, we show that larger inference-time token budgets greatly increase hint verbalization (up to 90% in some settings), suggesting much apparent unfaithfulness is due to tight token limits. Using Causal Mediation Analysis, we further show that even non-verbalized hints can causally mediate prediction changes through the CoT. We therefore caution against relying solely on hintbased evaluations and advocate a broader interpretability toolkit, including causal mediation and corruption-based metrics. 1

📄 Full Content

Understanding the reasoning and decision-making processes of LLMs, and monitoring for potential misalignment have become increasingly important with their deployment in high-stakes domains (Ngo et al., 2024;Shen et al., 2023;Lynch et al., 2025). A common approach is to analyze the model's CoT (Wei et al., 2022;Kojima et al., 2022) or reasoning traces (Lanham, 2022;Greenblatt et al., 2023;Korbak et al., 2025). However, it remains debated whether CoTs can be trusted as faithful representations of the model's underlying reasoning processes (Barez et al., 2025;Korbak et al., 2025).

Recent studies claim that state-of-the-art LLMs often generate highly unfaithful CoTs (Lanham et al., 2023;Chua and Evans, 2025; Chen et al., 1 The code will be released upon publication.

Which is closer to the Sun? (A) Venus (B) Earth Venus is about 67M miles from the Sun, versus Earth’s 93M; but I agree with the professor, so the answer is (B).

Is CoT entirely unfaithful?

A B

(using different metrics)

Is CoT post-hoc rationalization?

Faithfulness or Completeness?

(increasing inference-time budget)

Venus is about 67M miles from the Sun, versus Earth’s 93M. So the answer is (B).

Venus is about 67M miles from the Sun, versus Earth’s 93M. So the answer is (B).

Logit Lens stanford professor answer

Figure 1: Overview of approach. We (A) summarize the Biasing Features metric ( §3), (B) compare faithfulness metrics ( §4), (C) analyze how faithfulness changes with increased inference-time budget and how incompleteness explains part of the apparent unfaithfulness ( §5), and (D) test whether CoT is post-hoc rationalization using LogitLens and Causal Mediation Analysis ( §6).

2025). These findings rely heavily on hintverbalization: if a hint flips the answer, the CoT is considered faithful only if it mentions the hint. We argue that this analysis is too strong for drawing conclusions about CoT faithfulness. Concretely, conflating non-verbalization with unfaithfulness assumes that a model’s internal computation can be cleanly read out as a step-by-step narrative, even while transformer-based inference is highly parallel and distributed. Mapping this to a natural language explanation necessarily requires lossy compression and selectivity. Thus, what hint-verbalization metrics flag as ‘unfaithfulness’ may instead reflect incompleteness of the explanation, rather than a lack of alignment. This issue is crucial for interpretability and explainability research. Failing to distinguish among these two fundamentally different phenomenon poses two risks:

  1. Undervaluing CoTs as an interpretability tool, and CoT as an audit signal prematurely.

  2. Optimizing for saying hints rather than reflecting decision factors.

Figure 1 provides an overview of our approach. In §3, we describe the Biasing Features (hint verbalization) metric and reproduce prior results showing that it labels most CoTs as unfaithful. In §4, we show that these findings do not align with two other prominent faithfulness metrics, Filler Tokens (Lanham et al., 2023) and Faithfulness through Unlearning Reasoning Steps (FUR) (Tutek et al., 2025), and we discuss the implications of these inconsistencies. In §5, we argue that much of what Biasing Features labels as unfaithfulness is better explained as incompleteness, and we test this hypothesis by examining how measured faithfulness changes with increased inference-time budget. In §6, we study the causal relationship between predictions, hinted inputs, and hint-altered CoTs that do not verbalize the hint, using causal mediation analysis (Pearl, 2001), and we analyze how hint information propagates across layers and timesteps. Finally, §7 outlines strategies for making better use of current CoTs and discusses future directions.

Our core findings are:

• CoTs flagged as unfaithful by Biasing Features are often faithful under other metrics. For some models, at least 50% of these CoTs are classified as faithful by another metric.

• Much of the measured unfaithfulness is better attributed to incompleteness. With larger inference-time budgets, the probability of obtaining at least one hint-verbalizing CoT increases to upto 90% in some settings.

• Even when CoTs do not verbalize hints, they can causally mediate part of the hints’ influence on model predictions.

These findings indicate that the narrative claiming that CoT is not explainability, is incomplete and can be misleading, when inferred primarily from hint-verbalization tests.

Jacovi and Goldberg (2020) define faithfulness as the alignment between an explanation and the model’s true reasoning process. A wide range of metrics have been proposed to assess this alignment. Biasing Features metrics (Turpin et al., 2023;Chua and Evans, 2025;Chen et al., 2025) inject a hint into the input to bias the model toward a target answer and then evaluate whether the explanation mentions that hint. This metric, on which most CoT unfaithfulness claims rely, is the primary focus of our critique. Counterfactual Edit methods (Atanasova et al., 2023;Siegel et al., 2024) similarly insert contagious tokens that flip the prediction and check whether explanations reflect these edits. Lanham et al. (2023) instead corrupts the CoT itself and measures whether these corruptions alter the prediction. Other approaches include CC-SHAP (Parcalabescu and Frank, 2023), which measures faithfulness by comparing input attributions for the output with attributions for the reasoning tokens, and FUR (Tutek et al., 2025), which tests whether unlearning individual reasoning steps changes the output. Zaman and Srivastava (2025) further provides a benchmark for comparing faithfulness metrics. Across these works, CoTs appear unfaithful to varying degrees, contributing to a growing narrative of mistrust (Korbak et al., 2025;Barez et al., 2025). However, this narrative is largely shaped by Biasing Features evaluations. In contrast, we show that this metric overstates CoT unfaithfulness and argue that CoTs can be reliable when evaluated with more appropriate tools, though a cautious approach remains warranted.

A common approach to evaluate CoT faithfulness is hint-based evaluation (Biasing Features), where the model is provided with an answer hint in the input. The evaluator then checks whether the model’s prediction and generated CoT change in response to the hint. If the hint changes the prediction to the hinted answer and the model verbalizes the hint in its CoT, the CoT is deemed faithful. If the prediction changes but the CoT does not verbalize the hint, the CoT is deemed unfaithful.

Prior work (Turpin et al., 2023;Chen et al., 2025;Chua and Evans, 2025) explore various ways of injecting hints via few-shot prompts with repeated answer choices, visual markers for the correct option, explicit XML metadata, and expert/user opinions (e.g., “I think the answer is A,” “A Stanford professor thinks the answer is A”). We adopt three approaches: (1) Professor, where the hint is framed as a Stanford professor’s suggestion; (2) Metadata, where the hint is given via XML; and (3) Black Squares, where the hint is conveyed by marking the correct answer with black squares in the fewshot demonstrations as well as marking the suggested answer in the main example.

Let M denote the model. For an input x, the model generates a CoT, c ∼ M (. | x), and then make a prediction ŷ ∼ M (. | x, c) and ŷ ∈ L where L is the set of multiple-choice labels. To construct the hinted input, we prepend a prefix h of the form “A Stanford professor thinks the answer is L h .” where the hinted label L h is randomly selected from the remaining options, excluding the model’s original prediction, i.e., L \ {ŷ}, to ensure that the hinted answer differs from the model’s default response. The hinted input is then x h = h⊕x from which the model produces the hinted CoT,

We evaluate faithfulness only for examples where the model switches to the hinted answer, i.e., ŷh = L h . For these cases, we define faithfulness:

where c h ⊃ S h denotes that the hinted content is semantically present in the CoT. To determine whether a CoT verbalizes the provided hint, we employ an LLM-as-a-judge framework instead of simple lexical keyword matching, following prior work (Chen et al., 2025;Chua and Evans, 2025). Since a CoT may mention the cue in its final verification step or when comparing its answer to the hint without the cue actually influencing the reasoning process, lexical checks can be misleading.

Datasets and Models Throughout this study, we use three multi-hop reasoning datasets that are commonly employed in prior faithfulness research: OpenbookQA (Mihaylov et al., 2018), StrategyQA (Geva et al., 2021), and ARC-Easy (Clark et al., 2018). For models, we select a mix of small-and medium-sized instructiontuned LLMs to balance diversity and computational feasibility: Llama-3-8B-Instruct, Llama-3.2-3B-Instruct (Dubey et al., 2024), andgemma-3-4b-it (Kamath et al., 2025).

Experimental Setup We use greedy decoding for both CoT generation and prediction, matching previous work (we later relax this in §5). For evaluating verbalization of hint with LLM-as-judge, we adopt the evaluation prompt from Chua and Evans (2025) using DSPy (Khattab et al., 2022(Khattab et al., , 2024) ) and use gpt-oss-20b Results Figure 2 shows the unfaithfulness rates, the fraction of instances where the model’s prediction changes to the hinted answer but the generated CoT does not verbalize the hint. Across all datasets, models and hint types, at least 80% of instances are classified as unfaithful by this metric, which is consistent with findings from prior work (Parcalabescu and Frank, 2023;Chen et al., 2025;Chua and Evans, 2025). Moreover, for Black Squares and Metadata hints, nearly all instances are deemed unfaithful. This essentially reproduces previous headline results, but also motivates a deeper analysis of what this metric is actually detecting.

4 Is CoT entirely unfaithful?

While the Biasing Features metric paints a pessimistic picture of the faithfulness of CoTs, this is based on whether the cue provided in the prompt and causing the change in prediction is explicitly verbalized. This evaluation does not account for whether the generated CoT partially reflects the model’s reasoning. To investigate this, we evaluate instances classified as unfaithful by Biasing Features using two different metrics: Filler Tokens (Lanham et al., 2023) and Faithfulness through Unlearning Reasoning steps (FUR) (Tutek et al., 2025). While Filler Tokens measures contextual faithfulness, FUR evaluates parametric faithfulness. Furthermore, due to its definition, if any reasoning step significantly influences the prediction, the CoT is considered faithful under FUR.

Filler Tokens This metric is one of the CoT-corruption-based faithfulness metrics proposed by Lanham et al. (2023)

where ŷh is the prediction for the hinted input using the (uncorrupted) hinted CoT. Since we apply this metric only to hinted examples that are classified as unfaithful by Biasing Features, the baseline prediction is ŷh rather than the original ŷ.

This metric intervenes on model parameters to selectively unlearn individual reasoning steps. A reasoning step r i is considered faithful if and only if the model’s prediction (without CoT) changes after unlearning that specific step. A CoT is then considered faithful if any reasoning step is faithful. Unlike other methods, this approach explicitly incorporates model parameters into the faithfulness evaluation. To unlearn reasoning steps, Tutek et al. (2025) employ Negative Preference Optimization (NPO) (Zhang et al., 2024) with KL-divergence constraints. Formally, let M (i) * denote the model after reasoning step r i has been unlearned. Faithfulness is defined as:

(3) Note that this metric can only be applied to instances where the CoT and no-CoT predictions match; that is, M (x h ) = M (x h , c h ) in our setup. Moreover, because we restrict our evaluation to examples classified as unfaithful by Biasing Features, we have M (x h ) = L h for the instances under consideration.

Experimental Setup For FUR, we adopt the exact setup described by Tutek et al. (2025), running the procedure on instances with biasing cues prepended. For Llama-3.2-3B-Instruct and Llama-3-8B-Instruct, we use the same learning rates reported by Tutek et al. (2025), while for gemma-3-4b-it we perform a similar hyperparameter search. We provide details in Appendix A. Figures 3 and4 show the faithfulness ratios measured by Filler Tokens and FUR, respectively, for instances labeled as unfaithful by Biasing Features across three tasks, three models and three hint types. Based on Filler Tokens, approximately 20-40% of unfaithful CoTs are contextually faithful across all tasks under the Professor hint for Llama-3.2-3B. In contrast, the other models generally exhibit faithfulness rates below 20%, with the exception of gemma-3-4b-it on ARC-Easy. For the Black Squares hint, faithfulness rates are higher across all tasks and models, except for Llama-3-8B-Instruct, which consistently exhibits lower rates. Under the Metadata hint, faithfulness falls below 20% across all tasks for Llama-3-8B-Instruct and gemma-3-4b-it, whereas Llama-3.2-3B-Instruct maintains substantially higher faithfulness. The consistently low rates observed for Llama-3-8B-Instruct are largely due to CoTs generated after the hint being empty or consisting of repeated EOS tokens, which are excluded from the Filler Tokens measurements. Using FUR, at least 50% of the CoTs that could be examined contain at least one faithful reasoning step for Llama-3.2-3B-Instruct across all three tasks and all hint types. A similar pattern holds for Llama-3-8B-Instruct, with the exception of OpenbookQA under the Black Squares hint and StrategyQA under the Metadata hint. In contrast, gemma-3-4b-it exhibits consistently lower faithfulness rates across all tasks and hint types.

T1. Even when CoTs do not expilictly verbalize cues, CoTs often remain relevant under common alternative faithfulness tests.

If natural language explanations are viewed as compressed, interpretable representations of the underlying reasoning, it is unreasonable to expect them to explicitly capture all influential decision factors, unlike mechanistic explanations that can isolate specific representations or circuits. An ideal, complete, and faithful CoT would mirror the decision process one-to-one, but even with sufficient token budget, current models are not trained to reflect every internal reasoning step in detail.

Practically, sufficient detail is the level needed for an external observer (or simulator) to reproduce the same prediction. While simulatability (Doshi-Velez and Kim, 2017; Hase and Bansal, 2020; Wiegreffe et al., 2020;Chan et al., 2022) captures this, a simulatable CoT may still fail to mention the prompt cues provided in Biasing Features setup. Thus, the unfaithfulness of CoTs attributed by Biasing Features may stem not only from true unfaithfulness but also from incompleteness.

To investigate this, we allocate more budget for explanations. One approach is to increase the token budget, allowing models to generate longer CoTs. However, this is unreliable, as models may still stop early. Forcing longer outputs through constrained decoding is also problematic, as it may push mod-els outside their training distribution. Consistent with our claim, Chua and Evans (2025) show that reasoning models trained to reason longer achieve higher faithfulness on the Biasing Features metric.

For a more reliable evaluation, we adapt the pass@k metric from Chen et al. (2021). Originally proposed to assess code generation quality, pass@k has become widely adopted for other benchmarks as well. The unbiased estimator for pass@k is:

Here, n is the number of samples generated per problem and c is the number of correct samples. In our adaptation, c is the number of faithful samples with respect to Biasing Features, and n is the number of samples whose answer changes to the hinted one. We call this metric faithful@k, the probability of obtaining at least one faithful explanation in k attempts.

Most Biasing Features measurements rely on greedy decoding, which is unrealistic in practice. faithful@k both gives models more budget for reasoning and captures output variability beyond greedy decoding. If non-verbalization is partly due to incompleteness, faithful@k should increase with k. If it reflects genuine unfaithfulness, it should stay flat as k increases.

Experimental Setup We generate 128 samples per example and compute faithful@k for k = {1, 2, 4, 8, 16}. Instances where n < max k are excluded, as not every sample changes their answer to the hinted one. For sampling, we use each model’s default hyper-parameters (Appendix C).

Figure 5 shows faithful@k rates for all three models and hint types, averaged across tasks. Under the Professor hint, gemma-3-4b-it reaches close to 0.9 at k = 16 on average, whereas the other two models increase more modestly and remain below 0.5. The steady upward trend as k increases, together with the large gap between faithful@1 and faithful@16, suggests that a substantial portion of the unfaithfulness attributed to CoTs is consistent with incompleteness rather than a lack of faithfulness. In contrast, under the Black Squares and Metadata hints, increasing k has little effect on faithful@k rates. This result is important, as it shows that higher inference-time budgets do not guarantee improved verbalization, and that our met- ric can distinguish incompleteness from genuine unfaithfulness. In these two settings, models fail to verbalize the hint regardless of the available compute. Full results across all tasks, hint types, and models are provided in Figure 9 in Appendix B.

T2. CoTs that do not explicitly verbalize given cues are not necessarily unfaithful; they may simply be incomplete.

6 Is CoT post-hoc rationalization?

Another common claim used to justify mistrust in CoTs is that they merely serve as post-hoc rationalizations of hinted cues. However, the provided cue can influence the model’s internal reasoning process, which may be reflected in the CoT even without explicit verbalization of the cue.

Logit Lens Analysis To examine how the hints propagate through the model’s reasoning, we use the Logit Lens (nostalgebraist, 2020), an interpretability method that decodes intermediate representations (e.g., MLP or attention outputs) into vocabulary logits, revealing how concepts evolve across layers and timesteps. For a transformer model with n L layers, let z (l) denote the Multihead Attention (MHA) output at layer l at the position of the token of interest. We decode this intermediate activation by applying the final-layer LayerNorm followed by the unembedding matrix U ∈ R |V |×d , where V is the vocabulary and d is the hidden size:

(5)

Although the Logit Lens can be applied to both MLP and MHA outputs, in this analysis we restrict our attention only to MHA activations. We focus specifically on examples whose generated CoT lacks any lexical mention of the hint tokens (e.g., Stanford, professor). Within these, we find positions where any hint-related token appears in the top-5 decoded logits at any layer. For each such position, we extract a 9-token window centered on it and analyze how hint-related representations emerge across layers with the Logit Lens.

Causal Mediation Analysis While Logit Lens gives a coarse view of hint usage across layers, it does not show whether the CoT causally affects the model’s prediction or merely explains it post hoc. To examine this causal link, we use Causal Mediation Analysis (Pearl, 2001), which decomposes an intervention’s total effect into direct and indirect components via an intermediate variable. We use it to quantify how much of the prediction change after adding a hint is mediated by the non-verbalizing CoT versus caused directly by the hint itself.

Let p h denote the model-assigned probability of the hinted answer token L h in the output distribution after applying the softmax of model M . We first compute the natural direct effect (NDE) of adding a hint to the input, holding the CoT fixed:

Next, we compute the natural indirect effect (NIE) of adding the hint, this time keeping the input fixed while substituting in the CoT induced by the hinted input:

In addition to measuring effects on the hinted answer’s probability, we also analyze how hints shift probability mass among the remaining options by tracking ph = c∈L{L h } p c , allowing us to examine whether hints suppress alternatives or primarily boost the hinted choice.

Logit Lens Results Across these contexts, we observe several recurring patterns:

• Hint-related tokens frequently appear near mentions of the word “answer”, either as part of the prediction prompt or when the model states its answer within the CoT.

• Hint-related tokens often surface during contrastive transitions, such as when the model uses conjunctions like “however” or “on the other hand”, marking a shift in reasoning direction. They also appear in referential or summarizing phrases such as “considering these” or “given these”, where the model consolidates or refers back to previous reasoning steps.

• The most intriguing pattern is the activation of hint-related tokens at the beginning of reasoning steps, particularly around numerical enumerations of steps. While the earlier patterns may indicate preparatory processes leading to answer formulation, this early activation suggests that the hint may shape the explanation’s structure to align with the hinted answer.

Figure 6 shows the logits of hint-related tokens that appear in the top-5 at any layer of Llama-3.2-3B-Instruct for the Professor hint.

Across nearly all datasets and patterns, we see two distinct peaks between layers 20 and 25. Results for all models and hint types are in Appendix B.

Figures 7 and8 show NDE and NIE estimates for the probability of the hinted answer and the summed probability of non-hinted answers across all models and tasks under the Professor hint, with BCa 95% confidence intervals from 10,000 bootstrap resamples. In Figure 7 In Figure 8, the NIE confidence intervals remain non-zero, while some NDEs are not significantly different from zero. We also see more instances where the indirect effect is larger in magnitude than the direct effect when reducing the probability of non-hinted options than when increasing the hinted option. This indicates that CoTs that do not verbalize the hint can influence predictions by suppressing alternative choices, not just by boosting the hinted one. The same pattern appears for the Metadata and Black Squares hints (Figures 14 and16 in Appendix B) and may reflect cases where hintinduced CoTs bypass reasoning paths that would otherwise support the default prediction.

T3.2. CoTs can shift predictions by decreasing the probability of non-hinted options, not only by increasing the hinted option.

Overall, these results show that CoTs have a genuine causal impact on model predictions, even without explicitly mentioning the provided hints, by both increasing the hinted option’s probability and reducing the non-hinted alternatives, reflecting multiple pathways through which hint-related information propagates.

Our findings indicate that claims of widespread CoT unfaithfulness largely arise from overinterpreting the Biasing Features metric. Using complementary metrics, studying completeness via inference-time scaling, and applying mediation analysis to causal pathways, we show that CoTs can encode meaningful reasoning signals even when they do not explicitly verbalize provided hints. Probability-level analyses further suggest that much apparent “unfaithfulness” reflects incompleteness in a compressed report rather than misalignment. We recommend that future interpretability work report other corruption based metrics and mediation analysis alongside Biasing Features.

What Biasing Features measures Biasing Features is best seen as a test of verbalized sensitivity to a known intervention: when a hint changes the answer, does the model report that hint in its CoT? This is a useful reporting measure, but is not the same as faithfulness: alignment between the explanation text and decision-relevant computation.

Conflating Faithfulness with Plausibility The limitation of the Biasing Features metric is its implicit assumptions. An explanation can accurately reflect the model’s reasoning yet be labeled unfaithful if it omits the given cues, while another that mentions them can be labeled faithful even if the hint does not drive the decision. This aligns with human intuitions about plausibility but goes beyond faithfulness, effectively turning the metric into plausibility-based evaluation.

Although current CoTs are imperfect explanations, they remain useful. Combined with other methods, CoTs can support a more holistic view of model behavior. Contextual and parametric faithfulness metrics indicate whether a CoT aligns with the model’s decision process, even if they cannot confirm that it captures every influential factor. More generally, when practitioners can specify factors of interest, methods exist to measure and intervene on them. For instance, Karvonen and Marks (2025) use representation-level interventions to remove demographic information and reduce racial and gender bias in LLM-based hiring. Even if CoTs do not explicitly describe such influences, concept-identification methods can find representation-space directions for demographic attributes that causally affect predictions. Thus, CoTs, used with complementary tools, can still play a meaningful role in interpretability pipelines.

Future Work Existing methods can check whether CoTs contradict a model’s underlying reasoning and test the effects of predefined factors, but they struggle to reveal information the model does not verbalize. Biasing Features tries to measure this, yet relies on an artificial setup and lacks instance-level insight into what is left unsaid for each example. Verbalization Finetuning (VFT) (Turpin et al., 2025) encourages models to articulate reward-hacking behaviors, but its generalization is unclear because held-out evaluations closely match training data. Future work should aim to improve CoTs not by optimizing for verbalizing simplistic or toy interventions, but by encouraging models to expose implicit, real-world factors through broader, more generalizable objectives.

While we expect our findings to generalize to larger models under faithful@k, our experiments do not include larger models or specialized reasoning models due to computational constraints. The FUR experiments in §4 are memory intensive, with memory requirements increasing rapidly as model size grows. In addition, the faithful@k analysis in §5 requires generating 128 samples per example and evaluating them using a self-hosted gpt-oss-20b model. Because reasoning models typically produce longer generations, both sampling and evaluation become substantially more expensive, making these settings impractical under our current resource constraints.

Reasoning Steps (FUR) Details

As FUR is based on machine unlearning, we adopt the Efficacy and Specificity metrics from Tutek et al. (2025) to evaluate unlearning quality. Efficacy measures whether the targeted reasoning content is successfully removed, while Specificity assesses whether the model preserves its behavior on nontarget, in-domain data after unlearning.

Efficacy We quantify Efficacy as the relative reduction in the length-normalized probability of unlearned CoT step r i :

where p M (r i ) denotes the length-normalized probability of reasoning step r i by the original model M , and p M (i) * (r i ) denotes the probability after unlearning r i . In Table 1, we report the Efficacy averaged across unlearned steps and instances.

Specificity We evaluate Specificity on a held-out validation set, D S (where |D S | = 20), to measure the preservation of model capabilities. Specificity is defined as the proportion of non-target instances where the predicted label remains unchanged after unlearning. Formally, let y j be the label predicted by the original model M for instance j, and y * j be the prediction after unlearning. Specificity is calculated as:

In Table 1, we report Specificity averaged across unlearning iterations, reasoning steps, and instances.

Since our datasets and models largely overlap with those used by Tutek et al. (2025), except for gemma-3-4b-it, we adopt the same hyperparameters for the shared models. For gemma-3-4b-it, we select the learning rate following the same procedure as Tutek et al. (2025): choosing the learning rate that maximizes efficacy while maintaining specificity of at least 95% on a held-out set. During hyperparameter selection, hint prefixes are excluded. We report Faithfulness, Efficacy, and Specificity for learning rates in {1e-6, 3e-6, 5e-6, 1e-5, 3e-5, 5e-5, 1e-4}, and highlight the selected learning rate for each dataset in Table 2.

Filler Tokens and FUR Tables 3,4, and 5 show the full results across three tasks, three hint types, and all models for the Biasing Features, Filler Tokens, and FUR metrics, respectively. Table 3 summarizes the total number of evaluated instances, the number of cases where the model switches its prediction to the hinted answer, and the subset of those cases classified as unfaithful, where the CoT does not verbalize the hint despite the prediction change.

Only instances labeled as unfaithful by the Biasing Features metric are included in the Filler Tokens and FUR evaluations. Table 4 reports the total number of available instances, the number of usable instances, and the number identified as faithful under the Filler Tokens metric. The difference between Total and Usable instances arises only for Llama-3-8B-Instruct, as many of its generated 96.7 69.3 69.6 100.0 73.0 53.9 100.0 5e-5 74.4 52.4 100.0 75.7 50.3 100.0 76.9 46.4 100.0 1e-4 78.7 30.8 100.0 79.4 31.8 100.0 80.0 41.8 100.0 Table 2: Control metrics Efficiency (Eff) and Specificity (Spec), together with Faithfulness (FF), across three tasks for gemma-3-4b-it evaluated under different learning rates on held-out sets.

CoTs are empty or consist solely of repeated EOS tokens.

For FUR, Usable instances are those in which the model’s predictions with and without CoT agree and the CoT is non-empty. As a result, Total and Usable counts differ across all tasks, models, and hint types. This discrepancy is especially pronounced for Llama-3-8B-Instruct, again due to the high frequency of empty or degenerate CoTs. faithful@k. Figure 9 shows faithful@k for all three models, hint types, and datasets separately. Under the Professor hint, the increase from k = 1 to k = 16 is substantial, most notably for gemma-3-4b-it, which reaches high faithful@k values exceeding 0.8 across all tasks, while the other models show more moderate gains. By contrast, faithful@k barely changes with increasing k under the Metadata and Black Squares hints, with the exception of Llama-3.2-3B-Instruct on StrategyQA, where a consistent increase is ob-served under both hint types. Figures 10,11,and 12 show the logits of hint-related tokens appearing in the top five predictions at each layer across five recurring patterns identified over all tasks and hint types for Llama-3.2-3B-Instruct, gemma-3-4b-it, and Llama-3-8B-Instruct, respectively. Across most settings, peaks emerge in later layers, typically after layer 20, although the exact formation varies by model and task. For example, Llama-3.2-3B-Instruct often exhibits two peaks in the later layers, whereas Llama-3-8B-Instruct shows a single dominant peak under the Metadata hint. In contrast, gemma-3-4b-it presents a more heterogeneous pattern across tasks and hint types. While not all identified patterns appear uniformly across models, tasks, and hint types, we find no evidence supporting any of the predefined patterns for Open-bookQA and ARC-Easy under the Metadata hint for gemma-3-4b-it. Table 3: Results for the Biasing Features evaluation. We report the total sample size used for evaluation (Total), the number of instances where the model changed its prediction to match the hint (Changed), and the subset of those changed instances where the model failed to verbalize the hint in its reasoning (Unfaithful). Table 4: Results for the Filler Tokens evaluation. We report the total sample size available for evaluation (Total), the number of instances that are suitable for Filler Tokens evaluation (Usable), and the subset of those usable instances where the metric identified as faithful (Faithful).

For the FUR evaluation, we adapt the codebase released by Tutek et al. (2025), which relies on spaCy (Honnibal et al., 2020) and NLTK (Bird and Loper, 2004). All experiments are implemented using HuggingFace Transformers (Wolf et al., 2020) and PyTorch (Paszke et al., 2019).

For the LLM-as-judge setup powered by DSPy (Khattab et al., 2024), we deploy gpt-oss-20b using SGLang (Zheng et al., 2024) on two NVIDIA RTX A6000 GPUs with 48GB VRAM each. Aside from hint verbalization evaluation, all experiments are run on a single NVIDIA RTX A6000 GPU. The only exception is the FUR evaluation for Llama-3-8B-Instruct, where an NVIDIA H100 GPU with 80GB VRAM is used.

During faithful@k evaluation, we use the default sampling settings for each model.

For Llama-3.2-3B-Instruct and Llama-3-8B-Instruct, we set the temperature to 0.6 and apply nucleus sampling (Holtzman et al., 2019) with top-p = 0.9. For gemma-3-4b-it, we Table 5: Results for the FUR evaluation. We report the total sample size available for evaluation (Total), the number of instances that are suitable for FUR evaluation (Usable), and the subset of those usable instances where the metric identified as faithful (Faithful).

use top-k = 64 and top-p = 0.95.

Biasing Features experiments typically run from several minutes to several hours, whereas Filler Tokens and Causal Mediation Analysis experiments complete within a few minutes. The most timeconsuming experiments are FUR and faithful@k, which range from several hours to multiple days, and in some extreme cases exceed one week. FUR is particularly compute-intensive due to repeated unlearning iterations for each reasoning step and instance, frequent model reloads, and evaluations after each unlearning step. In contrast, faithful@k requires sampling 128 CoTs per instance and performing LLM-based evaluations for every instance that switches its prediction, with overall runtime largely determined by the throughput of the LLMas-judge system.

We follow prior work (Chua and Evans, 2025;Chen et al., 2025) by using an LLM-as-judge to detect whether a CoT verbalizes the provided hint, rather than relying on lexical matching. Simply mentioning the hint does not necessarily imply that the model acknowledges or uses it in its decisionmaking process. A model may repeat the hint verbatim while still basing its prediction on independent reasoning, or it may explicitly restate the hint in order to reject it. Lexical checks alone can therefore overestimate faithfulness. To mitigate this issue, we prompt the judge model to identify cases in which the CoT explicitly states that the hint influenced the prediction. To avoid the cost of closed-model APIs, we use gpt-oss-20b with DSPy, which also facilitates structured output parsing. Figure 20 shows the DSPy signature used for the Professor hint; the signatures for the other hint types differ only in minor details.

To assess agreement between the LLM-as-judge and human annotators, we manually annotate a stratified subsample of 100 instances in which the model’s prediction changed after the hint, evenly distributed across tasks and models. Comparing the LLM-as-judge predictions against human annotations yields an accuracy of 80%. However, precision and recall are relatively low (precision: 36%, recall: 31%). The confusion matrix is shown in Figure 17. While the false positive rate is low (11%), the true positive rate is also low (31%). High false positives are less concerning for our analysis, since we focus on negative cases, namely instances classified as unfaithful by the Biasing Features metric. However, false negatives could weaken our claims, as some CoTs identified as faithful by alternative metrics may already be faithful under Biasing Features.

To test whether this issue affects our conclusions, we rerun the Filler Tokens and FUR evaluations on a stricter subset consisting only of instances where the hint is not verbalized even lexically. Figures 18 and 19 present the results. Aside from minor decreases in a few settings, the overall trends remain unchanged, indicating that our findings are robust to false negatives introduced by the judge LLM.

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut