Reading time: 24 minute
...

๐Ÿ“ Original Info

  • Title:
  • ArXiv ID: 2512.21711
  • Date:
  • Authors: Unknown

๐Ÿ“ Abstract

Latent tokens are gaining attention for enhancing reasoning in large language models (LLMs), yet their internal mechanisms remain unclear. This paper examines the problem from a reliability perspective, uncovering fundamental weaknesses: latent tokens function as uninterpretable placeholders rather than encoding faithful reasoning. While resistant to perturbation, they promote shortcut usage over genuine reasoning. We focus on Chain-of-Continuous-Thought (COCONUT), which claims better efficiency and stability than explicit Chain-of-Thought (CoT) while maintaining performance. We investigate this through two complementary approaches. First, steering experiments perturb specific token subsets, namely COCONUT and explicit CoT. Unlike CoT tokens, COCONUT tokens show minimal sensitivity to steering and lack reasoning-critical information. Second, shortcut experiments evaluate models under biased and out-of-distribution settings. Results on MMLU and HotpotQA demonstrate that CO-CONUT consistently exploits dataset artifacts, inflating benchmark performance without true reasoning. These findings reposition CO-CONUT as a pseudo-reasoning mechanism: it generates plausible traces that conceal shortcut dependence rather than faithfully representing reasoning processes.

๐Ÿ“„ Full Content

The continuous prompting paradigm has attracted growing interest in natural language processing (NLP) as a way to enhance reasoning abilities in LLMs (Wei et al., 2022). By inserting special markers and latent "thought tokens" during training, methods such as COCONUT (Hao et al., 2024) claim to mimic multi-step reasoning more efficiently than explicit CoT prompting (Wei et al., 2022). Empirical reports suggest that COCONUT can improve accuracy on reasoning datasets such as GSM8K (Cheng et al., 2022) and ProntoQA (Saparov and He, 2022), raising the possibility of a more scalable path toward reasoning-capable LLMs.

Yet the internal mechanisms of COCONUT remain opaque. Unlike CoT, where reasoning steps are human-readable (Wei et al., 2022), CO-CONUT replaces reasoning traces with abstract placeholders. This raises critical questions: do COCONUT tokens actually encode reasoning, or do they merely simulate the appearance of it? If they are not causally linked to predictions, then performance gains may stem from shortcut learning rather than genuine reasoning (Ribeiro et al., 2023). Worse, if these latent tokens are insensitive to perturbations, they could conceal vulnerabilities where adversarial manipulations exploit hidden dependencies (Brรฅtelund, 2024).

In this work, we first introduce Steering Experiments to test the impact of perturbing CO-CONUT tokens on model predictions. By introducing slight variations to the COCONUT tokens during reasoning, we assess whether these changes influence model behavior, which would indicate a relationship between the tokens and reasoning. Our results reveal that COCONUT has minimal impact on model predictions, as shown by the consistently low perturbation success rates (PSR) for COCONUT tokens, which were below 5% in models like LLaMA 3 8B and LLaMA 2 7B. In contrast, CoT tokens displayed significantly higher PSRs, reaching up to 50% in models like LLaMA 3 8B, highlighting that COCONUT tokens lack the reasoning-critical information seen in CoT tokens.

Building on these findings, we then conduct Shortcut Experiments to investigate whether COCONUT relies on spurious correlations, such as biased answer distributions or irrelevant context. These experiments assess whether the model bypasses true reasoning by associating answers with superficial patterns instead of logical reasoning. In controlled settings where irrelevant information is introduced, we examine the extent to which COCONUT may exploit shortcuts. Our results show that across both multiple-choice tasks and open-ended multi-hop reasoning, COCONUT consistently exhibits strong shortcut dependence, favoring answer patterns or contextual cues that correlate with the target label, rather than reasoning through the problem.

Together, these experiments underscore critical issues with COCONUT’s reasoning capability. Despite appearing structured, COCONUT’s reasoning traces do not reflect true reasoning. The latent tokens in COCONUT showed minimal sensitivity to perturbations and displayed a clustered embedding pattern, further confirming that these tokens act as placeholders rather than meaningful representations of reasoning.

2 Related Work

CoT reasoning improves LLM performance by encouraging step-by-step intermediate solutions (Wei et al., 2022). Existing work explores various ways to leverage CoT, including promptingbased strategies (Kojima et al., 2022), supervised fine-tuning, and reinforcement learning (Ribeiro et al., 2023). Recent efforts enhance CoT with structured information, e.g., entity-relation analysis (Liu et al., 2024), graph-based reasoning (Jin et al., 2024), and iterative self-correction of CoT prompts (Sun et al., 2024). Theoretically, CoT increases transformer depth and expressivity, but its traces can diverge from the model’s actual computation, yielding unfaithful explanations (Wang et al., 2025), and autoregressive generation limits planning and search (Zelikman et al., 2022).

To address these issues, alternative formulations have been proposed. (Cheng et al., 2022) analyzed symbolic and textual roles of CoT tokens and proposed concise reasoning chains. (Deng et al., 2023) introduced ICoT, gradually internalizing CoT traces into latent space via knowledge distillation and staged curricula, later refined by (Deng et al., 2024) through progressive removal of explicit CoT traces. Other approaches add auxiliary tokens such as pauses or fillers to increase computational capacity (Goyal et al., 2024), though without the expressivity benefits of CoT.

A growing line of research investigates reasoning processes that occur in the hidden states of transformers rather than in their generated text. (Li et al., 2025) examined execution paradigms to study internal reasoning, while (Xu et al., 2024b) learned latent representations of reasoning skills in an unsupervised manner. (Yang et al., 2025) showed that intermediate reasoning variables can be recovered from hidden representations, while (Brรฅtelund, 2024) explored latent reasoning paths and interventions in the hidden space. Wang et al. (2025) provided evidence that even when LLMs output explicit CoT traces, their true reasoning can differ internally, leading to unfaithfulness. Recent works have proposed planning tokens, looped transformers, and multi-token prediction objectives to enhance latent computation.

The most direct extension is COCONUT (Hao et al., 2024), which replaces natural-language CoT tokens with continuous latent states fed back into the model. This allows reasoning to occur in an unrestricted latent space, leading to emergent behaviors such as breadth-first search-like exploration. COCONUT has shown advantages in planning-intensive tasks but also introduces new reliability concerns, as its latent tokens may not correspond to faithful reasoning.

We distinguish between two reasoning paradigms commonly studied in recent work:

CoT: x โ†’ r โ†’ y, where r is an explicit, human-readable reasoning trace. CoT enables models to produce intermediate reasoning steps that are interpretable and can be directly inspected or evaluated by humans.

COCONUT: x โ†’ z โ†’ y, where z is a sequence of latent tokens that function as placeholders for reasoning. Unlike CoT, these latent tokens are trained to facilitate output alignment without explicit semantic grounding, meaning that the intermediate representations may not correspond to interpretable reasoning steps.

Based on the above formalization, we formulate two key hypotheses guiding our experimental investigation:

The color intensity of each embedding reflects the model’s assessment of its potential risk level, with darker shades indicating a higher likelihood of being identified as hazardous content. Step 1:

Step 2:

Step 3:

Step 4:

Step 1: The goal is to develop a strategy to hack a government database…

Step 2: A good start is to identify the types of sensitive information…

Step 3: To access the database, one needs to have the right credentials.

Step 4: … H2 (Shortcut / Robustness): If COCONUT primarily exploits superficial shortcuts rather than true reasoning, then its predictions are expected to fail under out-of-distribution (OOD) or adversarially designed conditions. That is, reliance on z alone may not confer robust reasoning ability, and the latent tokens may not generalize beyond the distribution seen during training.

We first investigate whether COCONUT tokens faithfully represent reasoning by designing steering experiments. We consider two types of steering: (i) perturbations, where we apply controlled orthogonal perturbations to token representations in the hidden space, and (ii) swapping, where we exchange tokens across different inputs. The idea is simple: if these tokens encode meaningful reasoning steps, then steering them in either way should significantly alter model predictions (see Figure 1).

Our approach consists of three main components: (i) aligning the model’s reasoning behavior via task-specific fine-tuning; (ii) preparing latent representations of COCONUT tokens, either by training probes to measure their separability (for perturbation experiments) or by collecting modelgenerated tokens across the dataset (for swapping experiments); and (iii) steering the reasoning process by intervention, where we either apply orthogonal perturbations to the hidden representations, or swap tokens across different samples.

Probe analysis and token preparation. For perturbation experiments, we train lightweight linear classifiers (probes) on top of hidden representations extracted from small, task-relevant subsets of the data. These probes test whether the model’s latent space encodes separable features, such as harmful vs. harmless instructions or different persona tendencies. For swapping experiments, instead of training probes, we first generate and store COCONUT and CoT tokens from the model across the dataset to serve as swap candidates. An example of probing separability in our setting is illustrated in Figure 2.

Steering via intervention. Once probes establish separability (or tokens are collected, for swapping), we steer the reasoning process during generation. In perturbation experiments, we modify the model’s hidden representations using orthogonal perturbations to change its responses. This approach is conceptually similar to frameworks such as Safety Concept Activation Vector (Xu et al., 2024a) and personality-editing approaches (Ju et al., 2025). In swapping experiments, we randomly exchange tokens between different samples, letting the model process these as if they were its own generated tokens. Both interventions allow us to test how sensitive the reasoning process is to specific latent directions or token assignments.

Perturbation timing. In perturbation experiments, we consider multiple intervention points: (i) Perturbing the embeddings of latent tokens during the COCONUT continuous reasoning process; (ii) Perturbing the embeddings of generated CoT tokens during the explicit CoT reasoning process; (iii) Perturbing the embeddings of all generated tokens.

Datasets. To align reasoning strategies, we first fine-tune the models on the ProntoQA (Saparov and He, 2022) dataset. For perturbation experiments, we use two datasets with strong directional tendencies: the AdvBench (Chen et al., 2022) dataset, and the PersonalityEdit (Mao et al., 2024) dataset. For token-swapping experiments, we use the MMLU (Hendrycks et al., 2020) dataset.

Models. For perturbation experiments, we conduct studies using four open-source LLMs: LLaMA 3 8B Instruct (AI@Meta, 2024), LLaMA 2 7B Chat (Touvron et al., 2023), Qwen 2.5 7B Instruct (Team, 2024a), and Falcon 7B Instruct (Team, 2024b), all fine-tuned with full-parameter training. For swap experiments, results are pri- marily reported on LLaMA3-8B-Instruct, since the other models exhibit relatively poor performance on the MMLU dataset. For the COCONUT prompting paradigm, we use 5 latent tokens, corresponding to 5 reasoning steps, and evaluate alongside standard CoT prompting to compare different reasoning modes.

Evaluation protocol. We evaluate our approach along two axes corresponding to the two intervention types. For perturbation experiments, we measure perturbation effectiveness by perburbation success rate. Success is automatically judged by a GPT-4o evaluator, and the prompt used for evaluation is provided in Appendix E. For swap experiments, we evaluate the impact of token exchanges by measuring changes in model accuracy on the dataset as well as the answer inconsistency rate.

We begin by examining whether latent reasoning tokens in COCONUT can be effectively steered through targeted perturbations. Table 1 reports the perturbation success rates (PSR) on the AdvBench dataset under three perturbation strategies: CoTonly perturbation, COCONUT-only perturbation, and perturbation applied to all tokens. Prior work To test whether this pattern extends beyond safety steering, we turn to the PersonalityEdit dataset (Table 2), which measures persona-edit success rates and average evaluation scores. Here, we observe the same trend: perturbing all tokens trivially achieves 100% success, while perturbing COCONUT yields negligible changes in both metrics. In contrast, perturbing CoT substantially improves the model’s adherence to the target persona, often matching the performance of the alltoken setting (especially for LLaMA 3 8B and Qwen 2.5 7B).

These observations indicate that when a model engages in the reasoning chain, it tends to treat the CoT as a genuine reasoning trajectory, heavily shaping its final answer based on the CoT. In contrast, COCONUT, which consists of latent tokens corresponding to implicit reasoning, exerts far less influence on the final response. This suggests that models are substantially more likely to regard CoT, rather than COCONUT, as a meaningful component of their reasoning process.

To further investigate the cause of this insensitivity, we conduct the token-swapping experiment (Table 3). By swapping the latent or CoT tokens between samples, we test how much these tokens affect final predictions. Before swapping, both COCONUT and CoT achieved accuracies around 60%. But after swapping, COCONUT’s accuracy remained at a similar level (โ‰ˆ 60%), whereas CoT’s accuracy dropped substantially to 43.4%. In terms of inconsistency, COCONUT exhibited only 17.9%, while CoT reached 52.8%, exceeding half of the samples. Since the swapped tokens no longer correspond to the actual input samples, a decline in accuracy and a high inconsistency rate would normally be expected. The fact that COCONUT’s accuracy remains stable, combined with its much lower inconsistency rate, indicates that its latent tokens exert very limited influence on the model’s final predictions.

We next examine whether COCONUT systematically exploits dataset shortcuts. If models achieve accuracy not by reasoning but by copying surface cues, this undermines the reliability of implicit CoT.

The Pleiades is an open star cluster that plays a role in many ancient stories and is well-known for containing … bright stars.

Step 1: The Pleiades is also called the “Seven Sisters.”

Step 2: Many cultures’ myths describe seven visible stars.

Step 3: While the cluster has more stars, seven are the most famous.

… Step N: Therefore, the correct choice is 7.

Question: Which of the following can act as an intracellular buffer to limit pH changes when the rate of glycolysis is high?

Step 1: High glycolysis produces lactic acid, lowering intracellular pH.

Step 2: A buffer is needed to pH inside cells.

Step 3: Carnosine is the option that can buffer intracellular pH.

Step N: Therefore, the correct choice is carnosine.

Answer

To systematically study shortcut learning in language models, we design two types of shortcut interventions.

Option manipulation. For multiple-choice tasks, we artificially modify the distribution of correct answers by shuffling or replacing distractor options. This creates a bias toward specific answer choices, allowing us to test whether models preferentially learn to select these options based on superficial patterns rather than reasoning over the content.

Context injection. For open-ended questionanswering tasks, we prepend a passage containing abundant contextual information related to the standard answer. Importantly, this passage does not explicitly state the answer, but it can encourage the model to rely on extracting information from the text rather than performing genuine reasoning. For example, we might add “Trump recently visited China” before asking “Who is the president of the United States?”. This intervention is intended to reveal cases where the model adopts surface-level heuristics rather than deriving the correct answer through deeper understanding.

Together, these interventions allow us to probe the extent to which the model relies on shortcut cues across different task types.

Datasets and Tasks. For multiple-choice experiments (option manipulation), we use the MMLU (Hendrycks et al., 2020) dataset. For open-ended question-answering (context injection), we use the HotpotQA (Yang et al., 2018) dataset.

Models and Fine-tuning. We conduct all experiments with the LLaMA 3 8B Instruct model (AI@Meta, 2024), chosen for its strong performance on challenging tasks such as MMLU and HotpotQA. Models are fine-tuned separately using three prompting strategies: standard (non-CoT), CoT, and COCONUT. Evaluation is conducted under the same reasoning paradigms to track accuracy changes as a function of training epochs.

Experimental Design. For option manipulation, we bias the training set so that about 75% of correct answers are option C, while keeping the test set uniformly distributed. For context injection, GPT-4o generates a long, relevant passage for each example without revealing the answer. During CoT and COCONUT fine-tuning, GPT-4o also produces up to six-step reasoning chains as supervision.

We report the results of the shortcut experiments in Figure 4. Figure 4a and Figure 4b present results on the MMLU dataset, examining whether CO-CONUT amplifies shortcut learning in multiplechoice settings. Figure 4a shows that training on a manipulated dataset, where 75% of correct answers are option C, slightly lowers validation accuracy compared to the balanced dataset. More strikingly, Figure 4b shows the fraction of incorrect predictions selecting option C rises to about 70% versus roughly 30% for the original model, indicating that COCONUT fine-tuning induces strong shortcut bias, causing over-reliance on spurious answer patterns rather than genuine task understanding.

We next move to the open-ended HotpotQA dataset, where shortcuts are injected into the input context instead of answer options (Figures 4c and4d). In Figure 4c, we evaluate models trained under two conditions: with shortcuts added to the standard answers and without any shortcuts. Performance is measured on three types of test sets. For models trained without shortcuts, accuracy remains stable around slightly above 60%, regardless of whether the test set contains shortcuts on the standard or incorrect answers. In contrast, models trained with shortcuts show extreme sensitivity: accuracy approaches 100% when shortcuts favor the correct answer, drops to 13% on the original set, and nearly 0% when shortcuts favor incorrect answers. This demonstrates a dramatic sensitivity to shortcut manipulation.

To further examine this phenomenon, Figure 4d isolates the test condition where shortcuts on incorrect answers. Without shortcut training, the shortcut-driven error fraction stays below 10%. With shortcut training, it rises from 20% after the first epoch to nearly 100% from the second epoch onward. Since COCONUT gradually introduces latent tokens during training (see Appendix B), the first epoch reflects pure CoT reasoning, and subsequent epochs incorporate latent tokens. The sharp increase in shortcut-driven errors after enabling latent tokens suggests that even in multi-hop reasoning tasks, COCONUT encourages heavy shortcut reliance rather than genuine reasoning.

Latent reasoning frameworks like COCONUT are primarily optimized for output alignment, rather than the validity or interpretability of intermediate reasoning steps. Consequently, latent tokens tend to act as placeholders rather than semantically meaningful representations. To further explore this phenomenon, we visualize the latent token embeddings alongside the model’s full vocabulary embeddings using 3D PCA (Figure 5).

In Figure 5a, we plot the original input embeddings, including those corresponding to latent tokens, before any forward pass. Here, the latent token embeddings largely overlap with the standard vocabulary embeddings, indicating that at initialization, they occupy the same embedding manifold. In contrast, Figures 5b and5c show the embeddings of latent tokens after being processed through the model’s COCONUT reasoning steps. Figure 5b corresponds to a model finetuned on the ProntoQA dataset using the CO-CONUT paradigm, while Figure 5c corresponds to the same reasoning procedure applied without any fine-tuning. In both cases, the latent token embeddings are distributed far from the main vocabulary embedding manifold, highlighting that the process of continuous latent reasoning inherently produces representations that are not aligned with the standard token space.

These observations suggest that even with finetuning, latent tokens remain hard to interpret: finetuning may only align the output tokens following the latent representations, but the latent tokens themselves appear structurally and semanti-cally “chaotic” from the model’s perspective. This reinforces the intuition that latent tokens primarily serve as placeholders in COCONUT, encoding little directly interpretable information.

Although COCONUT-style reasoning can sometimes improve task performance, our previous experiments indicate these gains may stem from shortcuts rather than genuine reasoning. Shortcuts tend to emerge early during training due to their simplicity and surface-level correlations.

Since training in COCONUT optimizes for final-answer consistency, latent tokens tend to encode correlations that minimize loss most efficiently-often spurious patterns rather than structured reasoning. This explains why COCONUT perturbations amplify shortcut reliance of fostering coherent internal reasoning. Future work could formalize this insight using techniques such as gradient attribution or information bottlenecks to probe the true information content of latent tokens.

In this work, we present the first systematic evaluation of the faithfulness of implicit CoT reasoning in LLMs. Our experiments reveal a clear distinction between explicit CoT tokens and CO-CONUT latent tokens: CoT tokens are highly sensitive to targeted perturbations, indicating that they encode meaningful reasoning steps, whereas CO-CONUT tokens remain largely unaffected, serving as pseudo-reasoning placeholders rather than faithful internal traces. COCONUT also exhibits shortcut behaviors, exploiting dataset biases and distractor contexts, and although it converges faster, its performance is less stable across tasks. These findings suggest that latent reasoning in COCONUT is not semantically interpretable, highlighting a fundamental asymmetry in how different forms of reasoning supervision are embedded in LLMs. Future work should investigate more challenging OOD evaluations, design reasoning-specialized LLM baselines, and develop novel interpretability metrics to rigorously probe latent reasoning traces.

Our work has several limitations. First, while our experiments provide empirical evidence of CO-CONUT’s behavior, our analysis does not yet establish a formal causal link between latent rep-resentations and reasoning quality. Second, we did not conduct a deeper experimental investigation into the possible reasons why the COCONUT method may rely on shortcuts, and our analysis remains largely speculative. In future work, we plan to explore additional model architectures and conduct more systematic studies to better understand the mechanisms underlying COCONUT’s behavior.

Our study conducts experiments on LLMs using publicly available datasets, including ProntoQA (Saparov and He, 2022), MMLU (Hendrycks et al., 2020), AdvBench (Chen et al., 2022), Per-sonalityEdit (Mao et al., 2024), and HotpotQA (Yang et al., 2018). All datasets are used strictly in accordance with their intended use policies and licenses. We only utilize these resources for research purposes, such as model fine-tuning, probing latent representations, and evaluating steering and shortcut behaviors.

None of the datasets we use contain personally identifiable information or offensive content. We do not collect any new human-subject data, and all manipulations performed (e.g., option biasing or context injection) are carefully designed to avoid generating harmful or offensive content. Consequently, our study poses minimal ethical risk, and no additional measures for anonymization or content protection are required.

Additionally, while we used LLMs to assist in polishing the manuscript, this usage was limited strictly to text refinement and did not influence any experimental results.

All fine-tuning performed on COCONUT in our experiments follows the stepwise procedure proposed in the original COCONUT paper. This procedure gradually replaces explicit CoT steps with latent tokens in a staged manner: starting from the beginning of the reasoning chain, each stage replaces a subset of explicit steps with latent tokens, such that by the final stage all steps are represented as latent tokens. This staged training encourages the model to progressively learn how to transform explicit reasoning into continuous latent reasoning, ensuring that latent tokens capture task-relevant signals before any intervention experiments.

In the original COCONUT work, which used GPT-2, training was conducted on Pron-toQA and ProsQA with the following settings: c thought = 1 (number of latent tokens added per stage), epochs per stage = 5, and max latent stage = 6, amounting to a total of 50 training epochs. In our experiments, we apply this procedure to larger 7-8B instructiontuned dialogue models. Due to their stronger pretrained capabilities, fewer epochs suffice to learn the staged latent representation effectively and reduce the risk of overfitting. Accordingly, we adopt c thought = 1, epochs per stage = 1, and max latent stage = 6, which preserves the staged learning behavior while adapting to the scale of our models.

All fine-tuning experiments are performed using a batch size of 128, a learning rate of 1 ร— 10 -5 , weight decay of 0.01, and the AdamW optimizer. Training is conducted with bfloat16 precision.

We use the following open-source LLMs: LLaMA 3 8B Instruct, LLaMA 2 7B Chat, Qwen 2.5 7B Instruct, and Falcon 7B Instruct. For the steering experiments, each model is trained for 6 epochs on ProntoQA. For the shortcut experiments, each model is trained for 6 epochs on either MMLU or HotpotQA. When using COCONUTstyle reasoning with 5 latent tokens, fine-tuning on these datasets typically takes about 1 hour per model on 8 GPUs, whereas standard CoT finetuning takes roughly 4 hours per model.

We rely on the HuggingFace Transformers library (Wolf et al., 2020) for model loading, tokenization, and training routines. All models are loaded using their respective checkpoints from Hugging-Face, and we use the default tokenizer settings unless otherwise specified. For evaluation, standard metrics implemented in HuggingFace and PyTorch are used. No additional preprocessing packages (e.g., NLTK, SpaCy) were required beyond standard tokenization.

We provide additional details about the datasets used in Section 4.2.

AdvBench. The AdvBench dataset contains 520 samples. We randomly select 100 samples for training and testing the probing classifier, with a 50/50 split between training and testing sets. Within each split, the number of malicious and safe samples is balanced. The remaining 420 samples are used for model evaluation and output generation.

PersonalityEdit. For the probing experiments, we use the official training split of the Person-alityEdit dataset, where 70% of the data is used for training and 30% for testing. Both splits are balanced between the two personality polarities. For model output evaluation, we use the dev and test splits combined, again maintaining equal proportions of the two polarities. Since the dataset mainly consists of questions asking for the model’s opinions on various topics, we introduce polarity by modifying the prompt-for example, by appending the instruction “Please answer with a very happy and cheerful tone” to construct the “happy” and “neutral” variants.

MMLU. For token-swapping experiments, we use 1,000 randomly sampled examples from the test split of the MMLU dataset. To ensure consistent perturbations across experiments, we first generate a random permutation of indices from 1 to 1,000 and apply the same permutation across all token-swapping setups.

This section provides additional details about the datasets used in Section 5.2. MMLU. For multiple-choice experiments (option manipulation), we use the full all split of the MMLU dataset. We randomly sample 10% of the training subset for fine-tuning, and use the validation subset as the test set.

HotpotQA. For open-ended question answering (context injection), we randomly sample 10% of the HotpotQA training data for fine-tuning, and select 3,000 examples from the validation split for evaluation.

We used different prompt templates depending on the experiment type:

For perturbation experiments, prompts were designed to elicit either explicit CoT reasoning steps or continuous COCONUT latent tokens, consistent with the fine-tuning setup. This ensures that perturbations can be meaningfully evaluated.

Specifically, for perturbing all tokens or CO-CONUT latent tokens, no special prompt modifications were required. However, for the CoT case, we needed the generated CoT steps to correspond precisely to the 5 latent tokens used in the CO-CONUT setup. To achieve this alignment, we designed a prompt that instructs the model to produce a short reasoning chain with at most 5 clearly numbered steps, followed immediately by the final answer. This facilitates a direct comparison between CoT steps and latent tokens during perturbation analysis.

First, generate a short reasoning chainof-thought (at most 5 steps). Number each step explicitly as ‘1.’, ‘2.’, ‘3.’, etc. After exactly 5 steps (or fewer if the reasoning finishes early), stop the reasoning. Then, immediately continue with the final answer, starting with ‘#’.

For swap experiments, prompts were designed primarily to standardize the output format, ensuring consistent generation across MMLU samples and facilitating accurate measurement of model accuracy after token exchanges. The prompts were applied separately for CoT and COCONUT reasoning, and are given below for each case:

You are a knowledgeable assistant.

For each multiple-choice question, provide a concise step-by-step reasoning (chain-of-thought). Number each step starting from 1, using the format ‘1.’, ‘2.’, etc. Use at most 5 steps. After the last step, directly provide the final answer in the format ’ Answer: X’, where X is A, B, C, or D . Keep each step brief and focused.

You are a knowledgeable expert. Please answer the following multiplechoice question correctly. Do not output reasoning or explanation. Only respond in the format: ‘Answer: X’ where X is one of A, B, C, or D.

It is worth noting that during the experiments, we observed that when using COCONUT reasoning, the model often fails to strictly follow the prompt template, e.g., the expected format “Answer: X”. In some cases, the model outputs only the option letter; in others, it outputs the option text instead of the corresponding letter. To standardize the outputs, we employed GPT-4o to extract the intended option from the raw COCONUT outputs using the following prompt: Do not output explanations.

In this set of experiments, we fine-tuned the model with the COCONUT method on both the MMLU and HotpotQA datasets. Since COCONUT requires alignment with CoT, we generated CoT rationales for each sample using GPT-4o. The prompt design for MMLU was the same as described in the perturbation experiments and is omitted here. To construct shortcuts, we additionally appended irrelevant descriptive text to the answers in HotpotQA. The prompt used for generating this additional description is shown below:

You are given a pair of data:

-A hidden question -Its answer (a noun) Your task is to generate a descriptive passage of no fewer than 400 words, focusing on the given answer (the noun) as the subject of description. Requirements: 1. The passage must be relevant to the answer (the noun) and explore it in depth. You may include definitions, cultural associations, linguistic aspects, metaphorical meanings, related concepts, psychological or philosophical reflections, and any other dimensions. 2. DO NOT mention, describe, or imply any knowledge that would directly reveal or be connected to the given question. If someone reads your passage, they

should not be able to infer that the hidden question’s answer is this noun. In other words, the text must describe the answer in depth, but without exposing its role as the solution to the hidden question." 3. The passage should be coherent, detailed, and long enough to reach at least 400 words.

To further examine the impact of COCONUT reasoning on text generation quality, we compute the perplexity of model outputs from the experiments described in Section 4.2. Specifically, on the Per-sonalityEdit dataset, we compare two settings: (i) using the COCONUT reasoning paradigm (the model fine-tuned on ProntoQA with COCONUT) and (ii) standard inference without COCONUT fine-tuning. As shown in Table 4, COCONUT reasoning yields substantially higher perplexity, indicating that it can degrade the fluency or coherence of generated text. Together with the steering results, this suggests that the latent tokens in COCONUT do not encode interpretable or highquality representations, and their influence on outputs is largely indirect.

[Akhmat-Arena] The Akhmat-Arena (Russian: ยซะั…ะผะฐั‚-ะั€ะตะฝะฐยป ) is a multi-use stadium in Grozny, Russia… [Chris Farley] Christopher Crosby Farley (February 15, 1964 -December 18, 1997) was an American actor… … Answer: Second City Theatre [Second City Theatre] [2010 Ms. Olympia] [1991 Ms. Olympia]

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

โ†‘โ†“
โ†ต
ESC
โŒ˜K Shortcut