Mechanistic Interpretability of Cognitive Complexity in LLMs via Linear Probing using Bloom's Taxonomy

Mechanistic Interpretability of Cognitive Complexity in LLMs via Linear Probing using Bloom's Taxonomy
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The black-box nature of Large Language Models necessitates novel evaluation frameworks that transcend surface-level performance metrics. This study investigates the internal neural representations of cognitive complexity using Bloom’s Taxonomy as a hierarchical lens. By analyzing high-dimensional activation vectors from different LLMs, we probe whether different cognitive levels, ranging from basic recall (Remember) to abstract synthesis (Create), are linearly separable within the model’s residual streams. Our results demonstrate that linear classifiers achieve approximately 95% mean accuracy across all Bloom levels, providing strong evidence that cognitive level is encoded in a linearly accessible subspace of the model’s representations. These findings provide evidence that the model resolves the cognitive difficulty of a prompt early in the forward pass, with representations becoming increasingly separable across layers.


💡 Research Summary

The paper investigates whether large language models (LLMs) internally encode the hierarchical cognitive difficulty levels defined by Bloom’s taxonomy, and if such encoding is linearly accessible. The authors construct a balanced dataset of 1,128 prompts, evenly distributed across the six Bloom levels (Remember, Understand, Apply, Analyze, Evaluate, Create). The prompts are drawn from two educational sources—computer‑science course queries and the EduQG dataset—ensuring linguistic diversity while preserving clear expert annotations of cognitive level.

Four open‑source transformer‑based LLMs are examined: Llama‑3.1‑8B‑Instruct, Qwen3‑4B‑Instruct‑2507, Gemma‑3‑4b‑it, and DeepSeek‑R1‑Distill‑Llama‑8B. For each model, the authors perform a forward pass on every prompt and extract the hidden state of the final token from the residual stream at every layer. The final token is chosen because it has attended to the entire input sequence, thus containing the most complete representation of the prompt before generation begins.

To test linear decodability, a separate multinomial logistic regression probe is trained on the activations from each layer. Probes are trained with default L2 regularization and feature normalization, using an 80/20 stratified split that preserves the class distribution. Because the probe is strictly linear, any high classification performance must stem from linearly separable structure already present in the model’s internal representations, not from the expressive power of a deep classifier.

Results show a consistent pattern across all four models. Early layers (0‑2) yield low accuracies (≈60 % or less). Starting around layer 5, accuracy rises sharply and surpasses a 90 % threshold, which the authors define as the Cognitive Separability Onset (CSO). After this point, accuracy plateaus, indicating that once the model has resolved the cognitive difficulty of a prompt, the representation is preserved in the residual stream rather than recomputed in deeper layers.

Confusion‑matrix analysis reveals that misclassifications are predominantly between adjacent Bloom levels (e.g., Apply vs. Analyze, Evaluate vs. Create). This error pattern mirrors the ordinal nature of Bloom’s taxonomy, suggesting that the model’s latent space respects the hierarchical ordering of cognitive tasks rather than treating the six labels as unrelated categories.

The authors emphasize that the emergence of a linearly decodable Bloom signal is architecture‑independent: despite differences in parameter count, depth, and training data, all models display a CSO near layer 5 and achieve comparable probe accuracies. This robustness supports the hypothesis that LLMs spontaneously organize a subspace dedicated to abstract task difficulty during pre‑training.

Limitations are acknowledged. The study relies solely on linear probes; non‑linear probing could uncover additional structure. Moreover, high probe accuracy does not prove causal use of the encoded difficulty during generation; intervention experiments (e.g., manipulating the identified subspace) would be required to establish causality. Finally, the dataset, while balanced, is limited to educational questions in the computer‑science domain, and broader domains should be examined in future work.

In conclusion, the paper provides strong empirical evidence that LLMs internally encode Bloom‑level cognitive complexity in a linearly accessible subspace of their residual streams. By combining mechanistic interpretability techniques with an established educational taxonomy, the authors propose a novel, fine‑grained evaluation framework that goes beyond surface‑level correctness. This framework could be valuable for assessing model reasoning depth, guiding curriculum‑aware AI tutoring systems, and informing safety‑oriented analyses that require insight into the model’s internal notion of task difficulty.


Comments & Academic Discussion

Loading comments...

Leave a Comment