Choosing the right basis for interpretability: Psychophysical comparison between neuron-based and dictionary-based representations

Choosing the right basis for interpretability: Psychophysical comparison between neuron-based and dictionary-based representations
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Interpretability research often adopts a neuron-centric lens, treating individual neurons as the fundamental units of explanation. However, neuron-level explanations can be undermined by superposition, where single units respond to mixtures of unrelated patterns. Dictionary learning methods, such as sparse autoencoders and non-negative matrix factorization, offer a promising alternative by learning a new basis over layer activations. Despite this promise, direct human evaluations comparing neuron-based and dictionary-based representations remain limited. We conducted three large-scale online psychophysics experiments (N=481) comparing explanations derived from neuron-based and dictionary-based representations in two convolutional neural networks (ResNet50, VGG16). We operationalize interpretability via visual coherence: a basis is more interpretable if humans can reliably recognize a common visual pattern in its maximally activating images and generalize that pattern to new images. Across experiments, dictionary-based representations were consistently more interpretable than neuron-based representations, with the advantage increasing in deeper layers. Critically, because models differ in how neuron-aligned their representations are – with ResNet50 exhibiting greater superposition, neuron-based evaluations can mask cross-model differences, such that ResNet50’s higher interpretability emerges only under dictionary-based comparisons. These results provide psychophysical evidence that dictionary-based representations offer a stronger foundation for interpretability and caution against model comparisons based solely on neuron-level analyses.


💡 Research Summary

This paper investigates whether the choice of representational basis—individual neuron axes versus a learned dictionary over layer activations—affects human interpretability of convolutional neural networks. The authors begin by highlighting the “superposition” hypothesis: deep models often encode more visual concepts than they have neurons, causing single units to respond to mixtures of unrelated patterns (polysemantic neurons). Under this hypothesis, neuron‑centric explanations may be misleading. To address this, the study employs CRAFT, a non‑negative matrix factorization (NMF) based dictionary learning method, which respects the non‑negative nature of ReLU activations and yields sparse, approximately one‑hot basis vectors that ideally correspond to single, human‑recognizable visual concepts.

Two standard CNN architectures, ResNet‑50 and VGG‑16, are examined across multiple layers. For each layer, the authors extract (1) the raw neuron activations and (2) the NMF dictionary elements. They quantify the importance of each direction using Gradient × Input (GI) and assess how “axis‑aligned” each dictionary element is with the original neuron basis via a sparsity‑based metric H(D) ranging from 0 (dense, distributed) to 1 (one‑hot).

Human‑centered evaluation is carried out through three large‑scale online psychophysics experiments, adapting the protocol of Borowski et al. (2021). Participants view two panels of nine images each—maximally activating examples for a given unit—flanking two query images. They must choose which query shares the visual pattern displayed on the right panel. This task measures “visual coherence”: the clearer and more consistent the pattern, the higher the chance of a correct choice. Across 481 participants and 16,835 responses, each participant is assigned either the neuron‑basis condition or the dictionary‑basis condition, with a control to mitigate a known semantic confound in prior work.

Results consistently show that dictionary‑based representations are more interpretable than neuron‑based ones. Accuracy and response time advantages grow with depth, indicating that deeper layers suffer more from superposition and benefit more from a re‑parameterization into a dictionary. Moreover, the axis‑alignment analysis reveals that ResNet‑50 exhibits lower H(D) scores (i.e., more distributed dictionary elements) than VGG‑16, suggesting stronger superposition in ResNet‑50. Consequently, neuron‑based comparisons mask this difference, whereas dictionary‑based comparisons reveal that ResNet‑50 is actually more interpretable despite its higher superposition.

The paper’s contributions are fourfold: (1) identification and mitigation of a semantic confound in human‑evaluation protocols; (2) large‑scale psychophysical evidence that dictionary bases yield higher visual coherence than neuron axes; (3) quantitative measurement of axis‑alignment showing model‑specific superposition levels; and (4) demonstration that the choice of representational basis can materially alter cross‑model interpretability conclusions, cautioning against reliance on neuron‑level analyses alone.

Limitations include the exclusive focus on NMF (other dictionary methods such as sparse autoencoders are not compared) and the restriction to natural image classification tasks. Future work should explore a broader set of dictionary learning techniques, extend evaluations to other modalities (e.g., video, medical imaging), and develop more refined metrics for superposition. Overall, the study provides compelling psychophysical support for adopting learned dictionary bases as a stronger foundation for explainable AI in vision.


Comments & Academic Discussion

Loading comments...

Leave a Comment