From independent patches to coordinated attention: Controlling information flow in vision transformers

From independent patches to coordinated attention: Controlling information flow in vision transformers
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We make the information transmitted by attention an explicit, measurable quantity in vision transformers. By inserting variational information bottlenecks on all attention-mediated writes to the residual stream – without other architectural changes – we train models with an explicit information cost and obtain a controllable spectrum from independent patch processing to fully expressive global attention. On ImageNet-100, we characterize how classification behavior and information routing evolve across this spectrum, and provide initial insights into how global visual representations emerge from local patch processing by analyzing the first attention heads that transmit information. By biasing learning toward solutions with constrained internal communication, our approach yields models that are more tractable for mechanistic analysis and more amenable to control.


💡 Research Summary

This paper introduces a principled method for explicitly controlling the amount of information that flows through the attention mechanism of Vision Transformers (ViTs). The authors observe that in a ViT the only pathway that allows patches to exchange information is the attention‑mediated write to the residual stream; the MLP blocks operate independently on each patch. Leveraging this insight, they place a variational information bottleneck (VIB) directly on each attention head’s update before it is added back to the residual stream. Concretely, for each head they encode the deterministic update Δℓi into a stochastic latent variable zℓi using a diagonal‑Gaussian encoder qϕ(z|Δ), then decode it back with a decoder gθ(z) to obtain a reconstructed update ˆΔℓi. The KL divergence between qϕ and a unit‑Gaussian prior provides an upper bound on the mutual information transmitted by that head. This KL term is summed over all heads and layers and added to the standard cross‑entropy loss with a single scalar weight β. By varying β, the model smoothly interpolates between two extremes: β = 0 completely blocks attention‑mediated communication, reducing the architecture to independent per‑patch processing followed by a permutation‑invariant aggregation (akin to DeepSets); β → ∞ removes the bottleneck, recovering a standard ViT with unrestricted global attention.

The authors instantiate this design on a lightweight ViT‑tiny backbone (12 transformer blocks, 3 heads per block, total 36 bottlenecks) and train on ImageNet‑100. They sweep β over several orders of magnitude and evaluate both classification accuracy and the total KL (i.e., total transmitted information). The results reveal a clear trade‑off: as β increases, accuracy improves while the total KL grows, but the relationship is not linear. In the mid‑range of β, a modest increase in transmitted information yields a large boost in accuracy, suggesting that a small amount of crucial global information suffices for effective representation learning.

To understand how information is routed, the paper conducts several analyses. First, per‑patch KL values are visualized, showing that even at high β only a small fraction of patches (≈5–10 %) carry significant information, while the majority transmit near‑zero bits—an emergent sparsity in communication. Second, the authors compute normalized mutual information (NMI) between the latent messages of different heads. Early layers exhibit high NMI among a few heads, indicating that the model initially concentrates global signals in a few coordinated heads. Deeper layers show lower NMI, reflecting a diversification of information across heads as the network refines representations. Third, they introduce a “patch voting diversity” metric based on the inverse Simpson index to quantify how β influences the spread of logits across patches. Low β leads to homogeneous patch predictions (all patches vote for the same class), whereas higher β produces a broader distribution of logits and more varied top‑predicted classes, evidencing richer contextual integration.

The paper also discusses methodological implications. By imposing the bottleneck during training rather than post‑hoc, the model’s internal circuitry is shaped to be intrinsically low‑communication, making it more amenable to mechanistic interpretability. The approach differs from prior work that prunes attention heads for efficiency; here the goal is to expose the minimal set of communication pathways required for task performance. The authors argue that this controlled spectrum provides a valuable testbed for studying how global visual concepts emerge from local evidence, and that the variational formulation offers a tractable, differentiable measure of information flow.

In summary, the work presents a clean architectural modification—variational bottlenecks on attention writes—that yields a single, interpretable knob (β) to continuously tune the capacity of a ViT’s communication channel. Empirical results on ImageNet‑100 demonstrate a smooth accuracy‑information trade‑off, reveal sparse and hierarchical routing patterns, and suggest that limiting internal communication can produce models that are both performant and more transparent. This opens avenues for future research in model compression, safety‑critical AI, and the development of controllable, interpretable vision systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment