Sparse or Dense? A Mechanistic Estimation of Computation Density in Transformer-based LLMs

Sparse or Dense? A Mechanistic Estimation of Computation Density in Transformer-based LLMs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Transformer-based large language models (LLMs) are comprised of billions of parameters arranged in deep and wide computational graphs. Several studies on LLM efficiency optimization argue that it is possible to prune a significant portion of the parameters, while only marginally impacting performance. This suggests that the computation is not uniformly distributed across the parameters. We introduce here a technique to systematically quantify computation density in LLMs. In particular, we design a density estimator drawing on mechanistic interpretability. We experimentally test our estimator and find that: (1) contrary to what has been often assumed, LLM processing generally involves dense computation; (2) computation density is dynamic, in the sense that models shift between sparse and dense processing regimes depending on the input; (3) per-input density is significantly correlated across LLMs, suggesting that the same inputs trigger either low or high density. Investigating the factors influencing density, we observe that predicting rarer tokens requires higher density, and increasing context length often decreases the density. We believe that our computation density estimator will contribute to a better understanding of the processing at work in LLMs, challenging their symbolic interpretation.


💡 Research Summary

The paper introduces a principled method for quantifying how much of a transformer‑based large language model’s (LLM) computation is actually required for a given input, a metric the authors call “computation density.” They formalize an LLM as a token‑wise computational graph where nodes are intermediate hidden states and edges correspond to the elementary operations of multi‑head attention and feed‑forward layers. By treating a “trace” as a sub‑graph of this full graph, they define trace size s as the fraction of edges retained and use total variation (TV) distance between the full model’s output distribution and the trace‑restricted model’s distribution as a fidelity measure.

To extract traces they adapt the Information Flow Route (IFR) technique, but replace the original similarity‑based importance score with a simple L1‑norm magnitude of each edge’s activation vector. Starting from the final token representation, they perform a backward graph traversal, adding any incoming edge whose magnitude exceeds a threshold τ. Varying τ yields a family of traces of different sizes, allowing the authors to plot TV error ε(s) versus s. Computation density ρ is defined as the area under this error‑vs‑size curve; low ρ indicates that a small sub‑graph suffices (sparse computation), while high ρ signals that most of the graph is needed (dense computation).

The authors evaluate 13 publicly available LLMs ranging from 1 B to 13 B parameters (including Mistral‑7B, LLaMA‑2, Falcon, etc.). Across the board, the estimated ρ values are high (often > 0.7), meaning that a large proportion of the graph must be retained to faithfully reproduce the full output distribution. This challenges the common assumption that LLM inference is largely sparse.

Importantly, the study finds substantial input‑dependent variability. For the same input, different models produce highly correlated density values, suggesting that the linguistic properties of the prompt, rather than architectural details, drive the amount of computation required. Three systematic relationships are reported: (1) predicting rarer tokens (low frequency in the training corpus) leads to higher density; (2) longer contexts reduce density, presumably because more information is already available; and (3) higher model‑predicted entropy (greater uncertainty) correlates with higher density. These findings align with the intuition that “harder” predictions demand more internal processing.

The authors also perform a necessity test: when only the edges belonging to a trace are kept and all others are ablated, model performance collapses, confirming that the identified sub‑graphs are indeed essential for the observed output.

Overall, the paper contributes a scalable, training‑free tool for measuring effective computation in LLMs, provides empirical evidence that LLMs operate in a predominantly dense regime with dynamic, input‑driven fluctuations, and opens avenues for more informed pruning, adaptive inference, and theoretical work linking computational load to linguistic complexity.


Comments & Academic Discussion

Loading comments...

Leave a Comment