WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference
The growing computational demands of large language models (LLMs) make efficient inference and activation strategies increasingly critical. While recent approaches, such as Mixture-of-Experts (MoE), leverage selective activation but require specialized training, training-free sparse activation methods offer broader applicability and superior resource efficiency through their plug-and-play design. However, many existing methods rely solely on hidden state magnitudes to determine activation, resulting in high approximation errors and suboptimal inference accuracy. To address these limitations, we propose WINA (Weight Informed Neuron Activation), a novel, simple, and training-free sparse activation framework that jointly considers hidden state magnitudes and the column-wise $\ell_2$-norms of weight matrices. We show that this leads to a sparsification strategy that obtains optimal approximation error bounds with theoretical guarantees tighter than existing techniques. Empirically, WINA also outperforms state-of-the-art methods (e.g., TEAL) by up to $2.94%$ in average performance at the same sparsity levels, across a diverse set of LLM architectures and datasets. These results position WINA as a new performance frontier for training-free sparse activation in LLM inference, advancing training-free sparse activation methods and setting a robust baseline for efficient inference. The source code is available at https://github.com/microsoft/wina.
💡 Research Summary
The paper introduces WINA (Weight‑Informed Neuron Activation), a training‑free sparse activation framework designed to accelerate inference in large language models (LLMs). Existing training‑free methods such as TEAL, CA‑TS, and R‑Sparse select neurons solely based on the magnitude of hidden‑state activations, ignoring how the weight matrix modulates the influence of each input dimension on downstream layers. WINA addresses this gap by jointly considering (i) the absolute value of each hidden‑state element and (ii) the column‑wise ℓ₂‑norm of the corresponding weight matrix. The product of these two quantities yields a “weight‑informed activation score” for each neuron; the top‑K scores are retained while the rest are masked to zero. This simple criterion can be applied uniformly across all layers, including attention, MLP, and residual connections, and works with any sparsity budget (global or per‑layer).
The authors provide two theoretical guarantees. Lemma 3.1 shows that for a single linear layer whose weight matrix has orthogonal columns (i.e., WᵀW is diagonal), the WINA top‑K selection exactly solves the ℓ₂‑error minimization problem between the dense output and the sparsified output. Theorem 3.2 extends this result to an L‑layer linear network under the same column‑orthogonality assumption, deriving a separable upper bound U(x;G) on the total output deviation E(x;G). Minimizing this bound reduces to independently selecting the K largest entries of the weight‑informed score at each layer, which is precisely what WINA does. To make the theory applicable to real transformers, the authors apply a lightweight offline tensor transformation that orthogonalizes columns without changing the functional capacity of the model.
Empirically, the paper conducts two sets of experiments. First, synthetic networks with enforced column orthogonality are used to compare approximation error (ℓ₂ distance between dense and sparse outputs) across sparsity levels of 25 %–65 %. WINA consistently achieves roughly half the error of CA‑TS/TEAL and R‑Sparse, confirming the theoretical predictions. Second, the method is evaluated on four popular LLM families—Llama‑2‑7B, Llama‑3‑8B, Mistral‑7B, and Phi‑4‑14B—across a suite of benchmarks covering commonsense reasoning, general reasoning, mathematics, and code generation. At identical sparsity levels, WINA improves average accuracy by 0.8 %–2.9 % over TEAL, with the gap widening as sparsity increases (up to ~3 % at 70 % sparsity). The authors also demonstrate compatibility with 4‑bit and 8‑bit quantization, showing favorable speed‑accuracy trade‑offs, and report that long‑context reasoning and social bias metrics are preserved better than competing methods.
Limitations are acknowledged. The column‑orthogonality assumption does not hold naturally in pretrained transformers; the offline orthogonalization step, while inexpensive, may affect representational properties in ways not fully explored. Moreover, WINA relies on a fixed K (or a pre‑specified per‑layer K), lacking an adaptive mechanism that could tailor sparsity to each input token dynamically. Future work could investigate learnable or token‑dependent K selection and integrate orthogonalization into the training pipeline.
In summary, WINA offers a practically deployable, theoretically grounded approach to training‑free sparse activation. By incorporating weight statistics into the gating decision, it achieves tighter error bounds and measurable empirical gains across diverse LLM architectures and tasks, establishing a new performance frontier for efficient LLM inference.
Comments & Academic Discussion
Loading comments...
Leave a Comment