Free Energy Mixer
Standard attention stores keys/values losslessly but reads them via a per-head convex average, blocking channel-wise selection. We propose the Free Energy Mixer (FEM): a free-energy (log-sum-exp) read that applies a value-driven, per-channel log-linear tilt to a fast prior (e.g., from queries/keys in standard attention) over indices. Unlike methods that attempt to improve and enrich the $(q,k)$ scoring distribution, FEM treats it as a prior and yields a value-aware posterior read at unchanged complexity, smoothly moving from averaging to per-channel selection as the learnable inverse temperature increases, while still preserving parallelism and the original asymptotic complexity ($O(T^2)$ for softmax; $O(T)$ for linearizable variants). We instantiate a two-level gated FEM that is plug-and-play with standard and linear attention, linear RNNs and SSMs. It consistently outperforms strong baselines on NLP, vision, and time-series at matched parameter budgets.
💡 Research Summary
The paper identifies a fundamental limitation of modern attention mechanisms: while the key‑value (KV) cache stores past tokens losslessly, the read operation is a per‑head convex combination that applies the same weight vector to every channel of the values. Consequently, the output lies in the convex hull of the stored values and cannot realize channel‑wise selection (e.g., each dimension picking a different past token). The authors formalize this “lossless‑storage vs lossy‑processing” gap, prove that classic attention cannot represent a generic channel‑wise selector (Lemma 2.2, Corollary 2.3), and show that existing remedies—adding heads, increasing depth, per‑dimension queries/keys, richer in‑head mixers, or using linear RNN/SSM memories—either increase computational cost, reduce the lossless advantage, or still suffer from the same token‑separable read bottleneck.
To close the gap, they propose the Free Energy Mixer (FEM). FEM treats the selection distribution as a fast prior pₜ derived from queries/keys (or from the normalizer of a linear RNN/SSM) and introduces a value‑driven log‑linear tilt per channel controlled by a learnable inverse temperature βₜ,ⱼ. The per‑channel free‑energy output is
Fₜ,ⱼ(β) = (1/β) log ∑{i∈Mₜ} pₜ(i) exp(β v{i,ⱼ}),
and the corresponding posterior selection distribution is
qₜ,β^{(ⱼ)}(i) = pₜ(i) exp(β v_{i,ⱼ}) / ∑{r∈Mₜ} pₜ(r) exp(β v{r,ⱼ}).
Theorem 2.8 shows that this formulation is the exact solution of a KL‑constrained variational problem: maximize expected value under a KL budget relative to the prior. As β grows, q concentrates on the argmax, and Fₜ,ⱼ(β) approaches the per‑channel maximum, enabling true channel‑wise selection. When β→0, the method reduces to the standard expectation, preserving the original softmax behavior.
Complexity-wise, FEM adds a single masked log‑sum‑exp per channel, which retains the asymptotic cost of the underlying attention (O(T²) for softmax, O(T) for linearizable variants). The authors embed FEM into a two‑level gated architecture: an inner gate λ interpolates between the prior mean µₜ and the free‑energy term, while an outer gate g scales the final output. This design allows smooth learning of β and stabilizes training.
Empirically, FEM is evaluated on a broad suite of tasks: language modeling and classification (NLP), image classification and video recognition (vision), and multivariate time‑series forecasting. Across all domains, FEM consistently outperforms strong baselines—including vanilla softmax attention, LASER‑style log‑sum‑exp attention, linear attention, and state‑space models like Mamba—while keeping the same parameter budget. Gains are especially pronounced on tasks where channel‑wise discrimination matters (e.g., multivariate time‑series), where FEM achieves 2–4 % absolute improvements in accuracy or lower forecasting error.
The authors release code and pretrained checkpoints, and provide extensive ablations showing the effect of temperature gating, the two‑level gate, and the choice of prior (softmax vs linear RNN/SSM). They also discuss theoretical properties: convexity and smoothness of Fₜ,ⱼ in the values, capacity bounds (FEM can realize |Mₜ| × D assignments versus |Mₜ| × H for H heads), and robustness to masking.
In summary, the Free Energy Mixer introduces a principled, variational free‑energy read that transforms the traditional convex mixing of attention into a value‑aware, per‑channel posterior selection without sacrificing parallelism or asymptotic efficiency. This resolves a long‑standing lossless‑memory processing gap and opens new avenues for designing attention‑based models that require fine‑grained, channel‑specific retrieval.
Comments & Academic Discussion
Loading comments...
Leave a Comment