Attractor Patch Networks: Reducing Catastrophic Forgetting with Routed Low-Rank Patch Experts
Transformers achieve strong language modeling accuracy, yet their position-wise feed-forward networks (FFNs) are dense, globally shared, and typically updated end to end. These properties create two practical tensions. First, dense FFNs spend the same compute on every token regardless of context, and they allocate capacity uniformly even when language exhibits highly clustered context structure. Second, continual learning, in the sense of updating the model while serving a data stream, often produces interference because a small update touches broadly shared weights. We propose Attractor Patch Networks (APN), a plug-compatible replacement for the Transformer FFN. APN is a bank of patch experts. A similarity router selects a small top-k set of patches for each token by matching the token representation to learned prototypes. Each selected patch emits a low-rank residual update conditioned on a compact code. The architecture yields conditional, context-specialized nonlinear transformations while preserving the standard Transformer interface. This paper focuses on APN as an architectural primitive. We formalize APN, analyze its expressivity as a piecewise low-rank residual function class, and derive simple interference and stability arguments that make APN naturally compatible with continual learning. In experiments on character-level language modeling, APN achieves competitive perplexity (4.57 vs 4.32 PPL) while enabling dramatically better continual adaptation: when adapting to a shifted domain, APN achieves 2.6 times better retention (11.1 vs 29.4 PPL on the original domain) and 2.8 times better adaptation (6.4 vs 17.8 PPL on the new domain) compared to global fine-tuning of a dense FFN baseline.
💡 Research Summary
Transformers achieve state‑of‑the‑art performance in language modeling, but the position‑wise feed‑forward network (FFN) that dominates their compute budget is dense, globally shared, and updated end‑to‑end. This design creates two practical problems. First, every token incurs the same amount of computation regardless of how much context‑specific processing it actually needs, and the model’s capacity is forced to be allocated uniformly even though linguistic phenomena naturally cluster in representation space. Second, in a streaming deployment where the model must be continually adapted to new data, updating a dense FFN inevitably touches all its parameters, leading to catastrophic forgetting of previously learned knowledge.
The paper introduces Attractor Patch Networks (APN) as a drop‑in replacement for the standard FFN. APN consists of three components. (1) A prototype router maintains K learnable prototypes p_i∈ℝ^d. For a token representation h, the router computes similarity scores s_i = ⟨LN(h), Norm(p_i)⟩/τ, selects the top‑k prototypes, and produces softmax weights w_i over this sparse set. (2) A shared projection V∈ℝ^{d×r} (r≪d) maps the normalized token to a compact code u = Vᵀ·LN(h)∈ℝ^r. (3) Each selected patch expert i applies a gated transformation ϕ_i(u) = u ⊙ σ(a_i⊙u + b_i) (with learnable vectors a_i, b_i) and then a low‑rank decoder U_i∈ℝ^{d×r} to produce a residual Δ_i = U_i·ϕ_i(u). The final APN output is a weighted sum of the residuals, scaled by a factor γ, and added back to the input (y = h + γ∑_{i∈K(h)} w_i Δ_i).
Mathematically, APN implements a piecewise low‑rank residual operator. For a fixed active set K(h) the residual lies in the span of at most k rank‑r matrices, giving an effective rank bound of k·r per token. This contrasts with a dense FFN, which implicitly represents a single high‑rank mapping (typically d_ff ≈ 4d). By distributing capacity across many specialized patches while keeping each patch’s expressivity low, APN can allocate more parameters to regions of representation space that are frequently visited, thereby improving accuracy without increasing the overall parameter budget.
From a continual‑learning perspective, APN’s routing makes the update surface highly localized. When streaming data arrives, one can restrict gradient updates to the decoders (and optionally the prototypes) of the patches that are active for the current token. If two contexts A and B activate disjoint patch sets, updates for A cannot affect B. The paper formalizes an interference proxy Ω(A,B)=|K_A∩K_B|/k; smaller Ω means less forgetting. By increasing the total number of patches K or decreasing k, the expected overlap diminishes. Additional stability knobs—such as scaling γ, norm‑clipping of U_i, and confidence‑gated updates based on routing entropy—further bound the magnitude of each online step.
Empirically, the authors evaluate APN on a character‑level Shakespeare modeling task. With a parameter budget comparable to a dense FFN, APN attains a perplexity of 4.57 versus 4.32 for the baseline, showing that specialization does not sacrifice raw accuracy. More strikingly, when the model is subsequently adapted to a shifted domain, APN retains the original domain performance (11.1 PPL vs. 29.4 PPL for the dense FFN) and adapts to the new domain much faster (6.4 PPL vs. 17.8 PPL). These results confirm the theoretical claims: APN’s localized updates dramatically reduce interference while still providing enough capacity to learn new patterns.
The paper also provides a design space guide. The number of patches K controls total capacity and expected overlap; the top‑k parameter k trades off locality versus robustness to routing errors; the code dimension r controls per‑patch expressivity; the temperature τ shapes routing sparsity; and a simple balanced‑usage regularizer encourages uniform patch utilization. Optional mechanisms such as patch dropout and prototype re‑initialization enable dynamic allocation of new attractor regions for novel inputs.
In summary, Attractor Patch Networks offer a principled way to make the FFN component of Transformers context‑aware, compute‑efficient, and friendly to continual learning. By routing tokens to a small set of low‑rank patch experts, APN reduces unnecessary computation, allows specialization where the data distribution is heterogeneous, and confines online updates to a tiny parameter subset, thereby mitigating catastrophic forgetting. The approach is fully compatible with existing Transformer architectures, requires only modest architectural changes, and demonstrates both competitive language‑modeling performance and substantial gains in streaming adaptation scenarios.
Comments & Academic Discussion
Loading comments...
Leave a Comment