On the Infinite Width and Depth Limits of Predictive Coding Networks
Predictive coding (PC) is a biologically plausible alternative to standard backpropagation (BP) that minimises an energy function with respect to network activities before updating weights. Recent work has improved the training stability of deep PC networks (PCNs) by leveraging some BP-inspired reparameterisations. However, the full scalability and theoretical basis of these approaches remains unclear. To address this, we study the infinite width and depth limits of PCNs. For linear residual networks, we show that the set of width- and depth-stable feature-learning parameterisations for PC is exactly the same as for BP. Moreover, under any of these parameterisations, the PC energy with equilibrated activities converges to the BP loss in a regime where the model width is much larger than the depth, resulting in PC computing the same gradients as BP. Experiments show that these results hold in practice for deep nonlinear networks, as long as an activity equilibrium seem to be reached. Overall, this work unifies various previous theoretical and empirical results and has potentially important implications for the scaling of PCNs.
💡 Research Summary
This paper investigates the theoretical foundations and scalability of Predictive Coding Networks (PCNs) by analysing their infinite‑width and infinite‑depth limits and comparing them directly with standard back‑propagation (BP). The authors focus on linear residual networks, deriving conditions under which PC and BP become mathematically equivalent.
The work begins by introducing a general “abcd” parameterisation that scales activations (aₗ), weight‑initialisation variance (bₗ), learning‑rate (c) and output magnitude (d) with the network width N. This framework mirrors recent BP literature on “width‑aware” or “mean‑field” parameterisations, which are designed to satisfy three desiderata: (1) pre‑activations remain O(1) at initialization, (2) network outputs evolve O(1) during training, and (3) hidden features evolve O(1) (i.e., non‑lazy or “rich” learning). For BP, a unique family of one‑dimensional solutions satisfies these constraints, most notably the mean‑field setting with aₗ = ½ for hidden layers, bₗ = c = 0, and d = ½.
In the PC setting, the network minimises an energy function F(z,θ) that is a sum of layer‑wise mean‑squared errors. For linear networks the activity optimisation (inference) has a unique equilibrium z*(θ) and the equilibrated energy reduces to a rescaled mean‑squared error: F*(θ)=½ s(θ) L(θ), where s(θ)=1+∑_{ℓ=2}^L‖W^{(L:ℓ)}‖² depends on the product of weight matrices. Thus PC effectively learns the BP loss multiplied by a weight‑dependent factor.
Theorem 1 proves that, assuming activities have converged to equilibrium, the set of (aₗ,bₗ,c,d) that satisfy the three desiderata for PC is exactly the same as for BP. Consequently, the same mean‑field, maximal‑update, or other width‑stable parameterisations used in BP can be applied to PC without modification.
Theorem 2 extends the analysis to linear residual networks and introduces a depth‑aware scaling. When the width N is much larger than the depth L (N ≫ L), the rescaling factor s(θ) → 1 as N→∞. Therefore the equilibrated PC energy converges to the ordinary BP loss, and the gradients computed by PC become identical to those obtained by BP. This result is formalised in Corollary 4.2.
Empirical validation is performed on both linear and nonlinear models. Linear MLPs and residual networks trained on CIFAR‑10 and Fashion‑MNIST with widths ranging from 16 to 2048 demonstrate that (i) the cosine similarity between PC and BP gradients approaches 1 as width grows, (ii) the rescaling factor s(θ) indeed collapses to 1, and (iii) the training dynamics of PC match BP under the mean‑field scaling. Non‑linear ReLU networks and convolutional architectures are also tested using Adam; when the inference phase is allowed enough steps to reach a near‑equilibrium, PC gradients again align with BP gradients. Conversely, insufficient inference steps prevent equilibrium, leading to a persistent s(θ) > 1 and unstable learning, highlighting the practical importance of the equilibrium assumption.
The paper’s contributions are threefold: (1) it provides a rigorous theoretical bridge showing that PC and BP share the same width‑stable, feature‑learning parameter space; (2) it demonstrates that in the regime N ≫ L, PC’s energy landscape becomes indistinguishable from the BP loss, guaranteeing identical gradient signals; (3) it supplies extensive experimental evidence that these asymptotic results hold for realistic deep, nonlinear networks when appropriate scaling and sufficient activity optimisation are employed.
The authors discuss implications for scaling PCNs: by adopting the mean‑field (aₗ=½, bₗ=0, c=0, d=½) scaling, deep PCNs can be trained stably across a wide range of model sizes, mirroring the success of similarly derived BP scalings. Limitations are acknowledged: the proofs rely on linearity and exact equilibrium of activities, conditions that may not hold in biological circuits or in highly time‑constrained hardware. Extending the theory to fully nonlinear dynamics, stochastic inference, and biologically realistic constraints remains an open research direction. Overall, the work unifies prior empirical observations with a solid mathematical foundation and offers concrete guidelines for building scalable, biologically plausible predictive‑coding architectures.
Comments & Academic Discussion
Loading comments...
Leave a Comment