Forward Only Learning for Orthogonal Neural Networks of any Depth

Reading time: 5 minute
...

📝 Original Info

  • Title: Forward Only Learning for Orthogonal Neural Networks of any Depth
  • ArXiv ID: 2512.20668
  • Date: 2025-12-19
  • Authors: Paul Caillon, Alex Colagrande, Erwan Fagnou, Blaise Delattre, Alexandre Allauzen

📝 Abstract

Backpropagation is still the de facto algorithm used today to train neural networks. With the exponential growth of recent architectures, the computational cost of this algorithm also becomes a burden. The recent PEPITA and forward-only frameworks have proposed promising alternatives, but they failed to scale up to a handful of hidden layers, yet limiting their use. In this paper, we first analyze theoretically the main limitations of these approaches. It allows us the design of a forward-only algorithm, which is equivalent to backpropagation under the linear and orthogonal assumptions. By relaxing the linear assumption, we then introduce FOTON (Forward-Only Training of Orthogonal Networks) that bridges the gap with the backpropagation algorithm. Experimental results show that it outperforms PEPITA, enabling us to train neural networks of any depth, without the need for a backward pass. Moreover its performance on convolutional networks clearly opens up avenues for its application to more advanced architectures. The code is open-sourced at https://github.com/p0lcAi/FOTON .

💡 Deep Analysis

📄 Full Content

Forward Only Learning for Orthogonal Neural Networks of any Depth Paul Caillona,*,1, Alex Colagrandea,1, Erwan Fagnoua, Blaise Delattrea and Alexandre Allauzena,b aMiles Team, LAMSADE, Université Paris-Dauphine - PSL, Paris, France bESPCI PSL, Paris, France Abstract. Backpropagation is still the de facto algorithm used to- day to train neural networks. With the exponential growth of recent architectures, the computational cost of this algorithm also becomes a burden. The recent PEPITA and forward-only frameworks have pro- posed promising alternatives, but they failed to scale up to a handful of hidden layers, yet limiting their use. In this paper, we first ana- lyze theoretically the main limitations of these approaches. It allows us the design of a forward-only algorithm, which is equivalent to backpropagation under the linear and orthogonal assumptions. By relaxing the linear assumption, we then introduce FOTON (Forward- Only Training of Orthogonal Networks) that bridges the gap with the backpropagation algorithm. Experimental results show that it outper- forms PEPITA, enabling us to train neural networks of any depth, without the need for a backward pass. Moreover its performance on convolutional networks clearly opens up avenues for its application to more advanced architectures. The code is open-sourced on github. 1 Introduction and Related Work Backpropagation (BP) [29] is still the cornerstone for training deep neural networks. With the exponential growth of architectures, its limitations have been highlighted, notably regarding its lack of effi- ciency, along with its biological implausibility [22]. More specifically, the sequentiality of the backward pass represents a computational bottleneck when training deep networks. The memory footprint and execution time represent a significant drawback and reducing this burden is a major challenge that could allow the use of state-of-the-art models on resource-limited devices [14, 13]. Alternatives have thus been proposed, most targeting the backward phase. Feedback Alignment (FA) replaces exact gradients with ran- dom feedback directions [21], while Direct FA (DFA) transmits feed- back directly from the output to each layer in parallel [26]. These sim- plifications reduce test accuracy, though later works attempt to narrow the gap with BP by partially mimicking its behavior [1, 36, 17, 10, 16]. Nonetheless, clear limitations remain: poor scaling to deep convolu- tional networks [3, 18, 24, 7], and the need for an exact backward pass in certain modules, such as attention layers in transformers [19]. More recently, forward-only (FO) algorithms have been proposed as an alternative to the BP, replacing the backward pass by a modu- lated second forward pass. A first line of work relies on the estimation of the gradient, computed locally from a modified forward pass using ∗Corresponding Author. Email: paul.caillon@dauphine.psl.eu 1 Equal contribution directional derivatives. These forward gradients provide a plausible up- date for the parameters [11, 31, 23, 9]. For efficient computation, the Jacobian-vector product is involved during the forward pass. However, while forward gradients are unbiased, they exhibit a large variance. This inhibits the convergence of the training as well as the scalability of this approach even in combination with local losses [27, 4, 9]. In this work, we focus on another family of forward-only algo- rithms, in which a second pass or modulated forward pass [33] com- putes network outputs on a perturbed input. The nature of this pertur- bation varies by method: for example, Hinton [12] use “negative” data samples, requiring designing or collecting a special dataset, whereas the PEPITA framework [8] derives its modulation directly from the error signal, allowing a more agnostic and data-efficient approach. Srinivasan et al. [33] further analyzed the learning dynamics of these forward-only schemes and extended PEPITA to slightly deeper architectures but still found that, in practice, they fail to scale beyond five hidden layers and cannot adapt to convolutions. As a result, existing methods remain limited to shallow, fully connected networks. To address these limitations, we identify the core bottleneck: forward-only passes cannot reliably transmit gradient information through many layers, and the absence of a true backward path de- mands special data or ad-hoc fixes. We show that under linear or- thogonality, a purely forward strategy exactly recovers BP updates. Leveraging this, we introduce FOTON, which matches BP in the or- thogonal linear regime and scales robustly to deep non-linear networks. Crucially, FOTON dispenses with any backward pass or storage of the automatic differentiation graph: all updates are computed via for- ward evaluations alone, dramatically reducing memory overhead by eliminating the need to retain activations or gradient buffers. Our contributions are: • We show theoretically that, for orthogonal linear network

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut