📝 Original Info
- Title: Forward Only Learning for Orthogonal Neural Networks of any Depth
- ArXiv ID: 2512.20668
- Date: 2025-12-19
- Authors: Paul Caillon, Alex Colagrande, Erwan Fagnou, Blaise Delattre, Alexandre Allauzen
📝 Abstract
Backpropagation is still the de facto algorithm used today to
train neural networks.
With the exponential growth of recent architectures, the
computational cost of this algorithm also becomes a burden. The
recent PEPITA and forward-only frameworks have proposed promising
alternatives, but they failed to scale up to a handful of hidden
layers, yet limiting their use.
In this paper, we first analyze theoretically the main limitations of
these approaches. It allows us the design of a forward-only
algorithm, which is equivalent to backpropagation under the linear
and orthogonal assumptions. By relaxing the linear assumption, we
then introduce FOTON (Forward-Only Training of Orthogonal Networks)
that bridges the gap with the backpropagation
algorithm. Experimental results show that it outperforms PEPITA,
enabling us to train neural networks of any depth, without the need
for a backward pass.
Moreover its performance on convolutional networks clearly opens up avenues for its application to more
advanced architectures. The code is open-sourced at https://github.com/p0lcAi/FOTON .
💡 Deep Analysis
📄 Full Content
Forward Only Learning for Orthogonal Neural Networks
of any Depth
Paul Caillona,*,1, Alex Colagrandea,1, Erwan Fagnoua, Blaise Delattrea and Alexandre Allauzena,b
aMiles Team, LAMSADE, Université Paris-Dauphine - PSL, Paris, France
bESPCI PSL, Paris, France
Abstract.
Backpropagation is still the de facto algorithm used to-
day to train neural networks. With the exponential growth of recent
architectures, the computational cost of this algorithm also becomes a
burden. The recent PEPITA and forward-only frameworks have pro-
posed promising alternatives, but they failed to scale up to a handful
of hidden layers, yet limiting their use. In this paper, we first ana-
lyze theoretically the main limitations of these approaches. It allows
us the design of a forward-only algorithm, which is equivalent to
backpropagation under the linear and orthogonal assumptions. By
relaxing the linear assumption, we then introduce FOTON (Forward-
Only Training of Orthogonal Networks) that bridges the gap with the
backpropagation algorithm. Experimental results show that it outper-
forms PEPITA, enabling us to train neural networks of any depth,
without the need for a backward pass. Moreover its performance on
convolutional networks clearly opens up avenues for its application to
more advanced architectures. The code is open-sourced on github.
1
Introduction and Related Work
Backpropagation (BP) [29] is still the cornerstone for training deep
neural networks. With the exponential growth of architectures, its
limitations have been highlighted, notably regarding its lack of effi-
ciency, along with its biological implausibility [22]. More specifically,
the sequentiality of the backward pass represents a computational
bottleneck when training deep networks. The memory footprint and
execution time represent a significant drawback and reducing this
burden is a major challenge that could allow the use of state-of-the-art
models on resource-limited devices [14, 13].
Alternatives have thus been proposed, most targeting the backward
phase. Feedback Alignment (FA) replaces exact gradients with ran-
dom feedback directions [21], while Direct FA (DFA) transmits feed-
back directly from the output to each layer in parallel [26]. These sim-
plifications reduce test accuracy, though later works attempt to narrow
the gap with BP by partially mimicking its behavior [1, 36, 17, 10, 16].
Nonetheless, clear limitations remain: poor scaling to deep convolu-
tional networks [3, 18, 24, 7], and the need for an exact backward
pass in certain modules, such as attention layers in transformers [19].
More recently, forward-only (FO) algorithms have been proposed
as an alternative to the BP, replacing the backward pass by a modu-
lated second forward pass. A first line of work relies on the estimation
of the gradient, computed locally from a modified forward pass using
∗Corresponding Author. Email: paul.caillon@dauphine.psl.eu
1 Equal contribution
directional derivatives. These forward gradients provide a plausible up-
date for the parameters [11, 31, 23, 9]. For efficient computation, the
Jacobian-vector product is involved during the forward pass. However,
while forward gradients are unbiased, they exhibit a large variance.
This inhibits the convergence of the training as well as the scalability
of this approach even in combination with local losses [27, 4, 9].
In this work, we focus on another family of forward-only algo-
rithms, in which a second pass or modulated forward pass [33] com-
putes network outputs on a perturbed input. The nature of this pertur-
bation varies by method: for example, Hinton [12] use “negative” data
samples, requiring designing or collecting a special dataset, whereas
the PEPITA framework [8] derives its modulation directly from the
error signal, allowing a more agnostic and data-efficient approach.
Srinivasan et al. [33] further analyzed the learning dynamics of
these forward-only schemes and extended PEPITA to slightly deeper
architectures but still found that, in practice, they fail to scale beyond
five hidden layers and cannot adapt to convolutions. As a result,
existing methods remain limited to shallow, fully connected networks.
To address these limitations, we identify the core bottleneck:
forward-only passes cannot reliably transmit gradient information
through many layers, and the absence of a true backward path de-
mands special data or ad-hoc fixes. We show that under linear or-
thogonality, a purely forward strategy exactly recovers BP updates.
Leveraging this, we introduce FOTON, which matches BP in the or-
thogonal linear regime and scales robustly to deep non-linear networks.
Crucially, FOTON dispenses with any backward pass or storage of
the automatic differentiation graph: all updates are computed via for-
ward evaluations alone, dramatically reducing memory overhead by
eliminating the need to retain activations or gradient buffers. Our
contributions are:
• We show theoretically that, for orthogonal linear network
Reference
This content is AI-processed based on open access ArXiv data.