A Max-Sum algorithm for training discrete neural networks
We present an efficient learning algorithm for the problem of training neural networks with discrete synapses, a well-known hard (NP-complete) discrete optimization problem. The algorithm is a variant of the so-called Max-Sum (MS) algorithm. In particular, we show how, for bounded integer weights with $q$ distinct states and independent concave a priori distribution (e.g. $l_{1}$ regularization), the algorithm’s time complexity can be made to scale as $O\left(N\log N\right)$ per node update, thus putting it on par with alternative schemes, such as Belief Propagation (BP), without resorting to approximations. Two special cases are of particular interest: binary synapses $W\in{-1,1}$ and ternary synapses $W\in{-1,0,1}$ with $l_{0}$ regularization. The algorithm we present performs as well as BP on binary perceptron learning problems, and may be better suited to address the problem on fully-connected two-layer networks, since inherent symmetries in two layer networks are naturally broken using the MS approach.
💡 Research Summary
The paper addresses the notoriously hard problem of training neural networks whose synaptic weights are constrained to a finite set of discrete values. While continuous‑weight networks can be trained efficiently with gradient‑based methods, the discrete case is NP‑complete even for the simplest single‑layer perceptron. Recent work has shown that a reinforced version of Belief Propagation (BP) can solve random binary classification tasks up to a storage capacity close to the theoretical limit, but BP relies on a Gaussian approximation that is only valid for large N and it struggles with fully‑connected multi‑layer architectures because of permutation symmetry among hidden units.
To overcome these limitations the authors propose a Max‑Sum (MS) algorithm, which can be seen as the zero‑temperature limit of BP but with a different handling of constraints. The learning problem is represented on a complete bipartite factor graph: N variable nodes correspond to the synaptic weights, and αN factor nodes correspond to the training patterns. Two families of messages are defined: Φ_{t}^{μ→i}(W_i) from factor μ to variable i, and Ψ_{t}^{i→μ}(W_i) in the opposite direction. The standard MS update equations (3) and (4) involve a maximisation over all configurations that satisfy the hard constraint “no pattern is mis‑classified”. Direct evaluation would be exponential, but the authors show that the maximisation can be rewritten as a Max‑Convolution of N‑1 one‑dimensional functions.
A Max‑Convolution replaces the usual (+,×) operations of a standard convolution with (max,+). When the functions involved are piecewise‑linear and concave – which is the case when the prior on the weights is concave (e.g., L1 regularisation) – the convolution preserves this structure. By exploiting associativity, the authors compute the full N‑1‑fold convolution recursively, achieving a per‑node update cost of O(N log N) instead of O(N²). This matches the best known complexity of BP without resorting to any Gaussian approximation.
Two additional ingredients are introduced to make the algorithm practical on finite graphs: (i) a reinforcement term r_t Ψ_{t‑1} that gradually biases the messages toward previously obtained beliefs, acting as a soft decimation; and (ii) a tiny asymmetric concave noise Γ⁰_i(W_i)≈1 added to the external fields Γ_i(W_i). The reinforcement term improves convergence on loopy graphs, while the asymmetric noise breaks the permutation symmetry that plagues fully‑connected two‑layer committee machines.
The authors work out the details for two important cases: binary weights W∈{−1,+1} and ternary weights W∈{−1,0,+1} (the latter with an L0‑type regulariser). For the binary case they give explicit formulas for the message updates, the computation of the max‑convolution, and the normalisation constants that keep the messages zero‑mean. They also discuss how to incorporate L1 regularisation by simply adding a linear penalty to the external fields.
Experimental results are reported for both a single‑layer perceptron and a fully‑connected two‑layer committee machine. On the perceptron, the reinforced MS algorithm reaches a storage capacity α≈0.74, essentially identical to the best reinforced BP results, while requiring O(N log N) operations per update. The number of iterations needed scales roughly as 1/r, confirming the expected trade‑off between speed and accuracy. For the two‑layer committee, reinforced BP fails to break the hidden‑unit symmetry and its performance drops sharply (α≈0.55). In contrast, reinforced MS maintains a higher capacity (α≈0.68) and converges reliably, demonstrating that the MS framework naturally handles symmetry breaking through the reinforcement and asymmetric noise.
In conclusion, the paper shows that Max‑Sum, equipped with reinforcement and a small symmetry‑breaking perturbation, provides a powerful alternative to BP for discrete‑weight neural network training. It achieves the same asymptotic computational complexity as BP while avoiding the Gaussian approximation, and it works better on fully‑connected multi‑layer architectures where BP struggles. The authors suggest that the approach could be extended to more complex activation functions, deeper networks, and non‑concave priors, opening a promising direction for exact‑yet efficient combinatorial optimisation in machine learning.
Comments & Academic Discussion
Loading comments...
Leave a Comment