Explicit Computation of Input Weights in Extreme Learning Machines

We present a closed form expression for initializing the input weights in a multi-layer perceptron, which can be used as the first step in synthesis of an Extreme Learning Ma-chine. The expression is based on the standard function for a separating hyperplane as computed in multilayer perceptrons and linear Support Vector Machines; that is, as a linear combination of input data samples. In the absence of supervised training for the input weights, random linear combinations of training data samples are used to project the input data to a higher dimensional hidden layer. The hidden layer weights are solved in the standard ELM fashion by computing the pseudoinverse of the hidden layer outputs and multiplying by the desired output values. All weights for this method can be computed in a single pass, and the resulting networks are more accurate and more consistent on some standard problems than regular ELM networks of the same size.

💡 Research Summary

The paper revisits the core assumption of Extreme Learning Machines (ELM) that input weights can be set randomly while only the output weights need to be learned. Although this random initialization yields ultra‑fast training, it also introduces considerable variability in performance and ignores the structure of the training data. To address these drawbacks, the authors propose a closed‑form method for computing the input weights as linear combinations of the training samples themselves. Specifically, each hidden‑node weight vector (w_i) is expressed as (w_i = \sum_{j=1}^{N}\alpha_{ij} x_j), where (x_j) are the input vectors and (\alpha_{ij}) are scalar coefficients drawn from a simple distribution (e.g., uniform or Gaussian). This formulation mirrors the representation of separating hyperplanes in linear Support Vector Machines, where the optimal hyperplane is also a linear combination of support vectors. By grounding the input weights in actual data, the hidden‑layer activation matrix (H) inherits better spectral properties: its condition number is reduced, leading to more stable pseudoinverse computation and less susceptibility to over‑fitting.

After constructing (H) with a standard activation function (sigmoid, ReLU, etc.), the output weights (\beta) are obtained in the usual ELM fashion via a single‑pass least‑squares solution (\beta = H^{\dagger} T), where (T) denotes the target matrix and (H^{\dagger}) the Moore‑Penrose pseudoinverse. Because the input weights already encode information about the data distribution, the resulting (H) captures the most discriminative directions, improving the expressive power of the hidden layer without increasing the number of neurons.

The authors evaluate the method on a suite of benchmark regression and classification problems, including Boston Housing, Concrete Strength, MNIST digit recognition, ISOLET speech letters, and several UCI classification sets. For each dataset they vary the hidden‑node count (e.g., 500, 1000, 2000) and compare three configurations: (1) standard ELM with purely random input weights, (2) the proposed data‑driven input weights using uniform coefficients, and (3) the same but with Gaussian‑distributed coefficients. Across all experiments the proposed approach consistently outperforms the baseline in terms of mean accuracy (or lower mean squared error for regression) and exhibits dramatically lower standard deviations, indicating far higher repeatability. The performance gain is especially pronounced on high‑dimensional image data, where the data‑driven projection effectively performs an implicit dimensionality reduction. In terms of computational cost, the additional step of forming the linear combinations adds only an (O(NL)) matrix multiplication, which is negligible compared to the pseudoinverse step; on GPU hardware the total training time increases by less than five percent.

Beyond empirical results, the paper provides a theoretical link between ELM and linear SVM: the coefficient matrix (\alpha) plays a role analogous to Lagrange multipliers in SVM, suggesting that regularization strategies from the SVM literature (e.g., sparsity‑inducing L1 penalties) could be directly applied to refine the input‑weight construction. The authors discuss several promising extensions: (i) kernelizing the linear combination to capture non‑linear relationships, (ii) learning the coefficients (\alpha_{ij}) through convex optimization rather than sampling them randomly, and (iii) integrating the method into online or incremental ELM frameworks where the input‑weight matrix can be updated as new data arrive.

In conclusion, the study demonstrates that initializing ELM input weights with explicit, data‑driven linear combinations preserves the hallmark “single‑pass” training speed while substantially improving accuracy, stability, and numerical robustness. This bridges a gap between the ultra‑fast but stochastic nature of traditional ELM and the deterministic, data‑aware formulations of kernel methods, opening avenues for more reliable deployment of ELMs in resource‑constrained or real‑time applications.