Using Feature Weights to Improve Performance of Neural Networks

Using Feature Weights to Improve Performance of Neural Networks
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Different features have different relevance to a particular learning problem. Some features are less relevant; while some very important. Instead of selecting the most relevant features using feature selection, an algorithm can be given this knowledge of feature importance based on expert opinion or prior learning. Learning can be faster and more accurate if learners take feature importance into account. Correlation aided Neural Networks (CANN) is presented which is such an algorithm. CANN treats feature importance as the correlation coefficient between the target attribute and the features. CANN modifies normal feed-forward Neural Network to fit both correlation values and training data. Empirical evaluation shows that CANN is faster and more accurate than applying the two step approach of feature selection and then using normal learning algorithms.


💡 Research Summary

**
The paper introduces Correlation‑Aided Neural Networks (CANN), a method that incorporates prior knowledge about feature relevance directly into the training of feed‑forward neural networks. Rather than performing a separate feature‑selection step, the authors treat feature importance as the Pearson correlation coefficient between each input feature (X_k) and the target variable (Y). These correlation values, either obtained from domain experts or computed from a separate dataset, are used as target values that the network must reproduce during learning.

Formally, the standard mean‑squared error loss (E_D = \frac12\sum_d (y_d-\hat y_d)^2) is augmented with a correlation‑error term
(E_c = \frac12\sum_{k} (\rho_k - \hat\rho_k)^2),
where (\rho_k) is the prescribed importance (the true correlation) and (\hat\rho_k) is the sample correlation between the current network output (\hat Y) and feature (X_k). The total loss becomes (E = E_D + \lambda E_c), with (\lambda) controlling the trade‑off between fitting the data and matching the desired correlations.

During back‑propagation the gradient of the new loss is computed as
(\Delta w = -\eta\big(\frac{\partial E_D}{\partial w} + \lambda\frac{\partial E_c}{\partial w}\big)).
The derivative (\partial E_c/\partial w) requires the gradients of the sample means and covariances that define (\hat\rho_k). To keep computational cost low, the authors update the means and covariances incrementally for each training instance, using classic online formulas:
(\mu_X^{(new)} = \mu_X^{(old)} + \frac{x-\mu_X^{(old)}}{n}) and similarly for (\mu_Y) and the covariance (\sigma_{XY}). This yields an overall time complexity of (O(NK)) (N = number of instances, K = number of features), identical to a standard multilayer perceptron (MLP). Memory overhead is limited to a table storing the current means, covariances, and the usual weight matrices.

CANN is positioned against the earlier Importance‑Aided Neural Network (IANN), which only modified learning rates and weight initialisation based on heuristic importance scores. IANN lacked a solid theoretical foundation, whereas CANN explicitly treats importance as a statistical target and integrates it into the loss function, providing a principled optimization problem.

The experimental evaluation uses five publicly available datasets that are known to be challenging due to high dimensionality, class imbalance, or limited sample size: Soybean‑Large (35 features, 68 instances), Spambase (58 features, 4601 instances), Promoter Gene Sequences (58 features, 936 instances), Cardiac Arrhythmia (279 features, 452 instances), and Annealing (38 features, 798 instances). For each dataset the authors perform 10‑fold cross‑validation with a 50 % hold‑out test set. They compare CANN against a standard MLP, C4.5 decision trees, Support Vector Machines (SVM), k‑Nearest Neighbours, and Naïve Bayes. Results (Table 1) show that CANN achieves the highest accuracy on four of the five datasets and is very close on the fifth, with improvements ranging from about 2 % to 4 % over the plain MLP. Moreover, CANN converges in fewer epochs, indicating that the correlation term guides the weight updates toward more informative directions early in training.

The authors discuss several advantages of CANN: (1) it leverages external knowledge to boost performance when training data are scarce; (2) it requires no architectural changes, making it easy to plug into existing neural‑network pipelines; (3) its computational overhead is modest because of the incremental statistics. They also acknowledge limitations: the method depends on reliable correlation estimates; noisy or mis‑specified importance values can mislead learning; the scaling hyper‑parameter (\lambda) (and an optional scaling constant (p)) must be tuned; and Pearson correlation captures only linear relationships, potentially overlooking non‑linear feature relevance.

Future work suggested includes automatic tuning of (\lambda) and (p), extending the approach to non‑linear dependence measures (e.g., Spearman rank correlation or mutual information), and applying the same principle to deeper architectures such as convolutional or recurrent networks.

In conclusion, CANN provides a theoretically grounded, empirically validated framework for embedding feature importance—expressed as correlation coefficients—into neural‑network training. By jointly minimizing data error and correlation error, it achieves faster convergence and higher predictive accuracy, especially in domains where labeled data are limited but expert knowledge about feature relevance is available. This makes CANN a promising tool for a wide range of real‑world machine‑learning applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment