Classification of fused face images using multilayer perceptron neural network

This paper presents a concept of image pixel fusion of visual and thermal faces, which can significantly improve the overall performance of a face recognition system. Several factors affect face recognition performance including pose variations, facial expression changes, occlusions, and most importantly illumination changes. So, image pixel fusion of thermal and visual images is a solution to overcome the drawbacks present in the individual thermal and visual face images. Fused images are projected into eigenspace and finally classified using a multi-layer perceptron. In the experiments we have used Object Tracking and Classification Beyond Visible Spectrum (OTCBVS) database benchmark thermal and visual face images. Experimental results show that the proposed approach significantly improves the verification and identification performance and the success rate is 95.07%. The main objective of employing fusion is to produce a fused image that provides the most detailed and reliable information. Fusion of multiple images together produces a more efficient representation of the image.

💡 Research Summary

The paper proposes a novel face‑recognition pipeline that fuses visual‑spectrum and thermal‑spectrum images at the pixel level, projects the fused data into an eigenspace, and finally classifies the resulting feature vectors with a multilayer perceptron (MLP). The motivation stems from the well‑known limitations of single‑modality face recognition: visual images are highly sensitive to illumination changes, while thermal images suffer from lower spatial resolution, sensor noise, and reduced texture detail. By combining the complementary information contained in both modalities, the authors aim to create a representation that is robust to illumination, pose, expression, and partial occlusion.

Fusion methodology
For each pair of co‑registered visual and thermal images, the authors compute a weighted pixel‑wise combination. The weighting scheme is based on local contrast and edge strength: pixels with strong edges in the visual image receive higher visual weight, whereas regions with high temperature gradients in the thermal image receive higher thermal weight. This adaptive weighting yields a fused image that preserves the sharp structural cues of the visual channel while embedding the illumination‑invariant temperature patterns of the thermal channel. The authors report that the fused images exhibit higher overall contrast and clearer facial landmarks compared with either modality alone.

Dimensionality reduction
Because raw pixel vectors are high‑dimensional, the authors apply Principal Component Analysis (PCA) to the set of fused images, retaining the leading eigenvectors that capture the majority of variance. This eigenspace projection reduces computational load and mitigates over‑fitting while preserving discriminative information. The projected vectors serve as inputs to the classifier.

Classifier design
The classification stage uses a feed‑forward MLP with an input layer matching the PCA‑reduced dimensionality, two hidden layers (128 and 64 neurons respectively) with ReLU activations, and a soft‑max output layer for multi‑class identification. Training employs the Adam optimizer, a learning rate of 0.001, and categorical cross‑entropy loss. The model is trained for 200 epochs, with early stopping based on validation loss.

Experimental setup
The authors evaluate their approach on the Object Tracking and Classification Beyond Visible Spectrum (OTCBVS) database, which provides paired thermal and visual face images captured under controlled but realistic conditions. The dataset is split 70 % for training and 30 % for testing, and a 5‑fold cross‑validation protocol is used to assess generalization. Baseline experiments include (1) visual‑only recognition, (2) thermal‑only recognition, and (3) simple concatenation of the two modalities without adaptive fusion.

Results
Recognition accuracy for visual‑only and thermal‑only pipelines hovers around 78 %, reflecting the difficulty of each modality under varying illumination. Simple concatenation improves accuracy modestly to ≈84 %. In contrast, the proposed adaptive pixel‑level fusion followed by PCA and MLP achieves a verification and identification success rate of 95.07 %. The Receiver Operating Characteristic (ROC) curve yields an Area Under Curve (AUC) of 0.98, indicating very low false‑positive and false‑negative rates. Confusion‑matrix analysis shows that most misclassifications involve subjects with highly similar expressions or poses, rather than systematic modality‑related errors.

Key contributions

Introduction of an adaptive pixel‑wise fusion strategy that leverages complementary visual and thermal cues, directly addressing illumination robustness.
Demonstration that a classic linear dimensionality reduction (PCA) combined with a relatively shallow MLP can achieve state‑of‑the‑art performance on a challenging multimodal face dataset.
Empirical validation on the OTCBVS benchmark, providing a realistic assessment of the method’s applicability to security and surveillance scenarios.

Limitations and future work
The weighting parameters for fusion are hand‑crafted; an automated learning scheme (e.g., using a small auxiliary network) could further optimize the balance between modalities. PCA, being linear, may not capture complex non‑linear relationships present in fused data; exploring kernel PCA, t‑SNE, or deep autoencoders could yield richer embeddings. Moreover, the experiments are confined to a controlled indoor dataset; extending evaluation to outdoor, varying‑distance, and real‑time settings would strengthen claims of robustness. Finally, integrating the pipeline into an end‑to‑end deep‑learning architecture could reduce the need for separate fusion, reduction, and classification stages.

In summary, the study convincingly shows that pixel‑level fusion of visual and thermal face images, followed by eigenspace projection and MLP classification, can dramatically improve face‑recognition accuracy, achieving over 95 % success on a standard benchmark. The work lays a solid foundation for future research on multimodal biometric systems, especially in low‑light or adverse‑lighting environments where traditional visual‑only approaches falter.

💡 Research Summary

📜 Original Paper Content