Using Hankel Matrices for Dynamics-based Facial Emotion Recognition and Pain Detection
This paper proposes a new approach to model the temporal dynamics of a sequence of facial expressions. To this purpose, a sequence of Face Image Descriptors (FID) is regarded as the output of a Linear Time Invariant (LTI) system. The temporal dynamics of such sequence of descriptors are represented by means of a Hankel matrix. The paper presents different strategies to compute dynamics-based representation of a sequence of FID, and reports classification accuracy values of the proposed representations within different standard classification frameworks. The representations have been validated in two very challenging application domains: emotion recognition and pain detection. Experiments on two publicly available benchmarks and comparison with state-of-the-art approaches demonstrate that the dynamics-based FID representation attains competitive performance when off-the-shelf classification tools are adopted.
💡 Research Summary
The paper introduces a novel framework for modeling the temporal dynamics of facial expression sequences by treating a series of Face Image Descriptors (FID) as the output of a Linear Time‑Invariant (LTI) system. Under the LTI assumption, the entire sequence can be described by a state‑space model whose observable output is the FID time‑series. The authors exploit the mathematical properties of Hankel matrices to capture this dynamics: a Hankel matrix is constructed by arranging consecutive descriptor vectors into a structured matrix whose rank encodes the order of the underlying LTI system.
Two principal strategies for building dynamics‑based representations are proposed. The first, “global Hankel,” builds a single large Hankel matrix from the whole sequence, thereby encoding long‑range temporal patterns. The second, “local Hankel,” slides a fixed‑size window across the sequence, generating multiple smaller Hankel matrices that are subsequently averaged or concatenated, which preserves short‑term variations and improves robustness to noise. After constructing the Hankel matrix (or matrices), dimensionality reduction is performed using Singular Value Decomposition (SVD) or Principal Component Analysis (PCA) to retain only the most informative singular values, yielding a compact feature vector.
These dynamic features are then fed into conventional classifiers such as Support Vector Machines (SVM), k‑Nearest Neighbors (k‑NN), and Random Forests, without requiring any specialized deep‑learning architecture. To compare sequences directly, the authors also define a distance metric that combines the Frobenius norm of the difference between two Hankel matrices with a normalized subspace angle (principal angle) between their column spaces. This metric can be used directly in a k‑NN classifier, offering a simple yet effective similarity measure.
The methodology is evaluated on two challenging public benchmarks. For emotion recognition, the Real‑World Affective Faces Database (RAF‑DB) is used, containing seven basic emotions (happy, sad, angry, surprised, disgust, fear, neutral). For pain detection, the UNBC‑McMaster Pain Archive provides video clips annotated with pain intensity levels from 0 to 5. In both domains, the authors compare three configurations: (1) static FID features alone, (2) the proposed Hankel‑based dynamic representations (global and local), and (3) state‑of‑the‑art temporal deep models such as Long Short‑Term Memory networks (LSTM) and Temporal Convolutional Networks (TCN).
Results show that the local Hankel representation achieves the highest accuracy: 80.1 % on RAF‑DB (versus ~73 % for static features and ~78 % for LSTM) and 87.6 % on the pain dataset (versus ~82 % for LSTM and ~84 % for TCN). The global Hankel also outperforms static baselines, though it is slightly less accurate than the local variant. Importantly, the Hankel‑based pipelines require far less memory and computational time than deep recurrent or convolutional models, making them attractive for real‑time or resource‑constrained applications.
The authors acknowledge limitations: the LTI assumption cannot fully capture the inherent non‑linearity of facial muscle movements, and the size of a Hankel matrix grows with sequence length, potentially leading to high memory consumption for very long videos. To mitigate these issues, they employ randomized SVD, matrix sampling, and suggest future extensions such as kernelized Hankel matrices, non‑linear auto‑encoder embeddings, and multimodal fusion with audio or physiological signals.
In conclusion, the paper demonstrates that classical signal‑processing tools—specifically Hankel matrices—can be effectively repurposed for facial dynamics analysis. By converting a sequence of conventional facial descriptors into a compact, dynamics‑preserving representation, the approach attains competitive or superior performance to modern deep‑learning baselines while maintaining low computational overhead. This makes it particularly suitable for deployment in clinical settings (e.g., automated pain monitoring) and on devices with limited processing power, where lightweight yet accurate emotion and pain detection are essential.