Multimodal Functional Maximum Correlation for Emotion Recognition
📝 Original Info
- Title: Multimodal Functional Maximum Correlation for Emotion Recognition
- ArXiv ID: 2512.23076
- Date: 2025-12-28
- Authors: Deyang Zheng, Tianyi Zhang, Wenming Zheng, Shujian Yu
📝 Abstract
Emotional states manifest as coordinated and heterogeneous physiological responses across central and autonomic systems, posing a fundamental challenge for multimodal representation learning in affective computing. Learning such joint dynamics is further complicated by scarce and subjective affective annotations, motivating the use of self-supervised learning (SSL). However, most existing SSL approaches rely on pairwise alignment objectives, which are insufficient to characterize the joint dependencies among more than two modalities and fail to capture higher-order interactions arising from coordinated brain-autonomic responses. To move beyond this limitation, we propose Multimodal Functional Maximum Correlation (MFMC), a principled SSL framework that maximizes higher-order multimodal dependence via a Dual Total Correlation (DTC) objective. By deriving a tight sandwich bound and optimizing it with a functional maximum correlation analysis (FMCA)-based trace surrogate, MFMC directly captures joint multimodal interactions without resorting to pairwise contrastive losses. Experiments on three public affective computing benchmarks demonstrate that MFMC consistently achieves state-of-the-art or competitive performance under both subject-dependent and subject-independent protocols, highlighting its robustness to inter-subject variability. In particular, MFMC achieves substantial gains on CEAP-360VR, improving subject-dependent accuracy from 78.9% to 86.8% and subject-independent accuracy from 27.5% to 33.1% using the EDA signal alone, and remains highly competitive within 0.8 percentage points of the best-performing method on the most challenging EEG subject-independent split of MAHNOB-HCI. Our code is available at https://github.com/DY9910/MFMC.📄 Full Content
Traditional physiological emotion recognition systems are predominantly supervised and therefore rely on large collections of labeled affective data. In practice, ground truth labels are typically obtained through self-report instruments such as the Self-Assessment Manikin or dimensional scales based on the circumplex and PAD frameworks [16]- [19]. These ratings are labor-intensive to collect, require carefully controlled experimental protocols, and remain inherently subjective and context-dependent [20]- [23]. Participants may further modulate or reinterpret their emotional reports across trials, sessions, and studies, introducing label noise and reducing the comparability of datasets. As a result, supervised models are often trained on relatively small and idiosyncratic corpora with limited standardization, which leads to poor generalization across datasets, recording setups, and populations [24]- [26].
A further limitation lies in how existing models exploit multimodal physiological dynamics. Although emotional episodes can elicit coordinated responses across central and peripheral systems [27], many approaches still rely on unimodal encoders [28] or simple feature-level fusion strategies, such as concatenation [29] or shallow late fusion [24], [30]. While recent studies have explored more sophisticated multimodal architectures using attention mechanisms [15] or hypercomplex representations [31], [32], these methods often operate on pre-engineered features and primarily model pairwise correlations [33]. As a result, they fail to explicitly capture the synchronous and asynchronous dependencies between central neural activity and peripheral autonomic responses that characterize affective processing [34]. Consequently, much of the higher-order structure inherent in multichannel physiological data remains underutilized.
These challenges collectively motivate learning frameworks that reduce the reliance on explicit labels and can directly capture the intrinsic relationships among multiple heterogeneous modalities. Contrastive self-supervised learning (SSL) has therefore emerged as a promising direction for physiological representation learning. By treating different views of the same physiological event as positives, contrastive objectives encourage invariance to nuisance factors such as subject identity and session-specific noise [35], [36]. In multimodal settings, contrasting matched and mismatched central-peripheral segments further promotes cross-modal alignment [37].
Despite these advances, most existing SSL frameworks for physiological computing remain fundamentally limited by their reliance on pairwise contrastive formulations. General contrastive methods such as SimCLR [38], CPC [39], and CLIP [40], as well as their multimodal extensions including VATT [41], ImageBind [42], Symile [43], and Gramian-based models [44], were primarily developed for vision-language settings and typically assume dual-modality alignment. When applied to physiological signals, these frameworks are commonly instantiated as pairwise contrastive objectives between two modalities or between original and augmented views of a single modality. As a consequence, the learned representations are biased toward lower-order dependencies (e.g., between EEG and ECG segments), while overlooking both the intrinsic multichannel structure within each modality and the higherorder interactions that emerge when three or more physiological modalities jointly encode affective dynamics [45]- [48].
Recent physiological SSL approaches, such as Phys-ioSync [37] and cross-modal ECG-EEG alignment [49], demonstrate that contrastive and generative self-supervision can substantially improve emotion recognition performance. However, these methods either operate strictly in a pairwise setting or treat each modality as a single aggregated unit, without explicitly modeling how informative patterns are distributed across heterogeneous channels and modalities over time. Consequently, current SSL pipelines remain poorly aligned with the structured, multilevel dependence that characterizes multimodal physiological responses to emotion, leaving substantial room for approaches that can directly capture higher-order correlations across modalities.
Motivated by this observation, we propose a principled multimodal SSL framework that goes beyond pairwise alignment and explicitly maximizes the total dependence among multiple physiological modalities. Our approach does not rely on positive-negative sample construction or handcrafted data augmentations, and is designed to capture higher-order interactions that emerge when emotional stimuli jointly modulate central and peripheral physiological systems.
To summarize, our main contributions include:
• We introduce the first SSL framework for physiological emotion recognition that explicitly models higher-order multimodal dependence beyond pairwise contrastive objectives. By grounding multimodal alignment in dual total correlation (DTC) [50], [51], our framework naturally generalizes from tri-modal to arbitrarily many physiological modalities.
• We derive a principled and tractable optimization objective based on functional maximum correlation analysis (FMCA) [52], which enables stable estimation of joint multimodal dependence and avoids the numerical instability associated with eigenvalue decomposition and lower-bound-based contrastive objectives. • Extensive experiments on three public benchmarks demonstrate that our approach consistently achieves stateof-the-art or competitive performance under both subjectdependent and subject-independent protocols, highlighting its robustness to inter-subject variability and its effectiveness in modeling complex physiological dynamics.
In this section, we first provide a brief review of existing supervised learning and contrastive SSL approaches for emotion recognition. We then discuss prior efforts in the vision and language domains that attempt to extend bi-modal contrastive SSL to multiple modalities, although none of these have yet been evaluated on physiological signals.
EEG is the most widely used modality for physiological emotion recognition due to its ability to capture neural dynamics associated with affective processing [46], [48]. Traditional EEG-based emotion recognition pipelines typically rely on handcrafted time-domain [53], [54] or frequency-domain [55]- [57] features combined with supervised classifiers [1]. While effective in controlled settings, such approaches depend heavily on expert-designed features and often struggle to generalize across subjects, sessions, and datasets.
Building on handcrafted features, early physiological emotion recognition pipelines typically employ classifiers such as support vector machines, k-nearest neighbors, or shallow neural networks [20], [30]. More recent studies have shifted toward deep representation learning, using convolutional, recurrent, or graph-based encoders to learn features end-toend from raw or lightly preprocessed signals [28], [58]- [60]. In multimodal settings, these architectures have been extended to incorporate peripheral signals (e.g., EDA, BVP, ECG, EOG) through feature concatenation, attention-based fusion, or hypercomplex representations [15], [31], [34]. While these approaches improve flexibility and performance, they typically focus on pairwise fusion mechanisms and do not explicitly model higher-order dependencies across multiple physiological modalities.
Self-supervised learning (SSL) alleviates the need for explicit labels by leveraging surrogate objectives that exploit the intrinsic structure in the data [61]. For time series and physiological signals, contrastive SSL has become a dominant paradigm [45]- [48] enabling applications such as fMRI-based mental-disorder diagnosis [62], EEG-based sleep staging [47], and affect recognition [45], [46].
Current approaches often follow established SSL pipelines developed for vision tasks, such as SimCLR [38] and VI-CReg [36], and optimize contrastive losses like the In-foNCE [39], defined as:
where z i and z j are positive sample embeddings, z k denotes embeddings of negative samples in the batch of size N , sim(•, •) is a similarity function (e.g., cosine similarity), and τ is a temperature parameter. Unimodal contrastive SSL on physiological time series typically instantiates one of three families of collapse-avoidance principles: (i) contrastive objectives with negatives (e.g., Sim-CLR), which maximize agreement between two augmented views of the same sample [38]; (ii) redundancy-reduction via cross-correlation matching (Barlow Twins) [35]; and (iii) variance-invariance-covariance regularization without negatives (VICReg) [36]. In EEG and related biosignals, these objectives are adapted with modality-aware augmentations or pretext design to respect temporal and multichannel structure, yielding strong representations for downstream tasks such as sleep staging and affect recognition [45]- [47]. For instance, GANSER couples adversarial augmentation with selfsupervision to improve EEG emotion recognition [63], while VICReg-style training has been shown effective for EMG representation learning [64].
In physiological emotion recognition, contrastive SSL is particularly appealing because multimodal recordings naturally provide intrinsic supervision signals [61]. Temporal correspondences within a modality and synchronous correspondences across modalities (e.g., EEG-ECG or EEG-EDA) can be leveraged to define positive pairs without requiring explicit affect labels [38]- [40]. By exploiting these relationships as training signals, contrastive SSL alleviates the labeling burden and subjectivity associated with self-reported affect annotations.
Beyond label efficiency, SSL enables representation learning directly from raw or lightly processed physiological signals, reducing reliance on handcrafted feature engineering [45]- [48]. Pretraining on large-scale unlabeled recordings encourages the learned representations to capture structures that are stable across subjects, sessions, and acquisition conditions, thereby improving robustness to noise and inter-subject variability [25], [26]. Moreover, contrastive objectives can promote invariance to nuisance factors while encouraging alignment between central and peripheral modalities, making SSL a natural fit for modeling multimodal physiological dynamics during emotional processing [35]- [37].
Recent studies in physiological emotion recognition provide concrete evidence of this potential: Wu et al. [49] align ECG with EEG derived features using sequence and patch level InfoNCE and report binary arousal/valence accuracies of 0.892/0.879 on DREAMER [65] and 0.849/0.834 on AMIGOS [23], exemplifying a typical bi-modal, pairwise setup. Similarly, Zhang et al. propose GANSER [66], a GAN-based self-supervised data augmentation framework for EEG-based emotion recognition, which attains state-of-the-art performance of 93.52%/94.21% accuracy on the binary valence/arousal classification tasks of the DEAP dataset [21]. These approaches demonstrate that contrastive and generative self-supervision can exploit physiological recordings as a rich supervision source and substantially reduce annotation requirements. However, they still focus on pairwise modality alignment or single-modality augmentation and do not explicitly model the intrinsic multichannel dependencies among heterogeneous signals.
Multimodal contrastive learning has achieved remarkable success in vision-language domains. For a batch of paired samples (x i , y i ) from two modalities, where x i and y i are encoded, respectively, by networks f θ and g ϕ , methods like CLIP [40] aim to maximize similarity between corresponding pairs (positive samples) and minimize similarity with all other pairs (negative samples). The CLIP objective can be expressed as a symmetric InfoNCE-like loss. When x is treated as the anchor modality, the loss is:
and similarly for the reverse direction y → x. The final loss is the average of the two directions:
Although highly effective for bi-modal alignment, this paradigm fundamentally decomposes multimodal learning into collections of pairwise objectives. In affective computing, CLIP-style losses have been applied to align EEG with peripheral signals such as ECG or EOG [15], [31], [67], [68], and similar formulations have been extended to additional modalities in vision-language settings, including tri-modal video-audio-text models (VATT) [41] and broad cross-modal embedding spaces (ImageBind) [42].
In practice, these approaches typically optimize sums of pairwise losses across modality pairs, implicitly assuming that higher-order multimodal structure can be recovered from pairwise alignment alone [41], [42]. For example, in vision, language, and other multimedia domains, extending CLIP to three modalities is commonly implemented by summing all pairwise CLIP objectives [41], [42], [69], leading to:
Such pairwise decompositions overlook higher-order dependencies among modalities, as joint interactions that only emerge when multiple modalities are considered simultaneously cannot be recovered from sums of pairwise terms. From an information-theoretic perspective, this limitation is fundamental: pairwise criteria may fail to capture synergistic interactions that arise only when three or more modalities are jointly observed [70]- [72]. This insight has motivated recent efforts toward multi-way alignment objectives, including SymILE [43], Gramian volume-based alignment [44], and analyses of what to align in multimodal contrastive learning [73]. Nevertheless, existing approaches still lack a principled and tractable objective for modeling higher-order dependence across heterogeneous modalities. To our knowledge, no prior work has systematically addressed the alignment of more than two physiological modalities for emotion recognition.
To address these limitations, we introduce a principled alternative that explicitly models higher-order multimodal dependence by analyzing joint interactions among multiple modalities through the lens of dual total correlation (DTC). To make this formulation tractable in practice, we further leverage functional maximum correlation analysis (FMCA) [52] to enable efficient and scalable estimation of such joint dependence.
We consider the setting where M physiological modalities X 1 , X 2 , . . . , X M are available, and the goal is to learn generalizable representations that reside in a shared latent space and can effectively transfer to downstream tasks. In emotion recognition, typical physiological modalities include EEG, ECG, and peripheral signals such as skin temperature.
As discussed in Section II-B, most existing contrastive SSL objectives rely on mutual information (MI) estimation. From an information-theoretic perspective, the widely used InfoNCE-based estimator suffers from inherent limitations, including its lower-bound nature [39] and unfavorable sample complexity in high-dimensional settings [74]. To address these issues, our framework introduces a new MI estimator derived from a density-ratio decomposition perspective. For readers who may be unfamiliar with information theory, we first review the FMCA, which serves as the mathematical foundation of our proposed framework.
Given any two random processes X and Y with joint distribution p(X, Y ) and marginal product p(X)p(Y ), their statistical dependence can be characterized through an orthonormal decomposition of the density ratio [52], [75]:
where {ϕ k } and {ψ k } form orthonormal systems in L 2 (p(X)) and L 2 (p(Y )), respectively, satisfying:
The coefficients ), can be interpreted as nonlinear correlation strengths between X and Y . The largest coefficient satisfies σ 1 = 1. If X and Y are statistically independent, then σ k = 0 for all k ≥ 2; conversely, larger values of σ k indicate stronger dependence captured at increasingly higher orders.
This spectral view motivates defining a total statistical dependence (TSD) measure as an additive functional of {σ k }. FMCA quantifies this dependence as:
which increases monotonically with each σ i and reduces to zero only when X and Y are independent. The formulation requires no parametric assumptions and naturally generalizes classical linear correlation. 1) Neural Networks Implementation: Since true density functions are unavailable in practice, FMCA optimizes a neural surrogate of T using paired projection networks f θ : X → R K and g ϕ : Y → R K . Define the marginal autocorrelation and cross-correlation matrices:
FMCA minimizes the following log-determinant objective:
(10) After optimization, f θ and g ϕ approximate the top eigenfunctions of the density ratio ρ. Applying singular value decomposition to R XY = U S 1/2 V , where S = diag(λ 1 , . . . , λ K ) approximates the top-K eigenvalues of ρ, we can estimate ρ as ρ = f ⊤ θ S 1/2 g ϕ . We note that the FMCA objective resembles the mutual information formula for jointly Gaussian variables:
where
is the sample joint-covariance matrix. This resemblance partially motivates Eq. ( 10), but FMCA remains fully non-parametric.
). Minimizing Eq. ( 10) yields r * L (θ, ϕ)
, where σ i ∈ [0, 1) are the top K eigenvalues in the decomposition (5). Thus, minimizing Eq. ( 10) is equivalent to maximizing a truncated version of T , and hence the nonlinear dependence between X and Y .
While theoretically principled, the log-determinant objective requires repeated matrix inversion and eigenvalue computation, which may become numerically unstable during minibatch training. Small ridge terms εI are often added for stability [52], motivating the development of an alternative formulation that remains faithful to the spectral interpretation yet is more efficient and robust for large-scale optimization.
Motivated by CLIP, a natural approach is to maximize the total dependence among all modalities. From an informationtheoretic view, total dependence among M random variables can be quantified using total correlation (TC) [71] and dual total correlation (DTC) [50]. Let [M ] := {1, 2, . . . , M } denote the index set, and [M ]{i} denote the set excluding i. For convenience, we write X [M ] to denote the tuple (X 1 , . . . , X M ), Fig. 1: TC and DTC on three variables X 1 , X 2 , X 3 . In each case, the quantity is represented by the total number of block areas of the diagram. TC counts the triple-overlapped shaded area (a.k.a., interaction information [70]) twice, whereas DTC counts this area just once.
and X [M ]{i} to denote all variables except X i . Then, TC and DTC are defined as:
TC has recently been applied to multimodal learning via CLIP-like lower bounds [43]. However, TC is known to overestimate redundancy due to repeated counting of shared information. For example, in the case of three variables, it double-counts the interaction information [70]:
which may cause misleading dependence estimates (see Fig. 1 for an illustration). In contrast, DTC avoids this issue by emphasizing unique and synergistic information, making it better suited for multimodal alignment [72], [76].
Beyond its theoretical advantages over TC, DTC also highlights a key limitation of prior approaches that naively extend CLIP by aggregating contrastive losses across all modality pairs. For instance, in the case of three variables X 1 , X 2 , X 3 , DTC decomposes as (proof in Appendix):
Eq. (14) shows that the pairwise strategy, though intuitive, fails to capture all conditional higher-order dependencies.
Despite its appeal, DTC is challenging to estimate accurately due to the curse of dimensionality, necessitating novel estimators or reliable surrogates, as discussed later.
We first present a sandwich bound that approximates the DTC using more tractable mutual information terms. Theorem 1. For three random variables X 1 , X 2 , X 3 , the dual total correlation satisfies the following bound:
Proof. Complete proofs are provided in Appendix A, and we also present empirical evidence supporting the tightness of the proposed bound in Appendix B.
Here, cyc I(pair; third) denotes the cyclic sum over all permutations of the three variables, where in each term, two variables are grouped as a joint input and the third as the target. That is, cyc
Theorem 1 suggests that optimizing the cyclic sum cyc I(pair; third) serves as a reliable surrogate for DTC. By the chain rule:
indicating that our objective inherently captures conditional higher-order dependence terms, as highlighted in Eq. ( 14). This sandwich bound can be extended to settings with M > 3 modalities (see Theorem 2), though in this study we focus on the case of three modalities.
Theorem 2. For M ≥ 3 random variables X 1 , X 2 , . . . , X M , the dual total correlation satisfies the following sandwich bound:
where [M ] \ {i} refers to the set of all indices except i.
With cyc I(pair; third) as a surrogate for DTC, the key challenge is estimating each joint mutual information term.
Rather than relying on InfoNCE-or CLIP-like objectives that provide only lower bounds, we aim to directly optimize the true dependence values using FMCA. However, as discussed earlier, FMCA suffers from high computational cost and numerical instability due to its reliance on eigenvalue decomposition. Additionally, it is not straightforward to extend FMCA to the joint mutual information term I(pair; third), which involves three variables instead of two.
To address both issues, we first introduce a new FMCA objective that eliminates the need for eigenvalue decomposition. We then present a neural framework for estimating I(pair; third) based on this formulation.
By Lemma 3, the original FMCA objective essentially maximizes a truncated total statistical dependence (TSD),
This observation motivates our adoption of a flexible, functionbased formulation of statistical dependence.
Definition 1 (Generalized definition of TSD). Given the eigenspectrum {σ i } ∞ i=1 and any monotonically increasing convex function ζ(•) : [0, 1] → [0, ∞) satisfying ζ(0) = 0, we define the generalized form of the truncated total statistical dependence (TSD) as:
Notably, setting ζ(σ) = log 1 1-σ recovers the TSD formulation proposed in [52]. Alternatively, by choosing a much simpler function ζ(λ) = λ, we obtain an alternative of TSD, defined as T := K i=1 σ i . This motivates a new objective for the FMCA:
Minimizing Eq.( 21) effectively maximizes T := K i=1 σ i , thus maximizing the dependence for two variables.
Compared to the original objective r(f θ , g ϕ ) in Eq. ( 10), the new objective avoids explicit eigenvalue decomposition and can be efficiently computed as a trace term:
where
is the normalized crosscovariance operator.
Lemma 2 (First-order approximation). Let σ i ∈ (0, 1) denote the eigenvalues of density ratio ρ. By applying the first-order Taylor approximation to log(1 -x), i.e.,
This shows that the trace-based objective in Eq. ( 22) can be interpreted as a first-order approximation of the original logdeterminant-based dependence measure. Since the eigenvalues σ i are often small in practice (especially in high-dimensional or weakly correlated modalities), higher-order terms in the expansion decay rapidly, making the approximation sufficiently accurate while avoiding the numerical instability of computing log-determinants and eigenvalues.
- Three Modality Framework: After introducing the tracebased loss, we detail a neural implementation for estimating I(pair; third) using FMCA. As an illustrative example, Fig. 2 shows the estimation of I((EEG, ECG); Temperature).
EEG and ECG signals are first encoded by two modalityspecific encoders f θ and g ϕ , producing embeddings e 1 , e 2 ∈ R K , respectively. These embeddings are then fused through a fusion network to form a joint representation e 12 ∈ R K , which serves as the representation of the paired modalities (EEG, ECG). In parallel, the temperature signal is encoded by a third encoder h ψ to produce e 3 ∈ R K . The dependence between the paired representation e 12 and the third modality representation e 3 is then maximized using the trace-based FMCA objective in Eq. (22), where the auto-covariance matrices R and cross-covariance matrix P are estimated empirically from mini-batch samples of {e 12 , e 3 }.
The same procedure is applied cyclically to estimate I((EEG, Temperature); ECG) and I((ECG, Temperature); EEG). Consequently, the final tri-modal training objective is given by:
After pre-training, the fusion network is discarded, and only the modality-specific encoders f θ , g ϕ , and h ψ are retained for downstream emotion-recognition tasks. A key advantage of our framework is that each encoder has already learned to capture information from all modalities through the cyclic dependence maximization in Eq. (22). Consequently, unlike conventional multimodal fusion pipelines that require all modalities to be present at inference time, MFMC allows emotion recognition using any single modality encoder. In principle, each encoder provides a cross-modality-informed representation that implicitly encodes complementary cues from the other two modalities, enabling flexible deployment under missing-modality or resource-constrained conditions.
Furthermore, unlike prior contrastive SSL frameworks, our method requires neither positive-negative pair construction nor modality-specific augmentations, making the pretraining process simple, generalizable, and free from handcrafted alignment biases.
We use three public affective computing datasets, each containing at least three simultaneously recorded physiological signal modalities. A generic procedure is applied to select the most informative channels from the desired modalities in each dataset (see Sec. IV-B). We exclude the widely used SEED [77] and DREAMER [65], as they contain only two signal modalities.
DEAP-Emotion [21] includes 32 EEG and 8 peripheral channels recorded from 32 participants (16 male, 16 female) watching 40 one-minute music videos, each rated on 1-9 Self-Assessment Manikin (SAM) valence and arousal scales. Signals were acquired in a controlled laboratory setting using a BioSemi ActiveTwo system at 512 Hz (downsampled to 128 Hz), with 32 active Ag/AgCl electrodes placed according to the international 10-20 system and peripheral sensors capturing EOG, EMG, GSR, respiration, blood volume pulse (BVP), and skin temperature; frontal face video was additionally recorded for most subjects [21]. The 40 stimuli were selected from an initial pool of 120 clips via a web-based prestudy to span all four quadrants of the valence-arousal space, and each trial followed a standardized baseline-video-rating protocol with self-reports of valence, arousal, dominance, liking, and familiarity. DEAP has since become a canonical benchmark for EEG-based affective computing, underpinning many deep models including dynamical graph convolutional networks and self-supervised augmentation frameworks for emotion recognition [1], [46], [63].
We focus on central (EEG) and ocular/thermal (EOG, skin temperature) channels that are most relevant to affective processing (see Sec. IV-B2).
CEAP-360VR [22] provides only peripheral signals sampled at ∼30 Hz from 32 participants viewing eight 360°v ideos. Available channels include blood volume pulse (BVP), electrodermal activity (EDA), skin temperature (SKT), heart rate (HR), inter-beat interval (IBI), and three-axis acceleration. Participants watched the clips through an HTC Vive Pro Eye head-mounted display equipped with a high-frequency Tobii eye tracker while seated on a swivel chair and free to Fig. 3: Emotion-discrimination performance versus window length. Both macro-F 1 (blue) and accuracy (green) peak at 10 s, remain flat up to 14 s, and deteriorate beyond that. Very short windows (< 3 s) perform markedly worse. explore the scene; simultaneously, continuous valence arousal annotations were collected in VR using a joystick mapped to the 2D circumplex model, followed by within VR SAM ratings after each clip [22]. Physiological responses were recorded with an Empatica E4 wristband on the non-dominant hand, providing synchronized heart rate, EDA, and SKT measurements alongside head-and eye-movement streams. The dataset paper reports baseline binary and multi-class valence/arousal classification and quality of experience analyses, and CEAP-360VR has since been used in correlation-based feature extraction and multimodal physiological emotion recognition studies for immersive media [29], [32], [33].
In our experiments we focus on BVP, EDA, and SKT as the most reliable and emotion relevant modalities.
MAHNOB-HCI [78] records EEG, ECG, galvanic skin response (GSR/EDA), respiration, SKT, audio, video, and eyetracking from 30 participants during movie viewing. Data were collected in a dedicated audiovisual laboratory using a synchronized multimodal setup with multiple cameras for facial and head views, a Tobii eye tracker, high-quality microphones, and a BioSemi Active II system for 32-channel EEG together with peripheral sensors (ECG, GSR, respiration belt, SKT) [78]. In the emotion-recognition experiment considered here, 27 valid participants watched 20 emotional film clips and, after each clip, reported their felt emotion using dimensional ratings (valence, arousal, dominance, predictability) and free-form emotion keywords, enabling both dimensional and categorical analyses. MAHNOB-HCI has become a key benchmark for multimodal affect modeling and has recently been used to evaluate hypercomplex multimodal fusion architectures such as HyperFuseNet and its hierarchical extension on EEG plus peripheral physiological signals [15], [31].
We use EEG together with a subset of peripheral channels (cardiac and electrodermal signals) that capture autonomic arousal; see Sec. IV-B2 for the exact trimodal configuration.
Across all datasets, valence and arousal ratings are binarized at the midpoint (5), forming four emotional quadrants: LAHV (low arousal, high valence), HALV (high arousal, low valence), LALV (low arousal, low valence), and HAHV (high arousal, high valence), yielding a 4-class classification task. 1) Temporal windowing: Selecting appropriate temporal windows is critical in multimodal physiological tasks to balance temporal context, sample count, and nonstationarity. We first performed a window-size sweep over durations from 0.5 s to 20 s. For shorter windows (≤ 5 s) we used nonoverlapping segmentation to avoid excessive redundancy and highly correlated samples; for longer windows (> 5 s) we adopted a stride overlap of 40% (i.e., stride = 0.6× window length) to maintain an adequate sample count while preventing overly high correlation between adjacent segments. Empirical results (see Fig 3) show an increase in performance up to about 10 s window length, a plateau from 10-14 s, and a decrease beyond that point, indicating that ∼10 s captures sufficient temporal context without incurring excessive nonstationarity or diluted temporal resolution. This is consistent with prior work showing that window size has a marked effect on emotion recognition performance [58].
Unless otherwise noted, all subsequent experiments therefore use 10 s windows. For DEAP, windows start 3 s after stimulus onset to allow for initial adaptation. For MAHNOB-HCI, we discard the 15 s neutral clip and 30 s baseline before windowing. For CEAP-360VR, where no pre-stimulus baseline is provided, normalization is applied directly to each window. Across all datasets, signals are normalized by the first sample within each window, and any window whose normalized amplitude exceeds ±5 is discarded as an outlier. This procedure yields 20,097 windows for DEAP, 25,216 segments for CEAP-360VR, and 105,000 segments for MAHNOB-HCI.
- Modality selection: To select informative modalities and suppress noisy or redundant ones, we implement a learnable attention mask at the modality-fusion layer. Each modality embedding is associated with a softmax-normalized scalar weight that is trained jointly with the classification network. Modalities whose weights fall below a predefined threshold on the validation set are excluded from the final input set. This follows recent multimodal literature showing that attention at the modality level allows the model to emphasize informative modalities and down-weight uninformative ones [79].
Guided by this attention analysis, we fix the trimodal inputs per dataset as follows and use these configurations for all subsequent experiments:
• DEAP: EEG, EOG, and skin temperature (SKT);
• CEAP-360VR: EDA, BVP, and SKT;
• MAHNOB-HCI: EEG, ECG, and EDA.
- Encoder architecture: To ensure fairness, all selfsupervised and supervised methods share the same modalityspecific backbone. For multichannel signals such as EEG or EOG, the encoder applies four depth-wise 1D convolutional blocks along the time axis, reusing the same kernel across channels. Each block ends with a stride-4 max-pooling operation, resulting in a total temporal downsampling factor of 4 4 = 256. At 128 Hz with 10 s windows, each segment thus contains 1,280 time samples per channel, which are progressively reduced to compact feature maps.
The resulting per-channel features are flattened, concatenated, and passed through a lightweight fusion MLP (with one hidden layer) to model cross-channel interactions, yielding a 128-dimensional modality embedding suitable for multimodal alignment. For univariate signals with a single channel, such as skin temperature, we use the same four-block temporal CNN (omitting the cross-channel fusion step) and project the flattened features directly into the shared 128-dimensional space.
- Baselines and training protocol: We term our approach the Multimodal Functional Maximum Correlation (MFMC) algorithm. We compare it against three families of baselines.
a) Unimodal self-supervised learning: We first benchmark three uni-modal SSL methods applied independently to each modality: SimCLR [38], Barlow Twins [35], and VICReg [36]. Each method constructs positive pairs using two stochastically augmented views of the same signal window, generated through Gaussian noise, temporal shifts, and channel dropout [59]. The rest of the mini-batch serves as negatives when required.
b) Multimodal self-supervised learning: For the bimodal setting (K = 2), we evaluate FMCA [52] and CLIP [40]. For the tri-modal setting (K = 3), we evaluate SymILE [43], a recent SOTA method that maximizes a CLIP-like lower bound on total correlation (TC), and our CLIP++ baseline, which aggregates all pairwise CLIP losses by summing three InfoNCE terms (x 1 ↔x 2 , x 1 ↔x 3 , x 2 ↔x 3 ). Both SymILE and CLIP++ model higher-order structure only through pairwise or TC-based bounds, in contrast to our DTCgrounded MFMC objective. c) Supervised baselines: Finally, we compare against three supervised approaches: EEGNet [80], a vanilla CNN with the same encoder backbone, and HyperFuseNet [15], a SOTA multimodal emotion recognition model that fuses modality-specific representations using a hypercomplex neural network.
d) Training protocol: For each contrastive SSL method, we first perform self-supervised pretraining using the full training set of unlabeled windows. After pretraining, modality encoders are frozen, and a lightweight three-layer MLP classifier is trained on top of any encoder for quadrants classification. Supervised baselines are trained end-to-end under the same windowing, modality configurations, and training budget. To isolate the effect of the learning objective and modality scope, all methods share identical modality-specific encoders, projection heads, classifier head, data preprocessing, and augmentation recipes; only the loss function and active modality set differ. Trade-offs: retains positives/negatives and TC overcounting of redundancy, lacking DTC’s correction. • Supervised: EEGNet, vanilla CNN, HyperFuseNet. Objective: label-rich references with task-specific inductive biases (EEG-specialized and hypercomplex fusion). What they test: the headroom between self-supervision and full supervision in both subject-dependent and subject-independent regimes. 5) Train-test splits and evaluation protocols: We evaluate performance under two data-partitioning schemes. In the subject-dependent setting, trials from all participants are pooled and stratified 5-fold cross-validation is applied, with 80% of windows used for training and 20% for testing in each fold. In the subject-independent setting, the test set comprises entirely unseen participants: we adopt a 5-fold leave-groupout protocol, training on 80% of subjects and testing on the remaining 20%. Subject splits are as follows: DEAP (15/4 train/test subjects), MAHNOB-HCI (21/6), and CEAP-360VR (26/6).
We use classification accuracy as the primary metric and report mean ± standard deviation across the 5 folds in both subject-dependent and subject-independent settings. For subject-dependent evaluation, predictions are made at the window level. For subject-independent evaluation, all windows from held-out subjects are aggregated into a single test set per fold.
-
Computing environment: All experiments were conducted on a Linux server equipped with a single NVIDIA L40S GPU (48 GB VRAM), dual Intel Xeon Gold 6330 CPUs, and 256 GB RAM. Training MFMC on DEAP for five subjectdependent cross-validation folds required approximately 5 hours wall-clock time, while the subject-independent setup took about 6 hours. All models were implemented in Python using PyTorch [81]. Further implementation details are pro-vided in Appendix C, and a GitHub repository is made available for reproducibility at https://github.com/DY9910/MFMC .
-
Subject-dependent performance: Tables I-III summarize classification accuracies under both subject-dependent and subject-independent settings across the three datasets. In the subject-dependent regime, MFMC achieves strong performance that is consistently competitive with the fully supervised HyperFuseNet while outperforming all other selfsupervised baselines.
On DEAP, MFMC attains 0.987 accuracy with EEG and 0.925 with EOG, close to HyperFuseNet’s 0.995 and clearly above all other SSL methods (Table I). On CEAP-360VR, MFMC improves over all self-supervised baselines and approaches HyperFuseNet for both EDA and BVP (Table II), indicating that tri-modal alignment is beneficial even for purely peripheral signals. On MAHNOB-HCI, MFMC nearly matches HyperFuseNet in the EEG case (0.953 vs. 0.955) and substantially outperforms the remaining SSL methods for both EEG and ECG (Table III). Overall, MFMC is the only SSL method that consistently performs competitively with a strong supervised SOTA across all datasets and modalities.
- Subject-independent performance: The subjectindependent setting is considerably more challenging: accuracies are substantially lower than in the subjectdependent case, highlighting the difficulty of cross-subject generalization in physiological emotion recognition. Nevertheless, MFMC establishes a new SOTA among all considered methods in 5 out of 6 subject-independent scenarios across datasets and modalities.
For example, MFMC achieves the best or statistically tied best subject-independent performance on DEAP EEG, CEAP EDA/BVP, and MAHNOB ECG, while remaining competitive on MAHNOB EEG. Importantly, in many cases MFMC surpasses supervised baselines (including HyperFuseNet and EEGNet), demonstrating that a DTC-grounded self-supervised objective can yield robust and scalable representations that generalize better across subjects than label-intensive training.
Comparison with task-specific SOTA: HyperFuseNet [15] and related supervised fusion models are strong task-specific SOTA baselines on multimodal physiological emotion recognition. Our results show that MFMC, despite being self-supervised and using an augmentation-light objective, matches or closely approaches HyperFuseNet in the subject-dependent setting and surpasses it in most subject-independent scenarios. Compared to previously reported results on DEAP, CEAP-360VR, and MAHNOB-HCI under similar label definitions and modality subsets, MFMC therefore offers a competitive and often superior alternative while requiring no affect labels during pretraining. This suggests that DTC-grounded multimodal self-supervision can serve as a powerful drop-in replacement for heavily engineered, label-intensive pipelines. 2) Comparison with supervised learning: Table IV compares MFMC with the best supervised baseline on EEG across all three datasets. In the subject dependent setting, MFMC attains accuracies that are consistently within 0.2-1.9 percentage points of the strongest supervised model (e.g., 0.987 vs. 0.995 on DEAP and 0.953 vs. 0.955 on MAHNOB-HCI), indicating that self-supervised pretraining recovers almost all of the subject specific discriminative power of fully supervised training. In the more challenging subject independent setting, MFMC not only closes this gap but surpasses supervised learning on every dataset, reaching 0.346 vs. 0.341 on DEAP, 0.331 vs. 0.295 on CEAP-360VR, and 0.442 vs. 0.410 on MAHNOB-HCI. Similar trends are observed for peripheral modalities (EOG, EDA/BVP, ECG) in Tables I-III, where MFMC remains close to supervised performance on subjectdependent splits while providing consistently stronger gener-
To assess the importance of the MFMC objective, we keep the entire framework unchanged-encoders, projection heads, optimizer settings, training loop and substitute the loss term by two alternatives.
• High-order InfoNCE We use exactly the same architecture as MFMC, which essentially maximizes the sum I(x 1 , x 2 ; x 3 ) + I(x 1 , x 3 ; x 2 ) + I(x 2 , x 3 ; x 1 ). However, instead of our proposed trace-based FMCA objective, we use InfoNCE to estimate the three joint mutual information terms.
• FMCA-LogDet We replace the trace formulation (e.g., Eq. ( 19) in the main manuscript) with the log-determinant surrogate [52] (e.g., Eq. ( 10) in the main manuscript), with a ridge term ε = 10 -4 for stability.
Results in Table V support our claim that a stable DTCgrounded trace surrogate yields both smoother optimization dynamics and superior downstream performance compared with triplet-based InfoNCE and log-determinant FMCA objectives. The DTC-grounded trace objective employed in MFMC provides a more stable and expressive dependence measure than both InfoNCE-style lower bounds and the original FMCA LogDet formulation. On the DEAP dataset, MFMC outperforms the strongest high-order InfoNCE variant by 1.1 percentage points and the LogDet surrogate by 12.4 percentage points in the subject-dependent setting (Table V), while exhibiting smooth and reliable convergence (Fig. 5). Although InfoNCE converges stably, it saturates at a lower performance level, consistent with theoretical results showing that mutual-information lower-bound estimators tend to flatten as the true mutual information increases [74]. The LogDetbased FMCA initially follows MFMC but becomes numerically unstable as training progresses, owing to ill-conditioned covariance matrices, which causes the loss to diverge and performance to collapse. In contrast, the proposed trace-based surrogate optimizes a simple sum of eigenvalues, avoiding explicit determinants or eigendecompositions, and thereby achieves improved numerical stability together with stronger downstream representations.
V. DISCUSSION Our empirical results and ablation studies support two main observations about multimodal self-supervision for physiological emotion recognition. First, the comparison between tri-modal, bi-modal, and uni-modal methods highlights the importance of explicitly modeling higher-order dependence I-III). These gains are particularly pronounced in the more difficult subject-independent setting, where MFMC often matches or surpasses supervised baselines such as EEGNet and Hyper-FuseNet, despite using no affect labels during pretraining. This pattern suggests that DTC-based training encourages encoders to capture synergistic interactions that emerge only when EEG and peripheral signals are considered jointly, rather than over-emphasizing redundant pairwise correlations. The fact that MFMC remains augmentation-light and does not require positive/negative pair construction further indicates that much of the supervision signal is intrinsic to the multimodal physiological structure itself, rather than being injected through heavy augmentation or handcrafted contrastive schemes.
Second, our results shed light on the relative contribution of architecture versus objective in physiological SSL. All methods share the same temporal CNN backbone and projection head, and differ only in loss function and modality configuration. Yet MFMC substantially narrows the gap to, or even exceeds, task-specific supervised models under crosssubject evaluation. This suggests that for noisy, label-scarce physiological data, improving the alignment objective and dependence estimator may yield larger generalization benefits than further increasing model capacity. In particular, MFMC’s reliance on global covariance structure rather than instancewise negatives may help it average out subject-specific artifacts and label noise, which are known challenges in affective computing.
Overall, the present work should be viewed as a first step toward DTC-grounded, augmentation-light multimodal SSL for physiological signals. Our findings indicate that explicitly tar-geting higher-order dependence across central and peripheral modalities can yield substantial gains in subject-independent emotion recognition, while simplifying the training pipeline compared to heavily engineered contrastive frameworks. Extending MFMC to richer tasks, more diverse recording conditions, and additional modalities (e.g., facial video, speech, or eye tracking) is a promising direction for future research.
This paper presents the first approach to apply contrastive self-supervised learning across more than two modalities for emotion recognition based on physiological signals. The proposed Multimodal Functional Maximum Correlation (MFMC) framework is theoretically grounded, practical as it does not require positive or negative sample construction or data augmentation, and generalizable across a wide range of physiological signals. Our findings also highlight the potential of incorporating multiple modalities beyond the common bimodal setting to enhance performance in affective BCI tasks. In future work, we plan to extend MFMC to other downstream BCI applications, such as depression severity assessment. Furthermore, we aim to develop formal methods for quantifying the contribution of each modality, as well as their higher-order interactions, to the final decision. 8) MAHNOB-HCI [78]: MAHNOB logs 32-channel EEG plus four peripheral modalities: ECG, GSR, respiration amplitude and skin temperature. The mask highlights EEG, GSR and ECG, so we restrict the input to these three.
This automatic gating reduces the input dimensionality by 12-60 (depending on the dataset) while preserving or slightly improving validation performance and yields a clean, interpretable set of modalities for all subsequent analyses.
This section details the data-augmentation recipes used in the self-supervised (SSL) experiments on the CEAP, MAHNOB-HCI, and DEAP corpora under three popular frameworks: SimCLR, VICReg, and Barlow Twins. For clarity we first describe the B. Data and Model Pipeline 1) Data flow: All experiments begin with five NumPy tensors generated in preprocessing: eeg_data.npy (N, 32, 1280), eog_data.npy (N, 2, 1280), temp_data.npy (N, 1, 1280), emotion_labels.npy (N ), and subject.npy (N ) stored in <BASE_PATH>/Data_processed/. Here N = 20 097, reflecting 10 s windows (1 280 samples at 128 Hz) extracted with a 0.4 s stride. A pointer table maps each window to its subject and trial, enabling a subject-balanced split generator that assembles five folds (≈ 4 subjects per fold). The DataLoader streams the pre-processed windows directly from disk; each modality is already rescaled to [-1, 1] by dividing by its global maximum magnitude, and no further normalisation or augmentation is applied. Mini-batches therefore contain three tensors (EEG, EOG, temperature) plus one label vector, fed unchanged into the modality-specific encoders.
-
Model architecture: Each modality encoder has two stages: a temporal network that processes each sensor channel separately, followed by a channel network that fuses the channels into a 128-D vector. 2) Cross-validation outputs: Five subject-balanced folds are trained sequentially. Each fold produces a JSON log and a loss-accuracy curve figure in results/fold_k/. The final notebook cell merges these files into cv_summary.csv, a combined learning-curve plot, and a confusion-matrix image.
-
Accuracy: Subject-dependent folds achieve 98.3 % ± 1.4 best accuracy and 96.4 % ± 1.8 final accuracy. Subjectindependent folds reach 34.6 % ± 3.7 best accuracy and 26.9 % ± 3.2 final accuracy. Most remaining errors involve highversus low-arousal quadrants; temperature contributes the least discriminative power.
-
Runtime: One fold trains in roughly one hour on a single NVIDIA L40s GPU (48 GB) with 16 logical CPU cores, so the full five-fold run finishes in about five hours wall-clock.
• Trimodal (pairwise extensions): CLIP++ (sum of pairs).Objective: aggregates all pairwise CLIP losses. Limitation: cannot represent synergy that emerges only when all three modalities are considered jointly.• Trimodal (multiway bound): SymILE. Objective: CLIPlike lower bound targeting TC across multiple modalities.
• Trimodal (pairwise extensions): CLIP++ (sum of pairs).Objective: aggregates all pairwise CLIP losses. Limitation: cannot represent synergy that emerges only when all three modalities are considered jointly.
• Trimodal (pairwise extensions): CLIP++ (sum of pairs).