Align and Adapt: Multimodal Multiview Human Activity Recognition under Arbitrary View Combinations

Align and Adapt: Multimodal Multiview Human Activity Recognition under Arbitrary View Combinations
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multimodal multiview learning seeks to integrate information from diverse sources to enhance task performance. Existing approaches often struggle with flexible view configurations, including arbitrary view combinations, numbers of views, and heterogeneous modalities. Focusing on the context of human activity recognition, we propose AliAd, a model that combines multiview contrastive learning with a mixture-of-experts module to support arbitrary view availability during both training and inference. Instead of trying to reconstruct missing views, an adjusted center contrastive loss is used for self-supervised representation learning and view alignment, mitigating the impact of missing views on multiview fusion. This loss formulation allows for the integration of view weights to account for view quality. Additionally, it reduces computational complexity from $O(V^2)$ to $O(V)$, where $V$ is the number of views. To address residual discrepancies not captured by contrastive learning, we employ a mixture-of-experts module with a specialized load balancing strategy, tasked with adapting to arbitrary view combinations. We highlight the geometric relationship among components in our model and how they combine well in the latent space. AliAd is validated on four datasets encompassing inertial and human pose modalities, with the number of views ranging from three to nine, demonstrating its performance and flexibility.


💡 Research Summary

The paper introduces AliAd (Align and Adapt), a novel framework for multimodal, multiview human activity recognition that remains robust when views are arbitrarily combined or missing. The architecture consists of view‑specific (or shared) encoders that produce normalized feature vectors, an attention‑based weighting mechanism that computes a weighted sum (the “center”) of the available views, and a newly proposed Adjusted Center Contrastive loss (L_AC). L_AC contrasts each view with the weighted center of the remaining views, incorporating view‑quality weights w(v) both in the center computation and as a loss scaling factor (1‑w(v)). This design pulls all views toward a common hyperspherical center while giving higher‑quality views greater influence, and reduces the pairwise contrast computation from O(V²) to O(V).

After fusion, a sparse Mixture‑of‑Experts (MoE) block serves as the classification head. The gating network selects a subset of expert sub‑networks conditioned on the current view combination, and a load‑balancing loss prevents expert collapse. The MoE compensates for residual discrepancies that contrastive alignment does not capture and enables generalization to unseen view subsets.

Training is joint: unlabeled samples contribute to the contrastive loss, while labeled samples also train the MoE classifier. At inference only the weighted fused vector is fed to the MoE, keeping computation low.

Experiments on four datasets covering inertial sensors and human pose keypoints, with 3–9 views per sample, demonstrate that AliAd outperforms prior multiview contrastive methods, reconstruction‑based missing‑view approaches, and recent MoE‑based multimodal models. It achieves 3–7 % higher accuracy on full‑view settings and maintains less than 2 % degradation when up to half the views are missing. Moreover, the O(V) contrastive computation yields 30–45 % faster training compared to full‑graph O(V²) baselines.

Overall, AliAd provides a unified solution that aligns views efficiently, respects view quality, and adapts to arbitrary view configurations through a sparsely activated expert system, advancing the state of the art in flexible, high‑performance human activity recognition.


Comments & Academic Discussion

Loading comments...

Leave a Comment