DETACH : Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning

December 23, 2025

Reading time: 5 minute

...

📝 Original Info

Title: DETACH : Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning
ArXiv ID: 2512.20409
Date: 2025-12-23
Authors: Junho Yoon, Jaemo Jung, Hyunju Kim, Dongman Lee

📝 Abstract

Aligning egocentric video with wearable sensors have shown promise for human action recognition, but face practical limitations in user discomfort, privacy concerns, and scalability. We explore exocentric video with ambient sensors as a non-intrusive, scalable alternative. While prior egocentric-wearable works predominantly adopt Global Alignment by encoding entire sequences into unified representations, this approach fails in exocentric-ambient settings due to two problems: (P1) inability to capture local details such as subtle motions, and (P2) over-reliance on modality-invariant temporal patterns, causing misalignment between actions sharing similar temporal patterns with different spatio-semantic contexts. To resolve these problems, we propose DETACH, a decomposed spatio-temporal framework. This explicit decomposition preserves local details, while our novel sensor-spatial features discovered via online clustering provide semantic grounding for context-aware alignment. To align the decomposed features, our two-stage approach establishes spatial correspondence through mutual supervision, then performs temporal alignment via a spatial-temporal weighted contrastive loss that adaptively handles easy negatives, hard negatives, and false negatives. Comprehensive experiments with downstream tasks on Opportunity++ and HWU-USP datasets demonstrate substantial improvements over adapted egocentric-wearable baselines.

💡 Deep Analysis

📄 Full Content

DETACH : Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning Junho Yoon Jaemo Jung Hyunju Kim Dongman Lee KAIST {vpdtlrdl, gosfl4760, iplay93, dlee}@kaist.ac.kr Abstract Aligning egocentric video with wearable sensors have shown promise for human action recognition, but face prac- tical limitations in user discomfort, privacy concerns, and scalability. We explore exocentric video with ambient sen- sors as a non-intrusive, scalable alternative. While prior egocentric-wearable works predominantly adopt Global Alignment by encoding entire sequences into unified rep- resentations, this approach fails in exocentric-ambient set- tings due to two problems: (P1) inability to capture lo- cal details such as subtle motions, and (P2) over-reliance on modality-invariant temporal patterns, causing misalign- ment between actions sharing similar temporal patterns with different spatio-semantic contexts. To resolve these problems, we propose DETACH, a decomposed spatio- temporal framework. This explicit decomposition preserves local details, while our novel sensor-spatial features dis- covered via online clustering provide semantic grounding for context-aware alignment. To align the decomposed features, our two-stage approach establishes spatial cor- respondence through mutual supervision, then performs temporal alignment via a spatial-temporal weighted con- trastive loss that adaptively handles easy negatives, hard negatives, and false negatives. Comprehensive experi- ments with downstream tasks on Opportunity++ and HWU- USP datasets demonstrate substantial improvements over adapted egocentric-wearable baselines. 1. Introduction Human action recognition (HAR) has been extensively studied using video [26, 31] and sensor modalities [37, 54], with each offering unique strengths as video provides rich visual context [23] and sensors offer precise temporal cues [35]. Recently, multimodal approaches aligning egocentric video with wearable sensors [8, 11, 32, 57] have demon- strated promising results in personal action monitoring by leveraging their complementarity. However, such ap- Figure 1. Limitations of Global Alignment-based approaches. (a) Inability to capture local details: subtle motions become indis- tinguishable in video feature space. (b) Over-reliance on modality- invariant features: actions incorrectly aligned due to shared tem- poral patterns, despite different spatio-semantic contexts. proaches face practical limitations in real-world deploy- ment, as they introduce ongoing user burden, expose sensi- tive personal data, and offer limited scalability across mul- tiple users [42, 58]. Integrating exocentric video with ambient sensors offers a promising direction for addressing the limitations. Unlike egocentric–wearable setups, exocentric–ambient configura- tions employ statically installed cameras [23] and object- embedded sensors [34], enabling non-intrusive and scal- able monitoring [40, 50]. However, despite this potential, no existing work has explored exocentric–ambient align- ment, and the Global Alignment strategy commonly used in the egocentric–wearable domain is unsuitable to this set- ting. Global Alignment strategy encodes entire video clips and sensor windows into holistic representations, which are arXiv:2512.20409v1 [cs.CV] 23 Dec 2025 Figure 2. Negative sampling strategy. Hard negatives (same object, different actions) are prioritized, easy negatives (different objects, different actions) are down-weighted, and false negatives (same object, same action) are filtered. then aligned through contrastive or distillation objectives [13, 27]. While effective in egocentric–wearable scenarios where the two modalities exhibit strong cross-modal spatio- temporal correlations, we argue that the strategy fails to op- erate effectively in exocentric–ambient environments due to two fundamental challenges. First, global alignment fails to capture fine-grained lo- cal details [20], specifically subtle motion patterns, in exocentric-ambient settings. Unlike egocentric video with salient hand-object interactions [57], exocentric video of- ten shows minimal visual changes (e.g., a drawer opening slightly) against static scenes [1, 55]. Compressing the en- tire video into a single vector dilutes these subtle yet dis- criminative cues, despite clear ambient sensor signals (Fig. 1(a)). Second, global alignment with motion-centric sensor en- coders forces the model to rely solely on temporal patterns, which are the only modality-invariant feature [20]. This creates a critical representational mismatch where video embeddings capture rich spatio-semantic cues (object ap- pearance, spatial relationships, location), while sensor em- beddings capture only temporal dynamics. Motion-centric encoders suit egocentric-wearable setups focused on body movements [58, 59], but exocentric-ambient settings re- quire contextual grounding since each sensor is bound to a specific object or loc

📄 Read Full PDF on ArXiv