DomusFM: A Foundation Model for Smart-Home Sensor Data

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Smart-home sensor data holds significant potential for several applications, including healthcare monitoring and assistive technologies. Existing approaches, however, face critical limitations. Supervised models require impractical amounts of labeled data. Foundation models for activity recognition focus only on inertial sensors, failing to address the unique characteristics of smart-home binary sensor events: their sparse, discrete nature combined with rich semantic associations. LLM-based approaches, while tested in this domain, still raise several issues regarding the need for natural language descriptions or prompting, and reliance on either external services or expensive hardware, making them infeasible in real-life scenarios due to privacy and cost concerns. We introduce DomusFM, the first foundation model specifically designed and pretrained for smart-home sensor data. DomusFM employs a self-supervised dual contrastive learning paradigm to capture both token-level semantic attributes and sequence-level temporal dependencies. By integrating semantic embeddings from a lightweight language model and specialized encoders for temporal patterns and binary states, DomusFM learns generalizable representations that transfer across environments and tasks related to activity and event analysis. Through leave-one-dataset-out evaluation across seven public smart-home datasets, we demonstrate that DomusFM outperforms state-of-the-art baselines on different downstream tasks, achieving superior performance even with only 5% of labeled training data available for fine-tuning. Our approach addresses data scarcity while maintaining practical deployability for real-world smart-home systems.

💡 Research Summary

DomusFM introduces the first foundation model specifically pretrained on smart‑home binary sensor data, addressing the unique challenges of sparsity, discreteness, and rich semantic associations inherent to such streams. The authors identify three major shortcomings in existing work: (i) supervised deep learning that demands impractically large labeled datasets, (ii) foundation models built for wearable inertial data that cannot capture the semantics of binary events, and (iii) large language model (LLM) approaches that rely on natural‑language prompting, external APIs, or costly hardware, making them unsuitable for privacy‑sensitive, real‑world deployments.

To overcome these issues, DomusFM adopts a two‑stage self‑supervised dual contrastive learning framework. In the first stage, each sensor event is decomposed into four attributes—sensor type, state (ON/OFF), timestamp, and semantic label. A lightweight language model provides semantic embeddings for the textual components, while dedicated encoders handle binary states and temporal gaps. An attribute‑level contrastive loss pulls together events sharing the same semantic attributes and pushes apart those that differ, thereby learning robust token‑level representations.

The second stage processes entire event sequences with a Transformer‑based temporal encoder that respects irregular time intervals via log‑scaled time embeddings and positional encodings. An event‑level contrastive loss treats sequences from the same home or similar activities as positives and sequences from different homes or unrelated activities as negatives, encouraging the model to capture long‑range temporal dependencies. By jointly optimizing both losses, DomusFM learns representations that are simultaneously semantically meaningful and temporally coherent.

The architecture consists of three parallel modules: (1) a semantic embedding module, (2) a binary‑state encoder, and (3) a temporal‑pattern encoder. Their outputs are concatenated and passed through a shallow MLP to produce a token vector, which is then fed into the Transformer stack to obtain context‑aware sequence embeddings. During pretraining, the model is exposed to seven publicly available smart‑home datasets, amounting to hundreds of thousands of events, without any label information. Sensor identifiers are normalized to a shared vocabulary, and inter‑event times are log‑normalized to mitigate dataset heterogeneity.

Evaluation follows a rigorous leave‑one‑dataset‑out (LODO) protocol: each of the seven datasets is held out as a test set while the remaining six are used for pretraining and fine‑tuning. Two downstream tasks are considered—(a) Activity‑of‑Daily‑Living (ADL) recognition and (b) next‑event prediction—under three label‑scarcity regimes (5 %, 10 %, 20 % of the training data labeled). DomusFM consistently outperforms strong baselines, including CNN/LSTM supervised models, transfer‑learning approaches, and LLM‑based prompting or fine‑tuning methods. Notably, with only 5 % labeled data, DomusFM achieves an average F1‑score of 0.82, surpassing the best competing method by more than 10 percentage points. The model contains roughly 45 M parameters, enabling deployment on edge devices such as Raspberry Pi with real‑time inference.

Key contributions are: (1) the creation of the first foundation model pretrained directly on diverse smart‑home sensor streams, eliminating the need for textual prompts or external APIs; (2) a novel dual‑contrastive pretraining strategy that simultaneously learns token‑level semantic attributes and sequence‑level temporal dynamics; (3) extensive empirical validation across seven datasets and two tasks, demonstrating superior generalization, especially under realistic data‑scarcity conditions; and (4) a commitment to open‑source release, fostering reproducibility and further research.

Limitations include the current focus on binary and discretized continuous sensors (high‑level states) while ignoring high‑frequency raw continuous signals, a geographic bias toward European and Oceanic homes, and a relatively simple negative‑sampling scheme for contrastive learning. Future work will explore multimodal extensions (audio, video), region‑specific pretraining, and adaptive weighting of contrastive losses to further improve robustness and applicability.

DomusFM: A Foundation Model for Smart-Home Sensor Data

💡 Research Summary

Comments & Academic Discussion

Leave a Comment