BabyMamba-HAR: Lightweight Selective State Space Models for Efficient Human Activity Recognition on Resource Constrained Devices

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Human activity recognition (HAR) on wearable and mobile devices is constrained by memory footprint and computational budget, yet competitive accuracy must be maintained across heterogeneous sensor configurations. Selective state space models (SSMs) offer linear time sequence processing with input dependent gating, presenting a compelling alternative to quadratic complexity attention mechanisms. However, the design space for deploying SSMs in the TinyML regime remains largely unexplored. In this paper, BabyMamba-HAR is introduced, a framework comprising two novel lightweight Mamba inspired architectures optimized for resource constrained HAR: (1) CI-BabyMamba-HAR, using a channel independent stem that processes each sensor channel through shared weight, but instance independent transformations to prevent cross channel noise propagation, and (2) Crossover-BiDir-BabyMamba-HAR, using an early fusion stem that achieves channel count independent computational complexity. Both variants incorporate weight tied bidirectional scanning and lightweight temporal attention pooling. Through evaluation across eight diverse benchmarks, it is demonstrated that Crossover-BiDir-BabyMamba-HAR achieves 86.52% average macro F1-score with approximately 27K parameters and 2.21M MACs, matching TinyHAR (86.16%) while requiring 11x fewer MACs on high channel datasets. Systematic ablation studies reveal that bidirectional scanning contributes up to 8.42% F1-score improvement, and gated temporal attention provides up to 8.94% F1-score gain over mean pooling. These findings establish practical design principles for deploying selective state space models as efficient TinyML backbones for HAR.

💡 Research Summary

The paper tackles the challenge of deploying human activity recognition (HAR) models on resource‑constrained wearable and mobile devices, where both memory footprint and computational budget are severely limited. While recent lightweight HAR solutions rely on CNN‑LSTM hybrids, depthwise separable convolutions, or knowledge distillation, they still inherit the quadratic complexity of self‑attention mechanisms when temporal modeling is required. To address this, the authors introduce BabyMamba‑HAR, a family of two novel architectures inspired by the Mamba selective state‑space model (SSM), which processes sequences in linear time O(N) by making the discretization step Δt and the state‑transition matrices input‑dependent.

The first variant, CI‑BabyMamba‑HAR (Channel‑Independent), employs a shared 1‑D convolution‑BN‑SiLU stem that processes each sensor channel separately but with identical weights. This design isolates noise between heterogeneous sensors, preserving channel‑specific information while keeping parameter count low (≈28 K). After stem processing, each channel passes through weight‑tied bidirectional SSM blocks; forward and backward scans share parameters, effectively doubling the receptive field without extra parameters. The per‑channel representations are then fused by averaging.

The second variant, Crossover‑BiDir‑BabyMamba‑HAR (Early Fusion), projects all C input channels into a d‑model dimensional space with a single convolutional layer, making the backbone’s computational cost independent of C. The fused representation is fed into the same weight‑tied bidirectional SSM blocks, where a “crossover” connection enables information exchange between forward and backward paths. This architecture is especially efficient for high‑channel datasets (e.g., 79‑channel Opportunity), achieving roughly 11× fewer MAC operations than TinyHAR while maintaining comparable accuracy.

Both architectures incorporate a lightweight context‑gated temporal attention pooling head. Instead of treating all timesteps equally, the head learns attention scores via a tanh‑based projection followed by a softmax, then aggregates the sequence into a single vector. This adds only ~600 parameters but yields up to a 8.94 % macro‑F1 improvement over simple mean pooling, particularly on datasets with brief discriminative motion bursts.

Experiments are conducted on eight public HAR benchmarks covering a wide range of sensor configurations, sampling rates, and window lengths: UCI‑HAR, MotionSense, WISDM, PAMAP2, Opportunity, UniMiB‑SHAR, Skoda, and Daphnet. All datasets are processed with a unified pipeline (z‑score per channel, optional Butterworth filtering, 75 % overlapping windows of length 128, subject‑independent splits or LOSO CV). For the single‑subject Skoda dataset, the authors adopt a temporal split (first 80 % of windows for training, remaining 20 % for testing) to avoid data leakage caused by overlapping windows. Models are trained with AdamW, ReduceLROnPlateau scheduling, early stopping, and a suite of augmentations (time warping, magnitude scaling, Gaussian jitter, channel dropout).

Results show that Crossover‑BiDir‑BabyMamba‑HAR attains an average macro‑F1 of 86.52 % with only ~27 K parameters and 2.21 M MACs, matching TinyHAR’s 86.16 % while using far fewer operations. On the high‑channel Opportunity dataset, it achieves 88.81 % F1 with 3.44 M MACs versus TinyHAR’s 38.30 M MACs. CI‑BabyMamba‑HAR, while robust to heterogeneous sensor noise, scales linearly with channel count and becomes impractical for very high‑channel inputs (222 M MACs on Opportunity). Ablation studies confirm that (1) bidirectional scanning contributes up to an 8.42 % F1 gain on complex temporal datasets, (2) gated temporal attention improves performance by up to 8.94 % compared to mean pooling, and (3) early‑fusion stems dominate across most datasets, with the exception of PAMAP2 where channel‑independent processing better handles intentional sensor artifacts.

In summary, the paper demonstrates three key design principles for TinyML‑level HAR: (i) choose a stem architecture (channel‑independent vs early‑fusion) based on sensor heterogeneity, (ii) employ weight‑tied bidirectional SSM to double temporal context without extra parameters, and (iii) use lightweight attention pooling to focus on discriminative timesteps. The work establishes that selective state‑space models can serve as efficient backbones for HAR, opening avenues for further research such as quantization, hardware‑aware architecture search, and automated stem selection for diverse embedded scenarios.

BabyMamba-HAR: Lightweight Selective State Space Models for Efficient Human Activity Recognition on Resource Constrained Devices

💡 Research Summary

Comments & Academic Discussion

Leave a Comment