Collaborative Temporal Feature Generation via Critic-Free Reinforcement Learning for Cross-User Sensor-Based Activity Recognition

Collaborati ve T emporal Feature Generation via Critic-Free Reinforcement Learning for Cross-User Sensor -Based Acti vity Recognition Xiaozhou Y e a, ∗ , Feng Jiang a , Zihan W ang b , Xiulai W ang c,a, ∗∗ , Y utao Zhang c, ∗∗ , Ke vin I-Kai W ang d a School of Artiﬁcial Intelligence (Sc hool of Futur e T echnology), Nanjing University of Information Science and T echnology , Nanjing, China b School of Cyberspace Security , Nanjing University of Information Science and T echnology , Nanjing, China c Jinling Hospital, A ﬃ liated Hospital of Medical School, Nanjing University , Nanjing, China d Department of Electrical, Computer , and Softwar e Engineering, The University of A uckland, A uckland, New Zealand Abstract Human Activity Recognition using wearable inertial sensors is foundational to healthcare monitoring, ﬁtness analyt- ics, and context-a ware computing, yet its deployment is hindered by cross-user v ariability arising from heterogeneous physiological traits, motor habits, and sensor placements. Existing domain generalization approaches—including ad- versarial alignment, causal in v ariance learning, and contrasti ve pretraining—either neglect temporal dependencies in sensor streams or depend on impractical target-domain annotations. W e propose a fundamentally di ﬀ erent paradigm: modeling generalizable feature e xtraction as a collaborativ e sequential generation process governed by reinforcement learning. Our framework, CTFG (Collaborative T emporal Feature Generation), employs a T ransformer-based au- toregressi ve generator that incrementally constructs feature token sequences, each conditioned on prior context and the encoded sensor input. The generator is optimized via Group-Relative Policy Optimization, a critic-free algo- rithm that ev aluates each generated sequence against a cohort of alternativ es sampled from the same input, deri v- ing advantages through intra-group normalization rather than learned value estimation. This design eliminates the distribution-dependent bias inherent in critic-based methods and provides self-calibrating optimization signals that remain stable across heterogeneous user distrib utions. A tri-objectiv e reward comprising class discrimination, cross- user inv ariance, and temporal ﬁdelity jointly shapes the feature space to separate activities, align user distributions, and preserv e ﬁne-grained temporal content. Evaluations on the DSADS and P AMAP2 benchmarks demonstrate state- of-the-art cross-user accuracy (88.53% and 75.22%), substantial reduction in inter-task training variance, accelerated con ver gence, and robust generalization under v arying action-space dimensionalities. K e ywor ds: Human Activity Recognition, Domain Generalization, Reinforcement Learning, T emporal Modeling, W earable Sensors ∗ First author . ∗∗ Corresponding author . Email addr esses: 200102@nuist.edu.cn (Xiaozhou Y e), 900231@nuist.edu.cn (Xiulai W ang), 003448@nuist.edu.cn (Y utao Zhang) 1. Introduction W earable inertial measurement units (IMUs) cap- ture the kinematic signatures of human mov ement as multiv ariate time series, enabling automatic classiﬁca- tion of activities such as walking, cycling, and stair climbing [48, 1]. These Human Activity Recognition (HAR) systems underpin clinical gait assessment, fall detection in elderly care, adaptive ﬁtness coaching, and smart-en vironment interaction [2]. Despite signiﬁcant progress in deep learning for HAR, a persistent chal- lenge remains: models trained on a cohort of source users e xhibit substantial performance degradation when deployed on unseen target users. This cross-user distri- bution shift is rooted in physiological div ersity—limb length governs stride kinematics, body mass modu- lates ground-reaction-force proﬁles, and muscle compo- sition alters the spectral characteristics of accelerome- ter signals—causing e v en identical acti vities to produce markedly di ﬀ erent inertial signatures across individu- als [3, 4]. Domain generalization (DG) addresses this challenge by learning representations from labeled source do- mains that transfer to arbitrary unseen targets with- out any target data during training [5]. The ﬁeld has seen rapid progress: causal in v ariance learning dis- cov ers features robust to en vironment shifts through concept-lev el disentanglement [6, 45]; contrasti ve pre- training aligns representations across datasets and de- vice conﬁgurations via self-supervised objecti ves [7, 8]; adversarial domain generalization synthesizes di verse pseudo-domain features to bridge distributional dis- crepancies [43, 44]; and time-series out-of-distribution framew orks partition latent subdomains to capture dis- tribution heterogeneity [9]. Despite these advances, three recurrent limitations persist in the context of cross- user HAR. Adversarial alignment methods operate on aggre- gated or frame-le vel feature representations, discard- ing the sequential structure that encodes activity dynamics—the precise feature class that is most sta- ble across users [10]. Causal deconfounding ap- proaches [11] disentangle domain-speciﬁc and domain- in v ariant factors but do not exploit the multi-scale temporal relational structure intrinsic to human mo- tion. Human acti vities are compositionally struc- tured across time, with discriminativ e information re- siding in the ev olving relationships between tempo- ral phases (e.g., the stance-to-swing transition in gait, the extension-ﬂe xion cycle in pedaling) rather than in individual time steps [12]. Con ventional super- vised objectiv es perform single-pass feature extraction that conﬂates this compositional temporal structure into monolithic representations, making it di ﬃ cult to dis- entangle user-in variant dynamics from user-speciﬁc ar- tifacts [13]. Furthermore, meta-learning approaches require domain-speciﬁc labels and per-user calibra- tion [20], while data augmentation strategies generate synthetic diversity that often introduces artifacts rather than capturing temporal in variances [21]. Source-free adaptation reduces deployment overhead but still as- sumes access to a pretrained source model [18]. T o address these limitations, we propose to reformu- late feature extraction as an active, multi-step genera- tion process governed by reinforcement learning (RL), where a policy network constructs feature representa- tions incrementally—token by token—with each deci- sion conditioned on prior context. T wo structural obser - vations motiv ate this formulation. If the discriminativ e information in acti vity signals is distributed across tem- poral phases, then the feature extraction process should mirror this structure: an autoregressi ve generator that builds tokens sequentially can ﬁrst capture coarse tem- poral motifs (o verall periodicity) and then progressively reﬁne them with ﬁner details (phase relationships, tran- 2 sition dynamics). Moreov er , the desirable properties of a generalizable feature space—inter-class separation and intra-class cross-user alignment—are distributional properties of the complete representation, not of indi- vidual tokens. RL provides a natural mechanism for ev aluating these properties: the re ward is computed over the entire generated token sequence, enabling holistic assessment that per-token surrogate losses cannot pro- vide [22]. While RL-based feature generation addresses the optimization-representation mismatch, standard policy gradient methods such as Proximal Policy Optimiza- tion (PPO) [23] require a learned value function to esti- mate per -step adv antages. In the cross-user HAR con- text, this value function must estimate expected returns across feature distributions generated from heteroge- neous source users, yet it is itself trained on source data and susceptible to the distribution biases the frame- work seeks to overcome. Group-Relativ e Policy Opti- mization (GRPO) [24], resolves this by eliminating the value function. For each input, GRPO samples a group of candidate feature sequences, computes re wards for each, and deri v es adv antages by normalizing against the group’ s statistics. This collaborati ve ev aluation—where each generation is assessed relativ e to contemporane- ous alternatives—pro vides three structural beneﬁts: (1) the advantage signal is self-calibrating and in v ariant to absolute rew ard scale [25]; (2) the group sampling in- herently encourages exploration of div erse feature con- ﬁgurations [26]; and (3) eliminating the critic network reduces the parameter surface susceptible to source-user ov erﬁtting. In summary , we introduce CTFG (Collaborativ e T emporal Feature Generation), a framew ork integrating autoregressi ve feature generation, critic-free policy op- timization, and a tri-objectiv e rew ard system for cross- user HAR. The key contrib utions of this paper are: (1) W e reformulate domain-generalizable feature ex- traction as collaborativ e sequential generation, where a T ransformer -based autoregressi ve policy constructs temporal token sequences optimized via GRPO—a critic-free RL algorithm that de- riv es advantages through intra-group reward nor- malization rather than learned v alue estimation. W e provide formal analysis (Proposition 1) sho w- ing that group-relativ e adv antages yield a ﬃ ne- in v ariant gradient signals that remain stable un- der the reward-scale variation induced by hetero- geneous source-user distributions, resolving the value-function bottleneck inherent in critic-based RL for cross-user HAR. (2) W e design an autoregressi ve Transformer decoder that generates feature tokens incrementally , with each token conditioned on the encoded sensor input and all pre viously generated tokens via causal self-attention and cross-attention. This compositional construction captures the hierar- chical temporal structure of human activities— from coarse periodicity (gait cycles, pedaling rhythm) to ﬁne-grained phase transitions (stance- to-swing, extension-to-ﬂexion)—producing repre- sentations where user-in variant temporal dynamics are explicitly separated from user-speciﬁc ampli- tude artifacts. (3) W e propose a tri-objective rew ard mechanism combining class discrimination, cross-user in v ari- ance, and temporal ﬁdelity . The temporal ﬁdelity component, nov el in the context of RL-dri ven HAR, explicitly penalizes information loss during 3 feature generation by requiring reconstructibility of the encoder’ s temporal representation, prevent- ing the policy from discarding ﬁne-grained tem- poral content in pursuit of distributional alignment objectiv es. The remainder of this paper is organized as follows. Section 2 revie ws related work. Section 3 presents the methodology with formal analysis. Section 4 reports experimental results. Section 5 concludes with future directions. 2. Related work 2.1. Generalizable r epr esentations for wearable sensor HAR Domain generalization in HAR aims to b uild mod- els from multiple source users that generalize to unseen targets without target data [1, 5]. Recent adv ances span sev eral paradigms. Causal and in variance-based methods. Xiong et al. [6] proposed categorical concept inv ariant learning, separating causal acti vity features from domain-speciﬁc confounders via concept-lev el disentanglement. The same group introduced a deconfounding causal infer- ence frame work using two-branch architectures with early forking to isolate domain-in variant features via the Hilbert-Schmidt Independence Criterion [11]. Shao et al. [45] bridged single domain generalization to causal concepts, proposing CaRGI, a framework that employs generativ e intervention to enlar ge domain shifts with semantic consistency while learning causal representa- tions via counterfactual inference. Lu et al. [9] pre- sented DIVERSIFY method, a general framew ork for time-series out-of-distribution detection and generaliza- tion that discov ers latent subdomains through iterativ e adversarial partitioning. While these methods achiev e principled distribution-shift robustness, they operate on aggregate representations and do not exploit the sequen- tial temporal structure intrinsic to human motion. Contrastive and self-supervised approaches. CrossHAR [7] achiev es cross-dataset generaliza- tion through hierarchical self-supervised pretraining with sensor-a ware augmentation. ContrastSense [8] employs domain-in variant contrasti ve learning with parameter-wise penalties for in-the-wild wearable sensing. UniMTS [14] introduced uniﬁed pre-training for motion time series, aligning sensor data with LLM- enriched text descriptions. DDLearn [15] combines distribution di ver gence enlargement with supervised contrastiv e learning for low-resource generalization. These approaches produce powerful pretrained repre- sentations but require large-scale pretraining corpora and lack mechanisms for explicit temporal structure preservation during feature e xtraction. Domain adaptation approaches. Adaptive Fea- ture Fusion (AFF AR) [4] dynamically fuses domain- in v ariant and domain-speciﬁc representations. SWL- Adapt [16] uses meta-optimization-based sample weight learning for cross-user adaptation. D WLR [17] addresses the compound challenge of feature shift and label shift across users. SF-Adapter [18] enables computational-e ﬃ cient source-free adaptation for edge deployment. Liu et al. [43] introduced GADPN, a generativ e adversarial domain-enhanced prediction net- work that synthesizes div erse pseudo-domain samples to mitigate distrib utional discrepancies in multi-source domain generalization scenarios. W ang et al. [44] proposed MSDGM, a lightweight multi-source domain generalization model combining MobileNet-based fea- ture extraction with Mamba-based dynamic parame- 4 ter adjustment, demonstrating e ﬀ ectiv e generalization across unseen domains. Howe ver , adaptation methods require unlabeled tar get data, violating the strict DG constraint where target users are entirely unknown dur- ing training. T ransformer -based cross-subject methods. T ASKED [10] combines T ransformer architecture with adversarial learning and self-knowledge distillation for cross-subject HAR. CapMatch [19] introduces semi-supervised contrastive T ransformer capsules with knowledge distillation. These extensions advance cross-user generalization but inherit limitations from adversarial training (instability , mode collapse) and treat sensor data as ﬁxed-window snapshots rather than sequential processes. 2.2. P olicy optimization: fr om critic-based to critic- fr ee methods PPO [23] is the dominant policy gradient algorithm, using adv antages estimated via Generalized Adv antage Estimation (GAE) [34] from a learned value function. Recent work in language model alignment has demon- strated that critic-free alternatives provide comparable or superior performance. Ahmadian et al. [25] intro- duced REINFORCE Leav e-One-Out (RLOO), estimat- ing v alue baselines from multiple sampled completions. ReMax [27] eliminates the critic using a max-reward baseline. Direct Preference Optimization (DPO) [28] reparameterizes the RL objectiv e into a supervised loss, bypassing both rew ard model and critic. GRPO [24], computes advantages by sampling mul- tiple outputs per input and normalizing their rewards within the group. D APO [26] extends this paradigm with clip-higher mechanisms and dynamic sampling. Guan et al. [29] demonstrated that RL-based sample reweighting can serve federated domain generaliza- tion, though without autore gressiv e generation. T o our knowledge, no prior work has applied GRPO—or any critic-free RL method—to sequential feature generation for sensor-based domain generalization. 2.3. T empor al modeling for sensor data T ransformers hav e advanced time-series modeling by enabling global temporal reasoning via self-attention, ov ercoming vanishing-gradient limitations of RNNs and ﬁxed recepti ve ﬁelds of CNNs [12]. Con vT ran [30] improv ed positional encoding for multiv ariate time se- ries classiﬁcation. Essa and Abdelmaksoud [13] com- bined temporal-channel conv olutions with self-attention for HAR, while Ek et al. [31] demonstrated Trans- former robustness to de vice heterogeneity . Islam et al. [2] presented multi-level attention-based feature fu- sion for multimodal wearable HAR in Information Fu- sion. IMUGPT 2.0 [32] le veraged LLMs for cross- modality IMU data generation. Howe v er , existing T ransformer-based HAR models use supervised objec- tiv es that do not explicitly promote user -in v ariant repre- sentations, and autoregressi ve variants are designed for prediction rather than generalizable feature construc- tion. Our framework bridges this gap by situating the T ransformer within an RL loop where autoregressi ve generation becomes the mechanism for exploring and constructing generalizable representations, guided by distributional reward signals that enforce cross-user consistency through a critic-free optimization paradigm. 3. Methodology Fig. 1 depicts the complete CTFG framework, high- lighting its design for learning domain-generalizable 5 features. The key innov ation lies in the training phase, where an autoregressiv e decoder generates candidate feature sequences per input, framing feature learning as a sequential decision process amenable to reinforce- ment learning. A tri-objectiv e reward function cap- tures the essential properties of generalizable represen- tations: class separability , cross-user inv ariance, and temporal coherence. GRPO lev erages these re wards to compute group-relativ e advantages without training a critic network, updating the policy to fa vor feature gen- eration patterns that yield superior distributional prop- erties. At inference, stochastic sampling is replaced by deterministic mean prediction, ensuring e ﬃ cient de- ployment while preserving representation quality . A lightweight logistic regression classiﬁer then maps the extracted features to acti vity labels. 3.1. Pr oblem setting and notation Consider K source users, each with labeled data D k = { ( x k i , y k i ) } n k i = 1 , where x k i ∈ R l × d is a sensor sequence of length l with d channels, and y k i ∈ { 1 , . . . , C } is the ac- tivity label. Each user k induces a distinct joint distribu- tion P k ( x , y ), with P i , P j for i , j due to physiological and behavioral heterogeneity . The objective is to learn a feature mapping and classiﬁer using only source data that generalizes to unseen target users dra wn from dis- tributions { Q m } M m = 1 satisfying Q m , P k for all m , k , while the label space and feature space remain the same. 3.2. F eatur e extr action as a Markov decision pr ocess The central insight of CTFG is to recast feature extraction—traditionally a single deterministic forward pass—as a sequential decision-making process. Human activities are compositional across time: a gait cycle comprises distinct phases (heel strike, stance, toe-o ﬀ , swing) whose temporal ordering carries discriminativ e information that is largely user-in v ariant, e ven though amplitude and wav eform morphology vary with limb length, body mass, and muscle composition [12]. Stan- dard e xtractors compress this hierarchy into a ﬁxed vec- tor in a single pass, conﬂating user-in v ariant dynamics with user-speciﬁc artifacts. An autoregressi ve process addresses this by constructing the representation token by token, with each token conditioned on all preced- ing tokens and the encoded input. Early tokens cap- ture coarse, globally stable patterns (periodicity , dom- inant frequency); later tokens reﬁne with ﬁner details (phase transitions, inter-joint coordination). Crucially , the RL reward is computed only after the complete se- quence, providing a holistic quality signal that ev al- uates distributional properties—inter-class separability and cross-user alignment—rather than optimizing each token against a per -step surrogate loss. W e formalize this as the following MDP . Deﬁnition 1 (Featur e Generation MDP). For each input sample x i ∈ R l × d : • Action at step j : generation of token z i , j ∈ R k , sampled from a parameterized distribution. • T ransition : deterministic concatenation s i , j + 1 = ( h i , z i , 1: j ). • State at step j : s i , j = ( h i , z i , 1: j − 1 ), combining the encoder’ s hidden representation h i ∈ R l × d model with all previously generated tok ens. • Reward : computed once after all s tok ens are gen- erated, e valuating the distributional quality of the complete feature batch. • Horizon : ﬁxed at s (number of feature tokens). 6 Figure 1: Overview of the proposed CTFG framework. During training, the autoregressive generator produces G candidate feature sequences per input, e valuated by a tri-objecti ve reward ( R cls , R inv , R tmp ) and optimized via group-relative advantages ˆ A ( g ) without a v alue function. During inference, a single deterministic forward pass (using predicted means µ i , j only) generates features, which are ﬂattened into ˜ z i = vec( z i ) and classiﬁed by Logistic Regression to produce the predicted label ˆ y i . The state at step j comprises the encoder representa- tion h i (a ﬁxed summary of raw input) and the partial sequence z i , 1: j − 1 (all prior decisions), enabling the pol- icy to adaptiv ely allocate representational capacity . The continuous action space ( R k ) reﬂects the nature of fea- ture representations; each tok en is sampled from a diag- onal Gaussian, providing exploration variance for pol- icy gradient estimation during training and determinis- tic mean prediction during inference. T ransitions are deterministic (state augmented by the new token), so trajectory stochasticity resides entirely in action selec- tion, simplifying gradient computation. The re ward is delayed until the complete sequence is generated. This is deliberate: the properties to optimize—cross-user feature clustering, inter-class separation—are properties of the entire batch, not indi- vidual tokens. A per-step re ward would require decom- posing batch-lev el properties into token-le vel contribu- tions, which is ill-deﬁned when a single token may aid discrimination but harm in variance. The delayed rew ard delegates credit assignment to the RL algorithm. The polic y π θ ( z i , j | x i , z i , 1: j − 1 ), parameterized by θ , generates the complete representation z i = { z i , 1 , . . . , z i , s } ∈ R s × k through s sequential decisions. The objectiv e is: θ ∗ = arg max θ E π θ [ R ( { z i } i ∈B , { h i } i ∈B ) ] , (1) where B is a stratiﬁed mini-batch covering all source users and activity classes. Remark 1 (Why RL ov er supervised optimization). T wo pr operties make RL the appr opriate paradigm. F irst, the re war d assesses distributional structure (inter-class separation, cross-user alignment) that emer ges fr om the collective batch, not individual token- label pairs. While supervised analogues exist (center loss [35], MMD [36]), they r equir e di ﬀ er entiability with r espect to eac h sampled token, excluding non- di ﬀ er entiable metrics such as rank-based separability . RL r equir es only evaluability: the policy gradient theor em [37] pr ovides gradient estimates thr ough trajectory log-pr obabilities. Second, stochastic sam- pling performs adaptive featur e-space augmentation, concentrating exploration in re war d-favorable re gions 7 while avoiding conﬁgurations that collapse inter-class boundaries. 3.3. Autor egr essive featur e generation ar chitectur e The policy π θ is realized as a T ransformer encoder- decoder [33] that maps raw sensor sequences to feature token sequences. The choice of Transformer architec- ture is motiv ated by its demonstrated capacity for cap- turing long-range temporal dependencies through self- attention [12], which is essential for encoding the multi- scale structure of activity signals. W e describe each component and its design rationale below . 3.3.1. T empor al encoder A linear projection W in ∈ R d × d model maps the sensor dimension to the model dimension, follo wed by additiv e sinusoidal positional encoding [30]: h i = T ransformerEncoder( x i W in + PE( l )) , (2) where PE( t , q ) = sin  t · ω q  for even q and cos  t · ω q  for odd q , with ω q = 10000 − 2 ⌊ q / 2 ⌋ / d model . Sinusoidal en- coding is preferred ov er learned embeddings because its continuous frequency progression captures tempo- ral structure at multiple scales—low frequencies encode coarse window position, high frequencies encode local ordering—matching the multi-scale nature of activity signals and generalizing to unseen sequence lengths. The encoder’ s multi-head self-attention computes: Attn( Q , K , V ) = softmax Q K ⊤ √ d h ! V , (3) where Q , K , V ∈ R l × d h are linear projections and d h = d model / n heads . Multiple heads enable specialization in di ﬀ erent temporal relationships—local neighborhoods for stride phases, distant positions for global periodic- ity . The output h i ∈ R l × d model retains full temporal resolu- tion, providing the decoder with position-speciﬁc infor- mation for selectiv e cross-attention during generation. 3.3.2. Autor egr essive featur e decoder The decoder generates tokens incrementally , each conditioned on the encoder output and all preceding to- kens: [ µ i , j , log σ i , j ] = TransformerDecoder( h i , z i , 1: j − 1 ) , (4) z i , j ∼ N  µ i , j , diag  exp  log σ i , j  , (5) where µ i , j , log σ i , j ∈ R k parameterize a diagonal Gaus- sian. Each decoder layer applies three operations: (1) masked self-attention enforces causal structure, re- stricting step j to tokens z i , 1: j − 1 ; (2) cr oss-attention to h i enables selective retrie val from di ﬀ erent temporal regions of the encoded input; (3) feed-forward layers project into Gaussian parameter space. The causal mask induces an implicit curriculum: early tokens, generated with minimal context, must en- code broadly useful features, while later tokens beneﬁt from richer context and specialize. The diagonal Gaus- sian balances expressi veness with tractability—v ariance captures uncertainty , naturally decreasing in re ward- critical dimensions as training progresses and remaining high in reward-neutral dimensions, inducing implicit feature selection. During inference, only the predicted mean µ i , j is used, eliminating sampling noise. This coarse-to-ﬁne generation emerges from the causal conditioning without explicit enforcement: early tokens capture dominant periodicity , later tokens reﬁne phase transitions and inter-se gment coordination. 3.4. Gr oup-Relative P olicy Optimization 3.4.1. Limitations of critic-based advantage estimation In standard actor-critic methods, advantages are estimated via Generalized Advantage Estimation 8 (GAE) [34]. Letting δ j ′ = ˆ r j ′ + γ V ϕ ( s i , j ′ + 1 ) − V ϕ ( s i , j ′ ) denote the temporal-di ﬀ erence residual at step j ′ , the advantage at step j is ˆ A GAE i , j = s X j ′ = j ( γλ ) j ′ − j δ j ′ , (6) where V ϕ is a learned value function and γ , λ ∈ [0 , 1] are discount and trace-decay parameters. In the cross-user setting, this introduces two prob- lems. First, V ϕ ( s i , j ) must estimate expected future re- ward, b ut this depends on which users populate the cur - rent mini-batch—a non-stationary target that induces systematic estimation bias. Second, reward scale varies across leav e-one-group-out conﬁgurations, causing the e ﬀ ectiv e learning rate to ﬂuctuate and destabilizing con- ver gence. 3.4.2. GRPO formulation GRPO resolves both issues by replacing learned value estimation with empirical within-group normal- ization. For each sample x i , the policy generates G in- dependent complete feature sequences: z ( g ) i =  z ( g ) i , 1 , . . . , z ( g ) i , s  , g = 1 , . . . , G . (7) Each sequence is generated independently by sam- pling from the policy’ s output distribution at each step. The independence of the G sequences is important: it ensures that the group provides an unbiased sample of the reward distribution under the current policy for the giv en input, enabling statistically v alid advantage esti- mation. The group-relative advantage for the g -th sequence is: ˆ A ( g ) = R ( g ) − ¯ R G ˆ σ G + ϵ s , ¯ R G = 1 G G X g = 1 R ( g ) , ˆ σ G = v u t 1 G G X g = 1  R ( g ) − ¯ R G  2 (8) The advantage measures ho w many standard devi- ations a sequence’ s rew ard lies abov e or below the group mean, making it independent of absolute reward scale (formalized as a ﬃ ne inv ariance in Proposition 1). The stabilization constant ϵ s bounds ampliﬁcation when ˆ σ G ≈ 0. The complete GRPO loss combines clipped surrogate objectiv es with group-relati ve adv antages: L GRPO ( θ ) = − 1 G s G X g = 1 s X j = 1 h ℓ ( g ) j − β KL D KL  π θ ∥ π ref  i , (9) where ℓ ( g ) j = min  ρ ( g ) i , j ˆ A ( g ) , clip( ρ ( g ) i , j , 1 − ϵ , 1 + ϵ ) ˆ A ( g )  , ρ ( g ) i , j = π θ ( z ( g ) i , j | s ( g ) i , j ) /π θ old ( z ( g ) i , j | s ( g ) i , j ) is the importance sampling ratio, and π ref is a frozen reference policy . The clipping constrains the trust region, preventing destabi- lizing updates when the feature distribution shifts. The KL penalty anchors the policy to the reference, pre vent- ing drift into degenerate feature regions where reward signals become uninformativ e. 3.4.3. F ormal analysis W e no w establish the theoretical properties underpin- ning GRPO’ s suitability for cross-user HAR. Proposition 1 (Pr operties of group-r elative adv antage). Let R (1) , . . . , R ( G ) be i.i.d. r ewar d samples. The group- r elative advantage ˆ A ( g ) in Eq. (8) satisﬁes: (i) E [ ˆ A ( g ) ] = 0 (zer o-center ed); (ii) V ar ( ˆ A ( g ) ) → 1 as G → ∞ ; (iii) A ﬃ ne in variance: ˆ A ( g ) ( aR + b ) = ˆ A ( g ) ( R ) for constants a > 0 , b. Property (i) ensures the gradient direction reﬂects rel- ativ e quality ordering, not absolute rew ard lev el. Prop- erty (ii) bounds the adv antage magnitude regardless of the underlying rew ard distrib ution, stabilizing con ver - gence. Property (iii) is the most consequential: dif- ferent source-user combinations produce di ﬀ erent re- 9 Figure 2: Group-relativ e advantage in the latent space. For each in- put, G sampled feature sequences are compared against their group mean. Sequences achie ving above-a verage reward recei ve positiv e ad- vantages and are reinforced; below-av erage sequences are suppressed. ward magnitudes and sensiti vities; a ﬃ ne in v ariance en- sures the e ﬀ ectiv e learning rate is constant across leav e- one-group-out conﬁgurations, eliminating the inter-task variance that plagues critic-based methods. Fig. 2 visu- alizes this mechanism. 3.5. T ri-objective re war d mechanism The re ward comprises three components e valuated ov er a mini-batch of generated features Z = { z i } i ∈B and encoder representations H = { h i } i ∈B , each targeting a necessary property of generalizable features. 3.5.1. Class discrimination r ewar d R cls ( Z ) = 1 C ( C − 1) C X c , c ′ = 1 c , c ′ ∥ ¯ µ c − ¯ µ c ′ ∥ 2 F , (10) where ¯ µ c ∈ R s × k is the centroid of feature sequences for class c . Centroids are preferred ov er pairwise sample distances for robustness to stochastic generation outliers and O ( |B| ) vs. O ( |B| 2 ) complexity—important since the rew ard is computed G times per input. The Frobenius norm operates on the full s × k matrix, preserv ing the sequential structure in the discrimination objectiv e. 3.5.2. Cr oss-user in variance r e war d R in v ( Z ) = − C X c = 1  V c + D c  , (11) where V c measures intra-user scatter and D c measures inter-user centroid distance within class c . The two terms address complementary failure modes: tight per- user clusters that are far apart ( D c high) indicate user- dependent encoding; loose clusters ( V c high) indicate noisy features. T ogether they enforce a feature space where same-activity features form a single compact cluster irrespectiv e of user identity . 3.5.3. T empor al ﬁdelity r ewar d W ithout a content-preservation constraint, the policy could achieve high R cls and R in v by mapping all within- class inputs to a single point, discarding the temporal information needed to disambiguate challenging acti v- ity pairs. The temporal ﬁdelity re ward pre vents this col- lapse: R tmp ( Z , H ) = − 1 |B| X i ∈B     W proj · z i − h i     2 2 , (12) where z i = 1 s P s j = 1 z i , j , h i = 1 l P l t = 1 h i , t , and W proj ∈ R k × d model is a learnable projection. The projection is learnable because the feature and encoder spaces have di ﬀ erent dimensionalities and semantics. Summary- lev el matching (via temporal av eraging) is less restric- tiv e than point-wise correspondence but su ﬃ cient to prev ent information collapse. 10 3.5.4. Combined objective The three rew ards are combined as a weighted sum: R ( Z , H ) = w cls R cls ( Z ) + w in v R in v ( Z ) + w tmp R tmp ( Z , H ) . (13) The weights w cls , w in v , w tmp control the relative im- portance of each objective. The default conﬁguration ( w cls = 3 . 0 , w in v = 2 . 0 , w tmp = 1 . 0) prioritizes class discrimination, reﬂecting the primary task requirement that activities must be distinguishable. The in vari- ance weight is lower because ov erly aggressiv e align- ment can erase class-discriminativ e features that hap- pen to correlate with user identity . The temporal ﬁ- delity weight is the smallest, consistent with its role as a regularizer that prev ents collapse without dominating the optimization landscape. This weighting hierarchy follows prior research on balancing discrimination and alignment objectiv es in cross-user HAR [46, 47]. 3.6. T raining algorithm and downstr eam classiﬁer Algorithm 1 summarizes the training procedure. Stratiﬁed mini-batch sampling (Line 3) ensures rep- resentation from all source users and activity classes, which is essential for meaningful computation of both R cls (requiring all classes) and R in v (requiring all users per class). The G independent sequences per input (Lines 6–10) provide the empirical reward distribution for adv antage computation; larger G improv es accuracy (Proposition 1(ii)) at the cost of memory . A Logistic Regression classiﬁer on ﬂattened features ˜ z i = vec( z i ) ∈ R sk is trained post-hoc (Line 14); its minimal capacity ensures that performance reﬂects RL-optimized feature quality rather than classiﬁer expressi veness. Algorithm 1 CTFG T raining Require: Source data { ( x i , y i , u i ) } N s i = 1 ; weights w cls , w in v , w tmp ; clip ϵ ; group size G ; KL co- e ﬀ . β KL ; tokens s ; epochs E Ensure: T rained polic y π θ 1: Init policy π θ (encoder-decoder), reference π ref ← π θ , projection W proj 2: for epoch = 1 to E do 3: Sample stratiﬁed mini-batch B covering all users and classes 4: f or each x i ∈ B do 5: h i ← Encoder( x i W in + PE( l )) 6: for g = 1 to G do 7: for j = 1 to s do 8: [ µ ( g ) i , j , log σ ( g ) i , j ] ← Decoder( h i , z ( g ) i , 1: j − 1 ) 9: z ( g ) i , j ∼ N ( µ ( g ) i , j , diag(exp  log σ ( g ) i , j  )) 10: end for 11: end for 12: end f or 13: Compute re wards R ( g ) via Eq. (13) 14: Compute adv antages ˆ A ( g ) via Eq. (8) 15: Update θ, W proj via L GRPO (Eq. 9) 16: end for 17: T rain Logistic Regression on source features { (vec( z i ) , y i ) } 4. Experiment 4.1. Benchmarks and pr otocol Experiments are conducted on DSADS [49] (Daily and Sports Acti vities, 8 subjects, 19 activities, 25 Hz, 9-axis sensors on torso / arms / le gs) and P AMAP2 [50] (Physical Activity Monitoring, 6 subjects, 11 activities, 100 Hz, IMUs on chest / wrist / ankle). T able 1 summa- rizes the conﬁgurations. 11 T able 1: Benchmark dataset conﬁgurations. P AMAP2 — 6 subjects, 11 activities, 100 Hz Groups A = [1,2], B = [5,6], C = [7,8] Sensors Chest / wrist / ankle; accel. + gyro. W indow 3 s (300 samples), 50% overlap DSADS — 8 subjects, 19 activities, 25 Hz Groups A = [1,2], B = [3,4], C = [5,6], D = [7,8] Sensors T orso / arms / le gs; accel. + gyro. W indow 3 s (75 samples), 50% overlap Per-user z-score normalization remov es amplitude bi- ases from physiological factors. Leav e-one-group-out cross-validation provides rigorous e valuation of cross- user generalization: 4 cross-user transfers for DSADS, 3 for P AMAP2. Classiﬁcation accurac y on held-out tar- get users is the primary metric. Architecture: 1-layer T ransformer , d model = 64, 4 heads, lr = 10 − 4 (Adam). 4.2. Baselines and comparative r ationale The baseline methods are selected to systematically validate distinct aspects of the CTFG framework. T o- gether , they span the spectrum from minimal domain- aware training to sophisticated temporal adaptation to adversarial co-learning, enabling controlled attribution of performance gains to speciﬁc design choices. ERM [38] (Empirical Risk Minimization) trains a standard model by minimizing the average empirical loss ov er all source data without any domain general- ization mechanism. It serves as the lower-bound r efer- ence , quantifying the performance achie vable without explicit handling of cross-user distribution shift. The gap between CTFG and ERM isolates the total beneﬁt of the proposed sequential feature generation paradigm. RSC [39] (Representation Self-Challenging) im- prov es generalization by iterati vely discarding the dom- inant features that the model relies upon during train- ing, forcing the network to discover more diverse and robust feature patterns. RSC validates whether featur e diversity alone —achie ved through a self-challenging mechanism that operates on ﬁxed, non-sequential representations—su ﬃ ces for cross-user generalization, or whether the sequential structure of our autoregressi ve generation is essential. ANDMask [40] learns representations by retaining only gradient components that are consistent across all training domains, masking out gradients that point in conﬂicting directions across users. This in v ariance- by-gradient-agreement strategy validates whether im- plicit in variance constraints at the gradient level can match the explicit distrib utional alignment enforced by our cross-user in v ariance re ward. ANDMask represents a fundamentally di ﬀ erent approach to in v ariance— ﬁltering gradients rather than shaping the feature distribution—and its comparison rev eals whether dis- tributional reward signals provide stronger in v ariance guarantees. AdaRNN [41] addresses temporal distribution shift by adaptiv ely learning segment-le vel recurrent weights, re-weighting hidden states across temporal periods to reduce distribution discrepancy . It serves as the temporal modeling baseline , validating whether our Transformer -based autoregressiv e architecture with global self-attention outperforms recurrent temporal adaptation. The comparison is particularly infor- mativ e because AdaRNN explicitly targets temporal distribution shift—the same phenomenon our frame- work addresses—but through a fundamentally di ﬀ er - ent mechanism (recurrent adaptation vs. autoregressiv e generation with RL-driv en optimization). 12 A CON [42] (Adversarial Co-learning Network) com- bines adversarial domain alignment with co-learning strategies to bridge distributional discrepancies across domains for cross-domain HAR. It represents the ad- versarial alignment paradigm —the dominant approach in prior cross-user HAR work—and its comparison tests whether our re ward-guided generation frame work pro- vides more stable and e ﬀ ectiv e alignment than adver- sarial min-max optimization, which is kno wn to su ﬀ er from training instability and mode collapse. PPO-variant replaces GRPO with PPO + GAE [23] within our frame work, keeping all other components identical (same encoder-decoder architecture, same tri- objectiv e rew ard, same training protocol). This serves as the contr olled ablation for the optimization algo- rithm, directly isolating the contribution of critic-free advantage estimation versus critic-based estimation. The PPO-variant is the most important baseline because it attributes any performance di ﬀ erence speciﬁcally to the GRPO mechanism—the a ﬃ ne-in v ariant advantage computation and the elimination of the value function— rather than to the autoregressi ve architecture or the re- ward design. 4.3. Cr oss-user classiﬁcation r esults T ables 2 and 3 report the cross-user classiﬁcation re- sults. W e or ganize the analysis around three progres- siv ely deeper questions: (1) Does RL-guided sequen- tial feature generation improv e ov er con v entional meth- ods? (2) Which speciﬁc design components drive the improv ement? (3) Ho w do the structural properties of GRPO manifest in cross-task consistency? T able 2: Cross-user accuracy (%) on DSADS. Method ABC → D ACD → B ABD → C BCD → A A VG STD ERM 76.56 77.25 81.03 75.39 77.56 2.11 RSC 78.92 83.94 82.47 80.42 81.44 1.92 ANDMask 82.36 78.75 84.87 82.22 82.05 2.18 AdaRNN 79.53 77.84 83.23 76.92 79.38 2.41 A CON 85.43 88.76 89.17 87.25 87.65 1.47 PPO-var . 87.77 85.48 91.96 87.93 88.29 2.33 Ours 87.22 85.78 93.21 87.91 88.53 2.87 T able 3: Cross-user accuracy (%) on P AMAP2. Method AB → C A C → B BC → A A V G STD ERM 61.42 66.38 64.75 64.18 2.06 RSC 64.27 71.46 65.13 66.95 3.21 ANDMask 58.75 67.13 68.72 64.87 4.37 AdaRNN 62.34 74.47 65.95 67.59 5.09 A CON 67.31 75.97 72.58 71.95 3.56 PPO-var . 69.01 79.29 74.14 74.15 4.20 Ours 72.10 80.15 73.42 75.22 3.52 4.3.1. Overall performance and the sequential genera- tion advantage CTFG achiev es 88.53% on DSADS and 75.22% on P AMAP2, gains of 10.97% and 11.04% over ERM. These gains quantify the beneﬁt of sequential feature construction ov er single-pass extraction. ERM opti- mizes all feature dimensions simultaneously , conﬂating user-in variant temporal dynamics with user-speciﬁc am- plitude artifacts. The autoregressi ve decoder’ s causal structure creates implicit prioritization: early tokens en- code broadly useful features; later tokens reﬁne with ﬁner details conditioned on established coarse struc- ture. This naturally separates stable temporal motifs from user-speciﬁc variations, explaining the consistent improv ement. 13 4.3.2. Disentangling contrib utions: diversity , in vari- ance, and temporal modeling RSC improv es ov er ERM by 3.88% (DSADS) and 2.77% (P AMAP2), conﬁrming that feature diversity aids generalization but is insu ﬃ cient alone. RSC di- versiﬁes which features are extracted but cannot con- trol how they compose across the temporal dimension; CTFG’ s stochastic group sampling explores di ﬀ erent temporal composition strategies, where each z ( g ) i repre- sents a distinct organizational h ypothesis. ANDMask achie ves 82.05% on DSADS but shows high v ariance on P AMAP2 (STD 4.37). This rev eals that gradient-level in variance—retaining only cross- domain gradient agreement—works well when source domains densely cover the user distribution (DSADS: 8 subjects, 4 groups) but fails when coverage is sparse (P AMAP2: 6 subjects, 3 groups). The dramatic AB → C drop to 58.75% conﬁrms that gradient overlap be- tween groups A and C is insu ﬃ cient when group B is excluded. CTFG’ s distributional in v ariance rew ard (Eq. 11) directly penalizes cross-user div ergence, a void- ing this fragility . AdaRNN achiev es only 67.59% on P AMAP2 with the highest variance (STD 5.09), providing the most instructiv e temporal modeling comparison. Both AdaRNN and CTFG address temporal distribution shift, but AdaRNN’ s recurrent segment-lev el reweight- ing is limited by vanishing gradients over P AMAP2’ s 300-sample windo ws [12] and by passiv e feature modulation—it can suppress existing features b ut can- not compose new ones from temporal relationships. CTFG’ s Transformer captures long-range dependencies via self-attention, and the decoder actively constructs features by attending to arbitrary temporal positions. The 7.63% gap reﬂects the compounding beneﬁt of global attention and activ e feature construction. 4.3.3. Adversarial alignment vs. r ewar d-guided gener- ation A CON achiev es competitiv e performance (87.65% DSADS, 71.95% P AMAP2) but relies on adversarial min-max optimization where the discriminator provides per-sample gradients without assessing global distri- butional structure. This local signal is vulnerable to mode collapse—mapping all features to one point satis- ﬁes the adversarial objectiv e b ut destroys discriminabil- ity . CTFG’ s tri-objectiv e rew ard ev aluates batch-level structure directly: mode collapse minimizes R cls , user- speciﬁc encoding minimizes R in v , and information loss minimizes R tmp , forcing the policy tow ard genuinely generalizable conﬁgurations. The per-transfer pattern is re vealing: ACON achie ves its best on A CD → B (88.76%) b ut drops to 85.43% on ABC → D, suggesting conv ergence to a distribu- tional compromise. CTFG’ s wider range (85.78%– 93.21%) but higher peak at ABD → C (93.21%, exceed- ing ACON’ s best by 4.04%) reﬂects group-relativ e ex- ploration’ s ability to discover transfer-speciﬁc optima. 4.3.4. Isolating the GRPO contribution: contr olled ab- lation with PPO The PPO-variant shares the identical architecture and reward, di ﬀ ering only in advantage estimation. On DSADS, the 0.24% average gap (88.29% vs. 88.53%) masks a per-transfer cross-ov er: CTFG leads on ABD → C by 1.25% b ut trails on ABC → D by 0.55%, indicating that both optimizers ﬁnd good solutions on DSADS’ s well-conditioned landscape. P AMAP2 sharpens the distinction. The STD drops 14 Figure 3: Conv ergence comparison on P AMAP2 (epochs 0–100). Shaded regions indicate ± 1 standard de viation across leav e-one- group-out conﬁgurations. from 4.20 to 3.52 (16.2% reduction), directly reﬂect- ing a ﬃ ne in variance (Proposition 1(iii)): when the held- out group changes, the rew ard distrib ution shifts, and PPO’ s critic becomes miscalibrated, producing incon- sistent updates. GRPO’ s normalization absorbs these scale changes automatically . The largest single-transfer gain is on AB → C ( + 3.09%), P AMAP2’ s most challeng- ing transfer where groups A and B may share more physiological similarity with each other than with group C (users 7, 8), maximizing the critic’ s distribution- dependent bias. 4.4. Con ver gence and tr aining stability Figures 3 and 4 reveal three phenomena connecting directly to GRPO’ s design properties. Con vergence speed. On P AMAP2, GRPO reaches 51.8% at epoch 20—a lev el PPO does not attain ev en at epoch 100 (46.8%). This gap arises because GRPO di- rectly exploits reward variance within each group from epoch 1, while PPO must ﬁrst learn an accurate value function before producing useful advantages. This “bootstrap delay” is particularly costly on P AMAP2 where limited user di versity provides fe wer reward pat- Figure 4: Con ver gence comparison on DSADS (epochs 0–100). GRPO con verges faster with monotonically decreasing v ariance, while PPO exhibits mid-training instability . terns for value function training. Inter -task variance. At epoch 20, GRPO’ s inter- task STD is 0.016 vs. PPO’ s 0.105 (84.8% reduction). PPO’ s value function, trained on one conﬁguration’ s re- ward distribution, produces scale-dependent adv antages whose magnitude ﬂuctuates across conﬁgurations, caus- ing inconsistent e ﬀ ective learning rates. GRPO’ s nor- malization (Eq. 8) renders advantages scale-in variant, maintaining constant e ﬀ ectiv e learning rate regardless of conﬁguration. Mid-training instability . On DSADS, PPO’ s STD increases from 0.168 to 0.212 during epochs 40–60, while GRPO narro ws monotonically . This reﬂects a negati ve feedback loop: policy improv ement shifts the rew ard distribution, miscalibrating the critic, producing biased updates that temporarily degrade performance. GRPO is immune because advantages are computed fresh from current-group re wards at each iteration, auto- matically adapting to the ev olving re ward distrib ution. 4.5. Action-space dimensionality analysis The token count s controls the generation horizon— the number of sequential decisions the decoder must 15 Figure 5: T oken count sensitivity on DSADS. GRPO maintains stable accuracy with decreasing variance, while PPO collapses at high token counts. Figure 6: T oken count sensitivity on P AMAP2. Both methods degrade at s = 20 on this smaller dataset due to reward landscape ﬂattening. make. Increasing s simultaneously increases the rep- resentational capacity and the credit assignment di ﬃ - culty , directly probing the scalability of each optimiza- tion paradigm (Figures 5 – 6). GRPO exhibits remarkable stability on DSADS. Accuracy remains within a narrow 0.6% band (87.15%– 87.74%) across all token counts, while inter-task STD decreases monotonically from 7.05% at s = 5 to 4.21% at s = 20. This stability reveals that the autoregressi ve decoder e ﬀ ecti vely utilizes additional tokens—each to- ken reﬁnes the representation via cross-attention to dif- ferent temporal re gions—without introducing optimiza- tion instability . The monotonically decreasing variance is particularly signiﬁcant: as s gro ws, the G group mem- bers sample from a higher-dimensional space, produc- ing greater rew ard div ersity within each group, which improv es the statistical quality of GRPO’ s within-group normalization (Proposition 1(ii)). PPO degrades progressi vely with increasing hori- zon. On DSADS, PPO drops from 80.37% ( s = 5) to 64.46% ( s = 20), a 15.9% absolute loss that pro- duces a 23.3% accuracy gap vs. GRPO at s = 20. The mechanism traces directly to the MDP’ s sparse rew ard structure: all intermediate rewards are zero ( ˆ r j ′ = 0 for j ′ < s ), so GAE adv antages at step j depend en- tirely on the v alue function predicting the ﬁnal rew ard from the partial state s i , j = ( h i , z i , 1: j − 1 ). As s grows, early-step predictions must e xtrapolate from increas- ingly incomplete information, and estimation errors at each step compound through the s -step GAE summa- tion. At s = 20, this compounding overwhelms the true advantage signal, producing e ﬀ ecti vely random gradi- ents. The variance explosion at s = 20 (STD surging from 2.76% at s = 15 to 16.97%) provides direct empir - ical evidence of this compounding: di ﬀ erent leav e-one- group-out conﬁgurations experience di ﬀ erent degrees of value-function miscalibration, producing wildly incon- sistent optimization trajectories. Notably , at intermediate token counts ( s = 10 , 15), PPO achieves lower inter-task STD than GRPO on DSADS (1.63% vs. 5.05% at s = 10). This seem- ingly counterintuitive result has a structural explanation: at moderate horizons, PPO’ s v alue function can still produce reasonably accurate estimates, and the critic’ s smoothing e ﬀ ect reduces the stochastic noise inherent in GRPO’ s ﬁnite-sample group normalization. How- ev er , this stability is fragile—it collapses catastrophi- cally once the horizon exceeds the critic’ s estimation capacity , whereas GRPO’ s v ariance reduction is robust and monotonic. P AMAP2 re veals the interaction between horizon and dataset scale. On P AMAP2, GRPO maintains con- 16 sistent accuracy across token counts (74.14% at s = 5, 72.58% at s = 20), with a slight dip at s = 10 (70.15%) that reﬂects the stochasticity inherent in P AMAP2’ s smaller sample size (6 subjects, 11 activities). PPO achiev es competitiv e accuracy at lo w token counts— ev en slightly outperforming GRPO at s = 10 (72.9% vs. 70.15%)—demonstrating that the autoregressi ve ar- chitecture and tri-objectiv e rew ard provide substantial beneﬁt regardless of the optimizer when the horizon is short. Howe ver , at s = 20, PPO drops to 67.14% with STD exploding to 10.81% (vs. GRPO’ s 5.82%), reproducing the same critic-collapse pattern observed on DSADS. The fact that PPO’ s failure mode manifests identically across both datasets despite their di ﬀ erent scales conﬁrms that the root cause is structural (com- pounding GAE estimation error). GRPO bypasses this failure mode because its adv an- tage (Eq. 8) is computed from the ﬁnal re ward of each complete sequence, assigning a uniform but unbiased signal across all steps. While this is a coarser credit sig- nal than GAE’ s per-step decomposition, its unbiased- ness is the critical property: in the feature generation context, representation quality is a property of the com- plete token sequence, making sequence-le vel credit as- signment both appropriate and robust. 4.6. P er-class performance analysis The per -class analysis tests whether CTFG’ s perfor- mance is predictable from each activity’ s temporal and kinematic structure, connecting speciﬁc patterns to the framew ork’ s architecture and re ward design. T able 4: Per-class performance on P AMAP2. Activity Corr . T ot. Rec.% Prec.% F1 L ying 902 979 92.1 89.9 0.910 Sitting 463 868 53.3 68.1 0.598 Standing 578 961 60.2 62.4 0.612 W alking 1067 1144 93.3 73.8 0.824 Running 539 632 85.3 91.2 0.881 Cycling 879 937 93.8 77.9 0.851 Nordic walk. 754 1051 71.7 90.6 0.801 Asc. stairs 480 599 80.1 62.9 0.705 Desc. stairs 278 503 55.3 59.2 0.571 V acuuming 675 897 75.3 62.5 0.683 Ironing 770 1234 62.4 87.0 0.727 T able 5: Per-class performance on DSADS. Activity Corr . T ot. Rec.% Prec.% F1 Sitting 996 1206 82.6 99.1 0.901 Standing 1079 1206 89.5 92.1 0.907 L ying on back 1205 1206 99.9 72.5 0.840 L ying on right 1206 1206 100.0 100.0 1.000 Asc. stairs 1150 1206 95.4 99.3 0.973 Desc. stairs 1019 1206 84.5 95.1 0.895 Stand. elev . 641 1206 53.2 61.2 0.569 W alk. elev . 917 1206 76.0 53.4 0.627 W alk. parking 951 1206 78.9 95.7 0.865 T readmill (ﬂat) 995 1206 82.5 85.0 0.838 T readmill (incl.) 979 1206 81.2 80.8 0.810 Running 1203 1206 99.8 94.7 0.971 Stepper 997 1206 82.7 95.3 0.885 Cross trainer 953 1206 79.0 89.0 0.837 Cycling (horiz.) 1206 1206 100.0 99.5 0.998 Cycling (vert.) 1155 1206 95.8 99.8 0.978 Rowing 1204 1206 99.8 100.0 0.999 Jumping 1151 1206 95.4 97.6 0.965 Basketball 1168 1206 96.9 86.1 0.912 17 4.6.1. P eriodic activities: tempor al structure ex- ploitability The highest-performing activities share strong repet- itiv e temporal structure: Cycling (horizontal) F1 = 0.998, Rowing 0.999, Running 0.971 on DSADS; Run- ning 0.881, Cycling 0.851 on P AMAP2. These pro- duce periodic signals with biomechanically constrained phase relationships that are largely user-in v ariant—a tall and short person share the same pedaling phase struc- ture (push-recov er-push), di ﬀ ering primarily in ampli- tude and frequency , both normalized by per -user z- score preprocessing. The autoregressiv e decoder ex- ploits this systematically: the ﬁrst token captures the dominant periodic pattern via cross-attention; subse- quent tokens reﬁne with progressiv ely higher harmon- ics (phase asymmetries, transition dynamics), building a multi-resolution representation inherently inv ariant to user identity . L ying on right (F1 = 1.000) achieves perfect perfor- mance through a di ﬀ erent mechanism: its unique grav- itational projection across torso / arm / leg sensors shares no ov erlap with any other activity . The lo wer F1 for L ying on back (0.840, precision 72.5%) rev eals that its gravitational signature overlaps with some static activi- ties on certain sensor axes, attracting misclassiﬁed sam- ples into its feature region. 4.6.2. Discrimination-in variance tension in kinemati- cally similar pairs W alking vs. Nordic W alking on P AMAP2 (F1 = 0.824 vs. 0.801) exposes the core reward tension. Both share lo wer-body gait kinematics; only upper -body pat- terns (arm swing amplitude from pole usage) discrimi- nate them. These discriminativ e features are precisely the most user-v arying (arm swing style is highly per- sonal). The precision-recall asymmetry rev eals the re- sulting decision boundary geometry . W alking achiev es high recall (93.3%) b ut lower precision (73.8%); Nordic W alking shows the in verse (71.7%, 90.6%). The R cls centroid distance maximization pulls W alking toward the generic locomotion region while pushing Nordic W alking to a peripheral region deﬁned by its upper-body signature. Ambiguous instances—Nordic W alking with weak upper -body signal—default to W alking, the larger class. The DSADS treadmill pair (ﬂat F1 = 0.838, inclined 0.810) exhibits a parallel pattern. Inclination modu- lates anterior -posterior acceleration and hip ﬂexion in a narrow spectral band that is both subtle and user - dependent. Bidirectional confusion (both recalls ∼ 82%) conﬁrms the decision boundary bisects a region where user-in variant and class-discriminativ e features are en- tangled. 4.6.3. Static activities and the temporal ﬁdelity ﬂoor Sitting (F1 = 0.598) and Standing (0.612) on P AMAP2, and Standing in elev ator (0.569) on DSADS, represent a principled limitation. These activities pro- duce near -constant signals where the temporal structure the decoder is designed to exploit—periodicity , phase transitions—is absent. Cross-attention queries attend to noise-ﬂoor variations that di ﬀ er across users. More- ov er , R tmp forces preservation of encoder content that, for static activities, is dominated by the gravitational constant and user-speciﬁc noise rather than acti vity- discriminativ e dynamics. The cross-dataset contrast is instructiv e: DSADS Sitting achiev es F1 = 0.901 vs. P AMAP2’ s 0.598. DSADS’ s torso / arm / leg placement provides a richer multi-axis gravitational signature distinguishing pos- 18 tures ev en without temporal dynamics (gravity projec- tion across three body segments is posture-speciﬁc). P AMAP2’ s chest / wrist / ankle placement captures the sitting-standing distinction primarily through ankle orientation—a single degree of freedom easily con- founded by cross-user postural variation. This conﬁrms that CTFG’ s static-activity performance is limited by sensor information content. 4.6.4. En vir onmental confounding: elevator activities Standing in ele vator (F1 = 0.569) and W alking in elev ator (0.627) are DSADS’ s lowest-performing ac- tivities. The ele vator introduces confounding vertical acceleration—independent of the user’ s activity—that the encoder captures as temporal e vents and the decoder may interpret as activity dynamics. R tmp compounds this by requiring preserv ation of encoder content that includes confounding signals. W alking in elev ator’ s lo w precision (53.4%) indicates that elev ator vibrations produce features resembling low-intensity locomotion, attracting misclassiﬁed sam- ples. Standing in ele vator’ s low recall (53.2%) reﬂects indistinguishability from re gular standing when the ele- vator is stationary . By contrast, W alking in parking (F1 = 0.865) performs well because the parking en viron- ment introduces no confounding temporal dynamics— conﬁrming that the limitation stems from tempor al con- founding , not context-speciﬁcity . 4.6.5. Emer gent performance tiers A three-tier hierarchy maps onto the framework’ s ar- chitectural properties. T ier 1 (F1 ≥ 0.90): activities with strong periodicity or unique sensor signatures (Cycling, Rowing, Running, L ying on right, Ascending stairs), where all three re ward components align constructi vely . T ier 2 (F1 0.70–0.89): activities with moderate temporal structure b ut kinematic ov erlap (W alking, Nordic W alk- ing, Stepper, Cross trainer), where the discrimination- in v ariance tension limits performance. T ier 3 (F1 < 0.70): activities with minimal temporal v ariation or en vironmental confounding (Sitting, Standing, ele vator activities), representing principled limitations of tempo- ral feature generation. 5. Conclusion and future w ork This paper has proposed CTFG, a frame work that redeﬁnes cross-user activity recognition as a rew ard- guided temporal feature generation process governed by Group-Relativ e Policy Optimization. The core techni- cal insight is that critic-based RL algorithms introduce a structural bottleneck in the cross-user setting: the v alue function produces distribution-dependent estimation bi- ases manifesting as inconsistent con ver gence and high inter-task variance. GRPO eliminates this bottleneck through intra-group rew ard normalization, yielding self- calibrating, a ﬃ ne-in v ariant adv antage signals. Com- bined with a tri-objecti ve re ward mechanism enforcing class discrimination, cross-user in v ariance, and tempo- ral ﬁdelity , the framework achieves state-of-the-art ac- curacy on DSADS (88.53%) and P AMAP2 (75.22%), with substantially faster con ver gence and robust scal- ability to high-dimensional action spaces where critic- based training collapses. Future directions include adaptiv e group sizing strategies for datasets with limited user di versity , class- conditional in variance modulation for kinematically similar activity pairs, computational reduction through importance sampling or amortized adv antage estima- tion, and extension to larger demographically div erse 19 populations and multi-modal sensor conﬁgurations. References [1] S.G. Dhekane, T . Ploetz, T ransfer learning in sensor-based human acti vity recognition: A sur- ve y , ACM Comput. Surv . 57 (8) (2025) 1–39. [2] M.M. Islam, S. Nooruddin, F . Karray , G. Muham- mad, Multi-le vel feature fusion for multimodal hu- man activity recognition in Internet of Healthcare Things, Inform. Fusion 94 (2023) 17–31. [3] B. Barshan, E. Ko¸ sar, Bidirectional transfer learn- ing between activity and user identity recognition tasks via 2D CNN-LSTM model for wearables, IEEE Internet Things J. 12 (22) (2025) 46748– 46763. [4] X. Qin, J. W ang, Y . Chen, W . Lu, X. Jiang, Do- main generalization for activity recognition via adaptiv e feature fusion, ACM T rans. Intell. Syst. T echnol. 14 (1) (2022) 1–21. [5] K. Zhou, Z. Liu, Y . Qiao, T . Xiang, C.C. Loy , Do- main generalization: A surve y , IEEE T rans. Pat- tern Anal. Mach. Intell. 45 (4) (2022) 4396–4415. [6] D. Xiong, S. W ang, L. Zhang, W . Huang, C. Han, Generalizable sensor-based activity recognition via categorical concept inv ariant learning, in: Proc. AAAI Conf. Artif. Intell., V ol. 39, 2025, pp. 923–931. [7] Z. Hong, Z. Li, S. Zhong, et al., CrossHAR: Gen- eralizing cross-dataset human activity recognition via hierarchical self-supervised pretraining, Proc. A CM Interact. Mob. W earable Ubiquitous T ech- nol. 8 (2) (2024) Art. 64. [8] H. Y oon, T . Gong, et al., ContrastSense: Domain- in v ariant contrastiv e learning for in-the-wild wear- able sensing, Proc. A CM Interact. Mob . W earable Ubiquitous T echnol. 8 (4) (2024) Art. 162. [9] W . Lu, J. W ang, X. Sun, Y . Chen, X. Ji, Q. Y ang, X. Xie, DIVERSIFY : A general framework for time series out-of-distribution detection and gener- alization, IEEE T rans. Pattern Anal. Mach. Intell. (2024). [10] S. Suh, V . Fortes Rey , P . Lukowicz, T ASKED: T ransformer-based adversarial learning for human activity recognition using wearable sensors via self-knowledge distillation, Knowl.-Based Syst. 260 (2023) 110143. [11] D. Xiong, S. W ang, L. Zhang, et al., Deconfound- ing causal inference through two-branch frame- work with early-forking for sensor -based cross- domain acti vity recognition, Proc. ACM Interact. Mob . W earable Ubiquitous T echnol. 9 (2) (2025). [12] Q. W en, T . Zhou, C. Zhang, et al., Transformers in time series: A survey , in: Proc. IJCAI, Survey T rack, 2022, pp. 6778–6786. [13] E. Essa, I. Abdelmaksoud, T emporal-channel con- volution with self-attention network for human ac- tivity recognition using wearable sensors, Kno wl.- Based Syst. 278 (2023) 110867. [14] X. Zhang, D. T eng, R.R. Chowdhury , et al., UniMTS: Uniﬁed pre-training for motion time se- ries, in: Adv . Neural Inf. Process. Syst. (NeurIPS), 2024. [15] X. Qin, J. W ang, S. Ma, et al., Generalizable low-resource activity recognition with di verse and 20 discriminativ e representation learning, in: Proc. A CM SIGKDD, 2023, pp. 1943–1953. [16] R. Hu, L. Chen, S. Miao, X. T ang, SWL-Adapt: An unsupervised domain adaptation model with sample weight learning for cross-user wearable human activity recognition, in: Proc. AAAI, V ol. 37 (5), 2023, pp. 6012–6020. [17] D WLR: Domain adaptation under label shift for wearable sensor , in: Proc. IJCAI, 2024, pp. 4425– 4433. [18] X. Kang, et al., SF-Adapter: Computational- e ﬃ cient source-free domain adaptation for human activity recognition, Proc. A CM Interact. Mob. W earable Ubiquitous T echnol. 7 (4) (2024) 1–26. [19] Z. Xiao, H. T ong, R. Qu, et al., CapMatch: Semi- supervised contrasti ve transformer capsule with feature-based knowledge distillation for human activity recognition, IEEE T rans. Neural Netw . Learn. Syst. 36 (2) (2023) 2690–2704. [20] S. W ang, J. W ang, H. Xi, B. Zhang, L. Zhang, H. W ei, Optimization-free test-time adaptation for cross-person activity recognition, Proc. ACM In- teract. Mob . W earable Ubiquitous T echnol. 7 (4) (2024). [21] J. Zhang, L. Feng, Z. Liu, et al., Div erse intra- and inter-domain activity style fusion for cross- person generalization in acti vity recognition, in: Proc. A CM SIGKDD, 2024, pp. 4213–4222. [22] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, 2nd ed., MIT Press, 2018. [23] J. Schulman, F . W olski, P . Dhariwal, A. Radford, O. Klimov , Proximal policy optimization algo- rithms, arXiv:1707.06347, 2017. [24] D. Guo, D. Y ang, H. Zhang, J. Song, et al., DeepSeek-R1 incenti vizes reasoning in LLMs through reinforcement learning, Nature 645 (8081) (2025) 633–638. [25] A. Ahmadian, C. Cremer , M. Gallé, et al., Back to basics: Revisiting REINFORCE-style opti- mization for learning from human feedback in LLMs, in: Proc. 62nd Annu. Meeting A CL, 2024, pp. 12248–12267. [26] Q. Y u, Z. Zhang, R. Zhu, Y . Y uan, et al., D APO: An open-source LLM reinforcement learning sys- tem at scale, in: Adv . Neural Inf. Process. Syst. (NeurIPS), 2025. [27] Z. Li, T . Xu, Y . Zhang, et al., ReMax: A sim- ple, e ﬀ ecti ve, and e ﬃ cient reinforcement learn- ing method for aligning large language models, in: Proc. 41st Int. Conf. Mach. Learn. (ICML), 2024. [28] R. Rafailov , A. Sharma, E. Mitchell, C.D. Man- ning, S. Ermon, C. Finn, Direct preference opti- mization: Y our language model is secretly a re- ward model, in: Adv . Neural Inf. Process. Syst. (NeurIPS), V ol. 36, 2023, pp. 53728–53741. [29] Z. Guan, Y . Li, R. Xue, et al., RFDG: Rein- forcement federated domain generalization, IEEE T rans. Knowl. Data Eng. 36 (3) (2023) 1000– 1013. [30] N.M. Foumani, C.W . T an, G.I. W ebb, M. Salehi, Improving position encoding of transformers for 21 multiv ariate time series classiﬁcation, Data Min. Knowl. Disco v . 38 (2023) 22–48. [31] S. Ek, F . Portet, P . Lalanda, T ransformer-based models to deal with heterogeneous environments in human activity recognition, Pers. Ubiquitous Comput. 27 (2023) 2267–2280. [32] Z. Leng, A. Bhattacharjee, et al., IMUGPT 2.0: Language-based cross modality transfer for sensor-based human acti vity recognition, Proc. A CM Interact. Mob. W earable Ubiquitous T ech- nol. 8 (2024). [33] A. V aswani, N. Shazeer , N. P armar , J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser , I. Polosukhin, Attention is all you need, in: Adv . Neural Inf. Pro- cess. Syst., V ol. 30, 2017. [34] J. Schulman, P . Moritz, S. Le vine, M. Jordan, P . Abbeel, High-dimensional continuous control using generalized advantage estimation, in: Proc. ICLR, 2015. [35] Y . W en, K. Zhang, Z. Li, Y . Qiao, A discrimina- tiv e feature learning approach for deep f ace recog- nition, in: Proc. European Conf. Comput. V ision, 2016, pp. 499–515. [36] A. Gretton, K.M. Borgwardt, M.J. Rasch, B. Schölkopf, A. Smola, A kernel two-sample test, J. Mach. Learn. Res. 13 (2012) 723–773. [37] R.J. W illiams, Simple statistical gradient- following algorithms for connectionist reinforce- ment learning, Mach. Learn. 8 (3–4) (1992) 229–256. [38] H. Zhang, M. Cisse, Y .N. Dauphin, D. Lopez-P az, mixup: Beyond empirical risk minimization, in: Proc. ICLR, 2017. [39] Z. Huang, H. W ang, E.P . Xing, D. Huang, Self- challenging improves cross-domain generaliza- tion, in: Proc. European Conf. Comput. V ision, 2020, pp. 124–140. [40] G. P arascandolo, A. Neitz, A. Orvieto, L. Gresele, B. Schölkopf, Learning explanations that are hard to vary , in: Proc. ICLR, 2020. [41] Y . Du, J. W ang, W . Feng, S. Pan, T . Qin, R. Xu, C. W ang, AdaRNN: Adaptive learning and fore- casting of time series, in: Proc. ACM CIKM, 2021, pp. 402–411. [42] M. Liu, X. Chen, Y . Shu, X. Li, W . Guan, L. Nie, Boosting transferability and discriminabil- ity for time series domain adaptation, in: Proc. Int. Conf. Neural Information Processing Systems (NeurIPS), 2024. [43] S. Liu, Y . Qi, D. Li, L. Liu, S. W ang, C. Fer- nandez, X. Gao, Adversarial multi-source domain generalization approach for power prediction in unknown photov oltaic systems, Appl. Soft Com- put. 181 (2025) 113495. [44] F . W ang, Y . Li, H. Ma, Q. Li, C. W ang, Y . Liu, Lightweight and robust multi-source domain gen- eralization for classifying internet of things mal- ware, Appl. Soft Comput. 188 (2025) 114421. [45] Y . Shao, S. W ang, W . Zhao, CaRGI: Causal se- mantic representation learning via generati ve in- tervention for single domain generalization, Appl. Soft Comput. 173 (2025) 112910. 22 [46] X. Y e, K.I.-K. W ang, Deep generati ve domain adaptation with temporal relation attention mech- anism for cross-user activity recognition, Pattern Recognit. (2024) 110811. [47] X. Y e, K.I. W ang, Adversarial domain adaptation for cross-user activity recognition via noise di ﬀ u- sion model, Knowl.-Based Syst. (2025) 113952. [48] X. Y e, K.I.-K. W ang, Cross-user activity recog- nition using deep domain adaptation with tempo- ral dependency information, IEEE T rans. Instrum. Meas. 74 (2025) 1–15. [49] B. Barshan, M.C. Yüksek, Recognizing daily and sports activities in two open source machine learn- ing environments using body-worn sensor units, Comput. J. 57 (11) (2014) 1649–1667. [50] A. Reiss, D. Stricker , Introducing a new bench- marked dataset for activity monitoring, in: Proc. Int. Symp. W earable Comput., 2012, pp. 108–109. 23

Collaborative Temporal Feature Generation via Critic-Free Reinforcement Learning for Cross-User Sensor-Based Activity Recognition

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment