GATS: Gaussian Aware Temporal Scaling Transformer for Invariant 4D Spatio-Temporal Point Cloud Representation

Understanding 4D point cloud videos is essential for enabling intelligent agents to perceive dynamic environments. However, temporal scale bias across varying frame rates and distributional uncertainty in irregular point clouds make it highly challen…

Authors: Jiayi Tian, Jiaze Wang

GATS: Gaussian Aware Temporal Scaling Transformer for Invariant 4D Spatio-Temporal Point Cloud Representation
GA TS: Gaussian A ware T emporal Scaling T ransformer f or In variant 4D Spatio-T emporal Point Cloud Repr esentation Jiayi T ian State K ey Laboratory of Human-Machine Hybrid Augmented Intelligence Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong Uni versity tianreg@stu.xjtu.edu.cn Jiaze W ang Harbin Institute of T echnology , Shenzhen, Pengcheng Laboratory 24b951071@stu.hit.edu.cn Abstract Understanding 4D point cloud videos is essential for en- abling intelligent agents to per ceive dynamic en vir onments. However , temporal scale bias acr oss varying frame rates and distributional uncertainty in irr e gular point clouds make it highly challenging to design a unified and r obust 4D backbone . Existing CNN or T ransformer based meth- ods are constrained either by limited r eceptive fields or by quadratic computational complexity , while ne glecting these implicit distortions. T o addr ess this pr oblem, we pr opose a novel dual in variant frame work, termed Gaus- sian A ware T emporal Scaling (GA TS) , which explicitly r esolves both distributional inconsistencies and temporal. The pr oposed Uncertainty Guided Gaussian Con volution (UGGC) incorporates local Gaussian statistics and uncer- tainty awar e gating into point convolution, ther eby achie v- ing r ob ust neighborhood aggr e gation under density varia- tion, noise , and occlusion. In parallel, the T emporal Scal- ing Attention (TSA) intr oduces a learnable scaling fac- tor to normalize temporal distances, ensuring frame par- tition in variance and consistent velocity estimation acr oss differ ent frame rates. These two modules ar e complemen- tary: temporal scaling normalizes time intervals prior to Gaussian estimation, while Gaussian modeling enhances r obustness to irr e gular distributions. Our experiments on mainstr eam benchmarks MSR-Action3D ( +6.62% ac- curacy), NTU RGBD ( +1.4% accur acy), and Synthia4D ( +1.8% mIoU) demonstrate significant performance gains, offering a more efficient and principled par adigm for in vari- ant 4D point cloud video understanding with superior accu- racy , rob ustness, and scalability compar ed to T ransformer based counterparts. 1. Introduction 4D point cloud videos, which extend 3D space with 1D time, pro vide a natural representation of dynamic physical en vironments by capturing both geometric structures and temporal motions [ 48 ]. They are increasingly important for enabling intelligent agents to percei ve dynamics, under - stand en vironmental changes, and interact effecti vely with the world [ 57 ]. Consequently , point cloud video modeling has attracted gro wing attention [ 11 , 34 , 51 ], with applica- tions spanning robotic [ 35 ], AR/VR [ 42 ], and SLAM sys- tems [ 4 , 31 ]. Despite the remarkable progress in static 3D point cloud understanding [ 11 , 35 , 51 ], building effecti ve backbones for dynamic 4D point cloud sequences remains a signif- icant challenge [ 24 ]. Unlike con ventional RGB videos with structured grids [ 21 ], point cloud videos are inher - ently irre gular and unordered in space, making grid based methods such as 3D con volutions [ 45 ] unsuitable. A straightforward approach is voxelization [ 50 ], which con- verts raw 4D sequences into structured grids for 4D con vo- lutions. Howe ver , voxelization inevitably introduces quan- tization errors [ 26 ] and suf fers from inef ficiency due to the sparsity and scale of 4D data [ 57 ]. Another line of re- search [ 9 , 10 , 58 ] directly processes raw 4D sequences us- ing conv olutional netw orks [ 27 ] or transformers [ 5 , 38 ]. Y et, CNNs are limited by their local receptive fields [ 33 ], while transformers incur quadratic complexity [ 45 ]. Beyond efficienc y and scalability , we identify two fundamental distortions in point cloud video model- ing—complementary facets of a unified spatio-temporal misrepresentation (Fig. 1 ): (1) Distrib utional uncertainty : current geometric con volutions only consider Euclidean distances, ignoring local distrib utional shape and uncer- tainty . Dynamic point clouds, ho wev er, naturally exhibit Figure 1. Motivation. Right: The right panel illustrates video sequences under different frame rate partitions: with a lar ge temporal interval, some moving objects (second ro w) and static objects (third column) disappear , while they remain visible with a small temporal interval. This indicates that varying frame rate partitions may cause certain velocity features to v anish. Left: The left panel illustrates that GA TS can adaptiv ely adjust the scaling distribution across different frame rate partitions, thereby effecti vely mitigating the relativ e velocity bias introduced by frame rate v ariations and reducing fluctuations in accuracy . Consequently , GA TS achiev es improvements in A CC of 6.62% and 3.83% over P4D and PST , respectively . density variation, noise, occlusion, and missing points, which degrade rob ustness. (2) T emporal scale bias : Under different frame intervals, the same physical motion may be discretized into dif ferent relati ve v elocity estimates, leading to inconsistencies and distortions in spatio-temporal repre- sentations. Existing methods typically rely on fix ed frame partitioning or sampling rates. Motiv ated by these observ ations, we propose Gaussian A ware T emporal Scaling (GA TS) , a dual-inv ariant T rans- former framework for 4D point cloud video modeling. The central insight is that a collaborati ve calibration mecha- nism is required to jointly normalize geometric distributions and temporal motions, thereby achie ving truly in variant and robust spatio-temporal representations. Specifically , we introduce two complementary modules: (1) Uncertainty Guided Gaussian Con volution (UGGC) that augments geometric kernels with Gaussian statistics (mean, cov ari- ance, and uncertainty indicators), enabling robust neighbor - hood aggregation under density variation and noise. differ - ent frame rates. (2) T emporal Scaling Attention (TSA) : with a learnable normalization factor , it enables velocity- in variant temporal modeling, ensures consistent frame par- titioning, and reduces distortions across frame rates. These two modules are naturally synergistic: temporal scaling normalizes time intervals before Gaussian estima- tion, preventing variance inflation across different frame rates, while Gaussian modeling pro vides distributional ro- bustness for spatio-temporal neighborhoods. The pipeline is shown in Fig. 2 . T o the best of our knowledge, this is the first work to e xplicitly introduce relativ e velocity estima- tion in spatio-temporal point cloud modeling. Overall, our main contrib utions are as follows: • W e propose a nov el 4D backbone, GA TS , that explicitly addresses two implicit distortions in point cloud video modeling: temporal scale bias and distributional uncer- tainty . • W e introduce an UGGC module that incorporates lo- cal Gaussian statistics and uncertainty a ware gating into P4DCon v , enhancing robustness to noise, occlusion, and density variation. • W e design a TSA module that achiev es frame partition in variance by rescaling temporal metrics, improving con- sistency across varying frame rates and sampling strate- gies. • Extensi ve experiments on multiple 4D understanding benchmarks demonstrate that GA TS achiev es superior performance and rob ustness compared to baseline meth- ods, while maintaining high efficienc y and scalability . 2. Related W ork 2.1. Backbones for 4D point cloud videos: CNNs, T ransformers, and SSMs Backbones for dynamic 4D point cloud sequences ha ve ev olved along three major paradigms: CNNs, T ransform- ers, and State Space Models (SSMs) [ 4 , 5 , 18 ]. Early CNN based methods either con vert raw 4D sequences into struc- tured grids ( e.g ., voxels or BEV) to apply 3D/4D con volu- tions, or directly operate on raw points with spatio-temporal kernels [ 6 ]. While grid con version enables standard con- volutions, it introduces quantization artifacts and loses ge- ometric fidelity; point based CNNs are constrained by lo- cally restrained recepti ve fields and limited long range tem- poral modeling. Transformer based backbones [ 5 ] enlar ge receptiv e fields using global self-attention and often pair lo- cal point con volutions with temporal attention across times- tamps; ef ficiency oriented variants employ compact mid- lev el tokens or self-supervised training to mitigate annota- tion cost. Howe ver , quadratic complexity and sensitivity to frame partitioning remain bottlenecks, leading to high ov erhead and temporal scale bias under varying sampling strategies. Recently , SSM based designs ( e .g ., Mamba style modules [ 28 , 53 ]) achieve linear complexity for long se- quence modeling via parallel scans and data dependent state transitions, of fering scalable temporal co verage with global receptiv e fields. Distinct from prior art, our frame work em- phasizes dual in variance : it mitigates temporal scale bias through temporal scaling attention and enhances distribu- tional robustness via Gaussian aware conv olution, comple- menting architectural ef ficiency with in variance aware mod- eling. 2.2. Geometric modeling on dynamic 4D point clouds Research on geometric modeling for dynamic point clouds follows two threads: (i) dynamic as signal methods that take temporal variation itself as the modeling target [ 23 , 32 ], and (ii) geometry only methods that rely on static descrip- tors without e xplicit temporal reasoning [ 13 ]. Dynamic as signal approaches explicitly track points or estimate scene flow to align correspondences across frames, enabling mo- tion aware grouping and temporal aggregation ( e.g ., recur- rent point modules and PointNet++ extensions with tem- poral grouping or flow based alignment [ 33 , 34 ]). While explicit motion preserves identities, performance is sensi- tiv e to tracking errors, motion inconsistency , and point en- try/exit in sparse or noisy scenes. Subsequent methods ( e.g ., PST style spatio temporal conv olutions [ 4 , 5 , 18 ]) construct 4D neighborhoods to learn motion patterns im- plicitly without hard associations, improving robustness but remaining limited by local receptive fields. Geome- try only approaches emphasize robust per frame spatial de- scriptors and aggregate them across time, av oiding explicit tracking or flow . T ypical designs use Euclidean kernels, anisotropic neighborhood selection, and hierarchical local feature extraction [ 12 ]; hybrid vox el–point pipelines alle- viate sparsity and efficiency issues b ut still lack e xplicit statistical modeling. T o address these limitations, distribu- tion aware schemes introduce statistical cues ( e.g ., Gaus- sian weighting, cov ariance guided anisotropy) into neigh- borhood aggregation. Our method adv ances this direc- tion with multi-scale Gaussian weighting for heterogeneous densities and uncertainty aware gating via cov ariance con- ditioning, yielding distributionally robust features without explicit tracking. 2.3. Spatio-temporal decoupled r epresentations f or 4D point cloud videos Decoupling space and time is a pragmatic strategy for 4D point clouds: spatial sampling is irregular and unordered, whereas temporal sampling is typically regular [ 8 ]. Prior framew orks commonly adopt a two branch design that com- bines intra frame spatial encoding for local geometric pat- terns with inter frame tempor al modeling for long range de- pendencies. Intra frame modules include point based CNNs and local operators ( e .g ., PointNet++ v ariants and PST style kernels [ 5 , 33 , 37 ]) that build robust neighborhoods, per- form hierarchical aggregation, and leverage anisotropic re- ceptiv e fields; hybrid vox el–point tokens provide compact intermediate representations but introduce quantization ar- tifacts. Inter frame modules range from RNNs and tem- poral pooling to global attention and SSMs: T ransform- ers ( e.g ., temporal attention over all timestamps [ 18 , 22 ]) capture global semantics at quadratic cost, while Mamba style SSMs provide linear time long sequence coverage via data dependent state transitions and parallel scans. Despite the ef fectiv eness of decoupling, most methods assume fixed frame rates or uniform partitions, which induces (1) tem- poral scale bias : identical physical motion maps to differ - ent discrete temporal distances and velocities under varying sampling. Moreover , spatial neighborhoods are commonly treated with purely geometric kernels, overlooking (2) dis- tributional uncertainty (density v ariation, anisotropy , noise, and missing points). W e retain the decoupling principle and introduce two in- variance oriented components: (1) UGGC module , which encodes local shape statistics (mean and co variance) to- gether with uncertainty within spatial neighborhoods; (2) TSA module , which normalizes temporal metrics and en- sures frame partition in v ariance. This design mitig ates tem- poral scale bias, promotes consistent representations across varying frame rates, and enhances the stability of spatio- temporal modeling. T ogether , these modules yield decou- pled features that remain consistent across frame rates and robust to heterogeneous local distrib utions, while inte grat- ing seamlessly with CNN, Transformer , and SSM back- bones. 3. Methodology 3.1. Problem Definition W e consider a point cloud video sequence P as follows: P = { P t } T t =1 , P t = { ( x t i , T f t i ) } N t i =1 , (1) where P t is the point clouds of each video frame, x t i ∈ R 3 denotes the 3D coordinate of the i -th point at frame t , T f t i ∈ R d indicates the associated temporal feature, and N t is the number of points in frame t . The goal is to learn a Figure 2. Pipeline. The overall network backbone consists of two core modules: (a) UGGC Module. After the point cloud is fed into the network, the spatial variations of x t i generate cross frame representations. Ho wever , different cross frames often lead to inter frame inconsistencies. The UGGC module extracts local Gaussian features and incorporates an uncertainty aware gating mechanism to jointly model geometric and Gaussian local features of 4D point clouds, thereby enhancing the robustness of feature e xtraction. (b) TSA Module. Under dif ferent frame rates, the estimation of relativ e velocity t i varies, and as the temporal dimension progresses, motion features tend to produce inter frame inconsistencies. T o address this, the TSA module introduces a learnable scaling f actor s to normalize temporal distances, achieving frame partition in variance and ensuring consistent relati ve velocity estimation across v arying frame rates. spatio-temporal modeling function F that maps { P t } T t =1 to a target output y . F : { P t } T t =1 7→ y , (2) where y can be an action label, per point segmentation mask, or vedio modeling. Compared with conv entional videos, point cloud videos face two key distortions: (i) Distributional uncertainty from irregular geometry and noise. (ii) T emporal scale bias from inconsistent relative velocity estimates under v arying frame rates. Moreov er , uncalibrated geometric representations distort motion perception, while unnormalized motion interferes with geometric understanding. This dual dilemma must be resolved through a complementary fusion framework. T o address these issues, we propose Gaussian A ware T em- poral Scaling (GA TS) , a dual-in variant framew ork that in- troduces (2) Uncertainty Guided Gaussian Convolution for distributional rob ustness, and (2) T emporal Scaling Atten- tion for frame partition in variance. 3.2. Uncertainty Guided Gaussian Con volution Geometric con volutions typically rely on Euclidean dis- tance, ignoring local distributional shape and uncertainty . Dynamic point clouds, howe ver , exhibit density v ariation, noise, and occlusion. W e propose to incorporate Gaussian statistics into 4D Points con volution to enhance robustness. Local Gaussian Estimation. For a center point x t i , its 4D neighborhood N ( i, t ) is modeled by mean and cov ariance: µ i,t = 1 | N ( i, t ) | X x t i ∈ N ( i,t ) x t i , (3) Σ i,t = 1 | N ( i, t ) | X x t i ∈ N ( i,t ) ( x t i − µ i,t )( x t i − µ i,t ) ⊤ . (4) Here µ i,t ∈ R 3 is the local centroid, and Σ i,t ∈ R 3 × 3 en- codes distributional anisotrop y . Gaussian W eighted Con volution. Building upon this es- timation, we design a Gaussian weighted con volution that integrates both geometric kernels k ( · ) and Gaussian statis- tical GS ( · ) likelihoods. The aggregation weight is defined as: w ( x ) = k ( x t i − µ i,t ) · exp  − 1 2 ( x t i − µ i,t ) ⊤ Σ − 1 i,t ( x t i − µ i,t )  . (5) T o further adapt to heterogeneous densities, we employ multi-scale Gaussian kernels with σ ∈ { 0 . 5 r , r , 3 r } . Uncertainty A ware Gating. While Gaussian weighting improv es robustness, the reliability of local statistics may still v ary under severe noise or occlusion. T o adaptively bal- ance standard and robust features, we introduce an uncer- tainty aware gating mechanism. Using the condition num- ber cond (Σ i,t ) or its eigenv alue spectrum as an uncertainty indicator , we define: f ′ = αf + (1 − α ) f robust , α = ϕ ( cond (Σ i,t )) , (6) where f is the standard con volution feature, f robust is a com- plementary branch (e.g., larger receptive field), and ϕ ( · ) maps uncertainty to [0 , 1] . This gating ensures that the model adaptively emphasizes robust features when uncer- tainty is high, while preserving efficienc y in stable regions. 3.3. T emporal Scaling Attention Existing Transformer based models often rely on discrete frame indices | t − t ′ | as temporal bias. Howe ver , the same physical motion may correspond to different discrete inter- vals ∆ t under v arying frame rates, leading to inconsistent velocity estimation. T o address this temporal scale bias, we introduce a temporal scaling factor that normalizes such dis- crepancies and ensures consistent motion representation. Relative V elocity . First, giv en a point x t i ∈ R 3 at time t , the relativ e velocity v t i ∈ V t i is defined as: v t i = x t +∆ t i − x t i ∆ t , (7) where ∆ t ∈ ∆ T denotes the frame interv al. T o eliminate the dependency on arbitrary frame rates, we introduce a learnable or estimable scaling factor s ∈ R + : ∆ t ′ = s · ∆ t, v t i = x t +∆ t i − x t i s · ∆ t . (8) This normalization ensures that velocity estimation remains consistent across different temporal partitions. T emporal Scaling Attention. Building upon this formula- tion, we embed the scaling f actor into the attention mech- anism. Specifically , the scaled temporal distance modifies the positional bias: Attn ( q t , k t ′ ) = q t k ⊤ t ′ √ d + β · Φ( s · | t − t ′ | ) , (9) where q t , k t ′ ∈ R d are query and k ey v ectors. In our de- sign, the key v ector further integrates two complementary components: K UGGC from the Uncertainty Guided Gaus- sian Conv olution branch and K TS from the T emporal Scal- ing branch; Φ( · ) maps temporal distance to bias, and β is a learnable weight. Unlike con ventional additi ve bias, this scaling redefines the temporal metric space, thereby achie v- ing frame partition in variance. Geometric Featur e Extraction. The temporal scaling fac- tor also benefits geometric feature extraction. In P4D con- volution, the temporal neighborhood radius r t is rescaled as: r ′ t = s · r t , (10) which guarantees consistent neighborhood selection regard- less of frame rate variations. Synergy with T emporal Scaling . Finally , temporal scaling naturally complements Gaussian based modeling. By nor - malizing the temporal radius after Gaussian estimation, it prev ents variance inflation across different frame rates and ensures the comparability of Gaussian attributes. This syn- ergy highlights the dual role of temporal scaling: stabilizing motion representation and enhancing distributional robust- ness. 4. Experiments 4.1. Experimental Setup Our experimental v alidation is performed on three widely recognized benchmarks. For the 3D action recognition T able 1. Action recognition accuracy comparison (%) on MSR- Action3D. Methods Input Frames Accuracy (%) Actionlet [ 44 ] skeleton all 88.21 MeteorNet [ 25 ] point 12 86.53 PSTNet [ 6 ] point 12 87.88 P4T ransformer [ 5 ] point 12 87.54 PST -Transformer [ 8 ] point 12 88.15 GA TS (Ours) point 12 90.24 V ieira et al. [ 43 ] depth 20 78.20 GA TS (Ours) point 20 94.54 MeteorNet [ 25 ] point 24 88.50 PSTNet [ 6 ] point 24 91.20 P4T ransformer [ 5 ] point 24 90.94 PST -Transformer [ 8 ] point 24 93.73 PSTNet++[ 7 ] point 24 92.68 PPT r+C2P[ 56 ] point 24 94.76 PointCPSC[ 38 ] point 24 92.68 MaST -Pre[ 37 ] point 24 94.08 SequentialPointNet[ 17 ] point 24 92.64 MAMB A4D[ 22 ] point 24 93.38 3DInAction[ 1 ] skeleton 24 92.23 PvNeXt[ 46 ] point 24 94.77 GA TS (Ours) point 24 97.56 task, we utilize MSR-Action3D [ 16 ] and NTU RGBD [ 36 ]. For 4D semantic segmentation, we employ the Synthia 4D dataset [ 3 ]. A comprehensi ve description of the experimen- tal dataset can be found in Appendix 1.1. Additionally , further details regarding the experimental settings are pre- sented in Appendix 1.2. Figure 3. MSR-attention Results. Attention visualization sho wing focus on key re gions and spatio-temporal dynamics. 4.2. MSR-Action3D Quantitative results. T able 1 reports the performance of GA TS on the MSR-Action3D benchmark. In the 24-frame setting, GA TS achiev es the highest accuracy of 97.56%, outperforming the lowest-performing recent model MAMBA4D (93.38%) by ov er 4 points. Compared T able 2. Action recognition accuracy (%) on NTU RGBD. Method Input Acc (%) SkeleMotion[ 2 ] skeleton 69.6 GCA-LSTM[ 20 ] skeleton 74.4 AttentionLSTM[ 19 ] skeleton 77.1 A GC-LSTM[ 41 ] skeleton 89.2 AS-GCN[ 15 ] skeleton 86.8 V A-fusion[ 55 ] skeleton 89.4 2s-A GCN[ 40 ] skeleton 88.5 DGNN[ 39 ] skeleton 89.9 HON4D[ 30 ] depth 30.6 SNV[ 54 ] depth 31.8 HOG 2 [ 29 ] depth 32.2 Li et al.[ 14 ] depth 68.1 W ang et al.[ 47 ] depth 87.1 MVDI[ 52 ] depth 84.6 PointNet++[ 33 ] point 80.1 3D V[ 49 ] vox el 84.5 3D V -PointNet++[ 49 ] vox el+point 88.8 PSTNet[ 6 ] point 90.5 P4T ransformer[ 5 ] point 90.2 PST -T ransformer[ 8 ] point 91.0 SequentialPointNet[ 17 ] point 90.3 MaST -Pre[ 37 ] point 90.8 PvNeXt[ 46 ] point 89.2 GA TS(Ours) point 91.7 with strong baselines such as PvNeXt (94.77%) and PST - T ransformer (93.73%), GA TS also shows clear improve- ments. Consistent gains are observed in other settings, reaching 94.54% at 20 frames and 90.24% at 12 frames. Qualitative results. W e further visualize attention weights in Fig. 3 . The results show that the T ransformer focuses on rele vant re gions across frames and highlights key tran- sitional parts of actions. This confirms that our attention mechanism effecti vely captures spatio-temporal dynamics and can serve as a substitute for e xplicit point tracking. 4.3. NTU RGBD Quantitative results. The evaluation of action recog- nition accuracy on the NTU RGB+D dataset is detailed in T ab . 2 . Our model, which operates directly on point cloud inputs, achiev es a new state-of-the-art accuracy of 91.7%. This performance surpasses all other listed meth- ods, including strong point based competitors like PST - T ransformer (91.0%) and MaST -Pre (90.8%). Furthermore, our model outperforms the hybrid voxel+point method, 3D V -PointNet++ (88.8%), by a significant margin of 2.9%. This result highlights our framework’ s superior ability to capture complex spatio-temporal dynamics from raw T able 3. 4D semantic segmentation results (mIoU %) on the Synthia 4D dataset. Method Input Frame T rack Bldn Road Sdwlk Fence V e gittn Pole - 3D MinkNet14 v oxel 1 - 89.39 97.68 69.43 86.52 98.11 97.26 - 4D MinkNet14 v oxel 3 - 90.13 98.26 73.47 87.19 99.10 97.50 - PointNet++ point 1 - 96.88 97.72 86.20 92.75 97.12 97.09 - MeteorNet-m point 2 ✓ 98.22 97.79 90.98 93.18 98.31 97.45 - MeteorNet-m point 2 × 97.65 97.83 90.03 94.06 97.41 97.79 - MeteorNet-l point 3 × 98.10 97.72 88.65 94.00 97.98 97.65 - P4T ransformer point 1 - 96.76 98.23 92.11 95.23 98.62 97.77 - P4T ransformer point 3 × 96.73 98.35 94.03 95.23 98.28 98.01 - PST -T ransformer point 1 - 94.46 98.13 89.37 95.84 99.06 98.10 - PST -T ransformer point 3 × 96.10 98.44 94.94 96.58 98.98 98.10 - MAMB A4D point 3 × 96.16 98.58 92.80 94.95 97.08 98.24 - GA TS(Ours) point 1 × 97.51 98.58 95.62 95.54 99.01 98.27 - GA TS(Ours) point 3 × 97.37 98.41 94.72 95.85 97.81 98.31 - Method Input Frame Track Car T . Sign Pedstrn Bicycl Lane T . Light mIoU 3D MinkNet14 vox el 1 - 93.50 79.45 92.27 0.00 44.61 66.69 76.24 4D MinkNet14 vox el 3 - 94.01 79.04 92.62 0.00 50.01 68.14 77.46 PointNet++ point 1 - 90.85 66.87 78.64 0.00 72.93 75.17 79.35 MeteorNet-m point 2 ✓ 94.30 76.35 81.05 0.00 74.09 75.92 81.47 MeteorNet-m point 2 × 94.15 82.01 79.14 0.00 72.59 77.92 81.72 MeteorNet-l point 3 × 93.83 84.07 80.90 0.00 71.14 77.60 81.80 P4T ransformer point 1 - 95.46 80.75 85.48 0.00 74.28 74.22 82.41 P4T ransformer point 3 × 95.60 81.54 85.18 0.00 75.95 79.07 83.16 PST -T ransformer point 1 - 96.80 80.41 87.58 0.00 75.25 80.84 82.92 PST -T ransformer point 3 × 96.06 82.67 87.86 0.00 76.01 81.67 83.95 MAMB A4D point 3 × 95.75 82.03 84.57 0.00 79.35 80.74 83.35 GA TS(Ours) point 1 × 95.62 81.08 83.11 0.00 77.19 83.12 83.72 GA TS(Ours) point 3 × 95.80 84.87 87.64 0.00 76.77 82.98 84.21 point sequences, establishing its ef fectiveness for 3D action recognition. 4.4. 4D Semantic Segmentation Quantitative results. As delineated in T ab . 3 , our model’ s 4D semantic segmentation performance was comprehen- siv ely benchmarked against state-of-the-art methods on the Synthia 4D dataset. In the challenging multi-frame (frame=3) setting, our model achiev es a new state-of-the- art mIoU of 84.21%, surpassing the previous best, PST - T ransformer . Our model also establishes its superiority in the single frame (frame=1) setting, attaining an mIoU of 83.72% and outperforming all competitors. The significant performance gain from the single frame to the multi frame variant (83.72% to 84.21%) ef fecti vely demonstrates our ar - chitecture’ s rob ust ability to le verage temporal information, firmly establishing its ov erall superiority . Qualitative results. Figure 4 presents a qualitati ve analy- sis, comparing our model’ s predictions against the ra w in- put and GT se gmentation. The visualizations confirm that our predictions align closely with the GT across varied and complex scenes. This ability to accurately capture intricate boundaries and fine-grained details underscores the model’ s high fidelity and strong generalization. 4.5. Ablation Studies T o v alidate the contrib ution of each ke y component, we con- ducted ablation studies on the MSR-Action3D dataset, with results detailed in T ab . 4 . The full model serves as our base- line, achie ving an accuracy of 97.56%. When the GA mod- Figure 4. 4D Qualitativ e Results. The rows from top to bottom correspond to the input, GT , and our predictions. Detailed comparativ e results are highlighted in the enlarged re gions of the figure. ule was remo ved (”w/o GA”), the accuracy experienced a notable drop to 95.12%. Similarly , excluding the TS mod- ule (”w/o TS”) reduced the accuracy to 96.16%. This clear performance degradation upon the remov al of either com- ponent confirms that both GA and TS are integral to the model’ s success. T able 4. Ablation studies on MSR-Action3D. Model frame Acc(%) (a) Full Model 24 97.56 (b) w/o UGGC 24 95.12 (c) w/o TSA 24 96.16 (d) PSTT ransformer (baseline) 24 93.73 T able 5. Comparison results with advanced architecture process- ing more frames. Method Backbone Frame Acc(%) PvNeXt CNN 24 94.77 MAMB A4D Mamba 24 93.38 Ours T ransformer 24 97.56 Comparative Study on Model Effectiveness and Ef- ficiency . In our comparativ e analysis (T ab . 5 ), the T ransformer model achiev ed 97.56% accuracy with 24 frames, outperforming MAMB A4D (32 frames, 93.38%) and PvNeXt (24 frames, 94.77%). This result demon- strates not only superior spatio-temporal modeling capabil- ity but also greater efficiency . In contrast to models that rely on more frames b ut yield lo wer accurac y , our approach achiev es superior performance and thus provides a more practical solution for long-sequence tasks. Figure 5. Analysis of accuracy with v arying temporal and spatial parameters. Effect of temporal kernel size and spatial radius. Fig- ure 5 illustrates the ablation study on temporal kernel size and spatial radius. The analysis rev eals that smaller tempo- ral k ernels are adv antageous, as the model achie ves its peak accuracy at the smallest tested size of 3, after which per- formance monotonically decreases with larger k ernels. In contrast, the spatial radius requires an optimal balance; ac- curacy improv es from a radius of 0.1, peaks at 97.56% with a radius of 0.3, and then distinctly degrades at lar ger v alues. 5. Conclusion W e propose an innov ativ e Gaussian A ware T emporal Scaling (GA TS) framework for robust spatio-temporal modeling of 4D point cloud videos, which effecti vely addresses distrib utional uncertainty and temporal scale bias through the joint design of uncertainty guided Gaussian con volution and temporal scaling attention. Distinctiv ely , we are the first to analyze point cloud dynamics from the perspectiv e of relati ve v elocity , pro viding a principled solu- tion to frame rate inconsistenc y and motion representation stability . Extensiv e e xperiments demonstrate that GA TS consistently enhances performance across multiple bench- marks, highlighting both its theoretical significance and practical v alue for real world 4D point cloud understanding. References [1] Y izhak Ben-Shabat, Oren Shrout, and Stephen Gould. 3di- naction: Understanding human actions in 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , pages 19978–19987, 2024. 6 [2] Carlos Caetano, Jessica Sena, Franc ¸ ois Br ´ emond, Jefers- son A Dos Santos, and W illiam Robson Schwartz. Skelemo- tion: A new representation of skeleton joint sequences based on motion information for 3d action recognition. In 2019 16th IEEE international conference on advanced video and signal based surveillance (A VSS) , pages 1–8. IEEE, 2019. 6 [3] Christopher Choy , JunY oung Gwak, and Silvio Sav arese. 4d spatio-temporal con vnets: Minko wski con volutional neural networks. In Pr oceedings of the IEEE/CVF confer ence on computer vision and pattern reco gnition , pages 3075–3084, 2019. 6 [4] Hehe Fan and Y i Y ang. Pointrnn: Point recurrent neural network for moving point cloud processing. arXiv pr eprint arXiv:1910.08287 , 2019. 1 , 2 , 3 [5] Hehe F an, Y i Y ang, and Mohan Kankanhalli. Point 4d trans- former networks for spatio-temporal modeling in point cloud videos. In Proceedings of the IEEE/CVF confer ence on com- puter vision and pattern recognition , pages 14204–14213, 2021. 1 , 2 , 3 , 6 [6] Hehe Fan, Xin Y u, Y uhang Ding, Yi Y ang, and Mohan Kankanhalli. Pstnet: Point spatio-temporal conv olution on point cloud sequences. In International Conference on Learning Repr esentations , 2021. 2 , 6 [7] Hehe Fan, Xin Y u, Y i Y ang, and Mohan Kankanhalli. Deep hierarchical representation of point cloud videos via spatio-temporal decomposition. IEEE T ransactions on P at- tern Analysis and Machine Intelligence , 44(12):9918–9930, 2021. 6 [8] Hehe Fan, Y i Y ang, and Mohan Kankanhalli. Point spatio- temporal transformer networks for point cloud video mod- eling. IEEE T ransactions on P attern Analysis and Machine Intelligence , 45(2):2181–2192, 2022. 3 , 6 [9] Albert Gu, T ri Dao, Stefano Ermon, Atri Rudra, and Christo- pher R ´ e. Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information pr ocessing sys- tems , 33:1474–1487, 2020. 1 [10] Albert Gu, Karan Goel, and Christopher R ´ e. Ef ficiently mod- eling long sequences with structured state spaces. In The In- ternational Confer ence on Learning Repr esentations (ICLR) , 2022. 1 [11] Haiyun Guo, Jinqiao W ang, Y ue Gao, Jianqiang Li, and Han- qing Lu. Multi-view 3d object retriev al with deep embedding network. IEEE T ransactions on Image Pr ocessing , 25(12): 5526–5537, 2016. 1 [12] Xiaoshui Huang, Sheng Li, Y ifan Zuo, Y uming Fang, Jian Zhang, and Xiaowei Zhao. Unsupervised point cloud regis- tration by learning unified gaussian mixture models. IEEE Robotics and Automation Letter s , 7(3):7028–7035, 2022. 3 [13] Linglin Jing, Y ing Xue, Xu Y an, Chaoda Zheng, Dong W ang, Ruimao Zhang, Zhigang W ang, Hui Fang, Bin Zhao, and Zhen Li. X4d-sceneformer: Enhanced scene understand- ing on 4d point cloud videos through cross-modal knowledge transfer . In Pr oceedings of the AAAI Conference on Artificial Intelligence , pages 2670–2678, 2024. 3 [14] Junnan Li, Y ongkang W ong, Qi Zhao, and Mohan S Kankan- halli. Unsupervised learning of view-in variant action repre- sentations. Advances in neural information pr ocessing sys- tems , 31, 2018. 6 [15] Maosen Li, Siheng Chen, Xu Chen, Y a Zhang, Y anfeng W ang, and Qi T ian. Actional-structural graph conv olutional networks for skeleton-based action recognition. In Proceed- ings of the IEEE/CVF confer ence on computer vision and pattern r ecognition , pages 3595–3603, 2019. 6 [16] W anqing Li, Zhengyou Zhang, and Zicheng Liu. Action recognition based on a bag of 3d points. In 2010 IEEE computer society confer ence on computer vision and pattern r ecognition-workshops , pages 9–14. IEEE, 2010. 6 [17] Xing Li, Qian Huang, Zhijian W ang, T ianjin Y ang, Zhenjie Hou, and Zhuang Miao. Real-time 3-d human action recog- nition based on hyperpoint sequence. IEEE Tr ansactions on Industrial Informatics , 19(8):8933–8942, 2022. 6 [18] Dingkang Liang, Xin Zhou, Xin yu W ang, Xingkui Zhu, W ei Xu, Zhikang Zou, Xiaoqing Y e, and Xiang Bai. Point- mamba: A simple state space model for point cloud analysis. arXiv pr eprint arXiv:2402.10739 , 2024. 2 , 3 [19] Jun Liu, Gang W ang, Ling-Y u Duan, Kamila Abdiyev a, and Alex C K ot. Skeleton-based human action recognition with global context-aware attention lstm networks. IEEE T rans- actions on Image Pr ocessing , 27(4):1586–1599, 2017. 6 [20] Jun Liu, Gang W ang, Ping Hu, Ling-Y u Duan, and Alex C K ot. Global context-aware attention lstm networks for 3d action recognition. In Proceedings of the IEEE conference on computer vision and pattern reco gnition , pages 1647–1656, 2017. 6 [21] Jiuming Liu, Guangming W ang, W eicai Y e, Chaokang Jiang, Jinru Han, Zhe Liu, Guofeng Zhang, Dalong Du, and Hes- heng W ang. Difflo w3d: T o ward robust uncertainty-aware scene flo w estimation with iterative diffusion-based refine- ment. In Pr oceedings of the IEEE/CVF Confer ence on Com- puter V ision and P attern Recognition , pages 15109–15119, 2024. 1 [22] Jiuming Liu, Jinru Han, Lihao Liu, Angelica I A viles-Rivero, Chaokang Jiang, Zhe Liu, and Hesheng W ang. Mamba4d: Efficient 4d point cloud video understanding with disentan- gled spatial-temporal state space models. In Pr oceedings of the Computer V ision and P attern Recognition Conference , pages 17626–17636, 2025. 3 , 6 [23] Lihao Liu, Angelica I A viles-Riv ero, and Carola-Bibiane Sch ¨ onlieb . Contrastiv e registration for unsupervised medical image segmentation. IEEE T ransactions on Neural Networks and Learning Systems , 2023. 3 [24] Xingyu Liu, Charles R Qi, and Leonidas J Guibas. Flownet3d: Learning scene flow in 3d point clouds. In Pro- ceedings of the IEEE/CVF confer ence on computer vision and pattern r ecognition , pages 529–537, 2019. 1 [25] Xingyu Liu, Mengyuan Y an, and Jeannette Bohg. Meteor- net: Deep learning on dynamic 3d point cloud sequences. In Pr oceedings of the IEEE/CVF International Conference on Computer V ision , pages 9246–9255, 2019. 6 [26] Y unze Liu, Junyu Chen, Zekai Zhang, Jingwei Huang, and Li Y i. Leaf: Learning frames for 4d point cloud sequence un- derstanding. In Pr oceedings of the IEEE/CVF International Confer ence on Computer V ision , pages 604–613, 2023. 1 [27] Y ue Liu, Y unjie T ian, Y uzhong Zhao, Hongtian Y u, Lingxi Xie, Y aowei W ang, Qixiang Y e, and Y unfan Liu. Vmamba: V isual state space model. arXiv pr eprint arXiv:2401.10166 , 2024. 1 [28] Jun Ma, Feifei Li, and Bo W ang. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv pr eprint arXiv:2401.04722 , 2024. 3 [29] Eshed Ohn-Bar and Mohan T rivedi. Joint angles similari- ties and hog2 for action recognition. In Pr oceedings of the IEEE conference on computer vision and pattern reco gnition workshops , pages 465–470, 2013. 6 [30] Omar Oreifej and Zicheng Liu. Hon4d: Histogram of ori- ented 4d normals for activity recognition from depth se- quences. In Pr oceedings of the IEEE confer ence on com- puter vision and pattern r ecognition , pages 716–723, 2013. 6 [31] Charles R Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Y an, and Leonidas J Guibas. V olumetric and multi-view cnns for object classification on 3d data. In Pro- ceedings of the IEEE confer ence on computer vision and pat- tern r ecognition , pages 5648–5656, 2016. 1 [32] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and se gmentation. In Pr oceedings of the IEEE conference on computer vision and pattern r ecognition , pages 652–660, 2017. 3 [33] Charles Ruizhongtai Qi, Li Y i, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information pr ocessing systems , 30, 2017. 1 , 3 , 6 [34] Guocheng Qian, Y uchen Li, Houwen Peng, Jinjie Mai, Hasan Hammoud, Mohamed Elhoseiny , and Bernard Ghanem. Pointnext: Re visiting pointnet++ with improved training and scaling strategies. Advances in neural informa- tion pr ocessing systems , 35:23192–23204, 2022. 1 , 3 [35] A vishkar Saha, Oscar Mendez, Chris Russell, and Richard Bowden. Translating images into maps. In 2022 Interna- tional confer ence on robotics and automation (ICRA) , pages 9200–9206. IEEE, 2022. 1 [36] Amir Shahroudy , Jun Liu, T ian-Tsong Ng, and Gang W ang. Ntu r gb+ d: A large scale dataset for 3d human acti vity anal- ysis. In Pr oceedings of the IEEE confer ence on computer vision and pattern r ecognition , pages 1010–1019, 2016. 6 [37] Zhiqiang Shen, Xiaoxiao Sheng, Hehe Fan, Longguang W ang, Y ulan Guo, Qiong Liu, Hao W en, and Xi Zhou. Masked spatio-temporal structure prediction for self- supervised learning on point cloud videos. In Pr oceedings of the IEEE/CVF International Conference on Computer V i- sion , pages 16580–16589, 2023. 3 , 6 [38] Zhiqiang Shen, Xiaoxiao Sheng, Longguang W ang, Y ulan Guo, Qiong Liu, and Xi Zhou. Pointcmp: Contrastive mask prediction for self-supervised learning on point cloud videos. In Proceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , pages 1212–1222, 2023. 1 , 6 [39] Lei Shi, Y ifan Zhang, Jian Cheng, and Hanqing Lu. Skeleton-based action recognition with directed graph neu- ral networks. In Pr oceedings of the IEEE/CVF confer ence on computer vision and pattern reco gnition , pages 7912–7921, 2019. 6 [40] Lei Shi, Y ifan Zhang, Jian Cheng, and Hanqing Lu. T wo- stream adapti ve graph conv olutional networks for skeleton- based action recognition. In Pr oceedings of the IEEE/CVF confer ence on computer vision and pattern reco gnition , pages 12026–12035, 2019. 6 [41] Chenyang Si, W entao Chen, W ei W ang, Liang W ang, and T ieniu T an. An attention enhanced graph conv olutional lstm network for skeleton-based action recognition. In Proceed- ings of the IEEE/CVF confer ence on computer vision and pattern r ecognition , pages 1227–1236, 2019. 6 [42] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller . Multi-vie w con volutional neural networks for 3d shape recognition. In Pr oceedings of the IEEE in- ternational confer ence on computer vision , pages 945–953, 2015. 1 [43] Antonio W V ieira, Erickson R Nascimento, Gabriel L Oliv eira, Zicheng Liu, and Mario FM Campos. Stop: Space- time occupancy patterns for 3d action recognition from depth map sequences. In Iber oamerican congr ess on pattern r ecog- nition , pages 252–259. Springer , 2012. 6 [44] Jiang W ang, Zicheng Liu, Y ing W u, and Junsong Y uan. Mining actionlet ensemble for action recognition with depth cameras. In 2012 IEEE conference on computer vision and pattern r ecognition , pages 1290–1297. IEEE, 2012. 6 [45] Jun W ang, Xiaolong Li, Alan Sulliv an, L ynn Abbott, and Si- heng Chen. Pointmotionnet: Point-wise motion learning for large-scale lidar point clouds sequences. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 4419–4428, 2022. 1 [46] Jie W ang, T ingfa Xu, Lihe Ding, Xinjie Zhang, Long Bai, and Jianan Li. Pvnext: Rethinking network design and temporal motion for point cloud video recognition. arXiv pr eprint arXiv:2504.05075 , 2025. 6 [47] Pichao W ang, W anqing Li, Zhimin Gao, Chang T ang, and Philip O Ogunbona. Depth pooling based large-scale 3-d action recognition with con volutional neural networks. IEEE T ransactions on Multimedia , 20(5):1051–1061, 2018. 6 [48] Y ue W ang, Y ongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. A CM T ransactions on Graphics (tog) , 38(5):1–12, 2019. 1 [49] Y ancheng W ang, Y ang Xiao, Fu Xiong, W enxiang Jiang, Zhiguo Cao, Joey Tian yi Zhou, and Junsong Y uan. 3dv: 3d dynamic v oxel for action recognition in depth video. In Pr o- ceedings of the IEEE/CVF confer ence on computer vision and pattern r ecognition , pages 511–520, 2020. 6 [50] Hao W en, Y unze Liu, Jingwei Huang, Bo Duan, and Li Y i. Point primitiv e transformer for long-term 4d point cloud video understanding. In Eur opean Conference on Computer V ision , pages 19–35. Springer , 2022. 1 [51] Louis W iesmann, Rodrigo Marcuzzi, Cyrill Stachniss, and Jens Behley . Retriev er: Point cloud retriev al in compressed 3d maps. In 2022 international conference on robotics and automation (ICRA) , pages 10925–10932. IEEE, 2022. 1 [52] Y ang Xiao, Jun Chen, Y ancheng W ang, Zhiguo Cao, Joey T ianyi Zhou, and Xiang Bai. Action recognition for depth video using multi-view dynamic images. Information Sciences , 480:287–304, 2019. 6 [53] Zhaohu Xing, Tian Y e, Y ijun Y ang, Guang Liu, and Lei Zhu. Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation. arXiv pr eprint arXiv:2401.13560 , 2024. 3 [54] Xiaodong Y ang and Y ingLi T ian. Super normal v ector for ac- tivity recognition using depth sequences. In Pr oceedings of the IEEE confer ence on computer vision and pattern recog- nition , pages 804–811, 2014. 6 [55] Pengfei Zhang, Cuiling Lan, Junliang Xing, W enjun Zeng, Jianru Xue, and Nanning Zheng. V iew adaptive neural net- works for high performance skeleton-based human action recognition. IEEE transactions on pattern analysis and ma- chine intelligence , 41(8):1963–1978, 2019. 6 [56] Zhuoyang Zhang, Y uhao Dong, Y unze Liu, and Li Y i. Complete-to-partial 4d distillation for self-supervised point cloud sequence representation learning. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 17661–17670, 2023. 6 [57] Jia-Xing Zhong, Kaichen Zhou, Qingyong Hu, Bing W ang, Niki Trigoni, and Andre w Markham. No pain, big gain: clas- sify dynamic point cloud sequences with static models by fitting feature-lev el space-time surfaces. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 8510–8520, 2022. 1 [58] Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong W ang, W enyu Liu, and Xinggang W ang. V ision mamba: Efficient visual representation learning with bidirectional state space model. arXiv pr eprint arXiv:2401.09417 , 2024. 1 GA TS: Gaussian A ware T emporal Scaling T ransformer f or In variant 4D Spatio-T emporal Point Cloud Repr esentation Supplementary Material 6. Experimental Setup 6.1. Theoretical Results W e first analyze the effect of discrete sampling on velocity estimation in point cloud videos. Let x ( t ) : R → R 3 denote the continuous trajectory of a point in physical time. When the trajectory is sampled at discrete intervals ∆ t , the ob- served sequence is x t = x ( t 0 + t · ∆ t ) . The corresponding discrete velocity estimator is defined as ˆ v ( t ) = x t +∆ t − x t ∆ t . (11) This estimator de viates from the true instantaneous v e- locity v ( t ) = dx dt . By expanding x ( t + ∆ t ) around t using T aylor series, we obtain x ( t + ∆ t ) = x ( t ) + v ( t )∆ t + 1 2 a ( t )(∆ t ) 2 + o ((∆ t ) 2 ) , (12) where a ( t ) denotes the acceleration. Substituting this into the discrete estimator yields ˆ v ( t ) = v ( t ) + 1 2 a ( t )∆ t + o (∆ t ) . (13) It follows that the estimation error gr ows linearly with ∆ t , which leads to inconsistent v elocity estimation across dif- ferent frame rates. T o eliminate this dependency , we introduce a temporal scaling factor s > 0 and define the normalized velocity es- timator as ˆ v s ( t ) = x t +∆ t − x t s · ∆ t . (14) By choosing s = ∆ t ∆ t ref , where ∆ t ref is a fixed reference interval, the estimator becomes ˆ v s ( t ) = x ( t + ∆ t ) − x ( t ) ∆ t ref . (15) This normalization maps all sequences to the same ref- erence time scale, thereby removing sampling-rate bias. Moreov er , as ∆ t → 0 , the normalized estimator con ver ges to the true velocity v ( t ) . Next, we examine ho w the scaling factor behav es with respect to frame density . Consider a video se gment of fixed physical duration T seg sampled into F frames. The discrete interval is ∆ t = T seg F . (16) W ith a fixed reference interv al ∆ t ref > 0 , the scaling factor is s = ∆ t ∆ t ref = T seg ∆ t ref · 1 F . (17) This expression sho ws that s is in versely proportional to the frame count F . Equiv alently , for a fixed segment duration, increasing the number of frames (or equi valently , the frame rate) reduces the scaling factor . Since ∆ t = T seg F , substituting into the definition of s yields s ( F ) = T seg ∆ t ref · 1 F = C · F − 1 , C = T seg ∆ t ref > 0 . (18) The function s ( F ) = C F − 1 is strictly decreasing for F > 0 , with deriv ativ e ds dF = − C F − 2 < 0 . (19) Hence, as more frames are sampled within a fix ed duration, the temporal scaling factor decreases accordingly . Finally , if we express the relationship in terms of frame rate, let the frame rate be fps and the segment length be T seg . Since F = fps · T seg , the scaling factor becomes s = 1 ∆ t ref · 1 fps . (20) Thus, for a fixed reference interv al, an increase in frame rate leads to a reduction of the scaling factor, whereas a decrease in frame rate results in an increase of the scaling factor . This in verse relationship ensures that the normalized velocity estimator remains in variant to frame partitioning and robust across heterogeneous video sources. 6.2. Datasets MSR-Action3D. This widely used dataset, captured by a first generation Kinect sensor, comprises 567 depth se- quences. It features 20 distinct human action categories, totaling approximately 23,000 frames with an a verage se- quence length of 40 frames. T o ensure a direct and fair comparison with prior art, we adhere to the con ventional partition, assigning 270 sequences to the training set and 297 to the testing set. NTU RGBD. As a large scale benchmark for 3D human action recognition, this dataset contains 56,880 video sam- ples distrib uted across 60 fine-grained action classes. The duration of each video clip v aries, ranging from 30 to 300 frames. W e adopt the standard cross subject ev aluation protocol, which designates 40,320 videos for training and 16,560 videos for testing. Synthia 4D. T o assess our model’ s generalization perfor- mance on dense prediction, we conduct 4D semantic seg- mentation experiments using the Synthia 4D dataset. De- riv ed from the original Synthia, this dataset consists of six dynamic dri ving scenarios. W e follo w the established ex- perimental setup from previous works, using a standard frame wise split of 19,888 frames for training, 815 for vali- dation, and 1,886 for testing. 6.3. T raining Details MSR-Action3D. W e follow P4T ransformer to partition the training and testing sets. For each video, we densely sample 24 frames and sample 2048 points in each frame. W e train our model on a single NVIDIA A100 GPU for 50 epochs. The SGD optimizer is employed, where the initial learning rate is set as 0.01, and decays with a rate of 0.1 at the 20-th epoch and the 30-th epoch respectiv ely . NTU RGBD. Consistent with prior methodologies, our data processing in volves 24-frame sequences with a temporal step of 2, sampling 2048 points per frame. The training was conducted over 15 epochs with a batch size of 24 on a sin- gle NVIDIA A100 GPU. An SGD optimizer was utilized, configured with an initial learning rate of 0.01 and a cosine decay schedule to ensure stable con ver gence and facilitate a fair comparison. Synthia 4D. For the Synthia 4D experiments, our model was trained for a total of 150 epochs on a single NVIDIA A100 GPU, using a batch size of 8. Each input consisted of 3-frame clips, with 16,384 points per frame. The optimiza- tion was managed by an SGD optimizer with a momentum of 0.9. W e implemented a learning rate schedule featuring a 10-epoch linear warmup from an initial LR of 0.01, fol- lowed by a multi-step decay (factor of 0.1) at the 50th, 80th, and 100th epochs. 7. Network Configurations GA TS Architectur e. In our GA TS framework, the point cloud sequence is first fed into the Gaussian A war e Spatio- T emporal Embedding module . In this stage, spatial neigh- borhoods are modeled using Gaussian weighted con volu- tion, which lev erages local mean and cov ariance statistics, while the temporal dimension is processed through scaled normalized con volution. This design enables the extraction of spatio-temporal feature representations that are robust to both distributional uncertainty and frame rate v ariations. Subsequently , the features are passed into a dual branch adapter (UGGC + TSA) . The UGGC (Uncertainty Guided Gaussian Conv olution) branch enhances spatial robustness by incorporating Gaussian statistics, whereas the TSA (T emporal Scaling Attention) branch introduces scaled tem- poral biases to ensure cross frame consistency . The out- puts of these tw o branches are fused through an uncertainty aware gating mechanism, where the gating weights are de- termined by the condition number of the local cov ariance. The fused features are then processed by the T rans- former encoder , where each layer consists of an Atten- tion block and a FeedForw ard block. The Attention block extends the standard QKV projection and multi-head atten- tion by incorporating temporal scaling bias, thereby guaran- teeing in variance to frame partitioning. The FeedForward block is composed of LayerNorm, fully connected layers, and non-linear feedforward transformations, with residual connections to maintain training stability . By stacking mul- tiple such layers, the model produces globally consistent and robust spatio-temporal representations. The o verall GA TS architecture and the detailed attention structure are illustrated in Fig. 6 , where T ∈ R M × N × d de- notes the temporal tokens and S GA ∈ R M × N × d represents the Gaussian aware tok ens. Figure 6. Structure of GA TS. The left side shows the overall pro- cess, and the right side shows the specific process of D A TS. 8. Additional Quantitative Results Quantitative Results. T able 6 expands prior compar- isons, showing our method achie ves 97.56% accuracy , sig- nificantly outperforming depth-, skeleton-, and point-based baselines. Traditional depth-based methods such as V ieira et al. and Kl ¨ aser et al. achieve 78.20% and 81.43% ac- curacy respecti vely , indicating limited capacity in capturing fine-grained motion cues. Skeleton-based approaches like Actionlet improve performance to 88.21%, benefiting from structured human pose representations. Point-based meth- ods show mixed results: while MeteorNet reaches 88.50%, PointNet++ lags behind at 61.61%, highlighting the chal- lenge of modeling spatio-temporal dynamics directly from raw point clouds. In contrast, our proposed GA TS frame- work achie ves a substantial impro vement, reaching 97.56% accuracy . This demonstrates its superior ability to capture both spatial and temporal dependencies in 4D point cloud sequences, v alidating its effecti veness for action recogni- tion tasks. T able 6. Action recognition accurac y comparison on a standard benchmark. Methods Input Accuracy (%) V ieira et al. depth 78.20 Kl ¨ aser et al. depth 81.43 Actionlet skeleton 88.21 PointNet++ point 61.61 MeteorNet point 88.50 GA TS(Ours) point 97.56 Qualitative Results. Fig. 7 demonstrates that our model consistently deliv ers precise segmentation across complex scenes, underscoring its robustness and generalization. The first column displays the input frames from dynamic 4D point cloud sequences, while the second column shows the corresponding ground-truth semantic labels. The third column presents the se gmentation results produced by our method. As illustrated, e ven in scenarios with significant spatial irregularity and temporal v ariation, our approach maintains high accuracy in distinguishing semantic cate- gories. These qualitativ e results further validate the ef fec- tiv eness of our design in handling div erse and challenging en vironments, highlighting its strong adaptability and gen- eralization capability . Figure 7. Qualitative results of Syn4D.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment