KGS-GCN: Enhancing Sparse Skeleton Sensing via Kinematics-Driven Gaussian Splatting and Probabilistic Topology for Action Recognition

Skeleton-based action recognition is widely utilized in sensor systems including human-computer interaction and intelligent surveillance. Nevertheless, current sensor devices typically generate sparse skeleton data as discrete coordinates, which inev…

Authors: Yuhan Chen, Yicui Shi, Guofa Li

KGS-GCN: Enhancing Sparse Skeleton Sensing via Kinematics-Driven Gaussian Splatting and Probabilistic Topology for Action Recognition
1 XXXX-XXXX © XXXX I EEE. Personal use is permitted, but republication/re distribution require s IEEE permission. See http://www.ieee.org/p ublications_ standards/public ations/rig hts/index.html for more information. Abstract — Skeleton-based action recognition is widely utilized in sensor systems including human-computer interaction a nd inte lligent surveillan ce. Nevertheless, current sensor dev ices typically generate sparse s keleton data as discrete coordinates, whic h inevitably discards fine-grained spa tiotemporal d etails durin g highly dynamic movements. Moreov er, the rigid constraints of p redefined physical sensor topolo gies hinder the modeling of latent long-range dependencies. To overcome these lim itations, we propose KGS-GCN, a graph convolutional network that integrates kinematics-driven Gaussian sp latting with probabilistic to pology. Our framework explicitly ad dresses the challenges of sensor data sparsity and topological rigidity by transforming discrete joints into con tinuous generative representations. Firstly, a kinematics-driven Gaussian splatting module is desi gned t o dynamically construct anisotrop ic co variance matrices using instantaneous joint velocity vec tors. This modu le e nhances visual rep resentation by renderin g sparse skeleton sequences into multi-view continuous heatmaps rich in spa tiotemporal s emantics. Secondly, to transcend the limitations of fixed physical c onnections, a proba bilistic topo logy construction method is p roposed. This a pproach gene rates an adaptive prior a djacency matrix by quantifying statistical correlations via the Bhattacharyya distance b etween joint Gaussian di stributions. Ultimately, the GCN b ackbone is adaptively modulated by th e rendered visual features via a visual context g ating mechanism. Empirical results demonstrate that KGS-GCN significantly enhances th e modeling of complex spa tiotemporal d ynamics. By addre ssing the in herent limitations of sparse inputs, our framework offers a robust solution for processing low-fidelity sensor data. This ap proach establishes a pra ctical pathway fo r improving perceptual reliability i n real-world se nsing app lications. Index Terms — Skeleton-based action recognition; Gau ssian spaltting; Probabilistic top ology learning I.  I NTRODU CTION APID ad vancements in micro -electro -mechanical syste ms and 3D sensing technologies have paved the way for motion-captu re-based perception in critic al do mains, such as intelligent medical r ehabilitation, human-r obot co llaboration, and p ervasive computing [1-3]. In these co ntexts, ske leton data serves as a high-level sens or signal extracte d from depth cameras, radars, or inertial measureme nt units. It can effectively encapsula te human biomechanic al structures with This work is s upport by National Key R&D Program of China 2024YFB250 5500. (Corre sponding author: Guofa Li.) Yuhan Chen, Yicui Shi, Guofa Li and J ie Li are with the College of Mechanical and Vehic le Engineering, Chongq ing University, Cho ngqing 400044, China (e-mail: 20240701028 @stu.cqu.edu.cn ; yicuishi@cqu .edu.cn; liguofa@cqu.edu. cn; ji eli@cq u.edu.cn ). Liping Zhang is with the Department of Mathematical Sciences, Tsinghua University, Beijing 100084 , China (e-mail: lipingzhang @tsinghua.edu .cn ). Jiaxin Gao is with the S chool of V ehicle and Mobility, Tsinghua University, Beijing 100084, China (e-mail: gaojiaxin2017 @163.com ). Wenbo Chu is with the National I nnovation Center of Intelligent and Connected Vehicles, Beijing 100089, China (e-mail: chuwenbo@ wicv.cn ). minimal bandwidth requirements while ensuring superior privacy protectio n co mpared to raw RGB videos. Consequently, these merits have established skelet on data as a fundamental modality for action recognit ion tasks [2-4]. Traditio nal sensor signal processing meth ods rely on manual feature engineer ing or RNN/C NN-based time-se ries analysis. However, th ese approaches struggle to acco mmodate the non-Euclide an topological structures inhe rent in huma n skeletons. To ad dress this, ST -GCN [5] pioneere d a strategy to formulate skeleton sensor data as spatiotempo ral grap hs, utilizing graph convolutional network s to aggregate signals along physical limb co nnections. T his par adigm effectively establishes struct ured depe ndencies among sensor n odes, establishing graph neural networks as the dominant architectu re for skeleton-based p erception tasks. Subsequent works have focused on o vercoming the limitation s o f fixed physical topol ogy by exploring data-driven topology le arning and spatio temporal interaction mechanisms. For inst ance, 2s-AGCN [6] introduced ad aptive grap h convolution to learn data-specific adjac ency matrices end-to-end, significantly e nhancing feature discriminabilit y. Similarly, Shift-GCN [1 0] employed shift op erations to r educe computatio nal co mplexity while extending spatiote mporal KGS-GCN: Enhancing Sparse Skeleton Sensing via Kinematics-Driven Gaussian Splatting and Probabilistic Topology for Action Recognition Yuhan Chen , Yicui Shi, Guofa Li Senior M ember , IEEE , Liping Zhang, Jie Li, Jiaxin Gao, Wenbo Chu R 2 receptive fields. DeGCN [20] propo sed defor mable graph convolution to dynamicall y ad just neig hborhood aggregation ranges, a ccommodating deformatio n variations a cross distinct actions. Meanwhile, T ransformer-b ased architectur es have leveraged self-attention mech anisms to capture global long-range dependencies, continuou sly pushing the performanc e boundaries of this task [1, 21]. Despite thes e advancemen ts, curr ent skeleton perception methods face two fundament al b ottlenecks in sampli ng and modeling complex motion signals. First, discr ete p oint sampling fails to adequately represent continuous motion signals. Conventio nal pipelines typically treat joints as isolated points and learn featur es direc tly from their coordinate sequences. While this approach remains effective for slow or predicta ble movements, it fails to capture the intricate dynamics of complex actio ns. For rap id or explosive movements, ins tantaneous joint velocities , directions, and momentum serve as criti cal discriminative factors. Being restricted to sparse coordinates attenuates spatial expansion cues along the motion dire ction. Consequen tly, the mode l becomes prone to confusing actio ns that exhibit similar trajec tories b ut po ssess distinct dynamic characteristics. Second, current sensor topo logy co nstruction lacks statistic al interpretab ility. Altho ugh ad aptive grap h learning can transcend physical co nnectivity constraints to incorp orate latent long-range d ependencies [ 6], the edge weights are typically determined th rough implicit end-to-end training. Such a heuristic approach lacks explicit statistical significance and controllab ility, often lea ding to a black-box op timization process that obscures the under lying struct ural co rrelations between senso r node s. Furthermore, the joint optimization of adjacency matric es and netwo rk paramete rs can lead to topological f orgetting or relati onal instab ility, th ereby degrading the f idelity of topology aw areness [19]. Consequently, constructing top ological prio rs with rigorous statistical significanc e, inter pretability, and tr ansferability, while simultaneo usly enhancing co ntinuous motion representatio n, remains a pivota l challenge i n skeleton-b ased action recognitio n. To ad dress these challenges, this p aper p roposes KG S-GCN, a Graph Convo lutional Ne twork integrated with Kinematic Gaussian Sp latting and P robabilistic To pology. Our frame work re-envisions skeleton joint representation and top ology construction through the lenses of kinema tics and probabil ity. By doing so, we effectively resolve the challen ges of sensor data sparsity and to pological r igidity inher ent in conventional methods. Drawing inspiration from the success of 3D Gaussian Splatting in co ntinuous radiance field representatio n and efficient render ing [24], the proposed method transforms discrete joints from deterministic points into pro bability distributions, explicitl y encod ing spatial uncertain ty within the feature space. In contrast to computer graphics, which prioritizes static geometr ic r epresentation, o ur appr oach emphasizes the kinematic s-driven nature of senso r d ata. As joint velocity increases, the cor responding spatial distribution elongates aniso tropica lly along the dir ection of motio n. This mechanism ensures that velocity and o rientation are naturally encoded as intrinsic features o f the r epresentation . Furthermor e, formulating nodes as probability distrib utions enables the rigorous quantificatio n of inter-no de re lationships via statistical distance. This estab lishes an interpre table p rior for constructing semantic topologies that transcend physica l connections. Specifically, the main contributions of this w ork are summarized as follows: 1. A kinema tics-driven Gaussian splatting module is designed to d ynamically co nstruct anisotro pic covariance matrices based on instantaneous joint veloc ities. T his module renders sparse skeleton sequences into multi-view continuous heatmaps enriched with sp atiotempora l sema ntics, significantly enhancing the rep resentation of rapid motions and uncertain ty. 2. KGS-GCN introduces a pro babilistic topol ogy construction strategy that quanti fies statistical correla tions via the Bhattachar yya distance be tween j oint Gaussian distributions. T his generates an interp retable ad aptive prio r adjacency matrix to comple ment physical top ologies and capture latent long-range dependencies. 3. KGS-GCN incorporates a visual c ontext gating mechanism that leverages rendered visual features to modulate feature p ropagation within the GCN backbo ne, facilit ating the synergistic modeling of continuou s visual re presentations and graph structure lear ning. E xtensive exper iments on multiple benchmark datasets validate the framewor k's efficacy in capturing complex spa tiotemporal dynamics . II. R ELATED W ORK A. Sk eleton-Based Action Recognition Skeleton-ba sed action rec ognition aims to pa rse spatiotempo ral dynamic patterns from seq uences of huma n joint c oordinates. Compare d with RGB videos, sk eleton representatio ns a re charac terized by structural compac tness and robustness against ap pearance perturbations , establishin g them as a critic al modality for action und erstanding. E arly approaches pr imarily relied on handcrafted feature s or sequence mod els based on RNNs and CNNs, yet they faced inherent limitations in explicitly representing the non-Euclide an top ology o f the huma n body. ST-GCN [5] pioneered the modeling of skeleton sequen ces as spatiotempo ral graphs, performing featur e propagation a long physical sk eletal co nnections. This p aradigm estab lished the dominance of Grap h Convolutional Ne tworks (GCNs) in the field. Following the ST-GCN parad igm, subs equent works have primarily e volved along three directio ns: adaptive topology learning, the design of efficient spatiotempo ral op erators, a nd the enhanceme nt of to pology awareness stability. Rega rding adaptive top ology, 2s-AGCN [6 ] introduced end-to-en d learning of data -driven ad jacency matrices . This mechanism allows g raph struct ures to a daptively a djust a ccording to specific sampl es or network layers, thereb y overcoming the limitation s of fixed physical connections. AS- GCN [7] and MS-AAGCN [9] reinforced spatio temporal representatio n capabilitie s through structur ed relationship r easoning and multi-strea m feature fusion, r espectively. Furthermore , DGNN [8] util ized directed gra ph neural networks to mine asymmetr ic dependencies a mong skeletal j oints. Targeting e fficient modeling, Shift -GCN [10 ] replaced computat ionally intensive graph convolutio ns with lightwe ight spa tial and tempor al shift mechanisms, reducing complexi ty while expanding effec tive receptive fields. GCN-NAS [1 3] leveraged neural ar chitecture 3 search to automatic ally disco ver op timal network structures, aiming to enhance computational efficiency. Further more, recent works have focused on constructin g robust and efficient baselines to reduce trainin g and deployment costs [16]. To enha nce operator expressiven ess, exis ting works have focused on designing refined featur e aggr egation mechan isms. For instance, Disentangli ng GCN [ 11] proposed a disentangled graph co nvolution str ategy to unify multi -scale feature aggregatio n and eliminate red undant dependencies. Similar ly, Context-Aware GCN [12] incorporated a co ntext-aware mechanism, enabling the network to ef fectively capture frame-wise global context infor mation. Subsequently, CTR-GCN [1 4] refined to pology modeling along the channel dimension to learn channel-spec ific relational structures. InfoGCN [ 15] introd uced an information bottleneck o bjective to pro mote d iscriminative representatio n lea rning. Fur thermore, DeGCN [20] integrate d deformable sampling into spatia l and temporal graph convolutio ns, adapting to intra-class deformation s b y learning dynamic receptive fields. Hierarchical d ecomposition and long-range connectio n modeling were employed t o enhance multi-scale re lational representatio ns [ 17]. Meanwhi le, T ransformer architectures have further pushed performan ce boundaries by leveraging self-attentio n mechan isms to capture g lobal long -range dependencies [21]. Despite continuous advanceme nts in top ology learning and spatiotempo ral interaction modeling, mainstre am frameworks typically trea t j oints a s discrete point coor dinates. This limitation hinders the explicit characterizati on o f kine matic uncertainty in rapid motion s and the associated spatial distribution along the velo city d irection. Furth ermore, a lthough adaptive graph lea rning methods [6] capture latent lo ng-range dependencies, the edge weights ar e typically learned implicitly as networ k p arameters, resulting in a lack of explicit statis tical significance and c ontrollability. During joint o ptimization, the emphasis o n topological informatio n may be compro mised or subject to forgetting, lead ing to the degradatio n o f top ology awareness [1 8-19]. M otivated by these ob servations, the proposed KGS-GCN formulates joints as pr obability distributions r ather than de terministic points. By lever aging statistical distance to construct in terpreta ble probab ilistic topology priors, this approach pro vides a unified and interpretab le p erspective for continuous motion r epresentation and topology learning. B. G aussian Splatt ing The fiel d of neural rende ring focuses o n representing a nd rendering scenes in a differ entiable man ner. Neural Radia nce Fields (NeRF) [22 ] a chieved high-quality novel view synthesis via implicit neural fields, but they incur high training and inference costs. Meanwhile, related implicit neural representatio ns employi ng perio dic activation functions [23] have demonst rated stro ng fitting capa bilities for continuous signals. In contrast, 3D Gaussian Splatting (3D GS) [24] ad opts explicit Gaussian primitives to re present scenes, integrating differentia ble rasterizers to achieve efficie nt rendering and high-quality r econstructio n. This approac h has significantly advanced the point-ba sed explicit rendering paradigm. Regarding geomet ric consistency, 2D Gaussian Splatting (2DGS) [25] mod els object surfaces using su rfel-based 2D Gaussian disks. This method impr oves geometric accuracy, establishing a solid founda tion for su bsequent extensi ons. Benefiting from the efficiency of explicit repre sentation and differentia ble r endering, Gaussian spla tting techniques hav e been rapidly applie d to diverse tasks. In the rea lm o f 2D image representatio n and compression, approaches su ch as GaussianIma ge [26] and Large Images Are G aussians [27] utilize 2D Gaussian pr imitives to construct efficient image representatio ns. Furthermo re, Instant Gaussian Image [28] explores image rep resentation capab ilities with en hanced generalizatio n and adaptivity. Beyond generation a nd representatio n tasks, sparse Gaussian r epresentations have bee n applied to d ata compression and knowledge distillatio n to enhance efficiency and scalability [2 9]. Furthermo re, approaches like Speedy-Splat [30] acce lerate the 3DGS pipeline through rendering and op timization impro vements, while GaussianPro [31] refines training strategies via progressive propagation. For large-scale scene reconstructio n, CityGaussia n [3 2] impro ves training efficiency and real-time rendering capabilitie s in complex scenario s by employing divide-and-c onquer and level-o f-detail strategies. In the c ontext of dynamic scene m odeling, approaches such as Street Gaussians [33], MVSGau ssian [ 34], and Driving Gaussian [35] incorpor ate temporal dimensions o r dynamic a ttributes to capture moving objects and environmental variations. Additionally, Momentum-G S [36] reinforce s the quality and stability of reconstru ction through self-d istillation an d consistency c onstraints. In general, existing Gaussian splattin g a pproaches primarily target gener ative or reco nstructive ta sks, such as visual reconstructio n, novel view synth esis, and compression. E ven in dynamic settin gs, current works focus predominant ly on improving the reconstruct ion quality of time-varyi ng appear ance and geometry [33-36]. In contra st, for discriminati ve tasks such as skele ton-based a ction recognition, systematic explorati on is lacking regarding the transformat ion of anisotropic deformatio n features of Gaussian p rimitives into learnable kinemati c cues and their synergistic mod eling with graph structure learning. The propose d KGS-GC N introduces kinematics -driven anis otropic Gaussian splattin g to explicitly encode joint velocity an d d irectional information in to spatiotempo ral co ntinuous j oint h eatmaps. Simultaneously, interpretab le p robabilistic top ology pr iors are co nstructed based on the statistical distance be tween joint Gaussian distributions. This approach a chieves the unifie d modeling of continuous motion representatio ns and interpr etable topology learning for skeleton-b ased action recognitio n. III. P ROPOSED M ETHOD This sectio n d etails KGS- GCN, a graph convolutional network framework integratin g kinema tic Gaussian splattin g and pro babilistic topology. As illustr ated in Fig. 1 , the p roposed method aims to mitigat e the inherent li mitations of discr ete skeleton rep resentations through kinema tics-aware co ntinuous field modeling. A. Prob lem Defin ition and Ov erall FrameWork The input sequenc e is represent ed as a five-dimens ional tensor where in ea ch sample comprises  pedestrians with  4 joints spanning a sequence length  with  coordinate channels:          󰇟 󰇠   (1) where N de notes the b atch size, and K repr esents the number of classes. The KGS-GCN framewor k specifically targets two critical challen ges: 1 . the inadequacy of discrete , sparse j oint coordinates in charac terizing fine-graine d motion blur and velocity; a nd 2. the restrictions of predefined sk eleton topologies, w hich hind er the ada ptive modelin g of latent long-range de pendencies. Two core compo nents are introduced to address this iss ue. Sparse skeletons ar e innovatively rendered into contin uous multi-view heatmaps by the Kinemati cs-driven Gaussian Splatting Mod ule (KGSM) a longside the explicit construction of velocity-driven anisotro pic covariance. An innovative probabilistic topo logy co nstruction strateg y is concurre ntly proposed . Specifica lly, KGS- GCN formulates each jo int as a Gaussian d istribution and q uantifies statistical correla tions via the Bhattacha ryya distanc e to generate a sample-ada ptive pr ior adjacency matrix. Finally, the framework incorpo rates a visual context gating mecha nism to inject the rende red visual co ntext into the GCN backbone. The de tailed procedural steps of the entire framewo rk a re summarized in Algorithm 1 to elucidate the over all training procedure of KGS-GCN. Skele ton seq uences are specifically transformed into multi-view heatmaps via kinema tic computatio ns and the Gaus sian splatting module. Algorithm 1 outlines the detaile d training workflow of the proposed KG S-GCN. Initially, the framewo rk tr ansforms skeleton sequen ces into multi-view heatma ps by levera ging kinematic computat ions and the Gaussian Splattin g Mo dule. Subsequently, the probabilistic topology inferred via the Bhattachar yya distan ce initi alizes the adjacency matrix of the GCN. During featu re aggregation, visual features modulate skeleton representatio ns layer-b y-layer through a gating mechanism. Finally, the model is updated by jointly minimizing the classificatio n loss and the to pology co nstraint loss. Fig. 1. The overall framework of KGS-GCN. Input skeleton sequenc es are processed t o extract hy brid spatial and kinematic features. Anisotropic heatmaps and probabilistic joint distributions are generated via the K inemati cs-driven Gaussian S platting Module driven by hybrid features. Probabilistic topology is constructed utilizing statistical distance metrics to capture latent long-ran ge dependencies. Discrete skeleton features from the backbone network and c ontinuou s v isual cues from Gaussian maps are finally integrated to predict action c lasses. 5 B. Ki nematics -Driven Gaussian Splatting Traditio nal spar se and discrete skelet on re presentations often overlook motion blur effec ts induced b y velocity variatio ns, which encapsu late rich tempor al d ynamic informati on. To address this limitation, the K inematics-Dr iven Gaussian Splatting Mod ule (KGSM) leverage s instantaneous kin ematic states to dynamica lly construct anisotr opic Gaussian distributions. Kinematic State Extractio n and Normalizatio n: To mitigate scale discrepancies across datasets, we impleme nt an adap tive skeleton normaliza tion strategy. Specifically, for a given frame  , we calculate the geometr ic center and translate the skeleton to the origin. Subsequently, we scale the coo rdinates to the interval 󰇟   󰇠 based on the maximum bounding sphere radius to ob tain the normal ized c oordinates   (  is set to 0 .8). Furthermore, to explicitly model motion trends, we compute the instantaneous veloc ity vector      for each joint:           (2) where   and    represent the velocity and position of the  joint in the  frame. Dynamic Anisotro pic Covariance Constructio n: This represents the core innovation of KGS-GCN. Unlike traditional Gaussian heatma ps that rely o n fixed variances, we dynamically a dapt the shape and o rientation o f Gaussian kernels according to jo int velocities. Specifically, for each j oint, we constru ct a 2 D covariance matrix     defined by the interplay of rotatio n  and scaling matrices  , formul ated as:        (3) Regarding the constructio n o f the scalin g matr ix   󰇛     󰇜 , we simulate motio n blur by stretching the scale   along the direction of motion as the velo city magnitu de  increases, while preserving the base scale   in the perpendi cular direction   . We for mulate this pro cess as:       󰇛    󰇛 󰇜󰇜      (4) The stretc hing d egree is controlled by the hyperpara meter  set to 2 in this work whereas the ada ptive ba seline   is learned by the network. T he rota tion matrix  is deter mined by the direction of the velocity vector 󰇛      󰇜 and is formulated as follows utili zing the normaliz ed velocity direction :            (5) Accord ing to the above definitio n, the elements of after expansion can be represented as:                                    󰇛       󰇜     (6) Through this formulati on, statio nary joints manif est as isotropic circular distributions, while rapidly moving joints appear a s ellip ses elongated along their motion trajecto ries. Consequently, this mechanism explicitly encod es tempora l motion intensity d irectly within the spatial domain. Multi-view Rendering: The 3D space is projected onto three orthogonal planes 󰇛   󰇜 to p rocess 3 D skeleton data wherein 2D Gaussian Splattin g is independentl y e xecuted on each view. The response intensity   󰇛󰇜 generated by the  joint at the  frame follows a multivariate Gaussian distribution for an a rbitrary pixel point     on the p lane:   󰇛󰇜   󰇛    󰇛    󰇜     󰇛    󰇜󰇜 (7) where      denotes the mean vector of the Gaussian distribution c orresponding physically to the center position of the joint within the image coor dinate system. The heatmap   is formulated as the aggregation o f responses from all joints for the final representatio n:   󰇛󰇜   󰇛   󰇛󰇜󰇜   (8) Sparse skeleton sequen ces are transformed into multi-ch annel continuous visual representatio ns      via this process. C. Probabilis tic Topology Construction Latent de pendencies among physically unconnected joints are freq uently o verlooked by physical connectio n graphs . We propose the construc tion of a probabilistic to pology utilizing statistical par ameters genera ted by KGSM to ad dress this limitation . E ach joint is mod eled as a pro bability distrib ution       wherein the co rrelation between joints  and  is quantified via th e Bhattacharyya distance between their respective distributio ns. The analytical for m is exp ressed as:   󰇛 󰇜                               (9) The firs t term of ( 7) quant ifies the spatial Euclide an d istance between joints where as the second term measure s the shape discrepancy b etween two distributions.   is expressed as:          (10) Based on the B ach distance   , we constructed an adaptive prior adjacency matrix      :   󰇛 󰇜          󰇛      󰇜  (11) Long-range depende ncies among joints b ased on motion statistical c haracteristic s are ada ptively cap tured by this matrix   . The matrix is subsequentl y injected into the following Graph Convolutio nal Network as prior knowledge. D. Visual-Conte xt Modulated GCN We co nstruct the skeleto n reco gnition networ k b y stackin g  Spatio-T emporal Graph Convo lutional Mo dules ( ST-Blocks). As illus trated in Fig. 1 , each ST-Block adopts the classic spatial-tempo ral decoupling design, consisting of two sequential sub-stages : the Visually-Enhanced Sp atial GCN and the M ulti-Scale Te mporal TCN. The spatia l modeling phase targets the capture o f intra- frame joint dependencies. To this end, we employ a channel-level topology refinement mechanism and integrate visual contex t gating to a chieve dee p multi-mod al featur e fusion. Initially, we pr ocess the rendered input   via a lightweight CNN vis ual branch. T his mo dule incorpo rates 6 multiple convolutio nal layers and do wnsampling ope rations to extract high-level visual semanti c featur es. Subsequently, we apply global aver age pooling and linear projectio n to the o utput features to yield the visual context featur e        . Visual Co ntext Gatin g: This work design a visual context gating mecha nism within the GCN layer to achieve deep fusion of skeleton and visual feature s. Let        denote the interme diate featur e o f the GCN layer.   is aligne d a nd expanded to the ide ntical d imension to generate modulatio n coefficients via a nonlinear gating network:   󰇛         󰇜 (12) where  is the sigmoid activat ion fun ction a nd   is the learnable projectio n weights. The final fused featu res a re calculated as follows:       󰇛   󰇜    (13) where  denotes element-wise multiplicat ion. Specific skeleton c hannel featu res are ad aptively enhanced or suppressed accord ing to the current action context via this residual gatin g mecha nism to achieve the compleme ntarity o f multi-mod al infor mation. Graph Convolu tion and To pology Fusion: The to tal adjacency matrix  within e ach grap h convolution layer is composed of the pre-de fined physical g raph   , the network-learn ed grap h   and the probabili stic topo logy   generated by our method. The feature a ggregation process is formulated as:     󰇛    󰇛󰇜    󰇛󰇜    󰇜  (14) where  is a le arnable scaling factor used to dynamically adjust the importance of the prior probab ility topology. Multi-Scale T emporal Convolution: This work d esign a Multi-Scale Temporal Convo lution Module (MS-T CN) a long the t emporal dimension to cap ture actio n patterns of var ying durations. This module comp rises multiple p arallel convolution branches utilizing distinct dilation rat es    󰇝 󰇞 . The temporal output   is d efined as the aggre gation of o utputs from respective bra nches for the fused feature   :         󰇛󰇛 󰇛  󰇛  󰇜󰇜󰇜󰇜    (15) MS-TCN simultaneous ly captures sh ort-term transient changes such as kicking insta nts and long-term action dependencies like walking periodicit y via the combinatio n o f co nvolution kernels with diverse rece ptive fiel ds. Unified mode ling of complex spatiotempo ral dynamics is thereby achieved . E. Los s Function A multi- task lo ss function is a dopted for model training. The standard cross-entro py lo ss   serves as the primary loss for the classificatio n task. We introd uce a to pology co nsistency regularizatio n term   to constrain the learned top ology   within the GCN from d eviating from statis tical data regularities. The ter m is for mulated as:          󰇛   󰇛󰇜 󰇜       (16) where  denotes the number of GCN layers whereas  represents the sigmoid activatio n func tion.   is tr eated as the pseudo label. The total loss function is formulate d as:       󰇛󰇜    (17) where 󰇛󰇜 denotes the weight coefficie nt dependent on ep och  . 󰇛󰇜 is init ialized to a small value during the initial training phase to allow for free networ k expl oration whereas the weight is graduall y increased as t raining p roceeds to enforce the constraining effect o f the statis tical pr ior. IV. E XPERIMEN TS A. Ex perimental Set up Datasets : Two widely used benchmark da tasets in the actio n understanding domain are selected to comp rehensively evaluate the effectiv eness and generalizatio n ca pability o f the EMS-GCN model for action recognitio n tasks: P enn Action and NT U RGB+D. T he Pe nn Action dataset centers on r outine sports sequences and typically provides r ich human pose annotations alo ngside action categor y i nformation. It is suitable for verifying mode l performance regard ing pose variations and motion de tail modeling. NTU RGB +D rep resents one o f the most widely a dopted large-scale benchmarks for 3 D skeleton action re cognition, featur ing d iverse motion categories and scene variations. W e util ize this dataset to systema tically asses s the ro bustness and c ross-scenario generaliza tion of EMS-GCN under complex co nditions. Consequent ly, the joint evaluation on these benchmark s allows us to o bjectively verify the model ’ s p erformance in terms of both fine-grained action modeling and stability within realistic, complex environmen ts. Eva lu at ion Met ric s : To al ign with st and ar d co mpa ri so n pro to col s in act ion rec ogni ti on , we exc lus iv el y em pl oy Top -1 Acc ur ac y as the q uant it at ive me tr ic . Thi s cho ic e h ighl ig ht s the co re pe rfo rm anc e of the mod el in te r ms of cl as sifi ca ti on cor re ct nes s. T his me tr ic me asu re s the prop or ti on of sa mpl es wher e the hi ghe st -p ro ba bil it y pred ic ti on ma tc hes th e gr oun d trut h. It dir ect ly indi ca te s the mo de l's rec o gnit io n cap abi li ty und er stand ar d cl ass ifi ca ti on sett ing s, ther eb y fa cil it at ing con sis te nt and fa ir comp ar is ons a cro ss div er se exp er ime nt al con figu r at io ns. Imp lem en ta tio n D eta il s : We imp le me nte d the EM S-G CN arc hit ec tu re ba se d on the PyT or ch fr ame wo rk. All tra ini ng and infe re nc e p has es were ex ec ute d on a sing le NVI DI A RTX 406 0 GP U. T o opt imi ze co mput at io nal effi ci enc y , we emp lo yed mix ed -p rec isi on tr ai nin g. To en su re exp er ime nt al rep rod uc ib il ity, we det ai l the pa ram et er set ti ngs and tra in ing str ate gi es as f ol lows : Net wo rk Arch it ect ur e C on fig ura t ion : We co nst ru ct the KG S-G CN b ack bo ne us ing 10 s ta cked spa ti al- te mpo ra l g rap h con vol ut io n mo du le s. The fe atur e c han ne ls ar e s et to 64, 128 , and 256 for la ye rs 1- 4, 5 -7, an d 8-1 0, re sp ect ive ly. T o e xpa nd the temp o ral re ce pti ve fie ld and re duc e c omp uta ti ona l ove rhe ad , we ap ply a temp or al co nvo lu tio n s tri de o f 2 in the 5th a nd 8t h la yer s, whi le ma int ai ni ng a str id e of 1 el sewh e re . Fur the r mor e, we fix the re nd ere d hea tma p res ol uti on at    and in iti al iz e th e le ar na bl e lo g-sc al e p ara me ter  󰇛  󰇜 to -2.0 . To fac il it ate gr adi ent b ac kp rop aga ti on du ri ng the init ia l train ing pha se , we co nfig ur e the v elo ci ty stre tc hin g coe ffi ci ent  to 2.0, ens ur ing sma ll va ria nc e in the ge ne ra ted hea tma ps . For the c las si fic at io n h ea d, we pro je ct the enc o ded vis ual fe atu re s int o 12 8-d im ens io na l v ect ors . Su bs equ ent ly , we per for m gl oba l aver ag e poo li ng and conc at en ate the se 7 vis ual v ec tor s with the 2 56 -di men sio na l ske le ton fea tur es der ive d f rom the 10th la yer , forwar d ing the f use d rep re sen ta tio n to the f ull y co nn ec ted layer . Training Strategy : We tra in the network end-to-end using the SGD op timizer, with t he Neste rov momentum set to 0.9 and weight decay fixed at     . We i nitialize the b ase learning rate at 0 .05 and app ly a multi-step d ecay schedule , scaling the rate by a factor of 0.1 at the 40th and 6 0th epochs. To pr event gradient instability during the early phase, we implemen t a linear warm-up strategy for the firs t 1 0 epochs, increasing the learning rate linearly from 0 to the ba se value. Regarding the loss function, we ad opt a d ynamic adjustment stra tegy, where 󰇛󰇜 is formulated as: 󰇛󰇜      󰇛  󰇜 (18) where     . This implie s that topological constraints are gradually imposed during the initial training phase to provide a buffer period for adaptive network a djustment. B. Perfo rmance Comp arison We benchmark the pro posed method against state -of-the-art approaches using the NTU [ 44], NW-UC LA [45], and P enn Action [46] d atasets, with deta iled compariso ns pro vided in Table I . T he experime ntal results highlight the significa nt advantages of our app roach a cross all d atasets. This performanc e is particularly noteworthy as we present the first framework to integrate Gaus sian Splatting into this do main. Specifically, KGS-GCN achieves s uperior performance across various datasets while requiring only 1.4M parameter s and 1.3 GFLOPs. On the NTU-60 benchmar k, the model attains 92.8% accuracy o n the x-su b split and 97.2% o n the x-view split. This performance ranks second , tra iling Fre qMixFormer by a margin al gap of only 0.2%. Furthermore , on the NTU-12 0 x-sub and x-view benchmar ks, KGS-GC N achieve s a ccuracies of 8 8.9% and 90. 8%, resp ectively, co mparable to le ading state-of-the- art methods. Given the sub stantial scale of the NTU-12 0 dataset a nd the lightweigh t nature of o ur model, these results validate the efficacy of the propo sed Gaussian Splatting module and the probabilist ic topology constructio n strategy. KGS-GCN exhibits r emarkable generalizat ion capabilitie s on sma ller-scale data sets, such a s NW -UCLA and Penn Action. Specifically, on the NW -UCLA benchmark, our model achieves exceptio nal performanc e, sho wing a marginal difference o f only 0.1% relative to the runner-up. Similarly, on the Penn Action da taset, KGS-GCN secu res the second ra nk, trailing FreqMixFo rmer by a mere 0 .2%. In summary, KGS-GCN strikes an optimal ba lance be tween model c omplexity a nd inference performance, securing the second rank in p arameter co unt and the to p rank in computatio nal efficiency. T hese results validate the efficacy o f the proposed G aussian Splatting module and pr obabilistic topology co nstruction strategy, estab lishing a distinct competitive ad vantage over state-of-th e-art methods. C. Ablation Stu dy To r igorously evaluate the co ntribution of each component within KGS- GCN, we conduct compre hensive abla tion studies TABLE I Q UANTITATIVE P ERFORMANCE C OMPARISON R ESULTS ON D ATASET S . R ED AND BLUE INDICATE THE FIRST AND SEC OND BEST RE SULTS RE SPECTIVEL Y FOR EACH INDIV IDUAL METRI C . Methods NTU- 60 (%) NTU-120 (%) Penn Act ion (%) NW- UCLA (%) Params (M) Flops (G) x-sub x-view x-sub x-set MS-G3D [ 11 ] 91.5 96.2 86.9 88.4 96.1 - 2.8 5.2 CTR-GCN [ 14 ] 92.4 96.4 88.9 90.4 96.9 96.5 1.5 2.0 EfficientGCN [ 16 ] 91.7 95.7 88.3 89.1 96.7 - 2.0 15.2 InfoGCN [ 15 ] 92. 8 96.7 89.2 90.7 96.5 96.6 1.6 1.8 FRHead [ 38 ] 93.1 96.8 89.5 90.9 97.0 96.8 2.0 - BlockGCN [ 19 ] 92.4 97.0 90.3 91.5 96.8 96.9 1.3 1.6 DeGCN [ 20 ] 93.3 9 7.4 91.0 92.1 97.6 97.2 5.6 - ST-TR [ 37 ] 90.8 96.3 85.1 87.1 96.3 - 12.1 259.4 TranSke leton [ 39 ] 92.8 97.0 89.4 90.5 96.7 - 2.2 9.2 Hyperforme r [ 40 ] 92.9 96.5 89.9 91.3 97.1 96.7 2.7 9.6 SkeMixFo rmer [ 41 ] 93.0 97.1 90.1 91.3 99.2 97.4 2.1 4.8 SkateForm er [ 43 ] 9 3.5 97.4 89.8 91.4 98.4 98.3 2.0 3.6 FreqMixFo rmer [ 42 ] 93.6 97.4 90.5 91.9 99.7 97.4 2.0 64.4 KGS-GCN 92.8 97.2 88.9 90.8 99.5 97.3 1.4 1.3 TABLE Ⅱ Q UANTITATIVE P ERFORMAN CE C OMPARISON R ESULTS OF C ORE C OMPONENT C ONTRIBUTI ONS . R ED AND BLUE I NDICATE THE FIRST AND SECOND BEST RESULTS RESP ECTIVELY FOR EACH INDIVIDUAL METRI C . Method +KGSM +PT +VCG Penn Act ion (%) NW-UCLA (%) Baseline × × × 96.9 96.5 Baseline+K GSM √ × × 98.0 96.7 Baseline+KG SM+PT √ √ × 99.1 96.9 Baseline+KG SM+VCG √ × √ 98.7 97.1 KGS-GCN √ √ √ 9 9.5 97.3 8 on the Penn Action and NW -UCLA datasets. We utilize CTR-GCN [14] as the baseline model for our backbone. Specifically, we examine the effectiven ess of thre e key elements: the Kine matics-Drive n Gaussian Splatti ng, the probabilistic topo logy co nstruction strategy, and the visual context gating mechanism. To ensure fair comparisons, we maintain consistent tra ining co nfigurations across all experiments . Contributio n of Individu al Componen ts : W e initially evaluate the impact o f the cor e modules withi n KGS- GCN, specifically verifying the effectivene ss of the Kinematic s-Driven Gaus sian Splatting Mod ule ( KGSM), the Probab ilistic Top ology Constr uction strategy (PT), and the Visual Context Gatin g mechanism (VCG) . As indicated in Table Ⅱ , e ach co mponent yields consistent performance improvemen ts. Most notably, the comp lete framewo rk outperforms the baseline netwo rk b y 2 .6% on the Penn Action dataset and 0.8 % o n NW-UC LA. These o utcomes empirically validate the efficacy of our architectur al design. Splattin g Strate gy Analysis : A cor e motivatio n of KGS-GCN lies in explicitly mode ling motion blur effects via anisotropic covariance matric es. To valida te this design, we co mpare the proposed kinema tics-driven stra tegy against a standard isotropic counterpart. The isotro pic app roach disregard s velocity vectors, limiting the covariance matrix to a scaled identity matrix. As indicated in Tab le Ⅲ , our kinematics-dr iven method sur passes the isotropic b aseline by 1.6%, empirically confirming the efficac y o f the p roposed strategy. Mechani sm o f Visual Fea ture Fusion: To identify the optimal strategy for i ntegrating visual context into the GCN backbone, we compare three distinct fusion mechanisms: Element-wis e Ad dition, Concatena tion, and o ur proposed Visual Context Gatin g (VCG). As pre sented in Table Ⅳ , simple addition and c oncatenation yield only marginal performanc e improvemen ts. We attr ibute this limitatio n to potential background noise and spatial r edundancy in the rendered heatmaps, which ca n c orrupt high-le vel skeleton semantics during dir ect fusion . I n contrast, the VCG mechanism effectively modulates the feature flow, outperformi ng the ad dition and concatenatio n stra tegies b y 0.6% and 0.3%, respectively. Consequen tly, thes e results validate VCG as the superio r choic e for our frame work. Metric for Top ology Construction : We examine the metrics used to co nstruct p robabilistic to pology prio rs. Specific ally, we benchmark the Bhattacha ryya distance ado pted in this work against the conventional Euclidean distan ce d erived from j oint coordinates . As shown in Tabl e Ⅴ , the B hattacharyya distance yields superior perfor mance, validating its selection for KGS-GCN. V. C ONCLUSION We propose KGS-GCN, a graph convolutiona l netwo rk that integrates kine matics-driven Ga ussian Splattin g with probabilistic to pology. By conceptuali zing skelet on data as generative sources of continuous visual signals , o ur app roach significantly enhances the ca pacity to model co mplex spatiotempo ral d ynamics. Ultimately, this framework provides a novel p erspective for the unified modeling of sk eleton and visual featur es. Our cor e contribution lies in the pr oposal o f the Kinematic s-driven Gaussian Splattin g Module. By leveraging instantaneou s joint velocities, this mod ule d ynamically constructs anisotropic co variance matrice s. This mechanism effectively recovers the m otion blur an d spatiotemporal continuity lost in discrete coordinates , thereby pr oviding semantic inputs signi ficantly richer than raw positio nal informatio n. Furthermore , we challenge the convention of fixe d or attention-based to pologies by introducing a pro babilistic topology co nstruction strategy. B y modelin g joints a s Gaussian distributions and quantif ying their overlap via the Bhattachar yya distance, w e derive a grap h structure that captures intrinsic statis tical motion corre lations. This formulatio n pr oves robust a gainst noise arising from non-physically connected j oints. Ultima tely, b y employing rendered visual context to mod ulate geo metric graph convolution s, our unified a rchitecture establishes a novel synergy be tween low-level kinema tics and high -level visual semantics. While thes e results are e ncouraging, there remains potential for further investigatio n. In future work, we a im to develop lightwei ght approxima tion method s for p robabilistic topology to e nhance computat ional e fficiency. Add itionally, we intend to explore the a pplication of KGS-GCN in generative tasks, where continuous Ga ussian re presentations ca n provide superior smoothnes s a nd enhanced interpre tability. TABLE Ⅲ A BLATION R ESULTS OF S PLATTING S TRATEGY A NALYSIS . R ED INDICATE S THE BEST RE SULT FOR EACH INDIVIDUA L METRIC . Strategy Covariance Type Motion-Aware Penn Action (%) Isotropic Splatting Scaled Identity × 97.9 Kinematics- Driven Anisotropic √ 99.5 TABLE Ⅳ A BLATION R ESULTS OF THE V I SUAL F EATURE F USION M ECHANISM . R ED INDICATES THE BEST RESU LT FOR EACH INDIVIDUAL METRIC . Fusion Mechanism Formula Properties Penn Action (%) Element-wise Addition       Equal Weight 98.9 Concatenation   󰇛   󰇜 Channel Expansion 99.2 VCG     󰇛  󰇛  󰇜󰇜 Adaptive Selection 99.5 TABLE Ⅴ A BLATION R ESULTS OF M ETRIC FOR T OPOLOGY C ONSTRUCTIO N . R ED INDICATES THE BEST RESU LT FOR EACH INDIVIDUAL METRIC . Metric NW-UCLA (%) Penn Action (%) Euclidean Distance 96.9 98.9 Bhattachary ya 97.3 99.5 9 R EFERENCES [1] W. Xin, R. Liu, Y. Liu, et al ., “Transformer for skeleton-based action recognition: A review of recent advances,” Neurocomp uting , vol. 537, pp. 164–186, 2023. [2] R. Yue, Z. Tian, a nd S. Du, “Action recognition based on RGB and skeleton data sets: A survey,” Neurocompu ting , vol. 5 12, pp. 2 87–306 , 2022. [3] Y. Kong and Y. Fu, “ Human ac tion recognition and prediction: A survey,” Int. J. Comput. Vis ., vol. 130, n o. 5, pp. 1366–1401, 2022. [4] J. Zhang, L. Lin, S. Yang, et a l. , “Self-supervised skeleton-based action representation learning: A benchmar k and b eyond,” arXiv p reprint arXiv:2406.02 978, 2024. [5] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networ ks for skeleton-based action recognition,” in Proc. AAAI Conf. Artif. Intell. , vol. 32, no. 1, 2018. [6] L. Shi, Y. Zhang, J. Cheng, et al. , “Two-stream adaptive graph convolutional networks for skeleton-base d action recognition,” in Proc. IEEE/CVF C onf. Comput. Vis. Pattern Recognit. (CVPR) , 2019, pp. 12026–12035. [7] M. Li, S. C hen, X. Chen, et al ., “Actional-structural graph convolutional networ ks for skeleton-based action recognition,” in Proc. IEEE/CVF Conf. C omput. Vis. Pattern Re cognit. (CVPR) , 2019, pp. 3595–3603. [8] L. Shi, Y. Zhang, J. Cheng, et al ., “Skeleton-based ac tion recognition wit h directed graph neural networks,” in Proc. IEEE/CVF C onf. Comput. Vi s. Pattern Recognit. (CVPR) , 2 019, pp. 7912–7921. [9] L. Shi, Y. Zhang, J. Cheng, et al ., “Skeleton-based ac tion recognition wit h multi-stream adaptive graph convol utional networks,” IEEE Trans. Image Process ., vol. 2 9, p p. 9532–9545, 2020. [10] K. Cheng, Y. Zhang, X. He, et al. , “Skel eton-based ac tion recognition with shift graph convolutional n etwo rk,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2020, pp. 183–192. [11] Z. Liu, H. Zhang, Z. Chen, et a l. , “Disentangl ing and unifying graph convolutions for skeleton-base d action recognition,” in Proc. IEEE/CVF Conf. C omput. Vis. Pattern Re cognit. (CVPR) , 2020, pp. 143–152. [12] X. Zhang, C. Xu, and D. Tao, “Context aware graph convol ution for skeleton-base d action recognition, ” in Proc. IEEE/CVF Conf. Comput. Vis. Pa ttern Recognit. (CVPR) , 2020, pp. 14333–14342. [13] W. Peng, X. Hong, H. Chen, et al. , “Learning graph convolutional networ k for skeleton-based human action recognition by neural searching,” in Proc. AAAI Conf. Artif. Intell ., vol. 34, no. 3, 2020, pp. 2669–267 6. [14] Y. Chen, Z. Zhang, C. Yuan, et al. , “Channe l-wise topology refineme nt graph convolution for skeleton-based action recognition, ” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV) , 2021, pp. 13359–13368. [15] H. Chi, M. H. Ha, S. Chi, et al. , “InfoGCN: Representation learning for human skeleton-based ac tion recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2022, pp. 20186–20196. [16] Y.-F. Song, Z. Zhang, C. Shan, et al. , “C onstruct ing stronger and faster baselines for skele ton-based action recognition, ” IEEE Trans. Pa ttern Anal. Mach. Intell. , vol. 45, no. 2, pp. 1474–1488, 2022. [17] J. Lee, M. Lee, D. Lee, et al. , “Hierarchically d ecompo sed graph convolutional networks for skeleton-base d action recognition,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV) , 2023, pp. 10444–10453. [18] Y. Zhou, Z.-Q. Cheng, J.-Y. He, et a l. , “Overcoming t opolog y agnosticism: Enhancing skele ton-based action r ecognition through redefined ske letal topolog y awareness,” arXiv p reprint a rXiv:2305.1 1468, 2023. [19] Y. Zhou, X. Yan, Z.-Q. Cheng, et al. , “BlockG CN: Redefine topology awarene ss for skeleton-based action recognition,” in Proc. IEEE/CVF Conf. C omput. Vis. Pattern Re cognit. (CVPR) , 2024, pp. 2049–2058. [20] W. Myung, N. Su, J.-H. Xue, et al. , “DE-GCN: Deformable graph convolutional netwo rks for skeleton-based ac tion recognition,” IEEE Trans. Image Process. , vol. 33, pp. 2477–2490, 2024. [21] J. Do and M. Kim, “Skateformer: Skeletal-temporal transformer for human action recognition,” in Proc. Eur. Conf. Comput. Vis. (EC CV) , 2024, pp. 401–420. [22] B. M ildenhal l, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoo rthi, and R. Ng, “NeRF: Repre senting scenes as neural radiance fields for view synthesis,” Commun. AC M , vol. 65, no. 1, pp. 99–106, 2021. [23] V. Sitzmann, J. Martel, A. Bergman, D. Lindell, and G. Wetzstein, “Implicit n eural representations with periodic activation functions,” i n Adv. Neural Inf. Process. Syst. (NeurIPS) , vol. 33, 2020, p p. 7 462–747 3. [24] B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3D Gaussian splatting for real -time radiance field rendering,” ACM Trans. Graph. , vol. 42, no. 4 , Art. no. 139, 2023. [25] B. Huang, Z. Yu, A. Chen, A. Geiger, and S. Gao, “2D Gaussian splatting for geometricall y a ccurate rad iance fields,” in ACM SIGGRAPH Conf. Papers , 2024, pp. 1–11. [26] X. Zhang, X. Ge, T. Xu, et a l. , “GaussianImage: 1000 FPS i mage representation and compression by 2 D Gaussian splatting,” in Proc. Eur. Conf. C omput. Vis. (ECCV) , 2024, pp. 327–345. [27] L. Zhu, G. Lin, J. Chen, et a l. , “Larg e images are Gaussians: High-quality large image repre sentation with levels of 2 D Gaussian sp latting, ” in Proc. AAAI C onf. Artif. Intell. (AAAI) , 2 025, pp. 10977–10985. [28] Z. Zeng, Y. Wang, C. Yang, T. Guan, a nd L. Ju, “Instant GaussianImage: A generalizable and self-adaptive image representation via 2D Gaussian splatting,” arXiv preprint a rXiv:2506.2 3479, 2025. [29] C. Jiang, Z. Li, H. Zhao, Q. Shan, S. Wu, and J. Su, “ Beyond pixels: Efficient dataset distillation via sparse Gaussian representation, ” arXiv preprint arXiv:2509.26219, 2025. [30] A. Hanson, A. Tu, G. Lin, V. Singla, M. Zwicker, and T. Goldstein, “Speedy -Splat: Fast 3D Gaussian sp latting with sparse pixels and sparse primitives, ” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2025, pp. 21537–21546. [31] K. C heng, X. Long, K. Yang, Y. Yao, W. Yin, Y. Ma, et al. , “GaussianPro: 3D Gaussian splatting with p rogr essive p ropagatio n,” in Proc. Int. Conf. Mach. Learn. (ICML) , 2 024. [32] Y. Liu, C. Luo, L. Fan, N. Wang, J. Peng, and Z. Zhang, “CityGaussian: Real-time high-qual ity large-scale scene rendering with Gaussians,” in Proc. Eur. Conf. Comput. Vi s. (ECCV) , 2024, pp. 265–282. [33] Y. Yan, H. Lin, C. Zhou, W. Wang, H. Sun, K. Zhan, et al. , “Street Gaussians: Modeling dynamic urban scenes with Gaussian splatting,” in Proc. Eur. Conf. Comput. Vi s. (ECCV) , 2024, pp. 156–173. [34] T. Liu, G. Wang, S. Hu, L. Shen, X. Ye, Y. Zang, et a l. , “MVSGaussian: Fast generalizable Gaussian splatting reconstruction from multi-view stereo,” in Proc. Eur. Conf. Comput. Vi s. (ECCV) , 2024, pp. 37–53. [35] X. Zhou, Z. Lin, X. Shan, Y. Wang, D. Sun, and M.-H. Yang, “DrivingG aussian: Composite Gaussian splatting for surrounding dynamic autonomou s driving scenes,” in Proc. IEEE/CVF Conf. Comput. Vis. Pa ttern Recognit. (CVPR) , 2024, pp. 21634–21643. [36] J. Fan, W. Li, Y. Han, T. Dai , and Y. Tang, “Momentum-GS: Momentum Gaussian self-distillation for h igh-qual ity large scene reconstruction,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV) , 2025, pp. 25250–25260. [37] C. Plizzari, M. Cannici, and M. Matteucci, “Skeleton-based action recognition via spatial and t emporal transform er n etwo rks,” Comput. Vis. Image Understand. , vol. 208, Art. no. 103219, 2021. [38] H. Zhou, Q. Liu, and Y. Wang, “Learning discriminative representations for sk ele ton based action reco gnition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pa ttern Recognit. (CVPR) , 2023, pp. 10608–10617. [39] H. Liu, Y. Liu, Y. Chen, C. Yuan, B. Li, and W. Hu, “TransSkel eton: Hierarchical spatial–temporal transformer for skele ton-based action recognition,” IEEE Trans. Circuits Syst. Vi deo Technol ., vol. 33, no. 8, pp. 4137–414 8, 2023. [40] Y. Zhou, Z.-Q. Cheng, C. Li, Y. Fang, Y. Geng, X. Xie, a nd M. Keuper, “Hyper graph t ransfor mer for skeleton-based action recognition,” arXiv preprint arXiv:2211.09590, 2022. [41] W. Xin, Q. Miao, Y. Liu, R. Liu, C.-M. Pun, a nd C. Shi, “Skele ton MixForme r: Multivariate t opol ogy representation for sk ele ton-based action recognition,” i n Proc. ACM Int. Conf. Multimedia (ACM MM ) , 2023, pp. 2211–2220. [42] W. Wu, C . Zheng, Z. Yang, C. Chen, S. Das, and A. Lu, “Frequency guidance matters: Skeletal action recognition by frequency-aware mixed transforme r,” i n Proc. ACM Int. Conf. Multimedia (ACM MM) , 2024, pp. 4660–466 9. [43] J. Do and M. Kim, “Skateformer: Skeletal-temporal transformer for human action recognition,” in Proc. Eur. Conf. Comput. Vis. (EC CV) , 2024, pp. 401–420. [44] A. Shahroudy, J. Liu, T.-T. Ng, et al. , “NTU RGB+D: A large sca le dataset for 3D human activity analysis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2 016, pp. 1010–1019. [45] W. Zhang, M. Zhu, and K. G. Der panis, “From actemes to action: A strongly- supervised representation for detailed action understanding,” in Proc. IEEE Int. Conf. Comput. Vi s. (ICCV) , 2013, pp. 2248–2255. [46] J. W ang, X. Nie, Y. Xia, et al. , “C ross-v iew action modeling, learning and recognition,” i n Proc. IEEE Conf. Comput. Vi s. Pattern Recognit. (CVPR) , 2014, pp. 2649–2656. 1 0 Yuhan Chen received his master's degree in 2024 f rom the College of Mechanical Engineering at Chon gqing University o f Technology. He is currently pursuing the Ph.D. degree in College of Mechanical and Vehicle Engineering at Chongqing University, China. His research int erests include deep learning, Low-level Vision and Gaussian Splatting. Yicui Shi received the B.E degree majoring in Automotive Engineering at Chongqing University in 2025. He is curre ntly pursuing the M. S. degree in A utomot ive E ngineerin g at Chongqin g Univers ity, Chongq ing, China . His research interests include computer vision and Gaussian Splatting. Guofa Li received t he Ph.D. degre e in Mechanical Engineering from Tsinghua University, China, in 2016. He is curre ntly a Professor with Chongqing University, China. His research interests i nclude environment perception , driver behavior analysis, and smart decision-making based on artificial intelligence technologies in autonomous vehicles and intelligent transportation systems. He serves as the Associate Editor for IEEE Transactions on Intelligent Transportatio n Systems, IEEE Transactions on Affective Computing, and IEEE Sensors Journal . Liping Zhang is c urrently a tenured Professor in Department of Mathematical Sciences, Tsinghua University. She received her Ph.D. degre e from the Academy of Mathematics and Systems S ciences, Chinese Academy of Sciences in 2001. Her research interests include continuous optimization, tensor analysis and computation, machine learning. S he has published more than 70 research papers in int ernational journals such as Mathematical P rogramm ing, SIAM Journal on Optimizati on, Mathematics of Computation , Mathematics of Operational Research , SIAM Journal on Matrix Analysis and Applications, Journal of Machine Learning Research, Expert Sysytem with Application s, etc. Jie Li received the Ph.D. degree in mechanical engin eering from Tsinghua University, Beijing, C hina, 2024. He is currently an Associate Professor with the College of Mechanical and Vehicle Engineering, Cho ngqing University, Chongqin g, China. His research interests include model pred ictive control, adaptive dynamic programming and reinforcement learning. Jiaxin Gao received his B.S. and Ph.D. degrees from the University of S cience & Technology Beijing in 2017 and 2023, respectively . He i s c urrently an Assistan t Researcher at the School of Vehicle and Mobility, Tsinghua University. His curr ent research int erests f ocus on reinforcement learning, decision and control for autonomous vehicles, and perceptual data generation for autonomous driving. Wenbo Chu received his B.S. degree majored in Automotive E ngineering from Tsinghua University, China, in 2008, and his M.S. degree majored in Automo tive Engineering f rom RWTH-Aachen, German and Ph.D. degre e majored in Mechanical Engineering f rom Tsinghua University, China, in 2014. He is currently a research f ellow at Western C hina Sc ience City Inn ovation Center of Intelligent and Connected Vehicles (Chongq ing) Co, Ltd., and National Innovation Center of Intelligent and Connected Vehicles.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment