WiFlow: A Lightweight WiFi-based Continuous Human Pose Estimation Network with Spatio-Temporal Feature Decoupling

WiFlow: A Lightweight WiFi-based Continuous Human Pose Estimation Network with Spatio-Temporal Feature Decoupling
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Human pose estimation is fundamental to intelligent perception in the Internet of Things (IoT), enabling applications ranging from smart healthcare to human-computer interaction. While WiFi-based methods have gained traction, they often struggle with continuous motion and high computational overhead. This work presents WiFlow, a novel framework for continuous human pose estimation using WiFi signals. Unlike vision-based approaches such as two-dimensional deep residual networks that treat Channel State Information (CSI) as images, WiFlow employs an encoder-decoder architecture. The encoder captures spatio-temporal features of CSI using temporal and asymmetric convolutions, preserving the original sequential structure of signals. It then refines keypoint features of human bodies to be tracked and capture their structural dependencies via axial attention. The decoder subsequently maps the encoded high-dimensional features into keypoint coordinates. Trained on a self-collected dataset of 360,000 synchronized CSI-pose samples from 5 subjects performing continuous sequences of 8 daily activities, WiFlow achieves a Percentage of Correct Keypoints (PCK) of 97.00% at a threshold of 20% (PCK@20) and 99.48% at PCK@50, with a mean per-joint position error of 0.008m. With only 4.82M parameters, WiFlow significantly reduces model complexity and computational cost, establishing a new performance baseline for practical WiFi-based human pose estimation. Our code and datasets are available at https://github.com/DY2434/WiFlow-WiFi-Pose-Estimation-with-Spatio-Temporal-Decoupling.git.


💡 Research Summary

Human pose estimation (HPE) is a cornerstone technology for many IoT applications, yet most existing solutions rely on cameras or wearable devices, which suffer from privacy concerns, lighting constraints, or user inconvenience. WiFi‑based sensing, using Channel State Information (CSI), offers a low‑cost, non‑contact alternative, but prior works have struggled with two fundamental issues: (1) they treat CSI as a static 2‑D image and thus ignore its intrinsic temporal dynamics, and (2) the models are often large and computationally heavy, making real‑time edge deployment impractical.

The paper introduces WiFlow, a lightweight encoder‑decoder network specifically designed for continuous human pose estimation from WiFi CSI. The authors first collect a large‑scale dataset of 360 000 synchronized CSI‑pose pairs from five volunteers performing eight daily activities in continuous sequences. CSI is captured with an Intel 5300 NIC (3 Tx × 2 Rx, 30 sub‑carriers per link) at 600 Hz and synchronized to 30 FPS video. Only amplitude information is used, discarding phase to avoid costly calibration. The resulting input tensor has shape 540 × 20 (sub‑carriers × time).

Key architectural innovations:

  1. Temporal Convolutional Network (TCN) – A stack of causal, dilated 1‑D convolutions extracts temporal patterns while preserving causality. Exponential dilation allows the network to cover the whole 20‑frame window with few layers. Simultaneously, a progressive channel‑compression mechanism screens out sub‑carriers that carry little pose‑related information, reducing redundancy.

  2. Asymmetric Convolutional Blocks – After temporal encoding, spatial correlations among sub‑carriers are captured using 1 × k kernels that operate only on the sub‑carrier dimension, leaving the temporal axis untouched. Inspired by U‑Net, the network expands channel depth while progressively compressing the sub‑carrier dimension down to the number of keypoints (K). Each output channel thus directly corresponds to a specific joint, embedding structural constraints into the representation.

  3. Axial Self‑Attention – To model intra‑keypoint feature refinement and inter‑keypoint dependencies, the authors adopt axial attention, which decomposes full 2‑D self‑attention into two 1‑D attentions (width and height). The first stage attends along the temporal (width) axis for each keypoint, highlighting salient temporal features; the second stage attends along the keypoint (height) axis, capturing the relational structure of the human skeleton. Grouped attention further diversifies the patterns each head can learn while keeping computational cost low (O(H²W + HW²) instead of O(H²W²)).

  4. Decoder – A lightweight series of 1 × 1 convolutions maps the high‑dimensional feature map (B × C × K × T) directly to 2‑D joint coordinates, trained with an L2 regression loss. No post‑processing such as heat‑map decoding is required, enabling fast inference.

Experimental results: On the self‑collected dataset, WiFlow achieves PCK@20 = 97.0 %, PCK@50 = 99.48 %, and MPJPE = 0.008 m, outperforming recent WiFi‑based baselines (e.g., WiPose, CSI‑Former) while using only 4.82 M parameters and ≈0.68 G FLOPs. Ablation studies confirm that each component—TCN vs. LSTM/Transformer, asymmetric vs. standard 2‑D convolutions, and axial attention—contributes positively to accuracy and efficiency.

The authors release both the code and the large continuous CSI‑pose dataset, fostering reproducibility and future research. WiFlow demonstrates that respecting the physical asymmetry of CSI (temporal causality vs. spatial sub‑carrier distribution) and employing a modular, lightweight design can bridge the gap between high‑accuracy pose estimation and practical, on‑device deployment. Potential extensions include multi‑person scenarios, 3‑D pose estimation, and integration with other RF modalities, all of which can benefit from the same decoupled temporal‑spatial processing pipeline.


Comments & Academic Discussion

Loading comments...

Leave a Comment