Efficient Convolutional Neural Network for FMCW Radar Based Hand Gesture Recognition
FMCW radar could detect object’s range, speed and Angleof-Arrival, advantages are robust to bad weather, good range resolution, and good speed resolution. In this paper, we consider the FMCW radar as a novel interacting interface on laptop. We merge sequences of object’s range, speed, azimuth information into single input, then feed to a convolution neural network to learn spatial and temporal patterns. Our model achieved 96% accuracy on test set and real-time test.
💡 Research Summary
The paper presents a novel human‑computer interaction interface that leverages a 57‑64 GHz Frequency‑Modulated Continuous‑Wave (FMCW) radar to recognize four hand gestures: left wave, right wave, click, and wrist. Unlike camera‑based solutions, FMCW radar provides distance, radial velocity, and angle‑of‑arrival (AoA) measurements while preserving user privacy and being robust to adverse weather. The authors’ key contribution lies in fusing these three modalities into a single three‑channel image, termed RSA (Range‑Speed‑Azimuth), which serves as input to a deep convolutional neural network (CNN).
Signal Processing Pipeline
Raw radar returns are first transformed via a 2‑D Fast Fourier Transform (FFT) to generate a Range‑Doppler Map (RDM) of size 64 × 256. Constant False Alarm Rate (CFAR) detection isolates the hand‑body region, after which the RDM is cropped to 64 × 128. For each range bin, the maximum Doppler frequency (speed) and the average AoA are computed, producing a 1 × 128 × 3 frame (range‑time, speed‑time, azimuth‑time). Stacking 128 consecutive frames yields a 128 × 128 × 3 RSA image that simultaneously encodes spatial and temporal dynamics.
Dataset
Data were collected from 50 participants, each performing the four gestures ten times with both left and right hands, resulting in 3 652 valid recordings. To mitigate the limited sample size, the authors applied aggressive data augmentation: random cropping of gesture blocks and synthetic generation of training patches, expanding the training set to over 400 k examples.
Network Architectures
Two CNN variants are evaluated:
-
VGG‑10 – A shallow network inspired by VGG, consisting of two Conv3×3‑Conv3×3‑MaxPool blocks followed by two fully‑connected layers. Trained with the Adam optimizer, early stopping, and learning‑rate decay, VGG‑10 converged after ten epochs with a validation accuracy of 92 %.
-
ResNet‑20 – An enhanced version that inserts residual blocks and batch‑normalization layers between convolutions, forming a 20‑layer deep residual network. This architecture achieved a validation accuracy of 98 %, substantially outperforming VGG‑10.
For comparison, a CNN+LSTM baseline (mirroring prior work) processes resized 64 × 64 RDM frames through two Conv5×5 layers, pools, and then feeds the extracted features to an LSTM. Because this pipeline ignores AoA, its performance is markedly lower.
Results
On a held‑out test set, ResNet‑20 attained per‑gesture accuracies of 98.7 % (LEFT), 99.1 % (RIGHT), 99.0 % (CLICK), and 97.9 % (WRIST), yielding an overall average of 98.9 %. VGG‑10 achieved 97 % average accuracy, while CNN+LSTM lagged at 78 % average, with especially poor performance on LEFT/RIGHT due to missing AoA cues. Confusion‑matrix analysis revealed that LEFT and RIGHT are occasionally swapped because the retraction phase of one resembles the preparation phase of the other; CLICK and WRIST are sometimes misidentified as LEFT/RIGHT when large AoA variations occur.
Discussion and Limitations
The study convincingly demonstrates that integrating AoA into the input representation dramatically improves radar‑based gesture classification. The residual network’s depth and batch‑norm stabilize training and enable near‑perfect discrimination of the four gestures. However, several limitations remain:
- Generalization – The dataset, though augmented, originates from a controlled lab environment with a single radar placement. Real‑world scenarios involving multiple users, clutter, or varying hand‑radar distances were not evaluated.
- Latency and Resource Constraints – The paper reports “real‑time” performance but does not quantify inference latency, memory footprint, or suitability for embedded deployment.
- Gesture Segmentation – Errors stem from overlapping gesture phases; more sophisticated temporal segmentation or attention mechanisms could further reduce confusion.
- Scalability – Extending the system to a larger gesture vocabulary or user‑defined gestures may require additional training data and possibly hierarchical classification strategies.
Future Work
The authors propose expanding to user‑defined gestures, investigating multi‑user simultaneous recognition, and exploring lightweight model variants for on‑device inference. Additional research could explore adaptive radar parameters (bandwidth, chirp count) to balance resolution against power consumption, and conduct extensive field trials across diverse lighting, acoustic, and environmental conditions.
Conclusion
By fusing distance, velocity, and angle measurements into a compact three‑channel image and applying a deep residual CNN, the authors achieve state‑of‑the‑art accuracy (≈99 %) for four hand gestures using FMCW radar. This approach offers a privacy‑preserving, weather‑robust alternative to vision‑based interfaces, and sets a solid foundation for future radar‑centric human‑computer interaction systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment