WaveGlove: Transformer-based hand gesture recognition using multiple inertial sensors
š” Research Summary
The paper introduces WaveGlove, a custom glove equipped with five inertial measurement units (IMUs), one on each finger, to explore the benefits of multiāsensor hand gesture recognition (HGR). Two gesture vocabularies were defined: āWaveGloveāsingleā, consisting of eight wholeāhand movements plus a null class, and āWaveGloveāmultiā, containing ten gestures that involve distinct motions of individual fingers. Using the glove, the authors collected over 1āÆ000 samples for the single vocabulary and more than 10āÆ000 samples for the multi vocabulary.
To place their work in context, the authors assembled a benchmark of eleven datasets, including the two newly created WaveGlove datasets, three publicly available HGR datasets (uWave, Opportunity, Skoda), and six HAR datasets previously normalized in the literature. All datasets were processed with a uniform pipeline: LeaveāOneāTrialāOut (LOTO) crossāvalidation, consistent windowing (overlapping for the six preāprocessed HAR sets, nonāoverlapping for the others), and identical feature scaling. This standardization enables fair comparison across a wide range of classification methods.
The study reproduces several established approaches ranging from classical machine learning (Decision Tree, Baseline) to deep learning models such as DeepConVāLSTM, DCNN ensembles, and bidirectional GRU/LSTM networks. In addition, the authors propose a novel Transformerābased architecture tailored for inertial timeāseries. The model first projects the raw BĆTĆS input (batch, time steps, sensor channels) into a 32ādimensional embedding via a linear layer, adds sinusoidal positional encoding, and passes the sequence through four Transformer encoder layers (8 attention heads, dropoutāÆ0.2, feedāforward dimensionāÆ128). A dotāproduct attention mechanism uses the final temporal output as the query while attending over the entire sequence, and a final linear layer maps the result to class logits. Hyperāparameters were selected through an extensive sweep.
Performance results (TableāÆ1) show that the Transformer model achieves stateāofātheāart accuracy on most datasets. Notably, on the WaveGloveāmulti set it reaches 99.40āÆ% accuracy, surpassing DeepConVāLSTM (90.35āÆ%). On the WaveGloveāsingle set it also outperforms prior methods (99.40āÆ% vs. 99.10āÆ% for the best baseline). Across the eleven benchmark datasets, the Transformer consistently matches or exceeds the best published scores, confirming its robustness to varying sensor configurations and gesture complexities.
An ablation study investigates two key factors. First, training models on data from a single finger sensor reveals that sensor placement strongly influences confusion patterns: the indexāfinger sensor fails to detect gestures that do not involve the index finger (e.g., āMetalā, āPeaceā), while the pinky sensor confuses gestures lacking pinky movement. This demonstrates that different fingers contribute discriminative information for different gesture classes. Second, the authors vary the number of active sensors from one to five. For the simple āsingleā vocabulary, adding sensors yields negligible gains, as expected because the gestures involve uniform hand motion. In contrast, for the āmultiā vocabulary, accuracy improves markedly up to three sensors and then plateaus, indicating diminishing returns beyond that point. The benefit of additional sensors is especially pronounced when the training set is small, suggesting that multiāsensor data can compensate for limited sample sizes.
The paperās contributions are threefold: (1) the design and public release of a multiāIMU glove capable of capturing fineāgrained finger motions at a scale larger than any previously available HGR dataset; (2) the creation of a unified benchmark of eleven HAR/HGR datasets with standardized preprocessing, facilitating reproducible comparisons; and (3) the introduction of a streamlined Transformer architecture that achieves superior performance across diverse tasks. The authors suggest future work on modeling interāsensor correlations with graph neural networks, developing lightweight Transformer variants for realātime onādevice inference, and exploring transferālearning strategies for personalized gesture vocabularies.
Comments & Academic Discussion
Loading comments...
Leave a Comment