BdSL-SPOTER: A Transformer-Based Framework for Bengali Sign Language Recognition with Cultural Adaptation

BdSL-SPOTER: A Transformer-Based Framework for Bengali Sign Language Recognition with Cultural Adaptation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce BdSL-SPOTER, a pose-based transformer framework for accurate and efficient recognition of Bengali Sign Language (BdSL). BdSL-SPOTER extends the SPOTER paradigm with cultural specific preprocessing and a compact four-layer transformer encoder featuring optimized learnable positional encodings, while employing curriculum learning to enhance generalization on limited data and accelerate convergence. On the BdSLW60 benchmark, it achieves 97.92% Top-1 validation accuracy, representing a 22.82% improvement over the Bi-LSTM baseline, all while keeping computational costs low. With its reduced number of parameters, lower FLOPs, and higher FPS, BdSL-SPOTER provides a practical framework for real-world accessibility applications and serves as a scalable model for other low-resource regional sign languages.


💡 Research Summary

This paper presents BdSL‑SPOTER, a lightweight transformer‑based framework for recognizing Bengali Sign Language (BdSL). Building on the original SPOTER architecture, the authors introduce three major innovations tailored to the cultural and linguistic characteristics of BdSL. First, a culturally adapted preprocessing pipeline extracts 2‑D pose keypoints using MediaPipe Holistic, retaining 54 landmarks (21 per hand, 12 upper‑body points) and discarding facial points that add noise for BdSL. Because Bengali signers use a more compact signing space than Western sign languages, the authors apply a custom normalization (α = 0.85) that scales coordinates relative to the signer’s torso, preserving subtle spatial variations while reducing inter‑signer variance.

Second, the transformer encoder is streamlined to four layers, each with nine multi‑head self‑attention heads, a model dimension of 108, and a feed‑forward dimension of 512. Crucially, learnable positional encodings replace fixed sinusoidal ones, allowing the network to adapt to variable signing speeds and temporal patterns typical of BdSL. This design keeps the total parameter count at only 0.847 M, dramatically lower than many vision‑based baselines.

Third, the training regimen incorporates curriculum learning via a sequence‑length warm‑up: training starts with trimmed sequences and gradually expands to the full length of 200 frames over the first three epochs. This reduces gradient variance early on and speeds convergence. Additional regularization includes label smoothing (ε = 0.1), weight decay (1e‑4), dropout (0.15), and mixed‑precision (FP16) training with a OneCycleLR scheduler. Data augmentation applies temporal dropout (10 % of frames), small coordinate noise (±2 px), and horizontal flipping to increase robustness.

Experiments are conducted on the BdSL‑W60 benchmark, which contains 9,307 videos across 60 classes performed by 18 native signers. The dataset is split 70 %/15 %/15 % for training, validation, and testing. BdSL‑SPOTER achieves a Top‑1 accuracy of 97.92 %, Top‑5 accuracy of 99.80 %, and a macro‑averaged F1 score of 0.979. This represents a 22.82 % absolute improvement over the Bi‑LSTM baseline (75.10 % Top‑1) and a 15.52 % gain over the original SPOTER (82.40 % Top‑1). In terms of efficiency, the model reduces training time by 63.1 % and parameter count by 29.4 % relative to standard SPOTER, while delivering an inference speed of 127 FPS, suitable for real‑time applications.

A comprehensive ablation study evaluates the impact of each design choice. Four encoder layers with nine attention heads provide the best trade‑off; deeper stacks (6–8 layers) yield diminishing returns due to the limited data size. Learnable positional encodings improve performance by 2.32 % over fixed sinusoidal encodings. The culturally specific normalization contributes an additional 4.42 % absolute accuracy gain. Curriculum learning, label smoothing, and data augmentation each add roughly 8–9 % improvements, confirming their importance for low‑resource scenarios.

Error analysis shows that 52 of the 60 classes (86.7 %) are classified perfectly, with the remaining misclassifications concentrated on signs that share similar hand shapes but differ in temporal dynamics (e.g., classes 33 vs. 47). This indicates that while the model captures spatial configurations well, finer motion trajectory modeling could further boost performance.

In summary, BdSL‑SPOTER demonstrates that a carefully adapted pose‑based transformer, combined with cultural normalization and curriculum learning, can achieve state‑of‑the‑art accuracy on a low‑resource sign language dataset while maintaining real‑time efficiency. The proposed components are likely transferable to other under‑represented sign languages, paving the way for scalable, accessible sign language recognition systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment