Isolated Sign Language Recognition with Segmentation and Pose Estimation

Isolated Sign Language Recognition with Segmentation and Pose Estimation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The recent surge in large language models has automated translations of spoken and written languages. However, these advances remain largely inaccessible to American Sign Language (ASL) users, whose language relies on complex visual cues. Isolated sign language recognition (ISLR) - the task of classifying videos of individual signs - can help bridge this gap but is currently limited by scarce per-sign data, high signer variability, and substantial computational costs. We propose a model for ISLR that reduces computational requirements while maintaining robustness to signer variation. Our approach integrates (i) a pose estimation pipeline to extract hand and face joint coordinates, (ii) a segmentation module that isolates relevant information, and (iii) a ResNet-Transformer backbone to jointly model spatial and temporal dependencies.


💡 Research Summary

The paper tackles the pressing need for automatic translation tools tailored to American Sign Language (ASL), focusing on isolated sign language recognition (ISLR) – the task of classifying short video clips that each contain a single gloss. While large language models have dramatically improved text‑and‑speech translation, they remain blind to the visual modality that ASL relies on. Existing ISLR approaches suffer from two intertwined challenges: (1) severe data scarcity – most glosses are represented by only a few dozen examples in large‑scale datasets, and (2) high computational demand – state‑of‑the‑art video transformers or 3‑D CNNs require substantial GPU memory and processing time, limiting real‑world deployment.

To address these issues, the authors propose a lightweight, three‑stage pipeline that replaces dense RGB processing with structured pose information and targeted segmentation. First, they employ Google’s MediaPipe to extract 3‑D joint coordinates for the hands and face in each frame. MediaPipe’s multi‑stage architecture (body pose detector → hand and face detectors) yields a compact set of keypoints that capture the essential articulators of ASL while discarding background clutter. Second, the same keypoints are used to generate binary masks that zero‑out all pixels outside the hands and face, effectively segmenting the signer from the scene. This segmentation step further reduces visual noise and standardizes the input across diverse backgrounds and lighting conditions.

The third stage is the core learning module. The authors initially experimented with a dual‑stream visual model that fed both the original grayscale video and the segmented video into a 3‑D Vision Transformer (ViT) while simultaneously processing the pose coordinates with a separate transformer. However, the 3‑D ViT proved too memory‑intensive for their limited hardware, and the ResNet‑18 backbone they substituted performed poorly (≈1 % accuracy) on the down‑sampled dataset, likely due to over‑fitting on noisy pixel data. Consequently, they abandoned pixel‑based inputs altogether and built a pose‑only architecture.

A critical contribution is the systematic investigation of coordinate normalization. Three strategies were compared: (i) Global Nose‑Anchored Normalization (using the nose tip as a fixed origin), (ii) Face‑Centered Normalization (using the centroid of all facial landmarks), and (iii) Per‑Frame Center‑of‑Mass Normalization (computing the centroid of all visible joints for each frame and scaling to a


Comments & Academic Discussion

Loading comments...

Leave a Comment