A$^{2}$V-SLP: Alignment-Aware Variational Modeling for Disentangled Sign Language Production

A$^{2}$V-SLP: Alignment-Aware Variational Modeling for Disentangled Sign Language Production
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Building upon recent structural disentanglement frameworks for sign language production, we propose A$^{2}$V-SLP, an alignment-aware variational framework that learns articulator-wise disentangled latent distributions rather than deterministic embeddings. A disentangled Variational Autoencoder (VAE) encodes ground-truth sign pose sequences and extracts articulator-specific mean and variance vectors, which are used as distributional supervision for training a non-autoregressive Transformer. Given text embeddings, the Transformer predicts both latent means and log-variances, while the VAE decoder reconstructs the final sign pose sequences through stochastic sampling at the decoding stage. This formulation maintains articulator-level representations by avoiding deterministic latent collapse through distributional latent modeling. In addition, we integrate a gloss attention mechanism to strengthen alignment between linguistic input and articulated motion. Experimental results show consistent gains over deterministic latent regression, achieving state-of-the-art back-translation performance and improved motion realism in a fully gloss-free setting.


💡 Research Summary

The paper introduces A²V‑SLP, a novel alignment‑aware variational framework for continuous sign language production (SLP) that directly generates 3‑D skeletal pose sequences from spoken‑language text without relying on gloss annotations. The core idea is to replace deterministic latent embeddings with articulator‑wise probabilistic latent distributions (mean µ and log‑variance log σ²) learned by a disentangled variational auto‑encoder (VAE). The VAE is structurally partitioned into four branches—right hand, left hand, body, and face—each encoded by a small MLP that outputs a region‑specific mean and variance vector. By training with a reconstruction loss and a low‑weight KL regularizer, the VAE captures fine‑grained articulatory variability while preserving a compact latent space (80 dimensions total).

After the VAE is trained, its encoder is frozen and used to extract per‑frame articulator‑wise latent statistics from ground‑truth poses. These statistics serve as “distributional supervision” for a non‑autoregressive Transformer that maps BERT‑derived text embeddings to the same latent parameters. The Transformer consists of a three‑layer encoder (processing 768‑dimensional BERT tokens projected to 512 dimensions) and a six‑layer decoder that operates on a fixed set of learned time queries initialized from a neutral pose. Crucially, the decoder replaces standard global self‑attention with gloss‑attention: each query attends only to a local temporal window of N neighboring frames. This inductive bias enforces short‑range temporal coherence, effectively mimicking the alignment role of glosses without explicit gloss labels, while cross‑attention to the text encoder remains global to preserve sentence‑level semantics.

Training proceeds in two phases. First, the Transformer is optimized with an articulator‑weighted L1 loss that regresses predicted means and log‑variances toward the VAE targets; a dynamic weighting scheme emphasizes hand joints to preserve manual articulation. Second, a KL divergence term aligns the predicted latent distributions with the VAE posterior, encouraging the model to output realistic variance estimates. A lightweight length predictor estimates the target sequence length from the encoder’s mean‑pooled output.

Experiments on two benchmark datasets—PHOENIX‑2014T (German Sign Language weather broadcast) and CSL‑Daily (Chinese daily conversation)—demonstrate consistent improvements over deterministic baselines. A²V‑SLP achieves higher back‑translation BLEU scores (+2–3 points), lower mean per‑joint position error (MPJPE) for hands (≈4.5 mm reduction), and better human‑rated motion naturalness (+0.4 on a 5‑point scale). Ablation studies show that removing gloss‑attention degrades temporal alignment accuracy and increases length prediction error, confirming its contribution.

The contributions are threefold: (1) introducing distributional supervision for gloss‑free SLP, which mitigates latent collapse and captures articulatory variability; (2) integrating gloss‑attention into a non‑autoregressive generation pipeline to provide local temporal alignment without glosses; (3) employing articulator‑specific dynamic loss weighting to preserve fine‑grained hand motion. The work demonstrates that combining variational latent modeling with localized attention yields more expressive, temporally aligned, and realistic sign language generation, opening avenues for further research on multi‑signer scenarios, diffusion‑based decoders, and end‑to‑end video synthesis.


Comments & Academic Discussion

Loading comments...

Leave a Comment