SignVLA: A Gloss-Free Vision-Language-Action Framework for Real-Time Sign Language-Guided Robotic Manipulation
We present, to our knowledge, the first sign language-driven Vision-Language-Action (VLA) framework for intuitive and inclusive human-robot interaction. Unlike conventional approaches that rely on gloss annotations as intermediate supervision, the pr…
Authors: Xinyu Tan, Ningwei Bai, Harry Gardener
SignVLA: A Gloss-Free V ision-Language-Action Frame work for Real-T ime Sign Language–Guided Robotic Manipulation Xinyu T an 1 , Ningwei Bai 1 , Harry Gardener 1 , Zhengyang Zhong 1 , Luoyu Zhang 1 Liuhaichen Y ang 1 , Zhekai Duan 1 , Monkgogi Galeitsiwe 1 , Zezhi T ang 1 , ∗ 1 Department of Computer Science, Univ ersity College London (UCL), London, WC1E 6BT , U.K. ∗ Corresponding author: zezhi.tang@ucl.ac.uk Abstract —W e present, to our knowledge, the first sign language-driven V ision–Language–Action (VLA) framework for intuitive and inclusive human–robot interaction. Unlike con ven- tional approaches that rely on gloss annotations as intermediate supervision, the proposed system adopts a gloss-free paradigm and directly maps visual sign gestures to semantic instructions. This design reduces annotation cost and a voids the inf ormation loss introduced by gloss repr esentations, enabling more natural and scalable multimodal interaction. In this work, we f ocus on a real-time alphabet-level finger - spelling interface that provides a robust and low-latency commu- nication channel for robotic control. Compared with large-scale continuous sign language recognition, alphabet-level interaction offers impro ved reliability , interpretability , and deployment fea- sibility in safety-critical embodied envir onments. The proposed pipeline transforms continuous gestur e streams into coherent language commands through geometric normalization, temporal smoothing, and lexical refinement, ensuring stable and consistent interaction. Furthermore, the framework is designed to support futur e integration of transformer -based gloss-free sign language models, enabling scalable word-level and sentence-lev el semantic under- standing. Experimental results demonstrate the effectiveness of the proposed system in grounding sign-derived instructions into precise robotic actions under diverse interaction scenarios. These results highlight the potential of the framework to advance accessible, scalable, and multimodal embodied intelligence. Index T erms —V ision-Language-Action, Sign Language, Human-robot Interaction, Multimodal Learning, Gesture Recognition, Embodied AI I . I N T RO D U C T I O N V ision-Language-Action (VLA) models hav e recently emerged as a transformative paradigm in robotic autonomy , enabling agents to perform complex reasoning and embodied decision-making in unstructured en vironments. By scaling these architectures and le veraging internet-scale pre-training, state-of-the-art systems like NVIDIA ’ s GR00T [1] and Open- VLA [2] have achieved robust, generalist control across di- verse tasks. Howe ver , a significant limitation persists: current VLA research is predominantly ”hearing-normativ e, ” operat- ing under the assumption that human instructions are provided exclusi vely through text or speech. This dependency restricts the accessibility of robotic systems for the global population of individuals with hearing or speech impairments, treating sign language as a negligible edge case rather than a native instruction modality . Integrating sign language into VLA frameworks presents sev eral fundamental challenges. First, there is a profound modality mismatch between the discrete token-based pro- cessing of Large Language Models (LLMs) and the contin- uous, fluid motion dynamics of signing [3]. Unlike static text, sign language relies on complex spatial configurations, rhythmic trajectories, and contextual non-manual markers that are difficult to capture without significant information loss. Historically , this gap was addressed through glosses, but this approach creates an information bottleneck [4] that strips away grammatical nuance and spatial grounding. Furthermore, the glossing gap between raw video data and expert-annotated labels makes large-scale, gloss-based datasets prohibitiv ely expensi ve and difficult to scale. Beyond perception and language grounding, reliable ex ecu- tion in real-world robotics further requires robust and data- efficient control under uncertainty . Learning-based optimal control frameworks, such as reinforcement learning (RL) and adaptiv e dynamic programming (ADP), have demonstrated strong capabilities in handling nonlinear dynamics, distur- bances, and model uncertainty in safety-critical systems. In particular , disturbance-observer -based control and rob ust opti- mal tracking hav e been widely studied for nonlinear and un- certain systems, providing effecti ve disturbance compensation and improv ed closed-loop stability [5]–[8]. Recent advances further integrate RL with disturbance-aware control and adap- tiv e learning, enabling data-driven policy optimization without requiring accurate system models [9], [10]. Event-triggered learning and adaptiv e mechanisms have also been explored to improv e computational efficienc y and reduce communication ov erhead while maintaining stability guarantees in resource- constrained robotic platforms [10]. In addition, learning-based control has been extended to cooperative and multi-agent robotic systems, including formation and distributed control under uncertainty [11], [12]. Collecti vely , these dev elopments underscore the necessity of tightly coupling high-le vel seman- tic reasoning with low-le vel robust control, motiv ating unified embodied architectures that can bridge sign language percep- tion, semantic grounding, and reliable real-time ex ecution. Beyond the architectural hurdles, a significant dataset bot- tleneck exists due to the scarcity of high-quality , open-source sign language corpora. Restrictiv e licensing on gold-standard datasets often hinders rapid prototyping, forcing research to- ward language-agnostic and gloss-free methodologies that can generalize across dif ferent signing systems like ASL, BSL, or GSL [13]. T o resolve these issues, recent advancements sug- gest moving away from end-to-end monolithic models in fav or of modular translation-action architectures. By decoupling sign language translation (SL T) from the VLA policy , researchers can leverage specialist SL T models while maintaining gen- eralist control capabilities, ef fecti vely av oiding ”catastrophic forgetting” [14] during fine-tuning. In this paper , we present, to our knowledge, the first sign language-driv en V ision–Language–Action (VLA) framework designed for intuitiv e and inclusive human–robot interaction. Our system employs a hierarchical pipeline that transforms finger-spelled letters and gestures into coherent robotic com- mands. W e leverage the MediaPipe Hands framework for real- time 3D landmark extraction and implement a robust linguistic buf fering mechanism to handle temporal de-flickering and lexical error correction. These synthesized instructions are then dispatched to a VLA policy , which performs multimodal fusion to ground linguistic goals into physically ex ecutable motor behaviors. Our main contributions are summarized as follows: 1) W e present the first V ision–Language–Action (VLA) framew ork that integrates sign language as a nati ve instruction modality , enabling robots to directly under- stand and ex ecute tasks through manual gestures. 2) W e dev elop a robust Sign-to-W ord perception pipeline that integrates geometric normalization and Le venshtein- based lexical refinement, achieving accurate and stable real-time alphabet-level recognition. 3) W e design a modular interface that bridges continuous gestural streams and discrete token-based computation, ensuring scalability and mitigating catastrophic forget- ting in multimodal embodied systems. 4) W e validate the proposed framework on a Franka Emika Panda robot, demonstrating effecti ve grounding of sign language instructions into precise physical actions within complex manipulation environments. I I . M E T H O D S Our pipeline maps discrete manual gestures to continu- ous robot control. It consists of an alphabet-lev el perception module for finger-spelling, a linguistic buf fering mechanism that stabilizes and refines character streams, and a V ision- Language-Action (VLA) policy that grounds the synthesized instruction into ex ecutable actions. A. Sign-to-W ord: Alphabet-Level P er ception Buffer W e con vert a liv e video stream into text commands through a hierarchical process that recognizes finger -spelled letters and composes them into words suitable for downstream robotic control. 1) Data Augmentation and Dataset Expansion: Because the initial gesture dataset D raw is small, we apply stochastic augmentation to improv e robustness and reduce overfitting in the lightweight classifier . W e use geometric transformations to model pose variation during interaction, including random rotations θ ∈ [ − 25 ◦ , +25 ◦ ] , isotropic scaling s ∈ [0 . 7 , 1 . 3] , and horizontal flipping to support both hands. W e further apply photometric jittering by adjusting brightness and contrast with multipliers α ∈ [0 . 6 , 1 . 4] . This improves tolerance to illumi- nation changes and indirectly simulates small perturbations in MediaPipe keypoints, encouraging the model to rely on stable geometric structure. 2) F eatur e Extraction and Modeling: Our architecture mod- els sign language dynamics through a hierarchical pipeline, us- ing parallel feature extraction to balance recognition accuracy with computational latency , drawing inspiration from recent edge-oriented sign language models [15]. For spatial model- ing, CNN encoders—ranging from custom 4-layer Con v2D architectures to residual networks [16]—process raw RGB frames to learn high-dimensional embeddings of hand pose and orientation independently of pre-defined keypoints. These spatial representations are then passed to temporal modules, such as long-term recurrent conv olutional networks [17] or factorized spatiotemporal con volutions [18], to resolve gesture transitions. Specifically , the backbone decomposes spatiotem- poral kernels into separate spatial and temporal filters to better isolate and track subtle motion trajectories [19]. This visual stream is complemented by 3D hand landmarks extracted via MediaPipe [20], which are normalized against the wrist posi- tion and anatomical hand scale to serve as a stable geometric prior . Integrating this explicit topology with implicit CNN features ensures system robustness against visual noise and cluttered backgrounds. Finally , fused representations generate per-frame predictions, which are filtered by a softmax confi- dence threshold to suppress transient errors, producing stable instructions suitable for grounding in generalist foundation models [21]. 3) Linguistic Buf fering and Instruction Synthesis: Frame- lev el predictions are post-processed to produce stable words and complete commands. W e store recent predictions in a sliding window of size K and accept a character only when it remains the mode of the window for a specified num- ber of consecutive frames, which mitigates label flicker . A dedicated “Space” gesture indicates word termination. When the gesture is detected, the accumulated character sequence S = { c 1 , c 2 , . . . , c n } is refined by matching it to a task-specific dictionary D using Lev enshtein distance [22]: W = arg min w ∈D Lev enshtein ( S, w ) . (1) The resulting words are appended to a command buffer and con verted into a standardized natural-language instruction. The final instruction I is then dispatched to the VLA model for grounding and ex ecution. B. VLA P olicy and T ask Execution The VLA policy receives the synthesized instruction I together with the robot’ s RGB observation O t . It performs multimodal fusion through cross-attention to align linguistic descriptors with visual entities in the scene and to form a grounded task representation. Based on the fused representation, the policy predicts control commands including end-effector motion and gripper state. Execution runs in closed loop, with actions updated from visual feedback until the specified instruction is completed. I I I . M O D E L A N D T R A I N I N G A. Sign Languag e Model 1) Description: In this work, we focus on a real-time alphabet-lev el finger-spelling recognition module as the pri- mary sign language interface for robotic control. This design is motiv ated by the requirements of robustness, low latency , and deployment feasibility in safety-critical embodied en viron- ments. Compared with large-scale continuous sign language models, alphabet-le vel interaction provides a reliable and inter - pretable communication channel, which is particularly suitable for real-world human–robot interaction. The proposed perception module follows a gloss-free paradigm. Instead of relying on intermediate gloss annotations, the system directly maps visual gestures to semantic tokens in the form of characters and words. This a voids the information bottleneck introduced by gloss supervision and simplifies data collection and deployment. Specifically , we adopt a lightweight sign recognition pipeline based on hand landmark estimation. The system utilizes MediaPipe Hands to extract 3D hand ke ypoints from RGB frames in real time. These skeletal representations pro- vide robustness against v ariations in lighting, background, and user appearance. The extracted landmarks are then normalized and processed by a lightweight classifier to recognize isolated American Sign Language (ASL) alphabet gestures. This alphabet-level design enables stable and low-latenc y interaction, allowing users to compose complex instructions through finger spelling. The modular architecture further al- lows seamless integration with do wnstream linguistic process- ing and V ision–Language–Action models. 2) Model T raining: o improve robustness and general- ization, the alphabet recognition module is trained using a combination of prototypical gesture samples and data aug- mentation. Since real-world deployment in volv es div erse users and en vironments, we employ geometric and photometric transformations to simulate variations in hand orientation, scale, and illumination. Specifically , random rotations, scaling, and horizontal flip- ping are applied to increase in variance to viewpoint and user habits. In addition, brightness and contrast perturbations are introduced to improve rob ustness under different lighting conditions. These strategies effecti vely expand the training distribution and reduce ov erfitting. During training, the classifier is optimized using super- vised learning to predict the alphabet class from normalized landmark features. Confidence-based filtering and temporal smoothing are further applied during inference to suppress noise and ensure stable character prediction in continuous interaction scenarios. This lightweight training strategy enables microsecond-le vel inference and supports high-frequency real-time control, which is essential for embodied robotic systems. B. VLA Model T o bridge the gap between interpreted sign-language instruc- tions and physical robotic execution, our frame work utilizes the GR00T N1 foundation model [21]. This VLA architecture serves as a generalist decision-making engine, integrating syn- thesized linguistic instructions I with real-time visual feedback to generate low-le vel motor commands. The model implements a dual-system processing paradigm inspired by human cognitiv e systems [23]. System 2 functions as the reasoning backbone, employing the NVIDIA Eagle-2 VLM [24] to perform semantic grounding. Operating at a frequency of 10Hz, this module processes egocentric RGB observations O t alongside the sign-language instructions to define high-level task goals. T o optimize performance for real- time interaction, latent embeddings are extracted from the 12th layer of the VLM component, as this middle-layer represen- tation provides a superior balance between inference speed and task success rates compared to final-layer embeddings. This allows the model to effectiv ely correlate linguistic tokens from manual gestures with corresponding visual entities in the workspace. For low-le vel motion synthesis, System 1 utilizes a Diffu- sion T ransformer (DiT) [25] optimized via an action flow- matching objectiv e [26]. This action module operates at a control frequency of 120Hz, ensuring fluid and reacti ve motor control. T o maintain temporal coherence and suppress ex ecu- tion jitter, the system implements action chunking [27] with a horizon of H = 16 , predicting a sequence of future action vectors A t = [ a t , a t +1 , . . . , a t + H − 1 ] in a single inference pass. This iterativ e denoising process enables the robot to dynamically adjust its trajectory based on continuous visual feedback until the gestural command is fulfilled. A salient feature of this integration is its nati ve support for cross-embodiment adaptation. The framework manages hardware heterogeneity through embodiment-specific state and action encoders implemented as Multi-Layer Perceptrons [28]. These modules project the raw proprioceptive data q t of various robotic systems—ranging from the Franka Emika Panda arm used in this study to other tabletop manipulators or complex humanoid configurations—into a unified, shared embedding space. Consequently , our sign-language perception pipeline can dri ve di verse mechanical embodiments by ground- ing specialized manual gestures into millimeter -accurate Carte- sian setpoints without requiring hardware-specific structural modifications [29]. I V . E X P E R I M E N T S A. Experimental Setup The experimental framew ork is designed to validate the end- to-end integration of sign-language perception with embodied robotic decision-making. The primary hardware platform com- prises a Franka Emika Panda, a 7-DOF collaborativ e manipu- lator utilized for its high-fidelity torque sensing and precision. The robot is interfaced via the Robot Operating System (R OS) Noetic distribution running on an Ubuntu 20.04 L TS envi- ronment. W e lev erage the franka_ros and libfranka libraries to maintain a high-frequency control loop, while the high-le vel V ision-Language-Action (VLA) policy operates asynchronously to accommodate the computational demands of visual transformer inference. V isual perception is facilitated by an Intel RealSense RGB-D camera mounted in an eye-in-hand configuration on the robot’ s flange. This placement provides the GR00T VLA model with a dynamic, ego-centric perspectiv e of the workspace, which is essential for precise manipulation and reactiv e grasping. T o ensure low-latenc y performance, all computations, including the sign language recognition pipeline and the VLA model inference, are executed on a dedicated workstation equipped with an NVIDIA R TX series GPU. B. Spatial Calibr ation and Scaling A fundamental challenge in bridging linguistic intent with physical action is the accurate mapping of the camera’ s optical frame to the robot’ s base coordinate system. T o address this, we implement a spatial scaling and calibration procedure utilizing a ChArUco board. This hybrid target, which inte- grates ArUco markers within a traditional chessboard pattern, provides robustness against partial occlusions and varying lighting conditions. The calibration routine inv olves the acquisition of multiple views of the ChArUco board from diverse manipulator poses. By solving the Perspective-n-Point (PnP) problem and ap- plying a hand-eye calibration algorithm, the system identifies the precise extrinsic transformation between the camera and the robot’ s tool center point (TCP). This process is critical for the VLA model to translate pixel-space object detections into millimeter-accurate Cartesian setpoints. Furthermore, the ChArUco markers provide a known physical scale, allowing the system to normalize depth information and ensure that the action outputs are physically grounded within the workspace dimensions. C. Linguistic Inte gration and VLA W orkflow The interaction pipeline begins with the custom alphabet- lev el perception module, which monitors a secondary vision sensor dedicated to capturing the user’ s hand gestures. As the user performs American Sign Language (ASL) finger-spelling, the system extracts hand landmarks via the MediaPipe frame- work. These coordinates are processed by a classifier to predict discrete characters in real-time. T o ensure linguistic coherence, we implement a temporal buf fer and a lexical correction layer , which suppresses recognition noise and maps the character sequence to a task-specific dictionary . Once a complete command is synthesized, such as ”GRAB APPLE”, it is dispatched as a natural language prompt to the GR00T VLA model. The model performs multimodal fusion by correlating the linguistic embeddings of the spelled- out instruction with the visual features extracted from the wrist-mounted camera. The resulting action sequence is then ex ecuted by the Panda arm through a series of joint velocity commands, maintaining a feedback loop between the visual state and the robotic motion. V . R E S U LT S As the full Sign-VLA pipeline is under development, we focus on ev aluating the core architectural design. These con- trolled experiments provide strong evidence of the ef fecti ve- ness of our approach. A. Sign Languag e P er ception Benchmark W e first ev aluate the performance of our sign language perception module on standard ASL finger-spelling bench- marks. Unlik e con ventional sign recognition systems that focus solely on classification accuracy , our goal is to enable robust and low-latenc y interaction with embodied agents. Therefore, we emphasize real-time stability , temporal consistency , and robustness to viewpoint changes. W e compare our model with se veral baseline approaches, in- cluding frame-wise classifiers, temporal sequence models, and spatiotemporal architectures. All methods are ev aluated under identical lighting and motion conditions using both offline and real-time protocols. In particular, we assess the performance under two classification scales (100 and 500 classes) to reflect both constrained and large-vocab ulary interaction scenarios. As shown in T able I, spatiotemporal architectures consis- tently outperform frame-based and purely temporal models. The con ventional CNN+LSTM framew ork [17] achiev es rea- sonable accuracy but suffers from limited robustness in large- scale settings. Similarly , 3D CNN models [19] demonstrate strong temporal modeling capabilities but exhibit reduced performance due to increased computational complexity and sensitivity to viewpoint variations. In contrast, hybrid spatiotemporal approaches such as ResNet with factorized (2+1)D con v olutions [18] achiev e significantly higher accuracy , particularly in the large-class regime. This improv ement indicates that decoupling spatial and temporal modeling enables more efficient feature learning and better generalization across diverse gesture patterns. No- tably , the ResNet (2+1)D backbone achiev es the highest per- formance across both ev aluation scales, demonstrating strong capability in capturing fine-grained hand motion and gesture dynamics. W e further ev aluate skeleton-based approaches [30] to in- vestigate robustness to appearance variations. While skeleton- based methods are more in variant to lighting and background changes, they underperform RGB-based spatiotemporal mod- els due to limited expressiv eness in capturing subtle finger (a) Adjusting a red bottle (b) T arget zone interaction (c) Geometric object placement Fig. 1: Qualitati ve demonstration of the Sign-VLA policy across three distinct tasks. The sequences show the robot successfully ex ecuting instructions for (a) color-specific objects, (b) localized target areas, and (c) basic geometric shapes. T ABLE I: Performance comparison on isolated sign language recognition. W e report T op-1 accuracy on the CSL dataset under different model families. Method 100 Classes 500 Classes CNN + LSTM 82.08% 71.71% ResNet + LSTM 93.54% 83.17% 3D CNN 58.86% 45.07% 3D ResNet34 94.78% 81.61% ResNet (2+1)D 98.68% 94.85% Skeleton + LSTM 84.30% 70.62% T ABLE II: Continuous sign language recognition results on CSL. Method WER ↓ Loss Encoder-Decoder (word lev el) 1.01% 0.0346 Encoder-Decoder (char level) 1.19% 0.0494 articulations. This observation motiv ates the use of hybrid visual representations in downstream embodied interaction. Beyond classification accuracy , we analyze the temporal stability of predictions in real-time settings. W e observe that models with explicit temporal modeling produce smoother and more consistent outputs, which is critical for downstream robotic control. In contrast, frame-wise methods suffer from prediction jitter, leading to unstable command interpretation. Overall, these results demonstrate that the learned sign rep- resentations are not only accurate but also temporally coherent and robust under real-world conditions. Such properties are essential for embodied agents, where gesture inputs must be interpreted reliably in dynamic en vironments. Based on these findings, we adopt the ResNet (2+1)D backbone as the default sign encoder in our Sign-VLA framework. B. Pr eliminary VLA Evaluation T o ev aluate the feasibility of sign-conditioned embodied control, we conduct a series of preliminary experiments under a controlled ev aluation setting. Since the full end-to-end Sign- VLA frame work is still under de velopment, we focus on isolat- ing the effecti veness of sign-derived semantic representations in robotic decision-making. T able III summarizes the results, and Figure 1 illustrates the corresponding execution sequences. W e observe that sign-conditioned policies achiev e performance comparable to language-based control, with only a small performance gap. This result indicates that the proposed sign representation can serve as an ef fectiv e alternati ve to natural language instructions for embodied agents. Furthermore, temporal smoothing and lexical correction significantly improve robustness by reducing prediction jitter and stabilizing command sequences. As a result, the gap between sign- and text-conditioned control is further reduced, demonstrating the importance of temporal consistency in real- world gesture-based interaction. These preliminary findings provide strong evidence that sign language can function as a natural and accessible interface for robotic systems. Future work will focus on large-scale training and end-to-end optimization to further improv e generalization and robustness. T ABLE III: Preliminary ev aluation of sign-conditioned VLA control. Instruction Success Rate (%) Time (s) Stability T ext 86.5 6.2 High Sign 79.3 7.1 Medium Sign + T emporal Smoothing 84.7 6.6 High V I . C O N C L U S I O N In this paper, we present a sign language-driv en V ision- Language-Action (VLA) frame work that enables intuitive and inclusive human–robot interaction through a real-time alphabet-lev el finger-spelling interface. Unlike con ventional systems that rely on speech or text as the primary instruction modality , our approach treats manual gestures as a nati ve and accessible communication channel for embodied agents. Fig. 2: Illustration of a potential future extension of our framework using a transformer-based gloss-free sign language model. The encoder extracts spatial and temporal features from continuous sign videos, while the decoder generates natural language instructions that can be directly grounded by the VLA policy [15]. W e de velop a robust Sign-to-W ord perception pipeline that integrates geometric normalization, temporal smoothing, and lexical refinement to achieve stable and low-latency gesture recognition. The proposed design effectiv ely transforms con- tinuous gesture streams into coherent symbolic instructions suitable for token-based reasoning and robotic control. Exper- imental results demonstrate that the learned sign representa- tions are both accurate and temporally consistent, which is critical for reliable downstream decision-making in real-world en vironments. Furthermore, preliminary experiments on robotic manipu- lation tasks show that alphabet-lev el sign-conditioned control achiev es performance comparable to language-based policies. These findings v alidate the feasibility of finger -spelling as a practical and interpretable interface for embodied systems, particularly in safety-critical and latency-sensiti ve scenarios. Overall, this work provides an important step toward in- clusiv e and multimodal embodied intelligence by establishing a scalable and robust alphabet-level interaction paradigm for human–robot collaboration. V I I . F U T U R E W O R K Although the current system focuses on a real-time alphabet-lev el finger-spelling interface for robustness and low latency , an important future direction is to extend the pro- posed framew ork tow ard continuous and large-vocab ulary sign language understanding. In particular, we plan to explore transformer-based gloss-free sign language translation models, such as Signformer [15]. Compared with traditional gloss- based pipelines, these models directly translate sign language videos into natural language without relying on intermediate gloss annotations. This paradigm reduces the need for expert labeling, alleviates the information bottleneck caused by gloss supervision, and improv es scalability across dif ferent sign languages. Signformer represents a promising candidate for this exten- sion due to its lightweight transformer-based design and de- ployment efficienc y . Unlike many recent approaches that rely on lar ge pretrained models such as CLIP [31] or lar ge language models, Signformer is trained from scratch and optimized for computational ef ficiency . This makes it particularly suitable for real-time robotic and edge computing scenarios. As illustrated in Fig. 2, such models typically adopt an en- coder–decoder architecture [32]. The encoder processes visual features extracted from sign language videos through spatial embedding, temporal modeling, and attention mechanisms to capture long-range dependencies. A conv olutional or hybrid module can further enhance temporal feature extraction by modeling gesture continuity and motion dynamics. Residual connections and feed-forward layers improve representation robustness. The decoder then generates sentence-lev el instructions through cross-attention between encoded visual representa- tions and linguistic tokens. Context-aware position encoding methods, such as contextual position encoding (CoPE) [33], can further improv e alignment between visual gestures and textual outputs. This allows the system to capture both global semantic structure and fine-grained temporal context. W e also plan to inv estigate large-scale training strategies us- ing community-dri ven datasets such as ASL Citizen [34]. Fu- ture training pipelines may incorporate frame-lev el visual fea- ture extraction using conv olutional encoders or spatiotemporal backbones to capture hand shape, motion patterns, and body posture. These features will be integrated into transformer- based architectures to learn robust spatial–temporal represen- tations. Importantly , the modular design of the proposed Sign-VLA framew ork enables seamless integration of continuous sign understanding models. The current alphabet-level perception module can be replaced by a sentence-level sign translation component without modifying the downstream VLA policy . This extension is expected to enable richer semantic ground- ing, more natural human–robot interaction, and improved generalization across div erse tasks and en vironments. R E F E R E N C E S [1] J. Bjorck, F . Casta ˜ neda, N. Cherniadev , X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F . Hu, S. Huang et al. , “Gr00t n1: An open foundation model for generalist humanoid robots, ” arXiv pr eprint arXiv:2503.14734 , 2025. [2] M. J. Kim, K. Pertsch, S. Karamcheti, T . Xiao, A. Balakrishna, S. Nair, R. Rafailo v , E. F oster , G. Lam, P . Sanketi, Q. V uong, T . Kollar , B. Burch- fiel, R. T edrake, D. Sadigh, S. Levine, P . Liang, and C. Finn, “Open- vla: An open-source vision-language-action model, ” arXiv preprint arXiv:2406.09246 , 2024. [3] S. Fang, C. Chen, L. W ang, C. Zheng, C. Sui, and Y . T ian, “Signllm: Sign language production large language models, ” in Proceedings of the IEEE/CVF International Confer ence on Computer V ision , 2025, pp. 6622–6634. [4] M. M ¨ uller , Z. Jiang, A. Moryossef, A. R. Gonzales, and S. Ebling, “Con- siderations for meaningful sign language machine translation based on glosses, ” in Pr oceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 2: Short P aper s) , 2023, pp. 682– 693. [5] Z. T ang, C. W ang, and Z. Ding, “Unmatched disturbance rejection for amb systems via dobc approach, ” in 2016 35th Chinese Contr ol Confer ence (CCC) . IEEE, 2016, pp. 5931–5935. [6] Z.-z. T ang, Y .-j. Y u, Z.-h. Li, and Z.-t. Ding, “Disturbance rejection via iterativ e learning control with a disturbance observer for active magnetic bearing systems, ” Fr ontiers of Information T echnolo gy & Electronic Engineering , vol. 20, no. 1, pp. 131–140, 2019. [7] Z. T ang, C. Passmore, J. A. Rossiter, S. Ebbens, G. Dunderdale, and G. Panoutsos, “Disturbance observer-based optimal tracking control for slot coating process with mismatched input disturbances, ” in 2024 UKACC 14th International Conference on Control (CONTR OL) . IEEE, 2024, pp. 55–56. [8] Z. T ang, J. A. Rossiter , X. Jin, B. Zhang, and G. Panoutsos, “Output tracking for uncertain time-delay systems via robust reinforcement learning control, ” in 2024 43rd Chinese Contr ol Confer ence (CCC) . IEEE, 2024, pp. 2219–2226. [9] Z. T ang, J. A. Rossiter , and G. Panoutsos, “ A reinforcement learning- based approach for optimal output tracking in uncertain nonlinear sys- tems with mismatched disturbances, ” in 2024 UKACC 14th International Confer ence on Contr ol (CONTR OL) . IEEE, 2024, pp. 169–174. [10] N. Bai, C. P . Chan, Q. Y in, T . Gong, Y . Y an, and Z. T ang, “Deep reinforcement learning optimization for uncertain nonlinear systems via ev ent-triggered robust adaptiv e dynamic programming, ” arXiv preprint arXiv:2512.15735 , 2025. [11] Z. Luo, P . Zhang, X. Ding, Z. T ang, C. W ang, and J. W ang, “ Adaptive affine formation maneuver control of second-order multi-agent systems with disturbances, ” in 2020 16th International Confer ence on Control, Automation, Robotics and V ision (ICARCV) . IEEE, 2020, pp. 1071– 1076. [12] O. Onuoha, S. Kura wa, Z. T ang, and Y . Dong, “Discrete-time stress matrix-based formation control of general linear multi-agent systems, ” arXiv preprint arXiv:2401.05083 , 2024. [13] S. Albanie, G. V arol, L. Momeni, H. Bull, T . Afouras, H. Chowdhury , N. Fox, B. W oll, R. Cooper , A. McParland et al. , “Bbc-oxford british sign language dataset, ” arXiv pr eprint arXiv:2111.03635 , 2021. [14] Y . Luo, Z. Y ang, F . Meng, Y . Li, J. Zhou, and Y . Zhang, “ An empirical study of catastrophic forgetting in large language models during con- tinual fine-tuning, ” IEEE T ransactions on Audio, Speec h and Language Pr ocessing , 2025. [15] E. Y ang, “Signformer is all you need: T owards edge ai for sign language, ” 2024. [16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” in Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , 2016. [17] J. Donahue et al. , “Long-term recurrent conv olutional networks for visual recognition and description, ” Proceedings of the IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , 2015. [18] D. Tran, H. W ang, L. T orresani, J. Ray , Y . LeCun, and P . Manohar , “ A closer look at spatiotemporal conv olutions for action recognition, ” in Pr oceedings of the IEEE Conference on Computer V ision and P attern Recognition (CVPR) , 2018. [19] K. Hara, H. Kataoka, and Y . Satoh, “Can spatiotemporal 3d con volu- tional networks be pre-trained on imagenet?” in Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Reco gnition (CVPR) , 2018. [20] C. Lugaresi, J. T ang, H. Nash, C. McClanahan et al. , “Medi- apipe: A framew ork for building perception pipelines, ” arXiv pr eprint arXiv:1906.08172 , 2019. [21] S. Reed, R. Zheng, G. W ang, J. Bjorck, J. Jang, A. Zhang et al. , “Groot n1: An open foundation model for generalist humanoid robots, ” arXiv pr eprint arXiv:2503.14734 , 2025. [22] V . I. Levenshtein, “Binary codes capable of correcting deletions, inser- tions, and reversals, ” Soviet physics doklady , vol. 10, no. 8, pp. 707–710, 1966. [23] D. Kahneman, Thinking, F ast and Slow . Farrar , Straus and Giroux, 2011. [24] Z. Li, G. Chen, S. Liu, S. W ang et al. , “Eagle 2: Building post-training data strategies from scratch for frontier vision-language models, ” arXiv pr eprint arXiv:2501.14818 , 2025. [25] W . Peebles and S. Xie, “Scalable dif fusion models with transformers, ” in Proceedings of the IEEE/CVF International Confer ence on Computer V ision , 2023, pp. 4195–4205. [26] Y . Lipman, R. T . Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generati ve modeling, ” in The Eleventh International Confer ence on Learning Representations (ICLR) , 2023. [27] T . Z. Zhao, V . Kumar , S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware, ” in Proceedings of Robotics: Science and Systems , 2023. [28] K. Black, N. Brown, D. Driess, A. Esmail et al. , “ π 0 : A vision- language-action flow model for general robot control, ” arXiv pr eprint arXiv:2410.24164 , 2024. [29] S. Y e, J. Jang, B. Jeon, S. J. Joo et al. , “Latent action pretraining from videos, ” in The Thirteenth International Conference on Learning Repr esentations , 2025. [30] S. Y an, Y . Xiong, and D. Lin, “Spatial temporal graph conv olutional networks for skeleton-based action recognition, ” in Pr oceedings of the AAAI Conference on Artificial Intelligence , 2018. [31] A. Radford et al. , “Learning transferable visual models from natural lan- guage supervision, ” in International Confer ence on Machine Learning (ICML) , 2021. [32] A. V aswani et al. , “ Attention is all you need, ” in Advances in Neural Information Processing Systems (NeurIPS) , 2017. [33] O. Golovnev a et al. , “Cope: Contextual position encoding for transform- ers, ” in Pr oceedings of the 61st Annual Meeting of the Association for Computational Linguistics (A CL) , 2023. [34] A. Desai, L. Berger , F . O. Minakov , V . Milan, C. Singh, K. Pumphrey , R. E. Ladner, H. D. III, A. X. Lu, N. Caselli, and D. Bragg, “ Asl citizen: A community-sourced dataset for advancing isolated sign language recognition, ” 2023.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment