HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness

Vision-Language-Action (VLA) Models have become the mainstream solution for robot control, but suffer from slow inference speeds. Speculative Decoding (SD) is a promising acceleration method which can be divided into two categories: drafter-based SD …

Authors: Zihao Zheng, Zhihao Mao, Sicheng Tian

HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness
HeiSD : Hybrid Speculativ e Decoding f or Embodied V ision-Language-Action Models with Kinematic A war eness Zihao Zheng 1 Zhihao Mao 2 Sicheng Tian 3 Jiayu Chen 1 Maoliang Li 1 Xinhao Sun 4 Zhaobo Zhang 1 Xuanzhe Liu 1 Donggang Cao 1 Hong Mei 1 † Xiang Chen 1 Abstract V ision-Language-Action (VLA) Models hav e be- come the mainstream solution for robot control, but suffer from slo w inference speeds. Specula- tiv e Decoding (SD) is a promising acceleration method which can be divided into two cate gories: drafter-based SD and retrie val-based SD. Exist- ing methods fail to analyze the advantages and disadvantages of these two types of SD in VLA models, leading to their sole application or op- timization. In this paper , we analyze the trajec- tory patterns of robots controlled by the VLA model and deriv e a key insight: the two types of SD should be used in a hybrid manner . How- ev er, achie ving hybrid SD in VLA models poses sev eral challenges: (1) draft rejection and per - sistent errors in retrie v al-based SD; (2) dif ficulty in determining the hybrid boundary . T o address these, we propose the HeiSD framework. W e pro- pose a retriev al-based SD optimization method in HeiSD ,which contains a verify-skip mechanism and a sequence-wise relaxed acceptance strate gy . Moreov er , we proposed a kinematic-based fused metric in HeiSD to automatically determine the hybrid boundary . Experimental results demon- strate that HeiSD attains a speedup of up to 2.45 × in simulation benchmarks and 2.06 ×∼ 2.41 × in real-world scenarios, while sustaining a high task success rate. 1. Introduction V ision-Language-Action (VLA) models hav e emerged as the mainstream solution for Embodied Intelligence ( Ma 1 School of Computer Science, Peking University , Beijing, China 2 School of Computer Science, China Uni versity of Geo- sciences, W uhan, China 3 School of Artificial Intelligence, Bei- jing Normal Uni versity , Beijing, China 4 School of EECS, Peking Univ ersity , Beijing, China. Correspondence to: Xiang Chen < xi- ang.chen@pku.edu.cn > . Pr eprint. Mar ch 19, 2026. et al. , 2024 ; Zhang et al. , 2024 ). Despite their impressi ve per- formance, the heavy computational demands of VLA mod- els limit their inference speed, pre venting them from meet- ing real-time requirements ( W en et al. , 2025 ; Budzianowski et al. , 2025 ). T o boost the inference speed of VLA models, existing work has integrated large-model inference optimiza- tion techniques into VLAs, covering: model architecture in- nov ation ( Liu et al. , 2024a ; W en et al. , 2025 ; Budzianowski et al. , 2025 ; Pertsch et al. , 2025 ), model compression ( Zheng et al. , 2025 ; 2026a ; Park et al. , 2024 ; Li et al. , 2025 ), run- time optimization ( Zhang et al. , 2025 ; Song et al. , 2025 ; Y ue et al. , 2024 ) and deployment design ( Zheng et al. , 2026c ). In such runtime optimization methods, Speculati ve Decoding (SD) ( W ang et al. , 2025 ; Zheng et al. , 2026b ) is a promising method that can accelerate inference. The essence of SD lies in employing low-cost methods to rapidly generate token sequences. These sequences then undergo parallel verification and selecti ve acceptance by Large Language Models (LLMs), thereby enhancing infer- ence speed. Existing SD methods can be categorized into two types: drafter-based SD ( W en et al. , 2024 ; Y an et al. , 2025 ) and retriev al-based SD ( Cho et al. , 2025 ; He et al. , 2024 ; Lee et al. , 2025 ). Drafter-based SD le verages a small model (trained from scratch or fine-tuned) to generate draft token sequences. In contrast, retriev al-based SD does not rely on a dedicated draft model; instead, it uses a prebuilt vector database and retriev es relev ant content from it to obtain draft token sequences. When being adapted for VLA models, both drafter-based and retrie val-based SD approaches exhibit distinct advan- tages and limitations. As Fig. 1 (a) shows, drafter -based SD can provide high-quality drafts with a high probability of passing verification. But it needs to bear the inference ov erhead of the draft model. Retriev al-based SD eliminates the ov erhead of the draft model and has a higher theoretical speedup, but suf fers from low draft quality . In this study , we first construct a database and dev elop an analysis of retrie v al drafts. Our analysis rev eals that some trajectory segments guided by retriev al drafts highly align with VLA inference, while others exhibit deviations. Based on this observation, we derive a key insight: employing 1 Submission and Formatting Instructions f or ICML 2026 Genera tion Draft Q ualit y Theoret ical Speed Over head Draft - Based SD High Low High Retrieval - Based SD Low High Low (b) VLA Mo del a 1 a 2 a 3 a 4 a' 2 a' 3 a' 4 a' 1 a' 0 Draft M odel Draft er Input Draft -Based SD VLA Mo del a 1 a 2 a 3 a 4 a' 2 a' 3 a' 4 a' 1 a' 0 Database No Draft er Retrieval Input Retrieval-Based SD Step (a) Step Step Step Step Step Robot Contr ol Multi - Step Inference of VLA Models Retrieval-Based SD Draft -Based SD Retrieval-Based SD Retrieval-Based SD Needs Optimization Challenge 1 Hard to Deter mine Hybrid Boundary Challenge 2 HeiSD Framewor k H ybrid SD E mbodied I ntell igence = + Solution 1 Solution 2 Adaptive Verify - Skip Seq -Wise Relaxed Accept Kinemat ic -Based Fused Metric fo r Deter mine Boundar y F igure 1. Ov erview of the Proposed HeiSD Frame work retrie val-based SD for the ov erlapping segments and drafter - based SD for the non-ov erlapping se gments enables le verag- ing the advantages of both SD approaches simultaneously . Howe ver , achie ving this kind of hybrid SD in VLA mod- els faces challenges. First, retriev al-based SD suf fers from low-quality drafts, which are hard to pass the verification process and cause persistent errors. Thus, specific optimiza- tion is needed for retriev al-based SD. Second, it is hard to determine the hybrid boundary , i.e., identifying which step should employ drafter -based SD and which should use retriev al-based SD. T o address these challenges, we first propose an optimiza- tion method for retriev al-based SD. W e propose an adaptiv e verify-skip mechanism that selecti vely skips the verification process for certain drafts to av oid strict rejection. Further- more, we propose a sequence-wise relaxed acceptance strat- egy to enhance the di versity of drafts during verification and accept drafts with minor bias without compromising accuracy , to avoid persistent errors. After that, we dev eloped a kinematic-based fused metric to automatically determine the hybrid boundary , enabling automatic decision and automatic switching during VLA multi-step inference. These designs form an end-to-end hybrid SD frame work for VLA models, called HeiSD . W e believ e HeiSD will play a role in the future dev elopment of embodied intelligence and community building. In sum- mary , our contributions are belo w: • W e conducted a detailed analysis and gained key in- sights: hybrid using drafter-based SD and retriev al- based SD in VLA models leads to better performance. • W e optimize retriev al-based SD and propose an adap- ti ve v erify-skip mechanism along with a sequence-wise relaxed acceptance strate gy . • W e developed a kinematic-based fused metric to auto- matically determine the hybrid boundary , thus forming the HeiSD framew ork. Experimental results demonstrate that HeiSD attains a speedup of up to 2.45 × in simulation benchmarks and 2.06 ×∼ 2.41 × in real-world scenarios, while sustaining a high task success rate. 2. Preliminary 2.1. V ision-Language-Action Models VLA models usually comprise three core components ( Kim et al. , 2024 ; Zitkovich et al. , 2023 ): vision encoders (conv ert- ing visual modality into tokens), an LLM backbone (fusing multi-modal information and enabling reasoning), and an action de-tokenizer (decoding output tokens into actions). a j = arg max a j  P ( a j | a 0: j − 1 , O , P , W )  . (1) Each VLA generation outputs an action slice, which is a 7-dimensional vector representing 7 Degrees of Freedom (DoF): position X, Y , Z of the end gripper , joint rotation an- gles r X , r Y , r Z , and a binary gripper control signal G . Each DoF is encoded as a tok en a j . VLA models autoregressi vely predict the most probable token a j based on the pre viously tokens a 0: j − 1 , visual observ ations O , language prompts P , and the learnable model parameters W , as Eq. ( 1 ) shown. 2.2. Speculative Decoding The core idea of SD is to use low-cost methods for rapid to- ken sequence acquisition and emplo y LLMs for parallel v er- ification of these tokens, thus av oiding slow autoregressi ve generation. Existing SD falls into two categories: drafter- based and retriev al-based. The former leverages a small draft model M D to generate token sequences, and uses the LLM as a verification model M V . This process can be writ- ten as Eq. ( 2 ) , in which a j means the tokens, f t means the hidden features, and e t means the token embeddings. Draft: a j = M D ( f 1: t , e 0: t , a t +1: j − 1 ) . V erify: ˆ a j = M V ( a 0: j − 1 , P , W ) , ( ˆ a j = a j , Acpt. ˆ a j  = a j , Disc . (2) 2 Submission and Formatting Instructions f or ICML 2026 Trajectory Overlap Y- Dim Z- Dim Small Trajectory Deviation End Point Match VLA Inference DB Retrieval Trajectory of Failed Case (a) X- Dim Y- Dim Z- Dim Trajectory of Successful Case Trajectory Overlap Environment: LIBERO- Goa l Task: Push Plate to the Front of the Stove (b) VLA Inference DB Retrieval Environment: LIBERO- Goa l Task: Push Plate to the Front of the Stove VLA Inference: Success DB Retrieval: Success Large Trajectory Deviation End Point Mismatch X- Dim VLA Inference: Success DB Retrieval: Fai lure In Real - World Trajectory Trajectory Trajectory Fragment In Simulation Trajectory Trajectory Action ViT LLM De -Token In Real - World In Simulation Retrieval Vector DB Action High Similarity ! (c) Trajectory Fragment F igure 2. Kinematic-Based Trajectory Analysis and Real-w orld/Simulation V alidation for both VLA Inference and Database Retriev al Draft: a j =  Retr i ( f 1: t , e 0: t , a t +1: j − 1 )   ∈ DB  . (3) In contrast, retriev al-based SD does not need a draft model; instead, it uses a database to retrie ve draft tok en sequences. Its draft process can be written as Eq. ( 3 ) , where DB means the prebuilt database. When applied to VLA models, the two types of SD each ha ve distinct advantages and draw- backs. Drafter-based SD necessitates online maintenance of a draft model: while smaller than a VLA model, this drafter still incurs memory consumption and additional com- putational ov erhead/latency , though its high-quality drafts support longer acceptable sequences and achiev e actual ac- celeration. In contrast, retriev al-based SD eliminates the ov erhead associated with the drafter model and offers theo- retical performance benefits; ho we ver , the retriev ed drafts suffer from distrib ution mismatch, which hinders verifica- tion and restricts the theoretical acceleration. 3. Observation and Moti vation This section first details the construction of the database and its performance test, then presents a database retrie val trajectory analysis. Based on analysis results, we derive insights into achieving hybrid SD in VLA inference and outline the corresponding challenges. 3.1. Database Construction and Perf ormance T est Based on the common LIBER O datasets ( Liu et al. , 2023 ) (details in Appendix B ), we first construct a robust vector database. Leveraging the Qdrant ( Qdrant T eam , 2023 ) in- frastructure, we define key components including vector representation, database structure, and sharding strategy , with details provided in Appendix C . T o maximize retrie val T able 1. Database Performance on LIBERO Benchmark Model En vironment VLA Inference Database Retrieval SR Speed SR Speed OpenVLA LIBER O-Goal 77.0% 1.00 × 62.0% 4.17 × LIBER O-Object 71.2% 1.00 × 68.0% 4.83 × LIBER O-Spatial 82.8% 1.00 × 53.0% 3.98 × LIBER O-Long 54.4% 1.00 × 18.0% 3.74 × accuracy , all data from the LIBERO dataset are incorpo- rated into the database. Necessary vector data statistics, retriev al tests, retriev al latency and confidence analysis of the database are conducted in Appendix C.2 . Furthermore, we only use database retrie val to finish v arious tasks in the LIBER O ( Liu et al. , 2023 ) benchmark. For comparison, we report the Success Rate (SR) and speed of the OpenVLA model ( Kim et al. , 2024 ) inference. As shown in T ab . 1 ,models relying solely on database retriev al can still complete certain tasks (e.g., 68.0% in simple task suite lik e Goal) with high speed (3.74 ×∼ 4.83 × ). Moreo ver , the database remains effecti ve ev en on challenging task suites (e.g., Spatial and Long). Overall, the task completion rate of database retrie val is lo wer than that of VLA model inference. W e also ev aluate the performance of database scaling and find that scale e xpansion does not improv e the SR.(Appendix C.2 ). 3.2. Database Retrieval T rajectory Analysis W e further compare the kinematic trajectories (in both Sim- ulation and real-world en vironments) comes from database retriev al and VLA inference. In Fig. 2 (a) (where both retriev al and inference succeed), numerous trajectory seg- ments from the database and VLA inference overlap signifi- cantly (green region), confirming the precision of retrie ved results in these segments. Though some segments occur biased with the trajectory of VLA inference (red region), since most segments are accurate, the final endpoints can be matched (i.e. task success). In Fig. 2 (b) (where retrie val failed but inference succeeded), such o verlaps are scarce, resulting in endpoint mismatch. T able 2. Performance Comparison of T wo T ypes of SD Dataset VLA+SD (SpecVLA) Retrieval-Based SD SR AL Speed SR AL Speed LIBER O-Goal 71.0% 2.97 1.00 × 77.0% 1.03 0.94 × LIBER O-Object 62.4% 3.25 1.00 × 76.0% 0.93 0.92 × LIBER O-Spatial 80.4% 3.27 1.00 × 78.0% 0.81 0.78 × LIBER O-Long 46.2% 2.82 1.00 × 50.2% 0.82 0.97 × 3 Submission and Formatting Instructions f or ICML 2026 Action VLA Model ViT LLM De -Token Historical Trajectory Segme nts LLM Block Hook Featur e for Each Tra jectory Point ① 0.42 0.98 0.89 0.61 Feature of ① Feature of ② Feature of ③ Feature of ④ Feature of ⑤ ② ③ ④ ⑤ Attn FFN lm_head De -Token Feature S imila rity High Low Feature sim ilarity related to the point d istance 0.97 0. 92 Feature of ③ Feature of ④ Feature of ⑤ … When F acing New Task (a) (b) Historical Feature Similarit y of Trajectory Po ints Verify Verify Skip Verify Skip Skip Verify Skip Skip Verify Verify Skip Skip Verify Skip … Adaptive Verify -Skip Verify Skip Verify 1 ~ n 2 ~ n 3 ~ n 4 ~ n On Historical Trajectories in DB ① ② ③ ④ ⑤ Adaptive Algo rithm Indicators … F igure 3. Adaptive V erify-Skip Mechanism Thus, we argue that database retrie val should be emplo yed for ov erlapping trajectory segments, while VLA inference is suitable for non-o verlapping parts. The ef fectiveness of database retrie val for overlapping se gments is validated in both simulated and real-world en vironments. As sho wn in Fig. 2 (c), database retrie v al results for ov erlapping segments exhibit high similarity to those from VLA inference in terms of start points, end points, and motion trajectories, but with higher speed. Guided by this conclusion, implementing hybrid SD in VLA inference is a promising approach: it employs retrie v al-based SD in green regions and drafter- based SD in red regions, thereby preserving accuracy while further enhancing speed to achiev e a new Pareto Frontier . 3.3. Challenges of Achieving Hybrid SD Howe ver , achieving hybrid SD in VLA inference is non- trivial, facing sev eral challenges. First, as shown in T ab . 2 , achieving theoretical acceleration with retriev al-based SD for VLA models is hard, due to the follo wing reasons: (1) The distribution of retrie ved drafts mismatches that of VLA inference results; thus, even if the trajectories overlap, re- triev ed drafts rarely pass strict verification, leading to low accept length (AL) and poor speedup. (2) Furthermore, due to the absence of a drafter , the system repeatedly retrie ves identical results from the database, leading to persistent re- jects. Second, automatically determining the boundary of hybrid SD remains challenging. Although trajectory seg- mentation is highly effecti ve, the inability to predict the complete trajectory during inference complicates the deter - mination of segmentation boundaries. 4. Retriev al-Based SD Optimization In this section, we propose a no vel v erify-skip mechanism and a seq-wise relax acceptance strategy , aiming to boost retriev al-based SD to its theoretical speed. 4.1. Adaptive V erify-Skip Mechanism Due to distributional misalignment with VLA inference outputs, retriev ed drafts often fail verification. Given the in- herent multi-solution nature of embodied tasks, we propose omitting specific verification processes when such drafts are sufficiently accurate. Therefore, a model-free selection strat- egy is required in retrie val-based SD to identify which steps’ verification should be skipped. T o achie ve this, we capture the input features of the final layer (denoted as lm head ) during the verification process of each retrie ved draft. Exist- ing work ( Zhang et al. , 2025 ) prov es that the input features of the final layer in VLA models are the most critical and most relev ant to downstream robotic tasks. As shown in Fig. 3 (a), the feature similarity of trajectory points correlates with their distance: closer trajectory points exhibit higher feature similarity and should theoretically be directly accepted without verification, whereas farther trajectory points show lo wer similarity and should be less likely to skip verification. Inspired by these, we conduct a verify-skip mechanism based on the feature similarity patterns of trajectory points. Specifically , we calculate the feature similarity for each track point based on the historical tracks in the database. During this process, we collect the minimum acceptable similarity and its corresponding point distance, as illustrated in the offline stage of Alg. 1 . When confronting a new task, we reuse the minimum accept- able similarity and point distance from historical informa- tion to automatically determine which trajectory points in the ne w task are excessi vely similar and thus can be skipped. Meanwhile, feedback signals are incorporated based on task completion performance. Using these signals, the algorithm Algorithm 1 Adaptiv e V erify-Skip Mechanism ① Offline Stage: Input: Historical T rajectory Point P h i , Number of Trajectory Point n , Distance d , Input Feature of lm head Layer F eat P h i , Pre-Sampling Similarity Boundary T . Init: F eat p h i ← H ook ( lm head ) after I nf er ( P h i | P h 1 ∼ i − 1 ) ; Initialize S ( · , · ) = 0 , min S = 0 and O dist = 0 . for i = 1 to n − 1 for d = i to n − 1 Compute S ( F eat P h i , F eat P h i + d ) . if min S > S ( F eat P h i , F eat P h i + d ) > T then min S , O dist ← S ( F eat P h i , F eat P h i + d ) , d ② Online Stage: Input: T otal T ask Trials t total , T rajectory Points of Current T ask P c i , Number of Trajectory Points n , Last T ask Feedback Bool Signal B t Init: Initialize B n = T r ue , Get min S and O dist from ① repeat for i = 1 to n − 1 for d = i to n − 1 Compute S ( F eat P c i , F eat P c i + d ) . if B t = T r ue then min S += ∆ | S c − min ( S h ) | , O dist ↑ else B t = T r ue then min S -= ∆ | S c − min ( S h ) | , O dist ↓ t ++ until t = t total 4 Submission and Formatting Instructions f or ICML 2026 ID:72 ID:45 ID:37 Draft Token Verify Token Top -K Retrieval Action Vector DB a 1 a 2 a 3 a 4 a 5 a 6 a 7 Define X Y Z R X R Y R Z G a 1 a 2 a 3 a 4 a 5 a 6 a 7 Seq 1 Seq 2 Seq 3 X Y Z R X R Y R Z G Seq 1 Top -1 Position- Related Angle - Related Gripper K Detail Seq 1 Top -2 Seq 2 Top -1 Seq 2 Top -2 Seq 2 Top -3 Seq 2 Top -1 Seq 2 Top -2 Seq 2 Top -3 Seq 3 Top -1 Seq 3 Top -2 Seq 3 Top -3 Seq 3 Top -1 Seq 3 Top -2 Seq 3 Top -3 Root Seq -Wise Tr ee Decoding Use Top -K Seq to build the Tree Seq 1 Top -1 Seq 1 Top -2 Seq 2 Top -1 Seq 2 Top -2 Seq 2 Top -3 Seq 2 Top -1 Seq 2 Top -2 Seq 2 Top -3 Seq 3 Top -1 Seq 3 Top -2 Seq 3 Top -3 Seq 3 Top -1 Seq 3 Top -2 Seq 3 Top -3 Root Decoding of a Chain a 1 a 2 a 3 a 4 a 5 a 6 a 7 Seq 1 Top -1 Seq 2 Top -1 Var: 0 a' 1 a' 2 a' 3 a' 4 a ' 5 a' 6 a' 7 ID:150 ID:55 ID:160 ID:148 ID:40 ID:180 ID:72 ID:35 ID:39 … ID:60 Seq - W ise Relax Acceptance Depth -F irst Decoding Var:1 0 ID:9 1 Var:2 Var :2 Var: 15 Var:20 Var:31 Small Token Var. Small Se q Var . High Token Var. Hig h Seq Var . F igure 4. Sequence-Wise Relax ed Acceptance Strategy automatically updates the minimum acceptable similarity and point distance after each task to enhance its performance in subsequent tasks, as sho wn in the online stage of Alg. 1 . This adapti ve adjustment mechanism enables the skipping of some trajectory point verification while allo wing dynamic adjustments, thereby achieving minimal accurac y loss. 4.2. Sequence-Wise Relaxed Acceptance Strategy Merely addressing the issue that drafts fail to pass v erifi- cation is insufficient; retrie val-based SD also confronts the problem of persistent errors. If errors arise in the retrieved drafts, the database lacks correction capability , resulting in repeated retriev al of the same erroneous drafts. Conse- quently , the VLA model necessitates repeated verification, failing to achiev e acceleration effects. T o solve this problem, we propose a sequence-wise relaxed acceptance strate gy . Specifically , we first expand the retrie val draft by modify- ing the original T op-1 retriev al to T op-K matching. This approach can enhance the div ersity of retrieved drafts. Fur- thermore, we define sequences in the draft as a group that contains tokens with strong kinematic correlation. As shown in Fig. 4 , X , Y and Z denote the end-gripper positions (all position-related) and are thus grouped into a sequence; sim- ilarly , r X , r Y and r Z represent the joint rotation angles (all angle-related) and are also packed into a sequence. F or G , which means the binary control signal of the gripper, we treat it as a separate sequence. After that, we inherit the tree decoding backbone from Eagle- 2 ( Li et al. , 2024b ). The ke y difference is that we construct the tree in a sequence-wise rather than a token-wise man- ner . Furthermore, when constructing the tree, we allo w se- quences from dif ferent retriev al results to connect with each other to maximize the potential for draft generation, except for the sequence represented by the gripper . This exception is justified by the strong correlation between the gripper state and task success rate. W e verify each chain starting from the root node in a depth-first manner . In this process, a sequence-wise relaxed acceptance strategy is adopted to increase the probability of accepting div erse drafts, thereby Large Radius Y- Dim Z- Dim X- Dim ① ① ② ③ ④ ② ③ ④ ⑤ ⑤ ① ② ③ ④ ⑤ VLA Inference DB Retrieval Small Radius Large Displacement Small Displacement Fusion R [w] D [w] D [w] R [w] Retrieval - Based SD Drafter - Base d SD Large Radius Small Radius Threshold Large Radius Small Radius Fusion VLA Inference – Cur vature Radius DB Retrieval – Curvatur e Radius Large Displacement Large Displacement Small Displacement VLA Inference – Cum ulative Displacement DB Retrieval – Cumulative Displa cement VLA Inference – F used Metri c DB Retrieval – Fused Metric Small Displacement End - point End - point F igure 5. Kinematic-Based Metric for Achieving Hybrid SD reducing persistent errors. As sho wn in Fig. 4 , we calculate the index bias bias of each draft token and each v erify token. When the de viation of the entire sequence bias seq and that of individual tok ens bias a j are maintained within a specific range, the entire sequence is accepted by enforcement. Note that we allow bias a j > bias seq , meaning a single token may exhibit a larger bias provided that the overall bias of the sequence remains small, which differs distinctly from ex- isting token-le vel relaxed acceptance. After multiple trials, we select bias seq = 30 and bias a j = 15 , which achieves the optimal trade-off between accuracy and speed. After all chains are v alidated, if all candidates for a gi ven sequence are rejected, the VLA model is directly activ ated to generate subsequent tok ens. This strategy , which combines top-K retriev al with sequence-wise relaxed acceptance, effecti vely mitigates the issue of persistent errors. In particular , no deviation is allowed for binary grippers, as the correctness of the gripper is critical to task success. 5. Metric f or Determining Hybrid Boundary After optimizing retrie v al-based SD, this section proposes a novel kinematic-based metric to achiev e hybrid SD and automatically determine its boundary during step-by-step VLA model inference. 5.1. Issue Definition and Design Insights Aforementioned analysis in Section 3 sho ws that trajec- tories from successful database retriev als exhibit greater ov erlap with VLA model inference trajectories, whereas those from failed retriev als sho w less overlap. Howe ver , trajectory points are generated incrementally , and the com- plete trajectory cannot be obtained during the multi-step inference process. Therefore, a method is required to de- termine the hybrid boundary , i.e., which steps should adopt retriev al-based SD and which should use drafter-based SD. Through in-depth analysis, we identify that these trajectory segments e xhibit inherent kinematic characteristics (shown in Fig. 5 ): Overlapping trajectory segments (green region) exhibit a lar ger curvature radius and cumulative displace- 5 Submission and Formatting Instructions f or ICML 2026 ment, whereas biased segments (red region) sho w a smaller one. Therefore, we decided to use these two characteristics to design a metric to help determine the hybrid boundary . 5.2. Kinematic Characteristics of T rajectory Segments Assuming that the trajectory context sliding window is w , each trajectory point in the coordinate system T ( λ ; O γ ) con- tains spatial position x, y , z , where λ represents the distance scale and O γ represents the origin point. First, we calculate the geometric center C = ( u c , v c ) of trajectory segments and project it into 2-dimensional space to obtain the ba- sis vector ( u [ w ] i , v [ w ] i ) , shown in Eq. ( 4 ) . W e use Eq. ( 5 ) to iterati vely update the optimal geometric center , which E uclid 2-dim ( · ; · ) means 2-dimensional euclidean distance, and µ represents the mean. After that, we use Eq. ( 6 ) to calculate the curv ature radius R [ w ] i . In this way , we can scan the change of R [ w ] i in a sliding window w . ( u [ w ] i , v [ w ] i ) = P r oj  P i x,y ,z − 1 w w − 1 X i P i x,y ,z   ∈ T  . (4) ( ˆ u c , ˆ v c ) = min C w − 1 X i  E uclid 2-dim  ( u [ w ] i , v [ w ] i ); C  − µ  2 . (5) R [ w ] = 1 w w − 1 X i E uclid 2-dim  ( u [ w ] i , v [ w ] i ); ( ˆ u c , ˆ v c )  . (6) Assume D [ w ] means the cumulati ve displacement in the slid- ing window w . W e use E uclid 3-dim ( · ; · ) to calculate the displace- ment between two adjacent points. Then, we summarize all the displacements to obtain D [ w ] (Eq. 7 ). Considering that the robot sometimes performs round-trip or circular motion, D [ w ] does not consider the displacement direction. D [ w ] = w − 1 X i E uclid 3-dim  ( P i x,y ,z   ∈ T );  P i +1 x,y ,z   ∈ T )  . (7) 5.3. Kinematic-Based Fused Metric W e fuse R [ w ] and D [ w ] based on Eq. ( 8 ) to obtain a fused metric F [ w ] . N orm ( · ) means the normalization operation, which is detailed in Appendix D . Conceptually , a larger F [ w ] means faster movement and a trajectory closer to a straight line (representing coarse-grained action); a smaller F [ w ] results in slower mo vement, a more curved trajectory , and fine-grained operation. F [ w ] = α · N or m  R [ w ] i  + (1 − α ) · N or m  D [ w ] i  . (8) As sho wn in Fig. 5 , F [ w ] integrates R [ w ] i and D [ w ] i to com- prehensiv ely characterize the trajectory . W e identify a Metric Database ViT Draft LLM Embed ViT Forward() Retri Similarity() Top - K() Action Verify LLM Forward() CPU GPU Robot Exe De -Token( ) Control( ) Vision Info . Gen Drafter Forward() VLA Pea k DB Peak GPU Type GPU GMEM Result 16.6GB 8.8GB RTX 3090 24GB OOM ❌ RTX 4090 24GB OOM ❌ RTX 5090 24GB/32GB OOM ❌ A100 40GB/80GB OK ✅ H100 80GB/94GB OK ✅ Offloa ding th e Database to CPU HeiSD Framew ork Comput ation Flow Embed dings | from GPU to CPU: Embed - dim = 4352 -dim Vec @FP16 Latency = 0. 25ms (BW 32GB/s) Retrieved Draft | fr om CPU to GPU: Draft - dim = 7 -dim Vec @FP16 Latency: Negligible (BW 32GB/s) Data Movement – A Case Seq -Wise Relaxed Accept Dafter If R- SD Else S- SD Draft LLM Verify LLM Forward() If Verify -Skip Else F igure 6. System Implementation of HeiSD Framework threshold to serve as the demarcation between retriev al- based SD and drafter-based SD. Specifically , when F [ w ] at the current step exceeds the threshold, retriev al-based SD is selected, as both R [ w ] i and D [ w ] i at this step are relati vely large, which should align with the trajectory inferred by VLA; otherwise, drafter-based SD is chosen. 6. HeiSD Framework Implementation This section outlines key considerations for HeiSD frame- work implementation, co vering cost accounting, hardw are mapping and ov erall computation flo w . 6.1. Cost Accounting and Hardwar e Mapping W e develop pre-implementation cost accounting (Fig. 6 ) to rev eal that most consumer GPUs encounter Out-of-Memory (OOM) issues when hosting both the database component and the VLA model; thus, to ensure compatibility with mainstream hardware, the database is deployed in CPU memory to reduce GPU memory ov erhead. This hetero- geneous collaborati ve deployment necessitates GPU-CPU communications, whose detailed costing (Fig. 6 ) sho ws neg- ligible overhead ( > 100 ms) in a single inference process, far belo w that of model inference, v alidating the feasibility and effecti veness of this deployment strate gy . 6.2. Overall Computation Flow W e use 5000 lines of code to implement the HeiSD frame- work. Its overall computation flo w is shown in Fig. 6 . First, visual information is encoded into embeddings using a V iT on the GPU. Then, based on the prior trajectory , our fused metric determines the type of SD to be adopted. If the retriev al-based SD is used, the embeddings are transmitted to the CPU for retriev al, which includes similarity com- putation and T op-K selection. When drafter-based SD is employed, the embeddings remain on the GPU and are input to the drafter for draft generation. For the retrie ved drafts, they are first mo ved back to the GPU. Then the adaptiv e 6 Submission and Formatting Instructions f or ICML 2026 verify-skip mechanism determines whether a draft should skip verification. If skipped, the draft is executed directly; otherwise, verification is performed under our sequence- wise relaxed acceptance strategy . For drafts generated by the drafter , since they already reside on the GPU, we directly perform verification. All verification processes occur on the GPU. W e will sho w that this CPU+GPU implementation offers adv antages over the GPU-only deplo yment belo w . 7. Experiments 7.1. Setup W e test HeiSD based on OpenVLA ( Kim et al. , 2024 ) model clusters in LIBER O ( Liu et al. , 2023 ) simulation benchmark and real-world environment. W e build a single LLaMA block ( T ouvron et al. , 2023 ) as a draft model to b uild drafter- based SD. W e train the draft models based on the Deep- Speed ( Rasley et al. , 2020 ) framework, which takes 8 hours with 2*NVIDIA A100 GPUs. W e build retrie val-based SD using the pre-built database. Both types of SD are appli- cable to OpenVLA as the v alidation model. Since there is no similar w ork, we choose pure drafter -based SD (Pure D-SD), pure retrie val-based SD (Pure R-SD), and SO T A work SpecVLA ( W ang et al. , 2025 ) with token-le vel relax ed acceptance as the baselines. W e use an Nvidia A100 GPU and an Intel Xeon Silver 4410T as hardware platform. 7.2. Evaluation Results Simulation Results. W e utilize four LIBERO task suites to ev aluate HeiSD and each suite contains 10 tasks. For each task, we conduct 50 trials for testing. W e report the results in T ab . 3 . Compared with autoregressi ve inference, HeiSD achieves 1.79 ×∼ 2.45 × speed up. Compared with Pure D-SD and Pure R-SD, HeiSD also deli vers significant acceleration ef fects. This confirms that our choice to hy- bridize two types of SD during VLA reasoning is highly effecti ve. Moreo ver , ev en compared with SO T A works like SpecVLA, HeiSD achie ves 1.51 ×∼ 2.22 × speedup with bet- ter SR. This demonstrates that the hybrid use of two SD methods in the VLA reasoning process yields a superior Pareto frontier . Fig. 7 presents the proportions of two types of SD and verify-skip across four en vironments, explaining the source of acceleration. Moreover , HeiSD achiev es an increased acceptance length (AL) of about 4.75 ∼ 4.96. The improv ement in AL primarily stems from our optimization of retrie val-based SD. First, our verify-skip mechanism ef- fectiv ely extends AL, as drafts with skipped verification are treated as fully accepted. Second, our proposed sequence- wise relaxed acceptance strategy allo ws some biased tokens to be accepted alongside the sequence. Real-W orld Results. W e construct a tabletop operating en- vironment (Appendix E.1 ), and employ the AgileX PIPER LIBERO - Goal LIBERO - Spatial LIBERO -O bject LIBERO - Long D- SD R- SD Skip Verify R- SD w/ Verify F igure 7. Discussion of Hyper-Parameter w in HeiSD robot arm (Appendix E.2 ) to test HeiSD ’ s performance in real-world scenarios. W e b uild v arious manipulation tasks (Appendix E.3 ). W e collect massiv e human demonstration data (Appendix E.4 ) to reb uild the database and fine-tune the models. After that, we deploy HeiSD into the real w orld and test its performance. The results are shown in T ab . 4 . On the three task categories we defined (with SR of 87.2%, 77.3%, and 71.7% after fine-tuning), HeiSD achieves a 2.06 ×∼ 2.41 × speedup with minor SR loss (1.2% ∼ 3.9%). W e also represent an example real-world task completion process in Fig. 8 to show the performance. Ablation Studies. W e conduct ablation studies on the LIBER O-Goal benchmark to e v aluate the ef fects of HeiSD ’ s components, with corresponding results reported in T ab . 4 . When Hybrid SD is implemented solely based on fusion met- rics, the boundaries between retriev al-based SD and drafter- based SD are defined; howe ver , the ov erall performance remains suboptimal due to unresolved issues in retriev al- based SD. Specifically , it achieves an SR of 74.0% but exhibits nearly no acceleration, with an av erage accepted length of only 1.05. Building on this foundation, we incorpo- rate the adapti ve verify-skip mechanism, which significantly enhances performance. It achie ves a 2.08 × speedup with only a slight drop (1.0%) in accurac y , while the AL is also increased to 4.04. Further, after adding the sequence-wise relaxed acceptance, it achiev es 2.38 × speedup (0.3 × in- creasing) and an AL of 4.50, while maintaining 73.0% SR. T able 3. Simulation Results of HeiSD En v . Method SR Speed AL Steps HW LIBER O-Goal AR w/o SD 77.0% 1.00 × – 157.6 GPU Pure R-SD 77.0% 0.96 × 1.03 152.6 CPU+GPU Pure D-SD 76.2% 0.87 × 1.68 159.2 GPU SpecVLA 71.0% 1.23 × 3.63 166.8 GPU HeiSD 73.0% 2.38 × 4.75 156.3 CPU+GPU LIBER O-Object AR w/o SD 71.2% 1.00 × – 191.7 GPU Pure R-SD 76.0% 0.98 × 0.93 195.4 GPU Pure D-SD 68.6% 0.96 × 1.84 195.9 GPU SpecVLA 62.4% 1.10 × 3.91 214.5 GPU HeiSD 71.0% 2.45 × 4.94 189.6 CPU+GPU LIBER O-Spatial AR w/o SD 82.8% 1.00 × – 126.9 GPU pure R-SD 78.0% 1.15 × 0.81 123.5 GPU Pure D-SD 82.8% 0.98 × 1.65 127.3 GPU SpecVLA 80.4% 1.26 × 3.80 128.7 GPU HeiSD 78.0% 1.90 × 4.83 127.8 CPU+GPU LIBER O-Long AR w/o SD 54.4% 1.00 × – 393.2 GPU Pure R-SD 50.0% 0.99 × 0.81 399.3 GPU Pure D-SD 50.2% 0.91 × 1.59 400.7 GPU SpecVLA 46.2% 1.13 × 3.63 439.6 GPU HeiSD 47.0% 1.79 × 4.96 428.0 CPU+GPU 7 Submission and Formatting Instructions f or ICML 2026 Retrieval-Based SD Drafte r -Based SD Skipping Verify Retrieval-Based SD Drafte r -Based SD Skipping Verify Hybrid Hybrid Hybrid HeiSD F igure 8. A Case of HeiSD Framework Completing Real-W orld T asks (T ask Name: Pick up the Banana and Put It on the Plate) T able 4. Real-W orld Results of HeiSD T ask Category Fine-T une HeiSD SR HeiSD Speedup HeiSD AL Atomic Grasping 87.2% 86.0% 2.33 × 4.47 Spatial Displacement 77.3% 75.1% 2.41 × 4.39 Composite Sequential 71.7% 67.8% 2.06 × 4.15 40. 00% 45. 00% 50. 00% 1. 55 1. 60 1. 65 0. 3 0. 4 0. 5 0. 6 0. 7 LIBERO - Long Sp e e d SR 60. 00% 70. 00% 80. 00% 1. 80 2. 00 2. 20 0. 3 0. 4 0. 5 0. 6 0. 7 LIBERO - Goal Sp e e d SR 60. 00% 70. 00% 80. 00% 1. 00 2. 00 3. 00 0. 3 0. 4 0. 5 0. 6 0. 7 LIBERO -Object Sp e e d SR 65. 00% 70. 00% 75. 00% 80. 00% 1. 60 1. 80 2. 00 0. 3 0. 4 0. 5 0. 6 0. 7 LIBERO-Spatial Sp e e d SR HeiSD ’s Selection HeiSD ’s Selection HeiSD ’s Selection HeiSD ’s Selection F igure 9. Discussion of Hyper-Parameter w in HeiSD 7.3. Discussion Hyper -Parameters. HeiSD in volves two hyper -parameters: the sliding window size w for trajectory points and α in the fused metric. W e ev aluate HeiSD ’ s performance across various h yper-parameters on the LIBER O-Goal benchmark, with the results presented in Fig. 10 . As illustrated in Fig. 10 , we test the results under different v alues of w across four simulation en vironments. W e find that the value of w ex erts a significant influence on speed and SR, as it alters the distribution of the fused metric. Considering both the SR and speed comprehensiv ely , we select w = 15 as the default value for HeiSD . Fig. 9 shows the results under different values of α . Similarly , α also influences the SR and speed. W e select α = 0 . 5 to ensure equal attention is allocated to the curvature radius and cumulati ve displacement. Hardwar e Implementation Analysis. W e test HeiSD ’ s end-to-end latency (50 trials) on different hardware plat- forms in both four simulation benchmarks. Results in T ab . 6 show that of fload database into CPU brings a little accelera- tion (1.04 ×∼ 1.09 × ). This demonstrates that the CPU ex- hibits higher execution efficienc y than the GPU for database T able 5. Ablation Studies of HeiSD on LIBER O-Goal Benchmark SR Speedup AL Only Hybrid SD 74.0% 1.05 × 1.05 + Adaptiv e V erify-Skip 73.0% ↓ 1.0% 2.08 × ↑ 1.03 × 4.04 ↑ 2.99 + Seq-W ise Relaxed Acceptance 73.0% ↑ 0.0% 2.38 × ↑ 0.30 × 4.50 ↑ 0.46 60. 00% 70. 00% 80. 00% 2. 10 2. 30 2. 50 11 13 15 17 19 LIBERO - Goal Sp e e d SR 65. 00% 70. 00% 75. 00% 2. 10 2. 30 2. 50 11 13 15 17 19 LIBERO -Object Sp e e d SR 70. 00% 75. 00% 80. 00% 1. 60 1. 80 2. 00 11 13 15 17 19 LIBERO - Spatial Sp e e d SR 44. 00% 46. 00% 48. 00% 1. 40 1. 60 1. 80 11 13 15 17 19 LIBERO - Long Sp e e d SR HeiSD ’s Selection HeiSD ’s Selection HeiSD ’s Selection HeiSD ’s Selection F igure 10. Discussion of Hyper-Parameter α in HeiSD retriev al-related operations without specific optimization. More critically , offloading databases to the CPU signifi- cantly sav es GPU memory and better accommodates com- puting hardware of multiple specifications. Generality . Our w ork focuses e xclusi vely on autoregressi ve VLAs and is not applicable to those incorporating dif fusion models, as SD can only function in autore gressive models. The proposed HeiSD framework exhibits good generality and imposes no specific requirements on autoregressi ve VLAs, task categories, or robot platforms. Scope. W e do not design an automatic hyperparameter de- termination method, mainly because established standards for current embodied intelligence are lacking. Automatic hyper-parameter selection is therefore designated as future work, to be addressed once environmental standards are established. Additionally , the core purpose of HeiSD is to achiev e acceleration while ensuring SR, rather than to guar- antee strictly homogeneous distribution of draft and infer- ence results. Thus, the impact of v erify-skip and sequence- wise relaxed acceptance on the output distribution is not considered in this paper as it falls outside the scope. 8. Conclusion In this paper , we first construct a database, design tests, iden- tify the trajectory o verlapping regularity , and deriv e insights T able 6. Hardware Implementation Analysis of HeiSD Details Goal Object Spatial Long HeiSD on GPU 7388.9 s 7731.7 s 6557.0 s 23318.5 s HeiSD on CPU+GPU 7113.6 s 7328.4 s 6238.5 s 21352.7 s Acceleration 1.04 × 1.06 × 1.05 × 1.09 × 8 Submission and Formatting Instructions f or ICML 2026 for implementing hybrid SD. Then we introduce an adapti ve verify-skip mechanism and a sequence-wise relax ed accep- tance strategy to address limitations in retrie val-based SD. Additionally , we de velop a kinematic-based fusion metric to determine the hybrid boundary , forming HeiSD frame- work. Experiments show that HeiSD attains a speedup of up to 2.45 × in simulation benchmarks and 2.06 ×∼ 2.41 × in real-world scenarios, while sustaining a high SR. Impact Statement This paper presents work whose goal is to advance the field of machine learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here. References Budzianowski, P ., Maa, W ., Freed, M., Mo, J., Hsiao, W ., Xie, A., Młoduchowski, T ., Tipnis, V ., and Bolte, B. Edgevla: Efficient vision-language-action models. arXiv pr eprint arXiv:2507.14049 , 2025. Cho, S., Choi, S., Hwang, T ., Seo, J., Jeong, S., Lee, H., Song, H., Park, J. C., and Kwon, Y . Lossless accelera- tion of large language models with hierarchical drafting based on temporal locality in speculativ e decoding. arXiv pr eprint arXiv:2502.05609 , 2025. Ding, N., Qin, Y ., Y ang, G., W ei, F ., Y ang, Z., Su, Y ., Hu, S., Chen, Y ., Chan, C.-M., Chen, W ., et al. Parameter - efficient fine-tuning of lar ge-scale pre-trained language models. Nature machine intelligence , 5(3):220–235, 2023. Han, Z., Gao, C., Liu, J., Zhang, J., and Zhang, S. Q. Parameter -efficient fine-tuning for large models: A com- prehensiv e survey . arXiv preprint , 2024. He, Z., Zhong, Z., Cai, T ., Lee, J., and He, D. Rest: Retriev al-based speculative decoding. In Pr oceedings of the 2024 confer ence of the North American chapter of the association for computational linguistics: Human language technologies (volume 1: long papers) , pp. 1582– 1595, 2024. Hu, E. J., Shen, Y ., W allis, P ., Allen-Zhu, Z., Li, Y ., W ang, S., W ang, L., Chen, W ., et al. Lora: Low-rank adaptation of large language models. ICLR , 1(2):3, 2022. Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T ., Balakr- ishna, A., Nair , S., Rafailo v , R., Foster , E., Lam, G., San- keti, P ., et al. Openvla: An open-source vision-language- action model. arXiv pr eprint arXiv:2406.09246 , 2024. Kwon, O., George, A., Bartsch, A., and Farimani, A. B. Rt-cache: T raining-free retriev al for real-time manipula- tion. In 2025 IEEE-RAS 24th International Conference on Humanoid Robots (Humanoids) , pp. 1–8, 2025. doi: 10.1109/Humanoids65713.2025.11203198. Lee, A. C.-M., Cheng, W .-S., and Chan, C. C.-K. PR OMTEC: Fast LLM inference decoding using prompt multi-lookup with template database and common se- quences. In Che, W ., Nabende, J., Shutov a, E., and Pilehv ar , M. T . (eds.), F indings of the Asso- ciation for Computational Linguistics: A CL 2025 , pp. 6830–6842, V ienna, Austria, July 2025. Asso- ciation for Computational Linguistics. ISBN 979- 8-89176-256-5. doi: 10.18653/v1/2025.findings- acl. 355. URL https://aclanthology.org/2025. findings- acl.355/ . Li, S., Hu, Y ., Ning, X., Liu, X., Hong, K., Jia, X., Li, X., Y an, Y ., Ran, P ., Dai, G., et al. Mbq: Modality-balanced quantization for lar ge vision-language models. In Pr o- ceedings of the Computer V ision and P attern Recognition Confer ence (CVPR) , pp. 4167–4177, 2025. Li, X., Hsu, K., Gu, J., Pertsch, K., Mees, O., W alke, H. R., Fu, C., Lunawat, I., Sieh, I., Kirmani, S., Le vine, S., W u, J., Finn, C., Su, H., V uong, Q., and Xiao, T . Evaluat- ing real-world robot manipulation policies in simulation. arXiv pr eprint arXiv:2405.05941 , 2024a. Li, Y ., W ei, F ., Zhang, C., and Zhang, H. Eagle-2: Faster inference of language models with dynamic draft trees. arXiv pr eprint arXiv:2406.16858 , 2024b. Liu, B., Zhu, Y ., Gao, C., Feng, Y ., Liu, Q., Zhu, Y ., and Stone, P . Libero: Benchmarking kno wledge transfer for lifelong robot learning. In Advances in Neural Informa- tion Pr ocessing Systems , volume 36, pp. 44776–44791, 2023. Liu, J., Liu, M., W ang, Z., Lee, L., Zhou, K., An, P ., Y ang, S., Zhang, R., Guo, Y ., and Zhang, S. Robomamba: Multi- modal state space model for efficient robot reasoning and manipulation. arXiv pr eprint arXiv:2406.04339 , 2024a. Liu, S.-Y ., W ang, C.-Y ., Y in, H., Molchanov , P ., W ang, Y .-C. F ., Cheng, K.-T ., and Chen, M.-H. Dora: W eight- decomposed lo w-rank adaptation. In F orty-first Interna- tional Confer ence on Machine Learning , 2024b. Ma, Y ., Song, Z., Zhuang, Y ., Hao, J., and King, I. A surve y on vision-language-action models for embodied ai. arXiv pr eprint arXiv:2405.14093 , 2024. Nasiriany , S., Maddukuri, A., Zhang, L., Parikh, A., Lo, A., Joshi, A., Mandlekar , A., and Zhu, Y . Robocasa: Lar ge- scale simulation of e veryday tasks for generalist robots. arXiv pr eprint arXiv:2406.02523 , 2024. 9 Submission and Formatting Instructions f or ICML 2026 Oquab, M., Darcet, T ., Moutakanni, T ., V o, H., Szafraniec, M., Khalido v , V ., Fernandez, P ., Haziza, D., Massa, F ., El- Nouby , A., et al. Dinov2: Learning robust visual features without supervision. arXiv pr eprint arXiv:2304.07193 , 2023. O’Neill, A., Rehman, A., Maddukuri, A., Gupta, A., Padalkar , A., Lee, A., Pooley , A., Gupta, A., Mandlekar , A., Jain, A., et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collab- oration 0. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pp. 6892–6903. IEEE, 2024. Park, S., Kim, H., Jeon, W ., Y ang, J., Jeon, B., Oh, Y ., and Choi, J. Quantization-aware imitation learning for resource-efficient robotic control. arXiv preprint arXiv:2412.01034 , 2024. Pertsch, K., Stacho wicz, K., Ichter , B., Driess, D., Nair, S., V uong, Q., Mees, O., Finn, C., and Levine, S. Fast: Efficient action tokenization for vision-language-action models. arXiv pr eprint arXiv:2501.09747 , 2025. Qdrant T eam. Qdrant: High-performance, massiv e-scale vector database and vector search engine. https:// qdrant.tech/ , 2023. Accessed: 2024-01-08. Rasley , J., Rajbhandari, S., Ruwase, O., and He, Y . Deep speed: System optimizations enable training deep learn- ing models with over 100 billion parameters. In Pr oceed- ings of the 26th A CM SIGKDD international confer ence on knowledge discovery & data mining , pp. 3505–3506, 2020. Song, W ., Chen, J., Ding, P ., Huang, Y ., Zhao, H., W ang, D., and Li, H. Ceed-vla: Consistency vision-language- action model with early-exit decoding. arXiv pr eprint arXiv:2506.13725 , 2025. T odorov , E., Erez, T ., and T assa, Y . Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ Inter- national Confer ence on Intelligent Robots and Systems , pp. 5026–5033. IEEE, 2012. doi: 10.1109/IROS.2012. 6386109. T ouvron, H., Lavril, T ., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T ., Rozi ` ere, B., Goyal, N., Hambro, E., Azhar , F ., et al. Llama: Open and efficient foundation lan- guage models. arXiv pr eprint arXiv:2302.13971 , 2023. van der Maaten, L. and Hinton, G. V isualizing data using t- sne. Journal of Machine Learning Resear ch , 9(86):2579– 2605, 2008. URL http://jmlr.org/papers/v9/ vandermaaten08a.html . W ang, S., Y u, R., Y uan, Z., Y u, C., Gao, F ., W ang, Y ., and W ong, D. F . Spec-vla: speculativ e decoding for vision- language-action models with relax ed acceptance. arXiv pr eprint arXiv:2507.22424 , 2025. W en, J., Zhu, Y ., Li, J., Zhu, M., T ang, Z., W u, K., Xu, Z., Liu, N., Cheng, R., Shen, C., et al. T inyvla: T o wards f ast, data-efficient vision-language-action models for robotic manipulation. IEEE Robotics and Automation Letters , 2025. W en, Z., Gui, S., and Feng, Y . Speculativ e decoding with ctc-based draft model for llm inference acceleration. In Advances in Neural Information Pr ocessing Systems , vol- ume 37, pp. 92082–92100, 2024. Xu, L., Xie, H., Qin, S.-Z. J., T ao, X., and W ang, F . L. Parameter -ef ficient fine-tuning methods for pretrained language models: A critical re view and assessment. arXiv pr eprint arXiv:2312.12148 , 2023. Y an, M., Agarwal, S., and V enkataraman, S. Decoding spec- ulativ e decoding. In Pr oceedings of the 2025 Conference of the Nations of the Americas Chapter of the Associa- tion for Computational Linguistics: Human Languag e T echnologies (V olume 1: Long P apers) , pp. 6460–6473, 2025. Y ue, Y ., W ang, Y ., Kang, B., Han, Y ., W ang, S., Song, S., Feng, J., and Huang, G. Deer-vla: Dynamic inference of multimodal large language models for efficient robot ex ecution. In Advances in Neural Information Pr ocessing Systems , volume 37, pp. 56619–56643, 2024. Zhai, X., Mustafa, B., Kolesniko v , A., and Beyer , L. Sig- moid loss for language image pre-training. In Pr oceed- ings of the IEEE/CVF international confer ence on com- puter vision , pp. 11975–11986, 2023. Zhang, J., Huang, J., Jin, S., and Lu, S. V ision-language models for vision tasks: A survey . IEEE T ransactions on P attern Analysis and Machine Intelligence , 46(8):5625– 5644, 2024. Zhang, R., Dong, M., Zhang, Y ., Heng, L., Chi, X., Dai, G., Du, L., Du, Y ., and Zhang, S. Mole-vla: Dynamic layer skipping vision language action model via mixture- of-layers for efficient robot manipulation. arXiv preprint arXiv:2503.20384 , 2025. Zheng, Z., Cui, X., Zheng, S., Li, M., Chen, J., Chen, X., et al. Moqa: Rethinking moe quantization with multi- stage data-model distribution awareness. arXiv e-prints , pp. arXiv–2503, 2025. 10 Submission and Formatting Instructions f or ICML 2026 Zheng, Z., Cao, H., Tian, S., Chen, J., Li, M., Sun, X., Zou, H., Zhang, Z., Liu, X., Cao, D., et al. Dyq- vla: T emporal-dynamic-aware quantization for embod- ied vision-language-action models. arXiv pr eprint arXiv:2603.07904 , 2026a. Zheng, Z., Mao, Z., Li, M., Chen, J., Sun, X., Zhang, Z., Cao, D., Mei, H., and Chen, X. Kerv: Kinematic-rectified speculativ e decoding for embodied vla models. arXiv pr eprint arXiv:2603.01581 , 2026b. Zheng, Z., T ian, S., Cao, H., Li, C., Chen, J., Li, M., Sun, X., Zou, H., Luo, G., and Chen, X. Rapid: Redundancy-a ware and compatibility-optimal edge-cloud partitioned inference for diverse vla models. arXiv pr eprint arXiv:2603.07949 , 2026c. Zitkovich, B., Y u, T ., Xu, S., Xu, P ., Xiao, T ., Xia, F ., W u, J., W ohlhart, P ., W elker , S., W ahid, A., et al. Rt-2: V ision-language action models transfer web kno wledge to robotic control. In Confer ence on Robot Learning , pp. 2165–2183. PMLR, 2023. 11 Submission and Formatting Instructions f or ICML 2026 A. V ision-Language-Action Models A.1. Model Structure V ision-Language-Action (VLA) models map visual observations and natural language instructions directly to robot actions. The canonical VLA architecture follows a three-stage pipeline: (1) a V ision Encoder (V iT -based) extracts visual features from robot observ ations, (2) a pre-trained Large Language Model backbone fuses visual and language tokens for cross- modal reasoning, and (3) an Action Decoder head projects the model outputs into action space. While recent works hav e explored dif fusion-based action generation, we focus on the widely adopted autoregressiv e token prediction framework. T o illustrate this architecture concretely , we use OpenVLA ( Kim et al. , 2024 ) as an example. OpenVLA employs a dual-encoder vision system: DINOv2-V iT -L/14 ( Oquab et al. , 2023 ) (304M parameters, self-supervised features) and SigLIP-V iT -SO400M/14 ( Zhai et al. , 2023 ) (400M parameters, vision-language aligned features) process 224 × 224 RGB images in parallel. Their 1024-dim and 1152-dim outputs are concatenated and projected through a lightweight adapter into the LLM token space, yielding 256 visual tokens per observ ation. The language backbone is Llama-2-7B (7B parameters), which autoregressi vely generates action tokens by attending to the combined vision-language context. For 7-DoF robot control, the model predicts 7 action dimension tokens through a linear head ov er a 256-bin vocabulary per dimension. A.2. Generation Paradigms At each robot control timestep, action generation proceeds sequentially: the model predicts dimension a 1 , appends its token to the input sequence, then predicts a 2 conditioned on a 1 , and so forth. For a D -dimensional action ( D = 7 for OpenVLA), this requires D full forward passes through the LLM backbone, each in v olving causal attention over all visual tokens, language tokens, and pre viously generated action tokens. On an NVIDIA A100 GPU, OpenVLA ’ s single-timestep inference takes approximately 174ms: 8ms for dual-encoder visual feature extraction, 113ms for autore gressi ve action token generation (across 7 decoding steps), and ≈ 53 ms of system overheads (e.g., data transfer and CPU scheduling). This sequential decoding bottleneck—where each action dimension must w ait for its predecessor—directly moti v ates our application of speculativ e decoding to VLA inference. T o bridge continuous robot control and discrete language modeling, VLA models quantize action spaces into token vocab ularies. Each action dimension a i ∈ [ a min ,i , a max ,i ] is uniformly discretized into K bins (typically K = 256 to balance resolution and vocab ulary size). The continuous-to-discrete mapping assigns each action to its nearest bin index: b i =  a i − a min ,i a max ,i − a min ,i · ( K − 1)  , (9) while the in verse mapping reconstructs continuous actions via linear interpolation: a i = a min ,i + b i K − 1 ( a max ,i − a min ,i ) . (10) This discretization allo ws VLA models to apply standard cross-entrop y loss ov er bin indices, treating action prediction as multi-class classification without architectural modifications. Tok en ize r Input Image Vision Encoders “Turn on the Stove ” Language Instruction Large Language Model Action Detokenizer VLA Model … … … … … … … … … … … … … Vision Token Text T ok en … … … Execute [ X , Y , Z , …] Action Slice F igure 11. Common Structure of V ision-Language-Action Models. 12 Submission and Formatting Instructions f or ICML 2026 LIBERO - Spati al LIBERO - Object LIBERO - Goal LIBERO - Long F igure 12. Representative initial scenes from the four LIBER O subsets. From left to right: LIBER O-Spatial, LIBER O-Object, LIBERO- Goal, and LIBER O-Long. These MuJoCo tabletop environments, manipulated by a 7-DoF Franka Panda arm, illustrate distribution shifts across spatial configurations, object instances,and goal specifications. B. LIBER O Dataset Details B.1. Brief Introduction of LIBER O LIBER O ( Liu et al. , 2023 ) is a comprehensiv e benchmark for e valuating multitask robot learning and generalization. The dataset contains 130 tabletop manipulation tasks performed by a 7-DoF Franka Panda robot arm in simulated en vironments built on MuJoCo ( T odorov et al. , 2012 ) physics engine. All demonstrations are collected through human tele-operation using expert operators, ensuring high-quality trajectory data. Depending on the task en vironment and operands, LIBREO classifies all tasks into four categories, as follo ws: • LIBER O-Spatial : T asks vary in spatial configurations and object placements while maintaining consistent manipulation primitiv es (e.g., “pick up the bowl on the left” and “pick up the bo wl on the right”). • LIBER O-Object : T asks in volve dif ferent object instances with varying visual appearances b ut similar manipulation strategies (e.g., dif ferent colored plates, various shaped containers). • LIBER O-Goal : T asks require the same set of objects but with dif ferent goal specifications, testing the agent’ s ability to follow di verse instructions. • LIBER O-Long : Multi-step tasks requiring sequential execution of 3-4 sub-goals, significantly longer than single-step manipulation tasks in other suites. Each task provides multiple demonstration trajectories with randomized initial states, enabling robust polic y learning and systematic ev aluation of generalization capabilities across different distrib ution shifts. B.2. Statistics of the LIBER O Dataset W e build our retrie val database from the official LIBER O training demonstrations ( Liu et al. , 2023 ). T o guide the design of the database architecture, we profile key dataset statistics—such as episode count, length, Disk Usage, and action dimensionality—across its four task suites: LIBER O-Goal, LIBERO-Spatial, LIBER O-Object, and LIBER O-Long. As shown in T ab . 7 , LIBERO-Long episodes are significantly longer due to multi-step task composition, while the others focus on single-goal primitiv es. These metrics directly inform our choices in sharding, payload structure, and storage allocation. T able 7. LIBERO Dataset Statistics Dataset T asks Episodes T otal Steps A vg. Steps/Episode Size Files LIBER O-Goal 10 428 52,042 121.59 ± 37.62 1.7 GB 18 LIBER O-Spatial 10 432 52,970 122.62 ± 20.75 1.8 GB 18 LIBER O-Object 10 454 66,984 147.54 ± 18.53 2.6 GB 34 LIBER O-Long 10 379 101,469 267.73 ± 56.74 3.4 GB 34 13 Submission and Formatting Instructions f or ICML 2026 LIBERO - Spatial LIBERO -Object LIBERO - Goal LIBERO - Long Vector Representation Payload Schema Structure Pre -Design Sharding Strategy F igure 13. 3D visualization of database embedding vectors using t-SNE ( van der Maaten & Hinton , 2008 ) dimensionality reduction. Points are colored by task identity , showing clear clustering of trajectories from the same task. C. Database Construction This section details the design and implementation of our retriev al system, which enables fast, task-a ware searching o ver human demonstrations. W e first describe the multi-modal v ector representation used to encode visual states, follo wed by the self-contained payload schema that associates each vector with e xecutable action sequences and metadata. Based on scale profiling of the LIBER O dataset, we then present our database architecture—including backend selection, task-based sharding, and memory feasibility . Finally , we report the actual storage footprint and query latency of the constructed database, demonstrating its efficienc y as a low-latency retrie val module. C.1. Construction Details of the Retrieval System V ector Representation. V ision-language-action (VLA) models often incorporate multiple visual encoders or V ision T ransformer (V iT) backbones to capture complementary aspects of the en vironment. T o construct a comprehensive observation representation, we fuse the visual features from these complementary streams by concatenation. T aking OpenVLA ( Kim et al. , 2024 ) as a concrete example, its visual encoder pro vides two distinct feature types: • DINOv2 Featur es (1024-dim): A self-supervised visual representation that captures rob ust scene-le vel semantics. • SigLIP Featur es (1152-dim): A vision-language aligned visual representation that grounds observations to task instructions. In our setup, we employ a dual-vie w observation system consisting of a third-person view and an elbo w-mounted camera view . Each vie w is independently processed by both encoders, yielding (1024 + 1152) × 2 = 4352 dimensions in total. Concatenating features from both views yields a 4352-dim joint visual embedding, which is then L2-normalized to enable cosine similarity-based retrie v al. F ollo wing Qdrant’ s ef ficient retrie v al methodology , we le verage Hierarchical Navig able Small W orld (HNSW) graphs for approximate nearest neighbor search, which provide sub-linear query comple xity while maintaining high recall rates. T o validate the ef fectiveness of this fused representation, we perform t-SNE ( v an der Maaten & Hinton , 2008 ) dimensionality reduction over the entire demonstration database. As sho wn in Fig. 13 , the embeddings form clear clusters by task identity , confirming that the combined representation successfully captures task-specific patterns in the high-dimensional space. Retrieval Payload Schema. T o enable self-contained retrie val, each vector point stores a lightweight JSON payload with all metadata and action information needed for speculative decoding, including: (1) dataset name (string); (2) episode idx (int); (3) step idx (int); (4) current action (float[7]); (5) next actions (float[3 × 7]); (6) language instruction (string). This design ensures that a single query returns both the matched state and its ex ecutable 3-step action sequence, eliminating the need for external storage. The payload size is under 200 B per entry , making it highly memory-efficient. Structure Pr e-Design. Guided by our profiling of the LIBERO benchmark—comprising 273,465 timesteps across four suites (summing to fewer than 3 × 10 5 steps; see T ab. 8 for a per -suite breakdown)—we determine that the entire retrie v al 14 Submission and Formatting Instructions f or ICML 2026 T able 8. V ector Database Distribution and Storage Dataset Collections Ttl. V ectors V ec. per Collection Size Ttl. Size LIBER O-Goal 10 52,042 5,204 1.38 GB 6.5 GB LIBER O-Object 10 66,984 7,443 1.58 GB LIBER O-Spatial 10 52,970 5,297 1.38 GB LIBER O-Long 10 101,469 10,147 2.16 GB memory can comfortably fit in RAM. Each timestep entry requires approximately 8.69 KiB, leading to a theoretical total memory footprint of 2.27 GB. Giv en these constraints, we select Qdrant ( Qdrant T eam , 2023 ) as the vector database backend due to its high-speed similarity search, open-source licensing, and nati ve support for loading full indices into memory . Qdrant employs HNSW inde xing with configurable parameters (we use m=16 , ef construct=100 ) to balance inde x construction time, memory usage, and search accurac y . This design ensures low-latency retrie val during polic y inference while maintaining simplicity in deployment. Sharding Strategy . W e partition the retriev al database into multiple independent collections to enforce logical isolation across tasks, a voiding mixing of heterogeneous demonstration data and simplifying index management. This sharding improv es query efficienc y through smaller search spaces, supports incremental updates and selectiv e loading, and enhances maintainability—each collection can be inspected or replaced independently . This strategy yields 40 collections across the four LIBER O suites. T ab. 8 summarizes the distrib ution of vectors and storage usage per dataset. C.2. Perf ormance Analysis of the Retrieval System As summarized in T ab . 8 , the final database contains 273,465 vectors—one for each timestep in the LIBER O training set—distributed across 40 task-specific collections. W e construct it by encoding each observation into a 4352-dim vector and storing it with its payload in the corresponding collection, ensuring complete, non-redundant cov erage. The actual disk usage is 6.5 GB, comprising 1.0 GB for vector storage and 5.5 GB for inde xing overhead (HNSW graphs and metadata). W e measure retriev al latency on a standard server: the av erage query time over 100 runs is 5.13 ms , significantly faster than a single SD forward pass (13.93 ms). This confirms that our retriev al memory is both compact and efficient for real-time action drafting. Database Scaling Analysis. W e further inv estigate how database scaling affects retrie val performance. Since our database already incorporates all LIBER O training data (no isomorphic data could be added), we use heterogeneous data instead. W e select another dataset, SimplerEn v ( Li et al. , 2024a ), as the heterogeneous data source. Specifically , we incrementally add data from SimplerEn v to the existing LIBER O database and record the success rate (SR) of tasks that rely solely on database retriev al. As sho wn in Fig. 14 , due to the limitation of data heterogeneity and model generalization, simply expanding the database with out-of-distribution demonstrations does not ef fectively impro ve the task completion rate, and also increases the storage ov erhead of the database. This finding moti vates our hybrid approach that combines retrie v al-based drafting with neural verification, rather than relying purely on database scaling. Database Scale Success Rate Storage Size 0. 00 % 10 .0 0% 20 .0 0% 30 .0 0% 40 .0 0% 50 .0 0% 60 .0 0% 70 .0 0% 80 .0 0% 0 1 2 3 4 5 6 7 8 9 1. 00 x 1. 25 x 1. 50 x 1. 75 x 2. 00 x 2. 25 x 2. 50 x 2. 75 x 3. 00 x DB Const Overhead DB Vector Size Spat ial SR Long SR Goal S R Object SR F igure 14. Effect of database scaling on retrieval -only performance. Adding heterogeneous data from SimplerEn v does not improve success rate on LIBER O tasks while increasing storage costs. 15 Submission and Formatting Instructions f or ICML 2026 F igure 15. Distribution of retriev al confidence (cosine similarity scores) across the four LIBERO task suites. The consistently high scores ( > 0.90 for most queries) demonstrate the reliability of our constructed database for action retriev al. Retrieval Confidence Analysis. Beyond query latency , we further analyze the reliability of retriev al results through confidence measurement. In our system, retrie v al confidence is quantified by the cosine similarity score between the query embedding and its nearest neighbor in the database—higher scores indicate closer matches between the current observation and stored demonstrations. As illustrated in Fig. 15 , the retrie val confidence scores are consistently high across all four LIBERO task suites, with the majority of queries achieving similarity scores above 0.90. This uniformly high confidence demonstrates that our database construction ef fecti vely captures the visual-semantic patterns of demonstration trajectories, enabling reliable nearest-neighbor matching during inference. The consistently high retrie val confidence validates the quality of our fused visual embedding and the completeness of our database cov erage ov er the LIBER O demonstration space. C.3. Package and A pplication of the Retrieval System System Architectur e. T o support real-time integration with VLA policies, we implement the retriev al system as a lightweight Retrieval Class that can be directly in vok ed by the polic y agent. The retriev al class encapsulates two core components: (1) an Embedding Module that processes dual-view input images (third-person and elbow-mounted camera) into 4352-dimensional vectors by fusing frozen DINOv2 and SigLIP features from both views via concatenation and L2 normalization, le veraging GPU acceleration for lo w-latency encoding; and (2) a Qdrant Database client that maintains task-specific collections index ed with HNSW graphs for efficient similarity search. Critically , each vector is paired with a self-contained JSON payload that embeds all necessary action and metadata—eliminating the need for external storage or cross-database queries, as required by systems like R T -Cache ( Kwon et al. , 2025 ). Offline Construction Pipeline. The database is constructed in a single offline pass ov er the LIBER O RLDS datasets. For each timestep containing an observ ation, language instruction, and action sequence, we (1) generate its fused visual LIBERO - Long LIBERO - Goal LIBERO -Object LIBERO - Spatial F igure 16. Empirical distributions of Cumulati ve Spatial Displacement ( D [ w ] ) and Radius of Curvature ( R [ w ] ) across the four LIBER O task suites. The dashed lines indicate the 95 th percentile thresholds used for normalization. 16 Submission and Formatting Instructions f or ICML 2026 embedding using the same encoder configuration as at inference time, (2) construct a payload that includes metadata ( dataset name , episode idx , step idx ), the current 7-DoF action, and a three-step lookahead action sequence ( next actions ), and (3) insert the (embedding, payload) pair into the Qdrant collection corresponding to its task. T o ensure semantic purity and simplify maintenance, we partition the database by task identity—resulting in 40 independent collections across the four LIBER O suites. After ingestion, HNSW indices are b uilt across all collections to support fast approximate nearest neighbor search during runtime. Online Retriev al W orkflow . During polic y ex ecution, the system enables lo w-latency speculativ e drafting through direct function calls. Gi ven a ne w observation and instruction from the VLA agent, the retriev al class (1) computes a query vector using the embedded encoding module with identical frozen encoders, (2) performs a top-K approximate nearest neighbor search via HNSW within the pre-specified task collection—benefiting from the smaller search space afforded by task-le vel sharding, (3) retrie ves the payload of the nearest neighbor , and (4) extracts the next actions field to return a draft 3-step action sequence. The full retriev al pipeline achiev es an av erage latency of 5.13 ms (see Appendix C.2 ), well within real-time requirements for interactiv e robotic control. D. Normalization of Kinematic-Based Metric In HeiSD , we employ two kinematic indicators— Cumulative Spatial Displacement ( D [ w ] ) and Radius of Curvatur e ( R [ w ] )—to assess trajectory smoothness and guide the adapti ve switching mechanism. Since these raw metrics exhibit different scales and distributions across task suites, proper normalization is essential to ensure consistent and comparable thresholding. D.1. Pr ofiling of Raw Indicator Distributions T o characterize the natural range of each indicator, we profile all trajectories in the LIBER O training demonstrations across the four task suites. Fig. 16 visualizes the empirical distrib utions of D [ w ] and R [ w ] for each suite, re vealing notable differences in their ranges and tail beha viors. As shown in the Fig. 16 , both indicators exhibit long-tailed distrib utions with occasional extreme outliers caused by sudden motions or annotation artif acts. Directly using the global maximum as the upper bound would compress the majority of v alid samples into a narrow range, reducing discriminativ e power . D.2. Min-Max Normalization with P ercentile Clipping T o address this issue, we adopt a min-max normalization scheme with 95 th percentile clipping . Specifically , for each indicator x ∈ { D [ w ] , R [ w ] } and task suite s , we compute: x ( s ) min = min i x ( s ) i , x ( s ) max = P er centil e 95  { x ( s ) i }  , (11) where x ( s ) i denotes the i -th sample value in suite s . The normalized indicator is then computed as: ˆ x = cl ip x − x ( s ) min x ( s ) max − x ( s ) min , 0 , 1 ! , (12) where the clip ( · , 0 , 1) function ensures that values e xceeding the 95 th percentile threshold are capped at 1, and values below the minimum are capped at 0. This design choice is motiv ated by two observations: (1) the 95 th percentile effecti vely excludes extreme outliers while preserving the dynamic range of typical trajectories, and (2) capping the normalized output to [0 , 1] provides a well-bounded input for threshold-based switching decisions. D.3. Suite-Specific Normalization Bounds Based on our profiling, the suite-specific normalization bounds are summarized in T ab . 9 . These bounds are applied at inference time to normalize raw kinematic values before comparing them against the switching thresholds. By using suite-specific bounds, we ensure that the adaptive switching mechanism remains calibrated to the characteristic motion patterns of each task category . 17 Submission and Formatting Instructions f or ICML 2026 Robot Arm Zero Point Robot Arm Extension Tabletop & Object - Side View Tabletop & Object - Top View F igure 17. Our T abletop Operation En vironment. E. Real-W orld Evaluation Details E.1. T abletop Operation En vironment W e build a tabletop operation en vironment for real-world experiments, as depicted in Fig. 17 . Specifically , we fix the robotic arm on a common tabletop (located at the midpoint of the tabletop), ensuring that the operating range of the robotic arm can cov er the entire tabletop. W e use the of ficial tabletop fixer to lock the position of the robotic arm. Moreo ver , we use a variety of objects as manipulation tar get, including foam fruit models and kitchen utensils (e.g., a stainless steel plate). E.2. Robot Arm W e use a popular 6-DoF AgileX PIPER robotic arm. Its accessories include a handheld teaching display de vice, a 1-DoF gripper , an ORBBEC D ABAI camera, and a plastic camera holder , as shown in Fig. 18 . There are two solutions for the assembly of the robotic arm: (1) When the Handheld T eaching Display Device is mounted on the robotic arm, it can be used for data collection. (2) When the gripper is mounted on the robotic arm, it can be used for real-world e xperiments. T able 9. Normalization bounds for D [ w ] and R [ w ] across LIBER O task suites. All values are deri ved from the training demonstrations, with the upper bound set at the 95 th percentile. T ask Suite D min D max (95%) R min R max (95%) LIBER O-Goal 0.000009 0.123381 0.000001 0.014989 LIBER O-Spatial 0.000027 0.128629 0.000019 0.015654 LIBER O-Object 0.000098 0.116458 0.000010 0.014151 LIBER O-Long 0.000008 0.102298 0.000001 0.012479 6- DoF AgileX PIPER Robot Arm ORBBEC DABAI Camera Handheld Teaching Display 1- DoF Gripper Camera Holder Using Handheld Teaching Display for Data Collection Using 1- DoF Gripper for Experiments + (1) (2) F igure 18. Details of the Robot Arm. 18 Submission and Formatting Instructions f or ICML 2026 T able 10. T ask Comparison with Existing Popular Datasets Dataset Domain Hard ware Setup T ask Complexity Obj. & Env . Diversity O XE ( O’Neill et al. , 2024 ) Real Heterogeneous Atomic / Short-horizon High RoboCasa ( Nasiriany et al. , 2024 ) Simulation Standardized Long-horizon High SimplerEn v ( Li et al. , 2024a ) Sim / Real Standardized Atomic Low Ours Real Standardized Hierarchical High E.3. T ask Establish W e set up several tasks based on the tabletop operating en vironment. Considering that existing robotic arm datasets such as RoboCasa ( Nasiriany et al. , 2024 ) are confined to simulation, and while the Open X-Embodiment (O XE) dataset ( O’Neill et al. , 2024 ) of fers a large scale, it suf fers from hardware heterogeneity and is dif ficult to reproduce in a single laboratory setting, whereas the tasks in SimplerEn v ( Li et al. , 2024a ) are overly simplistic. W e integrate mainstream tasks from these datasets, such as object manipulation (grasping and moving) and placing objects into designated containers. Therefore, we introduce div erse fruit models and containers, along with environmental v ariations (e.g., lighting conditions and backgrounds), to facilitate the ex ecution of div erse tasks within a real-world tabletop operation en vironment. A comparison between our tasks and existing datasets is presented in T able 10 . Specifically , the established tasks encompass simple grasping tasks (e.g., pick up the apple/banana ), spatial displacement tasks (e.g., move the apple from point A to point B ), and specific pick-and-place tasks (e.g., pick up the mango from the plate and place it into the bowl ). Furthermore, we include more complex composite tasks, such as removing the apple from the plate and placing it into the bowl, then placing the banana onto the plate . T o account for the complexity of real-world environ- ments, we conducted operations under v arying lighting conditions, fruit categories, and container styles, as detailed in T able 11 . E.4. Data Collection W e constructed the dataset using Google’ s Robot Learning Dataset Specification (RLDS), encapsulating it in a hierarchical and serialized HDF5 format. The data collection platform is based on a PIPER 6-DoF robotic arm (communicating via CAN bus) and equipped with an ORBBEC D AB AI camera for primary-vie w visual feedback. Reg arding data processing and state representation, the system synchronously collects visual and proprioceptiv e data at 10 Hz. V isual observations are resized from the original 640 × 480 resolution to 224 × 224 pixels via bilinear interpolation and con verted to the RGB color space to adapt to model inputs. The robot state vector consists of six joint angles (unified to radians) and a binary gripper state, where the open/close status is automatically determined via an adaptiv e threshold algorithm based on the difference between initial calibration and real-time feedback. The action space is defined as the relativ e increments (delta joint positions) of joint angles between adjacent time steps and the absolute state of the target gripper . Regarding the collection protocol, each session begins with the robotic arm’ s automatic enabling and zero calibration. Subsequently , the operator inputs a natural language task instruction and switches to T each Mode to complete object transport tasks by manually manipulating the arm joints and controlling the gripper via a teach pendant. Each episode contains complete sequential image, state, action, and timestamp information, with task instructions and outcome-based sparse rewards (1.0 for success, -1.0 for failure) recorded in the file metadata. T o enhance data robustness, repeated trials were conducted under various lighting and background conditions to increase div ersity , collecting approximately 300 episodes per task type for subsequent fine-tuning. T able 11. T ask Categories and Examples T ask Category Description Instruction Examples Complexity Atomic Grasping Grasping a specific target object from the table- top without interacting with other objects. • “Pick up the apple. ” • “Pick up the banana. ” Low Spatial Displacement Moving an object from a starting position to a target re gion or a specific container . • “Mov e the apple from point A to point B. ” • “Pick up the mango from the plate and put it into the bowl. ” Medium Composite Sequential Long-horizon tasks requiring multi-step plan- ning and state memory to manipulate multiple objects in sequence. • “T ake the apple out of the plate and put it in the bo wl, then place the banana onto the plate. ” High 19 Submission and Formatting Instructions f or ICML 2026 E.5. Model Fine-T uning W e employed a Parameter-Ef ficient Fine-T uning (PEFT) ( Han et al. , 2024 ; Ding et al. , 2023 ; Xu et al. , 2023 ) strategy based on the OpenVLA-7B model clusters. Specifically , utilizing Low-Rank Adaptation (LoRA) ( Hu et al. , 2022 ; Liu et al. , 2024b ), we trained by injecting low-rank adapters with a rank of r = 32 and a dropout rate of 0 . 05 while freezing the pretrained backbone parameters; during inference, task-specific statistics are dynamically loaded to achiev e action space un-normalization. Regarding inference deployment, the system adopts a Client-Server (C/S) architecture, with the computing end acting as the server and the robotic arm as the client. The server loads the model using bfloat16 half-precision and the Flash Attention 2 acceleration mechanism. It encapsulates client-uploaded images and natural language instructions (e.g., ”pick up the banana”) into a Q&A prompt template, employing a greedy decoding strategy to predict a 7-dimensional action vector . Notably , addressing the multi-threading characteristics of HTTP , we introduced a synchronization and mutual exclusion mechanism based on condition v ariables at the inference interface. Through the cooperative control of a global status flag and a condition lock, this mechanism establishes the core inference process as a critical section, enforcing the serialized ex ecution of requests. This design effecti vely eliminates risks of GPU memory overflo w and resource competition caused by concurrent multi-threaded GPU calls, while maintaining high-concurrency network communication capabilities. Upon recei ving the action v alues (relative joi nt increments) from the serv er , the client con verts them into absolute target angles via an integration algorithm and determines the gripper’ s open/close status. Finally , the results are mapped to SDK underlying pulse values to dri ve the PIPER robotic arm via the CAN bus, achie ving high-precision closed-loop control. 20

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment