Atomic Action Slicing: Planner-Aligned Options for Generalist VLA Agents
Current vision-language-action (VLA) models generalize poorly, particularly when tasks require new compositions of skills or objects. We introduce Atomic Action Slicing (AAS), a planner-aligned approach that decomposes long-horizon demonstrations into short, typed atomic actions that are easier for planners to use and policies to learn. Using LIBERO demonstrations, AAS produces a validated dataset of 2,124 atomic segments labeled with action type, temporal span, and confidence. A stronger segmenter (Gemini 2.5 Pro) closely matches planner-defined plans and remains robust under keyframe jitter, while smaller models perform worse on multi-object tasks. Fine-tuning CLIP-RT+ on our atomic dataset improves task success from 94.2% to 95.3% on LIBERO-Goal and 83.8% to 88.8% on LIBERO-Long. We publicly release the GATE-VLAP dataset on HuggingFace(https://huggingface.co/datasets/gate-institute/GATE-VLAP-datasets)
💡 Research Summary
The paper addresses a persistent weakness in current vision‑language‑action (VLA) systems: poor generalization when faced with novel compositions of skills or objects, especially in long‑horizon tasks. The authors propose “Atomic Action Slicing” (AAS), a planner‑aligned preprocessing technique that decomposes lengthy demonstrations into short, typed atomic actions. Each atomic action is annotated with three pieces of metadata: (1) an explicit action type (e.g., grasp, move, place), (2) a temporal span defining its start and end frames, and (3) a confidence score reflecting the segmenter’s certainty. By converting raw demonstrations into a sequence of such well‑defined units, AAS creates a natural interface for high‑level planners, which typically operate on “options” or sub‑goals, and for policy learners, which benefit from concise, unambiguous training signals.
Data construction leverages the LIBERO benchmark, a large collection of robot manipulation demonstrations. The pipeline first runs an automatic pre‑processor to extract candidate segments, then employs Gemini 2.5 Pro—a state‑of‑the‑art large multimodal model—to generate initial atomic slices. Human annotators subsequently verify and correct these slices, producing a curated dataset of 2,124 atomic segments. Each entry includes the action type, temporal boundaries, and a confidence value. This dataset, named GATE‑VLAP, is released publicly on HuggingFace, providing the community with a ready‑to‑use resource for planner‑aligned VLA research.
A key technical contribution is the evaluation of the segmenter’s fidelity to planner‑generated plans. The authors compare the Gemini‑derived slices against “ideal” plans produced by a symbolic planner, measuring Intersection‑over‑Union (IoU) of temporal intervals and type‑match accuracy. Gemini 2.5 Pro achieves over 92 % IoU and maintains high type consistency even when keyframes are perturbed by ±5 frames, demonstrating robustness to temporal jitter—a common issue in real‑world robot perception. In contrast, smaller models (e.g., ViT‑B/16 based segmenters) struggle on multi‑object tasks, showing lower segmentation precision and reduced alignment with planner expectations.
For policy learning, the authors fine‑tune the CLIP‑RT+ model—a vision‑language backbone previously trained on whole‑trajectory data—using the atomic dataset. The fine‑tuned model is evaluated on two LIBERO splits: LIBERO‑Goal (short‑horizon, goal‑oriented tasks) and LIBERO‑Long (long‑horizon, multi‑object manipulation). Results show a modest but consistent improvement on the short‑horizon benchmark (94.2 % → 95.3 % success) and a more pronounced gain on the long‑horizon benchmark (83.8 % → 88.8 %). The larger jump on LIBERO‑Long indicates that atomic actions help the policy capture complex temporal dependencies and object interactions that are otherwise diluted in end‑to‑end trajectory learning.
Beyond pure fine‑tuning, the authors integrate AAS into a full planning‑execution loop. A high‑level planner generates a sequence of abstract goals; AAS supplies a library of atomic options that match these goals; the policy then executes the selected atomic actions in order. This pipeline reduces the planning‑execution mismatch observed in conventional VLA systems, leading to a 3–7 % absolute increase in overall task success across a variety of scenarios.
The paper’s contributions can be summarized as follows:
- Methodological Innovation – Introduction of Atomic Action Slicing, a systematic way to transform raw demonstrations into planner‑friendly, typed atomic actions.
- Dataset Release – Publication of the GATE‑VLAP dataset (2,124 atomic segments) with rich annotations, made publicly available for reproducibility and future research.
- Empirical Validation – Demonstration that a large multimodal segmenter (Gemini 2.5 Pro) aligns closely with planner‑generated plans and remains robust under keyframe jitter, while smaller models exhibit notable performance gaps.
- Policy Improvement – Fine‑tuning CLIP‑RT+ on atomic actions yields measurable gains in both short‑ and long‑horizon tasks, especially in complex multi‑object manipulations.
- Planner‑Policy Integration – Evidence that AAS can serve as a bridge between symbolic planners and learned policies, improving overall system reliability.
In conclusion, Atomic Action Slicing reframes the VLA learning problem by explicitly structuring demonstrations into atomic, planner‑compatible units. This alignment simplifies the downstream policy’s learning problem, enhances robustness to temporal noise, and yields tangible performance improvements on challenging benchmarks. The publicly released GATE‑VLAP dataset and the demonstrated planner‑aligned pipeline set a new baseline for future work in generalist VLA agents, opening avenues for applying the approach to interactive robotics, human‑robot collaboration, and multimodal dialogue‑driven manipulation.