RoboGene: Boosting VLA Pre-training via Diversity-Driven Agentic Framework for Real-World Task Generation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The pursuit of general-purpose robotic manipulation is hindered by the scarcity of diverse, real-world interaction data. Unlike data collection from web in vision or language, robotic data collection is an active process incurring prohibitive physical costs. Consequently, automated task curation to maximize data value remains a critical yet under-explored challenge. Existing manual methods are unscalable and biased toward common tasks, while off-the-shelf foundation models often hallucinate physically infeasible instructions. To address this, we introduce RoboGene, an agentic framework designed to automate the generation of diverse, physically plausible manipulation tasks across single-arm, dual-arm, and mobile robots. RoboGene integrates three core components: diversity-driven sampling for broad task coverage, self-reflection mechanisms to enforce physical constraints, and human-in-the-loop refinement for continuous improvement. We conduct extensive quantitative analysis and large-scale real-world experiments, collecting datasets of 18k trajectories and introducing novel metrics to assess task quality, feasibility, and diversity. Results demonstrate that RoboGene significantly outperforms state-of-the-art foundation models (e.g., GPT-4o, Gemini 2.5 Pro). Furthermore, real-world experiments show that VLA models pre-trained with RoboGene achieve higher success rates and superior generalization, underscoring the importance of high-quality task generation. Our project is available at https://robogene-boost-vla.github.io.

💡 Research Summary

RoboGene addresses a critical bottleneck in the development of general‑purpose robotic manipulation: the scarcity of diverse, physically plausible real‑world data for pre‑training Vision‑Language‑Action (VLA) models. Traditional data collection relies on human designers who inevitably bias task selection toward simple, repetitive primitives, leading to long‑tail distributions that limit model generalization. While recent attempts to harness large language models (LLMs) for automated task generation provide scalability, they suffer from hallucinations, lack of physical grounding, and an open‑loop design that ignores execution feedback.

The proposed framework integrates three complementary components: (1) Diversity‑driven sampling, (2) Self‑reflection, and (3) Human‑in‑the‑loop (HITL) long‑term memory.

Diversity‑driven sampling uses a Least Frequently Used (LFU) strategy over scenario categories, objects, and manipulation skills. Usage counters are maintained in a global history H; the sampler selects the least‑used scenario, then filters objects semantically related to that scenario and picks the least‑used objects and skills. This actively pushes the generation process toward under‑explored regions of the task space, mitigating long‑tail bias.
Self‑reflection consists of three specialized LLM‑based evaluators: a Physical Feasibility Evaluator (E_phy) that checks kinematic reachability, collision avoidance, force/torque limits, and temporal synchronization; a Novelty Evaluator (E_nov) that measures how distinct the proposed object‑skill combination is from the existing dataset; and a Consistency Evaluator (E_con) that ensures logical coherence of the task description. The initial proposal generated by a large language model (or a vision‑language model for spatially aware robots) is passed through these evaluators, which return detailed critiques. The Refiner module then revises the proposal based on the critiques.
HITL and long‑term memory allow human experts to intervene, correct hallucinations, substitute out‑of‑distribution objects, or adjust skill parameters. All corrections and execution outcomes (success/failure, error diagnostics) are stored in a long‑term memory M. During subsequent sampling, M is consulted to update H, ensuring that repeated mistakes are avoided and that the system continuously learns from real‑world feedback.

Formally, the generation pipeline is expressed as
(T = \Phi_{\text{refine}}(\Phi_{\text{gen}}(\Phi_{\text{sample}}(E,O,S|H),R),M)),
where (R) denotes the robot embodiment, and the three Φ functions correspond to sampling, generation, and refinement respectively.

The authors conducted extensive experiments. They defined 1,200 distinct tasks spanning single‑arm, dual‑arm, and mobile manipulators, collected 15 demonstrations per task, and amassed a dataset of 18 k real‑world trajectories. Novel metrics—Physical Feasibility Score, Object Diversity Index, Skill Diversity Index, and Novelty Score—were introduced to quantify both per‑task quality and global dataset properties. Compared with baseline task generators built on GPT‑4o and Gemini 2.5 Pro, RoboGene achieved higher physical feasibility (≈ 92 % vs 78 %), greater object diversity (1.8×) and skill diversity (2.1×), and an overall success rate of ≈ 85 % versus 63 % for the baselines.

To evaluate downstream impact, a state‑of‑the‑art VLA model (π₀) was pre‑trained on the RoboGene dataset and then tested on unseen scenarios involving novel objects, background variations, static distractors, illumination changes, and instruction rewrites. The RoboGene‑pre‑trained model demonstrated 12 %–18 % improvements in success rates and exhibited more robust generalization than models trained on existing datasets.

Key insights include: (i) LFU‑based sampling effectively balances the task distribution in an online manner; (ii) multi‑faceted self‑reflection dramatically reduces physically infeasible or nonsensical proposals without requiring exhaustive manual rule sets; (iii) integrating human feedback into a persistent memory enables continual refinement, turning the system into a self‑improving data generator.

Limitations noted by the authors involve residual sim‑to‑real gaps for highly complex multi‑stage tasks and the computational overhead of running multiple evaluators for each proposal. Future work aims to incorporate multimodal sensor feedback (force/torque, tactile) into the feasibility evaluator and to couple the framework with automatic curriculum learning for progressive difficulty scaling.

In summary, RoboGene presents a closed‑loop, agentic approach to automatically generate high‑quality, diverse, and physically grounded robotic manipulation tasks. By addressing both diversity and feasibility, it supplies the kind of rich training data that unlocks the full potential of generalist VLA models, paving the way toward more capable and adaptable embodied AI systems.

RoboGene: Boosting VLA Pre-training via Diversity-Driven Agentic Framework for Real-World Task Generation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment