OpenGVL -- Benchmarking Visual Temporal Progress for Data Curation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Data scarcity remains one of the most limiting factors in driving progress in robotics. However, the amount of available robotics data in the wild is growing exponentially, creating new opportunities for large-scale data utilization. Reliable temporal task completion prediction could help automatically annotate and curate this data at scale. The Generative Value Learning (GVL) approach was recently proposed, leveraging the knowledge embedded in vision-language models (VLMs) to predict task progress from visual observations. Building upon GVL, we propose OpenGVL, a comprehensive benchmark for estimating task progress across diverse challenging manipulation tasks involving both robotic and human embodiments. We evaluate the capabilities of publicly available open-source foundation models, showing that open-source model families significantly underperform closed-source counterparts, achieving only approximately $70%$ of their performance on temporal progress prediction tasks. Furthermore, we demonstrate how OpenGVL can serve as a practical tool for automated data curation and filtering, enabling efficient quality assessment of large-scale robotics datasets. We release the benchmark along with the complete codebase at \href{github.com/budzianowski/opengvl}{OpenGVL}.

💡 Research Summary

The paper introduces OpenGVL, a benchmark designed to evaluate and leverage vision‑language models (VLMs) for predicting temporal task progress in robotic and human manipulation scenarios. Recognizing that data scarcity remains a bottleneck for robotics despite an explosion of publicly shared datasets, the authors build upon the Generative Value Learning (GVL) framework, which uses VLMs to estimate a universal value function that reflects how far a task has progressed. OpenGVL extends this idea by providing a systematic evaluation suite that measures how well various VLMs can predict progress from shuffled video frames, thereby removing reliance on explicit temporal ordering cues.

Benchmark Construction
Four well‑known robotics datasets—NYU Door, Berkeley MVP, CMU Stretch, and NYU Franka—serve as the core validation set. From each dataset, 50 episodes are sampled, and for each episode 15 random frames are extracted and shuffled. Two prompting regimes are examined: zero‑shot (no examples) and two‑shot (two in‑context examples). To prevent overfitting and to enable future community contributions, two hidden, long‑horizon tasks (human and dual‑arm robot assembly of electronic components) are kept private; results on these tasks are only disclosed when new models are submitted.

Model Suite
The benchmark evaluates a broad spectrum of open‑source VLMs: the Gemma‑3 family (4B, 12B, 27B parameters), Qwen2.5‑VL family (3B, 7B, 32B), GLM‑4.1V‑9B‑Thinking, MiMo‑VL‑7B‑RL‑2508, Cosmos‑Reason1‑7B, and Kimi‑VL‑A3B. For comparison, three proprietary models—GPT‑4o, Gemini‑2.5‑Flash‑lite, and Gemini‑2.5‑Pro—are used as upper‑bound references. All models share a unified vision‑language encoder architecture, enabling a fair assessment of their temporal reasoning capabilities.

Metric: Value‑Order Correlation (VOC)
VOC computes the Spearman/Kendall rank correlation between the predicted progress scores (v₁…v_T) and the ground‑truth temporal order (1…T). A VOC of +1 indicates perfect monotonic alignment, –1 indicates perfect inverse ordering, and 0 denotes no correlation. This metric, originally proposed in GVL, serves as a proxy for how consistently a model can infer progress across a trajectory.

Key Findings

Scale Matters: Larger models consistently achieve higher VOC scores. The 32B Qwen model and the 27B Gemma model reach VOC values above 0.6 on most datasets, while smaller 3B–7B models hover near zero or even negative values.
Open‑Source Gap: Across all datasets and prompting conditions, open‑source VLMs attain roughly 60‑70 % of the VOC scores achieved by proprietary models. For example, Gemini‑2.5‑Pro reaches a VOC of 0.92 on NYU Door (zero‑shot), whereas the best open‑source model (Gemma‑27B) scores 0.64.
Few‑Shot Benefits: Providing two in‑context examples improves VOC for most models, but the gain is modest (often < 0.1) and does not close the gap to closed‑source performance.
Reasoning‑Enabled Models: Models with explicit reasoning modules (MiMo‑VL‑7B‑RL‑2508, GLM‑4.1V‑9B‑Thinking) outperform similarly sized baseline VLMs, suggesting that chain‑of‑thought style prompting can aid temporal inference. However, Kimi‑VL‑A3B underperforms despite strong results on other vision benchmarks, highlighting that reasoning capacity alone is insufficient.

Data Curation Applications
The authors demonstrate how VOC can be used to automatically flag problematic episodes in large‑scale repositories (e.g., the Hugging Face LeRobot hub). Three common failure modes are identified:

Task Definition Ambiguity: Datasets with vague instructions (e.g., “dig grass and dump”) yield low VOC because progress cannot be consistently measured.
Labeling Ambiguity: Multi‑step tasks with unclear success criteria (e.g., “take out a vial and put it into another pocket”) also produce low VOC, indicating that the VLM cannot infer a reliable progress trajectory.
Out‑of‑Distribution / Failure Cases: Episodes with occluded cameras, sensor dropouts, or unexpected robot behavior generate erratic VOC patterns, enabling automatic removal of low‑quality data.

By applying a VOC‑based filter, the authors estimate that roughly 15 % of the 13 000+ publicly shared datasets contain episodes that would likely degrade downstream VLA model training.

Community Infrastructure
OpenGVL is released as a Hugging Face Space with an interactive UI that allows researchers to upload new model checkpoints, run VOC evaluations on hidden test sets, and visualize per‑episode progress curves. All code, prompts, and evaluation scripts are open‑source, encouraging reproducibility and continuous benchmarking as VLMs evolve.

Limitations and Future Directions

Fine‑Grained Spatial Reasoning: Current open‑source VLMs struggle with sub‑millimeter precision tasks, indicating a need for better spatial grounding.
Zero‑Shot Robustness: The relatively low zero‑shot VOC suggests that practical deployment will still require carefully crafted prompts or few‑shot examples, adding overhead.
Metric Completeness: VOC captures ordering correlation but not absolute success probability; combining VOC with reward‑based metrics could provide a more holistic quality assessment.

Conclusion
OpenGVL provides the first comprehensive, open‑source benchmark for temporal progress prediction in robotics, quantifying the performance gap between open‑source and proprietary VLMs and showcasing a practical pipeline for large‑scale data curation. By making the benchmark, code, and interactive evaluation platform publicly available, the authors lay the groundwork for the community to drive rapid improvements in vision‑language‑action models and to build higher‑quality robotics datasets at scale.

OpenGVL -- Benchmarking Visual Temporal Progress for Data Curation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment