Act, Sense, Act: Learning Non-Markovian Active Perception Strategies from Large-Scale Egocentric Human Data
Achieving generalizable manipulation in unconstrained environments requires the robot to proactively resolve information uncertainty, i.e., the capability of active perception. However, existing methods are often confined in limited types of sensing behaviors, restricting their applicability to complex environments. In this work, we formalize active perception as a non-Markovian process driven by information gain and decision branching, providing a structured categorization of visual active perception paradigms. Building on this perspective, we introduce CoMe-VLA, a cognitive and memory-aware vision-language-action (VLA) framework that leverages large-scale human egocentric data to learn versatile exploration and manipulation priors. Our framework integrates a cognitive auxiliary head for autonomous sub-task transitions and a dual-track memory system to maintain consistent self and environmental awareness by fusing proprioceptive and visual temporal contexts. By aligning human and robot hand-eye coordination behaviors in a unified egocentric action space, we train the model progressively in three stages. Extensive experiments on a wheel-based humanoid have demonstrated strong robustness and adaptability of our proposed method across diverse long-horizon tasks spanning multiple active perception scenarios.
💡 Research Summary
This paper tackles the long‑standing challenge of enabling robots to actively resolve information uncertainty in unstructured environments. The authors formalize active perception as a non‑Markovian decision process (NMDP) driven by two mechanisms: (1) information gain, quantified as the conditional mutual information between an action chunk and the subsequent observation given the full history, and (2) decision branching, where the policy selects actions based on both the current observation and the accumulated context. They further categorize visual active perception into information discovery (via viewpoint change or manipulation) and information enrichment (refining already visible data).
To learn robust priors for such behaviors, the work leverages large‑scale egocentric human datasets (CaptainCook4D and Ego‑Exo4D) that provide fine‑grained hand and head pose annotations, rich action labels, and scenarios with heavy occlusions. These human demonstrations are aligned with robot data collected through immersive VR teleoperation of a wheel‑based humanoid (Corenetic Monte02). Alignment is achieved by mapping both modalities into a unified egocentric action space: each episode is expressed relative to its first frame, and hand configurations are abstracted to a single gripper width, while robot chassis and head motions are combined into a composite head pose.
The core contribution is CoMe‑VLA, a cognitive‑and‑memory‑aware Vision‑Language‑Action framework built on the Qwen3‑VL‑2B visual‑language backbone. CoMe‑VLA receives a sequence of egocentric RGB frames, a textual task description, a special “cognitive token,” and proprioceptive joint states. It processes these through a dual‑track memory encoder (separate visual and proprioceptive streams) and a flow‑matching action decoder that predicts a chunk of K future actions, each consisting of 6‑DoF pose for the head and both end‑effectors plus gripper aperture. A cognitive auxiliary head predicts sub‑task completion, automatically triggering transitions without external supervision.
Training proceeds in three stages: (i) cognitive pre‑training on human data only, (ii) full‑model pre‑training on the same human corpus, and (iii) fine‑tuning on robot tele‑operation data. This curriculum allows the model to first absorb human exploration and manipulation priors, then adapt them to the robot’s embodiment and sensor characteristics.
Extensive experiments on the wheel‑based humanoid cover a suite of long‑horizon tasks that require viewpoint adjustment, drawer opening, object search, precise grasping, and recovery from dynamic perturbations. Results show that CoMe‑VLA achieves high success rates (often above 85 %) and demonstrates emergent active‑perception behaviors such as purposeful scanning, strategic manipulation to reveal hidden objects, and information enrichment through close inspection. Ablation studies confirm that both the cognitive head and the dual‑track memory are essential: removing the cognitive head degrades sub‑task transition timing, while omitting either memory stream reduces performance on tasks demanding long‑term context.
In summary, the paper makes three key contributions: (1) a principled non‑Markovian formulation of active perception with a clear taxonomy of visual exploration strategies; (2) a method to distill human egocentric priors into a robot‑compatible action space, enabling transfer of “act‑sense‑act” strategies; and (3) the CoMe‑VLA architecture that combines cognition and memory to maintain environmental awareness and autonomously manage sub‑tasks. The work pushes the field beyond reactive perception, showing that integrating human‑derived exploratory priors with memory‑aware neural policies can endow robots with robust, generalizable active perception capabilities suitable for real‑world deployment.
Comments & Academic Discussion
Loading comments...
Leave a Comment