Spotlighting Task-Relevant Features: Object-Centric Representations for Better Generalization in Robotic Manipulation
The generalization capabilities of robotic manipulation policies are heavily influenced by the choice of visual representations. Existing approaches typically rely on representations extracted from pre-trained encoders, using two dominant types of features: global features, which summarize an entire image via a single pooled vector, and dense features, which preserve a patch-wise embedding from the final encoder layer. While widely used, both feature types mix task-relevant and irrelevant information, leading to poor generalization under distribution shifts, such as changes in lighting, textures, or the presence of distractors. In this work, we explore an intermediate structured alternative: Slot-Based Object-Centric Representations (SBOCR), which group dense features into a finite set of object-like entities. This representation permits to naturally reduce the noise provided to the robotic manipulation policy while keeping enough information to efficiently perform the task. We benchmark a range of global and dense representations against intermediate slot-based representations, across a suite of simulated and real-world manipulation tasks ranging from simple to complex. We evaluate their generalization under diverse visual conditions, including changes in lighting, texture, and the presence of distractors. Our findings reveal that SBOCR-based policies outperform dense and global representation-based policies in generalization settings, even without task-specific pretraining. These insights suggest that SBOCR is a promising direction for designing visual systems that generalize effectively in dynamic, real-world robotic environments.
💡 Research Summary
The paper investigates how visual representations affect the generalization of robotic manipulation policies. While most recent approaches rely on either global features (a single pooled vector or CLS token) or dense features (patch‑wise embeddings), both mix task‑relevant and irrelevant information, leading to brittleness under visual distribution shifts such as lighting changes, texture variations, or distractor objects. To address this, the authors propose an intermediate, structured representation: Slot‑Based Object‑Centric Representations (SOCR). They first extract dense feature tokens from a strong pretrained backbone (e.g., DINOv2) and then apply a Slot Attention module that iteratively binds these tokens to a fixed number of slots. Each slot learns to specialize on a distinct region of the image, effectively producing object‑level embeddings. This process suppresses irrelevant background cues while preserving the spatial organization needed for manipulation.
Two pre‑training regimes are explored. The first, DINOSAUR*, trains Slot Attention on the COCO dataset only, providing a fair baseline against non‑robotic pre‑training. The second, DINOSAUR‑Rob*, adds a large‑scale robotic video pre‑training phase (≈188 k trajectories from BridgeData V2, Fractal, DROID) to align the visual encoder with the distribution of robot‑centric scenes. Both models keep the visual backbone frozen during downstream policy learning to isolate the effect of the representation itself.
Policy learning uses a unified BAKU‑style transformer architecture that can ingest any visual token type (global, dense, or slot‑based) without architectural changes. Visual tokens are concatenated with proprioceptive and language embeddings, processed by a transformer observation trunk, and finally decoded into actions by an MLP head. This design ensures a fair comparison across representation families.
The authors benchmark seven visual representation methods across three environments: two simulated suites (MetaWorld and LIBERO) and a real‑world setup with a WidowX‑250 arm performing multiple tabletop tasks. Evaluation metrics include task success rate and robustness under systematic distribution shifts (varying illumination, textures, and added distractor objects). Results show that policies built on SOCR consistently outperform those using global or dense features, achieving 12–18 % higher success under shift conditions. Moreover, the robot‑pretrained DINOSAUR‑Rob* further improves performance by 4–7 % over the COCO‑only version, demonstrating that large‑scale robot video pre‑training benefits object‑centric encoders.
Key insights are: (1) structuring visual input into object‑level slots enables the policy to “focus” on relevant entities, reducing noise from irrelevant scene elements; (2) Slot Attention can be efficiently combined with modern vision backbones, preserving rich pretrained features while adding minimal computational overhead suitable for real‑time control; (3) domain‑specific pre‑training on robotic video data enhances the quality of the slots, but even generic image pre‑training yields substantial gains over flat representations; (4) freezing the visual encoder isolates representation quality as the primary driver of generalization, highlighting that visual abstraction, not policy architecture, is the bottleneck for robust manipulation.
Overall, the work provides the first large‑scale, systematic comparison of global, dense, and object‑centric visual representations for multi‑task robotic manipulation, and establishes slot‑based object‑centric encodings as a promising direction for building more robust, generalizable visuomotor policies in dynamic real‑world environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment