Learning 2D Invariant Affordance Knowledge for 3D Affordance Grounding
3D Object Affordance Grounding aims to predict the functional regions on a 3D object and has laid the foundation for a wide range of applications in robotics. Recent advances tackle this problem via learning a mapping between 3D regions and a single human-object interaction image. However, the geometric structure of the 3D object and the object in the human-object interaction image are not always consistent, leading to poor generalization. To address this issue, we propose to learn generalizable invariant affordance knowledge from multiple human-object interaction images within the same affordance category. Specifically, we introduce the Multi-Image Guided Invariant-Feature-Aware 3D Affordance Grounding (MIFAG) framework. It grounds 3D object affordance regions by identifying common interaction patterns across multiple human-object interaction images. First, the Invariant Affordance Knowledge Extraction Module (IAM) utilizes an iterative updating strategy to gradually extract aligned affordance knowledge from multiple images and integrate it into an affordance dictionary. Then, the Affordance Dictionary Adaptive Fusion Module (ADM) learns comprehensive point cloud representations that consider all affordance candidates in multiple images. Besides, the Multi-Image and Point Affordance (MIPA) benchmark is constructed and our method outperforms existing state-of-the-art methods on various experimental comparisons.
💡 Research Summary
The paper addresses the problem of 3D object affordance grounding, which aims to predict functional regions on a 3D object’s point cloud. Existing methods typically learn a mapping from the point cloud to a single human‑object interaction image. Because the geometry of the 3D object and the object depicted in the reference image can differ substantially, those approaches suffer from poor generalization to new objects or viewpoints.
To overcome this limitation, the authors propose MIFAG (Multi‑Image Guided Invariant‑Feature‑Aware 3D Affordance Grounding), a framework that extracts “invariant affordance knowledge” from multiple reference images belonging to the same affordance category and then integrates this knowledge with the 3D point cloud. The system consists of two main modules:
-
Invariant Affordance Knowledge Extraction Module (IAM) – IAM receives a set of n reference images {I₁,…,Iₙ}. It initializes a set of learnable affordance query tokens Q and iteratively refines them through a multi‑layer network. Each layer performs multi‑head cross‑attention (MCA) between the current queries and the image features, followed by a multi‑head self‑attention (MSA) on the image features themselves. An MLP aggregates the queries from all images, encouraging the model to capture the common affordance pattern while suppressing appearance‑specific noise. A similarity loss is applied to the image features at every iteration to enforce that all images share the same affordance category. After L iterations, the final query embeddings constitute an “affordance dictionary” that encodes the invariant knowledge distilled from all images.
-
Affordance Dictionary Adaptive Fusion Module (ADM) – ADM takes the point‑cloud feature P_in and the affordance dictionary generated by IAM. It creates a point‑cloud query Q_P via a linear projection and treats the dictionary entries as keys K_ID and values V_ID. A novel Invariant‑aware Query Dictionary Cross‑Attention (IQDCA) computes cosine‑similarity‑based attention weights A = Softmax(cos(Q_P, K_ID)) and produces a weighted sum of V_ID, yielding a tensor P_q that contains, for each point, its relevance to every image‑derived affordance token. A subsequent self‑weighted attention layer discards irrelevant tokens and produces the final fused representation P_mix. This design goes beyond conventional self‑attention by explicitly using the external dictionary to guide learning, thereby bridging the modality gap between 2D visual cues and 3D geometry.
The training objective combines three losses: (i) a similarity loss in IAM to keep image features aligned, (ii) a heat‑map loss on the point‑cloud predictions to encourage accurate spatial localization, and (iii) a cross‑entropy loss for the final affordance classification.
In addition to the methodological contributions, the authors introduce the Multi‑Image and Point Affordance (MIPA) benchmark. MIPA extends existing 3D affordance datasets by providing multiple human‑object interaction images per object, enabling evaluation of how well a model can leverage diverse visual references.
Experimental results on MIPA show that MIFAG outperforms prior state‑of‑the‑art methods such as 3D AffordanceNet, IAGNet, LASO, and others across standard metrics (IoU, mAP, F1). The performance gap widens especially when the reference images exhibit large variations in appearance, scale, or viewpoint, confirming the effectiveness of the invariant knowledge extraction and adaptive fusion strategies.
Overall, the paper makes three key contributions: (1) a dual‑branch IAM that extracts robust, appearance‑invariant affordance cues from multiple images, (2) an ADM that fuses these cues with point‑cloud features through dictionary‑guided cross‑attention, and (3) the MIPA benchmark for systematic evaluation. The proposed approach promises more reliable affordance grounding for downstream robotics tasks such as manipulation, human‑robot collaboration, and embodied AI in complex, visually diverse environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment