Back to 3D: Few-Shot 3D Keypoint Detection with Back-Projected 2D Features

Back to 3D: Few-Shot 3D Keypoint Detection with Back-Projected 2D Features

With the immense growth of dataset sizes and computing resources in recent years, so-called foundation models have become popular in NLP and vision tasks. In this work, we propose to explore foundation models for the task of key-point detection on 3D shapes. A unique characteristic of keypoint detection is that it requires semantic and geomet-ric awareness while demanding high localization accuracy. To address this problem, we propose, first, to back-project features from large pre-trained 2D vision models onto 3D shapes and employ them for this task. We show that we ob-tain robust 3D features that contain rich semantic information and analyze multiple candidate features stemming from different 2D foundation models. Second, we employ a key-point candidate optimization module which aims to match the average observed distribution of keypoints on the shape and is guided by the back-projected features. The resulting approach achieves a new state of the art for few-shot key-point detection on the KeyPointNet dataset, almost doubling the performance of the previous best methods.


💡 Research Summary

The paper tackles the challenging problem of 3D keypoint detection, which demands both semantic understanding of object parts and precise geometric localization. Traditional 3D‑only networks struggle in few‑shot scenarios because they lack large labeled datasets and require heavy computation. To overcome these limitations, the authors propose a two‑stage framework that leverages large‑scale 2D vision foundation models (such as CLIP, DINO, SAM, MAE) and projects their rich feature representations onto 3D shapes.

Stage 1 – Back‑projection of 2D features.
The input mesh is rendered from multiple calibrated camera viewpoints. For each view, a pre‑trained 2D backbone extracts high‑dimensional feature maps. Using depth maps and camera intrinsics/extrinsics, each pixel’s feature is inversely projected onto the corresponding 3D vertex. Because a vertex may be visible in several views, the authors experiment with different aggregation schemes: simple averaging, weighted averaging based on view confidence, and attention‑based fusion. The resulting per‑vertex descriptors encode semantic cues (e.g., “wheel”, “handle”) as well as geometric cues (curvature, normal direction), effectively turning a 3D point cloud into a semantically enriched feature field.

Stage 2 – Keypoint candidate optimization.
From the back‑projected descriptors an initial set of keypoint candidates is generated. The authors explore heat‑map peak detection and K‑means clustering on the feature space to obtain a coarse set of locations. They then model the empirical distribution of keypoints across the training set as a mixture of Gaussians (or a learned probability map). An EM‑like iterative process refines the candidates: the E‑step computes the posterior probability that each candidate belongs to the learned distribution, while the M‑step updates the candidate positions by maximizing similarity to the back‑projected features under a small geometric regularizer. This optimization aligns the candidate set with both the statistical prior of where keypoints typically appear and the local semantic evidence supplied by the 2D‑derived features.

Few‑shot learning protocol.
The method is evaluated on the KeyPointNet benchmark under 1‑shot, 5‑shot, and 10‑shot regimes. Only a handful of annotated shapes are used for supervision; the rest of the dataset is unlabeled. Despite this scarcity, the proposed pipeline achieves state‑of‑the‑art performance, nearly doubling the mean Average Precision (mAP) of the previous best few‑shot approaches. The gains are especially pronounced on complex, articulated objects such as human hands and animal bodies, where semantic cues from 2D models are crucial.

Ablation studies.
The authors conduct extensive ablations to dissect the contributions of each component. They compare several foundation models, finding that CLIP‑ViT‑B/16 provides the most balanced semantic‑geometric representation. Increasing the number of rendering views improves performance up to about 12 views, after which gains saturate. Feature aggregation via attention yields a modest but consistent boost over simple averaging. Removing the distribution‑guided optimization reduces performance dramatically, confirming that the prior on keypoint locations is essential in the few‑shot regime.

Limitations and future work.
The back‑projection relies on sufficient view coverage; sparse or occluded viewpoints can lead to incomplete vertex descriptors. High‑resolution meshes also cause memory overhead during multi‑view feature accumulation. The authors propose future directions such as adaptive view selection, feature compression (e.g., PCA, quantization), and extending the pipeline to raw point clouds without mesh connectivity. They also suggest integrating recent multimodal foundation models (e.g., CLIP‑Vision‑Language) to incorporate textual priors for even richer semantic guidance.

Impact.
By bridging 2D foundation models and 3D geometry, the work demonstrates a practical path to high‑accuracy keypoint detection with minimal annotation effort. This has immediate implications for downstream tasks such as 3D pose estimation, shape correspondence, and robotic manipulation, where reliable keypoint localization is a prerequisite. The paper sets a new benchmark for few‑shot 3D keypoint detection and opens avenues for further cross‑modal transfer learning in 3D vision.