Gaussian-Constrained LeJEPA Representations for Unsupervised Scene Discovery and Pose Consistency

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Unsupervised 3D scene reconstruction from unstructured image collections remains a fundamental challenge in computer vision, particularly when images originate from multiple unrelated scenes and contain significant visual ambiguity. The Image Matching Challenge 2025 (IMC2025) highlights these difficulties by requiring both scene discovery and camera pose estimation under real-world conditions, including outliers and mixed content. This paper investigates the application of Gaussian-constrained representations inspired by LeJEPA (Joint Embedding Predictive Architecture) to address these challenges. We present three progressively refined pipelines, culminating in a LeJEPA-inspired approach that enforces isotropic Gaussian constraints on learned image embeddings. Rather than introducing new theoretical guarantees, our work empirically evaluates how these constraints influence clustering consistency and pose estimation robustness in practice. Experimental results on IMC2025 demonstrate that Gaussian-constrained embeddings can improve scene separation and pose plausibility compared to heuristic-driven baselines, particularly in visually ambiguous settings. These findings suggest that theoretically motivated representation constraints offer a promising direction for bridging self-supervised learning principles and practical structure-from-motion pipelines.

💡 Research Summary

The paper tackles the challenging problem posed by the Image Matching Challenge 2025 (IMC2025): unsupervised 3D reconstruction from a heterogeneous collection of images that belong to multiple unrelated scenes, with a strong emphasis on both scene discovery (clustering) and camera pose estimation under real‑world conditions such as outliers and visual ambiguity. Traditional Structure‑from‑Motion (SfM) pipelines—relying on handcrafted features (SIFT, ORB), geometric verification, and heuristic clustering—break down when faced with mixed‑scene data. To bridge the gap between rigorous self‑supervised learning theory and practical SfM, the authors adopt the recent LeJEPA framework (Le et al., 2025), which proposes provable learning objectives that avoid contrastive heuristics. The core theoretical contribution of LeJEPA is the Sliced Isotropic Gaussian Regularization (SIGReg), which states that under a signal‑plus‑noise model the optimal latent representations follow an isotropic Gaussian distribution.

The authors design three progressively refined pipelines. Pipeline 1 is a conventional, score‑optimized baseline that uses RootSIFT with CLAHE, FLANN matching, DBSCAN ensemble clustering, and a circular‑trajectory heuristic for pose generation. It achieves a high public score (0.96) but a very low private score (0.19), indicating over‑fitting to the public leaderboard. Pipeline 2 improves robustness by normalizing SIFT, adding bilateral filtering, using an adaptive multi‑strategy matcher (FLANN + brute‑force), and inferring scene type (planar, linear, object‑centric) before pose generation. This version yields more balanced public and private scores (0.87 / 0.54) and demonstrates better generalization.

Pipeline 3 implements the LeJEPA‑enhanced solution. Images are passed through a backbone encoder; the resulting embeddings are L2‑normalized and scaled by √(embedding_dim). SIGReg is then enforced: for each cluster the mean μ and covariance Σ are computed, and the cluster is accepted only if the eigenvalue ratio λ_max(Σ)/λ_min(Σ) < 10 and the mean norm ‖μ‖ < 1.0. This constraint forces the embeddings to occupy a near‑spherical region, making inner‑product‑based similarity directly interpretable as a probability under isotropic Gaussian assumptions. Two similarity measures are introduced: (1) Gaussian Cosine Similarity, s_ij = ½·(1 + (z_i^T z_j)/(‖z_i‖‖z_j‖)), derived from the fact that inner products of isotropic Gaussians correlate with common origin probability; and (2) Characteristic‑Function Matching, s_ij = 1 − |ϕ_i(t) − ϕ_j(t)|, where ϕ_i(t)=E

Gaussian-Constrained LeJEPA Representations for Unsupervised Scene Discovery and Pose Consistency

💡 Research Summary

Comments & Academic Discussion

Leave a Comment