HOT-POT: Optimal Transport for Sparse Stereo Matching
Stereo vision between images faces a range of challenges, including occlusions, motion, and camera distortions, across applications in autonomous driving, robotics, and face analysis. Due to parameter sensitivity, further complications arise for stereo matching with sparse features, such as facial landmarks. To overcome this ill-posedness and enable unsupervised sparse matching, we consider line constraints of the camera geometry from an optimal transport (OT) viewpoint. Formulating camera-projected points as (half)lines, we propose the use of the classical epipolar distance as well as a 3D ray distance to quantify matching quality. Employing these distances as a cost function of a (partial) OT problem, we arrive at efficiently solvable assignment problems. Moreover, we extend our approach to unsupervised object matching by formulating it as a hierarchical OT problem. The resulting algorithms allow for efficient feature and object matching, as demonstrated in our numerical experiments. Here, we focus on applications in facial analysis, where we aim to match distinct landmarking conventions.
💡 Research Summary
Title: HOT-POT: Optimal Transport for Sparse Stereo Matching
Abstract (English):
The paper introduces a novel framework for unsupervised matching of sparse features—particularly facial landmarks—across stereo image pairs. By interpreting camera‑projected points as half‑rays, the authors define two geometric distance measures: the classical epipolar distance and a newly proposed 3‑D ray distance. These distances serve as the cost matrix of a (partial) optimal transport (OT) problem, which is solved efficiently using modern OT solvers (e.g., Sinkhorn iterations). To handle whole‑object correspondence (e.g., matching entire faces composed of many landmarks), the authors embed the point‑level OT into a hierarchical OT (HOT) formulation. Experiments on simulated data and real RGB‑Thermal facial images demonstrate that the ray‑distance‑based OT outperforms epipolar‑distance‑based OT, especially when landmark conventions differ between modalities. The hierarchical approach further yields robust face‑to‑face matching in multi‑person scenarios.
1. Introduction and Motivation
Stereo vision is essential for depth reconstruction in autonomous driving, robotics, and facial analysis. Traditional dense‑pixel matching fails under illumination changes, occlusions, or cross‑modal settings (e.g., RGB vs. thermal). Sparse features such as facial landmarks are attractive because they are compact and semantically meaningful, but they suffer from high sensitivity to detector noise, occlusions, and differing landmark conventions across modalities. The authors argue that optimal transport provides a principled relaxation of nearest‑neighbor matching, allowing probabilistic assignments that can handle mismatched cardinalities and outliers. Recent advances in fast OT solvers (Sinkhorn, sliced OT, partial OT) make this approach computationally feasible.
2. Geometric Modeling and Distance Functions
The paper adopts a standard pinhole camera model with intrinsic matrices (K_\ell, K_r) and extrinsic rotation (R) and translation (t). A 3‑D point (w) projects to homogeneous image points (x, y) via
\
Comments & Academic Discussion
Loading comments...
Leave a Comment