Probabilistic Permutation Synchronization using the Riemannian Structure of the Birkhoff Polytope
Introduction
Correspondences fuel a large variety of computer vision applications such as structure-from-motion (SfM) , SLAM , 3D reconstruction , camera re-localization , image retrieval and 3D scan stitching . In a typical scenario, given two scenes, an initial set of 2D/3D keypoints is first identified. Then the neighborhood of each keypoint is summarized with a local descriptor and keypoints in the given scenes are matched by associating the mutually closest descriptors. In a majority of practical applications, multiple images or 3D shapes are under consideration and ascertaining such two-view or pairwise correspondences is simply not sufficient. This necessitates a further refinement ensuring global consistency. Unfortunately, at this stage even the well developed pipelines acquiesce either heuristic/greedy refinement or incorporate costly geometric cues related to the linking of individual correspondence estimates into a globally coherent whole .
In this paper, by using the fact that correspondences are cycle consistent 1, we propose two novel algorithms for refining the assignments across multiple images/scans (nodes) in a multi-way graph and for estimating assignment confidences, respectively. We model the correspondences between image pairs as relative, total permutation matrices and seek to find absolute permutations that re-arrange the detected keypoints to a single canonical, global order. This problem is known as map or permutation synchronization . Even though in many practical scenarios matches are only partially available, when shapes are complete and the density of matches increases, total permutations can suffice .
Similar to many well received works , we relax the sought permutations to the set of doubly-stochastic (DS) matrices. We then consider the geometric structure of DS, the Birkhoff Polytope . We are - to the best of our knowledge, for the first time introducing and applying the recently developed Riemannian geometry of the Birkhoff Polytope to tackle challenging problems of computer vision. Note that lack of this geometric understanding caused plenty of obstacles for scholars dealing with our problem . By the virtue of a first order retraction, we can use the recent Riemannian limited-memory BFGS (LR-BFGS) algorithm to perform a maximum-a-posteriori (MAP) estimation of the parameters of the consistency loss. We coin our variation as Birkhoff-LRBFGS.
At the next stage, we take on the challenge of confidence/uncertainty estimation for the problem at hand by drawing samples on the Birkhoff Polytope and estimating the empirical posterior distribution. To achieve this, we first formulate a new geodesic stochastic differential equation (SDE). Our SDE is based upon the Riemannian Langevin Monte Carlo (RLMC) that is efficient and effective in sampling from Riemannian manifolds with true exponential maps. Note that similar stochastic gradient geodesic MCMC (SG-MCMC) tools have already been used in the context of synchronization of spatial rigid transformations whose parameters admit an analytically defined geodesic flow . Unfortunately, for our manifold the retraction map is only up to first order and hence we cannot use off-the-shelf schemes. Alleviating this nuisance, we further contribute a novel numerical integrator to solve our SDE by replacing the intractable exponential map of DS matrices by the approximate retraction map. This leads to another new algorithm: Birkhoff-RLMC.
In a nutshell, our contributions are:
-
We function as an ambassador and introduce the Riemannian geometry of the Birkhoff Polytope to solve problems in computer vision.
-
We propose a new probabilistic model for the permutation synchronization problem.
-
We minimize the cycle consistency loss via a Riemannian-LBFGS algorithm and outperfom the state-of-the-art both in recall and in runtime.
-
Based upon the Langevin mechanics, we introduce a new SDE and a numerical integrator to draw samples on the high dimensional and complex manifolds with approximate retractions, such as the Birkhoff Polytope. This lets us estimate the confidence maps, which can aid in improving the solutions and spotting consistency violations or outliers.
Note that the tools developed herewith can easily extend beyond our application and would hopefully facilitate promising research directions regarding the combinatorial optimization problems in computer vision.
Riemannian Limited Memory BFGS
We now explain the R-LBFGS optimizer used in our work. To begin with, we recall the foundational BFGS , that is a quasi Newton method operating in the Euclidean space. We then review the simple Riemannian descent algorithms that employ line-search. Finally, we explain the R-BFGS used to solve our synchronization problem. R-BFGS can be modified slightly to arrive at the desired limited memory Riemannian BFGS solver.
Euclidean BFGS
The idea is to approximate the true hessian $`\mathbf{H}`$ with $`{\bm{\B}}`$, using updates specified by the gradient evaluations2. We will then transition from this Euclidean space line search method to Riemannian space optimizers. For clarity, in Alg. 1 we summarize the Euclidean BFGS algorithm, that computes $`{\bm{\B}}`$ by using the most recent values of the past iterates. Note that many strategies exist to initialize $`{\bm{\B}}_0`$, while a common choice is the scaled identity matrix $`{\bm{\B}}_0=\gamma\mathbf{I}`$. Eq. [eq:bfgs] corresponds to the particular BFGS-update rule. In the limited-memory successor of BFGS, the L-BFGS, the Hessian matrix $`\mathbf{H}`$ is instead approximated up to a pre-specified rank in order to achieve linear time and space-complexity in the dimension of the problem.
$`k \gets 0`$
Riemannian Descent and Line Search
The standard descent minimizers can be extended to operate on Riemannian manifolds $`\Man`$ using the geometry of the parameters. A typical Riemannian update can be characterized as:
\begin{align}
\x_{k+1} = R_{\x_k}(\tau_k \bm{\eta}_k)
\end{align}
where $`R`$ is the retraction operator, i.e. the smooth map from the tangent bundle $`\T\Man`$ to $`\Man`$ and $`R_{\x_k}`$ is the restriction of $`R`$ to $`\T_{\x_k}`$, the tangent space of the current iterate $`\x_k`$, $`R_{\x_k}:\T_{\x_k}\rightarrow \Man`$. The descent direction is defined to be on the tangent space $`\bm{\eta}_k\in \T_{\x_k}\Man`$. When manifolds are rather simple shapes, the size of the step $`\tau_k`$ can be a fixed value. However, for most matrix manifolds, some form of a line search is preferred to compute $`\tau_k`$. The retraction operator is used to take steps on the manifold and is usually derived analytically. When this analytic map is length-preserving, it is called true exponential map. However, such exactness is not a requirement for the optimizer and as it happens in the case of doubly stochastic matrices, $`R_{\x_k}`$ only needs to be an approximate retraction, e.g. first or second order. In fact, for a map to be valid retraction, it is sufficient to satisfy the following conditions:
r0.27
-
$`R`$ is continuously differentiable.
-
$`R_{\x}(\zero_{\x})=\x`$, where $`\zero_{\x}`$ denotes the zero element of $`\T_{\x}\Man`$. This is called the centering property.
-
The curve $`\gamma_{\bm{\eta}_{\x}}(\tau)=R_{\x}(\tau \bm{\eta}_{\x})`$ satisfies:
\begin{align} \frac{d\gamma_{\bm{\eta}_{\x}}(\tau)}{d\tau}\Bigr\rvert_{\tau = 0}=\bm{\eta}_{\x}, \, \forall \bm{\eta}_{\x} \in \T_{\x}\Man. \end{align}This is called the local rigidity property.
Considering all these, we provide, in Alg. 2, a general Riemannian descent algorithm, which can be customized by the choice of the direction, retraction and the means to compute the step size. Such a concept of minimizing by walking on the manifold is visualized in Fig. [fig:torus].
It is possible to develop new minimizers by making particular choices for the Riemannian operators in Alg 2. We now review the Armijo variant of the Riemannian gradient descent , that is a common and probably the simplest choice for an accelerated optimizer on the manifold. Though, many other line-search conditions such as Barzilai-Borwein or strong Wolfe can be used. The pseudocode for this backtracking version is given in Alg. 3. Note that the Riemannian gradient $`\text{grad}f(\x)`$ is simply obtained by projecting the Euclidean gradient $`\nabla f(\x)`$ onto the manifold $`\Man`$ and the next iterate is obtained through a line search so as to satisfy the Armijo condition , tweaked to use the retraction $`R`$ for taking steps. For some predefined Armijo step size, this algorithm is guaranteed to converge for all retractions .
LR-BFGS for Minimization on $`\DP_n`$
Based upon the ideas developed up to this point, we present the Riemannian variant of the L-BFGS. The algorithm is mainly based on Huang and we refer the reader to their seminal work for more details . It is also worth noting . Similar to the Euclidean case, we begin by summarizing a BFGS algorithm with the difference that it is suited to solving the synchronization task. This time, $`\bm{\B}`$ will approximate the action of the Hessian on the tangent space $`\T_{\x_k}\Man`$. Generalizing any (quasi-)Newton method to the Riemannian setting thus requires computing the Riemannian Hessian operator, or its approximation. This necessitates taking a some sort of a directional derivative of a vector field. As the vector fields belonging to different tangent spaces cannot be compared immediately, one needs the notion of connection $`\Gamma`$ generalizing the directional derivative of a vector field. This connection is closely related to the concept of vector transport $`T: \T\Man \otimes \T\Man \rightarrow \T \Man`$ which allows moving from one tangent space to the other $`T_{\bm{\eta}}(\bm{\zeta}): (\bm{\eta},\bm{\zeta})\in\T_\x\Man \rightarrow \T_{R_{\x}(\bm{\eta})}\Man \,`$. For the first order treatment of the Birkhoff Polytope, this vector transport takes a simple form:
\begin{equation}
T_{\bm{\eta}}(\zeta) \in \T_{R_{\x}(\bm{\eta})}\DP_n \triangleq \Pi_{R_{\x}(\bm{\eta})}(\bm{\zeta}).
\end{equation}
where $`\Pi_{\X}(\cdot)`$ is the projection onto the tangent space of
$`\X`$ as defined in the main paper. We give the pseudocode of the
R-BFGS algorithm in Alg. 4 below. To ensure Riemannian L-BFGS always
produces a descent direction, it is necessary to adapt a line-search
algorithm which satisfies strong Wolfe (SW) conditions . Roptlib does
implement the SW while ManOpt uses the simpler Armijo conditions, even
for the LR-BFGS. Another simple possibility is to take the steps on the
Manifold using the retraction, while performing the line search in the
Euclidean space. This results in a projected, approximate line search,
but can terminate quite quickly, making it possible to exploit existing
Euclidean space SW LBFGS solvers such as Ceres . Without delving into
rigorous proofs, we provide one such implementation at
github.com/tolgabirdal/MatrixManifoldsInCeres where
several matrix manifolds are considered. We leave the analysis of such
approximations for a future study and note that the results in the paper
are generated by limited memory form of the R-BFGS procedure summarized
under Alg. 4. While many modifications do exist, LR-BFGS, in essence, is
an approximation to R-BFGS, where $`O(MN^2)`$ storage of the full
Hessian is avoided by unrolling the RBFGS descent computation and using
the last $`m`$ input and gradient differences where $`m< Proof. Consider the ray cast outwards from the origin:
$`\mathbf{r} = \mathbb{R}_{\geq 0} \mathbbm{1}`$. $`\mathbf{r}`$ exits
$`\DP_n`$ at $`\tfrac{1}{n} \mathbbm{1}`$ but exits the sphere
$`\Scal^{n-1}`$ at $`\tfrac{1}{\sqrt{n}} \mathbbm{1}`$. This creates a
gap between the two manifolds whose ratio grows to $`\infty`$ as $`n`$
grows. For a perspective of optimization, consider the linear function
$`\lambda(\Pm) = \sum_{i,j} P_{ij}`$. $`\lambda(\Pm)`$ is minimized on
$`\DP_n`$ at $`\tfrac{1}{n} \mathbbm{1}`$ and on the sphere at
$`\tfrac{1}{\sqrt{n}} \mathbbm{1}`$, ◻ We now look at the restricted orthogonal matrices and seek to find the
difference between optimizing a linear functional on them and $`\DP_n`$.
Note that minimizing functionals is ultimately what we are interested in
as the problems we consider here are formulated as optimization. Let
$`\mathcal{A}`$ be the affine linear space of $`(n-1)\times(n-1)`$
matrices whose rows and columns sum to 1, doubly stochastic but with no
condition on the signs or a generalized doubly stochastic matrix. The
Birkhoff Polytope $`\DP_n`$ is contained in $`\mathcal{A}`$, whereas the
orthogonal group $`\Og_n`$ is not. So, there exists an affine functional
$`\lambda`$ that is minimized at a point on $`\DP_n`$, $`\lambda=0`$ but
not on $`\Og`$. Let $`\mathcal{AO}_n=\DP_n \cap \Og_n`$, a further
restricted manifold. This time unlike the case of $`\DP_n`$,
$`\mathcal{AO}_n`$ would not coincide the permutations $`\mathcal{P}`$
due to the negative elements. In fact
$`\mathcal{P}\subset\mathcal{AO}_n`$. The ratio between the time a ray leaves $`\DP_n`$ and the same ray
leaves $`\mathcal{AO}_n`$ can be as large as $`n-1`$. Proof. Consider the line through $`\frac{1}{n}\mathbbm{1}`$ and
$`\Id`$ : $`l(x) = \frac{1+x}{n}\mathbbm{1} - x\Id`$. Such a ray leaves
$`\DP_n`$ at $`x=\frac{1}{n-1}`$ and for all $`x\in [-1,1]`$ is in the
convex hull of $`\mathcal{AO}_n`$. When $`x=1`$, it is an orthogonal
matrix, on $`\Og_n`$. Hence, the ray can be on $`\mathcal{AO}_n`$ for
$`n-1`$ times as long as in $`\DP_n`$. This quantity also grows to
infinity as $`n\rightarrow \infty`$. If same analysis is done for the
case of the sphere, the ratio is found to be $`(n-1)^{3/2}`$, grows
quicker to infinity and is a larger quantity. ◻ Proof. The conclusion of the proposition states that $`p(\X)`$ and
$`p(\Pm|\X)`$ have the following form: where Our goal is to verify that the following holds with the above
definitions: where $`Z`$ is a positive constant (cf. Section 4 in the main paper).
Then, we can easily verify this equality as follows: where we used the fact that $`\|\Pm_{ij}\|_{\mathrm{F}}^2 = n`$ in
Equation [eqn:interm] since
$`\Pm_{ij} \in \mathcal{P}_n`$ and $`Z = C \exp(-\beta n |\Edge|)`$. In the rest of the proof, we will characterize the normalizing constant
$`Z_{ij} = \int_{\mathcal{P}_n} \exp(2\beta \mathrm{tr}(\Pm_{ij}^\top \X_{ij}) \mathrm{d} \Pm_{ij}`$.
Here, we denote the the counting measure on $`\mathcal{P}_n`$ as
$`\mathrm{d} \Pm_{ij}`$. We start by decomposing $`\X_{ij}`$ via Birkhoff-von Neumann theorem: Then, we have: where we used Jensen’s inequality in
[eqn:jensen]. Here, we observe that
$`\int_{\mathcal{P}_n} 2\beta \theta_{ij,b} \mathrm{tr}(\Pm_{ij}^\top \mathbf{M}_{ij,b})\mathrm{d} \Pm_{ij}`$
is similar to the normalization constant of a Mallows model , and it
only depends on $`\beta`$ and $`\theta_{ij,b}`$. Hence, we conclude that where $`f`$ is an increasing function of $`\beta, \theta_{ij,b}`$ since
$`\mathrm{tr}(\Pm_{ij}^\top \mathbf{M}_{ij,b}) \geq 0`$. This completes
the proof. ◻ We start by recalling the SDE By , we know that By using this identity in
Equation [eqn:sdesupp], we obtain: By using a similar notation to , we rewrite the above equation as
follows: where $`\mathcal{N}`$ denotes the Gaussian distribution and
$`[\mathbf{M}]_{ij} = \partial [\X]_{ij} / \partial[\xe]_{ij}`$. We
multiply each side of the above equation by $`\mathbf{M}`$ and use the
property $`\nabla_{\xe} = \mathbf{M}^\top \nabla_{\X}`$ , which yields: Here we used the area formula (Equation 11 in the main paper) and the
fact that $`\mathbf{G} = \mathbf{M}^\top \mathbf{M}`$. The term $`\mathbf{M}(\mathbf{M}^\top\mathbf{M})^{-1}\mathbf{M}^\top`$
turns out the be the projection operator to the tangent space of $`\X`$
. Therefore, the usual geodesic integrator would consist of the
following steps at iteration $`k`$: Set a small step size $`h`$ Compute the term
$`- h\nabla_{\X}U(\X_t) + \sqrt{2h/\beta} \mathbf{Z}`$, with
$`\mathbf{Z} \sim \mathcal{N}(0, \mathbf{I})`$ Obtain the ‘direction’ by projecting the result of the previous step
on the tangent space Move the current iterate on the geodesic determined by the direction
that was obtained in the previous step. Unfortunately, the last step of the integrator above cannot be computed
in the case of $`\DP_n`$. Therefore, we replace it with moving the
current iterate by using the retraction operator, which yields the
update equations given in the main paper. We now give further details on the synthetic data used in the paper. We
synthesize our data by displacing a $`4\times 4`$ 2D grid to simulate 5
different images and scrambling the order of correspondence. Shown in
Fig. [fig:synth]a, the ground truth
corresponds to an identity ordering where the $`i^{th}`$ element of each
grid is mapped to $`i^{th}`$ node of the other. Note that the graph is
fully connected but we show only consecutive connections. To simulate
noisy data, we introduce $`16`$ random swaps into the ground truth as
shown in Fig. [fig:synth]b. We also randomize the
initialization in a similar way. On this data we first run a baseline
method where instead of restricting ourselves to the Birkhoff Polytope,
we use the paths of the orthogonal group $`O(N)`$. Our second baseline
is the prosperous method of Maset . Note that, regardless of the
initialization (Fig [fig:synth]e,f) our method is capable of
arriving at visually more satisfactory local minimum. This validates
that for complex problems such as the one at hand, respecting the
geometry of the constraints is crucial. Our algorithm is successful at
that and hence is able to find high quality solutions. In the figure the
more parallel the lines are, the better, depicting closeness to the
ground truth. We now show matches visualized after running our synchronization on the
Willow dataset . For the sake of space, we have omitted some of those
results from the paper.
Fig [fig:realsup] plots our findings, where
our algorithm is able to converge from challenging initial matches. Note
that these results only show our the MAP estimates explained in the main
paper. It is possible to extend our approach with some form of geometric
constraints as in Wang . We leave this as a future work. There are not many well accepted standards in evaluating the
uncertainty. Hence in the main paper, we resorted what we refer as the
top-K error. We will now briefly supply more details on this. The
purpose of the experiment is to measure the quality of the sample
proposals generated by our Birkhoff-RLMC. To this end, we introduce a
random sampler that generates, $`K`$ arbitrary solutions that are highly
likely to be irrelevant (bad proposals). These solutions alone do not
solve the problem. However if we were to consider the result correct
whenever either the found optimum or the random proposal contains the
correct match i.e. append the additional $`K-1`$ samples to the solution
set, then, even with a random sampler we are guaranteed to increase the
recall. Similarly, samples from Birkhoff-RLMC will also yield higher
recall. The question then is: Can Birkhoff-RLMC do better than a random
sampler? To give an answer, we basically record the relative improvement
in recall both for the random sampler and Birkhoff-RLMC. The top-K
error for different choices of $`K`$ is what we presented in Table 2 of
the main paper. We further visualize these solutions in the columns (c)
to (f) of the same table. To do so, we simply retain the assignments
with the $`K`$-highest scores in the solutions $`\X`$. Note that the
entire $`\X`$ acts as a confidence map itself (Column d). We then use
the MATLAB command imagesc on the $`{\X}`$ with the jet colormap. Next, we provide insights into how the sampler works in practice.
Fig. 1 plots the objective value attained
as the iterations of the Birkhoff-RLMC sampler proceeds. For different
values of $`\beta`$ these plots look different. Here we use
$`\beta=0.08`$ and show both on Motorbike and Winebottle samples (used
above) and for 1000 iterations, the behaviour of sample selection. In
the paper, we have accumulated these samples and estimated the
confidence maps. Note that occasionally, the sampler can discover better
solutions than the one being provided. This is due to two reasons: 1)
rarely, we could jump over local minima, 2) the initial solution is a
discrete one and it is often plausible to have a doubly stochastic
(relaxed) solution matrices that have lower cost. We now apply our algorithm to solve correspondence problems on isolated
3D shapes provided by the synthetic Tosca dataset that is composed of
non-rigidly deforming 3D meshes of various objects such as cat, human,
Centaur, dog, wolf and horse. The cat object from this dataset along
with ten of its selected ground truth correspondences is shown in
Fig. [fig:tosca]. This is of great concern
among the 3D vision community and the availability of the complete
shapes plays well with the assumptions of our algorithm, i.e.
permutations are total. It is important to mention that when initial
correspondences are good ($`\sim 80\%`$) all methods, including ours can
obtain $`100\%`$ accuracy . Therefore, to be able to put our method to
test, we will intentionally degrade the initial correspondence
estimation algorithm we use. Initialization Analogous to the 2D scenario, we obtain the initial
correspondences by running a siamese Point-Net like network regressing
the assignments between two shapes. Unlike 2D, we do not need to take
care of occlusions, visibility or missing keypoint locations as 3D
shapes can be full and clean as in this dataset. On the average, we
split half of the datasets for training and the other half for testing.
Note that such amount of data is really insufficient to train this
network, resulting in suboptimal predictions. We gradually downsample
the point sets uniformly to sizes of $`N_s=\{200, 50, 20\}`$ to get the
keypoints. At this point, it is possible to use sophisticated keypoint
prediction strategies to establish the optimum cardinality and the set
of points under correspondence. Note that it is sufficient to compute
the keypoints per shape as we do not assume the presence of a particular
order. Each keypoint is matched to another by the network prediction.
The final stage of the prediction results in a soft assignment matrix on
which we apply Hungarian algorithm to give rise to initial matches. Preliminary Results We run, on these initial sets of
correspondences, our algorithm and the state of the art methods as
described in the main paper. We count and report the percentage of
correctly matched correspondences (recall) in
Tab. 1 for only two objects, Cat and
Michael. Running on the full dataset is a future work. It is seen that
our algorithm consistently outperforms the competing approaches. The
difference is less visible when the number of samples start dominating
the number of edges. This is because none of the algorithms are able to
find enough consistency votes to correct for the errors.
In this work we have proposed two new frameworks for relaxed permutation
synchronization on the manifold of doubly stochastic matrices. Our novel
model and formulation paved the way to using sophisticated optimizers
such as Riemannian limited-memory BFGS. We further integrated a
manifold-MCMC scheme enabling posterior sampling and thereby confidence
estimation. We have shown that our confidence maps are informative about
cycle inconsistencies and can lead to new solution hypotheses. We used
these hypotheses in a top-K evaluation and illustrated its benefits. In
the future, we plan to (i) address partial permutations, the inner
region of the Birkhoff Polytope (ii) investigate more sophisticated MCMC
schemes such as (iii) seek better use cases for our confidence estimates
such as outlier removal. Supported by the grant ANR-16-CE23-0014 (FBIMATRIX). Authors thank
Haowen Deng for the initial 3D correspondences and, Benjamin Busam and
Jesus Briales for fruitful discussions.Proof of Proposition 1 and Further Analysis
Proof of Proposition 2
\begin{align}
p(\X) =& \frac1{C} \exp \Bigl(-\beta \sum\limits_{(i,j)\in \Edge}\|\X_{ij}\|_{\mathrm{F}}^2 \Bigr) \prod\limits_{(i,j) \in \Edge} Z_{ij} \\
p(\Pm | \X) =& \prod_{(i,j)\in \Edge} p(\Pm_{ij} | \X_i, \X_j) \\
=& \prod_{(i,j)\in \Edge} \exp\Bigl( 2\beta\>\mathrm{tr}(\Pm_{ij}^\top \X_{ij}) \Bigr) \frac1{Z_{ij}} \\
=& \exp\Bigl( \beta \sum_{(i,j)\in \Edge} 2 \mathrm{tr}(\Pm_{ij}^\top \X_{ij}) \Bigr) \prod_{(i,j)\in \Edge}\frac1{Z_{ij}}
\end{align}
\begin{align}
Z_{ij} := \prod\limits_{b=1}^{B_{ij}} Z_\X(\beta, \theta_{ij,b}). \label{eqn:normconst}
\end{align}
\begin{align}
p(\Pm,\X) = \frac1{Z} \exp\Bigl(-\beta \sum_{(i,j)\in \Edge} \|\Pm_{ij}- \X_i \X_j^\top \|_{\mathrm{F}}^2 \Bigr) = \frac1{Z} \prod_{(i,j)\in \Edge} \psi(\Pm_{ij},\X_i,\X_j),
\end{align}
\begin{align}
p(\Pm,\X) &= p(\Pm|\X) p(\X) \\
&= \frac1{C} \exp \Bigl(-\beta \sum\limits_{(i,j)\in \Edge}\|\X_{ij}\|_{\mathrm{F}}^2 \Bigr) \exp\Bigl( \beta \sum_{(i,j)\in \Edge} 2 \mathrm{tr}(\Pm_{ij}^\top \X_{ij}) \Bigr) \\
&= \frac1{C} \exp \Bigl(-\beta \sum\limits_{(i,j)\in \Edge} \bigl(\|\X_{ij}\|_{\mathrm{F}}^2 - 2 \mathrm{tr}(\Pm_{ij}^\top \X_{ij}) \bigr) \Bigr) \\
&= \frac1{C} \exp \Bigl(-\beta \sum\limits_{(i,j)\in \Edge}\bigl(\|\X_{ij}\|_{\mathrm{F}}^2 - 2 \mathrm{tr}(\Pm_{ij}^\top \X_{ij}) - n + n \bigr) \Bigr) \\
&= \frac1{C} \exp(\beta n |\Edge|) \exp \Bigl(-\beta \sum\limits_{(i,j)\in \Edge}\bigl( \|\X_{ij}\|_{\mathrm{F}}^2 - 2 \mathrm{tr}(\Pm_{ij}^\top \X_{ij}) + \|\Pm_{ij}\|_{\mathrm{F}}^2 \bigr) \Bigr) \label{eqn:interm} \\
&= \frac1{Z} \exp \Bigl(-\beta \sum\limits_{(i,j)\in \Edge} \|\Pm_{ij}- \X_i \X_j^\top \|_{\mathrm{F}}^2 \Bigr)
\end{align}
\begin{align}
\X_{ij} = \sum\limits_{b=1}^{B_{ij}} \theta_{ij,b} \mathbf{M}_{ij,b}, \quad \sum\limits_{b=1}^{B_{ij}} \theta_{ij,b} =1, \quad B_{ij} \in \mathbb{Z}_+, \quad \theta_{ij,b} \geq 0, \> \mathbf{M}_{ij,b} \in \mathcal{P}_n, \quad \forall b = 1,\dots,B_{ij}.
\end{align}
\begin{align}
Z_{ij} &= \int_{\mathcal{P}_n} \exp\Bigl(2\beta \mathrm{tr}(\Pm_{ij}^\top \X_{ij} \Bigr) \mathrm{d} \Pm_{ij} \\
&= \int_{\mathcal{P}_n} \exp\Bigl(2\beta \mathrm{tr}(\Pm_{ij}^\top \sum\limits_{b=1}^{B_{ij}} \theta_{ij,b} \mathbf{M}_{ij,b} \Bigr) \mathrm{d} \Pm_{ij} \\
&= \int_{\mathcal{P}_n} \exp\Bigl(2\beta \sum_{b=1}^{B_{ij}} \theta_{ij,b} \mathrm{tr}(\Pm_{ij}^\top \mathbf{M}_{ij,b}) \Bigr) \mathrm{d} \Pm_{ij} \\
% &= \int_{\mathcal{P}_n} \prod_{b=1}^{B_{ij}} \exp\Bigl(2\beta \theta_{ij,b} \mathrm{tr}(\Pm_{ij}^\top \mathbf{M}_{ij,b}) \Bigr) \mathrm{d} \Pm_{ij} \\
& \geq \exp\Bigl( \int_{\mathcal{P}_n} 2\beta \sum_{b=1}^{B_{ij}} \theta_{ij,b} \mathrm{tr}(\Pm_{ij}^\top \mathbf{M}_{ij,b}) \mathrm{d} \Pm_{ij} \Bigr) \label{eqn:jensen} \\
& = \exp\Bigl( \sum_{b=1}^{B_{ij}} \int_{\mathcal{P}_n} 2\beta \theta_{ij,b} \mathrm{tr}(\Pm_{ij}^\top \mathbf{M}_{ij,b}) \mathrm{d} \Pm_{ij} \Bigr)
% &= \prod_{b=1}^{B_{ij}} \int_{\mathcal{P}_n} \exp\Bigl(2\beta \theta_{ij,b} \mathrm{tr}(\Pm_{ij}^\top \mathbf{M}_{ij,b}) \Bigr) \mathrm{d} \Pm_{ij}.
\end{align}
\begin{align}
Z_{ij} & \geq \prod_{b=1}^{B_{ij}} \exp \Bigl( \int_{\mathcal{P}_n} 2\beta \theta_{ij,b} \mathrm{tr}(\Pm_{ij}^\top \mathbf{M}_{ij,b}) \mathrm{d} \Pm_{ij} \Bigr) \\
&:= \prod_{b=1}^{B_{ij}} f(\beta, \theta_{ij,b})
% &= \prod_{b=1}^{B_{ij}} \int_{\mathcal{P}_n} \exp\Bigl(2\beta \theta_{ij,b} \mathrm{tr}(\Pm_{ij}^\top \mathbf{M}_{ij,b}) \Bigr) \mathrm{d} \Pm_{ij}.
\end{align}
Derivation of the Retraction Euler Integrator
\begin{align}
\mathrm{d} \xe_t = (-\mathbf{G}^{-1} \nabla_{\xe} U_\lambda(\xe_t) + \boldsymbol{\Gamma}_t) \mathrm{d}t + \sqrt{2/\beta \mathbf{G}^{-1}} \mathrm{d} \mathrm{B}_t. \label{eqn:sdesupp}
\end{align}
\begin{align}
\boldsymbol{\Gamma}_t = \frac1{2} \mathbf{G}^{-1} \nabla_{\xe} \log |\mathbf{G}| .
\end{align}
\begin{align}
\mathrm{d} \xe_t = -\mathbf{G}^{-1} (\nabla_{\xe}U_\lambda(\xe_t) + \frac1{2}\nabla_{\xe} \log |\mathbf{G}|) \mathrm{d}t + \sqrt{2/\beta \mathbf{G}^{-1}} \mathrm{d} \mathrm{B}_t.
\end{align}
\begin{align}
\mathrm{d} \xe_t = -\mathbf{G}^{-1} (\nabla_{\xe}U_\lambda(\xe_t) + \frac1{2}\nabla_{\xe} \log |\mathbf{G}|) \mathrm{d}t + \mathbf{G}^{-1} \mathbf{M}^\top \mathcal{N}(0, 2 \beta \mathbf{I})
\end{align}
\begin{align}
\mathrm{d} \X_t &= -\mathbf{M}\mathbf{G}^{-1}\mathbf{M}^\top \nabla_{\X}U(\X_t) \mathrm{d}t + \mathbf{M} \mathbf{G}^{-1} \mathbf{M}^\top \mathcal{N}(0, 2 \beta \mathbf{I}) \\
&= \mathbf{M}(\mathbf{M}^\top\mathbf{M})^{-1}\mathbf{M}^\top \Bigl(- \nabla_{\X}U(\X_t) \mathrm{d}t + \mathcal{N}(0, 2 \beta \mathbf{I}) \Bigr).
\end{align}
Details on the Synthetic Dataset
Examples on the Willow Dataset
Finding the Optimum Solution
Uncertainty Estimation
Application to Establishing 3D Correspondences
N=200
N=50
N=20
2-13
Initial
MatchEIG
Wang et al.
Ours
Initial
MatchEIG
Wang et al.
Ours
Initial
MatchEIG
Wang et al.
Ours
2-13 Cat
38.42%
32.92%
37.83%
39.08%
53.38%
55.42%
56.13%
56.93%
35.56%
37.33%
38.33%
40.33%
Michael
30.64%
34.39%
35.92%
36.12%
46.40%
52.23%
54.93%
54.62%
45.76%
51.05%
51.63%
55.95%
Conclusion
Acknowledgements