Probabilistic Permutation Synchronization using the Riemannian Structure of the Birkhoff Polytope

Probabilistic Permutation Synchronization using the Riemannian Structure   of the Birkhoff Polytope

Introduction

Correspondences fuel a large variety of computer vision applications such as structure-from-motion (SfM) , SLAM , 3D reconstruction , camera re-localization , image retrieval  and 3D scan stitching . In a typical scenario, given two scenes, an initial set of 2D/3D keypoints is first identified. Then the neighborhood of each keypoint is summarized with a local descriptor   and keypoints in the given scenes are matched by associating the mutually closest descriptors. In a majority of practical applications, multiple images or 3D shapes are under consideration and ascertaining such two-view or pairwise correspondences is simply not sufficient. This necessitates a further refinement ensuring global consistency. Unfortunately, at this stage even the well developed pipelines acquiesce either heuristic/greedy refinement  or incorporate costly geometric cues related to the linking of individual correspondence estimates into a globally coherent whole .

In this paper, by using the fact that correspondences are cycle consistent 1, we propose two novel algorithms for refining the assignments across multiple images/scans (nodes) in a multi-way graph and for estimating assignment confidences, respectively. We model the correspondences between image pairs as relative, total permutation matrices and seek to find absolute permutations that re-arrange the detected keypoints to a single canonical, global order. This problem is known as map or permutation synchronization . Even though in many practical scenarios matches are only partially available, when shapes are complete and the density of matches increases, total permutations can suffice .

Similar to many well received works , we relax the sought permutations to the set of doubly-stochastic (DS) matrices. We then consider the geometric structure of DS, the Birkhoff Polytope . We are - to the best of our knowledge, for the first time introducing and applying the recently developed Riemannian geometry of the Birkhoff Polytope  to tackle challenging problems of computer vision. Note that lack of this geometric understanding caused plenty of obstacles for scholars dealing with our problem . By the virtue of a first order retraction, we can use the recent Riemannian limited-memory BFGS (LR-BFGS) algorithm  to perform a maximum-a-posteriori (MAP) estimation of the parameters of the consistency loss. We coin our variation as Birkhoff-LRBFGS.

At the next stage, we take on the challenge of confidence/uncertainty estimation for the problem at hand by drawing samples on the Birkhoff Polytope and estimating the empirical posterior distribution. To achieve this, we first formulate a new geodesic stochastic differential equation (SDE). Our SDE is based upon the Riemannian Langevin Monte Carlo (RLMC)  that is efficient and effective in sampling from Riemannian manifolds with true exponential maps. Note that similar stochastic gradient geodesic MCMC (SG-MCMC)  tools have already been used in the context of synchronization of spatial rigid transformations whose parameters admit an analytically defined geodesic flow . Unfortunately, for our manifold the retraction map is only up to first order and hence we cannot use off-the-shelf schemes. Alleviating this nuisance, we further contribute a novel numerical integrator to solve our SDE by replacing the intractable exponential map of DS matrices by the approximate retraction map. This leads to another new algorithm: Birkhoff-RLMC.

In a nutshell, our contributions are:

  1. We function as an ambassador and introduce the Riemannian geometry of the Birkhoff Polytope  to solve problems in computer vision.

  2. We propose a new probabilistic model for the permutation synchronization problem.

  3. We minimize the cycle consistency loss via a Riemannian-LBFGS algorithm and outperfom the state-of-the-art both in recall and in runtime.

  4. Based upon the Langevin mechanics, we introduce a new SDE and a numerical integrator to draw samples on the high dimensional and complex manifolds with approximate retractions, such as the Birkhoff Polytope. This lets us estimate the confidence maps, which can aid in improving the solutions and spotting consistency violations or outliers.

Note that the tools developed herewith can easily extend beyond our application and would hopefully facilitate promising research directions regarding the combinatorial optimization problems in computer vision.

Riemannian Limited Memory BFGS

We now explain the R-LBFGS optimizer used in our work. To begin with, we recall the foundational BFGS , that is a quasi Newton method operating in the Euclidean space. We then review the simple Riemannian descent algorithms that employ line-search. Finally, we explain the R-BFGS used to solve our synchronization problem. R-BFGS can be modified slightly to arrive at the desired limited memory Riemannian BFGS solver.

Euclidean BFGS

The idea is to approximate the true hessian $`\mathbf{H}`$ with $`{\bm{\B}}`$, using updates specified by the gradient evaluations2. We will then transition from this Euclidean space line search method to Riemannian space optimizers. For clarity, in Alg. 1 we summarize the Euclidean BFGS algorithm, that computes $`{\bm{\B}}`$ by using the most recent values of the past iterates. Note that many strategies exist to initialize $`{\bm{\B}}_0`$, while a common choice is the scaled identity matrix $`{\bm{\B}}_0=\gamma\mathbf{I}`$. Eq. [eq:bfgs] corresponds to the particular BFGS-update rule. In the limited-memory successor of BFGS, the L-BFGS, the Hessian matrix $`\mathbf{H}`$ is instead approximated up to a pre-specified rank in order to achieve linear time and space-complexity in the dimension of the problem.

$`k \gets 0`$

The standard descent minimizers can be extended to operate on Riemannian manifolds $`\Man`$ using the geometry of the parameters. A typical Riemannian update can be characterized as:

\begin{align}
\x_{k+1} = R_{\x_k}(\tau_k \bm{\eta}_k)
\end{align}

where $`R`$ is the retraction operator, i.e. the smooth map from the tangent bundle $`\T\Man`$ to $`\Man`$ and $`R_{\x_k}`$ is the restriction of $`R`$ to $`\T_{\x_k}`$, the tangent space of the current iterate $`\x_k`$, $`R_{\x_k}:\T_{\x_k}\rightarrow \Man`$. The descent direction is defined to be on the tangent space $`\bm{\eta}_k\in \T_{\x_k}\Man`$. When manifolds are rather simple shapes, the size of the step $`\tau_k`$ can be a fixed value. However, for most matrix manifolds, some form of a line search is preferred to compute $`\tau_k`$. The retraction operator is used to take steps on the manifold and is usually derived analytically. When this analytic map is length-preserving, it is called true exponential map. However, such exactness is not a requirement for the optimizer and as it happens in the case of doubly stochastic matrices, $`R_{\x_k}`$ only needs to be an approximate retraction, e.g. first or second order. In fact, for a map to be valid retraction, it is sufficient to satisfy the following conditions:

r0.27

  1. $`R`$ is continuously differentiable.

  2. $`R_{\x}(\zero_{\x})=\x`$, where $`\zero_{\x}`$ denotes the zero element of $`\T_{\x}\Man`$. This is called the centering property.

  3. The curve $`\gamma_{\bm{\eta}_{\x}}(\tau)=R_{\x}(\tau \bm{\eta}_{\x})`$ satisfies:

    \begin{align}
    \frac{d\gamma_{\bm{\eta}_{\x}}(\tau)}{d\tau}\Bigr\rvert_{\tau = 0}=\bm{\eta}_{\x}, \, \forall \bm{\eta}_{\x} \in \T_{\x}\Man.
    \end{align}
    

    This is called the local rigidity property.

Considering all these, we provide, in Alg. 2, a general Riemannian descent algorithm, which can be customized by the choice of the direction, retraction and the means to compute the step size. Such a concept of minimizing by walking on the manifold is visualized in Fig. [fig:torus].

It is possible to develop new minimizers by making particular choices for the Riemannian operators in Alg 2. We now review the Armijo variant of the Riemannian gradient descent , that is a common and probably the simplest choice for an accelerated optimizer on the manifold. Though, many other line-search conditions such as Barzilai-Borwein  or strong Wolfe  can be used. The pseudocode for this backtracking version is given in Alg. 3. Note that the Riemannian gradient $`\text{grad}f(\x)`$ is simply obtained by projecting the Euclidean gradient $`\nabla f(\x)`$ onto the manifold $`\Man`$ and the next iterate is obtained through a line search so as to satisfy the Armijo condition , tweaked to use the retraction $`R`$ for taking steps. For some predefined Armijo step size, this algorithm is guaranteed to converge for all retractions .

LR-BFGS for Minimization on $`\DP_n`$

Based upon the ideas developed up to this point, we present the Riemannian variant of the L-BFGS. The algorithm is mainly based on Huang and we refer the reader to their seminal work for more details . It is also worth noting . Similar to the Euclidean case, we begin by summarizing a BFGS algorithm with the difference that it is suited to solving the synchronization task. This time, $`\bm{\B}`$ will approximate the action of the Hessian on the tangent space $`\T_{\x_k}\Man`$. Generalizing any (quasi-)Newton method to the Riemannian setting thus requires computing the Riemannian Hessian operator, or its approximation. This necessitates taking a some sort of a directional derivative of a vector field. As the vector fields belonging to different tangent spaces cannot be compared immediately, one needs the notion of connection $`\Gamma`$ generalizing the directional derivative of a vector field. This connection is closely related to the concept of vector transport $`T: \T\Man \otimes \T\Man \rightarrow \T \Man`$ which allows moving from one tangent space to the other $`T_{\bm{\eta}}(\bm{\zeta}): (\bm{\eta},\bm{\zeta})\in\T_\x\Man \rightarrow \T_{R_{\x}(\bm{\eta})}\Man \,`$. For the first order treatment of the Birkhoff Polytope, this vector transport takes a simple form:

\begin{equation}
T_{\bm{\eta}}(\zeta) \in \T_{R_{\x}(\bm{\eta})}\DP_n \triangleq \Pi_{R_{\x}(\bm{\eta})}(\bm{\zeta}).
\end{equation}

where $`\Pi_{\X}(\cdot)`$ is the projection onto the tangent space of $`\X`$ as defined in the main paper. We give the pseudocode of the R-BFGS algorithm in Alg. 4 below. To ensure Riemannian L-BFGS always produces a descent direction, it is necessary to adapt a line-search algorithm which satisfies strong Wolfe (SW) conditions . Roptlib  does implement the SW while ManOpt  uses the simpler Armijo conditions, even for the LR-BFGS. Another simple possibility is to take the steps on the Manifold using the retraction, while performing the line search in the Euclidean space. This results in a projected, approximate line search, but can terminate quite quickly, making it possible to exploit existing Euclidean space SW LBFGS solvers such as Ceres . Without delving into rigorous proofs, we provide one such implementation at github.com/tolgabirdal/MatrixManifoldsInCeres where several matrix manifolds are considered. We leave the analysis of such approximations for a future study and note that the results in the paper are generated by limited memory form of the R-BFGS procedure summarized under Alg. 4. While many modifications do exist, LR-BFGS, in essence, is an approximation to R-BFGS, where $`O(MN^2)`$ storage of the full Hessian is avoided by unrolling the RBFGS descent computation and using the last $`m`$ input and gradient differences where $`m<

Proof of Proposition 1 and Further Analysis

Proof. Consider the ray cast outwards from the origin: $`\mathbf{r} = \mathbb{R}_{\geq 0} \mathbbm{1}`$. $`\mathbf{r}`$ exits $`\DP_n`$ at $`\tfrac{1}{n} \mathbbm{1}`$ but exits the sphere $`\Scal^{n-1}`$ at $`\tfrac{1}{\sqrt{n}} \mathbbm{1}`$. This creates a gap between the two manifolds whose ratio grows to $`\infty`$ as $`n`$ grows. For a perspective of optimization, consider the linear function $`\lambda(\Pm) = \sum_{i,j} P_{ij}`$. $`\lambda(\Pm)`$ is minimized on $`\DP_n`$ at $`\tfrac{1}{n} \mathbbm{1}`$ and on the sphere at $`\tfrac{1}{\sqrt{n}} \mathbbm{1}`$, ◻

We now look at the restricted orthogonal matrices and seek to find the difference between optimizing a linear functional on them and $`\DP_n`$. Note that minimizing functionals is ultimately what we are interested in as the problems we consider here are formulated as optimization. Let $`\mathcal{A}`$ be the affine linear space of $`(n-1)\times(n-1)`$ matrices whose rows and columns sum to 1, doubly stochastic but with no condition on the signs or a generalized doubly stochastic matrix. The Birkhoff Polytope $`\DP_n`$ is contained in $`\mathcal{A}`$, whereas the orthogonal group $`\Og_n`$ is not. So, there exists an affine functional $`\lambda`$ that is minimized at a point on $`\DP_n`$, $`\lambda=0`$ but not on $`\Og`$. Let $`\mathcal{AO}_n=\DP_n \cap \Og_n`$, a further restricted manifold. This time unlike the case of $`\DP_n`$, $`\mathcal{AO}_n`$ would not coincide the permutations $`\mathcal{P}`$ due to the negative elements. In fact $`\mathcal{P}\subset\mathcal{AO}_n`$.

The ratio between the time a ray leaves $`\DP_n`$ and the same ray leaves $`\mathcal{AO}_n`$ can be as large as $`n-1`$.

Proof. Consider the line through $`\frac{1}{n}\mathbbm{1}`$ and $`\Id`$ : $`l(x) = \frac{1+x}{n}\mathbbm{1} - x\Id`$. Such a ray leaves $`\DP_n`$ at $`x=\frac{1}{n-1}`$ and for all $`x\in [-1,1]`$ is in the convex hull of $`\mathcal{AO}_n`$. When $`x=1`$, it is an orthogonal matrix, on $`\Og_n`$. Hence, the ray can be on $`\mathcal{AO}_n`$ for $`n-1`$ times as long as in $`\DP_n`$. This quantity also grows to infinity as $`n\rightarrow \infty`$. If same analysis is done for the case of the sphere, the ratio is found to be $`(n-1)^{3/2}`$, grows quicker to infinity and is a larger quantity. ◻

Proof of Proposition 2

Proof. The conclusion of the proposition states that $`p(\X)`$ and $`p(\Pm|\X)`$ have the following form:

\begin{align}
    p(\X) =& \frac1{C} \exp \Bigl(-\beta  \sum\limits_{(i,j)\in \Edge}\|\X_{ij}\|_{\mathrm{F}}^2 \Bigr)  \prod\limits_{(i,j) \in \Edge}  Z_{ij} \\
p(\Pm | \X) =&  \prod_{(i,j)\in \Edge} p(\Pm_{ij} | \X_i, \X_j) \\
=& \prod_{(i,j)\in \Edge} \exp\Bigl( 2\beta\>\mathrm{tr}(\Pm_{ij}^\top \X_{ij}) \Bigr)  \frac1{Z_{ij}} \\
=& \exp\Bigl( \beta \sum_{(i,j)\in \Edge} 2 \mathrm{tr}(\Pm_{ij}^\top \X_{ij}) \Bigr) \prod_{(i,j)\in \Edge}\frac1{Z_{ij}}
\end{align}

where

\begin{align}
    Z_{ij} :=  \prod\limits_{b=1}^{B_{ij}} Z_\X(\beta, \theta_{ij,b}). \label{eqn:normconst}
\end{align}

Our goal is to verify that the following holds with the above definitions:

\begin{align}
    p(\Pm,\X) = \frac1{Z} \exp\Bigl(-\beta \sum_{(i,j)\in \Edge} \|\Pm_{ij}- \X_i \X_j^\top \|_{\mathrm{F}}^2 \Bigr) = \frac1{Z} \prod_{(i,j)\in \Edge} \psi(\Pm_{ij},\X_i,\X_j),
\end{align}

where $`Z`$ is a positive constant (cf. Section 4 in the main paper). Then, we can easily verify this equality as follows:

\begin{align}
    p(\Pm,\X) &= p(\Pm|\X) p(\X) \\
    &=  \frac1{C} \exp \Bigl(-\beta  \sum\limits_{(i,j)\in \Edge}\|\X_{ij}\|_{\mathrm{F}}^2 \Bigr)   \exp\Bigl( \beta \sum_{(i,j)\in \Edge} 2 \mathrm{tr}(\Pm_{ij}^\top \X_{ij}) \Bigr) \\
    &=  \frac1{C} \exp \Bigl(-\beta   \sum\limits_{(i,j)\in \Edge} \bigl(\|\X_{ij}\|_{\mathrm{F}}^2 - 2 \mathrm{tr}(\Pm_{ij}^\top \X_{ij}) \bigr) \Bigr)  \\
    &= \frac1{C} \exp \Bigl(-\beta   \sum\limits_{(i,j)\in \Edge}\bigl(\|\X_{ij}\|_{\mathrm{F}}^2 - 2 \mathrm{tr}(\Pm_{ij}^\top \X_{ij}) - n + n \bigr) \Bigr) \\
    &= \frac1{C} \exp(\beta n |\Edge|) \exp \Bigl(-\beta  \sum\limits_{(i,j)\in \Edge}\bigl( \|\X_{ij}\|_{\mathrm{F}}^2 - 2 \mathrm{tr}(\Pm_{ij}^\top \X_{ij}) + \|\Pm_{ij}\|_{\mathrm{F}}^2 \bigr) \Bigr) \label{eqn:interm} \\
    &= \frac1{Z}  \exp \Bigl(-\beta  \sum\limits_{(i,j)\in \Edge} \|\Pm_{ij}- \X_i \X_j^\top \|_{\mathrm{F}}^2 \Bigr)
\end{align}

where we used the fact that $`\|\Pm_{ij}\|_{\mathrm{F}}^2 = n`$ in Equation [eqn:interm] since $`\Pm_{ij} \in \mathcal{P}_n`$ and $`Z = C \exp(-\beta n |\Edge|)`$.

In the rest of the proof, we will characterize the normalizing constant $`Z_{ij} = \int_{\mathcal{P}_n} \exp(2\beta \mathrm{tr}(\Pm_{ij}^\top \X_{ij}) \mathrm{d} \Pm_{ij}`$. Here, we denote the the counting measure on $`\mathcal{P}_n`$ as $`\mathrm{d} \Pm_{ij}`$.

We start by decomposing $`\X_{ij}`$ via Birkhoff-von Neumann theorem:

\begin{align}
\X_{ij} = \sum\limits_{b=1}^{B_{ij}} \theta_{ij,b} \mathbf{M}_{ij,b}, \quad \sum\limits_{b=1}^{B_{ij}} \theta_{ij,b} =1, \quad B_{ij} \in \mathbb{Z}_+, \quad \theta_{ij,b} \geq 0, \> \mathbf{M}_{ij,b} \in \mathcal{P}_n, \quad \forall b = 1,\dots,B_{ij}.
\end{align}

Then, we have:

\begin{align}
    Z_{ij} &= \int_{\mathcal{P}_n} \exp\Bigl(2\beta \mathrm{tr}(\Pm_{ij}^\top \X_{ij} \Bigr) \mathrm{d} \Pm_{ij} \\
    &= \int_{\mathcal{P}_n} \exp\Bigl(2\beta \mathrm{tr}(\Pm_{ij}^\top \sum\limits_{b=1}^{B_{ij}} \theta_{ij,b} \mathbf{M}_{ij,b}  \Bigr) \mathrm{d} \Pm_{ij} \\
    &= \int_{\mathcal{P}_n} \exp\Bigl(2\beta \sum_{b=1}^{B_{ij}} \theta_{ij,b}  \mathrm{tr}(\Pm_{ij}^\top \mathbf{M}_{ij,b}) \Bigr) \mathrm{d} \Pm_{ij} \\
    % &= \int_{\mathcal{P}_n} \prod_{b=1}^{B_{ij}} \exp\Bigl(2\beta \theta_{ij,b}  \mathrm{tr}(\Pm_{ij}^\top \mathbf{M}_{ij,b}) \Bigr) \mathrm{d} \Pm_{ij} \\
    & \geq \exp\Bigl( \int_{\mathcal{P}_n} 2\beta \sum_{b=1}^{B_{ij}} \theta_{ij,b}  \mathrm{tr}(\Pm_{ij}^\top \mathbf{M}_{ij,b})  \mathrm{d} \Pm_{ij} \Bigr) \label{eqn:jensen} \\
    & = \exp\Bigl(  \sum_{b=1}^{B_{ij}} \int_{\mathcal{P}_n} 2\beta  \theta_{ij,b}  \mathrm{tr}(\Pm_{ij}^\top \mathbf{M}_{ij,b})  \mathrm{d} \Pm_{ij} \Bigr)
    % &= \prod_{b=1}^{B_{ij}} \int_{\mathcal{P}_n}  \exp\Bigl(2\beta \theta_{ij,b}  \mathrm{tr}(\Pm_{ij}^\top \mathbf{M}_{ij,b}) \Bigr) \mathrm{d} \Pm_{ij}.
\end{align}

where we used Jensen’s inequality in [eqn:jensen]. Here, we observe that $`\int_{\mathcal{P}_n} 2\beta \theta_{ij,b} \mathrm{tr}(\Pm_{ij}^\top \mathbf{M}_{ij,b})\mathrm{d} \Pm_{ij}`$ is similar to the normalization constant of a Mallows model , and it only depends on $`\beta`$ and $`\theta_{ij,b}`$. Hence, we conclude that

\begin{align}
    Z_{ij} & \geq \prod_{b=1}^{B_{ij}} \exp \Bigl( \int_{\mathcal{P}_n} 2\beta  \theta_{ij,b}  \mathrm{tr}(\Pm_{ij}^\top \mathbf{M}_{ij,b})  \mathrm{d} \Pm_{ij} \Bigr) \\
    &:= \prod_{b=1}^{B_{ij}}  f(\beta, \theta_{ij,b})
    % &= \prod_{b=1}^{B_{ij}} \int_{\mathcal{P}_n}  \exp\Bigl(2\beta \theta_{ij,b}  \mathrm{tr}(\Pm_{ij}^\top \mathbf{M}_{ij,b}) \Bigr) \mathrm{d} \Pm_{ij}.
\end{align}

where $`f`$ is an increasing function of $`\beta, \theta_{ij,b}`$ since $`\mathrm{tr}(\Pm_{ij}^\top \mathbf{M}_{ij,b}) \geq 0`$. This completes the proof. ◻

Derivation of the Retraction Euler Integrator

We start by recalling the SDE

\begin{align}
\mathrm{d} \xe_t = (-\mathbf{G}^{-1} \nabla_{\xe}  U_\lambda(\xe_t) + \boldsymbol{\Gamma}_t) \mathrm{d}t + \sqrt{2/\beta \mathbf{G}^{-1}} \mathrm{d} \mathrm{B}_t. \label{eqn:sdesupp}
\end{align}

By , we know that

\begin{align}
   \boldsymbol{\Gamma}_t = \frac1{2} \mathbf{G}^{-1} \nabla_{\xe} \log |\mathbf{G}| .
\end{align}

By using this identity in Equation [eqn:sdesupp], we obtain:

\begin{align}
    \mathrm{d} \xe_t = -\mathbf{G}^{-1} (\nabla_{\xe}U_\lambda(\xe_t)  + \frac1{2}\nabla_{\xe} \log |\mathbf{G}|) \mathrm{d}t + \sqrt{2/\beta \mathbf{G}^{-1}} \mathrm{d} \mathrm{B}_t.
\end{align}

By using a similar notation to , we rewrite the above equation as follows:

\begin{align}
     \mathrm{d} \xe_t = -\mathbf{G}^{-1} (\nabla_{\xe}U_\lambda(\xe_t)  + \frac1{2}\nabla_{\xe} \log |\mathbf{G}|) \mathrm{d}t +  \mathbf{G}^{-1} \mathbf{M}^\top  \mathcal{N}(0, 2 \beta \mathbf{I})
\end{align}

where $`\mathcal{N}`$ denotes the Gaussian distribution and $`[\mathbf{M}]_{ij} = \partial [\X]_{ij} / \partial[\xe]_{ij}`$. We multiply each side of the above equation by $`\mathbf{M}`$ and use the property $`\nabla_{\xe} = \mathbf{M}^\top \nabla_{\X}`$ , which yields:

\begin{align}
     \mathrm{d} \X_t &= -\mathbf{M}\mathbf{G}^{-1}\mathbf{M}^\top \nabla_{\X}U(\X_t) \mathrm{d}t +  \mathbf{M} \mathbf{G}^{-1} \mathbf{M}^\top  \mathcal{N}(0, 2 \beta \mathbf{I}) \\
     &= \mathbf{M}(\mathbf{M}^\top\mathbf{M})^{-1}\mathbf{M}^\top \Bigl(- \nabla_{\X}U(\X_t) \mathrm{d}t + \mathcal{N}(0, 2 \beta \mathbf{I}) \Bigr).
\end{align}

Here we used the area formula (Equation 11 in the main paper) and the fact that $`\mathbf{G} = \mathbf{M}^\top \mathbf{M}`$.

The term $`\mathbf{M}(\mathbf{M}^\top\mathbf{M})^{-1}\mathbf{M}^\top`$ turns out the be the projection operator to the tangent space of $`\X`$ . Therefore, the usual geodesic integrator would consist of the following steps at iteration $`k`$:

  • Set a small step size $`h`$

  • Compute the term $`- h\nabla_{\X}U(\X_t) + \sqrt{2h/\beta} \mathbf{Z}`$, with $`\mathbf{Z} \sim \mathcal{N}(0, \mathbf{I})`$

  • Obtain the ‘direction’ by projecting the result of the previous step on the tangent space

  • Move the current iterate on the geodesic determined by the direction that was obtained in the previous step.

Unfortunately, the last step of the integrator above cannot be computed in the case of $`\DP_n`$. Therefore, we replace it with moving the current iterate by using the retraction operator, which yields the update equations given in the main paper.

Details on the Synthetic Dataset

We now give further details on the synthetic data used in the paper. We synthesize our data by displacing a $`4\times 4`$ 2D grid to simulate 5 different images and scrambling the order of correspondence. Shown in Fig. [fig:synth]a, the ground truth corresponds to an identity ordering where the $`i^{th}`$ element of each grid is mapped to $`i^{th}`$ node of the other. Note that the graph is fully connected but we show only consecutive connections. To simulate noisy data, we introduce $`16`$ random swaps into the ground truth as shown in Fig. [fig:synth]b. We also randomize the initialization in a similar way. On this data we first run a baseline method where instead of restricting ourselves to the Birkhoff Polytope, we use the paths of the orthogonal group $`O(N)`$. Our second baseline is the prosperous method of Maset  . Note that, regardless of the initialization (Fig [fig:synth]e,f) our method is capable of arriving at visually more satisfactory local minimum. This validates that for complex problems such as the one at hand, respecting the geometry of the constraints is crucial. Our algorithm is successful at that and hence is able to find high quality solutions. In the figure the more parallel the lines are, the better, depicting closeness to the ground truth.

Examples on the Willow Dataset

Finding the Optimum Solution

We now show matches visualized after running our synchronization on the Willow dataset . For the sake of space, we have omitted some of those results from the paper. Fig [fig:realsup] plots our findings, where our algorithm is able to converge from challenging initial matches. Note that these results only show our the MAP estimates explained in the main paper. It is possible to extend our approach with some form of geometric constraints as in Wang  . We leave this as a future work.

Uncertainty Estimation

There are not many well accepted standards in evaluating the uncertainty. Hence in the main paper, we resorted what we refer as the top-K error. We will now briefly supply more details on this. The purpose of the experiment is to measure the quality of the sample proposals generated by our Birkhoff-RLMC. To this end, we introduce a random sampler that generates, $`K`$ arbitrary solutions that are highly likely to be irrelevant (bad proposals). These solutions alone do not solve the problem. However if we were to consider the result correct whenever either the found optimum or the random proposal contains the correct match i.e. append the additional $`K-1`$ samples to the solution set, then, even with a random sampler we are guaranteed to increase the recall. Similarly, samples from Birkhoff-RLMC will also yield higher recall. The question then is: Can Birkhoff-RLMC do better than a random sampler? To give an answer, we basically record the relative improvement in recall both for the random sampler and Birkhoff-RLMC. The top-K error for different choices of $`K`$ is what we presented in Table 2 of the main paper. We further visualize these solutions in the columns (c) to (f) of the same table. To do so, we simply retain the assignments with the $`K`$-highest scores in the solutions $`\X`$. Note that the entire $`\X`$ acts as a confidence map itself (Column d). We then use the MATLAB command imagesc on the $`{\X}`$ with the jet colormap.

Next, we provide insights into how the sampler works in practice. Fig. 1 plots the objective value attained as the iterations of the Birkhoff-RLMC sampler proceeds. For different values of $`\beta`$ these plots look different. Here we use $`\beta=0.08`$ and show both on Motorbike and Winebottle samples (used above) and for 1000 iterations, the behaviour of sample selection. In the paper, we have accumulated these samples and estimated the confidence maps. Note that occasionally, the sampler can discover better solutions than the one being provided. This is due to two reasons: 1) rarely, we could jump over local minima, 2) the initial solution is a discrete one and it is often plausible to have a doubly stochastic (relaxed) solution matrices that have lower cost.

Iterates of our sampling algorithm. With β = 0.085, our Birkhoff-RLMC sampler wanders around the local mode and draws samples on the Birkhoff Polytope. It can also happen that better solutions are found and returned. Purple line indicates the starting point, that is only a slight perturbation of the found solution.

Application to Establishing 3D Correspondences

We now apply our algorithm to solve correspondence problems on isolated 3D shapes provided by the synthetic Tosca dataset  that is composed of non-rigidly deforming 3D meshes of various objects such as cat, human, Centaur, dog, wolf and horse. The cat object from this dataset along with ten of its selected ground truth correspondences is shown in Fig. [fig:tosca]. This is of great concern among the 3D vision community and the availability of the complete shapes plays well with the assumptions of our algorithm, i.e. permutations are total. It is important to mention that when initial correspondences are good ($`\sim 80\%`$) all methods, including ours can obtain $`100\%`$ accuracy . Therefore, to be able to put our method to test, we will intentionally degrade the initial correspondence estimation algorithm we use.

Initialization Analogous to the 2D scenario, we obtain the initial correspondences by running a siamese Point-Net  like network regressing the assignments between two shapes. Unlike 2D, we do not need to take care of occlusions, visibility or missing keypoint locations as 3D shapes can be full and clean as in this dataset. On the average, we split half of the datasets for training and the other half for testing. Note that such amount of data is really insufficient to train this network, resulting in suboptimal predictions. We gradually downsample the point sets uniformly to sizes of $`N_s=\{200, 50, 20\}`$ to get the keypoints. At this point, it is possible to use sophisticated keypoint prediction strategies to establish the optimum cardinality and the set of points under correspondence. Note that it is sufficient to compute the keypoints per shape as we do not assume the presence of a particular order. Each keypoint is matched to another by the network prediction. The final stage of the prediction results in a soft assignment matrix on which we apply Hungarian algorithm to give rise to initial matches.

Preliminary Results We run, on these initial sets of correspondences, our algorithm and the state of the art methods as described in the main paper. We count and report the percentage of correctly matched correspondences (recall) in Tab. 1 for only two objects, Cat and Michael. Running on the full dataset is a future work. It is seen that our algorithm consistently outperforms the competing approaches. The difference is less visible when the number of samples start dominating the number of edges. This is because none of the algorithms are able to find enough consistency votes to correct for the errors.

Correspondence refinement on 3D inputs. We show our results on the meshes of Tosca  dataset. The cells show the percentage of correctly detected matches (recall).
N=200 N=50 N=20
2-13 Initial MatchEIG Wang et al. Ours Initial MatchEIG Wang et al. Ours Initial MatchEIG Wang et al. Ours
2-13 Cat 38.42% 32.92% 37.83% 39.08% 53.38% 55.42% 56.13% 56.93% 35.56% 37.33% 38.33% 40.33%
Michael 30.64% 34.39% 35.92% 36.12% 46.40% 52.23% 54.93% 54.62% 45.76% 51.05% 51.63% 55.95%

Conclusion

In this work we have proposed two new frameworks for relaxed permutation synchronization on the manifold of doubly stochastic matrices. Our novel model and formulation paved the way to using sophisticated optimizers such as Riemannian limited-memory BFGS. We further integrated a manifold-MCMC scheme enabling posterior sampling and thereby confidence estimation. We have shown that our confidence maps are informative about cycle inconsistencies and can lead to new solution hypotheses. We used these hypotheses in a top-K evaluation and illustrated its benefits. In the future, we plan to (i) address partial permutations, the inner region of the Birkhoff Polytope (ii) investigate more sophisticated MCMC schemes such as (iii) seek better use cases for our confidence estimates such as outlier removal.

Acknowledgements

Supported by the grant ANR-16-CE23-0014 (FBIMATRIX). Authors thank Haowen Deng for the initial 3D correspondences and, Benjamin Busam and Jesus Briales for fruitful discussions.


  1. Composition of correspondences for any circular path arrives back at the start node. ↩︎

  2. Note that certain implementations can instead opt to approximate the inverse Hessian for computational reasons. ↩︎