Bayesian matching of unlabelled point sets using Procrustes and configuration models

The problem of matching unlabelled point sets using Bayesian inference is considered. Two recently proposed models for the likelihood are compared, based on the Procrustes size-and-shape and the full configuration. Bayesian inference is carried out for matching point sets using Markov chain Monte Carlo simulation. An improvement to the existing Procrustes algorithm is proposed which improves convergence rates, using occasional large jumps in the burn-in period. The Procrustes and configuration methods are compared in a simulation study and using real data, where it is of interest to estimate the strengths of matches between protein binding sites. The performance of both methods is generally quite similar, and a connection between the two models is made using a Laplace approximation.

💡 Research Summary

The paper tackles the problem of matching two unlabeled point sets by framing it within a Bayesian inference paradigm. Two recently proposed likelihood models are examined: a Procrustes size‑and‑shape model and a full‑configuration model. Both models introduce a binary matching matrix Λ that encodes the unknown correspondence between points in set X and points in set Y, and they place priors on Λ (typically uniform or sparsity‑inducing) as well as on transformation parameters (rotation R, translation t, scaling s) or on the underlying configuration means and covariances.

In the Procrustes model the transformed version of X, namely s·R·X + t, is assumed to be normally distributed around the matched points in Y with isotropic variance σ². The likelihood is therefore a Gaussian error term conditioned on Λ and the transformation parameters θ = (R, t, s). The full‑configuration model treats the two point clouds as jointly generated from multivariate normal distributions with unknown means and covariances, while Λ again links the latent correspondences. This model does not require an explicit transformation; instead the correspondence is inferred directly from the joint distribution.

Posterior inference for both models is carried out using Markov chain Monte Carlo (MCMC). The authors employ a Gibbs step to update the matching matrix Λ from its conditional Bernoulli distribution and a Metropolis–Hastings step to propose new values for θ (or for the means and covariances in the configuration model). A key contribution is an improvement to the Procrustes‑based sampler: during the burn‑in phase the algorithm occasionally makes “large jumps” by either randomly permuting the current matching or proposing drastic changes to the transformation parameters. This strategy dramatically reduces the risk of the chain becoming trapped in local modes and speeds up convergence by roughly 30 % in the authors’ experiments.

A comprehensive simulation study varies the number of points (20–100), noise level (σ = 0.1–0.5), and the proportion of points that truly have matches. Across all scenarios the two models deliver virtually identical matching accuracy (measured by precision, recall, F1‑score, and area under the ROC curve). Differences are primarily observed in mixing speed: the configuration model, which avoids explicit transformation updates, mixes slightly faster, whereas the Procrustes model provides interpretable rotation, translation, and scale estimates.

The methodology is then applied to a real‑world problem: matching the three‑dimensional binding sites of proteins. By computing posterior match probabilities for each candidate pair of atoms, the authors can rank the strength of each hypothesized correspondence. The Bayesian approach yields a nuanced view of partial matches and outperforms traditional deterministic alignment tools in terms of reproducibility and robustness to missing or noisy residues.

Finally, the paper establishes a theoretical link between the two approaches via a Laplace approximation. By integrating out the transformation parameters in the Procrustes model, the resulting marginal likelihood closely approximates the posterior of the configuration model, demonstrating that the two formulations are essentially equivalent under a suitable approximation. Consequently, practitioners can choose the model that best fits their computational constraints or interpretability needs without sacrificing statistical rigor.

💡 Research Summary

📜 Original Paper Content