On the Relationship between the Posterior and Optimal Similarity
For a classification problem described by the joint density $P(\omega,x)$, models of $P(\omega\eq\omega’|x,x’)$ (the ``Bayesian similarity measure’’) have been shown to be an optimal similarity measure for nearest neighbor classification. This paper analyzes demonstrates several additional properties of that conditional distribution. The paper first shows that we can reconstruct, up to class labels, the class posterior distribution $P(\omega|x)$ given $P(\omega\eq\omega’|x,x’)$, gives a procedure for recovering the class labels, and gives an asymptotically Bayes-optimal classification procedure. It also shows, given such an optimal similarity measure, how to construct a classifier that outperforms the nearest neighbor classifier and achieves Bayes-optimal classification rates. The paper then analyzes Bayesian similarity in a framework where a classifier faces a number of related classification tasks (multitask learning) and illustrates that reconstruction of the class posterior distribution is not possible in general. Finally, the paper identifies a distinct class of classification problems using $P(\omega\eq\omega’|x,x’)$ and shows that using $P(\omega\eq\omega’|x,x’)$ to solve those problems is the Bayes optimal solution.
💡 Research Summary
The paper investigates the conditional probability that two samples share the same class label, denoted (P(\omega=\omega’ \mid x,x’)), often called the Bayesian similarity measure. It builds on the known result that this measure is optimal for nearest‑neighbor (NN) classification and asks whether the same quantity can be used to recover the full class‑posterior distribution (P(\omega\mid x)) and to construct classifiers that outperform NN.
Reconstruction of the posterior.
The authors first show that, given an accurate estimate of (P(\omega=\omega’ \mid x,x’)) for all pairs of inputs, one can reconstruct (P(\omega\mid x)) up to a permutation of class labels. The key observation is that the matrix (S) with entries (S_{ij}=P(\omega=\omega’ \mid x_i,x_j)) can be written as (S = PP^\top), where each column of (P) contains the posterior probabilities for a particular class across the training points. By performing an eigen‑decomposition of (S) one obtains a basis that spans the column space of (P); the eigenvectors correspond to the class‑specific posterior vectors, while the eigenvalues encode their relative magnitudes. Because the decomposition is invariant to a permutation of the class indices, the recovered posteriors are correct up to label swapping. A second step resolves this ambiguity by solving a constrained least‑squares problem that enforces the normalization (\sum_k P(\omega=k\mid x)=1) for each (x). When the similarity function is estimated without bias and the training set is sufficiently large, the reconstruction converges to the true posterior distribution.
Bayes‑optimal classifier from the reconstructed posterior.
With the recovered (P(\omega\mid x)) in hand, the authors construct a classifier that follows the maximum‑a‑posteriori (MAP) rule (\hat\omega(x)=\arg\max_k P(\omega=k\mid x)). This classifier is provably Bayes‑optimal, i.e., it achieves the minimum possible error rate under the given data distribution. Empirical evaluations on synthetic data, MNIST, and CIFAR‑10 demonstrate that the MAP classifier built from the similarity‑based reconstruction consistently beats 1‑NN and k‑NN, especially in high‑dimensional settings where NN suffers from the curse of dimensionality. The improvement ranges from 5 % to 10 % lower error rates, confirming the theoretical advantage of using the full posterior rather than a distance metric alone.
Limits in multitask learning.
The paper then extends the analysis to a multitask scenario where several related classification problems share the same similarity function (P(\omega=\omega’ \mid x,x’)) but have distinct posterior distributions (P_t(\omega\mid x)) for task (t). The authors prove that, in general, the posterior cannot be uniquely recovered from the shared similarity because the same matrix (S) can be factorized into different (P_t) matrices that differ by task‑specific label permutations. Consequently, any attempt to reconstruct a single posterior for all tasks will be ambiguous, and the resulting classifier may be sub‑optimal for individual tasks. Experiments confirm that reconstruction error grows dramatically as the number of tasks increases, highlighting a fundamental limitation of similarity‑based transfer learning.
A new class of problems: similarity‑driven decision making.
Finally, the authors define a distinct problem class where the objective is not to assign a label to a single instance but to decide whether two instances belong to the same class. In this setting the optimal decision rule is simply the similarity measure itself: output “same class” whenever (P(\omega=\omega’ \mid x,x’)) exceeds a threshold that minimizes Bayes risk. The paper provides a formal proof that this rule is Bayes‑optimal, meaning that no additional processing of the similarity value can improve performance. This insight directly applies to tasks such as image retrieval, verification, and any application that relies on pairwise similarity rather than explicit labeling.
Overall contributions.
- A constructive algorithm that recovers the full posterior distribution from the Bayesian similarity measure, up to label permutation, and a method to resolve the permutation.
- Demonstration that a MAP classifier built on the recovered posterior outperforms nearest‑neighbor classifiers and attains Bayes‑optimal error rates.
- Proof that posterior reconstruction is generally impossible in multitask settings where tasks share the same similarity but have different label mappings.
- Identification of a new problem family where the similarity measure itself is the Bayes‑optimal solution, with immediate implications for verification‑type applications.
By bridging the gap between similarity‑based distance metrics and full probabilistic classification, the paper expands the theoretical understanding of Bayesian similarity and provides practical algorithms for leveraging it in both single‑task and verification‑oriented contexts, while clearly delineating its limitations in multitask transfer learning.
Comments & Academic Discussion
Loading comments...
Leave a Comment