Learning Discriminative Metrics via Generative Models and Kernel Learning
Metrics specifying distances between data points can be learned in a discriminative manner or from generative models. In this paper, we show how to unify generative and discriminative learning of metrics via a kernel learning framework. Specifically, we learn local metrics optimized from parametric generative models. These are then used as base kernels to construct a global kernel that minimizes a discriminative training criterion. We consider both linear and nonlinear combinations of local metric kernels. Our empirical results show that these combinations significantly improve performance on classification tasks. The proposed learning algorithm is also very efficient, achieving order of magnitude speedup in training time compared to previous discriminative baseline methods.
💡 Research Summary
The paper addresses the problem of learning distance metrics for data points, a task that can be approached either discriminatively (optimizing directly for classification performance) or generatively (deriving metrics from probabilistic models of the data). The authors propose a unified framework that leverages the strengths of both paradigms by treating locally learned generative metrics as base kernels and then combining these kernels in a discriminative fashion to obtain a global kernel suitable for nearest‑neighbor classification.
The background section reviews two representative families of metric learning. The discriminative side is exemplified by Large‑Margin Nearest Neighbor (LMNN), which formulates metric learning as a convex semi‑definite program that enforces “pull‑push” constraints on target and impostor points. The generative side is represented by the Generative Local Metric (GLM) method, which assumes class‑conditional densities (typically Gaussian) and derives a local Mahalanobis matrix M_i that minimizes a finite‑sample bias term in the nearest‑neighbor error bound. While GLM can achieve competitive accuracy, it suffers from two practical drawbacks: (1) a new metric must be recomputed for every test point, making inference costly, and (2) performance depends heavily on the assumed generative model.
To overcome these issues, the authors reinterpret each local metric M_i as a positive‑semi‑definite kernel K_i(x_m, x_n) = x_m^T M_i x_n. They then explore two strategies for aggregating the kernels:
-
Linear Combination (Global Metric M_UNI). By taking a convex combination K = Σ_i α_i K_i with non‑negative α_i that sum to one, the resulting global metric M = Σ_i α_i M_i is itself positive‑semi‑definite. The simplest choice, uniform averaging (α_i = 1/N), yields M_UNI = (1/N) Σ_i M_i. The authors prove (Theorem 1) that under Gaussian class‑conditional distributions, this uniform combination induces a linear transformation under which each local metric becomes approximately the identity matrix, explaining why the averaged metric works well in practice. Empirically, M_UNI consistently outperforms LMNN and the original GLM on a variety of small‑scale benchmarks.
-
Non‑Linear Combination via Multiple Kernel Learning (MKL). Each local metric is used as the covariance of a Gaussian RBF kernel: K_{i,l}(x_m, x_n) = exp(−(x_m−x_n)^T M_i (x_m−x_n)/σ_l^2). The set of kernels across different bandwidths σ_l is then combined using MKL: K = Σ_{i,l} α_{i,l} K_{i,l}, with α_{i,l} ≥ 0 and Σ α_{i,l}=1. The coefficients are learned discriminatively by minimizing empirical risk of a kernel‑based classifier (e.g., SVM). This approach retains the rich, non‑Euclidean geometry encoded by the generative metrics while allowing a highly flexible, data‑driven combination.
Complexity analysis shows that computing each local metric requires eigen‑decomposition of a D×D matrix, costing O(D³). Since this is done for every training point, the total cost is O(N D³), which is linear in N and far cheaper than LMNN’s O(N³) constraint handling when D is modest (e.g., D ≤ √N). The combination steps (both linear and MKL) add negligible overhead.
Experimental evaluation spans ten datasets: eight small UCI‑style sets (dimensions 4–34, 150–2310 points) and two larger image sets (MNIST reduced to 40–164 dimensions, and a 20 k‑sample Letters dataset). For each dataset the authors perform multiple random splits, report average misclassification rates, and compare against LMNN, GLM, and baseline Euclidean distance. Results indicate:
- M_UNI reduces error rates relative to LMNN by 0.5–2 % on most small datasets.
- A normalized variant M_UNI_E further improves performance on several tasks.
- The MKL‑based non‑linear combination achieves the lowest error on the larger MNIST and Letters datasets, confirming the benefit of non‑linear kernel fusion.
- Training times for the proposed methods are an order of magnitude faster than LMNN, especially as N grows.
The paper concludes that (i) local generative metrics can be effectively turned into kernels, (ii) a simple uniform linear combination already yields strong performance, (iii) MKL provides a principled way to exploit non‑linear interactions, and (iv) the overall framework offers both accuracy gains and substantial computational savings. Future work is suggested on extending beyond Gaussian assumptions, developing scalable approximations for the eigen‑decomposition step, and applying the framework to unsupervised or semi‑supervised settings.
Comments & Academic Discussion
Loading comments...
Leave a Comment