Metric and Kernel Learning using a Linear Transformation
Metric and kernel learning are important in several machine learning applications. However, most existing metric learning algorithms are limited to learning metrics over low-dimensional data, while existing kernel learning algorithms are often limited to the transductive setting and do not generalize to new data points. In this paper, we study metric learning as a problem of learning a linear transformation of the input data. We show that for high-dimensional data, a particular framework for learning a linear transformation of the data based on the LogDet divergence can be efficiently kernelized to learn a metric (or equivalently, a kernel function) over an arbitrarily high dimensional space. We further demonstrate that a wide class of convex loss functions for learning linear transformations can similarly be kernelized, thereby considerably expanding the potential applications of metric learning. We demonstrate our learning approach by applying it to large-scale real world problems in computer vision and text mining.
💡 Research Summary
The paper tackles the problem of learning a distance metric as the learning of a linear transformation of the input data. By representing the transformation through a positive‑semidefinite matrix M = A Aᵀ, the authors adopt the LogDet divergence as a regularizer. This choice yields a convex objective that penalizes deviation from the identity while preserving the geometry of the transformed space, and it enjoys favorable numerical properties such as scale invariance and strong convexity.
A major obstacle in high‑dimensional settings is that directly optimizing M requires storing and manipulating a matrix whose size grows with the data dimension, which quickly becomes infeasible. To overcome this, the authors apply the classic “kernel trick”. They show that the LogDet‑regularized objective can be expressed entirely in terms of the kernel matrix K = Φ Φᵀ, where Φ is the (possibly infinite‑dimensional) feature map induced by any positive‑definite kernel. Consequently, the optimization proceeds without ever forming the explicit transformation matrix A or the high‑dimensional features; all computations are performed on the kernel matrix, which depends only on the number of training examples.
Beyond the specific LogDet formulation, the paper proves a more general result: any convex loss that can be written as a trace of M with a data‑dependent matrix (e.g., tr(M S) or a sum of such terms) can be kernelized in exactly the same way. This includes a wide range of previously proposed metric‑learning losses such as LMNN, ITML, MCML, and many triplet‑based formulations. The authors therefore provide a unified framework that simultaneously supports arbitrary convex losses and arbitrary kernels.
Algorithmically, the authors adopt a Newton‑type method that repeatedly computes the gradient ∇ = M⁻¹ − I and the Hessian‑vector product, both of which can be evaluated using only kernel matrix operations. To keep the method scalable, they incorporate low‑rank approximations of the kernel matrix and stochastic sampling of constraints, reducing the per‑iteration cost from cubic to near‑linear in the number of examples.
Empirical evaluation is carried out on large‑scale computer‑vision and text‑mining benchmarks. In vision experiments (CIFAR‑10, Caltech‑101, etc.) the kernelized LogDet learner outperforms traditional metric‑learning algorithms (LMNN, ITML, NCA) in classification accuracy and mean average precision, and it further improves performance when combined with a support‑vector machine classifier. In text experiments (Reuters‑21578, 20 Newsgroups) the method yields superior clustering quality and retrieval precision compared with cosine‑similarity baselines, demonstrating its effectiveness on high‑dimensional sparse data.
In summary, the paper presents a powerful and flexible approach to metric and kernel learning: by framing metric learning as the estimation of a linear transformation, regularizing with the LogDet divergence, and exploiting kernelization, it enables convex loss functions to be learned efficiently in arbitrarily high‑dimensional spaces. The work bridges the gap between metric learning and kernel learning, expands the design space for loss functions, and opens the door to applications in large‑scale vision, language, and beyond. Future directions suggested include extensions to non‑linear transformations, online updating schemes, and integration with deep neural architectures.
Comments & Academic Discussion
Loading comments...
Leave a Comment