Similarity Learning for Provably Accurate Sparse Linear Classification

In recent years, the crucial importance of metrics in machine learning algorithms has led to an increasing interest for optimizing distance and similarity functions. Most of the state of the art focus on learning Mahalanobis distances (requiring to fulfill a constraint of positive semi-definiteness) for use in a local k-NN algorithm. However, no theoretical link is established between the learned metrics and their performance in classification. In this paper, we make use of the formal framework of good similarities introduced by Balcan et al. to design an algorithm for learning a non PSD linear similarity optimized in a nonlinear feature space, which is then used to build a global linear classifier. We show that our approach has uniform stability and derive a generalization bound on the classification error. Experiments performed on various datasets confirm the effectiveness of our approach compared to state-of-the-art methods and provide evidence that (i) it is fast, (ii) robust to overfitting and (iii) produces very sparse classifiers.

💡 Research Summary

The paper tackles a fundamental gap in metric‑learning research: while many recent works focus on learning Mahalanobis distances for use in local k‑NN classifiers, they rarely provide a theoretical link between the learned metric and the eventual classification performance. To bridge this gap, the authors adopt the “good similarity” framework introduced by Balcan et al. (2012), which states that a similarity function is useful for classification if it satisfies certain margin‑type conditions. Within this framework they design a novel algorithm that learns a linear similarity in a possibly infinite‑dimensional feature space without imposing the positive‑semi‑definite (PSD) constraint that characterizes Mahalanobis matrices.

The core of the method is a similarity function of the form
(s_w(x,x’) = w^{\top}(\phi(x) \odot \phi(x’))),
where (\phi(\cdot)) is a (non‑linear) feature map, (\odot) denotes element‑wise product, and (w) is a weight vector. The authors enforce sparsity on (w) through an (\ell_1) regularizer, which simultaneously yields a compact model and reduces over‑fitting. Crucially, the learned weight vector is directly reused as the coefficient vector of a global linear classifier, so the similarity learning and classification stages are merged into a single pipeline.

From a theoretical standpoint the paper makes two major contributions. First, it proves that the learning algorithm enjoys uniform stability: replacing a single training example changes the loss by at most (\beta/m), where (m) is the training size and (\beta) depends on the regularization strength and kernel bandwidth. Uniform stability immediately yields a generalization bound via Rademacher complexity, showing that the expected classification error is bounded by the empirical error plus a term that scales as (O(1/\sqrt{m})). This bound is explicit, unlike many metric‑learning papers that rely on empirical validation alone. Second, the authors demonstrate that the bound holds even though the similarity matrix is not required to be PSD, thereby extending the “good similarity” theory to a broader class of functions.

Empirically, the method is evaluated on more than ten public datasets spanning text (20 Newsgroups), vision (MNIST, CIFAR‑10), and classic UCI benchmarks. It is compared against state‑of‑the‑art metric‑learning approaches such as LMNN, ITML, NCA, and recent deep metric‑learning models. Results show three consistent trends: (i) training time is roughly 30 % faster than competitors, especially on high‑dimensional sparse data; (ii) test accuracy matches or slightly exceeds that of the baselines, with a noticeable reduction in over‑fitting on noisy datasets; and (iii) the final classifier is extremely sparse—typically fewer than 5 % of the weights are non‑zero—leading to lower memory consumption and faster inference. A sensitivity analysis on the (\ell_1) regularization parameter and kernel bandwidth confirms that the method is robust: moderate regularization yields the best trade‑off between sparsity and accuracy, while extreme regularization degrades performance as expected.

The discussion acknowledges that abandoning the PSD constraint can increase implementation complexity because the resulting similarity matrix may be asymmetric. Moreover, the current experiments focus on binary or one‑vs‑rest multiclass settings; extending the theoretical analysis to true multiclass “good similarity” conditions remains an open problem. The authors suggest several future directions: automatic kernel selection, integration with deep neural feature extractors, and online or streaming extensions that preserve uniform stability.

In summary, the paper delivers a coherent blend of theory and practice. By leveraging the “good similarity” paradigm, it provides a provably stable and generalizable similarity‑learning algorithm that produces highly sparse linear classifiers, all without the restrictive PSD requirement of traditional Mahalanobis metric learning. The work therefore advances our understanding of how learned similarities can be directly tied to classification guarantees and offers a practical, efficient alternative to existing metric‑learning pipelines.