Sparsity-accuracy trade-off in MKL

We empirically investigate the best trade-off between sparse and uniformly-weighted multiple kernel learning (MKL) using the elastic-net regularization on real and simulated datasets. We find that the best trade-off parameter depends not only on the sparsity of the true kernel-weight spectrum but also on the linear dependence among kernels and the number of samples.

💡 Research Summary

This paper investigates how to balance sparsity and predictive accuracy in Multiple Kernel Learning (MKL) by employing elastic‑net regularization, which blends ℓ1 and ℓ2 penalties through a mixing parameter α (0 ≤ α ≤ 1). The authors conduct a systematic empirical study on both synthetic and real‑world datasets to determine how the optimal α depends on three key factors: (1) the true sparsity of the kernel‑weight spectrum, (2) the linear dependence among the individual kernels, and (3) the number of training samples.

In the synthetic experiments, a collection of Gaussian RBF kernels with varying bandwidths is generated. Two scenarios are examined: (a) a highly sparse true weight vector where only a small fraction (≈10 %) of kernels have non‑zero weights, and (b) a dense weight vector where many kernels contribute. Additionally, the authors manipulate inter‑kernel correlation by applying linear transformations, creating both nearly independent and strongly correlated kernel sets. Results show that when the underlying weight vector is sparse and kernels are nearly independent, a high α (≈0.8–1.0), i.e., a strong ℓ1 component, yields the lowest test error because it effectively discards irrelevant kernels. Conversely, when kernels are highly correlated and the sample size is limited (N ≤ 200), a moderate α (≈0.3–0.5) performs best; the ℓ2 component stabilizes the solution and mitigates over‑fitting caused by redundancy. As the number of samples grows (N ≥ 1000), the optimal α shifts toward larger ℓ1 contributions, reflecting the benefit of sparsity for computational efficiency without sacrificing accuracy.

The real‑data experiments involve two benchmark tasks: image classification on Caltech‑101 and text categorization on the Reuters‑21578 corpus. For each task, a diverse set of feature representations (e.g., color histograms, SIFT, LBP for images; TF‑IDF, LDA for text) is transformed into separate kernels, resulting in 12–15 kernels per dataset. In the text domain, the kernels derived from word‑frequency features are highly correlated, and the best performance is achieved with α ≈ 0.4, indicating a balanced mix of ℓ1 and ℓ2 regularization. In the image domain, the visual features are more complementary; optimal α values lie between 0.7 and 0.9, favoring sparsity to highlight the most informative kernels while still retaining enough ℓ2 smoothing to handle residual noise.

A crucial observation is that the optimal α is not a fixed property of the algorithm but varies with data characteristics. The authors employ 5‑fold cross‑validation combined with a grid search over α∈{0,0.1,…,1.0} to select the best trade‑off for each experiment. Across all settings, the selected α values fall within the interval

💡 Research Summary

📜 Original Paper Content