Feature Selection with Annealing for Computer Vision and Big Data Learning
Many computer vision and medical imaging problems are faced with learning from large-scale datasets, with millions of observations and features. In this paper we propose a novel efficient learning scheme that tightens a sparsity constraint by gradually removing variables based on a criterion and a schedule. The attractive fact that the problem size keeps dropping throughout the iterations makes it particularly suitable for big data learning. Our approach applies generically to the optimization of any differentiable loss function, and finds applications in regression, classification and ranking. The resultant algorithms build variable screening into estimation and are extremely simple to implement. We provide theoretical guarantees of convergence and selection consistency. In addition, one dimensional piecewise linear response functions are used to account for nonlinearity and a second order prior is imposed on these functions to avoid overfitting. Experiments on real and synthetic data show that the proposed method compares very well with other state of the art methods in regression, classification and ranking while being computationally very efficient and scalable.
💡 Research Summary
The paper introduces Feature Selection with Annealing (FSA), a scalable algorithm designed for high‑dimensional computer‑vision and medical‑imaging tasks where both the number of samples and the number of features can reach millions. The core idea is to intertwine gradient‑based parameter updates with a systematic “keep‑or‑kill” screening step that removes variables according to the magnitude of their current coefficients. An annealing schedule controls how many variables are retained at each iteration, gradually tightening the sparsity constraint from the full feature set down to a user‑specified cardinality k.
Formally, the method solves a constrained optimization problem
β* = arg min L(β) subject to |{j : βj ≠ 0}| ≤ k,
where L(β) is any differentiable loss (logistic, Huber‑SVM, a newly proposed Lorenz loss, squared error, etc.). The algorithm proceeds as follows: (1) β←β − η∇βL(β) (standard gradient descent with learning rate η); (2) keep only the Mₑ coefficients with largest absolute value, where Mₑ follows an inverse‑annealing schedule Mₑ = k + (M − k)·max(0,(N_iter − 2e)/(2e^μ + N_iter)). The schedule parameter μ and the number of iterations N_iter are easy to tune; the authors show that a wide range of values yields stable performance.
Theoretical contributions include a proof of global convergence (Theorem 2.1) and a selection‑consistency result, guaranteeing that, under standard regularity conditions, the algorithm recovers the true set of relevant variables with high probability. Unlike L₁, SCAD, MCP, or other penalized approaches, FSA does not introduce bias into the estimated coefficients because sparsity is enforced by explicit cardinality control rather than by a penalty term. The cardinality k is intuitive for practitioners, directly specifying the desired model size.
To capture non‑linearity, the authors augment the linear predictor with one‑dimensional piecewise‑linear response functions and impose a second‑order prior to avoid over‑fitting. They also introduce the Lorenz loss, a differentiable, margin‑based loss that is zero for correctly classified points beyond the margin and grows only logarithmically for severely mis‑classified points, making it more robust to label noise than logistic or hinge losses.
Computationally, each iteration costs O(M · N) operations, but because Mₑ shrinks over time the total work is proportional to the area under the annealing curve. Empirical timing tables show that with μ = 300 the total cost can be reduced to roughly 5 × M · N, compared with ≈125 × M · N for a naïve schedule. The algorithm is trivially parallelizable: the data matrix can be partitioned into sub‑blocks for distributed or GPU execution, with row‑wise reductions for response computation and column‑wise reductions for gradient aggregation.
Experimental evaluation covers synthetic data (M = 10⁵, N ≤ 10⁶) and real‑world tasks such as face‑keypoint detection, motion segmentation, and medical‑image classification. FSA is benchmarked against LogitBoost, AdaBoost, Recursive Feature Elimination (RFE), L₁‑penalized logistic regression, SCAD, and other state‑of‑the‑art methods. Results consistently demonstrate: (i) dramatically lower training times (often 5–20× faster than boosting); (ii) higher or comparable variable‑selection accuracy (percentage of truly relevant features recovered); (iii) equal or superior predictive performance measured by AUC for classification, RMSE for regression, and NDCG for ranking. Sensitivity analyses confirm that performance is robust to the choice of η, μ, and N_iter, and that the method tolerates a broad range of k values without substantial degradation.
In summary, FSA offers a universal, easy‑to‑implement feature‑selection framework that combines (1) loss‑function agnosticism, (2) intuitive cardinality control, (3) provable convergence and selection consistency, (4) annealing‑driven computational savings, and (5) the ability to incorporate non‑linear response functions. These properties make it especially suitable for modern big‑data computer‑vision pipelines where traditional penalized or boosting‑based feature‑selection methods become computationally prohibitive.
Comments & Academic Discussion
Loading comments...
Leave a Comment