Datum-Wise Classification: A Sequential Approach to Sparsity
We propose a novel classification technique whose aim is to select an appropriate representation for each datapoint, in contrast to the usual approach of selecting a representation encompassing the whole dataset. This datum-wise representation is found by using a sparsity inducing empirical risk, which is a relaxation of the standard L 0 regularized risk. The classification problem is modeled as a sequential decision process that sequentially chooses, for each datapoint, which features to use before classifying. Datum-Wise Classification extends naturally to multi-class tasks, and we describe a specific case where our inference has equivalent complexity to a traditional linear classifier, while still using a variable number of features. We compare our classifier to classical L 1 regularized linear models (L 1-SVM and LARS) on a set of common binary and multi-class datasets and show that for an equal average number of features used we can get improved performance using our method.
💡 Research Summary
The paper introduces a novel classification framework called Datum‑Wise Sparse Classification (DWSC) that selects a tailored set of features for each individual data point rather than using a single global feature subset for the entire dataset. Traditional sparsity‑inducing methods such as L₀‑regularization or its convex surrogate L₁ (e.g., LASSO, L₁‑SVM, LARS) enforce the same sparsity pattern across all samples, which can be sub‑optimal when some instances are easy to classify and others are difficult. DWSC addresses this by augmenting the empirical risk with a penalty term proportional to the number of features actually used for a given instance. The resulting objective is
min θ (1/N) ∑ᵢ Δ(ŷθ(xᵢ), yᵢ) + λ ‖zθ(xᵢ)‖₀,
where Δ is a classification loss (0/1), zθ(xᵢ)∈{0,1}ⁿ indicates which features were consulted, and λ controls the trade‑off between accuracy and per‑instance sparsity.
To solve the combinatorial problem, the authors cast classification as a deterministic Markov Decision Process (MDP). A state is a pair (x, z) where x is the input vector and z records which features have already been observed. Two types of actions are available: (i) “feature‑selection” actions that reveal an unobserved feature (incurring a fixed cost −λ), and (ii) “classification” actions that output a label (reward 0 for correct prediction, −1 for error). The episode terminates once a classification action is taken. The cumulative reward of an episode is exactly the negative of the objective in the equation above, establishing an equivalence between reward maximization and loss minimization.
Because the state‑action space is infinite (continuous x, exponentially many z), the authors approximate the action‑value function sθ(x, z, a) with a linear model: sθ(x, z, a)=θᵀΦ(x, z, a). They construct Φ using a “block‑vector” trick: the feature representation φ(x, z) (which concatenates the observed components of x with the binary mask z) is placed in a high‑dimensional vector at a location that depends on the action a. This encoding ensures that a linear classifier can differentiate between different actions while sharing parameters across the whole policy.
Learning proceeds via Approximate Policy Iteration (API) with Monte‑Carlo rollouts. Starting from an initial policy πθ₀, the algorithm samples trajectories on the training set, computes the empirical return for each visited state‑action pair, and then fits θ by solving a supervised regression problem that predicts these returns. The updated θ defines a new greedy policy πθ, and the process repeats until convergence. This iterative scheme gradually shapes the policy to acquire only those features that are expected to improve the final classification, while keeping the total number of acquired features low.
The authors evaluate DWSC on 14 publicly available datasets covering binary and multi‑class problems. Baselines include L₁‑regularized SVM and the Least Angle Regression (LARS) algorithm, both of which produce a single sparse linear model. Performance is measured in terms of (i) average number of features used per instance and (ii) classification accuracy. Results show that, for a fixed average feature budget, DWSC consistently attains higher accuracy than the baselines. Notably, on datasets where the data naturally split into distinct sub‑spaces, DWSC automatically learns to use different subsets of features for different sub‑spaces, something a global sparse model cannot do.
The paper discusses several limitations. The current implementation relies on a linear approximation of the action‑value function; extending to nonlinear function approximators (e.g., deep neural networks) could capture more complex decision boundaries. The hyper‑parameter λ, which balances sparsity and accuracy, must be tuned manually; adaptive or Bayesian approaches could alleviate this. Moreover, the computational cost of rollouts grows with the number of features, suggesting the need for more efficient exploration strategies.
In conclusion, the work proposes a fundamentally new notion of sparsity—datum‑wise sparsity—by integrating feature selection and classification into a single sequential decision process. By formulating the problem as an MDP and solving it with approximate reinforcement‑learning techniques, the authors demonstrate that it is possible to achieve comparable or superior predictive performance while using fewer features on average. This approach is especially attractive for resource‑constrained settings such as mobile devices, embedded sensors, or real‑time streaming applications, where the cost of acquiring each feature may be significant. Future research directions include nonlinear policy representations, automated λ selection, and deployment in large‑scale, online environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment