Focus of Attention for Linear Predictors

We present a method to stop the evaluation of a prediction process when the result of the full evaluation is obvious. This trait is highly desirable in prediction tasks where a predictor evaluates all its features for every example in large datasets. We observe that some examples are easier to classify than others, a phenomenon which is characterized by the event when most of the features agree on the class of an example. By stopping the feature evaluation when encountering an easy- to-classify example, the predictor can achieve substantial gains in computation. Our method provides a natural attention mechanism for linear predictors where the predictor concentrates most of its computation on hard-to-classify examples and quickly discards easy-to-classify ones. By modifying a linear prediction algorithm such as an SVM or AdaBoost to include our attentive method we prove that the average number of features computed is O(sqrt(n log 1/sqrt(delta))) where n is the original number of features, and delta is the error rate incurred due to early stopping. We demonstrate the effectiveness of Attentive Prediction on MNIST, Real-sim, Gisette, and synthetic datasets.

💡 Research Summary

The paper introduces an “attentive prediction” framework that allows linear predictors to stop evaluating features early when the final classification decision becomes obvious. The authors begin by observing that many examples in large‑scale datasets are easy to classify: the majority of features already point strongly toward one class. For such examples, computing the remaining features yields little additional information, yet traditional linear models (e.g., SVMs, AdaBoost) evaluate every feature for every instance, leading to unnecessary computational cost.

To formalize this intuition, the authors propose a sequential evaluation scheme. Features are processed in a predetermined order, and after each feature the cumulative score (e.g., the weighted sum in a linear classifier) is updated. A statistical stopping rule is applied: if the current partial sum lies outside a confidence interval that guarantees the final decision will not change even if all remaining features were evaluated, the algorithm halts and outputs the current prediction. The confidence interval is derived using concentration inequalities (Markov, Chebyshev, and Bernstein bounds) applied to the sum of bounded random variables representing feature contributions. The key theoretical result shows that, for a user‑specified error tolerance δ (the probability that early stopping changes the true label), the expected number of evaluated features per instance is bounded by