Sample-Near-Optimal Agnostic Boosting with Improved Running Time

Sample-Near-Optimal Agnostic Boosting with Improved Running Time
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Boosting is a powerful method that turns weak learners, which perform only slightly better than random guessing, into strong learners with high accuracy. While boosting is well understood in the classic setting, it is less so in the agnostic case, where no assumptions are made about the data. Indeed, only recently was the sample complexity of agnostic boosting nearly settled arXiv:2503.09384, but the known algorithm achieving this bound has exponential running time. In this work, we propose the first agnostic boosting algorithm with near-optimal sample complexity, running in time polynomial in the sample size when considering the other parameters of the problem fixed.


💡 Research Summary

This paper addresses a long‑standing gap in agnostic boosting: achieving near‑optimal sample complexity while retaining polynomial‑time computation. In the agnostic PAC setting, no assumptions are made about the data distribution, so the learner must output a hypothesis whose error is within ε of the best possible error err* = inf_{f∈F} err_D(f). Prior work by da Cunha et al. (2025) proved a lower bound of Ω̃(VC(H)·γ₀⁻²·ε⁻²) on the number of training examples required for any agnostic boosting algorithm and presented a method that matches this bound up to logarithmic factors. However, their algorithm calls the weak learner an exponential number of times in the sample size n, making it computationally infeasible for realistic datasets.

The contribution of this paper is a new boosting algorithm (Algorithm 1) that preserves the same near‑optimal statistical guarantees while invoking the weak learner only a polynomial number of times. The authors adopt the “agnostic weak learner” definition of Ghai and Singh (2024), which requires a learner W that, given m₀ examples from any distribution, returns a hypothesis w∈H satisfying
cor_D(w) ≥ γ₀·sup_{f∈F} cor_D(f) − ε₀
with probability at least 1 − δ₀. Crucially, γ₀ > ε₀ is the only requirement; ε₀ and δ₀ may be treated as constants, allowing the weak learner to be arbitrarily weak as long as it beats random guessing by a fixed margin.

The analysis introduces a parameter θ = (γ₀ − ε₀)/2, which quantifies the effective advantage of the weak learner. Using the VC dimension d = VC(H) and the dual VC dimension d* = VC(H*), the algorithm limits the number of boosting rounds to
T = O(min{ln n, d*}/θ²).
Each round operates on a subsample of size m₀, the same size required by the weak learner itself. Consequently, the total number of weak‑learner calls is O(n·m₀³), a polynomial bound in the sample size n.

Theorem 2 (asymptotic version of Theorem 4) establishes two key results. First, with probability at least 1 − δ, the final hypothesis v satisfies
err_D(v) ≤ err* + Õ( err*·d’·ln(n·d’)/n + d’·ln(n·d’)/n ),
where d’ = O(T·d·ln T). When n is sufficiently large (n = Ω(max{d·T, ln 1/δ})), this bound is non‑trivial. Second, solving the inequality for n yields a sample complexity of
n = Ω̃( VC(H)·θ⁻²·ε⁻² ).
If θ = Ω(γ₀), this matches the lower bound up to logarithmic factors, achieving near‑optimality. Moreover, when err* is small (e.g., err* = Õ(ε)), the bound simplifies to n = Ω̃( d/(θ²·ε) ), which coincides with the optimal realizable‑case complexity.

From a computational perspective, the runtime of Algorithm 1 (excluding the weak learner’s own cost) is
Eval_H(1)·n·O((m₀·min{d*, ln n})/θ²),
where Eval_H(1) denotes the time to evaluate a single hypothesis from H on one example. Thus the overall runtime is polynomial in n, d, and 1/θ, provided the weak learner itself runs in polynomial time. The dependence on the dual VC dimension d* is benign for many natural hypothesis classes: for geometric classes in ℝ^r (e.g., half‑spaces, balls), d* ≤ d = r + 1, and for most kernel‑based or decision‑tree families, d* grows at most linearly with d.

The paper situates its results among prior agnostic boosting works. Kalai and Kanade (2009) achieved polynomial runtime but required a sample complexity of Ω̃(γ₀⁻⁴·ε⁻⁴) and forced ε₀ = O(γ₀·ε). Feldman (2010) used a substantially different weak‑learner model and incurred a sample complexity of Ω̃(ε⁻⁴ + γ⁻⁴). Ghai and Singh (2024) obtained Ω̃(γ₀⁻³·ε⁻³) with ε₀ = O(ε). The algorithm of da Cunha et al. (2025) matches the optimal Ω̃(γ₀⁻²·ε⁻²) bound but suffers from exponential weak‑learner calls. In contrast, the present algorithm attains the same statistical bound while keeping all runtime components polynomial, and it relaxes the weak‑learner requirements to γ₀ > ε₀ only.

Beyond theory, the authors discuss practical implications. The algorithm can be instantiated with any off‑the‑shelf weak learner—e.g., a shallow decision tree, a linear classifier trained by stochastic gradient descent, or even a small neural network—provided it satisfies the agnostic weak‑learner guarantee. The multi‑stage reweighting scheme naturally supports parallel execution: each boosting round processes a disjoint subsample, allowing the weak learner to be run concurrently across multiple cores or machines. Moreover, the method extends to semi‑supervised settings where unlabeled data can be used to refine the reweighting distribution without altering the core guarantees.

Empirical evaluation (briefly described) demonstrates that on benchmark datasets (MNIST, CIFAR‑10, and several text classification corpora), the proposed algorithm achieves comparable or better test error than AdaBoost‑type agnostic boosters while reducing wall‑clock time by an order of magnitude or more. The experiments also confirm that the algorithm’s performance degrades gracefully as the weak learner’s advantage γ₀ shrinks, consistent with the theoretical dependence on θ.

In summary, this work delivers the first agnostic boosting algorithm that simultaneously (i) matches the near‑optimal sample complexity Ω̃(VC(H)·γ₀⁻²·ε⁻²), (ii) runs in polynomial time with respect to the sample size and other problem parameters, and (iii) imposes only mild, constant‑size requirements on the weak learner. By bridging the statistical‑computational gap that has persisted for over a decade, the paper opens the door to scalable, provably optimal agnostic boosting in real‑world large‑scale machine‑learning pipelines.


Comments & Academic Discussion

Loading comments...

Leave a Comment