Estimating Subagging by cross-validation

Estimating Subagging by cross-validation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this article, we derive concentration inequalities for the cross-validation estimate of the generalization error for subagged estimators, both for classification and regressor. General loss functions and class of predictors with both finite and infinite VC-dimension are considered. We slightly generalize the formalism introduced by \cite{DUD03} to cover a large variety of cross-validation procedures including leave-one-out cross-validation, $k$-fold cross-validation, hold-out cross-validation (or split sample), and the leave-$\upsilon$-out cross-validation. \bigskip \noindent An interesting consequence is that the probability upper bound is bounded by the minimum of a Hoeffding-type bound and a Vapnik-type bounds, and thus is smaller than 1 even for small learning set. Finally, we give a simple rule on how to subbag the predictor. \bigskip


💡 Research Summary

The paper develops a rigorous statistical framework for estimating the generalization error of subagged (subsample‑aggregated) predictors using cross‑validation. Subagging, which builds an ensemble by averaging predictors trained on random subsets of the training data, offers computational savings and variance reduction compared with full‑bagging, but its theoretical error behavior has been poorly understood, especially when the size of the subsets and the number of base learners are varied.

To address this gap, the authors extend the formalism introduced by Dudoit and van der Laan (2003) so that any cross‑validation scheme can be expressed as a choice of a validation index set 𝒱⊂{1,…,n} of size ν. This unified view covers leave‑one‑out (ν=1), k‑fold (ν=n/k), hold‑out (ν=αn), and leave‑ν‑out procedures. Within this generalized cross‑validation setting, a subagged estimator (\hat f^{(B)}) is defined as the average of B base estimators (\hat f_b), each trained on an independently drawn subset of size m (m≤n).

The core contribution is the derivation of two complementary concentration inequalities for the cross‑validation estimate (\widehat R_{CV}) of the true risk R.

  1. Hoeffding‑type bound – Assuming the loss ℓ takes values in

Comments & Academic Discussion

Loading comments...

Leave a Comment