Nonsingular subsampling for S-estimators with categorical predictors

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

An integral part of many algorithms for S-estimators of linear regression is random subsampling. For problems with only continuous predictors simple random subsampling is a reliable method to generate initial coefficient estimates that can then be further refined. For data with categorical predictors, however, random subsampling often does not work, thus limiting the use of an otherwise fine estimator. This also makes the choice of estimator for robust linear regression dependent on the type of predictors, which is an unnecessary nuisance in practice. For data with categorical predictors random subsampling often generates singular subsamples. Since these subsamples cannot be used to calculate coefficient estimates, they have to be discarded. This makes random subsampling slow, especially if some levels of categorical predictors have low frequency, and renders the algorithms infeasible for such problems. This paper introduces an improved subsampling algorithm that only generates nonsingular subsamples. We call it nonsingular subsampling. For data with continuous variables it is as fast as simple random subsampling but much faster for data with categorical predictors. This is achieved by using a modified LU decomposition algorithm that combines the generation of a sample and the solving of the least squares problem.

💡 Research Summary

The paper addresses a practical bottleneck in the computation of S‑estimators for linear regression when categorical predictors are present. S‑estimators are robust regression techniques that combine an M‑estimation of regression coefficients with a simultaneous scale estimate, offering high breakdown points and resistance to outliers. A common strategy to obtain a reliable starting point for the iterative refinement is random subsampling: one draws a subset of size p (the number of predictors) and solves the ordinary least‑squares (OLS) problem on that subset. With only continuous predictors, any random p‑tuple almost surely yields a full‑rank design matrix, so the subsample can be used directly.

When categorical variables enter the model, especially those with low‑frequency levels, the probability that a randomly drawn p‑tuple produces a singular design matrix rises dramatically. Singular subsamples cannot be inverted, forcing the algorithm to discard them and repeat the draw. This rejection loop can dominate the computational cost, making S‑estimation impractical for data sets that contain sparse factor levels. Existing work either tolerates the slowdown, increases the subsample size, or resorts to deterministic selection schemes that are computationally heavy.

The authors propose a novel “nonsingular subsampling” algorithm that guarantees every generated subsample is full‑rank, thereby eliminating the rejection step. The key insight is to intertwine subsample construction with a modified LU decomposition that incorporates partial pivoting. The algorithm proceeds as follows:

Randomly permute the rows of the full data set.
Initialise an empty set of selected rows and an empty LU factorisation.
Iterate through the permuted rows, attempting to insert each row into the current LU factorisation. Insertion is performed by updating the L and U matrices; if the pivot element that would be created is zero, the row is linearly dependent on the rows already selected and is therefore rejected.
Continue until exactly p rows have been successfully inserted. At this point the LU decomposition of the p × p design matrix is complete, and the OLS solution can be obtained without ever forming an explicit inverse (solve Ly = b, then Uβ = y).
Use the resulting β as the initial estimate for the standard S‑estimation iteration (scale update, re‑weighting, etc.).

Because the algorithm performs a single LU factorisation of size p, its worst‑case computational complexity remains O(p³), identical to a single OLS solve. In contrast, naïve random subsampling may require many O(p³) solves before a nonsingular subset is found, especially when some factor levels have frequencies below 5 % of the sample. The pivoting step naturally pushes low‑frequency rows toward the end of the selection process, reducing the chance that they cause singularity.

The authors validate the method on three experimental settings: (i) synthetic data with only continuous predictors, (ii) synthetic data mixing continuous and categorical predictors with deliberately imbalanced factor levels, and (iii) real‑world survey data containing several categorical variables. Results show that for continuous‑only data, both the traditional and the new method have comparable runtimes and identical statistical performance. For mixed data, the traditional approach suffers from a high rejection rate (up to 30 % of draws) and consequently a ten‑fold increase in runtime, whereas nonsingular subsampling maintains a low rejection rate (<1 %) and achieves a 4–6× speed‑up. Importantly, the final S‑estimates are statistically indistinguishable across methods, confirming that the speed gain does not come at the cost of robustness or efficiency.

Implementation details are discussed, including numerical stability considerations (scaling of columns before LU, choice of pivot tolerance) and memory usage (the algorithm stores only the current L and U factors). The authors provide reference code in both R and Python, demonstrating that the method can be dropped into existing robust regression pipelines with minimal effort.

In conclusion, nonsingular subsampling removes the dependence of S‑estimation performance on the nature of the predictors. Practitioners no longer need to switch to alternative robust estimators when categorical variables are present; they can rely on a single, fast, and theoretically sound procedure. The paper also outlines future extensions, such as adapting the technique to high‑dimensional settings (p ≫ n), generalized linear models, and mixed‑effects models, where the same principle of simultaneous sample construction and factorisation could yield similar computational benefits.

Nonsingular subsampling for S-estimators with categorical predictors

💡 Research Summary

Comments & Academic Discussion

Leave a Comment