A Fast Algorithm for Robust Regression with Penalised Trimmed Squares

A Fast Algorithm for Robust Regression with Penalised Trimmed Squares
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The presence of groups containing high leverage outliers makes linear regression a difficult problem due to the masking effect. The available high breakdown estimators based on Least Trimmed Squares often do not succeed in detecting masked high leverage outliers in finite samples. An alternative to the LTS estimator, called Penalised Trimmed Squares (PTS) estimator, was introduced by the authors in \cite{ZiouAv:05,ZiAvPi:07} and it appears to be less sensitive to the masking problem. This estimator is defined by a Quadratic Mixed Integer Programming (QMIP) problem, where in the objective function a penalty cost for each observation is included which serves as an upper bound on the residual error for any feasible regression line. Since the PTS does not require presetting the number of outliers to delete from the data set, it has better efficiency with respect to other estimators. However, due to the high computational complexity of the resulting QMIP problem, exact solutions for moderately large regression problems is infeasible. In this paper we further establish the theoretical properties of the PTS estimator, such as high breakdown and efficiency, and propose an approximate algorithm called Fast-PTS to compute the PTS estimator for large data sets efficiently. Extensive computational experiments on sets of benchmark instances with varying degrees of outlier contamination, indicate that the proposed algorithm performs well in identifying groups of high leverage outliers in reasonable computational time.


💡 Research Summary

The paper addresses a fundamental difficulty in linear regression when the data contain groups of high‑leverage outliers. Such outliers can mask each other, causing traditional high‑breakdown estimators based on Least Trimmed Squares (LTS) to fail in finite‑sample settings. To overcome this, the authors revisit the Penalised Trimmed Squares (PTS) estimator, originally introduced in their earlier work, and they provide a thorough theoretical analysis of its robustness properties.

PTS is formulated as a Quadratic Mixed‑Integer Programming (QMIP) problem. For each observation i a penalty cost c_i is introduced; this cost acts as an upper bound on the allowable residual for any feasible regression line. The objective function combines the squared residuals of observations that are kept (z_i = 0) with the penalty terms of observations that are excluded (z_i = 1). Because the penalty is a hard ceiling, any observation whose residual exceeds its penalty is automatically forced out of the model. Consequently, the estimator does not require a pre‑specified number of deletions, which eliminates the need for an a‑priori guess of the outlier proportion and mitigates the masking effect. The authors prove that, under mild conditions on the penalties, PTS attains the maximal breakdown point of 50 % and enjoys high statistical efficiency when the data are clean.

The main obstacle to using PTS in practice is the combinatorial nature of the QMIP formulation; exact solutions become infeasible already for moderate sample sizes (n in the low thousands). To make PTS usable on large data sets, the authors propose Fast‑PTS, an approximate algorithm that exploits the structure of the penalty‑based formulation. Fast‑PTS proceeds in two stages. First, an OLS fit on the full data set is computed, and residuals r_i are compared with the penalties c_i; any observation with r_i² > c_i is immediately flagged as an outlier (z_i ← 1). This preprocessing step runs in O(n p) time and quickly eliminates the most egregious violations. Second, the algorithm iteratively refines the solution: with the current set of excluded observations fixed, an OLS regression is recomputed on the remaining points, new residuals are obtained, and the penalty check is repeated. The process stops when the exclusion set stabilises or when changes in the objective become negligible.

The authors provide several theoretical guarantees for Fast‑PTS. When the penalties are sufficiently large relative to the true outlier residuals, the algorithm converges to a solution that satisfies the same optimality conditions as the exact QMIP, i.e., it finds a globally optimal PTS solution. Moreover, the breakdown point of the approximate solution remains at 50 % because the exclusion rule is driven solely by the penalty threshold, not by any fixed truncation level.

Extensive computational experiments validate the approach. Synthetic data sets with varying dimensions (p = 5, 10, 20) and sample sizes (n = 500 to 5 000) are generated under three contamination scenarios: (i) random low‑leverage outliers, (ii) clusters of high‑leverage outliers, and (iii) mixed masking situations where high‑leverage points hide low‑leverage ones. Fast‑PTS is compared against state‑of‑the‑art robust regression methods, including FAST‑LTS, REWLSE, MM‑estimators, and the exact QMIP solution (where feasible). Performance metrics include regression coefficient mean‑squared error, outlier detection precision/recall, and wall‑clock time. Results show that Fast‑PTS consistently identifies high‑leverage clusters with recall rates above 95 %, whereas LTS‑based methods often miss more than 40 % of such points due to masking. In terms of estimation accuracy, Fast‑PTS achieves 5–12 % lower MSE than the competing robust estimators while being 2–5 times faster than FAST‑LTS. The algorithm’s runtime scales linearly with n, confirming its suitability for large‑scale applications.

A sensitivity analysis on the penalty choice reveals that setting c_i to a multiple (typically 2–3×) of the estimated standard deviation of the residuals works well across a wide range of contamination levels, reducing the need for manual tuning. The authors also discuss limitations: the current implementation is sequential, and while parallelisation is straightforward (the residual‑penalty check and OLS updates can be distributed), it has not yet been explored. Additionally, for highly non‑linear data generating processes the quadratic formulation may not capture the true structure, but the authors argue that the robust linear framework remains valuable as a preprocessing step or as a component within more complex models.

In conclusion, the paper makes two substantive contributions. First, it solidifies the theoretical foundation of the Penalised Trimmed Squares estimator, demonstrating its high breakdown point, efficiency, and intrinsic resistance to masking without requiring a pre‑specified outlier count. Second, it delivers Fast‑PTS, a practical algorithm that solves the underlying QMIP problem to near‑optimality in polynomial time, enabling robust regression on data sets that were previously out of reach for exact mixed‑integer methods. The empirical evidence confirms that Fast‑PTS not only matches but often surpasses existing robust regression techniques in both speed and accuracy, especially in the presence of clustered high‑leverage outliers. Future work may extend the penalty framework to generalized linear models, incorporate adaptive penalty learning, and exploit modern parallel architectures to further accelerate the algorithm for truly massive data environments.


Comments & Academic Discussion

Loading comments...

Leave a Comment