ROIDS: Robust Outlier-Aware Informed Down-Sampling
Informed down-sampling (IDS) is known to improve performance in symbolic regression when combined with various selection strategies, especially tournament selection. However, recent work found that IDS’s gains are not consistent across all problems. Our analysis reveals that IDS performance is worse for problems containing outliers. IDS systematically favors including outliers in subsets which pushes GP towards finding solutions that overfit to outliers. To address this, we introduce ROIDS (Robust Outlier-Aware Informed Down-Sampling), which excludes potential outliers from the sampling process of IDS. With ROIDS it is possible to keep the advantages of IDS without overfitting to outliers and to compete on a wide range of benchmark problems. This is also reflected in our experiments in which ROIDS shows the desired behavior on all studied benchmark problems. ROIDS consistently outperforms IDS on synthetic problems with added outliers as well as on a wide range of complex real-world problems, surpassing IDS on over 80% of the real-world benchmark problems. Moreover, compared to all studied baseline approaches, ROIDS achieves the best average rank across all tested benchmark problems. This robust behavior makes ROIDS a reliable down-sampling method for selection in symbolic regression, especially when outliers may be included in the data set.
💡 Research Summary
The paper investigates a critical weakness of Informed Down‑Sampling (IDS), a popular down‑sampling strategy used to accelerate parent selection in Genetic Programming (GP) for symbolic regression. While IDS improves performance by selecting a diverse subset of training cases based on pair‑wise distances of error vectors, the authors discover that IDS systematically over‑represents outliers. When outliers are present, IDS frequently includes them in the sampled subset, causing the GP population to over‑fit to these noisy points and degrading generalisation.
To remedy this, the authors propose ROIDS (Robust Outlier‑Aware Informed Down‑Sampling). ROIDS follows the same overall workflow as IDS but adds a lightweight pre‑filtering step: after evaluating a sampled fraction (ρ) of the parent population on all training cases, each case’s error vector is collapsed to its mean error. The top γ · |T| cases with the highest mean error—interpreted as potential outliers—are removed from consideration. The remaining cases (ˆT) are then used to compute the distance matrix and the farthest‑first traversal selects the final down‑sampled subset N. This extra step incurs virtually no additional computational cost because it only requires a mean calculation and a simple ranking.
The experimental evaluation comprises two parts. First, synthetic benchmarks (2‑D Nguyen‑6 with even/uneven distributions, and Friedman‑1/2/3) are generated both with and without 5 % injected outliers. Visualization of case inclusion frequencies shows that IDS concentrates on outliers when they exist, whereas ROIDS continues to focus on edge and sparsely populated regions. Second, ten real‑world regression datasets (e.g., concrete compressive strength, housing prices, red wine quality, yacht hydrodynamics) are used to compare ROIDS, IDS, and Random Down‑Sampling (RDS). Performance is measured by test RMSE and average rank across 30 independent runs per dataset.
Results indicate that:
- On outlier‑free synthetic problems, ROIDS matches or slightly improves upon IDS, while both dominate RDS.
- When outliers are present, ROIDS dramatically outperforms IDS, achieving lower RMSE on all eight synthetic variants and never performing worse than IDS.
- Across the ten real‑world benchmarks, ROIDS attains the best average rank (1.7) compared with IDS (3.0) and RDS (2.5). It yields lower error than IDS on over 80 % of the real‑world problems, even when a conventional outlier‑removal preprocessing step is applied beforehand.
- Computational overhead remains negligible; ROIDS runs in essentially the same time as IDS.
The authors discuss the sensitivity parameter γ, noting that overly aggressive outlier removal could discard genuinely informative sparse cases, so γ should be tuned to the data’s noise level. They also acknowledge that the current mean‑error filter assumes a single‑mode distribution of errors; more complex noise structures might benefit from advanced outlier detection techniques combined with ROIDS.
In conclusion, ROIDS successfully combines the diversity‑driven sampling benefits of IDS with robust protection against outlier‑induced over‑fitting, all at almost zero extra cost. This makes it a compelling default down‑sampling method for GP‑based symbolic regression, especially in real‑world settings where noisy or anomalous observations are common.
Comments & Academic Discussion
Loading comments...
Leave a Comment