SplitWise Regression: Stepwise Modeling with Adaptive Dummy Encoding
Capturing nonlinear relationships without sacrificing interpretability remains a persistent challenge in regression modeling. We introduce SplitWise, a novel framework that enhances stepwise regression. It adaptively transforms numeric predictors into threshold-based binary features using shallow decision trees, but only when such transformations improve model fit, as assessed by the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC). This approach preserves the transparency of linear models while flexibly capturing nonlinear effects. Implemented as a user-friendly R package, SplitWise is evaluated on both synthetic and real-world datasets. The results show that it consistently produces more parsimonious and generalizable models than traditional stepwise and penalized regression techniques.
💡 Research Summary
The paper introduces SplitWise, a novel framework that augments traditional stepwise linear regression with automatically generated binary (dummy) variables derived from adaptive thresholding of numeric predictors. The core idea is to use shallow decision trees (maximum depth = 2) to locate optimal split points for each continuous variable, then encode those splits as 0/1 indicators. Whether a split is retained is decided by an information‑theoretic criterion—AIC or BIC—so that only transformations that improve the trade‑off between model fit and complexity are incorporated.
Two operational modes are offered. In Iterative Mode, the algorithm intertwines transformation and variable‑selection steps: at each iteration it evaluates, for every candidate predictor, three forms (null, linear, single‑split dummy, double‑split dummy), computes the resulting AIC/BIC, and either adds, removes, or replaces a term based on the lowest criterion value. This process continues until no single action yields further improvement. In Univariate Mode, each predictor is examined independently, the best representation is chosen, and the set of selected variables is subsequently fed into a conventional stepwise routine. Both modes restrict the number of dummy variables per predictor to at most two, thereby limiting model size and guarding against over‑parameterisation.
Implementation is provided as an open‑source R package called SplitWise. The single user‑facing function splitwise() accepts a formula, data, transformation mode, stepwise direction (forward, backward, or both), and the information criterion. Internally it relies on base R’s lm() and step() for fitting and selection, and on the rpart package for the shallow tree splits. The resulting model object (splitwise_lm) stores metadata on each transformation (cut‑points, dummy definitions) and supplies custom print() and summary() methods that display the final linear equation together with a human‑readable description of all generated dummy variables. The package has minimal dependencies, works across operating systems, and is released under GPL‑3.
Empirical evaluation comprises synthetic data designed to contain known piecewise‑linear relationships and several real‑world regression benchmarks (e.g., mtcars, Boston housing, and domain‑specific datasets from healthcare and finance). SplitWise is compared against a suite of interpretable baselines: classic stepwise regression (all three directions), best‑subset selection via regsubsets(), penalised regressions (LASSO, Ridge, Elastic Net), and the exhaustive model‑averaging approach dredge() from MuMIn. Performance metrics include root‑mean‑square error (RMSE), mean absolute error (MAE), adjusted R², AIC, BIC, and the number of selected predictors.
Results show that SplitWise consistently attains the lowest AIC/BIC scores while selecting 20–30 % fewer predictors than standard stepwise methods. Predictive accuracy (RMSE, MAE) is on par with or slightly better than penalised regressions, especially on datasets where true relationships exhibit clear thresholds. The generated dummy variables are interpretable (e.g., “engine displacement ≥ 101.55”) and often receive statistically significant coefficients, indicating that the method captures meaningful non‑linear effects without sacrificing the global linear structure.
The authors acknowledge limitations: the depth‑2 tree restriction may miss more intricate non‑monotonic patterns; interactions between predictors are not modeled directly, so combined threshold effects could be overlooked; and reliance on AIC/BIC may be overly conservative in very small samples. Future work is outlined to extend the framework to deeper trees, to create interaction‑specific dummy features, and to integrate Bayesian model averaging for uncertainty quantification. Potential extensions to classification problems and high‑dimensional settings are also mentioned.
In summary, SplitWise offers a pragmatic solution for analysts who need the transparency of linear models but also wish to capture salient threshold effects automatically. By embedding adaptive dummy encoding within a rigorous information‑criterion‑driven stepwise procedure, it delivers parsimonious, interpretable, and competitively accurate regression models, making it a valuable addition to the toolbox of applied statisticians and data scientists working in high‑stakes domains.
Comments & Academic Discussion
Loading comments...
Leave a Comment