SR4-Fit: An Interpretable and Informative Classification Algorithm Applied to Prediction of U.S. House of Representatives Elections
The growth of machine learning demands interpretable models for critical applications, yet most high-performing models are ``black-box’’ systems that obscure input-output relationships, while traditional rule-based algorithms like RuleFit suffer from a lack of predictive power and instability despite their simplicity. This motivated our development of Sparse Relaxed Regularized Regression Rule-Fit (SR4-Fit), a novel interpretable classification algorithm that addresses these limitations while maintaining superior classification performance. Using demographic characteristics of U.S. congressional districts from the Census Bureau’s American Community Survey, we demonstrate that SR4-Fit can predict House election party outcomes with unprecedented accuracy and interpretability. Our results show that while the majority party remains the strongest predictor, SR4-Fit has revealed intrinsic combinations of demographic factors that affect prediction outcomes that were unable to be interpreted in black-box algorithms such as random forests. The SR4-Fit algorithm surpasses both black-box models and existing interpretable rule-based algorithms such as RuleFit with respect to accuracy, simplicity, and robustness, generating stable and interpretable rule sets while maintaining superior predictive performance, thus addressing the traditional trade-off between model interpretability and predictive capability in electoral forecasting. To further validate SR4-Fit’s performance, we also apply it to six additional publicly available classification datasets, like the breast cancer, Ecoli, page blocks, Pima Indians, vehicle, and yeast datasets, and find similar results.
💡 Research Summary
The paper introduces SR4‑Fit, a novel interpretable classification algorithm that merges the rule‑extraction framework of RuleFit with the sparse optimization capabilities of Sparse Relaxed Regularized Regression (SR3). The authors motivate the work by highlighting the need for models that are both transparent and high‑performing, especially in high‑stakes domains such as U.S. House of Representatives election forecasting where demographic composition strongly influences outcomes.
Algorithmic Design
SR4‑Fit proceeds in four steps. First, for each class a one‑vs‑rest random forest is trained; every root‑to‑leaf path in each tree is harvested as a logical rule. Rules are encoded as binary indicators, and a user‑defined maximum number of rules (r_max) caps model complexity. Second, the original feature matrix X is concatenated with the rule indicator matrix R to form an extended matrix Z =
Comments & Academic Discussion
Loading comments...
Leave a Comment