Predicting Mortgage Default with Machine Learning: AutoML, Class Imbalance, and Leakage Control
Mortgage default prediction is a core task in financial risk management, and machine learning models are increasingly used to estimate default probabilities and provide interpretable signals for downstream decisions. In real-world mortgage datasets, however, three factors frequently undermine evaluation validity and deployment reliability: ambiguity in default labeling, severe class imbalance, and information leakage arising from temporal structure and post-event variables. We compare multiple machine learning approaches for mortgage default prediction using a real-world loan-level dataset, with emphasis on leakage control and imbalance handling. We employ leakage-aware feature selection, a strict temporal split that constrains both origination and reporting periods, and controlled downsampling of the majority class. Across multiple positive-to-negative ratios, performance remains stable, and an AutoML approach (AutoGluon) achieves the strongest AUROC among the models evaluated. An extended and pedagogical version of this work will appear as a book chapter.
💡 Research Summary
The paper presents a systematic comparison of several machine learning approaches for predicting mortgage default using a real‑world loan‑level dataset from Fannie Mae. The authors focus on three practical pitfalls that often compromise the validity of model evaluation in this domain: ambiguous default labeling, severe class imbalance, and information leakage arising from the temporal nature of the data.
Data and labeling – The dataset spans more than three decades and contains monthly performance updates for each loan. A loan is labeled as a default (positive) when the delinquency status (DLQ_STATUS) is greater than zero, and as non‑default (negative) when it equals zero. Because the same loan can appear in multiple rows, the labeling is applied at the row level to ensure consistency.
Leakage‑aware temporal split – Rather than splitting only on origination date or reporting period, the authors impose joint constraints on both ORIG_DATE and ACT_PERIOD. This creates three non‑overlapping triangular regions in the (origination, action‑period) plane: training (both dates before November 2023), validation (both dates between November 2023 and June 2024), and testing (both dates from June 2024 onward). This strict split eliminates forward‑looking information such as post‑origination payment history that would otherwise inflate validation metrics.
Feature selection and preprocessing – All identifiers, timestamps, and payment‑history fields are removed to avoid leakage. The final feature set comprises 26 numeric and 16 categorical variables. Categorical variables with very high cardinality are excluded; ZIP codes are transformed into latitude/longitude coordinates, and LightGBM/AutoGluon receive raw categoricals while other models use one‑hot encoding.
Class imbalance handling – The raw data exhibit roughly a 100:1 ratio of non‑defaults to defaults. The authors fix the number of positive samples (≈20 k for training, ≈10 k for validation, ≈9 k for testing) and down‑sample the negative class to achieve four positive‑to‑negative ratios: 1:1, 1:2, 1:5, and 1:10. This enables an assessment of how performance varies with different levels of imbalance while also reducing computational cost.
Models compared – Logistic regression (L1 and L2 regularization), Random Forest, XGBoost, LightGBM, and AutoGluon (an AutoML framework) are trained on the temporally split data. XGBoost and LightGBM employ early stopping based on the validation set; AutoGluon uses its “medium” preset, which includes automated preprocessing, hyper‑parameter tuning, and model ensembling.
Results – Test‑set AUROC is reported for each model and each down‑sampling ratio. AutoGluon consistently achieves the highest AUROC (0.823 at the 1:2 ratio), outperforming XGBoost (0.811) and LightGBM (0.811) by 1–2 percentage points. Logistic regression lags far behind (≈0.72 AUROC). Performance remains relatively stable across the four imbalance ratios; adding more negative samples does not uniformly improve AUROC. Notably, AutoGluon’s training AUROC (0.957) is substantially higher than its test AUROC, indicating a higher propensity for over‑fitting, which aligns with its greater flexibility and extensive ensembling.
Feature importance – AutoGluon’s built‑in importance analysis (at the 1:2 ratio) highlights four variables contributing more than 5% each: months since origination, primary borrower credit score, current unpaid principal balance, and original unpaid principal balance. These are classic credit‑risk drivers, confirming that despite sophisticated modeling, domain‑specific variables remain paramount.
Conclusions – The study demonstrates that (1) rigorous leakage control via dual‑date temporal splitting is essential for trustworthy evaluation, (2) down‑sampling the majority class preserves or slightly improves predictive performance while dramatically reducing training cost, and (3) AutoML (specifically AutoGluon) can automatically assemble a high‑performing pipeline that outperforms manually tuned gradient‑boosting models, albeit with a higher risk of over‑fitting. The authors suggest that future work could explore more advanced hyper‑parameter optimization (e.g., Bayesian methods) and evaluate models on the full, imbalanced dataset to further assess generalization.
Comments & Academic Discussion
Loading comments...
Leave a Comment