The impact of class imbalance in logistic regression models for low-default portfolios in credit risk
In this paper, we study how class imbalance, typical of low-default credit portfolios, affects the performance of logistic regression models. Using a simulation study with controlled data-generating mechanisms, we vary (i) the level of class imbalance and (ii) the strength of association between the predictors and the response. The results show that, for a given strength of association, achievable classification accuracy deteriorates markedly as the event rate decreases, and the optimal classification cut-off shifts with the level of imbalance. In contrast, the Gini coefficient is comparatively stable with respect to class imbalance once sample sizes are sufficiently large, even when classification accuracy is strongly affected. As a practical guideline, we summarise attainable classification performance as a function of the event rate and strength of association between the predictors and the response.
💡 Research Summary
The paper investigates how the severe class imbalance typical of low‑default credit portfolios influences the performance of logistic regression models. Recognising that default events are extremely rare in such portfolios, the authors design a controlled simulation study that independently varies (i) the event rate (from 0.5 % up to 10 %) and (ii) the strength of association between predictors and the binary outcome (by adjusting the magnitude of the logistic coefficients). For each combination they generate datasets of varying sizes (5 000, 10 000, 20 000 observations) to explore the interplay between sample size, events‑per‑variable (EPV), and a newly introduced aggregate information value (AIV).
Methodologically, the study builds on the well‑known weight‑of‑evidence (WoE) and information value (IV) concepts used in credit scoring. IV quantifies the discriminative power of a single predictor, while AIV aggregates the IVs of all predictors under the assumption of conditional independence, effectively measuring the signal‑to‑noise ratio of the whole model. The authors provide explicit formulas for WoE, IV, and AIV, linking them to the Kullback‑Leibler divergence.
Performance is evaluated with two families of metrics. The first family consists of threshold‑dependent measures derived from the confusion matrix: overall classification accuracy and the optimal probability cut‑off that maximises a chosen utility (e.g., balanced accuracy). The second family is the Gini coefficient (twice the area under the ROC curve minus one), which is threshold‑independent and reflects the model’s intrinsic discriminative ability.
The simulation results reveal a stark dichotomy. As the event rate declines, overall accuracy deteriorates sharply and the optimal cut‑off shifts toward lower probability values, indicating that the model becomes increasingly biased toward the majority (non‑default) class. In contrast, the Gini coefficient remains remarkably stable across event rates provided the sample size is sufficiently large (≥ 10 000). This suggests that the underlying ranking power of the logistic model is preserved even when the minority class is scarce.
AIV emerges as a key explanatory variable. When AIV is low (e.g., < 0.2) and the event rate is below 1 %, Gini values fall below 0.55, signalling weak discrimination. Conversely, when AIV exceeds 0.3–0.4, Gini stays above 0.65 regardless of the event rate, demonstrating that a strong collective signal from the predictors can offset the adverse effects of imbalance. The authors also confirm that the traditional EPV rule of thumb (EPV ≥ 10) has limited predictive value for classification performance; instead, the combination of event rate and AIV offers a more reliable guide.
Based on these findings, the paper proposes practical guidelines for credit‑risk practitioners: (1) ensure a minimum sample size of 5 000–10 000 observations when the event rate is ≤ 1 %; (2) aim for an AIV of at least 0.3 when selecting or engineering predictors; (3) adjust the decision threshold according to the observed event rate (e.g., use a cut‑off around 0.2–0.3 for a 0.5 % event rate). While the study does not evaluate specific imbalance‑remediation techniques (oversampling, SMOTE, cost‑sensitive learning), it establishes a baseline against which such methods can be objectively compared.
In conclusion, class imbalance dramatically reduces classification accuracy and forces a shift in optimal cut‑offs, but the discriminative capacity measured by Gini remains robust if the dataset is large enough and the predictors collectively carry sufficient information (high AIV). The authors recommend future work to validate the AIV‑based guidelines on real credit datasets and to explore interactions with various resampling or algorithmic mitigation strategies.
Comments & Academic Discussion
Loading comments...
Leave a Comment