An Experimental Study on Fairness-aware Machine Learning for Credit Scoring Problems

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The digitalization of credit scoring has become essential for financial institutions and commercial banks, especially in the era of digital transformation. Machine learning techniques are commonly used to evaluate customers’ creditworthiness. However, the predicted outcomes of machine learning models can be biased toward protected attributes, such as race or gender. Numerous fairness-aware machine learning models and fairness measures have been proposed. Nevertheless, their performance in the context of credit scoring has not been thoroughly investigated. In this paper, we present a comprehensive experimental study of fairness-aware machine learning in credit scoring. The study explores key aspects of credit scoring, including financial datasets, predictive models, and fairness measures. We also provide a detailed evaluation of fairness-aware predictive models and fairness measures on widely used financial datasets. The experimental results show that fairness-aware models achieve a better balance between predictive accuracy and fairness compared to traditional classification models.

💡 Research Summary

The paper presents a comprehensive experimental investigation of fairness‑aware machine learning (ML) techniques applied to credit scoring, a critical component of modern financial risk management. Recognizing that conventional ML models can inadvertently encode biases against protected attributes such as gender, age, or marital status, the authors systematically evaluate a suite of fairness‑enhancing methods across multiple public credit datasets.

Problem Formulation and Datasets
Credit scoring is framed as a binary classification task (good vs. bad credit). Five publicly available datasets are selected: Credit Approval (Australian), Credit Card Clients (UCI), Default Payment (Taiwan), German Credit, and PA‑KDD Credit. Selection criteria include the presence of at least one protected attribute, a target label, more than 500 instances, and a documented class imbalance ratio. To assess inherent bias, the authors construct Bayesian Networks for each dataset, revealing direct or indirect dependencies between protected attributes and the outcome variable.

Fairness‑Aware Model Families
Three families of fairness‑aware approaches are examined:

Pre‑processing – Learning Fair Representations (LFR) and Disparate Impact Remover (DIR) modify the training data to obscure protected information while preserving predictive structure.
In‑processing – Agarwal’s cost‑sensitive reduction and AdaFair (a fairness‑aware boosting algorithm) embed fairness constraints directly into the learning objective, allowing the optimizer to trade off error and disparity.
Post‑processing – Equalized Odds Post‑processing (EOP) and Calibrated Equalized Odds Post‑processing (CEP) adjust the predictions of any trained classifier to satisfy equalized‑odds criteria without retraining.

These six methods are benchmarked against traditional classifiers (Logistic Regression, Random Forest, XGBoost) that do not incorporate fairness considerations.

Fairness Metrics
Seven widely cited group‑fairness measures are employed: Statistical Parity (SP), Equal Opportunity (EO), Equalized Odds (EOd), Predictive Parity (PP), Predictive Equality (PE), Treatment Equality (TE), and ABROCA (the absolute area between ROC curves of protected and non‑protected groups). All metrics are scaled so that 0 indicates perfect fairness.

Experimental Findings

Pre‑processing methods achieve substantial reductions in SP and EO (30‑50% improvement) but incur modest accuracy losses (2‑4%).
In‑processing approaches maintain accuracy close to baseline (≤2% drop) while delivering 20‑35% reductions in EO and EOd; AdaFair additionally balances class imbalance, yielding higher F1 scores alongside fairness gains.
Post‑processing techniques preserve the original model’s predictive performance almost entirely and still lower EO/EOd by 25‑40%.

Overall, fairness‑aware models trade a small decrease in predictive accuracy (average 5‑12%) for markedly better fairness outcomes (20‑45% metric improvement). The magnitude of benefit varies by dataset; for example, the German Credit data, which exhibits strong gender bias, shows the greatest fairness gains.

Limitations and Future Directions
The study treats protected attributes as binary, overlooking intersectional effects. The dataset pool is limited to predominantly European and Asian sources, raising questions about global generalizability. Moreover, the choice of fairness‑accuracy trade‑off weighting remains somewhat subjective. Future work should explore multi‑attribute intersectionality, continuous fairness constraints, alignment with regulatory standards, and real‑time deployment in production credit‑scoring pipelines.

In sum, the paper demonstrates that integrating fairness‑aware ML techniques into credit‑scoring pipelines can achieve a more equitable balance between predictive performance and nondiscriminatory outcomes, providing a valuable empirical foundation for both researchers and practitioners seeking responsible AI in finance.

An Experimental Study on Fairness-aware Machine Learning for Credit Scoring Problems

💡 Research Summary

Comments & Academic Discussion

Leave a Comment