Reading time: 33 minute
...

๐Ÿ“ Original Info

  • Title:
  • ArXiv ID: 2512.21348
  • Date:
  • Authors: Unknown

๐Ÿ“ Abstract

Traditional software fairness research typically emphasizes ethical and social imperatives, neglecting that fairness fundamentally represents a core software quality issue arising directly from performance disparities across sensitive user groups. Recognizing fairness explicitly as a software quality dimension yields practical benefits beyond ethical considerations, notably improved predictive performance for unprivileged groups, enhanced out-of-distribution generalization, and increased geographic transferability in realworld deployments. Nevertheless, existing bias mitigation methods face a critical dilemma: while pre-processing methods offer broad applicability across model types, they generally fall short in effectiveness compared to post-processing techniques. To overcome this challenge, we propose Correlation Tuning (CoT), a novel preprocessing approach designed to mitigate bias by adjusting data correlations. Specifically, CoT introduces the Phi-coefficient, an intuitive correlation measure, to systematically quantify correlation between sensitive attributes and labels, and employs multi-objective optimization to address the proxy biases. Extensive evaluations demonstrate that CoT increases the true positive rate of unprivileged groups by an average of 17.5% and reduces three key bias metrics, including statistical parity difference (SPD), average odds difference (AOD), and equal opportunity difference (EOD), by more than 50% on average. CoT outperforms state-of-the-art methods by three and ten percentage points in single attribute and multiple attributes scenarios, respectively. We will publicly release our experimental results and source code to facilitate future research.

๐Ÿ“„ Full Content

Empowered by artificial intelligence (AI) techniques such as machine learning (ML) and deep learning (DL), software systems have become increasingly intelligent and widely deployed in critical decision-making scenarios, including justice [4,27,42], healthcare [2, 36,71], and finance [1,3,20]. However, these software systems may produce biased predictions against groups or individuals with specific sensitive attributes, causing significant concern about software fairness [20,28] and potentially violating anti-discrimination laws [77]. In the Software Engineering (SE) community, software fairness is generally considered an ethical issue, and developing responsible, fair software is recognized as an essential ethical responsibility for software engineers [27,30]. In fact, fairness metrics such as Equal Opportunity Difference quantify the performance disparity across groups [14,20]. Meanwhile, performance and consistency are critical software quality attributes explicitly outlined by established quality frameworks, such as McCall's Quality Model [55], Boehm's Quality Model [15], and the FURPS Quality Model [6,38]. Thus, software fairness represents not only a social and ethical concern but also a fundamental software quality issue arising from software performance disparities.

Although substantial progress has been made in the SE community to improve software fairness, one conceptual gap and one significant challenge hinder further advancement [27,56]. The conceptual gap arises because software fairness is predominantly viewed as purely an ethical issue, with research typically motivated by social requirements, policies, and laws [30,77]. This viewpoint overlooks a significant practical benefit: fairness improvement methods can enhance model performance for discriminated groups (unprivileged groups), improve out-of-distribution generalization, and enhance geographical transferability [7,64]. For example, the bias mitigation method FairMask enhances racial fairness in the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) software system, used by US courts to predict recidivism likelihood [4], by improving predictive performance for Black individuals. Such improvements facilitate the deployment of existing software to predominantly Black regions without needing new systems trained on additional datasets. Recognizing software fairness as inherently tied to software performance helps provide strong motivation for fairness research.

The primary challenge is the trade-off between the applicability and effectiveness of existing bias mitigation methods. Preprocessing methods, which mitigate bias by adjusting training data, are model-agnostic and applicable across ML, and DL. However, recent empirical studies [26,30] indicate that post-processing methods such as MAAT [27], FairMask [61], and MirrorFair [71] achieve better fairness effectiveness than leading pre-processing methods, such as Fair-SMOTE [20] and LTDD [52], in ML and DL. However, their applicability can be limited by model types due to differences in prediction mechanisms.

To address this dilemma and enable pre-processing methods to achieve state-of-the-art effectiveness in bias mitigation while balancing performance-fairness trade-offs, we propose Correlation Tuning (CoT). CoT adjusts sensitive attribute distributions to tune data correlations explicitly for fairness improvement. Specifically, we utilize the Phi-coefficient [34,60,65,68] to measure correlations between sensitive attributes and labels, guiding dataset adjustments to directly mitigate biases associated with sensitive attributes. Additionally, we employ a multi-objective optimization algorithm to address proxy biases caused by non-sensitive attributes [71,76].

We extensively evaluate CoT across ten widely adopted fairness benchmark tasks involving six ML and DL models. The evaluation compares CoT against five state-of-the-art methods across multiple dimensions, including single and multiple sensitive attribute biases, as well as performance-fairness trade-offs and group-level performance. Our results show that CoT improves the true positive rate on unprivileged groups (e.g., “female”) by an average of 17.5% across all evaluated tasks without significantly compromising overall performance. For single attributes, CoT reduces SPD, AOD, and EOD biases by 46.9%, 58.1%, and 51.0%, respectively, surpassing existing state-of-the-art methods by more than three percentage points. CoT also outperforms the fairness-performance trade-off baseline in 79.9% of cases, exceeding state-of-the-art by three percentage points. In addressing multiple sensitive attributes, CoT reduces intersectional biases ISPD, IAOD, and IEOD by 50.3%, 42.1%, and 29.5%, respectively, surpassing existing methods by more than 10 percentage points on average.

Overall, this work makes the following contributions:

โ€ข We analyze software fairness, emphasizing that it is both an ethical concern and a fundamental software quality issue.

We also highlight the practical significance of fairness research for the SE community, as fairness-improving methods can adjust software performance for specific groups and enhance the outof-distribution capabilities of software systems. โ€ข We introduce CoT, a novel pre-processing bias mitigation method that outperforms existing state-of-the-art approaches in fairness improvement and performance-fairness trade-offs. โ€ข We conduct an extensive empirical evaluation of existing methods, investigating the intersectional impacts of bias mitigation techniques on unconsidered sensitive attributes, which is frequently overlooked in previous fairness research. โ€ข We will publicly release our data, results, and source code to foster future research in software fairness.

In this section, we first introduce the background and key concepts of software fairness, then review the related work.

Software fairness concerns fairness bugs in software, which are defined as any imperfections in a software system that cause a discrepancy between actual and required fairness conditions [28,77].

Common fairness bugs often arise in classification tasks that assign class labels to individuals based on a set of features [13,14,47,73]. When these tasks are related to significant social activities or important individual benefits, such as college admissions or hiring, the assigned labels can be categorized as either favorable or unfavorable. These are important concepts in software fairness. A favorable label, such as “good credit”, “high income”, or “no recidivism”, indicates a potential advantage, for example, increased opportunities to receive benefits, for the recipient. In contrast, an unfavorable label, such as “bad credit”, “low income”, or “recidivism”, usually signals potential disadvantages for the individual. Depending on the model’s preference for a specific sensitive attribute (also known as protected attribute, e.g., “sex”, “age”, and “race”) when predicting whether an individual receives a favorable or unfavorable label, individuals with a particular sensitive attribute can be categorized into a privileged group and an unprivileged group. From an ethical perspective, model bias reflect human discrimination in the real world. However, from a software quality perspective, it represents a defect, as the model struggles to correctly assign favorable labels to unprivileged groups and unfavorable labels to privileged groups.

With the increasing prevalence of AI-model driven software, concerns regarding fairness have received significant attention in both academia and industry [77,78]. As a critical non-functional property tied to software quality and consistency, fairness has garnered heightened interest from the software engineering community [40,50,79]. In response, leading technology companies have established dedicated teams to explore equitable AI algorithms, applications, and software services [10,12,39,48,53,62,63]. For instance, IBM has developed the AIF360 toolkit [8], which incorporates widely used benchmarking datasets, bias mitigation algorithms, and evaluation metrics. Bias mitigation strategies are typically categorized into preprocessing, in-processing, and post-processing techniques [56]. Specifically, pre-processing methods operate on the data prior to model training by, for example, removing sensitive features, applying instance reweighting, or synthesizing new data samples [44,74]. In contrast, in-processing approaches embed fairness objectives directly into the model learning process, often leveraging techniques such as adversarial debiasing [10,75]. Post-processing techniques, on the other hand, enhance fairness by adjusting the model outputs after training, without modifying the data or the underlying model itself; a representative example is Reject Option Classification [45,51].

While pre-processing methods are widely adopted due to their model-agnostic nature and operational simplicity-particularly in software engineering scenarios [13]-they often face challenges in balancing fairness and overall model performance. To address data bias, Chakraborty et al. [21] propose Fairway, which removes ambiguous data instances to enhance fairness, albeit at the expense of reduced model performance. Subsequently, methods such as Fair-SMOTE [20] and FairGenerate [43] have been introduced to synthesize new data points, aiming for a better trade-off between performance and fairness; however, performance degradation remains a concern. Addressing this limitation, Li et al. [52] present LTDD, which seeks to repair problematic training data rather than simply removing or generating new data.

Regarding in-processing methods, which are applied during the model training phase and are closely linked to model architectures, recent efforts have predominantly focused on deep neural networks (DNNs) [50]. Sun et al. [66] introduce a causality-based approach, CARE, which identifies neurons with high-bias causal effects and subsequently adjusts their activations.

In the context of post-processing, many recent studies have achieved substantial bias mitigation without severely compromising performance. For instance, Peng et al. [61] propose FairMask, which adjusts sensitive attributes based on an extrapolation model during the model testing stage. Recognizing the dual importance of performance and fairness in AI software, Chen et al. [27] present MAAT, which separately trains a performance model and a fairness model, and then ensembles their predictions to achieve fairer outcomes. Similarly, Xiao et al. propose MirrorFair, which generates a counterfactual dataset by flipping sensitive attributes, trains a mirror model with different biases from the original model, and ensembles the predictions of the two models to counteract bias and yield fairer predictions.

Despite these substantial advancements, several challenges remain. Among the three categories of bias mitigation strategies, pre-processing methods are distinguished by their model-agnostic nature. However, they often struggle to achieve an optimal balance between model performance and fairness. In-processing and postprocessing methods, while more flexible in overcoming some of the limitations of pre-processing approaches, are typically dependent on specific model architectures or post-processing procedures, which can limit their capacity of generalizing across different models. For instance, state-of-the-art methods such as MirrorFair are difficult to apply to models that do not explicitly output class probabilities, since they depend on post-prediction probability analysis. To address this limitation, our work proposes a pre-processing

In this section, we introduce CoT in detail.

Correlation Tuning (CoT) is a novel pre-processing technique that adjusts the correlation between sensitive attributes and data labels to enhance the model performance on unprivileged groups to achieve two types of trade-offs: (1) balancing model performance between privileged groups and unprivileged groups, and (2) balancing fairness and performance. We implement correlation tuning by mutating the values of sensitive attributes. As a classical and intuitive strategy, data features and sensitive attribute adjustment have been applied in other research [18,33]. However, the challenge of determining which data instances should be adjusted and how to adjust prevents these methods from significantly improving fairness without excessively compromising performance. Our novelty and main contribution lie in reacting to the ongoing challenge with two separate implementations: (1) CoT via Phi-coefficient tuning, and (2) CoT via multi-objective optimization. Figure 1 illustrates the overall workflow of the two implementations of CoT, while the details are described in the following sections.

Correlation between data features and labels represents the statistical linear and non-linear relationships between input variables and the target label, which forms the foundation for AI models to make predictions. When a model exhibits bias by disproportionately assigning a favorable label (e.g., “good credit”) to a privileged group (e.g., “female”), which is a superficial manifestation. The underlying cause is that the attribute “female” displays a stronger statistical association with the label “good credit” compared to “male”, as observed in the training data. Such correlations are determined by the underlying distribution and composition of different groups within the dataset, which may reflect real-world patterns, historical biases, or sampling artifacts.

Therefore, correlation tuning is an approach that addresses bias at its root cause. We choose to adjust the sensitive attribute to tune the correlation rather than oversampling, undersampling, removing specific data instances, synthesizing new data, or modifying data labels. This approach aims to minimize the impact on the original data because the strategies mentioned above alter the correlations among all features and labels, not just between sensitive attributes and data labels. Currently, we lack effective mechanisms to control these negative impacts. More detailed empirical evidence and discussion refer to Section 5.4 and Section 6. By contrast, tuning correlation through the adjustment of sensitive attributes does not directly affect the correlation between other features and labels. This method gives us greater flexibility in addressing multiple sensitive biases, helping to avoid conflicts and counteracting effects in bias mitigation across different sensitive attributes. In the following section, we describe how we address the two challenges of correlation tuning by adjusting the sensitive attributes. The goal of correlation tuning is to enhance the positive correlation between the unprivileged group and the favorable label. Existing approaches often globally adjust all data instances to construct a desired class distribution [18,33]. For example, Optimized Pre-processing (OP) applies an optimization algorithm to learn a conditional probability transformation matrix that maps the original dataset ๐ท (๐‘‹, ๐ด, ๐‘Œ ) to a new dataset ๐ท โ€ฒ (๐‘‹ โ€ฒ , ๐ด โ€ฒ , ๐‘Œ โ€ฒ ) for bias mitigation. However, such extensive changes, in the absence of effective risk control mechanisms, hinder the achievement of a promising trade-off between performance and fairness.

Unlike these methods, we select those instances ๐ผ ๐‘ฅ,๐‘Ž=1,๐‘ฆ=1 that belong to the privileged group with a favorable label as candidates for adjustment. The adjustment process simply adjusts selected candidates from privileged to unprivileged as below:

The adjustment preserves the non-sensitive attributes and favorable label, and only changes group membership. This straightforward operation directly alters the class distribution, strengthening the correlation between the unprivileged group and the favorable label. The subsequent challenge is to determine which and how many candidates should be adjusted to achieve a suitable balance between performance and fairness, as well as the performance across both the privileged and unprivileged groups.

To address this problem, we employ the Phi-coefficient [60] to determine how many candidates should be adjusted to achieve the desired trade-off. The Phi-coefficient is a significant metric for measuring correlation in binary analysis and is widely used in software engineering literature [34,60,65,68]. As a special case of Pearson’s coefficient, the Phi-coefficient is particularly well-suited for the context, measuring the correlation between sensitive attributes and data labels. Based on the definitions above, the Phi-coefficient between sensitive attributes and data labels is calculated as follows:

Once the Phi-coefficient for the dataset is obtained, its value ranges within [-1, 1]. According to our definitions, a positive Phicoefficient means the privileged group has a stronger correlation with the favorable label (the definition of privileged group aligns with the real situation), while a negative Phi-coefficient means the unprivileged group has a stronger correlation with the favorable label (the definition of privileged group is opposite to the real situation). Ideally, the target Phi-coefficient is 0, indicating that both privileged and unprivileged groups have the same correlation to the favorable label. In other words, the sensitive attribute has no correlation with the data labels.

In practice, the Phi-coefficient of a real dataset is usually not zero. Instead, based on our definitions, it is typically positive. Therefore, the goal of correlation tuning is to adjust the Phi-coefficient to zero by transforming candidate instances from ๐ผ ๐‘ฅ,๐‘Ž=1,๐‘ฆ=1 to ๐ผ โ€ฒ ๐‘ฅ,๐‘Ž=0,๐‘ฆ=1 . Let ๐‘ƒ ๐œ™ โˆˆ (0, 1) be the adjustment parameter for the Phi-coefficient and ๐‘˜ denote the number of adjusted instances. Using Eq. (2), we can derive the analytical solution for ๐‘ƒ ๐œ™ as follows:

With ๐‘ƒ ๐œ™ , we randomly select the corresponding proportion of candidates for instance adjustment, transforming the dataset from ๐ท (๐‘‹, ๐ด, ๐‘Œ ) to ๐ท โ€ฒ (๐‘‹, ๐ด โ€ฒ , ๐‘Œ ) and completing the correlation tuning process based on the Phi-coefficient. We refer to this implementation of CoT as CoT-Phi.

Through correlation tuning via CoT-Phi, we explicitly adjust the correlation between sensitive attributes and labels to a desired value. However, many previous works have revealed that model bias can arise not only from the sensitive attributes themselves, but also from the proxy effects of non-sensitive attributes [31,37,58,59,71,76]. Currently, we lack effective and convenient tools to analyze the correlation among multiple variables, especially when these variables include both categorical and numerical types.

To address the influence caused by non-sensitive attributes, we propose a compensating strategy that dynamically adjusts the Phicoefficient between sensitive attributes and labels, rather than setting it to zero by default. This design allows us to counteract the effect of non-sensitive attributes by strengthening or weakening the correlation between sensitive attributes and labels. For example, we may build a stronger correlation between the unprivileged group and the favorable label to offset the bias introduced by non-sensitive attributes.

To achieve this, we introduce multi-objective optimization [19,22,72] into our approach to identify the optimal Phi-coefficient value that achieves a balance between model performance and fairness, as well as balanced performance across groups. We define the loss function for the multi-objective optimization process as follows:

Here, ๐‘ƒ ๐œ™ denotes the proportion of candidate instances to be adjusted. ๐‘“ 1 denotes the F1 Score metric, ๐‘“ 2 denotes the accuracy metric, ๐‘“ 3 denotes the SPD metric, ๐‘“ 4 denotes the AOD metric, and ๐‘“ 5 denotes the EOD metric. This loss function design comprehensively incorporates both performance and fairness metrics to achieve a better trade-off. Specifically, F1 score and accuracy measure model performance, while SPD, AOD, and EOD serve as metrics of model fairness. Details of these metrics are provided in Section 4.4, and further discussion on the loss function design can be found in Section 6.

We refer to this implementation as CoT-Opt. CoT-Opt can be regarded as a hybrid that combines the simplicity of CoT-Phi with the flexibility of optimization. The key distinction between CoT-Phi and CoT-Opt lies in how the tuning parameter is determined: CoT-Phi fixes it analytically, whereas CoT-Opt dynamically optimizes it with respect to different loss functions. Thus, CoT-Opt is designed as an optimized extension of CoT-Phi.

In real-world AI decision-making scenarios, it is common to encounter cases where more than one sensitive attribute needs bias mitigation. For example, in the COMPAS software application [42], both “sex” and “race” are class-sensitive attributes that require protection and mitigation of related biases. As a method focused on adjusting sensitive attributes, CoT offers specific advantages in addressing multiple sensitive attributes.

CoT-Phi for Multiple Sensitive Attributes. We can repeat the process used for a single attribute to mitigate multiple sensitive attribute biases. Given an original dataset ๐ท (๐‘‹, ๐ด1, ๐ด2, ๐‘Œ ), where ๐ด1 represents “sensitive attribute 1” and ๐ด2 represents “sensitive attribute 2,” the adjustment process can be described as follows, using parameters ๐‘ƒ ๐œ™1 and ๐‘ƒ ๐œ™2 derived from Eq (3):

3.4.2 CoT-Opt for Multiple Sensitive Attributes. Similarly, we apply the same approach as with a single sensitive attribute using CoT-Opt to adjust the dataset from ๐ท (๐‘‹, ๐ด1, ๐ด2, ๐‘Œ ) to ๐ท โ€ฒ (๐‘‹, ๐ด1 โ€ฒ , ๐ด2, ๐‘Œ ). We then introduce the following loss function:

Compared to the previous loss function (4), we replace the single attribute fairness metrics (SPD, AOD, and EOD) with intersectional fairness metrics (ISPD, IAOD, and IEOD; for detailed introductions to these metrics, refer to Section 3). This design enables a better balance between mitigating different types of bias and maintaining performance. Additional discussion on this design is provided in Section 6. With this, we have introduced the methodology of CoT, and next, we move to the evaluation section.

In this part, we introduce our experimental design to evaluate the performance and applicability of CoT by answering the following research questions (RQs). We present the benchmark datasets and evaluation metrics and then provide detailed experiment setups.

Following recent empirical studies on the fairness of machine learning software, we conducted our experiments on five benchmark datasets using the IBM AIF360 framework [30]. These datasets originate from diverse domains, including finance, criminal justice, and healthcare, and have been widely adopted within the fairness research community. The Adult Income dataset [1] (also known as the Adult dataset) contains individual information derived from the 1994 U.S. Census, with the objective of predicting whether an individual’s income exceeds $50K. The ProPublica Recidivism dataset [4] (also referred to as the Compas dataset) comprises the criminal history of defendants, aiming to predict their potential for re-offending in the future. The Default of Credit Card Clients dataset [5] (also referred to as the Default dataset) encompasses information on default payments, demographic factors, credit data, payment history, and bill statements of credit card clients. Lastly, the Medical Survey 2015 and Medical Survey 2016 datasets [2,36] (also known as the Mep1 and Mep2 datasets) contain data regarding American medical care usage, health insurance, and out-of-pocket expenditures, with the goal of predicting healthcare utilization patterns.

These benchmark datasets are well-established and do not require additional pre-processing or cleaning for fairness research. They are particularly valuable as they comprehensively represent real-world scenarios and have been adopted in numerous significant studies [25,26,30,71], providing meaningful insights into the equity and ethical implications of machine learning algorithms.

We adopt five state-of-the-art bias mitigation methods as baselines to demonstrate the effectiveness of CoT, including three preprocessing and two non-preprocessing techniques. These existing methods are derived from recent, significant fairness research published in flagship venues. Fair-SMOTE (FSE'21) [20] generates synthetic data using the SMOTE [23,35] algorithm to mitigate imbalances in the training distribution and reduce bias. LTDD (ICSE'22) [52] enhances fairness by learning fair data representations through a transfer learning framework. FairMask (TSE'23) [61] addresses bias by substituting sensitive attributes in the test set with synthesized values. MirrorFair (FSE'24) [71] creates counterfactual samples by flipping sensitive attributes, trains a mirror model, and ensembles predictions from both models for fairer outcomes. FairGenerate (TOSEM'25) [43] leverages generative modeling to augment training data and improve fairness.

4.4.1 Performance Metrics. We use five widely adopted performance metrics: precision, recall, accuracy, F1-score, and Matthews correlation coefficient (MCC). Table 2 presents the definitions of the model performance metrics used in this paper. The notation ๐‘‡ ๐‘ƒ signifies True Positives, ๐‘‡ ๐‘ signifies True Negatives, ๐น ๐‘ƒ signifies False Positives, and ๐น ๐‘ signifies False Negatives. Precision measures the exactness of the model in predicting a certain class; recall measures the sensitivity of the model to a certain class; accuracy measures the overall correctness of model predictions; F1-score is the harmonic mean of precision and recall. MCC measures the correlation between observed and predicted classifications, providing a balanced evaluation even when classes are imbalanced. Equal Opportunity Difference, along with their intersectional counterparts (ISPD, IAOD, IEOD), following prior works [13,27,29,30]. Specifically, SPD measures the difference in favorable prediction rates between unprivileged and privileged groups; AOD is the average of the differences in False Positive Rates (FPR) and True Positive Rates (TPR) across groups; EOD quantifies the TPR gap between unprivileged and privileged groups. The measurement of these fairness metrics actually evaluates the performance differences across groups, further reflecting the nature of software fairness issues-namely, a quality defect characterized by performance disparities among user groups.

The formal definitions of these metrics are presented in Table 3. Where ๐ด denotes the sensitive attribute, ๐‘Œ the ground-truth label, and ลถ the predicted label. ๐ด = 0 and ๐ด = 1 indicate unprivileged and privileged groups, respectively, while ๐‘Œ = 1 represents the favorable outcome. For intersectional metrics, ๐‘  โˆˆ ๐‘† enumerates all intersectional subgroups (e.g., gender and race combinations). For intersectional fairness, the metrics (ISPD, IAOD, IEOD) extend these notions to the worst-case disparity among all intersectional subgroups ๐‘†, capturing fairness at the granularity of sensitive attribute intersections. For instance, in a decision-making task where “sex” and “race” are sensitive attributes to be protected, if “male” and “White” are privileged groups while “female” and “Non-White” are unprivileged groups, intersectional fairness measures the disparities between the “male-White” and “female-Non-White” subgroups.

Table 3: Definition of single and intersectional fairness bias.

Definition

IEOD max

To measure the effectiveness of the tradeoff between model fairness and performance, Hort et al. proposed Fairea [41] at ESEC/FSE 2021, introducing a trade-off baseline using AOD-Accuracy and SPD-Accuracy metrics. To provide a more comprehensive evaluation, Chen et al. extended this baseline to fifteen fairness-performance metrics (combinations of three fairness metrics and five performance metrics) [27]. The baseline categorizes the effectiveness of the fairness-performance trade-off into five levels: “win-win” (improvement in both performance and fairness), “good” (improved fairness, reduced performance, but still surpassing the Fairea Baseline), “inverted” (improved performance but reduced fairness), “poor” (improved fairness, reduced performance, but not surpassing the Fairea Baseline), and “lose-lose” (reduction in both fairness and performance). In this work, we follow prior studies [30,43,71] and adopt the Fairea Trade-off Baseline to evaluate existing methods and CoT.

To enhance the reliability of our experimental results, we follow previous work [27,71] and recent empirical studies [29,30], employing two statistical tools: the Mann-Whitney U-test [54] and Cliff’s ๐›ฟ to analyze our raw results. In RQ2 and RQ4, we use the Mann-Whitney U-test to assess whether improvements in fairness after applying bias mitigation methods are statistically significant. Consistent with previous studies [27,71], we consider improvements statistically significant when the ๐‘-value is below 0.05. Additionally, we employ Cliff’s ๐›ฟ, which is widely used in SE research [9,30,67], to measure the effect size of the impact. In line with the literature [30], we set our threshold at 0.428; when the absolute value of ๐›ฟ is no less than 0.428, we consider the change to have a large effect.

Given the prominence of fairness as a rapidly growing field across various research communities, we have rigorously structured our experiments to ensure their reliability. To maintain fairness in comparison, we align our experimental setups, including benchmarking datasets, classification algorithms, evaluation metrics, and experimental environment, with those of recent empirical studies that comprehensively explored existing bias-mitigating methods. The experimental classifiers include Logistic Regression (LR) [70], Support Vector Machine (SVM) [57], Random Forest (RF) [11], and deep neural networks (DNN) [49]. We also include two modern classifiers, XGBoost (XGB) [24] and LightGBM (LGBM) [46], to enhance the robustness and comprehensiveness of our evaluation.

Regarding the implementation of existing methods, we meticulously replicated them using the IBM AIF360 toolkit [8], the source code released by the original authors [43,52], and recent empirical investigations [29,30]. For each task, we conducted 20 repeated experiments to reduce random error, and the mean values across all runs are reported as the final results. We adopted the iteration count as the random seed to split each dataset into a 70% training set and a 30% testing set, following recent empirical studies [30] to enhance the soundness of our evaluation.All experiments involving ML and DL models were conducted on a CPU platform using Python 3.12.

In this section, we present our experimental results by answering the RQs. Considering space limitations, we primarily report the statistical analysis results in the paper. All raw results for model fairness and performance across each experimental scenario are available in the supplementary materials.

We answer this RQ through a comprehensive comparison between two implementations of CoT (CoT-Phi and CoT-Opt) and the original model without any bias mitigation methods, across the ML and DL scenarios on ten tasks involving five datasets. Additionally, we specifically analyze the impact of CoT on the model’s ability to assign correct favorable labels to unprivileged groups.

Table 4 presents the accuracy, three fairness metrics, and the true positive rate (TPR) for unprivileged groups. Specifically, CoT, including both CoT-Phi and CoT-Opt, significantly affects the TPR of unprivileged groups, typically increasing its value. This indicates that the model’s tendency to assign favorable labels to unprivileged groups is strengthened. The positive impact of CoT on the unprivileged group does not substantially compromise overall model accuracy, while it significantly reduces bias as measured by the SPD, AOD, and EOD metrics in the ML software. In particular, CoT-Phi increases TPR-U by 17% and decreases SPD, AOD, and EOD by 57%, 55%, and 41%, respectively. CoT-Opt increases TPR-U by 15%, and decreases SPD, AOD, and EOD by 47%, 58%, and 51%, respectively.

Ans. to RQ1: Both CoT-Phi and CoT-Opt significantly enhance true positive rate for unprivileged groups without substantially compromising overall accuracy, demonstrating strong effectiveness in trade-off model performance between different groups. Additionally, this performance tradeoff enables CoT to achieve significant improvements in model fairness. Specifically, CoT-Opt decreases the SPD, AOD, and EOD bias by 47%, 58%, and 51%.

In RQ5.1, we examine the impact of CoT on model performance and fairness across scenarios six models and ten classification tasks. In this RQ, we employ the Mann-Whitney U-Test as a statistical analysis tool to provide an overall understanding of the effectiveness of CoT in mitigating data bias and enhancing model fairness. Additionally, we compare the two implementations of CoT with existing methods across 60 scenarios (10 tasks ร— 6 models) to highlight the advantages of CoT.

Specifically, we evaluate both existing methods and CoT from two perspectives. First, we treat each combination of dataset, sensitive attribute, model, and fairness metric as a distinct scenario. We then measure the proportion of scenarios in which fairness is improved and the effect is significantly large, using the Mann-Whitney U-test and Cliff’s ๐›ฟ. Second, we assess the degree of change in fairness metrics, considering both absolute and relative changes. Table 5 presents the proportion of scenarios in which fairness is improved by existing methods and CoT. Among the baselines, Mir-rorFair achieves the highest proportion for both fairness improvement (92.2%) and fairness improvement with a significantly large effect (77.8%). CoT-Opt attains the state-of-the-art by six and three percentage points. Table 6 presents the absolute and relative changes in three fairness metrics for five existing methods and two implementations of CoT. Different methods demonstrate varying strengths in mitigating bias as measured by the different metrics. Among the baselines, MirrorFair achieves the greatest reduction in SPD bias by 0.046 (50.9%), and FairGenerate achieves the largest reduction in EOD bias by 0.049 (43.4%), while MirrorFair demonstrates the highest overall effectiveness across all three types of bias. For CoT, CoT-Phi achieves the greatest reduction in SPD by 0.054 (57.4%), while CoT-Opt achieves the largest reductions in AOD and EOD by 0.052 (58.1%) and 0.060 (51.0%), respectively.

Ans. to RQ2: CoT outperforms the state-of-the-art in both the proportion of fairness improvement scenarios and the degree of changes in fairness metrics. Specifically, CoT-Opt improves fairness in 97.8% of scenarios, with 83.3% showing a significantly large effect, and reduces SPD, AOD, and EOD bias by 46.9%, 58.1%, and 51.0%, respectively. This effectiveness surpasses the state-of-the-art by five percentage points in the proportion of fairness improvement scenarios, and by an average of three percentage points in reducing SPD, AOD, and EOD biases.

This RQ explores the effectiveness of balancing model performance and fairness across existing methods and CoT approaches. We employ hybrid evaluation criteria and metrics, building upon previous research. To ensure consistency with prior work [27,29,41], we utilize the Fairea-Baseline to assess the trade-off effectiveness between performance and fairness for each bias mitigating method. The evaluation includes 15 performance-fairness baselines (5 performance metrics ร— 3 fairness metrics) and 60 decision-making scenarios across ten decision-making tasks and six models. Figure 2 illustrates the effectiveness distribution for existing methods CoT. Both implementations of CoT outperform the state-of-the-art in exceeding the Fairea trade-off baseline (located in the “win-win” and “good” regions). Specifically, CoT-Phi surpasses the trade-off baseline in 79.2% of cases, while CoT-Opt achieves this in 79.9%, outperforming the state-of-the-art method MirrorFair by three percentage points. Additionally, as methods of the same type, both CoT-Phi and CoT-Opt outperform the state-of-the-art preprocessing approach LTDD (61.9% “win-win” + “good” proportion) by more than 18 percentage points.

Ans. to RQ3: Both CoT-Phi and CoT-Opt outperform the stateof-the-art in the model performance-fairness trade-off. Specifically, CoT-Opt surpasses the Fairea Trade-off Baseline in 79.9% of cases, outperforming the top post-processing method Mir-rorFair by three percentage points, and exceeding the leading pre-processing method LTDD by 18 percentage points.

This RQ explores the impact of fairness improvement methods on the bias of unconsidered sensitive attributes when mitigating the bias of considered sensitive attributes, and evaluates the effectiveness of existing methods and CoT in addressing intersectional bias related to multiple sensitive attributes.

The concepts of considered and unconsidered sensitive attributes are defined based on the objective of applying fairness improvement methods. For example, in the Adult dataset, “sex” and “race” are typically regarded as sensitive attributes requiring protection. When the target is to mitigate sex-related bias, “sex” is identified as the considered attribute and “race” as the unconsidered attribute, and vice versa. Table 7 presents the relative changes in bias metrics of considered and unconsidered sensitive attributes achieved by existing methods and CoT. Overall, LTDD, FairMask, MirrorFair, CoT-Phi, and CoT-Opt exhibit much less negative side effects (no more than 5%) on the bias of unconsidered attributes compared to Fair-SMOTE and FairGenerate when mitigating the bias of a considered attribute. Notably, CoT-Opt achieves the highest overall bias reduction on the considered attribute while simultaneously achieving the lowest negative effect on the unconsidered attribute.

The initial LTDD and FairGenerate methods are not applicable to multiple sensitive attributes, and we have not identified an appropriate strategy to generalize them to a multi-attribute context. Therefore, we exclude LTDD and FairGenerate from this comparison. Table 8 presents the absolute and relative changes (in parentheses) in intersectional bias metrics achieved by three existing methods and CoT. CoT-Opt maintains a strong advantage in improving overall intersectional fairness, reducing ISPD, IAOD, and IEOD by 38.5%, 42.2%, and 34.0%, respectively. Additionally, CoT-Opt achieves intersectional fairness improvement in 97.8% of scenarios, with 74.4% of those showing a significantly large effect.

Regarding CoT-Phi, it achieves the largest reduction in ISPD at 50.3%, surpassing MirrorFair by five percentage points and outperforming FairMask by 29 percentage points. CoT-Phi also attains a 83.3% proportion of scenarios with significantly large improvements in intersectional fairness, exceeding the leading method FairMask by 13 percentage points.

Ans. to RQ4: Both CoT-Phi and CoT-Opt achieve the top overall bias reduction on the considered attribute while simultaneously exhibiting the lowest negative effect on the unconsidered attribute. In terms of intersectional fairness improvement, CoT-Opt reduces ISPD, IAOD, and IEOD by 38.5%, 42.2%, and 34.0%,

This investigates the robustness of CoT under overfitting and realistic data conditions using the modern classifier XGBoost across all ten tasks we studied. For the overfitting scenario, we randomly sample 100 data instances from the original training data for model training to deliberately construct an overfitting setting. For realistic data conditions, we manually contaminate it with missing values, noise, and outliers. Table 9 presents the relative changes in model performance and fairness metrics under overfitting and realistic data conditions. Both overfitting and realistic data conditions consistently reduce model performance. The decreases across the five performance metrics range from 4 to 28 percentage points. Regarding fairness, the AOD and EOD metrics increase, while the SPD metric decreases under both conditions.

In this section, we discuss the implications and advantages of CoT, as well as the threats to the validity of our experimental results. 6.1.5 Scalability. Scalability is an important aspect of any bias mitigation approach. CoT mitigates bias by processing sensitive attributes, meaning that increases in the dimensionality of nonsensitive features have minimal impact. In terms of data size and the number of sensitive attributes, the computational overhead scales linearly. Regarding effectiveness scalability, Table 7 shows that CoT has limited effect on unconsidered sensitive attributes. Therefore, CoT demonstrates good scalability with respect to data size, feature dimensionality, and the number of sensitive attributes.

Since CoT operates at the data level and is model-agnostic, we did not observe any inherent scalability issues. Notably, when CoT is applied in conjunction with language models on tabular tasks, the data must first be converted into prompt-response pairs. In such cases, very high-dimensional feature spaces may approach or exceed the input length limits of current language models, which could constrain the applicability of CoT.

6.1.6 Reflecting the Real-world Distributions. Unlike data augmentation techniques such as Fair-SMOTE [20], which generate entirely virtual data instances, CoT directly modifies the sensitive attributes while keeping all other features and data labels unchanged. As a result, the transformed data continues to closely reflect real-world distributions. This alignment is further supported by the comparison of model performance before and after applying CoT. Nevertheless, we encourage practitioners to restrict the use of CoT-processed data solely to model training, rather than for other purposes.

In Section 3.3, we introduce our default implementation of CoT-Opt for achieving optimal overall bias mitigation effectiveness. In practice, CoT-Opt can be tailored to prioritize specific fairness metrics by modifying the loss function. For example, removing SPD and AOD from the loss function while retaining EOD can enhance effectiveness in reducing EOD bias. Regarding the optimization algorithm, we use Particle Swarm Optimization (PSO) [32] to accelerate the optimization process. Other algorithms, such as Genetic Algorithms (GA) [69], are also applicable. The choice of optimization algorithm does not impact the effectiveness of CoT-Opt, but it can influence optimization speed. Due to page limitations, we are unable to present all of our experimental results for different optimization strategies in this paper. However, the related files are available in our supplementary material.

6.3.1 Internal Validity. Potential threats to the internal validity of our results stem from the selection of benchmark datasets, tasks, evaluation criteria, and baseline methods. Many bias mitigation techniques are evaluated using different benchmark datasets and fairness metrics in their original papers, complicating direct comparisons between CoT and baseline approaches and potentially impacting the validity of our experiments. To address these concerns, we align with prior research [20,21,27,30,61], adopting widely recognized benchmark datasets and employing multiple evaluation metrics. Nevertheless, it should be noted that results may differ slightly from those reported in original studies if new datasets, algorithms, or metrics are introduced.

CoT is specifically designed to address bias mitigation in tabular data and classification tasks, which are prevalent in software applications. However, several factors may limit its external validity. While we evaluated CoT across ten diverse benchmark tasks, these datasets primarily represent structured tabular data. The effectiveness and generalizability of CoT to alternative scenarios with unstructured data, such as natural language processing (NLP) or computer vision (CV), require further investigation. For instance, extracting and sensitive attribute and data labels from high-dimensional image pixels or complex linguistic structures may present unique challenges. Furthermore, our study mainly considered binary sensitive attributes. In real-world software deployments, sensitive attributes can be non-binary or continuous and the applicability of the Phi-coefficient in these continuous correlation contexts warrants additional empirical validation.

Recently, large language models (LLMs) have demonstrated promising capabilities across various domains [16,17,80]. Therefore, we plan to explore extending CoT to typical LLM tasks such as text classification and reasoning. Since CoT tunes correlations within training data based on explicitly defined sensitive attributes and labels, it is feasible to extend CoT to text classification tasks by extracting sensitive attributes and labels directly from the natural language text. We plan to pursue this direction in our future work.

This paper investigates software fairness, emphasizing that fairness concerns are not only ethical but also constitute a significant software quality issue stemming from software performance bugs. We further highlight the practical significance of software fairness research, as bias mitigation methods empower software systems to dynamically adjust performance across groups, enhancing both outof-distribution generalization and geographical transferability. We introduce CoT, an effective, widely applicable, engineering-friendly, and efficient pre-processing bias mitigation approach based on the Phi-coefficient and multi-objective optimization. CoT demonstrates clear advantages over current state-of-the-art methods. Additionally, the successful application of the Phi-coefficient and multiobjective optimization in CoT opens up new research opportunities for the AI and SE communities, including balancing the software robustness, efficiency, privacy, trustworthiness, and performance.

โ€ข RQ1: โ€ข RQ3: What is the performance-fairness trade-off of CoT compared with existing methods? This RQ investigates the advantages of CoT in balancing model performance and fairness compared to existing methods, using the Fairea Trade-off measurement tool. โ€ข RQ4: What is the effectiveness of CoT in handling multiple sensitive attributes? This RQ examines the potential negative side effects of fairness-improving methods on unconsidered sensitive attributes and evaluates the effectiveness of CoT in mitigating multiple sensitive attribute biases and intersectional bias. โ€ข RQ5: What is the robustness of CoT under overfitting risks and realistic data conditions? This RQ investigates the robustness of CoT by evaluating whether its effectiveness remain consistent when facing overfitting, missing values, noise, and outliers.

MCC (๐‘‡ ๐‘ƒ ร— ๐‘‡ ๐‘ -๐น ๐‘ƒ ร— ๐น ๐‘ )/ โˆš๏ธ (๐‘‡ ๐‘ƒ + ๐น ๐‘ƒ)(๐‘‡ ๐‘ƒ + ๐น ๐‘ )(๐‘‡ ๐‘ + ๐น ๐‘ƒ) (๐‘‡ ๐‘ + ๐น ๐‘ ) 4.4.2 Fairness Metrics. To comprehensively evaluate both single and intersectional fairness, we adopt three widely used fairness metrics: Statistical Parity Difference, Average Odds Difference, and

6.1.1 Effectiveness.

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

โ†‘โ†“
โ†ต
ESC
โŒ˜K Shortcut