Discovering the Hidden Role of Gini Index In Prompt-based Classification
In classification tasks, the long-tailed minority classes usually offer the predictions that are most important. Yet these classes consistently exhibit low accuracies, whereas a few high-performing classes dominate the game. We pursue a foundational …
Authors: Ruixi Lin
D I S C OV E R I N G T H E H I D D E N R O L E O F G I N I I N D E X I N P R O M P T - B A S E D C L A S S I FI C A T I O N Ruixi Lin Independent Researcher {ruixi}@u.nus.edu A B S T R AC T In classification tasks, the long-tailed minority classes usually offer the predictions that are most important. Y et these classes consistently exhibit lo w accuracies, whereas a few high-performing classes dominate the game. W e pursue a foundational understanding of the hidden role of Gini Index as a tool for detecting and optimizing (debiasing) disparities in class accuracy , focusing on the case of prompt-based classification. W e introduce the intuitions, benchmark Gini scores in real-world LLMs and vision models, and thoroughly discuss the insights of Gini not only as a measure of relati ve accurac y dominance but also as a direct optimization metric. Through rigorous case analyses, we first sho w that weak to strong relati ve accurac y imbalance exists in both prompt- based, text and image classification results and regardless of whether the classification is high- dimensional or low-dimensional. Then, we harness the Gini metric to propose a post-hoc model- agnostic bias mitigation method. Experimental results across few-shot news, biomedical, and zero-shot image classification sho w that our method significantly reduces both relativ e and absolute accuracy imbalances, minimizing top class relati ve dominance while elev ating weakest classes. K eywords Gini Index, LLM, Prompt-Based Classification 1 Introduction Classification tasks form the backbone of modern artificial intelligence, spanning div erse domains from natural language processing to computer vision. Whether classifying te xt into categories, recognizing objects in images, or assigning labels to multi-modal inputs, the fundamental challenge remains the same: models must learn to correctly distinguish between classes based on training data. In recent years, lar ge language models (LLMs) and vision-language models hav e achieved remarkable performance across these tasks, yet they often struggle with a persistent problem—accuracy imbalance across classes—whose causes are rooted in the pretraining data itself. A pervasi ve phenomenon in real-world data is the long-tailed distribution, where a small number of “head” classes occupy the majority of the dataset while the vast majority of “tail” classes have v ery few samples. In such cases, the performance of deep learning models is often dominated by the head classes while the learning of the tail classes is sev erely underdeveloped. This imbalance manifests directly in classification accurac y: models become biased toward frequently seen classes, while rare classes suffer from poor recognition performance. Though the exact classes may not be seen in pretraining, LLM completions can be degenerate, carrying more weight on frequently seen tokens [Chang and Bergen, 2022]. Y et the importance of these minority classes cannot be overstated. In many critical applications—medical diagnosis, fraud detection, anomaly identification, and scientific discov ery—the rare categories often carry the highest stakes. A model that performs well on common cases b ut fails on rare ones may be practically useless or e ven dangerous. The challenge, therefore, is not merely achieving high av erage accuracy , but ensuring adequate performance across all classes, particularly those with limited representation. T raditional approaches to addressing class imbalance hav e focused primarily on the training data lev el. Methods incorporating ov ersampling, undersampling, and data augmentation aim to rebalance the dataset, followed by training with imbalance-aware loss functions [Cao et al., 2019, Cui et al., 2019, Buda et al., 2018]. Howe ver , iterativ e retraining Discov ering the Hidden Role of Gini Index In Prompt-based Classification or fine-tuning on rebalanced datasets becomes prohibitively e xpensi ve. Moreover , the cost of collecting and annotating sufficient high-quality tail-class e xamples to achieve natural balance is often high. This suggests a fundamental shift in perspecti ve: rather than fixing imbalance at the data level, we might address it at the output le vel. The core problem is not merely that training data is imbalanced, but that this imbalance produces systematically distorted output vectors—class predictions that f avor head classes and suppress tail classes. If we can detect and correct this distortion in the model’ s predictions themselves, we may achie ve better class fairness without the prohibitiv e cost of data rebalancing. This points to the need for metrics that can quantify output-le vel imbalance and methods that can post-hoc optimize for mor e equitable predictions across classes . In this paper, we rev eal one such metric—the Gini Index. W e carefully examine Gini’ s hidden role as a tool for detecting and optimizing (debiasing) disparities in class accuracy , focusing on the case of prompt-based classification. In particular , we first empirically demonstrate Gini v alues in representati ve LLM-based text and image classification scenarios, and then we propose a post-hoc model-agnostic bias mitigation method to reduce Gini. W e sho w that: • As measured by the Gini index, weak to str ong r elative accuracy imbalance exists in both prompt-based, text and image classification results and regardless of whether the classification is high-dimensional or low-dimensional. Crucially , the non-zero Gini (Gini > 0) demonstrates that relative accuracy imbalance is present in r eal-world models . • Reducing relativ e accuracy imbalance is possible and meaningful. W e achiev e this by directly leveraging the Gini index as an optimization metric in a post-hoc model-agnostic bias mitigation method . • Both relati ve and absolute accuracy imbalance ar e mitigated by our Gini-based bias mitigation method. 2 Related W ork 2.1 The Inequality Measure: Gini Index The Gini index, or Gini coef ficient, is the most commonly used measure of inequality [Gini, 1912, Hirschman, 1964]. It was defined based on the Lorenz curve [Lorenz, 1905, Gastwirth, 1972], ranging from 0 to 1, with 0 indicating perfect equality and 1 indicating perfect inequality . The Gini index can be estimated using various approaches, including direct and indirect calculation methods [Heshmati, 2004]. In practice, when measuring income inequality , the choice between direct and indirect Gini index calculation methods is more than just con venience, but statistical rob ustness under different data conditions [Co well, 2000]. For example, the income Gini adopts a direct calculation method, which is directly computed from indi vidual or grouped data without assuming an underlying functional form for the income distribution [Y ao, 1999]. Income Gini for N individuals (their income is denoted by I , average income is denoted by γ I ) is defined as: G Income = P N i =1 P N j =1 | I i − I j | 2 N 2 γ I (1) Intuitiv ely , income Gini highlights top-class relative dominance to the mean. As we will see later when formally defining the Gini index, this direct reciprocal calculation used in income Gini fits well in the context of measuring class accuracy disparity . 2.2 Prompt-based Classification LLMs can be viewed as modeling the probability distribution of completion strings (outputs) gi ven prompt strings (inputs). Prompt-based methods enable fe w-shot learning in LLMs for classification tasks [Brown et al., 2020]. By reformulating classification as a language modeling problem, the LLM predicts a tar get label token gi ven a prompt template, and these approaches achiev e competitive results with minimal labeled data [Schick and Schütze, 2021, Singh and Y annakoudakis, 2025]. Despite their success, studies rev eal that prompts are often not interpreted by models in the way humans intend [Khashabi et al., 2022], and they can inadv ertently influence model biases in classification decisions [Prabhumoye et al., 2021, Utama et al., 2021, W ebson and Pa vlick, 2022, Góral et al., 2025, Chaudhary et al., 2025]. From a taxonomic perspectiv e, prompt-based classification methods can be broadly cate gorized into two lines: prompt design, which focuses on crafting human-readable prompts that map classification tasks to cloze-style questions for a frozen language model [Petroni et al., 2019], and prompt tuning, where continuous soft prompts are learned while the 2 Discov ering the Hidden Role of Gini Index In Prompt-based Classification underlying model remains fixed, enabling more flexible adaptation to tar get class distributions [Lester et al., 2021, Qin and Eisner, 2021, Sanh et al., 2022]. In both paradigms, the core objective remains accurate classification across all classes—a goal that can be undermined when models e xhibit systematic accuracy disparities between frequently and rarely occurring label tokens. 2.3 Measuring Class Accuracy Disparities Lin and Y ou [2024a] propose the COBias metric for e valuating pairwise class accurac y disparities, when using LLMs to perform text classification. COBias also enables learning debiasing coefficients that, during inference, plug in to reweight the probability distrib ution over the prediction token for classification tasks [Lin and Y ou, 2024a,b, Lin et al., 2025]. More recently , Li et al. [2026] explored the use of an adapted Gini coefficient as a metric in multi-agent systems. In contrast, we treat Gini not as an adapted system-lev el metric b ut as a direct measure of class accuracy disparity within prompt-based classification, where it naturally captures relati ve imbalance across classes, enabling both interpretability and direct optimization. 3 The Gini Index f or Relative Class Accuracy Disparity in Pr ompt-Based Classification W e transfer Gini index, from the commonly used socioeconomic metric of income inequality , to a useful metric for class accuracy disparity in prompt-based classification. Below , we define the Gini index metric, illustrate why Gini makes sense with a numerical walkthrough, and compare Gini with a most related metric, COBias. 3.1 Gini Index Definition The measurement of class accurac y imbalance strongly resembles inequality between strong and weak classes. Therefore, we can define Gini index as a metric of accurac y disparity in prompt-based classification results. The Reciprocal Calculation Method: Because prompt-based classification outputs often do not conform to standard parametric distributions, we adopt the direct, reciprocal Gini index method to measure ov erall class accuracy disparity . This method computes the Gini index without distributional assumptions—making the Gini Index a well-suited metric for analyzing prompt-based classification outputs. W e introduce the Gini index definition as follows. As a reminder , for prompt-based classification , an LLM works as modeling the probability distribution of the completion/answer string (the classification output) giv en a prompt consisting of task instructions, the instance to be classified, and a question. W e follo w a common practice to predict the ar gmax class using probabilities assigned to label tokens. In details, after obtaining the answer string’ s probability distribution ov er the whole vocab ulary , we extract and normalize probabilities corresponding to label tokens in the vocab ulary to form class probabilities, i.e., p m = ( p m 1 , . . . , p mN ) for instance x m and N classes. The prediction ˆ y m is then ar gmax i ∈{ 1 ,...,N } p mi . W e can then compute the accurac y for each class using the ground - truth instances, to ev aluate class-specific performance. Concretely , let A i denote the accuracy for class i, i ∈ 1 , . . . , N . Let γ Acc represent the av erage class accuracy: γ Acc = 1 N N X i =1 A i (2) As an analogy , A i resembles a family’ s income, and the av erage class accurac y is similar to a verage income ov er N classes (families). Then, we can similarly define the Gini index about class accuracies in classification, which we term as Classification tasks Gini or G CLS , analogous to income Gini. Henceforth, the G CLS metric is mathematically defined as follows. G CLS = P N i =1 P N j =1 | x i − x j | 2 N 2 γ Acc (3) Interpr etation of the Gini Scale (Range: [0, 1]): Gini = 0 indicates perfect fairness: all classes have equal accurac y; or in a perfectly equal society , the income difference is always 0. Gini approaches 1 (as population size approaches infinity) corresponds to maximal disparity: the accurac y gap between one class vs. the rest of all classes reaches the largest possible value; or in a perfectly unequal society (one person has everything), the average income difference approaches twice the mean 1 . Gini > 0.4 usually indicates strong relative imbalance [Jin et al., 2015]. 1 See Appendix A for deriv ations 3 Discov ering the Hidden Role of Gini Index In Prompt-based Classification 3.2 A Numerical W alkthr ough of the Gini Index Calculation Using the abov e definition, we present a step-by-step numerical demonstration of the Gini Index to illustrate how it works and why it makes sense. Belo w is the step-by-step breakdown of the G CLS formula. The Numerator: P N i =1 P N j =1 | x i − x j | . This is the total sum of absolute class accurac y differences. For each class i , it calculates the absolute dif ference between its accuracy and every other class j ’ s accuracy . This sum captures total inequality in the population (classes). If ev ery class had the same accuracy , the sum would be zero. The more spread out the accuracies, the larger this sum becomes. Example A with 4 classes with accuracies: [1, 0, 0, 0]: Sum A = 3 + 1 + 1 + 1 = 6 Example B with 4 classes with accuracies: [0.8, 0.2, 0, 0]: Sum B = 2.2 + 1 + 1 + 1 = 5.2 Example C with 4 classes with accuracies: [1, 1, 0, 0]: Sum C = 2 + 2 + 2 + 2 = 8 Lin and Y ou [2024a] uses a similar numerator (the sum of only unordered distinct pairs ( i < j )), which is di vided by combination size N 2 . This mean absolute difference ov er all unordered pairs forms the COBias metric. The advantage of COBias is the directness in representing inequalities between pairs of classes, but it omits normalization by mean accuracy , as we will see next. The Denominator: 2 N 2 γ Acc . This denominator provides the scaling by population and mean . N 2 : There are N × N = N 2 pairs in the numerator’ s double sum. Dividing by N 2 turns the total sum into an average pairwise difference, i.e., the mean absolute differ ence over all ordered pairs . γ Acc : Dividing by this mean accuracy makes the measure scale-in variant ; this is what COBias lacks. The resulted average pairwise dif ference γ Acc is “relative mean dif ference”. If ev ery class accuracy doubled, γ Acc would double and the sum would double, so the ratio stays the same. This is crucial—inequality shouldn’t change just because everyone gets richer by the same proportion. 2 : By conv entional definition, the Gini index is half the relati ve mean dif ference [Gini, 1912]. The di vision by 2 is embedded in the normalization to bound the index between 0 and 1. Putting it together , Gini indices for the following numerical examples are: Example A with 4 classes with accuracies: [1, 0, 0, 0]: G CLS A = Sum A 2 × 4 2 × 0 . 25 = 6 8 = 0 . 75 Example B with 4 classes with accuracies: [0.8, 0.2, 0, 0]: G CLS B = 5 . 2 2 × 4 2 × 0 . 25 = 5 . 2 8 = 0 . 65 Example C with 4 classes with accuracies: [1, 1, 0, 0]: G CLS C = Sum C 2 × 4 2 × 0 . 5 = 8 16 = 0 . 5 Intuitiv ely , Gini index emphasizes ho w much the top class dominates relativ e to the mean, not absolute v alues. Gini is not higher for larger absolute gaps, but when the top class e xceeds the mean more. In summary , Gini (0.75 for [1, 0, 0, 0], 0.5 for [1, 1, 0, 0], 0.65 for [0.8, 0.2, 0, 0]) penalizes proportional dominance , not absolute gaps, so a split like [1, 1, 0, 0] (where top classes share dominance) scores lo wer than a monopoly [1,0,0,0]. Even though [0.8, 0.2, 0, 0] has a larger absolute gap (0.6) between top classes than [1, 1, 0, 0] (0), its Gini index being higher than [1, 1, 0, 0]’ s is not just because of the lar ger gap, but because the top class (0.8) remains heavily dominant relati ve to the mean (3.2x mean), whereas [1, 1, 0, 0]’ s top class has less relativ e concentration (2x mean). 3.3 Comparisons Between Gini Index And COBias Both metrics as tools for dif ferent diagnostic or optimization goals. The core rationale is an optimization/bias mitigation target priority—Gini penalizes relati ve concentration (strong concentration if high inequality acro relati ve to the mean); COBias penalizes absolute pairwise gaps (lar ge raw dif ferences between classes), and which matters depends on the bias pattern of interest. Recall COBias definition: for class accuracies A = ( A 1 , . . . , A N ) across N classes, the COBias metric is defined as the mean absolute difference o ver all distinct unordered pairs: COBias = 2 N ( N − 1) X 1 ≤ i0.4) because the top class is 3.8× the mean, so the system is highly imbalanced giv en the average performance. Interpr etation of Scores: The Llama-2-70b based DDI classification results show strong relati ve accuracy imbalance, and Gini is very high (0.67). Intuitiv ely , the DDI accuracy distribution is visibly similar to the [0.8, 0.2, 0, 0] distribution in T able 1, suggesting that ev en with a more comprehensi ve LLM, there can be extreme class accuracy disparities—dominance by a few classes. 6 Discov ering the Hidden Role of Gini Index In Prompt-based Classification Figure 1: Measurements of Gini index and related metrics for image classification (CIF AR-100; 100 classes) 4.2 Case Analysis 2: Image Classification Using the CLIP V iT -L/14 zero-shot image classification results on the dataset CIF AR-100 [Krizhevsky, 2009] as a case study , we present the Gini index, alongside accuracy for each class, mean accuracy , top class relati ve dominance to the mean, COBias, and other measurement results in Figure 1. This task comprises of 100 classes. Interpr etation of Scores: for 100 classes with mean accuracy 0.67 (lo west class accuracy: 0, highest accuracy: 0.99), the low Gini index (0.18) confirms the absence of strong relativ e dominance—the top class is only 1.47× the mean accuracy . This demonstrates that Gini penalizes relativ e concentration (i.e., top class dominance relative to the mean), so a low score indicates no significant dominant class. COBias (0.24), meanwhile, reflects absolute gaps (e.g., 0.99 vs. 0), showing absolute accurac y disparity despite low relati ve concentration. 4.3 The Rationale In summary , analyses of these three randomly picked cases confirm that weak to strong r elative concentration exists in both prompt-based, text and image classification’ s class accuracies and r egardless of whether the classification is high-dimensional or low-dimensional . Crucially , this non-zero Gini (Gini > 0) demonstrates that relativ e accuracy imbalance is pr esent in real-world models . The DDI case exhibits a typical high Gini (0.67), calling for bias mitigation. Though non-dominant in A GNews and CIF AR-100 case studies (Gini = 0.18-0.21, where the top class is only 1.31-1.47× the mean accuracy), the weak concentration still represents measurable bias under our metric—such accuracy disparity must be mitigated. 5 The Bias Mitigation Method Not only useful as an evaluation metric, Gini can be harnessed as an optimization metric for minimizing class accuracy disparities. W e propose a bias mitigation method D Gini that directly lev erages Gini. This method is post-hoc, model-agnostic, and without parameter updates of the original model. The mitigation method presented here is a foundational proof of concept that directly validates the feasibility of mitigating relati ve accurac y imbalance. Crucially , the current method serves as a direct, unembellished translation of the proposed Gini metric into a functional bias mitigation model. While subsequent refinements will further optimize performance, this single model with experiments delivers clear , actionable proof that Gini-based mitigation is both possible and meaningful. The Mathematical Model: Giv en a classification dataset of N classes, we split it into a labeled optimization set (analogous to the training set in gradient-based learning methods) of M instances and a test set. After prompting the LLM/pretrained vision model for a prediction on an instance x m (from the optimization set), we can obtain the output class probabilities at the answer token: p m = ( p m 1 , . . . , p mN ) . The immediate prediction is ˆ y m = ar gmax i ∈{ 1 ,...,N } p mi ; and the per-class accurac y is A i = 1 |S i | X m ∈S i 1 { ˆ y m = y m } , (6) where S i is the set of indices for class i instances, and y m is the ground-truth class for instance x m . 7 Discov ering the Hidden Role of Gini Index In Prompt-based Classification From the benchmarking results, we ha ve seen that these prompt-based predictions can be prone to at least, weak relati ve accuracy imbalance (Gini > 0); therefore, we aim to post-hoc adjust the output class probability distrib ution, so that the corrected probabilities lead to fairer predictions and thus fairer class accuracies relati ve to the mean, i.e., lo wer Gini. T o post-hoc adjust the output class probability distrib ution for reducing Gini, we propose re weighting coefficients— specifically , inte ger selection v ariables ξ = ( ξ 1 , . . . , ξ N ) for class { 1 , . . . , N } and a bias mitigation method D Gini based on ξ —such that Gini index of the corrected predictions is minimized. Follo wing the computational details in [Lin et al., 2025], these ξ adjustments select from both discrete, class-lev el correction weights (e.g., a factor of 0.2) and sample-level correction functions (e.g., a probability-range-specific triangular membership function)—the correction is tailored to each class and instance. W e collectiv ely refer to the correction weights and correction functions as the corr ection map : F = { f 1 , . . . , f | F | } . Henceforth, we correct the per-sample per -class probability p ′ mi by a mapping: f ξ i : p ′ mi ← f ξ i ( p mi ) , i ∈ { 1 , . . . , N } , p mi ∈ [0 , 1] , ξ i ∈ { 1 , . . . , | F |} . (7) Therefore, the corrected prediction for instance x m is: ˆ y ′ m = ar gmax i ∈{ 1 ,...,N } { p ′ m 1 , . . . , p ′ mN } = ar gmax i ∈{ 1 ,...,N } { f ξ 1 ( p m 1 ) , . . . , f ξ N ( p mN ) } . (8) Then, the updated accuracy for an y class i is A ′ i ( ξ ) = 1 |S i | X m ∈S i 1 { ˆ y ′ m = y m } . (9) W ithin this notational framew ork, an optimization model (called D Gini ) aimed at minimizing Gini inde x in a prompt- based classification task can be formulated as follows: min G CLS ( ξ ) = P N i =1 P N j =1 | A ′ i ( ξ ) − A ′ j ( ξ ) | 2 N 2 γ Acc s.t. ξ = ( ξ 1 , . . . , ξ N ) , ξ i ∈ { 1 , . . . , | F |} (10) The Solution Framework Based on Simulated Annealing for the Mitigation Method : to complete our solution frame work for the mitig ation model, without loss of generality , we follo w [Lin et al., 2025] to use a simulated annealing (SA) algorithm for solving the nonlinear integer programming objecti ves. 6 Optimization Experiments 6.1 Experimental Setup W e apply D Gini to both case studies, including A GNe ws text classification and CIF AR-100 image classification. For optimizations on the CIF AR-100 dataset, we refine the correction map F to include only weight corrections. The simplification allows us to demonstrate the optimization performance while enabling rapid iteration for this case study . During inference, a test instance’ s class probabilities are re weighted by the re weighting coefficients for each class learned during optimization. W e report e valuation scores on the test set. Ev aluation metrics follo w from the benchmarking section, including mean accuracy , Gini, COBias, and other related metrics. 6.2 Bias Mitigation Results For Case Analysis 1: T ext Classification T able 5 presents the test results on A GNews. Our bias mitigation method D Gini significantly reduces Gini index by relati vely 86%. In particular , top class Sports’ s relativ e dominance is reduced. Meanwhile, by optimizing over Gini, test COBias is also greatly reduced (by relati vely 86%), the weak est class T ech’ s accuracy rises from 19% to 85%. These suggest that by Gini-based bias optimization is effecti ve for reducing both relati ve and absolute accuracy imbalance. T able 6 shows the test results on DDI. Extremely imbalanced in the original prompt-based results, the corrected predictions yields much more balanced class accuracies. The saying—the minority classes usually of fer the predictions that are most important—rings particularly true in the DDI classification, where mispredictions in a dominant class can lead to life-risking consequences (when patients trust the LLM predictions for a wrong interaction type between a pair of drugs and misuse them). Our Gini-based bias mitigation method is particularly well-suited for biomedical DDI-like scenarios, effectively calibrating prompt-based outputs f or much lower top class relativ e dominance . 8 Discov ering the Hidden Role of Gini Index In Prompt-based Classification T able 5: T est results using the Gini-based bias mitig ation method D Gini on A GNews (In the first column, ↑ ( ↓ ) means higher (lower) v alue is better .) Original Debiased (Opt. Metric: Gini) Evaluation Metric W orld Sports Business T ech W orld Sports Business T ech Relative Impro vement Class Acc. (4 Classes) 0.85 0.98 0.97 0.19 0.85 0.98 0.85 0.85 - Mean Acc. ( ↑ ) 0.75 0.88 ↑ 17% Relativ e Dominance of the T op Class to the Mean ( ↓ ) 1.31 1.11 ↓ 15% COBias ( ↓ ) 0.42 0.06 ↓ 86% Gini ( ↓ ) 0.21 0.03 ↓ 86% T able 6: T est results using the Gini-based bias mitigation method D Gini on DDI (In the first column, ↑ ( ↓ ) means higher (lower) v alue is better .) Original Debiased (Opt. Metric: Gini) Evaluation Metric Neg. Eff. Mech. Adv . Int. Neg. Eff. Mech. Adv . Int. Relative Impro vement Class Acc. (4 Classes) 0 0.87 0.03 0.04 0.20 0.30 0.45 0.32 0.27 0.52 - Mean Acc. ( ↑ ) 0.23 0.37 ↑ 61% Relativ e Dominance of the T op Class to the Mean ( ↓ ) 3.8 1.4 ↓ 63% COBias ( ↓ ) 0.38 0.13 ↓ 66% Gini ( ↓ ) 0.67 0.14 ↓ 79% 6.3 Bias Mitigation Results For Case Analysis 2: Image Classification T able 7 shows that our mitigation method ef fectiv ely addresses relativ e class accuracy imbalances in zero-shot image classification, where the 100 classes obtain more balanced relati ve and absolute class accuracies. Specifically , Gini is reduced by a relativ e 61%, and the weakest class accuracy rises from 0 to 27%. T able 7: T est results using the Gini-based bias mitigation method D Gini on CIF AR-100 (In the first column, ↑ ( ↓ ) means higher (lower) v alue is better .) Evaluation Metric Original Debiased (Opt. Metric: Gini) Relative Impro vement Mean Acc. ( ↑ ) 0.67 0.69 ↑ 3% Relativ e Dominance of the T op Class to the Mean ( ↓ ) 1.47 1.42 ↓ 3% COBias ( ↓ ) 0.24 0.10 ↓ 58% Gini ( ↓ ) 0.18 0.07 ↓ 61% In summary , this mitigation method ef fectively harnesses Gini for meaningful bias mitigation. W e leave the door open for iterativ e engineering or model enhancements given the fundamental v alidity of our approach. 6.4 Discussion: Ablation On The Optimization Metric Choice, Gini. vs. COBias Through ablation, we isolate the impact of optimization metric choice: Gini vs. COBias. This COBias-based ablation study is performed on A GNews; hyperparameters align with the Gini-based setting. Results are shown in T able 8, and it demonstrates that, Gini-based tuning yields quantitativ ely slightly stronger bias mitigation than COBias-based tuning. 9 Discov ering the Hidden Role of Gini Index In Prompt-based Classification T able 8: Ablation results using the COBias-based bias mitigation method D Gini on A GNews (In the first column, ↑ ( ↓ ) means higher (lower) v alue is better .) Original Debiased (Opt. Metric: COBias) Evaluation Metric W orld Sports Business T ech W orld Sports Business T ech Relative Impro vement Class Acc. (4 Classes) 0.85 0.98 0.97 0.19 0.86 0.98 0.85 0.84 - Mean Acc. ( ↑ ) 0.75 0.88 ↑ 17% Relativ e Dominance of the T op Class to the Mean ( ↓ ) 1.31 1.11 ↓ 15% COBias ( ↓ ) 0.42 0.07 ↓ 83% Gini ( ↓ ) 0.21 0.03 ↓ 86% 7 Conclusion W e carefully examine Gini’ s hidden role as a tool for detecting and optimizing (debiasing) disparities in class accuracy , focusing on the case of prompt-based classification. In particular , we first empirically demonstrate Gini values in representativ e LLM based text and image classification scenarios—re vealing that weak to strong relative accuracy imbalance exists in real-world models—and we directly leverage Gini to propose a post-hoc, model-agnostic bias mitigation method. Experimental results sho w that our method effecti vely reduces Gini from 0.21 to 0.03 for news text classification, from 0.67 to 0.14 for biomedical drug-drug interaction relation classification, and from 0.18 to 0.07 for a 100-class image classification task. This Gini-based mitigation method also significantly improv es the weakest class’ s accuracy , showing that reducing relati ve accuracy imbalance is possible and meaningful. It should be noted that, by proposing the Gini index and the post-hoc debiasing method in the prompt-based classification setting, we do not imply that they are only for LLMs, but rather because this is the high-impact context of accuracy disparity challenges. In addition, models examined in this work are basic yet widely-used architectures that serv e as building blocks for more complex systems. Understanding and mitigating accuracy disparities in basic models is a necessary prerequisite before extending to multi-modal or multi-agent settings. Building on this foundation, we aim to extend our methods to multi-modalities and agentic systems in future work. References T yler A. Chang and Benjamin K. Bergen. W ord acquisition in neural language models. T ransactions of the Association for Computational Linguistics , 10:1–16, 2022. doi: 10.1162/tacl_a_00444. URL https://aclanthology.org/ 2022.tacl- 1.1/ . Kaidi Cao, Colin W ei, Adrien Gaidon, Nikos Arechiga, and T engyu Ma. Learning imbalanced datasets with label-distribution-a ware margin loss. In H. W allach, H. Larochelle, A. Beygelzimer , F . d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Pr ocessing Systems , volume 32. Cur- ran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/ 621461af90cadfdaf0e8d4cc25129f91- Paper.pdf . Y in Cui, Menglin Jia, Tsung-Y i Lin, Y ang Song, and Serge Belongie. Class-balanced loss based on effecti ve number of samples. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , June 2019. Mateusz Buda, Atsuto Maki, and Maciej A. Mazurowski. A systematic study of the class imbalance problem in con volutional neural networks. Neural Networks , 106:249–259, 2018. ISSN 0893-6080. doi: https://doi.org/10.1016/j. neunet.2018.07.011. URL https://www.sciencedirect.com/science/article/pii/S0893608018302107 . Corrado Gini. V ariabilità e mutabilità . C. Cuppini, Bologna, 1912. Reprinted in E. Pizetti and T . Salvemini (Eds.), Memorie di metodologica statistica , Libreria Eredi V ir gilio V eschi, Rome, 1955. Albert O. Hirschman. The paternity of an index. American Economic Revie w , 54(5):761–762, 1964. One-page clarification that the Gini coefficient w as actually developed by Hirschman in 1943. Max O. Lorenz. Methods of measuring the concentration of wealth. Publications of the American Statistical Association , 9(70):209–219, 1905. Joseph L. Gastwirth. The estimation of the lorenz curve and gini index. The Revie w of Economics and Statistics , 54(3): 306–316, 1972. doi: 10.2307/1937992. 10 Discov ering the Hidden Role of Gini Index In Prompt-based Classification Almas Heshmati. Inequalities and their measurement. IZA Discussion Paper 1219, Institute for the Study of Labor (IZA), Bonn, 2004. URL https://ideas.repec.org/p/ess/wpaper/id7311.html . F .A. Cowell. Chapter 2 measurement of inequality . volume 1 of Handbook of Income Distribution , pages 87–166. Elsevier , 2000. doi: https://doi.org/10.1016/S1574- 0056(00)80005- 6. URL https://www.sciencedirect.com/ science/article/pii/S1574005600800056 . Shujie Y ao. On the decomposition of gini coefficients by population class and income source: a spreadsheet approach and application. Applied Economics , 31(10):1249–1264, 1999. doi: 10.1080/000368499323463. URL https: //doi.org/10.1080/000368499323463 . T om Brown, Benjamin Mann, Nick Ryder , Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry , Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger , T om Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler , Jeffrey W u, Clemens W inter , Chris Hesse, Mark Chen, Eric Sigler , Mateusz Litwin, Scott Gray , Benjamin Chess, Jack Clark, Christopher Berner , Sam McCandlish, Alec Radford, Ilya Sutskev er, and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F . Balcan, and H. Lin, editors, Advances in Neural Information Pr ocessing Sys- tems , v olume 33, pages 1877–1901, 2020. URL https://proceedings.neurips.cc/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a- Paper.pdf . T imo Schick and Hinrich Schütze. Exploiting cloze-questions for few-shot text classification and natural language inference. In Pr oceedings of the 16th Confer ence of the Eur opean Chapter of the Association for Computational Linguistics , pages 255–269, 2021. URL https://aclanthology.org/2021.eacl- main.20.pdf . A vyav K umar Singh and Helen Y annakoudakis. Few-shot open-set classification via reasoning-aware decomposition. In Christos Christodoulopoulos, T anmoy Chakraborty , Carolyn Rose, and V iolet Peng, editors, Pr oceedings of the 2025 Confer ence on Empirical Methods in Natural Languag e Pr ocessing , pages 13854–13875, Suzhou, China, Nov ember 2025. Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp- main.699. URL https://aclanthology.org/2025.emnlp- main.699/ . Daniel Khashabi, Xinxi L yu, Sew on Min, Lianhui Qin, Kyle Richardson, Sean W elleck, Hannaneh Hajishirzi, T ushar Khot, Ashish Sabharwal, Sameer Singh, and Y ejin Choi. Prompt waywardness: The curious case of discretized interpretation of continuous prompts. In Pr oceedings of the 2022 Confer ence of the North American Chapter of the Association for Computational Linguistics: Human Languag e T echnologies , pages 3631–3643, 2022. URL https://aclanthology.org/2022.naacl- main.266/ . Shrimai Prabhumoye, Rafal K ocielnik, Mohammad Shoeybi, Anima Anandkumar , and Bryan Catanzaro. Few-shot instruction prompts for pretrained language models to detect social biases. arXiv preprint , 2021. URL . Prasetya Utama, Nafise Sadat Moosa vi, V ictor Sanh, and Iryna Gurevych. A voiding inference heuristics in few-shot prompt-based finetunings. In Pr oceedings of the 2021 Confer ence on Empirical Methods in Natural Language Pr ocessing , pages 9063–9074, 2021. URL https://aclanthology.org/2021.emnlp- main.713.pdf . Albert W ebson and Ellie Pavlick. Do prompt-based models really understand the meaning of their prompts? In Pr oceedings of the 2022 Confer ence of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies , pages 2300–2344, 2022. URL https://aclanthology.org/2022.naacl- main. 167.pdf . Gracjan Góral, Emilia W i ´ snios, Piotr Sankowski, and Pa weł Budzianowski. W ait, that’s not an option: LLMs rob ustness with incorrect multiple-choice options. In W anxiang Che, Joyce Nabende, Ekaterina Shutov a, and Mohammad T aher Pilehv ar , editors, Pr oceedings of the 63r d Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pages 1495–1515, V ienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979- 8-89176-251-0. doi: 10.18653/v1/2025.acl- long.75. URL https://aclanthology.org/2025.acl- long.75/ . Isha Chaudhary , Qian Hu, Manoj Kumar , Morteza Ziyadi, Rahul Gupta, and Gagandeep Singh. Certifying counterfactual bias in LLMs. In The Thirteenth International Confer ence on Learning Representations , 2025. URL https: //iclr.cc/virtual/2025/poster/30226 . Fabio Petroni, T im Rocktäschel, Sebastian Riedel, Patrick Le wis, Anton Bakhtin, Y uxiang W u, and Alexander Miller . Language models as kno wledge bases? In Pr oceedings of the 2019 Confer ence on Empirical Methods in Natural Language Pr ocessing and the 9th International J oint Confer ence on Natur al Languag e Pr ocessing , pages 2463–2473, 2019. Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-ef ficient prompt tuning. In Pr oceedings of the 2021 Conference on Empirical Methods in Natur al Language Pr ocessing , pages 3045–3059, 2021. URL https://aclanthology.org/2021.emnlp- main.243/ . 11 Discov ering the Hidden Role of Gini Index In Prompt-based Classification Guanghui Qin and Jason Eisner . Learning how to ask: Querying LMs with mixtures of soft prompts. In Proceedings of the 2021 Confer ence of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies , pages 5203–5212, 2021. URL https://aclanthology.org/2021.naacl- main.410/ . V ictor Sanh, Albert W ebson, Colin Raffel, Stephen Bach, Lintang Suta wika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler , Arun Raja, Manan Dey , M Saiful Bari, Canwen Xu, Urmish Thakker , Shanya Sharma Sharma, Eliza Szczechla, T ae woon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike T ian-Jian Jiang, Han W ang, Matteo Manica, Sheng Shen, Zheng Xin Y ong, Harshit Pandey , Rachel Bawden, Thomas W ang, T rishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry , Jason Alan Fries, Ryan T eehan, T e ven Le Scao, Stella Biderman, Leo Gao, Thomas W olf, and Alexander M Rush. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations , 2022. URL https://openreview.net/forum?id=9Vrb9D0WI4 . Ruixi Lin and Y ang Y ou. Optimizing class-level probability re weighting coefficients for equitable prompting accuracy , 2024a. URL . Ruixi Lin and Y ang Y ou. Let the fuzzy rule speak: Enhancing in-context learning debiasing with interpretability . arXiv pr eprint arXiv:2412.19018 , 2024b. URL . Ruixi Lin, Ziqiao W ang, and Y ang Y ou. Ensemble debiasing across class and sample lev els for f airer prompting accurac y . In Confer ence on Language Modeling , 2025. URL https://openreview.net/attachment?id=63c7hTrUCh& name=pdf . Ke yu Li, Jin Gao, and Dequan W ang. Measuring bias amplification in multi-agent systems with large language models. In The F ourteenth International Confer ence on Learning Repr esentations , 2026. URL https://openreview.net/ forum?id=mo7u21GoQv . Jian Jin, Jianxiang W ang, Xiaoyi Ma, Y uding W ang, and Renyong Li. Equality of medical health resource allocation in china based on the gini coefficient method. Iranian journal of public health , 44(4):445–457, 2015. URL https://pmc.ncbi.nlm.nih.gov/articles/PMC4441957/#B28 . Isabel Segura-Bedmar , Paloma Martínez, and María Herrero-Zazo. SemEval-2013 T ask 9 : Extraction of Drug-Drug Interactions from Biomedical T exts (DDIExtraction 2013). In Second Joint Confer ence on Lexical and Computational Semantics (*SEM), V olume 2: Pr oceedings of the Se venth International W orkshop on Semantic Evaluation (SemEval 2013) , pages 341–350, 2013. URL https://aclanthology.org/S13- 2056.pdf . Alex Krizhe vsky . Learning multiple layers of features from tiny images. T echnical report, 2009. A Maximum Possible Gini V alue N − 1 N By the proposed definition, max Gini happens in the case of [1 , 0 , . . . , 0] for N classes. As N → ∞ , A verage dif ference = 2( N − 1) N 2 ≈ 2 × 1 N , i.e, twice the mean. That is, when one class reaches the highest accuracy while others obtain 0, the av erage class accuracy difference approaches twice the mean. 12
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment