Discovering the Hidden Role of Gini Index In Prompt-based Classification

In classification tasks, the long-tailed minority classes usually offer the predictions that are most important. Yet these classes consistently exhibit low accuracies, whereas a few high-performing classes dominate the game. We pursue a foundational …

Authors: Ruixi Lin

Discovering the Hidden Role of Gini Index In Prompt-based Classification
D I S C OV E R I N G T H E H I D D E N R O L E O F G I N I I N D E X I N P R O M P T - B A S E D C L A S S I FI C A T I O N Ruixi Lin Independent Researcher {ruixi}@u.nus.edu A B S T R AC T In classification tasks, the long-tailed minority classes usually offer the predictions that are most important. Y et these classes consistently exhibit lo w accuracies, whereas a few high-performing classes dominate the game. W e pursue a foundational understanding of the hidden role of Gini Index as a tool for detecting and optimizing (debiasing) disparities in class accuracy , focusing on the case of prompt-based classification. W e introduce the intuitions, benchmark Gini scores in real-world LLMs and vision models, and thoroughly discuss the insights of Gini not only as a measure of relati ve accurac y dominance but also as a direct optimization metric. Through rigorous case analyses, we first sho w that weak to strong relati ve accurac y imbalance exists in both prompt- based, text and image classification results and regardless of whether the classification is high- dimensional or low-dimensional. Then, we harness the Gini metric to propose a post-hoc model- agnostic bias mitigation method. Experimental results across few-shot news, biomedical, and zero-shot image classification sho w that our method significantly reduces both relativ e and absolute accuracy imbalances, minimizing top class relati ve dominance while elev ating weakest classes. K eywords Gini Index, LLM, Prompt-Based Classification 1 Introduction Classification tasks form the backbone of modern artificial intelligence, spanning div erse domains from natural language processing to computer vision. Whether classifying te xt into categories, recognizing objects in images, or assigning labels to multi-modal inputs, the fundamental challenge remains the same: models must learn to correctly distinguish between classes based on training data. In recent years, lar ge language models (LLMs) and vision-language models hav e achieved remarkable performance across these tasks, yet they often struggle with a persistent problem—accuracy imbalance across classes—whose causes are rooted in the pretraining data itself. A pervasi ve phenomenon in real-world data is the long-tailed distribution, where a small number of “head” classes occupy the majority of the dataset while the vast majority of “tail” classes have v ery few samples. In such cases, the performance of deep learning models is often dominated by the head classes while the learning of the tail classes is sev erely underdeveloped. This imbalance manifests directly in classification accurac y: models become biased toward frequently seen classes, while rare classes suffer from poor recognition performance. Though the exact classes may not be seen in pretraining, LLM completions can be degenerate, carrying more weight on frequently seen tokens [Chang and Bergen, 2022]. Y et the importance of these minority classes cannot be overstated. In many critical applications—medical diagnosis, fraud detection, anomaly identification, and scientific discov ery—the rare categories often carry the highest stakes. A model that performs well on common cases b ut fails on rare ones may be practically useless or e ven dangerous. The challenge, therefore, is not merely achieving high av erage accuracy , but ensuring adequate performance across all classes, particularly those with limited representation. T raditional approaches to addressing class imbalance hav e focused primarily on the training data lev el. Methods incorporating ov ersampling, undersampling, and data augmentation aim to rebalance the dataset, followed by training with imbalance-aware loss functions [Cao et al., 2019, Cui et al., 2019, Buda et al., 2018]. Howe ver , iterativ e retraining Discov ering the Hidden Role of Gini Index In Prompt-based Classification or fine-tuning on rebalanced datasets becomes prohibitively e xpensi ve. Moreover , the cost of collecting and annotating sufficient high-quality tail-class e xamples to achieve natural balance is often high. This suggests a fundamental shift in perspecti ve: rather than fixing imbalance at the data level, we might address it at the output le vel. The core problem is not merely that training data is imbalanced, but that this imbalance produces systematically distorted output vectors—class predictions that f avor head classes and suppress tail classes. If we can detect and correct this distortion in the model’ s predictions themselves, we may achie ve better class fairness without the prohibitiv e cost of data rebalancing. This points to the need for metrics that can quantify output-le vel imbalance and methods that can post-hoc optimize for mor e equitable predictions across classes . In this paper, we rev eal one such metric—the Gini Index. W e carefully examine Gini’ s hidden role as a tool for detecting and optimizing (debiasing) disparities in class accuracy , focusing on the case of prompt-based classification. In particular , we first empirically demonstrate Gini v alues in representati ve LLM-based text and image classification scenarios, and then we propose a post-hoc model-agnostic bias mitigation method to reduce Gini. W e sho w that: • As measured by the Gini index, weak to str ong r elative accuracy imbalance exists in both prompt-based, text and image classification results and regardless of whether the classification is high-dimensional or low-dimensional. Crucially , the non-zero Gini (Gini > 0) demonstrates that relative accuracy imbalance is present in r eal-world models . • Reducing relativ e accuracy imbalance is possible and meaningful. W e achiev e this by directly leveraging the Gini index as an optimization metric in a post-hoc model-agnostic bias mitigation method . • Both relati ve and absolute accuracy imbalance ar e mitigated by our Gini-based bias mitigation method. 2 Related W ork 2.1 The Inequality Measure: Gini Index The Gini index, or Gini coef ficient, is the most commonly used measure of inequality [Gini, 1912, Hirschman, 1964]. It was defined based on the Lorenz curve [Lorenz, 1905, Gastwirth, 1972], ranging from 0 to 1, with 0 indicating perfect equality and 1 indicating perfect inequality . The Gini index can be estimated using various approaches, including direct and indirect calculation methods [Heshmati, 2004]. In practice, when measuring income inequality , the choice between direct and indirect Gini index calculation methods is more than just con venience, but statistical rob ustness under different data conditions [Co well, 2000]. For example, the income Gini adopts a direct calculation method, which is directly computed from indi vidual or grouped data without assuming an underlying functional form for the income distribution [Y ao, 1999]. Income Gini for N individuals (their income is denoted by I , average income is denoted by γ I ) is defined as: G Income = P N i =1 P N j =1 | I i − I j | 2 N 2 γ I (1) Intuitiv ely , income Gini highlights top-class relative dominance to the mean. As we will see later when formally defining the Gini index, this direct reciprocal calculation used in income Gini fits well in the context of measuring class accuracy disparity . 2.2 Prompt-based Classification LLMs can be viewed as modeling the probability distribution of completion strings (outputs) gi ven prompt strings (inputs). Prompt-based methods enable fe w-shot learning in LLMs for classification tasks [Brown et al., 2020]. By reformulating classification as a language modeling problem, the LLM predicts a tar get label token gi ven a prompt template, and these approaches achiev e competitive results with minimal labeled data [Schick and Schütze, 2021, Singh and Y annakoudakis, 2025]. Despite their success, studies rev eal that prompts are often not interpreted by models in the way humans intend [Khashabi et al., 2022], and they can inadv ertently influence model biases in classification decisions [Prabhumoye et al., 2021, Utama et al., 2021, W ebson and Pa vlick, 2022, Góral et al., 2025, Chaudhary et al., 2025]. From a taxonomic perspectiv e, prompt-based classification methods can be broadly cate gorized into two lines: prompt design, which focuses on crafting human-readable prompts that map classification tasks to cloze-style questions for a frozen language model [Petroni et al., 2019], and prompt tuning, where continuous soft prompts are learned while the 2 Discov ering the Hidden Role of Gini Index In Prompt-based Classification underlying model remains fixed, enabling more flexible adaptation to tar get class distributions [Lester et al., 2021, Qin and Eisner, 2021, Sanh et al., 2022]. In both paradigms, the core objective remains accurate classification across all classes—a goal that can be undermined when models e xhibit systematic accuracy disparities between frequently and rarely occurring label tokens. 2.3 Measuring Class Accuracy Disparities Lin and Y ou [2024a] propose the COBias metric for e valuating pairwise class accurac y disparities, when using LLMs to perform text classification. COBias also enables learning debiasing coefficients that, during inference, plug in to reweight the probability distrib ution over the prediction token for classification tasks [Lin and Y ou, 2024a,b, Lin et al., 2025]. More recently , Li et al. [2026] explored the use of an adapted Gini coefficient as a metric in multi-agent systems. In contrast, we treat Gini not as an adapted system-lev el metric b ut as a direct measure of class accuracy disparity within prompt-based classification, where it naturally captures relati ve imbalance across classes, enabling both interpretability and direct optimization. 3 The Gini Index f or Relative Class Accuracy Disparity in Pr ompt-Based Classification W e transfer Gini index, from the commonly used socioeconomic metric of income inequality , to a useful metric for class accuracy disparity in prompt-based classification. Below , we define the Gini index metric, illustrate why Gini makes sense with a numerical walkthrough, and compare Gini with a most related metric, COBias. 3.1 Gini Index Definition The measurement of class accurac y imbalance strongly resembles inequality between strong and weak classes. Therefore, we can define Gini index as a metric of accurac y disparity in prompt-based classification results. The Reciprocal Calculation Method: Because prompt-based classification outputs often do not conform to standard parametric distributions, we adopt the direct, reciprocal Gini index method to measure ov erall class accuracy disparity . This method computes the Gini index without distributional assumptions—making the Gini Index a well-suited metric for analyzing prompt-based classification outputs. W e introduce the Gini index definition as follows. As a reminder , for prompt-based classification , an LLM works as modeling the probability distribution of the completion/answer string (the classification output) giv en a prompt consisting of task instructions, the instance to be classified, and a question. W e follo w a common practice to predict the ar gmax class using probabilities assigned to label tokens. In details, after obtaining the answer string’ s probability distribution ov er the whole vocab ulary , we extract and normalize probabilities corresponding to label tokens in the vocab ulary to form class probabilities, i.e., p m = ( p m 1 , . . . , p mN ) for instance x m and N classes. The prediction ˆ y m is then ar gmax i ∈{ 1 ,...,N } p mi . W e can then compute the accurac y for each class using the ground - truth instances, to ev aluate class-specific performance. Concretely , let A i denote the accuracy for class i, i ∈ 1 , . . . , N . Let γ Acc represent the av erage class accuracy: γ Acc = 1 N N X i =1 A i (2) As an analogy , A i resembles a family’ s income, and the av erage class accurac y is similar to a verage income ov er N classes (families). Then, we can similarly define the Gini index about class accuracies in classification, which we term as Classification tasks Gini or G CLS , analogous to income Gini. Henceforth, the G CLS metric is mathematically defined as follows. G CLS = P N i =1 P N j =1 | x i − x j | 2 N 2 γ Acc (3) Interpr etation of the Gini Scale (Range: [0, 1]): Gini = 0 indicates perfect fairness: all classes have equal accurac y; or in a perfectly equal society , the income difference is always 0. Gini approaches 1 (as population size approaches infinity) corresponds to maximal disparity: the accurac y gap between one class vs. the rest of all classes reaches the largest possible value; or in a perfectly unequal society (one person has everything), the average income difference approaches twice the mean 1 . Gini > 0.4 usually indicates strong relative imbalance [Jin et al., 2015]. 1 See Appendix A for deriv ations 3 Discov ering the Hidden Role of Gini Index In Prompt-based Classification 3.2 A Numerical W alkthr ough of the Gini Index Calculation Using the abov e definition, we present a step-by-step numerical demonstration of the Gini Index to illustrate how it works and why it makes sense. Belo w is the step-by-step breakdown of the G CLS formula. The Numerator: P N i =1 P N j =1 | x i − x j | . This is the total sum of absolute class accurac y differences. For each class i , it calculates the absolute dif ference between its accuracy and every other class j ’ s accuracy . This sum captures total inequality in the population (classes). If ev ery class had the same accuracy , the sum would be zero. The more spread out the accuracies, the larger this sum becomes. Example A with 4 classes with accuracies: [1, 0, 0, 0]: Sum A = 3 + 1 + 1 + 1 = 6 Example B with 4 classes with accuracies: [0.8, 0.2, 0, 0]: Sum B = 2.2 + 1 + 1 + 1 = 5.2 Example C with 4 classes with accuracies: [1, 1, 0, 0]: Sum C = 2 + 2 + 2 + 2 = 8 Lin and Y ou [2024a] uses a similar numerator (the sum of only unordered distinct pairs ( i < j )), which is di vided by combination size  N 2  . This mean absolute difference ov er all unordered pairs forms the COBias metric. The advantage of COBias is the directness in representing inequalities between pairs of classes, but it omits normalization by mean accuracy , as we will see next. The Denominator: 2 N 2 γ Acc . This denominator provides the scaling by population and mean . N 2 : There are N × N = N 2 pairs in the numerator’ s double sum. Dividing by N 2 turns the total sum into an average pairwise difference, i.e., the mean absolute differ ence over all ordered pairs . γ Acc : Dividing by this mean accuracy makes the measure scale-in variant ; this is what COBias lacks. The resulted average pairwise dif ference γ Acc is “relative mean dif ference”. If ev ery class accuracy doubled, γ Acc would double and the sum would double, so the ratio stays the same. This is crucial—inequality shouldn’t change just because everyone gets richer by the same proportion. 2 : By conv entional definition, the Gini index is half the relati ve mean dif ference [Gini, 1912]. The di vision by 2 is embedded in the normalization to bound the index between 0 and 1. Putting it together , Gini indices for the following numerical examples are: Example A with 4 classes with accuracies: [1, 0, 0, 0]: G CLS A = Sum A 2 × 4 2 × 0 . 25 = 6 8 = 0 . 75 Example B with 4 classes with accuracies: [0.8, 0.2, 0, 0]: G CLS B = 5 . 2 2 × 4 2 × 0 . 25 = 5 . 2 8 = 0 . 65 Example C with 4 classes with accuracies: [1, 1, 0, 0]: G CLS C = Sum C 2 × 4 2 × 0 . 5 = 8 16 = 0 . 5 Intuitiv ely , Gini index emphasizes ho w much the top class dominates relativ e to the mean, not absolute v alues. Gini is not higher for larger absolute gaps, but when the top class e xceeds the mean more. In summary , Gini (0.75 for [1, 0, 0, 0], 0.5 for [1, 1, 0, 0], 0.65 for [0.8, 0.2, 0, 0]) penalizes proportional dominance , not absolute gaps, so a split like [1, 1, 0, 0] (where top classes share dominance) scores lo wer than a monopoly [1,0,0,0]. Even though [0.8, 0.2, 0, 0] has a larger absolute gap (0.6) between top classes than [1, 1, 0, 0] (0), its Gini index being higher than [1, 1, 0, 0]’ s is not just because of the lar ger gap, but because the top class (0.8) remains heavily dominant relati ve to the mean (3.2x mean), whereas [1, 1, 0, 0]’ s top class has less relativ e concentration (2x mean). 3.3 Comparisons Between Gini Index And COBias Both metrics as tools for dif ferent diagnostic or optimization goals. The core rationale is an optimization/bias mitigation target priority—Gini penalizes relati ve concentration (strong concentration if high inequality acro relati ve to the mean); COBias penalizes absolute pairwise gaps (lar ge raw dif ferences between classes), and which matters depends on the bias pattern of interest. Recall COBias definition: for class accuracies A = ( A 1 , . . . , A N ) across N classes, the COBias metric is defined as the mean absolute difference o ver all distinct unordered pairs: COBias = 2 N ( N − 1) X 1 ≤ i

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment