Dual-Metric Evaluation of Social Bias in Large Language Models: Evidence from an Underrepresented Nepali Cultural Context

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) increasingly influence global digital ecosystems, yet their potential to perpetuate social and cultural biases remains poorly understood in underrepresented contexts. This study presents a systematic analysis of representational biases in seven state-of-the-art LLMs: GPT-4o-mini, Claude-3-Sonnet, Claude-4-Sonnet, Gemini-2.0-Flash, Gemini-2.0-Lite, Llama-3-70B, and Mistral-Nemo in the Nepali cultural context. Using Croissant-compliant dataset of 2400+ stereotypical and anti-stereotypical sentence pairs on gender roles across social domains, we implement an evaluation framework, Dual-Metric Bias Assessment (DMBA), combining two metrics: (1) agreement with biased statements and (2) stereotypical completion tendencies. Results show models exhibit measurable explicit agreement bias, with mean bias agreement ranging from 0.36 to 0.43 across decoding configurations, and an implicit completion bias rate of 0.740-0.755. Importantly, implicit completion bias follows a non-linear, U-shaped relationship with temperature, peaking at moderate stochasticity (T=0.3) and declining slightly at higher temperatures. Correlation analysis under different decoding settings revealed that explicit agreement strongly aligns with stereotypical sentence agreement but is a weak and often negative predictor of implicit completion bias, indicating generative bias is poorly captured by agreement metrics. Sensitivity analysis shows increasing top-p amplifies explicit bias, while implicit generative bias remains largely stable. Domain-level analysis shows implicit bias is strongest for race and sociocultural stereotypes, while explicit agreement bias is similar across gender and sociocultural categories, with race showing the lowest explicit agreement. These findings highlight the need for culturally grounded datasets and debiasing strategies for LLMs in underrepresented societies.

💡 Research Summary

This paper investigates social and cultural bias in seven state‑of‑the‑art large language models (LLMs) within the under‑represented Nepali context. The authors first construct a culturally grounded benchmark called EquiText‑Nepali, comprising over 2,400 sentence pairs that contrast stereotypical statements with anti‑stereotypical counterparts. The pairs cover three primary bias categories—gender, race, and sociocultural (including caste, religion, regional, and language dimensions)—and are evenly distributed across ten finer‑grained domains such as professions, education, politics, traditions, and inter‑faith dynamics. Expert annotators fluent in both Nepali and English validated the labels, achieving an overall agreement of 92 % (gender 94.7 %, race 92.8 %, sociocultural 88.9 %). The dataset conforms to the Croissant metadata standard, ensuring discoverability and reproducibility.

To evaluate bias, the authors propose the Dual‑Metric Bias Assessment (DMBA) framework, which simultaneously measures (1) explicit agreement with biased statements and (2) implicit stereotypical completion tendencies. Agreement is quantified as the probability the model assigns to a biased sentence versus its anti‑biased counterpart. Completion bias is measured by prompting the model with the same sentence stem and scoring the generated continuation for stereotypical content using an automated lexical‑semantic pipeline; five samples per prompt are averaged to obtain a bias score between 0 (anti‑biased) and 1 (biased).

Seven LLMs are examined: OpenAI’s GPT‑4o‑mini, Anthropic’s Claude‑3‑Sonnet and Claude‑4‑Sonnet, Google’s Gemini‑2.0‑Flash and Gemini‑2.0‑Lite, Meta’s Llama‑3‑70B, and Mistral‑AI’s Mistral‑Nemo. For each model, twelve decoding configurations are explored by varying temperature (T = 0.0, 0.3, 0.7, 1.0) and nucleus sampling top‑p (0.8, 0.9, 1.0). This systematic sweep allows the authors to isolate how stochasticity and probability‑mass truncation affect bias expression.

Key findings:

Explicit agreement bias ranges from 0.36 to 0.43 across models, indicating that on average the models “agree” with stereotypical statements about 40 % of the time. Higher top‑p values modestly increase agreement, with the peak at top‑p = 0.9.
Implicit completion bias is substantially higher, between 0.740 and 0.755, showing that when generating free‑form text the models reproduce stereotypical patterns roughly three‑quarters of the time. The relationship with temperature follows a U‑shaped curve: bias rises to a maximum at T = 0.3 and declines slightly at T = 1.0, suggesting moderate stochasticity amplifies bias while extreme randomness dampens it.
Correlation analysis reveals a weak or slightly negative Pearson correlation (‑0.12 – 0.08) between agreement and completion bias, meaning that a model’s explicit endorsement of a stereotype does not reliably predict its generative bias. This underscores the necessity of measuring both metrics.
Decoding sensitivity shows that increasing top‑p from 0.8 to 0.9 raises agreement by about 0.03 on average, yet has negligible impact on completion bias, indicating that nucleus sampling primarily affects the confidence of explicit choices rather than the overall generative tendency.
Domain‑level analysis finds the strongest implicit bias for race and sociocultural categories (e.g., caste‑related statements) with completion bias up to 0.78, whereas explicit agreement is relatively uniform across gender and sociocultural domains and lowest for race (≈0.34). This pattern suggests that models are more cautious when explicitly stating race‑related stereotypes but still embed those biases in their generative behavior.

The paper contributes three major advances: (i) a publicly released, culturally specific Nepali bias benchmark adhering to FAIR data principles; (ii) the DMBA framework that bridges the gap between agreement‑based and generative bias evaluations; and (iii) empirical evidence that decoding hyper‑parameters have non‑linear effects on bias, offering a practical lever for mitigation in deployed systems.

Limitations are acknowledged: the prompts are in English, which may not fully capture nuances of Nepali discourse; only seven models are examined, leaving out newer or region‑specific LLMs; and the bias scores are probabilistic rather than grounded in human perception, necessitating future work that aligns model metrics with user‑centric evaluations. The authors propose extending the dataset to Nepali‑language prompts, incorporating human judgments, and testing debiasing interventions such as prompt engineering or post‑generation filtering.

In conclusion, this study provides the first comprehensive, dual‑metric assessment of social bias in LLMs for the Nepali cultural context, revealing that both explicit and implicit biases are prevalent, that they behave differently across decoding settings, and that culturally grounded evaluation resources are essential for building fairer AI systems in low‑resource societies.

Dual-Metric Evaluation of Social Bias in Large Language Models: Evidence from an Underrepresented Nepali Cultural Context

💡 Research Summary

Comments & Academic Discussion

Leave a Comment