Cross-Language Bias Examination in Large Language Models

February 18, 2026

Reading time: 5 minute

...

📝 Original Info

Title: Cross-Language Bias Examination in Large Language Models
ArXiv ID: 2512.16029
Date: 2025-12-17
Authors: Yuxuan Liang, Marwa Mahmoud

📝 Abstract

This study introduces an innovative multilingual bias evaluation framework for assessing bias in Large Language Models, combining explicit bias assessment through the BBQ benchmark with implicit bias measurement using a prompt-based Implicit Association Test. By translating the prompts and word list into five target languages, English, Chinese, Arabic, French, and Spanish, we directly compare different types of bias across languages. The results reveal substantial gaps in bias across languages used in LLMs. For example, Arabic and Spanish consistently show higher levels of stereotype bias, while Chinese and English exhibit lower levels of bias. We also identify contrasting patterns across bias types. Age shows the lowest explicit bias but the highest implicit bias, emphasizing the importance of detecting implicit biases that are undetectable with standard benchmarks. These findings indicate that LLMs vary significantly across languages and bias dimensions. This study fills a key research gap by providing a comprehensive methodology for cross-lingual bias analysis. Ultimately, our work establishes a foundation for the development of equitable multilingual LLMs, ensuring fairness and effectiveness across diverse languages and cultures.

💡 Deep Analysis

📄 Full Content

Cross-Language Bias Examination in Large Language Models Yuxuan Liang Georgia Institute of Technology yliang372@gatech.edu Marwa Mahmoud University of Glasgow marwa.mahmoud@cl.cam.ac.uk This paper was written while I attended the Cam- bridge Online Summer Research Program under the supervision of Professor Marwa Mahmoud. Abstract This study introduces an innovative multilin- gual bias evaluation framework for assessing bias in Large Language Models, combining ex- plicit bias assessment through the BBQ bench- mark with implicit bias measurement using a prompt-based Implicit Association Test. By translating the prompts and word list into five target languages — English, Chinese, Arabic, French, and Spanish — we were able to di- rectly compare different types of bias across languages. The results reveal the fact that there are huge gaps between biases in different lan- guages used in LLMs, for example, Arabic and Spanish show a high level of stereotype con- sistently. In contrast, Chinese and English ex- hibit a lower level of bias. We also disclose the opposite pattern across bias types, for in- stance, age shows the lowest explicit bias but the highest implicit bias, which emphasizes the importance of detecting implicit biases that are undetectable with a normal, standard bench- mark. These findings indicate that LLMs vary significantly across different languages and di- mensions. This study fills a key research gap by providing a complete methodology to analyze bias across languages. Ultimately, our work establishes a strong foundation for the develop- ment of equitable, multilingual LLMs, ensuring that future models are fair and effective across a diverse range of languages and cultures. 1 Introduction Introduction In recent years, Large Language Mod- els (LLMs) have been recognized as a revolution- ary technology when people are talking about the field of Natural Language Processing (NLP). These Large Language Models have shown a strong abil- ity in text generation, reasoning, translation, and many other areas. Also, LLMs have become one of the hottest topics in the world today. Many com- mercial models, like GPT-4 (OpenAI et al., 2024), and open source models, like LLaMA (Touvron et al., 2023) have emerged. Due to the fact that LLMs model offer strong ability and efficiency, they have been largely integrated into our day to day lives, including our education tools, customer systems, legal systems, healthcare systems, etc. Ad- ditionally, many people who work with LLMs ex- pressed that LLMs not just make their work more efficient, but also more meaningful (Kobiella et al., 2025). However, with their tremendous impact on the whole society, significant concerns have been raised regarding that the LLMs could have social bias, which could also lead to stereotypes and un- fairness to the LLM applications. Most research nowadays focuses on examining bias in English in LLMs, covering a lot of dimen- sions like gender, age, race, and religion, by us- ing many benchmark like Truthful QA (Lin et al., 2022), BBQ benchmark (Parrish et al., 2022), and tools like BiasAlert (Fan et al., 2024) which could detect social bias. Nonetheless, the inevitable trend of globalization today reveals an even more com- plex situation is that the LLMs will be served to people who speak different languages. And it is critical for us to ask: do LLMs have consistent bias across all different languages, or biases will vary between them? Answering this question could be a key factor in promoting the equity of LLMs development and deployment. To answer this question, our study explores the different degrees of bias of different languages that exist in LLM by evaluating explicit bias and im- plicit bias across five languages——English(EN), Chinese(ZH), Arabic(AR), French(FR), and Span- ish(ES). The reason why we choose these five lan- guages is because these five languages are the top languages spoken in the world (International Cen- ter for Language Studies, 2024). Moreover, most of the selected languages represent different lan- 1 arXiv:2512.16029v1 [cs.CY] 17 Dec 2025 guage families, for example English is classified as Indo-European, and Mandarin Chinese is classified as Sino-Tibetan(Ethnologue, 2024). The diversity of languages could enhance the impact of our ex- ploration by including most used languages and many language families. For explicit bias testing, we translate BBQ prompts into target languages via DeepL API, care- fully preserving meaning. We then invoke GPT-4 across languages to obtain responses, from which we derive accuracy and bias scores. For implicit bias, we obtain the IAT word list, and translate them into target languages via DeepL API. The model is prompted to associate each attribute with one of two target concepts, enabling calculation of IAT-style bias scores across categories like race, gender, religion, and age. This dual-method approach, integrating explicit decision-based evaluation with implicit semantic a

📄 Read Full PDF on ArXiv