Bi-directional Bias Attribution: Debiasing Large Language Models without Modifying Prompts

Bi-directional Bias Attribution: Debiasing Large Language Models without Modifying Prompts
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) have demonstrated impressive capabilities across a wide range of natural language processing tasks. However, their outputs often exhibit social biases, raising fairness concerns. Existing debiasing methods, such as fine-tuning on additional datasets or prompt engineering, face scalability issues or compromise user experience in multi-turn interactions. To address these challenges, we propose a framework for detecting stereotype-inducing words and attributing neuron-level bias in LLMs, without the need for fine-tuning or prompt modification. Our framework first identifies stereotype-inducing adjectives and nouns via comparative analysis across demographic groups. We then attribute biased behavior to specific neurons using two attribution strategies based on integrated gradients. Finally, we mitigate bias by directly intervening on their activations at the projection layer. Experiments on three widely used LLMs demonstrate that our method effectively reduces bias while preserving overall model performance. Code is available at the github link: https://github.com/XMUDeepLIT/Bi-directional-Bias-Attribution.


💡 Research Summary

The paper introduces a novel framework for debiasing large language models (LLMs) that does not rely on fine‑tuning or prompt modification. The authors first define “stereotype cues” as adjectives or nouns that tend to trigger biased model behavior. Using a set of handcrafted templates, they insert each candidate cue into prompts and query the model for the probability distribution over demographic groups (e.g., gender, race, profession, religion). By computing the Shannon entropy of these distributions, they rank cues: lower entropy indicates a stronger bias‑inducing effect. The top‑ranked cues for each demographic attribute are automatically selected, eliminating the need for manually curated bias lexicons.

To trace the bias back to the model’s internal components, the authors propose two attribution strategies based on Integrated Gradients (IG). The Forward‑IG strategy treats the causal direction from stereotype cue to demographic prediction. For each neuron in the projection layer (the linear layer that maps hidden representations to logits), they linearly interpolate the neuron’s activation from zero to its original value (parameter α∈


Comments & Academic Discussion

Loading comments...

Leave a Comment