Bridging Fairness and Explainability: Can Input-Based Explanations Promote Fairness in Hate Speech Detection?

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Natural language processing (NLP) models often replicate or amplify social bias from training data, raising concerns about fairness. At the same time, their black-box nature makes it difficult for users to recognize biased predictions and for developers to effectively mitigate them. While some studies suggest that input-based explanations can help detect and mitigate bias, others question their reliability in ensuring fairness. Existing research on explainability in fair NLP has been predominantly qualitative, with limited large-scale quantitative analysis. In this work, we conduct the first systematic study of the relationship between explainability and fairness in hate speech detection, focusing on both encoder- and decoder-only models. We examine three key dimensions: (1) identifying biased predictions, (2) selecting fair models, and (3) mitigating bias during model training. Our findings show that input-based explanations can effectively detect biased predictions and serve as useful supervision for reducing bias during training, but they are unreliable for selecting fair models among candidates.Our code is available at https://github.com/Ewanwong/fairness_x_explainability.

💡 Research Summary

This paper investigates the long‑standing assumption that input‑based explanations (often called rationales) can help improve fairness in natural language processing, specifically in the task of hate‑speech detection. While prior work has largely been qualitative and limited to a handful of explanation methods, the authors present the first large‑scale quantitative study that jointly evaluates explainability and fairness across both encoder‑only (BERT, RoBERTa) and decoder‑only large language models (Llama‑3.2‑3B, Qwen‑3‑4B/8B).

The study is organized around three research questions (RQs). RQ1 asks whether token‑level attribution scores can be used to identify biased predictions. RQ2 asks whether such explanations can automatically select the most fair model among several candidates. RQ3 asks whether explanations can be leveraged during training to reduce bias. To answer these questions, the authors use two widely‑used hate‑speech datasets (Civil Comments and Jigsaw) and focus on three protected attributes: race (Black/White), gender (Female/Male), and religion (Christian/Muslim/Jewish). Sensitive tokens are defined from a curated identity‑marker vocabulary, and for each example a counterfactual version is created by swapping the identity token.

Sixteen popular post‑hoc explanation techniques are evaluated, covering gradient‑based (Grad, Input×Grad, Integrated Gradients), perturbation‑based (Occlusion, Occlusion‑abs), propagation‑based (DeepLift), SHAP‑based (KernelSHAP), attention‑based (Attention, Attention‑rollout, Attention‑flow), and newer methods such as DecompX and Progressive Inference. For each method, the authors compute a “sensitive‑token reliance score” by taking the maximum absolute attribution among the sensitive tokens.

Fairness is measured with both group‑level metrics (disparities in accuracy, false‑positive rate, false‑negative rate) and an individual‑fairness metric (average change in prediction when the identity token is swapped). Higher scores indicate more bias.

Findings for RQ1 (bias detection).
Across all models and datasets, higher sensitive‑token reliance correlates strongly with larger group‑level disparities. Integrated Gradients and KernelSHAP achieve the highest area‑under‑curve (≈0.87) for distinguishing biased from unbiased examples, outperforming rationales generated by large language models in a human‑in‑the‑loop setting. This demonstrates that input‑based explanations can reliably flag potentially biased predictions.

Findings for RQ2 (model selection).
When the authors rank models by the average reliance score and compare the ranking to actual fairness metrics, the correlation is weak. Models with low reliance scores are not consistently the most fair, especially for attention‑based explanations, which tend to under‑estimate bias. Consequently, using explanation scores alone to automatically pick the fairest model is unreliable.

Findings for RQ3 (bias mitigation during training).
The authors introduce an explanation‑guided regularization term that penalizes high reliance on sensitive tokens. Combined with existing debiasing techniques (group balancing, counterfactual data augmentation, dropout, attention entropy, causal debias), this regularizer reduces disparity metrics substantially while incurring only a modest drop in overall accuracy (≈1–2%). For example, on BERT the accuracy‑dispersion metric drops from 2.05 to 0.00 and the average individual unfairness score falls from 3.17 to 0.66. This shows that explanations can serve as effective supervision signals for bias reduction.

Additional analyses reveal that explanation methods differ markedly in computational cost and memory usage, and that large language models are sensitive to prompt design when generating rationales. Moreover, models can be deliberately trained to hide bias by suppressing attribution scores, highlighting a potential adversarial risk.

Conclusion.
Input‑based explanations are valuable for (1) detecting biased predictions and (3) guiding bias‑aware training, but they are not dependable for (2) automatically selecting the most fair model. The work underscores the promise of explanation‑driven monitoring in real‑world systems while cautioning that explanations themselves may be imperfect or manipulable. Future directions include developing fidelity metrics for explanations, building continuous fairness‑monitoring pipelines that combine explanations with other signals, and defending against attacks that aim to “explain away” bias.

Bridging Fairness and Explainability: Can Input-Based Explanations Promote Fairness in Hate Speech Detection?

💡 Research Summary

Comments & Academic Discussion

Leave a Comment