B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Post-hoc explanation methods for black-box models often struggle with faithfulness and human interpretability due to the lack of explainability in current neural architectures. Meanwhile, B-cos networks have been introduced to improve model explainability by proposing an architecture that removes bias terms and promotes input-weight alignment. Although B-cos networks have shown success in building explainable systems, their application has so far been limited to computer vision models and their associated training pipelines. In this work, we introduce B-cos LMs, i.e., B-cos Language Models (LMs) empowered for natural language processing (NLP) tasks. Our approach directly transforms pre-trained language models into B-cos LMs by combining B-cos conversion and task fine-tuning, improving efficiency compared to previous methods. Automatic and human evaluation results demonstrate that B-cos LMs produce more faithful and human interpretable explanations than post-hoc methods, while maintaining task performance comparable to conventional fine-tuning. Our in-depth analysis explores how B-cos LMs differ from conventionally fine-tuned models in their learning processes and explanation patterns. Finally, we present a first exploration of transforming decoder-only models to B-cos LMs for generation tasks. Our code is available at https://github.com/Ewanwong/bcos_lm.

💡 Research Summary

The paper “B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability” addresses the critical challenge of explainability in large language models (LLMs). It posits that post-hoc explanation methods for black-box models often lack faithfulness and human interpretability due to the inherent lack of explainability in standard neural architectures. To solve this, the authors introduce B-cos Language Models (B-cos LMs), a novel approach that transforms standard pre-trained LMs into inherently explainable models.

The core innovation lies in adapting the B-cos network framework—previously successful in computer vision—to natural language processing. The B-cos transformation replaces standard linear operations with a bias-free, dynamic linear transformation that promotes alignment between input features and model weights. A key parameter B (>1) introduces “alignment pressure,” forcing the model to align its weights strongly with the most task-relevant input patterns to achieve high activation. This architectural inductive bias ensures that the model learns to focus on salient features during training, which subsequently leads to more interpretable and faithful explanations.

The proposed pipeline for creating a B-cos LM is efficient and designed for the NLP “pre-train then fine-tune” paradigm. Starting from a standard pre-trained LM (e.g., BERT), the method involves: (1) Architectural modifications: removing all bias terms (including in LayerNorm and attention blocks) and non-linear activation functions from the prediction head to create a purely dynamic linear model. (2) Computational modifications: replacing all linear transformations with B-cos transformations, initialized with the original model’s weights. (3) Fine-tuning: the adapted model is then fine-tuned on the downstream task using a binary cross-entropy loss, which combines task adaptation and B-cos conversion into a single, efficient step. This contrasts with prior B-cosification methods for vision that required converting already task-capable models.

Comprehensive evaluations on text classification tasks (SST-2, HateXplain) demonstrate the effectiveness of B-cos LMs. In automatic evaluations measuring faithfulness, B-cos LM explanations significantly outperform popular post-hoc methods like Gradient, Integrated Gradients, and Attention, as measured by metrics such as Area Over the Perturbation Curve (AOPC) and Log-odds Ratio. Human evaluations further confirm that explanations from B-cos LMs are rated higher in relevance, clarity, and helpfulness. Crucially, this gain in explainability does not come at the cost of performance; B-cos LMs maintain task accuracy comparable to conventionally fine-tuned models.

The paper provides an in-depth analysis of the models’ behavior. It shows that B-cos LMs converge faster than both conventional fine-tuning and previous B-cos methods, quickly learning to align weights with relevant tokens. Visualizations reveal that B-cos LM explanations are more focused and less noisy, accurately highlighting salient words while suppressing irrelevant ones. Finally, the authors present a pioneering exploration of applying B-cos transformation to decoder-only models (GPT-2) for generation tasks, demonstrating the potential for broader applicability in the era of LLMs.

In summary, this work presents a practical and efficient framework for converting powerful pre-trained LMs into models that are both high-performing and inherently explainable by design, offering a significant step towards more transparent and trustworthy AI systems for NLP.

B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability

💡 Research Summary

Comments & Academic Discussion

Leave a Comment