Building Interpretable Models for Moral Decision-Making

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We build a custom transformer model to study how neural networks make moral decisions on trolley-style dilemmas. The model processes structured scenarios using embeddings that encode who is affected, how many people, and which outcome they belong to. Our 2-layer architecture achieves 77% accuracy on Moral Machine data while remaining small enough for detailed analysis. We use different interpretability techniques to uncover how moral reasoning distributes across the network, demonstrating that biases localize to distinct computational stages among other findings.

💡 Research Summary

The paper presents a purpose‑built transformer model for studying moral decision‑making on trolley‑style dilemmas. Instead of applying large pre‑trained language models, the authors design a compact architecture (≈104 K parameters, 2 encoder layers, 2 attention heads, embedding dimension 64) that directly reflects the hypothesized structure of moral reasoning: who is affected, how many are affected, and which side they belong to. Each scenario is encoded as a pair of outcomes; for each of the 23 character types the model receives a concatenated embedding of character identity, cardinality, and team label. By omitting positional embeddings and relying on the team embeddings, the model forces self‑attention to perform cross‑outcome comparisons rather than sequence ordering.

Training on a curated subset of the Moral Machine dataset (5.4 M unique scenarios, 1.7 M held‑out for validation) yields 77 % validation accuracy, comparable to much larger models evaluated on the same data. The authors deliberately choose a slightly sub‑optimal configuration (d = 64, H = 2, L = 2) to prioritize interpretability and computational efficiency over a marginal 0.4 % gain.

Interpretability is explored through three complementary techniques:

Causal Intervention – Using the DoWhy framework, the authors estimate the average treatment effect (ATE) of each character type while controlling for total group sizes. Results reveal a clear moral hierarchy: “Pregnant” and “Stroller” have the strongest positive effects (+0.12, +0.11), while “Criminal” exerts the strongest negative effect (‑0.10). Generic categories such as “Man” and “Woman” cluster near zero, indicating they serve as baselines.
Layer‑wise Bias Localization – Attention weights from each layer‑head are combined with variance and correlation to bias scores across five dimensions (legality, gender, social role, age, species). The analysis shows that legality bias is almost entirely captured in layer 0, species bias emerges in layer 1, and other biases are distributed but tend to specialize in different heads. This suggests early layers encode individual attributes, while later layers perform the comparative reasoning that yields the final moral preference.
Circuit Probing – Sparse binary masks are learned over the frozen model to identify a minimal set of neurons that compute the moral scoring signal. In the CLS‑MLP block, only 45 out of 256 neurons (≈18 % sparsity) are selected, achieving a K‑nearest‑neighbor accuracy of 0.956 on held‑out examples. Ablating this sub‑circuit reduces agreement with the model’s own scores by 1.2 percentage points, accounting for roughly 8 % of the model’s performance margin over a class‑imbalanced baseline. This demonstrates that the moral computation is implemented by a highly localized subnetwork.

Additionally, the authors provide token‑level explanations via gradient‑weighted attention relevance, showing that in a “Man 3 vs Criminal 3” scenario the “Criminal” token contributes about 27 % of the decision evidence, while demographic tokens contribute minimally.

The discussion emphasizes that a small, purpose‑designed transformer can achieve competitive moral decision performance while remaining amenable to mechanistic analysis. The authors acknowledge that training on aggregate human preferences inevitably inherits cultural biases, but the interpretability pipeline enables targeted debiasing (e.g., orthogonalizing representations in the identified “legality” head). Limitations include reliance on a single cultural dataset and the simplicity of trolley‑style dilemmas; future work is suggested to incorporate multi‑cultural data, richer scenario modalities, and automated bias mitigation strategies.

In sum, the paper demonstrates that moral competence does not require massive pretrained models; a carefully structured, lightweight transformer can both learn human‑like moral preferences and reveal, with fine granularity, where and how those preferences arise within the network. This work provides a concrete blueprint for building transparent, ethically accountable AI systems.

Building Interpretable Models for Moral Decision-Making

💡 Research Summary

Comments & Academic Discussion

Leave a Comment