MM-SCALE: Grounded Multimodal Moral Reasoning via Scalar Judgment and Listwise Alignment

MM-SCALE: Grounded Multimodal Moral Reasoning via Scalar Judgment and Listwise Alignment
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Vision-Language Models (VLMs) continue to struggle to make morally salient judgments in multimodal and socially ambiguous contexts. Prior works typically rely on binary or pairwise supervision, which often fail to capture the continuous and pluralistic nature of human moral reasoning. We present MM-SCALE (Multimodal Moral Scale), a large-scale dataset for aligning VLMs with human moral preferences through 5-point scalar ratings and explicit modality grounding. Each image-scenario pair is annotated with moral acceptability scores and grounded reasoning labels by humans using an interface we tailored for data collection, enabling listwise preference optimization over ranked scenario sets. By moving from discrete to scalar supervision, our framework provides richer alignment signals and finer calibration of multimodal moral reasoning. Experiments show that VLMs fine-tuned on MM-SCALE achieve higher ranking fidelity and more stable safety calibration than those trained with binary signals.


💡 Research Summary

The paper introduces MM‑SCALE, a large‑scale multimodal moral reasoning dataset and alignment framework designed to improve the safety and ethical behavior of vision‑language models (VLMs). Existing benchmarks for VLM safety typically rely on binary labels (safe/unsafe) or pairwise preferences, which fail to capture the continuous, context‑dependent nature of human moral judgments. MM‑SCALE addresses these gaps by providing 32,212 image‑scenario pairs annotated with 5‑point scalar acceptability scores and explicit modality grounding labels indicating whether the judgment is based on text, image, or both.

Dataset construction begins with extracting everyday social‑norm situations from the Commonsense NormBank. Each situation is rendered into an image using text‑to‑image generators (Stable Diffusion v1.5 and DALL·E 3), preserving the core context while avoiding explicit moral cues. Human annotators then view each image and generate several plausible moral scenarios. For every scenario they assign a scalar rating (1–5) and a modality label (text, image, both). Each item receives three independent annotations; inter‑annotator agreement is strong (Krippendorff’s α = 0.74 for scores, 0.71 for modality). Notably, 68 % of scenarios exhibit a shift in judgment when the image is considered, and 78 % of those shifts are grounded in visual information, underscoring the importance of multimodal cues.

A novel interactive annotation interface, MORAL‑E, implements a model‑in‑the‑loop workflow. The VLM first predicts a moral score for each scenario; if the absolute difference between the model’s prediction and the human rating exceeds one point, the case is flagged for re‑evaluation and the annotator must also specify the grounding modality. When the discrepancy is small, annotators are prompted to add new image‑grounded scenarios, expanding the dataset with hard cases. This loop both improves data quality and creates a focused set of disagreement examples for fine‑tuning.

For training, the authors adopt a listwise preference optimization method (ListMLE) that learns from the full ranking of scenarios within the same image rather than isolated pairwise comparisons. Listwise learning is more annotation‑efficient and better captures relative acceptability. Experiments fine‑tune CLIP‑B/ViT‑G based VLMs on MM‑SCALE and compare them against models trained with binary cross‑entropy (VLGuard) or pairwise preference loss (SP‑A‑VL). Evaluation metrics include NDCG@5, Kendall’s τ, and Unsafe Rate across synthetic and real‑image subsets (the latter drawn from Visual Genome). MM‑SCALE‑trained models achieve consistent improvements of 6 %–12 % across all metrics, with especially strong gains in ranking consistency for multiple scenarios sharing the same visual context. Performance differences between synthetic and real images are negligible (Δ ≤ 0.02), indicating that the dataset does not suffer from generation artifacts.

The paper’s contributions are threefold: (1) the release of a multimodal moral dataset that combines scalar judgments with modality grounding, (2) the demonstration that listwise scalar supervision yields superior alignment compared to binary or pairwise approaches, and (3) the development of the MORAL‑E interface for scalable, disagreement‑driven data collection. Limitations include a cultural bias toward Western norms, the coarse granularity of a 5‑point scale, and residual bias from the underlying text‑to‑image models. Future work is suggested to expand cultural diversity, explore finer‑grained rating scales, and automate prompt generation to reduce annotation costs.

In summary, MM‑SCALE provides a richer, more nuanced supervision signal for VLMs, enabling them to reason about moral acceptability in a way that respects both textual intent and visual context, and the listwise training paradigm proves effective for aligning models with the continuous spectrum of human moral preferences.


Comments & Academic Discussion

Loading comments...

Leave a Comment