Value Alignment Tax: Measuring Value Trade-offs in LLM Alignment
Existing work on value alignment typically characterizes value relations statically, ignoring how interventions - such as prompting, fine-tuning, or preference optimization - reshape the broader value system. We introduce the Value Alignment Tax (VAT), a framework that measures how alignment-induced changes propagate across interconnected values relative to achieved on-target gain. VAT captures the dynamics of value expression under alignment pressure. Using a controlled scenario-action dataset grounded in Schwartz value theory, we collect paired pre-post normative judgments and analyze alignment effects across models, values, and alignment strategies. Our results show that alignment often produces uneven, structured co-movement among values. These effects are invisible under conventional target-only evaluation, revealing systemic, process-level alignment risks and offering new insights into the dynamics of value alignment in LLMs.
💡 Research Summary
The paper introduces the Value Alignment Tax (VAT), a quantitative framework for measuring how alignment interventions on large language models (LLMs) affect the broader system of human values beyond the targeted improvement. Existing alignment research typically evaluates only the gain on a chosen value or treats each value as an independent scalar, ignoring the relational structure that underlies human value systems (e.g., Schwartz’s ten basic values). VAT addresses this gap by capturing both first‑order marginal shifts and second‑order systemic coupling among values.
The authors construct a large, culturally diverse evaluation dataset consisting of 29,568 scenario‑action pairs. Each scenario is grounded in a specific country and social domain (12 countries × 11 domains). For each scenario, a target Schwartz value and a polarity (express or suppress) are specified, and the model generates a concrete action. Human annotators (27 participants) rate realism, cultural grounding, action correctness, harmlessness, etc., achieving high inter‑annotator agreement, which validates the dataset’s quality.
VAT operates on norm‑based evidence extracted from Likert‑style judgments. For each micro‑value u within a scenario‑action pair, a signed evidence score e(u|s,a) is derived, then aggregated to a value score E_s(v) for each of the ten Schwartz values. The shift δ_s(v) = E_s^post(v) – E_s^pre(v) constitutes the sample‑level representation of alignment impact.
Two measurement levels are defined:
-
Gain‑Normalized Deviation (GND) – The average shift of each non‑target value normalized by the absolute gain on the target value. This yields a first‑order “collateral damage” metric that is comparable across interventions of differing strength.
-
Systemic Coupling – For each value v, the vector of sample‑level shifts z_v =
Comments & Academic Discussion
Loading comments...
Leave a Comment