Scaling Laws for Moral Machine Judgment in Large Language Models
Autonomous systems increasingly require moral judgment capabilities, yet whether these capabilities scale predictably with model size remains unexplored. We systematically evaluate 75 large language model configurations (0.27B–1000B parameters) using the Moral Machine framework, measuring alignment with human preferences in life-death dilemmas. We observe a consistent power-law relationship with distance from human preferences ($D$) decreasing as $D \propto S^{-0.10\pm0.01}$ ($R^2=0.50$, $p<0.001$) where $S$ is model size. Mixed-effects models confirm this relationship persists after controlling for model family and reasoning capabilities. Extended reasoning models show significantly better alignment, with this effect being more pronounced in smaller models (size$\times$reasoning interaction: $p = 0.024$). The relationship holds across diverse architectures, while variance decreases at larger scales, indicating systematic emergence of more reliable moral judgment with computational scale. These findings extend scaling law research to value-based judgments and provide empirical foundations for artificial intelligence governance.
💡 Research Summary
The paper investigates whether moral judgment capabilities of large language models (LLMs) follow predictable scaling laws as model size increases. Building on the Moral Machine dataset—a large‑scale collection of human decisions in autonomous‑vehicle dilemmas—the authors evaluate 75 model configurations ranging from 0.27 billion to 1 trillion parameters. For each model, they generate 10,000 synthetic scenarios that systematically vary nine moral factors (age, gender, social status, fitness, species, legality, number of characters, intervention type, and passenger vs. pedestrian status). Model responses are coded into binary choices, and the average marginal component effect (AMCE) for each factor is computed. The nine‑dimensional AMCE vector for a model is compared to the human AMCE vector using Euclidean distance D; smaller D indicates closer alignment with human moral preferences.
The central empirical finding is a power‑law relationship between model size (S) and distance D:
D ∝ S⁻⁰·¹⁰ ± 0.01,
with an R² of 0.50 and a highly significant Spearman correlation (ρ = ‑0.73, p ≈ 10⁻¹³). Alternative functional forms (linear, logarithmic, exponential) fit substantially worse, confirming that the power‑law best captures the scaling pattern. Moreover, variance in D shrinks as models become larger, suggesting that scaling not only improves average alignment but also makes moral judgments more reliable.
To rule out confounding factors, the authors fit a series of linear mixed‑effects models. Model family (e.g., Llama, Gemma, Qwen, DeepSeek, proprietary APIs) is treated as a random effect, while fixed effects include model size, release year (as a proxy for time‑dependent improvements), and a binary indicator for extended reasoning capability (chain‑of‑thought or “thinking‑mode” variants). Adding release year does not improve fit, indicating that temporal advances beyond scale do not explain the observed trend. Introducing the reasoning indicator yields a significant reduction in D (β = ‑0.16, p = 0.001), and a size × reasoning interaction is also significant (β = 0.057, p = 0.024). This interaction shows that extended reasoning benefits smaller models more strongly, while very large models already capture much of this capability through sheer scale.
Family‑specific analyses reveal that the negative size‑D relationship holds across all major families, albeit with varying statistical power due to differing sample sizes (8–19 models per family). Random‑effect variances are modest (intercept SD = 0.019, slope SD = 0.017), confirming that the scaling law is not driven by any single architecture or training regimen.
The discussion emphasizes three key implications. First, moral alignment appears to be an emergent property of scale, extending the well‑known scaling laws for language modeling and reasoning to value‑based judgments. Second, architectural innovations that enable explicit reasoning can substantially boost alignment, especially when computational resources are limited, offering a practical route for edge‑device or robotics deployments. Third, the reduction in performance variance at larger scales enhances predictability and safety, a crucial requirement for real‑world autonomous systems.
Limitations include the narrow temporal window (most models released in 2024–2025), the reliance on parameter count as the sole proxy for scale, and the potential cultural bias inherent in the Moral Machine dataset. Future work should explore additional scaling dimensions (training tokens, data diversity), incorporate multi‑cultural preference datasets, and test whether similar laws hold for other ethical frameworks (e.g., deontological or virtue‑ethics tasks).
In sum, the study provides robust empirical evidence that computational scale systematically improves moral judgment alignment in LLMs, following a clear power‑law, and that extended reasoning mechanisms offer complementary gains, especially for smaller models. These findings supply a quantitative foundation for AI governance, safety certification, and the design of ethically aligned language models.
Comments & Academic Discussion
Loading comments...
Leave a Comment