Differences in the Moral Foundations of Large Language Models
📝 Abstract
Large language models are increasingly being used in critical domains of politics, business, and education, but the nature of their normative ethical judgment remains opaque. Alignment research has, to date, not sufficiently utilized perspectives and insights from the field of moral psychology to inform training and evaluation of frontier models. I perform a synthetic experiment on a wide range of models from most major model providers using Jonathan Haidt’s influential moral foundations theory (MFT) to elicit diverse value judgments from LLMs. Using multiple descriptive statistical approaches, I document the bias and variance of large language model responses relative to a human baseline in the original survey. My results suggest that models rely on different moral foundations from one another and from a nationally representative human baseline, and these differences increase as model capabilities increase. This work seeks to spur further analysis of LLMs using MFT, including finetuning of open-source models, and greater deliberation by policymakers on the importance of moral foundations for LLM alignment.
💡 Analysis
Large language models are increasingly being used in critical domains of politics, business, and education, but the nature of their normative ethical judgment remains opaque. Alignment research has, to date, not sufficiently utilized perspectives and insights from the field of moral psychology to inform training and evaluation of frontier models. I perform a synthetic experiment on a wide range of models from most major model providers using Jonathan Haidt’s influential moral foundations theory (MFT) to elicit diverse value judgments from LLMs. Using multiple descriptive statistical approaches, I document the bias and variance of large language model responses relative to a human baseline in the original survey. My results suggest that models rely on different moral foundations from one another and from a nationally representative human baseline, and these differences increase as model capabilities increase. This work seeks to spur further analysis of LLMs using MFT, including finetuning of open-source models, and greater deliberation by policymakers on the importance of moral foundations for LLM alignment.
📄 Content
Large Language Models (LLMs) have rapidly become ubiquitous in educational, professional, governmental, and personal affairs. More and more, humans are relying on LLMs to serve as assistants, partners, and advisors on important matters. The extent of LLM involvement in core societal functions generates an imperative for researchers to inform the public about the nature and limitations of these systems.
The field of alignment research has, to date, been relatively segmented between two streams. A first stream has focused on “narrow” problems in alignment relating to bias and harm. A second stream has been focused on “broad” problems relating to catastrophic risks, oversight, and control.
This paper seeks to focus on the under-investigated middle ground between these two perspectives by characterizing variations in the value judgments of LLMs when presented with moral vignettes using Jonathan Haidt’s Moral Foundations Theory (MFT). MFT uses a factor analysis to decompose our moral intuitions into a set of stable values: care, fairness, loyalty, authority, sanctity, and liberty. MFT is particularly useful for this analysis for two reasons. First, as a factor analysis, it is amenable to dimensionality reduction, which is useful for effectively visualizing variations in elicited value judgments. Second, MFT has been shown to be correlated with American political attitudes and has significant cultural variation, making it relevant to current debates over LLM alignment.
The paper consists of a synthetic experiment with frontier language models utilizing a survey of 116 moral vignettes previously administered to a nationally representative sample of Americans by Clifford et al. [2015]. The paper compares the results of the synthetic experiment across models and in relation to this human baseline, and documents four key findings:
- Observed differences in moral judgments by models support the view that moral foundations theory is a useful construct in understanding the moral biases of LLMs. 2. Most LLMs value traditionally liberal foundations of care and fairness more strongly than traditionally conservative values of authority, loyalty, and sanctity, relative to the human baseline. 3. Model providers exhibit systematic variance from one another in their relative weighting of moral foundations. 4. For each model provider, larger and more capable models move further from the human baseline.
2 Related Works
Metaethics, the field of philosophy which relates to the grounding of ethical judgments, has preoccupied humanity for thousands of years. Early documentations of metaethical discussions come from Plato’s Republic, where Plato asks whether justice is a tool of the powerful (constructivism), an instrumental asset for society (relativism), or a dictate from nature (realism). For centuries, the field of metaethics remained firmly couched within the theoretical domains of philosophy and literature. Following the revolution in cognitive psychology and decision theory led by Daniel Kahneman in the second half of the twentieth century, however, the field of moral psychology has emerged, using empirical approaches to analyze ethical judgment.
Jonathan Haidt’s contributions to moral psychology began with his influential 2001 article, “The emotional dog and its rational tail: A social intuitionist approach to moral judgment,” which presented a Humean account of ethics driven by individual and cultural moral intuitions, not grand theories such as utilitarianism, deontology, or virtue ethics [Haidt, 2001]. Haidt spent the next decade building an account of moral pluralism which would explain variation in these intuitions, culminating in his 2009 work with Jesse Graham and Brian Nosek [Graham et al., 2009], where they introduced five moral intuitions that would constitute the original MFT -harm/care, fairness/reciprocity, ingroup/loyalty, authority/respect, and purity/sanctity -as an explanation for moral disagreement between liberals and conservatives. In the following decade and a half, MFT has been validated across many populations, expanded to include liberty as a distinct foundation, and enjoyed widespread popularity and adoption, especially in the business and economics communities.
The relationship between moral foundations and political ideology is central to this work. Graham et al. [2009] propose that these foundations, while distinct, have an underlying correlative structure of"individualizing" foundations -care and fairness -and “binding” foundations -loyalty, authority, and sanctity. Figure 1 demonstrates this result, showing that liberals more strongly value the two individualizing foundations, while conservatives value each foundation more equally.
In 2011, Graham, Haidt, and Nosek designed the Moral Foundations Questionnaire (MFQ), a survey intended to measure the relative strength of each of the foundations based on a Likert scale of an individual’s subjective assessment of the “relevance” of a particu
This content is AI-processed based on ArXiv data.