Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in Large Language Models

Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in Large Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models can express values in two main ways: (1) intrinsic expression, reflecting the model’s inherent values learned during training, and (2) prompted expression, elicited by explicit prompts. Given their widespread use in value alignment, it is paramount to clearly understand their underlying mechanisms, particularly whether they mostly overlap (as one might expect) or rely on distinct mechanisms, but this remains largely understudied. We analyze this at the mechanistic level using two approaches: (1) value vectors, feature directions representing value mechanisms extracted from the residual stream, and (2) value neurons, MLP neurons that contribute to value vectors. We demonstrate that intrinsic and prompted value mechanisms partly share common components crucial for inducing value expression, generalizing across languages and reconstructing theoretical inter-value correlations in the model’s internal representations. Yet, as these mechanisms also possess unique elements that fulfill distinct roles, they lead to different degrees of response diversity (intrinsic > prompted) and value steerability (prompted > intrinsic). In particular, components unique to the intrinsic mechanism promote lexical diversity in responses, whereas those specific to the prompted mechanism strengthen instruction following, taking effect even in distant tasks like jailbreaking.


💡 Research Summary

This paper investigates how large language models (LLMs) express human values through two distinct pathways: intrinsic expression, which reflects values learned during pre‑training, and prompted expression, which is elicited by explicit system prompts at inference time. The authors ground their analysis in Schwartz’s taxonomy of ten universal human values (e.g., Benevolence, Power) and adopt a mechanistic perspective, extracting linear “value vectors” from the residual stream and identifying “value neurons” within MLP layers that drive these vectors.

Methodology

  1. Data Collection – Over 26 k first‑turn user queries relevant to the ten Schwartz values are harvested from real‑world conversational datasets (ShareGPT, LMSYS‑Chat‑1M). For each query two responses are generated: (a) with an empty system prompt (intrinsic condition) and (b) with a value‑targeting system prompt (prompted condition). Responses are automatically labeled as “value expressed” or “value unexpressed” using GPT‑4o‑mini, with human validation reported in the appendix.
  2. Value Vector Extraction – For each value and each transformer layer, the authors compute the mean residual‑stream activation across tokens for the expressed and unexpressed sets, then take the difference of means (diff‑in‑means). This yields a direction vₗₛ,ₑ that captures the feature associated with expressing value s under condition e (intrinsic or prompted). Overlap between intrinsic and prompted vectors is quantified by cosine similarity; orthogonal components are isolated by projecting one vector onto the other and subtracting.
  3. Value Neuron Identification – Leveraging the fact that an MLP block’s residual update is a sum of rank‑1 contributions from its neurons, the authors project each neuron’s output weight vector onto the 2‑dimensional subspace spanned by the intrinsic and prompted vectors for a given value. Neurons with the largest projection norms (top ≈15 %) are retained as candidate “value neurons.” Singular‑value decomposition of the stacked vectors provides three orthogonal axes: a shared axis (first singular vector), an intrinsic‑unique axis, and a prompted‑unique axis (derived from the second singular vector and sign alignment). Each neuron is classified according to the axis it aligns with (angle < 30°).

Experiments
The analysis is performed primarily on Qwen2.5‑7B‑Instruct, Qwen2.5‑1.5B‑Instruct, and Llama‑3.1‑8B‑Instruct, with additional checks on larger models (Qwen2.5‑32B, Gemma2‑9B‑it, Qwen3‑14B, Qwen3‑8B) to confirm robustness across scale and architecture.

Key Findings

  1. Partial Overlap of Mechanisms – Across all layers, intrinsic and prompted value vectors exhibit positive cosine similarity, peaking around middle layers (≈14). This indicates that both mechanisms rely on a shared semantic subspace within the model’s activation geometry.
  2. Shared and Unique Neurons – Approximately 20 % of MLP neurons are “shared,” meaning their output weights project strongly onto both intrinsic and prompted axes. The remaining neurons split into intrinsic‑unique and prompted‑unique groups, each aligning with the corresponding orthogonal axis.
  3. Cross‑Language Generalization – Vectors extracted from English contexts retain their relational structure when applied to Chinese and Spanish inputs, successfully reconstructing Schwartz‑defined inter‑value correlations (e.g., Universalism–Benevolence). This suggests that the identified subspaces encode language‑agnostic value semantics.
  4. Behavioral Divergence
    Response Diversity: Ablating intrinsic‑unique neurons leads to a marked drop in lexical diversity (measured by type‑token ratio and n‑gram entropy), whereas ablating prompted‑unique neurons has little effect. Thus, intrinsic mechanisms contribute to more natural, varied language generation.
    Steerability: Injecting prompted‑unique vectors or activating prompted‑unique neurons dramatically improves instruction following, including in “jailbreak” resistance and translation tasks, confirming that prompted mechanisms are more effective for fine‑grained control.
  5. Practical Implications – The findings clarify a trade‑off: intrinsic alignment yields richer, more diverse outputs but weaker controllability; prompted alignment offers stronger steerability at the cost of reduced naturalness. Moreover, because value neurons are interpretable, one can directly edit or fine‑tune specific values (e.g., suppress Power‑related neurons) without retraining the entire model.

Conclusion
The study demonstrates that LLMs encode value expression through a hybrid of shared and mechanism‑specific components. While intrinsic and prompted pathways share a core semantic subspace, each also possesses unique vectors and neurons that fulfill distinct functional roles—lexical diversity for intrinsic, instruction compliance for prompted. This nuanced understanding informs the design of value‑aligned AI systems, enabling developers to choose or combine mechanisms based on desired balances between naturalness and controllability, and opens avenues for neuron‑level interventions to improve safety and alignment.


Comments & Academic Discussion

Loading comments...

Leave a Comment