Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLMs
Large Language Models exhibit implicit personalities in their generation, but reliably controlling or aligning these traits to meet specific needs remains an open challenge. The need for effective mechanisms for behavioural manipulation of the model during generation is a critical gap in the literature that needs to be fulfilled. Personality-aware LLMs hold a promising direction towards this objective. However, the relationship between these psychological constructs and their representations within LLMs remains underexplored and requires further investigation. Moreover, it is intriguing to understand and study the use of these representations to steer the models’ behaviour. We propose a novel pipeline that extracts hidden state activations from transformer layers using the Big Five Personality Traits (Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism), which is a comprehensive and empirically validated framework to model human personality applies low-rank subspace discovery methods, and identifies trait-specific optimal layers across different model architectures for robust injection. The resulting personality-aligned directions are then operationalised through a flexible steering framework with dynamic layer selection, enabling precise control of trait expression in LLM outputs. Our findings reveal that personality traits occupy a low-rank shared subspace, and that these latent structures can be transformed into actionable mechanisms for effective steering through careful perturbations without impacting the fluency, variance and general capabilities, helping to bridge the gap between psychological theory and practical model alignment.
💡 Research Summary
The paper introduces a novel pipeline called “Activation‑Space Personality Steering” that enables precise, stable control of the Big‑Five (OCEAN) personality traits in large language models (LLMs) without any weight updates. The authors first collect a high‑quality dataset (Big‑5‑Chat) containing 20 000 examples labeled as high or low on each of the five traits. For each transformer layer L they extract the final non‑padding residual vector h⁽ᴸ⁾ for every example, then standardize the high and low activations jointly and compute a normalized mean‑difference vector d(c)ₗ for each trait c. To account for the fact that different layers carry different discriminative power, a set of non‑negative weights w(c)ₗ (summing to one) is learned; the weighted sum across layers yields a single robust direction d(c) for each trait.
All five trait directions are stacked and subjected to a low‑rank PCA (or SVD) to obtain a shared subspace Uₖ of dimension k that retains >95 % of the variance. Each trait vector is projected onto this subspace and re‑normalized, producing the final steering vectors b_d(c) = UₖUₖᵀ d(c) / ‖UₖUₖᵀ d(c)‖. This low‑rank representation dramatically reduces noise, improves reproducibility, and captures inter‑trait correlations (e.g., openness‑extraversion).
The core contribution lies in the hybrid layer‑selection strategy. The authors first perform an offline diagnostic using neutral probe prompts: a tiny perturbation (α ≪ 1) is injected at each layer, and three metrics are computed—ℓ₂ distance of the output distribution (Δℓ₂), KL divergence, and a flip‑rate measuring whether the top‑predicted token changes. A weighted sum S(L,c) combines these metrics (with fixed λ coefficients) and the layer with maximal S is stored as the verified prior L*₍c₎ for trait c.
At inference time a lightweight dynamic diagnostic is also run. For the actual user prompt p, the authors compute the ℓ₂ norm of the change in logits caused by a small steer at each layer, ν(L,p) = ‖z_steered⁽ᴸ⁾(p) – z_base(p)‖. The layer with the highest ν becomes the dynamic candidate R(p,c). The final injection is performed simultaneously at the verified prior and the dynamic layer, using a fixed mixture weight of 0.8 for the prior and 0.2 for the dynamic layer. The steering vector b_d(c) is scaled by a user‑specified intensity α and added to the residual stream via forward hooks.
Experiments span multiple model families (LLaMA‑2‑7B, LLaMA‑3‑8B‑Instruct, etc.) and three evaluation regimes: (1) personality questionnaires where the generated text is scored on OCEAN dimensions, (2) open‑ended discourse to assess fluency, perplexity, and lexical diversity, and (3) standard reasoning benchmarks (GLUE‑like tasks) to verify that core capabilities remain intact. Results show statistically significant shifts in personality scores (average 0.45–0.68 standard deviations toward the target trait), while perplexity and task performance change negligibly. The low‑rank subspace reduces variance across runs by more than 30 %, and the hybrid layer selection yields higher reproducibility (correlation across random seeds improves from 0.92 to 0.98) compared to fixed‑layer baselines.
The study also demonstrates that the five traits occupy a shared low‑dimensional manifold, confirming psychological theories about trait inter‑relations and suggesting that multi‑trait steering can be achieved without severe parameter interference. Limitations include reliance on English‑only data, the need for manually curated high/low examples, and fixed hyper‑parameters for the diagnostic scores and mixture weights. Future work could explore automatic tuning of these parameters, extension to multilingual models, and application to other abstract concepts such as ethical stance or creativity.
In summary, the paper provides a comprehensive, technically sound framework that bridges personality psychology and LLM activation engineering. By extracting trait‑specific directions, compressing them into an orthogonal subspace, and selecting injection layers through a blend of static priors and prompt‑aware dynamics, the authors achieve stable, interpretable, and efficient personality steering. This contribution opens avenues for personalized conversational agents, safer AI alignment, and deeper investigations into how high‑level human constructs are encoded within transformer activations.
Comments & Academic Discussion
Loading comments...
Leave a Comment