Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits
We train nine sparse autoencoders (SAEs) on the residual stream of Qwen 3.5-35B-A3B, a 35-billion-parameter Mixture-of-Experts model with a hybrid GatedDeltaNet/attention architecture, and use them to identify and steer five agentic behavioral traits…
Authors: Jia Qing Yap
Beha vioral Steering in a 35B MoE Language Model via SAE-Decoded Pr obe V ectors: One Agency Axis, Not Fiv e T raits Jia Qing Y ap Independent Researcher Abstract W e train nine sparse autoencoders (SAEs) on the residual stream of Qwen 3.5- 35B-A3B, a 35-billion-parameter Mixture-of-Experts model with a hybrid Gated- DeltaNet/attention architecture, and use them to identify and steer fiv e agentic behavioral traits. Our method trains linear probes on SAE latent acti v ations, then projects the probe weights back through the SAE decoder to obtain continuous steering vectors in the model’ s nati ve acti vation space: v = W ⊤ dec w probe . This by- passes the SAE’ s top- k discretization, enabling fine-grained beha vioral intervention at inference time with no retraining. Across 1,800 agent rollouts (50 scenarios × 36 conditions), we find that autonomy steering at multiplier α = 2 achiev es Cohen’ s d = 1 . 01 ( p < 0 . 0001 ; peak d = 1 . 04 at α = 3 ), shifting the model from asking the user for help 78% of the time to proactiv ely ex ecuting code and searching the web . Cross-trait analysis, ho wever , reveals that all fi ve steering vectors primarily modulate a single dominant agency axis (the disposition to act independently versus defer to the user), with trait- specific effects appearing only as secondary modulations in tool-type composition and dose-response shape. The most informati ve result is a dissociation between two traits probed from the same SAE. Risk calibration and tool-use eagerness share the same SAE and layer with nearly identical probe R 2 (0.795 vs. 0.792), yet their probe vectors are nearly orthogonal (cosine similarity = − 0 . 017 ). The tool-use vector steers behavior ( d = 0 . 39 ); the risk-calibration vector produces only suppression. Probe R 2 mea- sures correlation, not causal efficac y . A predictiv e direction in latent space need not be causally upstream of behavior . W e additionally show that steering only during autoregressi ve decoding has zero effect ( p > 0 . 35 ), providing causal evi- dence that behavioral commitments are computed during prefill in GatedDeltaNet architectures. Code, trained SAEs, and steering vectors are publicly a vailable. 1 2 1 Introduction A central goal of mechanistic interpretability is to mov e from observing what language models do to intervening on how they do it. Sparse autoencoders (SAEs) have emerged as a promising tool for decomposing model representations into interpretable features (Cunningham et al., 2024; Bricken et al., 2023), and recent work has demonstrated that clamping or amplifying individual SAE features can steer model beha vior (T empleton et al., 2024). Separately , activ ation engineering methods add 1 https://github.com/zactheaipm/qwenscope 2 https://huggingface.co/zactheaipm/qwen35- a3b- saes Preprint. learned directions to the residual stream to shift model outputs without retraining (T urner et al., 2023; Li et al., 2023). These approaches hav e primarily been demonstrated on models up to 7B parameters with standard transformer architectures. Whether they extend to production-scale models with non-standard architectures (Mixture-of-Experts routing, linear recurrence layers, hybrid attention) remains an open question. W e address this question by training SAEs on Qwen 3.5-35B-A3B (Qwen T eam, 2025), a 35B- parameter MoE model (3B active per token) that uses a hybrid architecture: repeating blocks of three GatedDeltaNet layers (Y ang et al., 2024) followed by one standard attention layer . W e target five agentic behavioral traits (autonomy , tool-use eagerness, persistence, risk calibration, and deference) that are central to the deployment characteristics of AI agents. Our method bridges SAE-based feature discov ery with acti vation steering. F or each trait, we: (1) collect contrasti ve acti vation pairs from the model’ s residual stream, (2) train a linear probe on the SAE’ s latent acti vations to predict the trait, (3) project the probe weights through the SAE’ s decoder to obtain a steering vector in the model’ s nativ e 2048-dimensional activ ation space. This projection bypasses the SAE’ s top- k activ ation function, which would otherwise discretize the latent space and destroy fine-grained directional information. W e e v aluate each steering v ector by running 50 ReAct-style agent scenarios and measuring behavioral change through proxy metrics extracted from ra w trajectories. Our contributions are: 1. SAE-decoded probe steering at 35B MoE scale. W e demonstrate that the method v = W ⊤ dec w probe produces causally effecti ve steering vectors on a production-scale hy- brid architecture, achieving d = 1 . 01 for autonomy . 2. A dominant agency axis. Cross-trait analysis reveals that all fi ve steering vectors primarily modulate a single shared disposition (acting independently versus deferring), with trait- specific effects appearing only in tool-type composition and dose-response dynamics. 3. A correlation/causation dissociation. Risk calibration and tool-use eagerness, probed from the same SAE with nearly identical R 2 , yield orthogonal vectors with opposite causal efficac y . High probe accuracy does not guarantee causal relev ance. 4. Prefill-only beha vioral commitment. Steering during autoregressiv e decoding has zero effect. Behavioral decisions are committed during the prefill phase and propagated through the GatedDeltaNet recurrence. 2 Related W ork Sparse A utoencoders f or Interpretability . SAEs decompose model activ ations into sparse, inter- pretable features by training an o vercomplete autoencoder with a sparsity constraint (Cunningham et al., 2024; Bricken et al., 2023). T empleton et al. (2024) scaled SAEs to Claude 3 Sonnet and demonstrated that individual features correspond to human-interpretable concepts. Gao et al. (2024) introduced the top- k acti vation function and auxiliary dead-feature losses, impro ving training stability at scale. Our work extends SAE training to hybrid GatedDeltaNet/attention architectures and uses SAE latents as an intermediate representation for steering vector construction rather than directly clamping features. Activation Steering . Turner et al. (2023) introduced activ ation addition, where a direction vector is added to the residual stream during the forward pass to shift model beha vior . Li et al. (2023) used probing to identify attention heads encoding truthfulness, then shifted activ ations at those heads during inference. Rimsky et al. (2024) used contrastive activ ation addition (CAA) to steer model behavior on Llama 2, computing steering vectors as the mean dif ference between contrastiv e prompt pairs. Our SAE-decoded probe method differs from mean-dif f CAA by lev eraging the SAE’ s learned basis to identify more targeted directions; a head-to-head comparison is reported as a limitation (§7). Linear Probing and Causal Inter vention. Linear probes ha ve been widely used to test whether specific information is linearly represented in model activ ations (Belinkov, 2022; Burns et al., 2023). Howe ver , probes measure corr elation between representations and labels, not causal influence 2 on model behavior . Ra vfogel et al. (2020) and Elazar et al. (2021) explored the gap between probe accuracy and causal relev ance through nullspace projection and amnesic counterfactuals, respectiv ely . Our risk-calibration dissociation provides an empirical case study of this g ap: a probe with R 2 = 0 . 795 that identifies a direction with zero causal efficac y for steering, while a probe with R 2 = 0 . 792 on the same SAE successfully steers a different trait. Hybrid and Linear-Recurr ence Architectur es. GatedDeltaNet (Y ang et al., 2024) combines gated linear attention with delta-rule updates, offering sub-quadratic sequence processing. Qwen 3.5 (Qwen T eam, 2025) deploys this in a hybrid configuration alternating recurrence and attention layers. The interaction between linear recurrence and sparse feature decomposition has not been previously studied. Our finding that beha vioral commitments are computed during prefill and propagated through the recurrent state provides e vidence about the computational role of these architectural components. 3 Model and Architectur e W e study Qwen 3.5-35B-A3B, a Mixture-of-Experts language model with 35 billion total parameters and approximately 3 billion activ e per token. The model has 40 transformer layers organized into 10 blocks of 4 layers each: 3 GatedDeltaNet layers followed by 1 standard multi-head attention layer (T able 1). Each MoE layer routes tokens to 8 of 256 experts plus 1 shared expert, with per-expert intermediate dimension 512. The residual stream has hidden dimension d = 2048 . T able 1: Qwen 3.5-35B-A3B architecture summary . Property V alue T otal parameters 35B Activ e parameters/token ∼ 3B Layers 40 (10 blocks × 4) Block pattern 3 GatedDeltaNet + 1 Attention Hidden dimension 2,048 MoE experts 256 total, 8 routed + 1 shared Expert intermediate dim 512 The hybrid architecture creates a natural experimental contrast: within each block, we can compare feature representations in GatedDeltaNet (recurrent) v ersus attention (quadratic) layers at matched depth. 4 Method Our pipeline has four stages: SAE training (§4.1), contrastiv e feature identification (§4.2), probe- to-steering-vector projection (§4.3), and beha vioral ev aluation (§4.4). Figure 1 summarizes the full pipeline. 4.1 SAE T raining W e train 9 top- k SAEs on residual stream activ ations at four depth levels (early , early-mid, mid, late), cov ering both GatedDeltaNet and attention sublayers at each depth (T able 2). An additional control SAE at a different position within the mid block tests within-block position ef fects. Each SAE is trained on 200M tokens from a mixture of HuggingFaceH4/ultrachat_200k, allenai/W ildChat-1M (GDPR-filtered), and synthetic tool-use con v ersations. W e follow the F AST methodology (Gao et al., 2024): sequential conv ersation processing with a 2M-vector circular CPU buf fer for activ ation mixing. Dead features are resampled every 5,000 steps using an auxiliary top- k loss on the reconstruction residual (Gao et al., 2024), with learning rate warmup over 1,000 steps and linear decay ov er the final 20% of training. 3 SAE-Decoded Probe Steering Pipeline Offline (training) Contrastive HIGH / L OW pairs F orward P ass Qwen 3.5-35B SAE Encode z = T opK(R eL U(...)) Ridge Probe w = ar gmin Xw- y ²+ w ² SAE Decoder W d e c v s t e e r = W d e c w p r o b e SAE feature space residual stream b y p a s s e s t o p - k d i s c r e t i z a t i o n A p p l y a t l a y e r all / pr efill / decode Online (inference) Residual x a t l a y e r + v s t e e r multiplier × dir ection = S t e e r e d x 0 behavioral shif t Continue generation Steered output Figure 1: SAE-decoded probe steering pipeline. Contrastiv e activ ations are encoded through the SAE, a ridge regression probe identifies the discriminativ e direction in SAE latent space, and the probe weights are projected through the decoder to obtain a continuous steering vector in the model’ s nativ e activ ation space. T able 2: SAE configurations. All SAEs use top- k acti vation with auxiliary dead-feature loss. Learning rates are scaled in versely with acti vation norm at each depth. SAE ID Layer T ype Block Dict Size k LR sae_delta_early 6 DeltaNet 1 8,192 128 3e-5 sae_attn_early 7 Attention 1 8,192 128 3e-5 sae_delta_earlymid 14 DeltaNet 3 16,384 96 2e-5 sae_attn_earlymid 15 Attention 3 16,384 96 2e-5 sae_delta_mid_pos1 21 DeltaNet 5 16,384 64 1e-5 sae_delta_mid 22 DeltaNet 5 16,384 64 1e-5 sae_attn_mid 23 Attention 5 16,384 64 1e-5 sae_delta_late 34 DeltaNet 8 16,384 64 8e-6 sae_attn_late 35 Attention 8 16,384 64 8e-6 The SAE architecture follows standard practice. Given a residual stream activ ation x ∈ R d , the SAE computes: z = T opK ReLU ( W enc ( x − b pre )) (1) ˆ x = W ⊤ dec z + b pre (2) where W enc ∈ R D × d , W dec ∈ R D × d with columns normalized to unit norm, b pre ∈ R d is a learned centering parameter , D is the dictionary size, and T opK retains only the k largest acti vations, zeroing the rest. 4.2 Contrastiv e Featur e Identification W e define fiv e agentic behavioral traits, each decomposed into three sub-beha viors (T able 3). T able 3: Behavioral traits and sub-beha viors. T rait Sub-behaviors Autonomy decision independence, action initiation, permission av oidance T ool-use eagerness tool reach, proactive info g athering, tool diversity Persistence retry willingness, strategy variation, escalation reluctance Risk calibration approach nov elty , scope expansion, uncertainty tolerance Deference instruction literalness, challenge av oidance, suggestion restraint For each trait, we construct contrastive prompt pairs: HIGH variants that should elicit strong expression of the trait and LO W v ariants that should suppress it. W e generate 800 composite pairs 4 (10 templates × 4 v ariations × 4 task domains × 5 traits) plus 720 sub-beha vior-specific pairs, for 1,520 total. T ask domains (coding, research, communication, data analysis) ensure breadth. For each pair , we run the model’ s forward pass and e xtract residual stream activ ations at all 9 SAE hook points. W e encode these through each SAE and compute the T rait Association Score (T AS) for each latent feature j : T AS j = E [ z high j ] − E [ z low j ] std ( z high j − z low j ) (3) where z high j and z low j are feature j ’ s acti vations on HIGH and LO W prompts respecti vely , pooled at the last token position to avoid sequence-length confounds. W e select the SAE with the highest mean | T AS | for each trait as the best hook point for subsequent probing. 4.3 Probe-to-Residual-Str eam Projection W e construct steering vectors in two steps, translating from SAE feature space to the model’ s native activ ation space. Step 1: Linear Probe. For each trait, we train a ridge regression probe (L2-regularized linear regression) on the SAE latent acti vations from the best SAE, with binary labels y = 1 for HIGH and y = 0 for LO W: ˆ y = w ⊤ probe z + b, w probe = arg min w ∥ X w − y ∥ 2 + α ∥ w ∥ 2 (4) where z ∈ R D is the SAE encoding of a contrasti ve example, w probe ∈ R D are the learned weights, and α is selected from { 0 . 01 , 0 . 1 , 1 . 0 , 10 . 0 } by held-out R 2 . The probe’ s R 2 quantifies how linearly separable the trait is in SAE latent space. Step 2: Decoder Projection. W e project the probe weights through the SAE decoder to obtain a steering vector in the residual stream: v steer = W ⊤ dec w probe (5) where W dec ∈ R D × d is the SAE’ s decoder weight matrix. This maps the discriminativ e direction in D -dimensional SAE space to a d -dimensional direction in the residual stream. Why not steer inside the SAE? The SAE’ s top- k activ ation function (Eq. 1) introduces a hard threshold: only k features are nonzero per token. Amplifying a feature that falls outside the top- k has no effect on the reconstruction, and modifying a feature that is inside the top- k shifts the reconstruction in a quantized, k -sparse manner . The decoder projection circumvents this by operating in continuous activ ation space. The resulting vector v steer can be added at any scale without discretization artifacts. Steering A pplication. During inference, we add the steering vector to the residual stream at the probe’ s layer: x ′ ℓ = x ℓ + α · v steer (6) where α is a scalar multiplier . W e test three application modes: all_positions (ev ery token in both prefill and decode), prefill_only (only during prompt processing), and decode_only (only during autoregressi ve generation). 4.4 Beha vioral Evaluation W e ev aluate steering ef fects using 50 ReAct-style agent scenarios spanning four task domains. Each scenario provides the model with tool schemas ( code_execute , web_search , file_read , file_write , ask_user ) and pre-cached tool responses to eliminate network dependencies. The model generates up to 5 turns of tool calls and responses. W e extract beha vioral proxy metrics directly from trajectories: • A utonomy pr oxy : 1 if no ask_user calls, else 0 • T ool-use proxy : count of non- ask_user tool calls 5 • P ersistence proxy : number of agent turns • Risk pr oxy : fraction of tool calls that are code_execute or file_write • Defer ence proxy : 1 if any ask_user call, else 0 W e use ask_user count and proactiv e tool count (non- ask_user ) as primary metrics, reporting Cohen’ s d (standardized effect size) and Mann-Whitney U tests (non-parametric, appropriate for count data). Statistical significance is assessed with Bonferroni correction for 35 total conditions ( α corrected = 0 . 0014 ). 5 Results W e ran 1,800 agent rollouts: 50 scenarios × (1 baseline + 35 steered conditions). The unsteered model predominantly asks the user for help (78% of tool calls are ask_user ), with 38% of scenarios producing zero tool calls. 5.1 P er-T rait Steering Efficacy T able 4 summarizes the best steering condition for each trait. T able 4: Best steering condition per trait. d ( ask ) : Cohen’ s d for ask_user reduction. d ( pro ) : Cohen’ s d for proactiv e tool-call increase. ∅ TC%: fraction of scenarios with zero tool calls (coherence proxy). Conditions tested: multipliers α ∈ { 1 , 2 , 3 } ; positions ∈ { all, prefill, decode } . T rait SAE Layer R 2 Condition d ( ask ) d ( pro ) ∅ TC% Tier Autonomy delta_mid 22 0.865 all, α = 3 − 0.87 +1.04 36% 1 T ool use delta_mid_p1 21 0.792 all, α = 3 − 0.60 +0.39 26% 2 Deference attn_earlymid 15 0.737 pre, α = 2 − 0.58 +0.80 46% 3 Persistence attn_mid 23 0.561 all, α = 1 − 0.45 +0.28 40% fail Risk cal. delta_mid_p1 21 0.795 — − 0.51 +0.00 70% fail A utonomy (Tier 1). At α = 2 . 0 (all positions), ask_user calls drop from 1 . 04 ± 1 . 65 to 0 . 12 ± 0 . 33 ( d = − 0 . 76 , p < 0 . 0001 ), while proactiv e tool calls rise from 0 . 30 ± 0 . 85 to 2 . 10 ± 2 . 14 ( d = +1 . 01 , p < 0 . 0001 ; 95% CI [+0 . 60 , +1 . 43] ). At α = 3 . 0 , the effect on proactiv e calls is maintained ( d = +1 . 04 ) with slightly more coherence de gradation ( ∅ TC = 36% vs. 30%). Abov e α = 5 . 0 , the model collapses entirely ( ∅ TC = 100%). The dose-response follows a smooth in verted-U curv e. T ool-Use Eagerness (T ier 2). This trait shows a phase-transition dose-response. At α = 2 . 0 , 58% of scenarios ha ve zero tool calls, paradoxically wor se than baseline (38%). At α = 3 . 0 , the model passes through a formatting instability into a ne w behavioral mode: 4.26 tool calls per scenario (vs. 1.34 baseline), 88% of which are web_search . The perturbation at α = 3 . 0 corresponds to 12.8 × the residual stream RMS norm, yet the model maintains coherent output in 74% of scenarios. Deference (Tier 3, borderline). The only trait where prefill_only outperforms all_positions ( d ( pro ) = +0 . 80 vs. +0 . 76 ). This suggests deference is a dispositional property computed during prompt processing. The therapeutic window is narrow: in prefill-only mode, α = 1 . 0 has a weak effect ( d ( pro ) = +0 . 23 ), α = 2 . 0 produces genuine redirection ( d ( pro ) = +0 . 80 ), and α = 3 . 0 degrades coherence ( ∅ TC = 84%). In all-positions mode, α = 3 . 0 collapses output entirely ( ∅ TC = 100%). Deference operates through an attention SAE (layer 15, early-mid block), the earliest intervention point and the only attention-layer SAE that produces genuine behavioral redirection. Persistence (F ail). The weakest probe ( R 2 = 0 . 561 ) produces the weakest steering. No multiplier achiev es significant behavioral redirection; the best condition ( α = 1 . 0 , all positions: d ( ask ) = − 0 . 45 , p = 0 . 056 ) does not approach the Bonferroni-corrected threshold ( α corrected = 0 . 0014 ). All higher multipliers produce pure suppression without compensatory proactiv e behavior . Risk Calibration (Fail). Discussed in detail in §5.6. 6 0 1 2 3 4 5 M u l t i p l i e r 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25 C o h e n ' s d ( p r o a c t i v e t o o l c a l l s ) Smooth inverted-U Autonomy (all pos.) Autonomy (pr efill) Defer ence (all pos.) W eak effect zone 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 M u l t i p l i e r 88% web\_sear ch 58% zer o T C Phase transition T ool-use eager ness 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 M u l t i p l i e r Pure suppression P ersistence Risk calibration Figure 2: Dose-response curves for proactive tool-call effect size d ( pro ) as a function of steering multiplier α . Autonomy exhibits a smooth in verted-U with a clear therapeutic window . T ool-use eagerness sho ws a phase transition: intermediate multipliers degrade performance before a high- α regime produces a ne w behavioral mode. Persistence and risk calibration show monotonic suppression at all multipliers. 5.2 Dose-Response T axonomy The fiv e traits exhibit three qualitati vely distinct dose-response patterns (Figure 2): 1. Smooth in verted-U (autonomy , deference): Ef fect grows monotonically with α until a sharp cliff. There exists a clear therapeutic windo w between weak ( d < 0 . 3 ) and collapsed ( ∅ TC > 80% ) regimes. 2. Phase transition (tool-use eagerness): Non-monotonic. Intermediate multipliers de grade performance; high multipliers push through a formatting instability into a ne w stable behavioral attractor with qualitati vely dif ferent output. 3. Pure suppression (persistence, risk calibration): Monotonic degradation. Increasing α suppresses the dominant behavior ( ask_user ) without replacing it with alternativ e actions. No multiplier redirects behavior . These patterns are not predicted by probe R 2 , vector norm, SAE type, or layer depth (T able 4). Risk calibration ( R 2 = 0 . 795 ) fails while deference ( R 2 = 0 . 737 ) succeeds. The autonomy vector (norm 10.58) tolerates α = 3 . 0 , while the deference vector (norm 28.54, perturbation 10.2 × RMS) collapses at α = 3 . 0 . Coherence under perturbation depends on the dir ection , not just the magnitude , of the intervention. 5.3 The Decode-Only Null Result W e tested autonomy steering (the strongest trait) in decode_only mode, where the steering vector is applied only during autoregressi ve token generation, not during prompt processing. At both α = 1 . 0 and α = 2 . 0 , the effect is null: T able 5: Decode-only steering produces zero effect on autonomy . Condition d ( ask ) p d ( pro ) p ∅ TC% decode-only α = 1 . 0 − 0.16 0.674 + 0.13 0.342 38% decode-only α = 2 . 0 − 0.20 0.359 + 0.02 0.511 38% all-positions α = 2 . 0 − 0.76 < 0.0001 + 1.01 < 0.0001 30% prefill-only α = 2 . 0 − 0.64 0.0007 + 0.61 0.002 42% This provides causal evidence that the model computes its behavioral commitments (whether to ask or act) during the prefill phase and propagates them through the GatedDeltaNet recurrent state. By the time the model is generating tokens, the decision is already embedded in the hidden state. The attention-based steering for deference corroborates this: prefill_only actually outperforms all_positions for that trait, consistent with deference being a dispositional property set during prompt processing. 7 Auton. T ool Use P ersist. Risk Defer . Measur ed pr o xy metric Autonomy T ool Use P ersistence Risk Cal. Defer ence Steer ed trait +0.94*** +1.00*** +0.37 +0.94*** -0.94*** +0.68** +0.62** +0.08 +0.45* -0.68** +1.29*** +0.25 -0.52** +0.47* -1.29*** +1.29*** +0.26 -0.49*** +0.40* -1.29*** +0.52* +0.75*** +0.19 +0.65*** -0.52* * * * p < 0 . 0 0 1 * * p < 0 . 0 1 * p < 0 . 0 5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 C o h e n ' s d ( s t e e r e d v s . b a s e l i n e ) Figure 3: Cross-trait specificity matrix (Cohen’ s d , α = 2 . 0 , all positions). Every steering vector primarily increases autonomy and decreases deference, rev ealing a dominant agency axis. No vector achiev es a specificity ratio > 1 . 0 . Numeric values in T able 7. 5.4 T ool-T ype Composition Even where the cross-trait specificity matrix sho ws shared agency ef fects (§5.5), the type of action differs qualitati vely between steering v ectors: T able 6: T ool-type breakdown for the three successful steering conditions vs. baseline. Condition ask_user web_search code_exec file_r/w T otal Baseline 78% 22% 0% 0% 67 Autonomy ( α = 2 ) 5% 44% 48% 3% 111 T ool use ( α = 3 ) 6% 88% 5% < 1% 213 Deference ( α = 2 , pre) 16% 63% 15% 6% 87 The autonomy vector unlocks code e xecution (0% → 48% of tool calls), the tool-use v ector creates compulsiv e web searching (88%), and the deference vector produces a balanced portfolio. These represent qualitativ ely different behavioral modes ev en though all three increase the overall disposition to act. 5.5 Cross-T rait Specificity W e measured each steering vector’ s ef fect on all five proxy metrics to assess trait specificity (T able 7, Figure 3). T able 7: Cross-trait specificity matrix. Each cell is Cohen’ s d (steered vs. baseline at α = 2 . 0 , all positions). Bold diagonal = on-target effect. ∗∗∗ p < 0 . 001 , ∗∗ p < 0 . 01 , ∗ p < 0 . 05 . Steered A uton. T ool Use Persist. Risk Defer . Autonomy 0.94 ∗∗∗ 1.00 ∗∗∗ 0.37 0.94 ∗∗∗ − 0.94 ∗∗∗ T ool use 0.68 ∗∗ 0.62 ∗∗ 0.08 0.45 ∗ − 0.68 ∗∗ Persistence 1.29 ∗∗∗ 0.25 − 0.52 ∗∗ 0.47 ∗ − 1.29 ∗∗∗ Risk cal. 1.29 ∗∗∗ 0.26 − 0.49 ∗∗∗ 0.40 ∗ − 1.29 ∗∗∗ Deference 0.52 ∗ 0.75 ∗∗∗ 0.19 0.65 ∗∗∗ − 0.52 ∗ 8 Finding: A dominant agency axis. Every steering vector primarily increases autonomy and decreases deference, regardless of which trait it was trained to target. No vector achie ves a specificity ratio (on-target | d | / max off-tar get | d | ) greater than 1.0. The autonomy vector’ s strongest effect is actually on the tool-use proxy ( d = 1 . 00 ), exceeding its on-target autonomy effect ( d = 0 . 94 ). The persistence and risk-calibration vectors produce their lar gest effects on the autonomy proxy ( d = 1 . 29 ), substantially exceeding their on-tar get effects ( d = 0 . 52 and d = 0 . 40 ). This is not simply measurement artifact from correlated proxies. The autonomy and deference proxies are definitionally anti-correlated (both deriv ed from ask_user ), but the tool-use and risk proxies are constructed from independent counts. All fiv e vectors modulate a shared agency disposition, with trait-specific ef fects appearing as secondary modulations in tool-type composition (§5.4) and dose-response dynamics (§5.2). 5.6 The Risk-Calibration Dissociation Risk calibration and tool-use eagerness share the same SAE ( sae_delta_mid_pos1 , layer 21) with nearly identical probe accuracy ( R 2 = 0 . 795 vs. 0 . 792 ) and vector norms (22.45 vs. 23.84). Y et the tool-use vector steers behavior ( d ( pro ) = +0 . 39 ) while the risk-calibration vector produces only suppression ( d ( pro ) = 0 . 00 at best). The probe directions are nearly orthogonal: T able 8: Geometry of the risk-calibration and tool-use probe directions. Metric V alue Cosine similarity (SAE feature space) − 0.017 Cosine similarity (residual stream) − 0.065 Risk-cal variance parallel to tool-use 0.03% Risk-cal variance orthogonal to tool-use 99.97% The risk-calibration probe found a direction that is: • Pr edictive : R 2 = 0 . 795 on held-out contrastiv e pairs • Distinct : nearly orthogonal to the tool-use direction (cosine = − 0 . 017 ) • Concentrated : 83 features explain 50% of probe v ariance (vs. 113 for tool use) • Causally inert : produces only suppression, no behavioral redirection This is a controlled dissociation between predictiv e and causal relev ance. The risk-calibration direction is a real statistical pattern in the model’ s acti vations, an epiphenomenal repr esentation that co-occurs with cautious/bold beha vior but is not part of the causal chain generating it. The failure cannot be attributed to shared v ariance with tool use (orthogonal directions), weak probing (high R 2 ), or wrong layer (same SAE and layer as a successful vector). Implication : Probe R 2 measures correlation between the learned direction and the trait label distribution. It does not guarantee that the direction is causally upstream of the beha vior it predicts. Intervention e xperiments are necessary to distinguish causal from epiphenomenal representations. 5.7 F eature Attrib ution W e decompose each steering vector by its constituent SAE features (T able 9). All directions are extremely sparse: < 1% of features explain 50% of the steering signal. Autonomy is the most concentrated (48 features for 50%), which may explain its efficacy: a concentrated signal may be harder for the model to “route around. ” Despite the behavioral cross-talk observed in §5.5, the steering vectors are nearly orthogonal in residual stream space (max pairwise cosine similarity = 0.086 between autonomy and risk calibration). The shared agency axis in behavior space is therefore not a single direction in activation space . Multiple distinct perturbations con ver ge to the same behavioral attractor . 9 T able 9: Feature concentration of steering vectors. V alues indicate how man y SAE features (out of 16,384) are needed to explain a gi ven fraction of the v ector’ s squared norm. T rait 50% var 80% var 90% var 95% var Autonomy 48 (0.3%) 147 (0.9%) 228 (1.4%) 315 (1.9%) T ool use 113 (0.7%) 316 (1.9%) 462 (2.8%) 595 (3.6%) Persistence 77 (0.5%) 227 (1.4%) 339 (2.1%) 439 (2.7%) Risk cal. 83 (0.5%) 257 (1.6%) 403 (2.5%) 529 (3.2%) Deference 70 (0.4%) 216 (1.3%) 340 (2.1%) 453 (2.8%) 6 Discussion One Axis, Not Fiv e. W e set out to find fi ve independent beha vioral directions and instead found a dominant agency axis with secondary trait-specific modulations. Agentic behavior in this model appears to be organized around a single ask-or-act decision point, with the type of action (code ex ecution vs. web search vs. balanced portfolio) as a secondary consideration. The structure is analogous to findings in personality psychology , where the “Big Five” traits are empirically correlated and factor analyses sometimes extract a “General Factor of Personality” (Musek, 2007) that subsumes the individual traits. Whether this reflects a genuine architectural constraint (a single “agency bottleneck” in the forward pass) or an artifact of our proxy metrics (which are all deriv ed from tool-call patterns) is an open question that requires richer ev aluation metrics to resolve. The Probe/Causation Gap. The risk-calibration dissociation provides an empirical demonstration that high linear probe accuracy does not imply causal rele v ance for steering. Prior work has ar gued this point on theoretical grounds (Ravfogel et al., 2020; Belinkov, 2022). Our case is unusually controlled: same SAE, same layer, same R 2 , orthogonal directions, opposite causal efficacy . W e recommend that future work using probes for steering report intervention results alongside R 2 , treating probe accuracy as a necessary b ut insufficient condition. Prefill as the Behavioral Commitment Phase. The decode-only null result has implications for both interpretability and deployment. For interpretability , it suggests that GatedDeltaNet recurrent layers “compile” behavioral decisions during prefill into the recurrent state, which then determinis- tically guides generation. For deployment, it implies that monitoring or intervening on the prefill phase may be more effecti ve than monitoring generated tok ens for behavioral anomalies. Therapeutic Windo ws Are T rait-Specific. There is no uni versal multiplier that works across traits. Autonomy tolerates α = 3 . 0 . Deference has a narrow windo w at exactly α = 2 . 0 . T ool use requires a phase-transition-inducing α = 3 . 0 . Production deployment of behavioral steering would require per-trait calibration, not a single scaling parameter . 7 Limitations and Future W ork Sample size. N = 50 scenarios per condition provides adequate po wer for large ef fects ( d > 0 . 8 ) but is underpowered for medium effects. The 95% confidence interval for the headline autonomy result ( d = 1 . 01 ) is [+0 . 60 , +1 . 43] , indicating the ef fect is real but the point estimate is imprecise. Persistence does not approach the Bonferroni-corrected threshold ( p = 0 . 056 vs. α corrected = 0 . 0014 ); a larger e valuation set may clarify whether this is a true null or a po wer issue. Proxy metric design. All results use proxy metrics extracted directly from trajectories (tool-call counts and types). The autonomy and deference proxies are anti-correlated by construction (both deriv ed from ask_user rate), which means the cross-trait specificity matrix (§5.5) partially reflects metric correlation rather than behavioral correlation. The claim of a “dominant agency axis” is supported by the tool-use and risk proxies (which are independently constructed) showing the same pattern, but richer ev aluation—such as LLM-judged sub-behavior scores across 15 dimensions— could rev eal trait specificity not detectable with 5 aggregate proxies. W e have not v alidated any LLM judge against human ratings for this purpose. 10 No mean-difference baseline. W e hav e not compared our SAE-decoded probe vectors against simple mean-dif ference steering (contrastiv e activ ation addition without the SAE intermediary). If mean-diff achie ves comparable results, the SAE’ s contribution would be limited to feature selection and interpretability rather than directional improv ement. No max-activating example analysis. W e report top features per direction but have not identi- fied what inputs maximally acti vate them. Semantic feature interpretation would strengthen the mechanistic narrativ e. Single model. All results are on Qwen 3.5-35B-A3B. Whether the method generalizes to other hybrid architectures, pure-transformer models, or dif ferent model scales is unknown. 8 Broader Impact Behavioral steering of language model agents is inherently dual-use. The same vectors that could make an agent more autonomous for legitimate producti vity applications could also bypass safety- relev ant deference behaviors. Our finding that all five trait vectors modulate a single agency axis amplifies this concern: a single intervention shifts the model’ s entire disposition to ward independent action, including in contexts where deference would be appropriate. W e release our SAEs and steering vectors to enable reproducibility and defensiv e research (e.g., detecting when a model’ s agency disposition has been shifted). W e note that the method requires white- box access to model activ ations during inference, limiting misuse to settings where the attacker already controls the deployment infrastructure. Monitoring prefill-phase activ ations, as suggested by our decode-only null result, may offer a detection mechanism for unauthorized beha vioral modification. 9 Conclusion W e trained nine sparse autoencoders on a 35-billion-parameter hybrid MoE language model and used them to construct behavioral steering vectors via probe-to-decoder projection. Projecting linear probe weights through the SAE decoder to obtain continuous steering vectors in the residual stream works at production scale without retraining. Behavioral steering on a 35B MoE model with GatedDeltaNet layers is feasible, achieving effect sizes up to d = 1 . 04 for autonomy . Fi ve ostensibly distinct agentic traits collapse onto a single dominant agency axis, with trait-specific ef fects appearing only in secondary properties like tool-type composition. A probe with R 2 = 0 . 795 can find a direction that is genuinely represented in the model’ s acti vations yet causally irrele vant to the beha vior it predicts, demonstrating that predicti ve accuracy is not causal e vidence. The failures in this work carry as much information as the successes. Risk calibration’ s high- R 2 irrelev ance rev eals the existence of epiphenomenal representations that predict but do not cause behavior . The decode-only null result places the computational locus of beha vioral commitment in the prefill phase. Cross-trait leakage rev eals that multiple orthogonal directions in acti vation space con ver ge to a single behavioral attractor . W e release all SAEs, steering vectors, and code to support further in vestigation. References Y onatan Belinkov . Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics , 48(1):207–219, 2022. T renton Bricken, Adly T empleton, Joshua Batson, Brian Chen, Adam Jermyn, T om Conerly , Nick T urner , Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby , Y if an W u, Shauna Krav ec, Nicholas Schiefer, T im Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex T amkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, T om Henighan, and Christopher Olah. T ow ards monosemanticity: Decomposing language models with dictionary learning. T ransformer Circuits Thr ead , 2023. 11 Collin Burns, Haotian Y e, Dan Klein, and Jacob Steinhardt. Discovering latent kno wledge in language models without supervision. In ICLR , 2023. Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey . Sparse autoen- coders find highly interpretable features in language models. In ICLR , 2024. Y anai Elazar, Shauli Ravfogel, Alon Jacovi, and Y oav Goldberg. Amnesic probing: Behavioral explanation with amnesic counterfactuals. T ransactions of the Association for Computational Linguistics , 9:160–175, 2021. Leo Gao, T om Dupré la T our , Henk T illman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutske ver , Jan Leike, and Jef frey W u. Scaling and ev aluating sparse autoencoders. arXiv preprint arXiv:2406.04093 , 2024. Kenneth Li, Oam Patel, Fernanda V iégas, Hanspeter Pfister , and Martin W attenberg. Inference-time intervention: Eliciting truthful answers from a language model. In NeurIPS , 2023. Janek Musek. A general factor of personality: Evidence for the Big One in the fiv e-factor model. Journal of Resear ch in P ersonality , 41(6):1213–1233, 2007. Qwen T eam. Qwen3 technical report. arXiv pr eprint arXiv:2505.09388 , 2025. Shauli Ravfogel, Y anai Elazar , Hila Gonen, Michael T witon, and Y oav Goldberg. Null it out: Guarding protected attributes by iterati ve nullspace projection. In A CL , 2020. Nina Rimsky , Nick Gabrieli, Julian Schulz, Meg T ong, Evan Hubinger , and Alexander Matt T urner . Steering Llama 2 via contrastiv e activ ation addition. In ACL , 2024. Adly T empleton, T om Conerly , Jonathan Marcus, Jack Lindsey , Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L T urner, Callum McDougall, Monte MacDiarmid, Ale x T amkin, Esin Durmus, T ristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter , Chris Olah, and T om Henighan. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet. T ransformer Circuits Thr ead , 2024. Alexander Matt T urner , Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte Mac- Diarmid. Activ ation addition: Steering language models without optimization. arXiv pr eprint arXiv:2308.10248 , 2023. Songlin Y ang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. In ICLR , 2025. 12 A A ppendix A.1 Full Dose-Response T ables Condition ask ∆ ask p d ( ask ) pro d ( pro ) ∅ TC% decode α = 1 0.80 − 0.24 0.674 − 0.16 0.42 +0.13 38% decode α = 2 0.74 − 0.30 0.359 − 0.20 0.32 +0.02 38% all α = 1 1.46 +0.42 0.210 +0.25 0.88 +0.47 22% all α = 2 0.12 − 0.92 < 0.0001 − 0.76 2.10 +1.01 30% all α = 3 0.02 − 1.02 < 0.0001 − 0.87 1.92 +1.04 36% all α = 5 0.00 − 1.04 < 0.0001 − 0.89 0.00 − 0.50 100% all α = 10 0.00 − 1.04 < 0.0001 − 0.89 0.00 − 0.50 100% pre α = 1 1.52 +0.48 0.124 +0.29 1.04 +0.47 22% pre α = 2 0.24 − 0.80 0.0007 − 0.64 1.16 +0.61 42% pre α = 3 0.00 − 1.04 < 0.0001 − 0.89 0.64 +0.30 72% pre α = 5 0.00 − 1.04 < 0.0001 − 0.89 0.00 − 0.50 100% A utonomy . Condition ask ∆ ask p d ( ask ) pro d ( pro ) ∅ TC% all α = 1 0.78 − 0.26 0.144 − 0.16 0.72 +0.32 52% all α = 2 0.34 − 0.70 0.002 − 0.52 1.48 +0.63 58% all α = 3 0.26 − 0.78 0.0006 − 0.60 4.00 +0.39 26% pre α = 1 0.64 − 0.40 0.037 − 0.26 0.48 +0.17 60% pre α = 2 0.02 − 1.02 < 0.0001 − 0.87 0.52 +0.19 84% pre α = 3 0.00 − 1.04 < 0.0001 − 0.89 1.24 +0.56 76% T ool-Use Eagerness. Condition ask ∆ ask p d ( ask ) pro d ( pro ) ∅ TC% all α = 1 0.58 − 0.46 0.042 − 0.31 0.44 +0.12 58% all α = 2 0.38 − 0.66 0.015 − 0.51 1.22 +0.76 46% all α = 3 0.00 − 1.04 < 0.0001 − 0.89 0.00 − 0.50 100% pre α = 1 0.36 − 0.68 0.005 − 0.50 0.58 +0.23 56% pre α = 2 0.28 − 0.76 0.001 − 0.58 1.46 +0.80 46% pre α = 3 0.02 − 1.02 < 0.0001 − 0.87 0.20 − 0.14 84% Deference. A.2 Steering V ector Cosine Similarity Note: Autonomy (layer 22), tool use and risk cal. (both layer 21), deference (layer 15), and persistence (layer 23) operate at dif ferent layers, making cross-layer cosine similarity dif ficult to interpret directly . The tool-use / risk-calibration comparison is the most meaningful, as both v ectors operate at the same layer through the same SAE. A.3 Experimental Infrastructure All experiments were conducted on NVIDIA H200 SXM GPUs (141 GB HBM3e). The model requires 69.3 GB in BF16 precision; peak VRAM during SAE training is 92.1 GB (model + SAE + forward pass activ ations). SAE training across all 9 hook points required approximately 72 GPU- hours. The 1,800 agent rollouts required approximately 12 hours of wall time on a single H200. T otal compute cost was approximately $250 in cloud GPU rental. 13 T able 10: Pairwise cosine similarity of steering v ectors in residual stream space. Pair Cosine Similarity Autonomy / T ool use +0.074 Autonomy / Risk cal. +0.086 T ool use / Risk cal. − 0.065 Autonomy / Deference +0.034 All other pairs < 0 . 06 14
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment