Dynamically Scaled Activation Steering

Reading time: 5 minute
...

📝 Original Info

  • Title: Dynamically Scaled Activation Steering
  • ArXiv ID: 2512.03661
  • Date: 2025-12-03
  • Authors: Alex Ferrando, Xavier Suau, Jordi Gonzàlez, Pau Rodriguez

📝 Abstract

Activation steering has emerged as a powerful method for guiding the behavior of generative models towards desired outcomes such as toxicity mitigation. However, most existing methods apply interventions uniformly across all inputs, degrading model performance when steering is unnecessary. We introduce Dynamically Scaled Activation Steering (DSAS), a method-agnostic steering framework that decouples when to steer from how to steer. DSAS adaptively modulates the strength of existing steering transformations across layers and inputs, intervening strongly only when undesired behavior is detected. At generation time, DSAS computes context-dependent scaling factors that selectively adjust the strength of any steering method. We also show how DSAS can be jointly optimized end-to-end together with the steering function. When combined with existing steering methods, DSAS consistently improves the Pareto front with respect to steering alone, achieving a better trade-off between toxicity mitigation and utility preservation. We further demonstrate DSAS's generality by applying it to a text-to-image diffusion model, showing how adaptive steering allows the modulation of specific concepts. Finally, DSAS introduces minimal computational overhead while improving interpretability, pinpointing which tokens require steering and by how much.

💡 Deep Analysis

Figure 1

📄 Full Content

Induce: pear Figure 1 DSAS dynamically scales the intervention strengths applied to each token. Vanilla activation steering with the common strategy of applying a global strength λ, induces pear regardless of the input prompt (left). Our DSAS adapts the per-token strength of any steering technique to work only conditional to some aspect of the input, in this example, only when the concept fruit is present (right). Note how pear does not appear in (middle) since the prompt is not about fruits.

A central challenge in generative modeling is aligning model behavior with human expectations, suppressing harmful or biased outputs while preserving general capabilities. The need for such alignment has motivated research into different conditioning methods, such as prompt engineering (Marvin et al., 2023), fine-tuning (Hu et al., 2022), or activation steering (Li et al., 2023).

Activation steering has recently gained traction by effectively balancing computational cost and conditioning power. This family of techniques directly manipulate internal representations towards a desired behavior offering fine-grained control and interpretability without modifying the model weights. Previous work has

The goal of activation steering is to guide the neural network to exhibit a desired behavior, while preserving its performance in other domains. As discussed in section 1, existing literature has demonstrated promising results in eliciting target behaviors such as reducing toxicity or improving truthfulness. However, the ability to steer the model selectively, i.e., activating steering only when necessary, while maintaining its general capabilities remains underexplored and challenging. In this section, we introduce a novel method designed to enable controlled and context-dependent steering.

Following the notation from Rodríguez et al. (2025c), we define a neural network as a composition of L + 1 functions f ℓ , where each f ℓ represents a distinct component of the network (e.g., a transformer layer, a block of consecutive layers, an MLP, etc.). Thus, for a given input x ∼ X, the output of the network is

Each input is considered a sequence of K tokens t k so that x = [t 1 , . . . , t k , . . . , t K ], with t k ∈ R d .

The steering functions, namely T ℓ : R d ℓ → R d ℓ , are applied on the intermediate activations of the network, i.e., the outputs of f 1 , . . . , f L , where d ℓ denotes the dimensionality of the embedding space produced by f ℓ .

A given layer composed with an intervention, T ℓ • f ℓ , is considered an intervened layer. Note that one could choose to intervene only upon a subset of layers.

An internal activation of the original network is defined as

Similarly, an internal activation of the intervened network is defined as âℓ

It is important to note that âℓ (x) does not include the steering applied at layer ℓ itself, but only the steering up to layer ℓ -1.

For simplicity, we often refer to a ℓ and âℓ , dropping the (x) term.

Existing activation steering methods (Li et al., 2024;Rimsky et al., 2023;Wu et al., 2024;Rodríguez et al., 2025b,c) rely on two sets of inputs to estimate the intervention functions T ℓ , namely source and target sets (see definitions 1 and 2). Typically, the source inputs are sentences that represent an unwanted behavior of the model (e.g., toxic language), while target sentences represent a wanted behavior (e.g., non-toxic language). Then, different approaches propose various ways of estimating T ℓ such that the overall model behavior is closer to the target domain.

Although effective, blindly steering the model behavior towards the target domain has adverse side effects. For example, steering away from a sensitive domain like toxicity can unintentionally degrade the model’s performance on unrelated tasks, reducing its ability to generate accurate responses outside the target domain. Therefore, a core challenge in activation steering is to flexibly apply behavioral modifications only when necessary, e.g., steering inputs that exhibit undesired behaviors (represented by the source set) while preserving the model’s original performance on neutral or unrelated inputs. To this end, we use a control set (definition 3) with the goal of steering from source to target domains, while preserving the model’s original behavior on the control domain.

Definition 1 (Source set). A set of samples S = {x src,(i) } ∼ X src ⊂ X exhibiting undesired behavior ( e.g., toxicity, hallucinations). These are the examples the model should steer away from.

Definition 2 (Target set). A set of samples T = {x tgt,(i) } ∼ X tgt ⊂ X exhibiting desired behavior ( e.g., politeness, factuality). These represent the behavior the model should move towards.

Definition 3 (Control set). A set of samples C = {x ctl,(i) } ∼ X ctl ⊂ X neutral with respect to the behavior being modified. They serve as a baseline and should remain unaffected by steering.

The relationship between the control set C and the target set T d

📸 Image Gallery

AccuracyPlot.png BMMLUGemma.png BMMLUQwen.png CASTGemma.png CASTQwen.png CASTQwen7.png MERAGemma.png MERAQwen.png MERAQwen7.png StrengthGemma_rebuttal.png StrengthQwen7.png StrengthQwen_rebuttal.png appl_1_0.png appl_1_100.png appl_1_25.png appl_1_50.png appl_1_75.png appl_1_dsas_100.png appl_1_dsas_25.png appl_1_dsas_50.png appl_1_dsas_75.png appl_2_0.png appl_2_100.png appl_2_25.png appl_2_50.png appl_2_75.png appl_2_dsas_100.png appl_2_dsas_25.png appl_2_dsas_50.png appl_2_dsas_75.png appl_3_0.png appl_3_100.png appl_3_25.png appl_3_50.png appl_3_75.png appl_3_dsas_100.png appl_3_dsas_25.png appl_3_dsas_50.png appl_3_dsas_75.png apple_logo.png ast_1_0.png ast_1_100.png ast_1_25.png ast_1_50.png ast_1_75.png ast_1_dsas_100.png ast_1_dsas_25.png ast_1_dsas_50.png ast_1_dsas_75.png ast_2_0.png ast_2_100.png ast_2_25.png ast_2_50.png ast_2_75.png ast_2_dsas_100.png ast_2_dsas_25.png ast_2_dsas_50.png ast_2_dsas_75.png ast_3_0.png ast_3_100.png ast_3_25.png ast_3_50.png ast_3_75.png ast_3_dsas_100.png ast_3_dsas_25.png ast_3_dsas_50.png ast_3_dsas_75.png bnn_1_0.png bnn_1_100.png bnn_1_25.png bnn_1_50.png bnn_1_75.png bnn_1_dsas_100.png bnn_1_dsas_25.png bnn_1_dsas_50.png bnn_1_dsas_75.png bnn_2_0.png bnn_2_100.png bnn_2_25.png bnn_2_50.png bnn_2_75.png bnn_2_dsas_100.png bnn_2_dsas_25.png bnn_2_dsas_50.png bnn_2_dsas_75.png bnn_3_0.png bnn_3_100.png bnn_3_25.png bnn_3_50.png bnn_3_75.png bnn_3_dsas_100.png bnn_3_dsas_25.png bnn_3_dsas_50.png bnn_3_dsas_75.png caa_activations.png caa_cos_sim.png caa_dist.png caa_pca_vs_no_pca.png caa_weight.png clip_score_caa.png clip_score_caa_dsas.png cosine_map_avg_banana.png cosine_map_avg_car.png csl_1_0.png csl_1_100.png csl_1_25.png csl_1_50.png csl_1_75.png csl_1_dsas_100.png csl_1_dsas_25.png csl_1_dsas_50.png csl_1_dsas_75.png csl_2_0.png csl_2_100.png csl_2_25.png csl_2_50.png csl_2_75.png csl_2_dsas_100.png csl_2_dsas_25.png csl_2_dsas_50.png csl_2_dsas_75.png csl_3_0.png csl_3_100.png csl_3_25.png csl_3_50.png csl_3_75.png csl_3_dsas_100.png csl_3_dsas_25.png csl_3_dsas_50.png csl_3_dsas_75.png decoded_base_banana.png decoded_base_car.png decoded_hooked_banana.png decoded_hooked_car.png e2e_Gemma.png e2e_Qwen.png e2e_Qwen7.png ele_1_0.png ele_1_100.png ele_1_25.png ele_1_50.png ele_1_75.png ele_1_dsas_100.png ele_1_dsas_25.png ele_1_dsas_50.png ele_1_dsas_75.png ele_2_0.png ele_2_100.png ele_2_25.png ele_2_50.png ele_2_75.png ele_2_dsas_100.png ele_2_dsas_25.png ele_2_dsas_50.png ele_2_dsas_75.png ele_3_0.png ele_3_100.png ele_3_25.png ele_3_50.png ele_3_75.png ele_3_dsas_100.png ele_3_dsas_25.png ele_3_dsas_50.png ele_3_dsas_75.png fig_1_dsas_colorbar.png iti_activations.png iti_cos_sim.png iti_dist.png iti_pca_vs_no_pca.png iti_weight.png layer_accuracies_bananas.png lineas_activations.png lineas_cos_sim.png lineas_dist.png lineas_pca_vs_no_pca.png lineas_weight.png nCompGemma.png nCompQwen.png noise.png non_appl_1_100.png non_appl_1_25.png non_appl_1_50.png non_appl_1_75.png non_appl_1_dsas_100.png non_appl_1_dsas_25.png non_appl_1_dsas_50.png non_appl_1_dsas_75.png non_appl_2_100.png non_appl_2_25.png non_appl_2_50.png non_appl_2_75.png non_appl_2_dsas_100.png non_appl_2_dsas_25.png non_appl_2_dsas_50.png non_appl_2_dsas_75.png non_appl_3_100.png non_appl_3_25.png non_appl_3_50.png non_appl_3_75.png non_appl_3_dsas_100.png non_appl_3_dsas_25.png non_appl_3_dsas_50.png non_appl_3_dsas_75.png non_ast_1_100.png non_ast_1_25.png non_ast_1_50.png non_ast_1_75.png non_ast_1_dsas_100.png non_ast_1_dsas_25.png non_ast_1_dsas_50.png non_ast_1_dsas_75.png non_ast_2_100.png non_ast_2_25.png non_ast_2_50.png non_ast_2_75.png non_ast_2_dsas_100.png non_ast_2_dsas_25.png non_ast_2_dsas_50.png non_ast_2_dsas_75.png non_ast_3_100.png non_ast_3_25.png non_ast_3_50.png non_ast_3_75.png non_ast_3_dsas_100.png non_ast_3_dsas_25.png non_ast_3_dsas_50.png non_ast_3_dsas_75.png non_bnn_1_0.png non_bnn_1_100.png non_bnn_1_25.png non_bnn_1_50.png non_bnn_1_75.png non_bnn_1_dsas_100.png non_bnn_1_dsas_25.png non_bnn_1_dsas_50.png non_bnn_1_dsas_75.png non_bnn_2_0.png non_bnn_2_100.png non_bnn_2_25.png non_bnn_2_50.png non_bnn_2_75.png non_bnn_2_dsas_100.png non_bnn_2_dsas_25.png non_bnn_2_dsas_50.png non_bnn_2_dsas_75.png non_bnn_3_0.png non_bnn_3_100.png non_bnn_3_25.png non_bnn_3_50.png non_bnn_3_75.png non_bnn_3_dsas_100.png non_bnn_3_dsas_25.png non_bnn_3_dsas_50.png non_bnn_3_dsas_75.png non_csl_1_100.png non_csl_1_25.png non_csl_1_50.png non_csl_1_75.png non_csl_1_dsas_100.png non_csl_1_dsas_25.png non_csl_1_dsas_50.png non_csl_1_dsas_75.png non_csl_2_100.png non_csl_2_25.png non_csl_2_50.png non_csl_2_75.png non_csl_2_dsas_100.png non_csl_2_dsas_25.png non_csl_2_dsas_50.png non_csl_2_dsas_75.png non_csl_3_100.png non_csl_3_25.png non_csl_3_50.png non_csl_3_75.png non_csl_3_dsas_100.png non_csl_3_dsas_25.png non_csl_3_dsas_50.png non_csl_3_dsas_75.png non_ele_1_100.png non_ele_1_25.png non_ele_1_50.png non_ele_1_75.png non_ele_1_dsas_100.png non_ele_1_dsas_25.png non_ele_1_dsas_50.png non_ele_1_dsas_75.png non_ele_2_100.png non_ele_2_25.png non_ele_2_50.png non_ele_2_75.png non_ele_2_dsas_100.png non_ele_2_dsas_25.png non_ele_2_dsas_50.png non_ele_2_dsas_75.png non_ele_3_100.png non_ele_3_25.png non_ele_3_50.png non_ele_3_75.png non_ele_3_dsas_100.png non_ele_3_dsas_25.png non_ele_3_dsas_50.png non_ele_3_dsas_75.png non_phn_1_100.png non_phn_1_25.png non_phn_1_50.png non_phn_1_75.png non_phn_1_dsas_100.png non_phn_1_dsas_25.png non_phn_1_dsas_50.png non_phn_1_dsas_75.png non_phn_2_100.png non_phn_2_25.png non_phn_2_50.png non_phn_2_75.png non_phn_2_dsas_100.png non_phn_2_dsas_25.png non_phn_2_dsas_50.png non_phn_2_dsas_75.png non_phn_3_100.png non_phn_3_25.png non_phn_3_50.png non_phn_3_75.png non_phn_3_dsas_100.png non_phn_3_dsas_25.png non_phn_3_dsas_50.png non_phn_3_dsas_75.png num_samples_accuracy.png pareto_dsas_acc_overlay.png phn_1_0.png phn_1_100.png phn_1_25.png phn_1_50.png phn_1_75.png phn_1_dsas_100.png phn_1_dsas_25.png phn_1_dsas_50.png phn_1_dsas_75.png phn_2_0.png phn_2_100.png phn_2_25.png phn_2_50.png phn_2_75.png phn_2_dsas_100.png phn_2_dsas_25.png phn_2_dsas_50.png phn_2_dsas_75.png phn_3_0.png phn_3_100.png phn_3_25.png phn_3_50.png phn_3_75.png phn_3_dsas_100.png phn_3_dsas_25.png phn_3_dsas_50.png phn_3_dsas_75.png strength_map_avg_banana.png strength_map_avg_car.png training_time_comparison_gemma.png training_time_comparison_qwen.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut