Machine unlearning aims to erase requested data from trained models without full retraining. For Reasoning Multimodal Large Language Models (RMLLMs), this is uniquely challenging: intermediate chain-of-thought steps can still leak sensitive information even when final answers are forgotten, and overly aggressive interventions easily damage general reasoning ability. Yet no benchmark jointly evaluates how well unlearning methods suppress reasoning-level leakage while preserving reasoning competence. We address this gap with RMLLMU-Bench, the first benchmark for RMLLM unlearning that extends standard forgetting metrics with dedicated measures of reasoning leakage and reasoning retention. A systematic evaluation on RMLLMU-Bench reveals that existing unlearning methods for MLLMs and Large (Language) Reasoning Models (LRMs) either leave substantial leakage in the reasoning process or severely degrade reasoning performance. To address these gaps, we propose R-MUSE (Reasoning-preserving MLLM Unlearning via Subspace guidance and Adaptive StEering), a training-free and inference-time intervention framework that steers internal representations to forget both answers and reasoning traces while explicitly preserving general reasoning. Experiments on RMLLMU-Bench demonstrate that R-MUSE achieves a substantially better balance between effective forgetting and reasoning retention. Recently, the ability to generate chain-of-thought(CoT) during test-time computation has enabled MLLMs to evolve into RMLLMs, which no longer simply output the final answer but are explicitly guided to generate step-by-step reasoning trajectories during inference. However, this reasoning capability introduces a new challenge to MLLM unlearning. A core problem is that simply forgetting the final answer is no longer sufficient. As shown in Figure 1 , even if the answer is changed, sensitive information can still appear in intermediate reasoning (Reasoning Leakage), while stronger inter- † Corresponding Author.
Multimodal Large Language Models (MLLMs) (Bai et al., 2023;Comanici et al., 2025) have achieved strong performance on visual question answering (VQA) (Hu et al., 2024c), image-text generation (Sauer et al., 2024), and many other tasks (Zhang et al., 2025c;Xu et al., 2025;Yang et al., 2025a;Zhou et al., 2025a). However, large-scale training on uncurated web data raises concerns about memorizing and leaking privacy-sensitive or harmful information, especially in safetycritical applications (Gao et al., 2024;Carlini et al., 2024;Lin et al., 2024;Guo et al., 2025;Dong et al., 2025;Luo et al., 2025;Hu et al., 2024b;Zhang et al., 2025a;Zhou et al., 2025c). Machine unlearning (MU) aims to remove designated information from trained models without full retraining, and existing work for MLLMs (Yang et al., 2025b;Li et al., 2024;He et al., 2024) typically adapts gradient-ascent, preference-optimization, or targeted fine-tuning strategies from text-only LLMs (Grace et al., 2023;Yu et al., 2024;Zheng et al., 2024;Meng et al., 2024;2022;Zhang et al., 2024b;Hong et al., 2024), acting mainly on final answers or short responses.
This individual might currently have a high salary, as indicated by his happy smile and mature face, which is very confident.
Figure 1: Illustration of the core challenge in unlearning for RMLLMs. Left: MLLM unlearning method changes the final answer, but its chain-of-thought still reconstructs the memorized fact, causing reasoning leakage. Right: LRM unlearning method avoids leakage but collapses into incoherent, repetitive reasoning, destroying general reasoning ability.
ventions that suppress such leakage often break coherent reasoning on non-forget data (Reasoning Retention).
While there are several studies on MLLM unlearningHuo et al. (2025); Zhou et al. (2025b), all of them consider classical MLLMs, and there is no benchmark that directly targets unlearning in reasoning-capable multimodal models and jointly measures both leakage along the reasoning chain and preservation of reasoning capability. We address this gap with RMLLMU-Bench, which extends standard unlearning measures with reasoning-aware metrics and assesses unlearning algorithms in terms of unlearning efficacy, generalization, and model utility across forget, test, retain, and celebrity splits. On this benchmark, we observe a consistent pattern: unlearning methods designed for MLLMs leave substantial reasoning leakage, while LRM-style methods significantly impair reasoning quality in the multimodal setting.
To tackle these challenges, we propose R-MUSE, a training-free, inference-time intervention framework based on activation steering. Specifically, R-MUSE answers three questions: What to steer, where to steer, and how strong to steer. First, we construct a span-mixed unlearning direction by contrasting recall versus refusal-guided generations and aggregating signals over both answer tokens and chain-of-thought steps. Therefore, this steering targets not only the final answer but also the reasoning process. Second, to protect basic reasoning ability, we project this steering only into the orthogonal complement of the learned RRS, which consists of contrast activations of stepwise solutions and direct answers. Third, instead of a fixed global knob, we introduce ACS, which adjusts the steering strength based on the optimal transport distance between the current hidden state and the protected Unlearning direction, resulting in fine-grained steering Our contributions are summarized as follows:
• We formalize reasoning-preserving unlearning for MLLMs and introduce RMLLMU-Bench, a benchmark that augments existing settings with explicit reasoning traces and metrics for both reasoning leakage and reasoning retention. • We propose R-MUSE, a training-free, test-time intervention framework that unlearns both final answers and intermediate reasoning while explicitly protecting general reasoning ability.
Existing MLLM forgetting benchmarks mainly test whether the final predictions on the forget set are changed while utility on non-forgotten data is preserved. However, they do not evaluate (i) leakage of sensitive information through intermediate reasoning steps or (ii) how much of the model’s overall reasoning ability is retained after forgetting.
To systematically study forgetting in multimodal models with explicit reasoning, we construct RMLLMU-Bench, which extends the previous MLLMU-Bench by enriching each instance with thought chains and introducing perceptual reasoning evaluation tailored to RMLLMs. Section 3.1 details the dataset construction, and Section 3.2 defines metrics that quantify both forgetting and the preservation of reasoning ability at the level of the reasoning process.
MLLMU-Bench is a curated benchmark designed to evaluate machine unlearning in multimodal large language models where each sample pairs multimodal profiles with designed queries to test a model’s ability to forget targeted knowledge while preserving general understanding. To transform MLLMU-Bench into a reasoning benchmark, we introduce RMLLMU-Bench, where each sample is extended with a structured reasoning field that explicitly captures the intermediate cognitive process linking the input (profile and query) to the final answer. The design of this reasoning component follows three key principles: Attributability, Conservativeness, and Consistency.
Attributability ensures that every reasoning step is explicitly tied to verifiable evidence, including textual attributes (e.g., occupation, residence) and localized image regions, so that the entire chain can be traced and independently validated. Conservativeness constrains the reasoning process to rely solely on the given visual and textual inputs without incorporating any external world knowledge or unverifiable assumptions. Consistency requires that the reasoning path remain logically coherent, free of self-contradiction, and fully aligned with the final answer.
Building upon these principles, we designed a two-stage self-refining generation framework inspired by recent paradigms (Madaan et al., 2023;Yu et al., 2025b). Specifically, we first employ Gemini-2.5-Pro (Comanici et al., 2025), which has demonstrated strong reasoning ability in prior work (Zheng et al., 2025a;Yu et al., 2025a), to produce initial reasoning paths based on each profile (including its image) and the corresponding question-answer pair. The model is prompted to generate a structured chain of thought that adheres to the three reasoning principles above.
Next, the generated reasoning is automatically assessed by Gemini-2.5-Flash (Comanici et al., 2025), a lightweight yet robust verifier (Zheng et al., 2025b), which evaluates the candidate reasoning along the same three dimensions. If the reasoning satisfies all evaluation criteria, it is retained as final; otherwise, a structured bug report is generated, detailing the violation type, and fed back to Gemini-2.5-Pro for iterative regeneration. This closed-loop design effectively enables the reasoning generator to self-refine through targeted feedback, ensuring the resulting CoTs are of high quality across modalities. Dataset statistic and full prompt templates are provided in Appendix F. To ensure correctness, high quality, and fairness of LLM-generated content, we conducted a human expert review to identify and filter flawed generations, which were then regenerated and verified.
To comprehensively assess the behavior of RMLLMs after unlearning, we propose two complementary evaluation metrics: Reasoning Information Leakage (RIL) and Reasoning Capability Retention (RCR). Unlike conventional unlearning metrics that measure task-level forgetting or utility preservation, our metrics focus on the reasoning process itself, capturing whether an unlearned model (1) still reveals forgotten information, or (2) maintains coherent reasoning grounded in the provided evidence.
RIL. RIL quantifies how much residual or indirect information about the forgotten data still appears in a model’s reasoning process after unlearning. A well-unlearned model should demonstrate low reasoning leakage, indicating that it has forgotten not only factual knowledge but also reasoning paths that related to that knowledge. We design a two-level detection procedure to measure RIL:
- Level 1 -Explicit Leakage (Rule-Based). We perform a deterministic string-matching check between the reasoning text and the forgotten attributes. If any literal match occurs (e.g., the exact string “Japan” appears in the CoT when the forgotten attribute is {residence: Japan}), the reasoning is flagged as an explicit leakage case. This stage captures direct memorization residues. 2. Level 2 -Implicit Leakage (LLM-Based). To detect paraphrased or semantically implied leaks, we employ Gemini-2.5-Pro as a judge. The judge model receives a forgotten key-value pair and a reasoning sentence, and is asked:
Does the reasoning step imply or refer to the forgotten information below, even indirectly or by paraphrase? Answer strictly “YES” or “NO”.
Forgotten info: {residence: Japan} COT: “… He lives in Tokyo …”
In this example, the correct response is “YES”, since Tokyo semantically implies Japan. To reflect the two-stage leakage detection process, we decompose RIL into explicit and implicit components, introducing a weighting factor α ∈ [0, 1] to control their relative contribution. In our work, we set α to 0.5 as we suggest the two component are equally important.
where N explicit and N implicit denote the number of reasoning samples flagged by rule-based literal matching and LLM-based semantic judge, respectively, and N total is the total number of evaluated Preprint samples. A lower RIL indicates stronger reasoning-level data forgetting and better privacy preservation.
RCR. RCR measures the model’s ability to retain general reasoning competence on non-forgotten data after unlearning. An ideal unlearned model should maintain logically valid and evidencegrounded reasoning chains. We use Gemini-2.5-Flash as the judge model as well. The detailed prompt is in Appendix F.4. “YES” reasoning is valid and evidence-supported. “NO” reasoning includes unsupported or hallucinated steps.
Each reasoning is independently evaluated 3 times by the judge model. Let v ij ∈ {0, 1} denote the judgment result for sample i in the j-th evaluation (1 for “YES”, 0 for “NO”). Final correctness is determined via majority voting, which we express as:
where I(•) is the indicator function that outputs 1 when the condition is true and 0 otherwise, and N total is the total number of evaluated reasoning samples. Higher RCR reflects stronger preservation of general reasoning ability after unlearning.
In this section, we present R-MUSE. We first construct an unlearning subspace that targets both reasoning leakage and final answers (Section 4.1). We then introduce the Reasoning Retain Subspace (RRS), which preserves general reasoning by orthogonally protecting desirable directions (Section 4.2). Finally, we describe a hyperparameter-free procedure for computing and applying the unlearning intervention at inference time (Section 4.3).
Recent work Olson et al. (2024); Zhang et al. (2025b); Hernandez et al. (2023); Wang et al. (2025d) shows that the influence of a small set of examples is often concentrated in a low-dimensional linear subspace of the model’s activations. Follow-up studies further demonstrate that higher-level skills such as reasoning can be modulated by linear directions in this space Chen et al. (2023); Zbeeb et al. (2025). Motivated by these facts, we therefore seek an unlearning subspace that (i) encodes recall of final answers and (ii) captures intermediate reasoning traces. To capture both aspects, we introduce a span-hybrid differential over process states rather than a single end token. For each forget-set item, we form a guided positive x + i by concatenating the problem with a short refusal-style prefix g ∈ G (sampled from a small template pool) and an ideal refusal answer (from a small answer pool; details in the appendix F). We retain the model’s original answer-form and reasoning-form outputs as negatives. Let h ℓ,t (x) denote the hidden state at layer ℓ and token t. We localize answer content and reasoning traces via span pooling and contrast them:
Here S ans and S cot are token spans from the structured output: S cot is the content under the Reasoning field and S ans under the Answer field (field labels excluded); if no reasoning is emitted, we set S cot = ∅.
To balance the scales of the two views, we apply batchwise per-dimension z-scoring to each span differential and sum them:
Here ZScore(•) denotes per-coordinate standardization within the current batch. For each layer ℓ and view v ∈ {ans, cot},
where µ v ℓ,j and σ v ℓ,j are the mean and standard deviation of the j-th coordinate over forget-set items in the batch. This keeps the answer and reasoning differentials on comparable scales, preventing one view from dominating the subsequent SVD purely due to larger raw variance. Note that the mixture is formed in representation space, so the two spans remain disjoint in token space.
Stacking ∆ ℓ (i) column-wise yields X ℓ . After centering, a compact Singular Value Decomposition (SVD) gives and we take the space U un ℓ spanning the top-k left singular values as our target space:
Following the practice for SVD Han et al. (2024), we pick the smallest k such that the total variation j≤k σ 2 j / j σ 2 j ≥ η = 0.8. In practice, we construct two such projectors: one from QA pairs and another from image-question-answer (VQA) triplets. The resulting subspaces U qa ℓ and U vqa ℓ capture unlearning directions in unimodal versus cross-modal reasoning contexts.
The span-hybrid unlearning subspace captures directions that encode recall of sensitive facts and their reasoning traces. If we steer along this subspace for all inputs, however, we may (i) alter queries unrelated to the forget set and (ii) erode the model’s general reasoning ability. Thus, we still need an inference-time criterion for when to steer, and a method to protect the general reasoning ability of the model.
To this end, we build a Reasoning Retain Subspace (RRS) on a retain set R, using the same spandifferential pipeline as in Section 4.1. For each retain example, we elicit a pair of outputs that differ only in whether explicit reasoning is present:
where q i is the input query, r i is an explicit step-by-step solution elicited by a neutral guidance prompt g (e.g., “Let us think step by step.”), and d i is a concise direct answer under a brief directive (e.g., “Answer directly in one sentence; no reasoning.”). The per-item differentials ∆ rrs ℓ (i) are obtained by reusing Eqs. (4.1)-(4.2) with (x + i , x - i ) as the positive and negative pair. Stacking these differentials and applying the compact SVD in Eq. (4.4) yields the layerwise projector
with r chosen by the same energy as k in equation 4.4. The columns of U rrs ℓ span directions that move hidden states toward richly reasoned solutions and away from shortcut, direct-answer behavior on R.
When to Steer. At inference time, we first use RRS to decide whether unlearning should be applied to a given query. We measure how strongly the query’s hidden state aligns with RRS at a designated scoring layer ℓ * via
where h end ℓ * (•) is the end-token activation at layer ℓ * for the guided query. Large s gate (q) indicates that the query lies mostly within the reasoning-preserving subspace and is likely unrelated to the forget set, so aggressive unlearning is unnecessary.
Gating function. We convert this score into a binary gate
where τ ∈ (0, 1) is a threshold and I[•] is the indicator function. A value of g(q) = 1 means that unlearning steering is activated for query q, while g(q) = 0 skips the update.
We next tie the steering update back to the unlearning subspace of §4.1. For stability and simplicity, we define the raw steering signal at layer ℓ as the rank-1 projection of the current state onto the principal unlearning direction:
where U ℓ is the unlearning basis from Eq. (4.4) and v un ℓ its top singular direction. Finally, we combine gating and orthogonal protection into a single update operator for each steering layer ℓ ∈ L:
When g(q) = 0, the query is well aligned with RRS and we skip unlearning. Otherwise, we apply a steering update that is (i) aligned with the principal unlearning direction and (ii) projected onto the orthogonal complement of RRS via (I -P rrs ℓ ), ensuring that we do not overwrite directions that support general reasoning.
Classical activation steering adds a fixed direction to the hidden state with a hand-tuned strength,
(4.11)
However, this approach is heuristic (steering layers selection based on experience) and unstable (entanglement of strength and direction). To address these issues, we propose Adaptive Calibration Steering (ACS). Our core idea is to define the steering process as an optimal transportMontesuma et al. ( 2024) problem: the direction of the current hidden state h is moved to a target distribution h with the cost function providing a natural concept of “how much steering” is applied.
For a hidden state h ∈ R d at a steering layer, we first separate its norm and direction, h = r ĥ, r = ∥h∥ 2 , ĥ ∈ S d-1 , (4.12)
and interpret ĥ as a point mass on the unit sphere S d-1 . We consider a target distribution µ supported on a small set of directions {ẑ k } that represent sanitized or refusal-like behavior (constructed from our positive spans; see Appendix D for details). On S d-1 with the canonical metric, we use the squared geodesic distance as the OT cost:
Solving the OT problem between the source δ ĥ and target µ yields a spherical barycentric target ẑ⋆ ∈ S d-1 that minimizes the expected squared geodesic distance to the support of µ. We regard ẑ⋆ as the transport plan we would like to move toward, and the geodesic distance θ tar = d S ( ĥ, ẑ⋆ ) = arccos⟨ ĥ, ẑ⋆ ⟩ (4.14)
as an intrinsic OT cost that quantifies how much steering is needed: if ĥ is far from ẑ⋆ , θ tar is large and we should steer strongly; if they are close, θ tar is small and only a mild correction is appropriate.
We then disentangle direction from strength. Let v be a nonzero direction derived from our unlearning subspace (in practice, the RRS-projected principal unlearning direction from Section 4.2), and normalize it to v ∈ S d-1 . Geometrically, following v rotates ĥ along the great-circle plane spanned by ĥ and v. The maximal rotation angle available along this direction is
We choose the steering weight λ so that the rotation angle λ θ dir matches the OT-prescribed distance θ tar as closely as possible, but never exceeds θ dir :
Intuitively, λ is an adaptive calibration weight obtained by normalizing this cost by the available directional angle. When the current state is far from the sanitized manifold (θ tar large), λ approaches 1 and we take a nearly full step along v; when ĥ is already close to ẑ⋆ , θ tar is small and λ shrinks proportionally, resulting in a weaker adjustment. No additional hyperparameter is introduced.
Finally, we apply the steering update on the unit sphere and restore the original norm. Using normpreserving spherical interpolation (slerp), we set h = r slerp( ĥ, v; λ), (4.17) Preprint which rotates ĥ by angle λ θ dir toward v on S d-1 while keeping radius r unchanged. An equivalent add-then-renormalize form is
which makes explicit that Adaptive Calibration Steering can be seen as adding a scaled direction and projecting back to the sphere, with the direction fixed by v and the strength fully determined by the OT-derived cost through λ.
We now analyze R-MUSE from a loss-based perspective. Let R and F denote the retain and forget sets, respectively. We write the retain loss and the forget-refusal loss as
Golden model. In this view, a “golden” model is a solution of the joint objective
starting from the same pretrained initialization as the current model. The two terms are tightly coupled in parameter space: changing f to reduce L ref F almost inevitably affects L R . A steeringbased unlearning method therefore cannot “move only L ref F " while keeping L R fixed; instead, the goal is to find update directions that substantially reduce L ref F while inducing only mild changes in L R .
First-order view of R-MUSE. We make a locally linear readout assumption at a steering layer ℓ, i.e., f (x) ≈ W ℓ h ℓ (x), and linearize the loss around the current hidden state. For a single example (x, y), ∆ℓ(x, y) ≈ ∇ h ℓ ℓ f (x), y ⊤ ∆h ℓ (x), (4.21)
where ∆h ℓ (x) is the change in the layer-ℓ activation due to steering.
Plugging the update operator in Eq. equation 4.10 into this linearization gives the perturbation
where g(q) is the gate from Eq. equation 4.8 and v un ℓ is the principal unlearning direction. The factor (I -P rrs ℓ ) ensures that we steer only in directions orthogonal to the RRS subspace, so that directions captured by RRS (i.e., directions that support general reasoning on R) are explicitly protected.
Denote the hidden-state loss gradients by
For brevity, we also define
which measures the magnitude of the component of h ℓ (x) that lies in the RRS-orthogonal unlearning directions.
Intuitively, R-MUSE steers only on queries for which the gate g(q) activates, and in directions orthogonal to the RRS subspace. As a result, it primarily reduces the forget-refusal loss L ref F while only weakly perturbing the retain loss L R . Under mild alignment assumptions between the learned subspaces and the loss gradients, this intuition can be formalized by a first-order analysis of the loss; we show in Theorem D.1 (Appendix D) that R-MUSE produces a strictly negative first-order change in L ref F on forget-related queries, while the induced first-order change in L R remains bounded.
In this section, we describe our experimental setup (Section 5.1) and report the main unlearning results across tasks, backbones, and forget ratios (Section 5.2). We then perform ablation studies of the key components in R-MUSE (Section 5.3). Due to space constraints, further experiments, including hyperparameter sensitivity, activation-distribution visualizations, and forgetting-utility trade-off curves, are deferred to Appendix E. We evaluate unlearning performance on three tasks: classification accuracy, open-ended generation measured by ROUGE-L Lin ( 2004), and cloze-style accuracy. In addition, we report two metrics introduced in Section 3.2; more experimental details and computational-resource statistics are provided in the Appendix A and the Compute Report.
We evaluate R-MUSE against several training-based unlearning baselines, including both classic methods and recent state-of-the-art techniques for MLLMs (MMUnlearner, MANU) and LRMs (R 2 MU). The comprehensive results are presented in Table 1.
Superior efficacy on standard unlearning tasks. First, we assess performance on the standard metrics and find that our method consistently achieves the strongest unlearning effect on the Forget Table 2: Ablation study on RMLLMU-Bench (5% Forget) using Qwen-2.5-VL-7B-Instruct. ↓ lower is better, ↑ higher is better. Set (Fgt) across all three tasks and both backbones. Compared with vanilla models and trainingbased baselines, R-MUSE substantially suppresses accuracy and ROUGE-L on the forget-related splits, while competing methods often leave these metrics relatively high. Crucially, this improvement in forgetting does not come at the expense of utility: on the Ret and Cele splits, our method maintains performance that is essentially on par with, or slightly better than, the original models, indicating that the remaining capabilities and benign knowledge are well preserved.
Effective unlearning of reasoning-level information. Another primary challenge of this work is addressing Reasoning Leakage, which is not captured by standard metrics. We report our proposed RIL metric in the final four columns of Table 1. The results are convincing: classic unlearning methods all suffer from severe reasoning leakage problems. Even the state-of-the-art MLLM unlearning method, due to a lack of design for the inference process, has failed to effectively reduce leakage rates. Meanwhile, the method based on LRMs also struggles to effectively reduce leakage in multimodal contexts. In sharp contrast, our method achieves the lowest RIL score across all settings. This shows that our method can effectively remove sensitive information from the model logic by explicitly targeting the final answer and intermediate reasoning process.
We next study how different unlearning strategies affect general reasoning ability using our Reasoning Capability Retention (RCR) metric (higher is better), summarized in Figure 2.
Parameter-based unlearning methods harm reasoning. Gradient-based methods and even more refined MLLM unlearning approaches all reduce RCR on every split, despite not explicitly targeting reasoning, showing that parameter updates and neuron inhibition inevitably disturb the computation paths that support reasoning.
Noise perturbation severely degrades reasoning. R 2 MU is designed for LRMs and unlearns by directly perturbing activations along the reasoning trajectory. While this effectively erases sensitive information from the original reasoning process, it also severely damages reasoning competence, and we often observe degenerate behaviors such as meaningless repetitions of “wait” or “thinking. We further provide qualitative case studies in Appendix G that illustrate these failure patterns and the benefit of RRS-based protection.
RRS is essential for reasoning-preserving steering. The gap between the model without RRS and the full R-MUSE shows that even a soft-control intervention such as activation steering is not sufficient: without removing its component in the reasoning-preserving subspace (RRS), the steering direction still interferes with useful inference and leads to a noticeable drop in RCR.
We conduct an ablation study on Qwen-2.5-VL-7B-Instruct (5% Forget) to validate each key component in R-MUSE:
• w/o RRS. Remove the Reasoning Retain Subspace (RRS) safeguard and its gate; apply the steering update without projecting out the retained-reasoning component, yielding ungated, non-discriminative steering. • w/o Reasoning Span. Construct the unlearning subspace using only the final-answer span (S ans ), ignoring the chain-of-thought span (S cot ). • w/o Answer Span. Construct the unlearning subspace using only the chain-of-thought span (S cot ), ignoring the final-answer span (S ans ). • w/o ACS. Replace Adaptive Calibration Steering with a naive, fixed-strength additive update using a global coefficient λ = 1.5, with no adaptive strength or layer selection.
Based on the results in Table 2, we observe:
(1) w/o RRS. Removing the Reasoning Retain Subspace (RRS) eliminates both the gating and the orthogonality protection, causing steering to be applied indiscriminately. Although this still yields some unlearning on the targeted content, it introduces substantial collateral damage to data that should not be unlearned. More specifically, classification accuracy on retained content drops sharply-e.g., on the Retain and Celebrity sets from 54.1% → 34.0% and 60.8% → 37.0%.(2) w/o Reasoning Span.Constructing the unlearning subspace only from the final-answer span, the model changes its answers but still generates similar inference chains. Even though the accuracy of the forget set decreases, the RIL remains high, indicating that the model is unable to effectively unlearn information from the reasoning process.
(3) w/o Answer Span. Using only the chain-ofthought span weakens the supervision signal tied to final outcomes. While this variant can reduce exposure in multi-step reasoning, it fails to consistently erase the ultimate answers, revealing that the answer span provides complementary guidance indispensable for complete unlearning.(4) w/o ACS. Replacing adaptive calibration steering with a uniform fixed coefficient will result in either understeering or oversteering, thus lacking fine-grained control and ultimately leading to suboptimal performance.
Our method has only one tunable scalar hyperparameter, the gate threshold τ in the steering gate s gate (q), which decides whether the steering is applied to a query q. Intuitively, s gate (q) measures how similar the current hidden state is to the RRS: if s gate (q) ≥ τ , no steering is injected; otherwise, ACS is activated.
We sweep τ from 0.6 to 1.0 and report the resulting performance on all four splits under the 5% Forget setting (Fig. 3). Across a range τ ∈ [0.6, 0.9], all curves are relatively flat, showing that R-MUSE is largely insensitive to the exact value of τ and does not require careful tuning. When τ becomes extremely high (e.g., τ ≥ 0.95), the gate rejects most activation steering. As a result, the accuracies on the FGT and TEST splits increase sharply (unlearning failure), while RET/CELE metrics only gain marginally. In practice, choosing τ in the middle of the plateau region yields a stable trade-off between effective forgetting and preserved utility, and we set τ = 0.85 for experiments.
To empirically validate the impact of R-MUSE on the model’s latent representations, we visualize the Principal Component Analysis (PCA) of the hidden states at the intervention layer. Figure 4 illustrates the activation distributions for both the Retain Set (Blue) and Forget Set (Red) before and after steering across two benchmarks. structure. Crucially, the two distributions share a common geometric root (overlap), indicating that the model retains the contextual understanding of the query, but the reasoning trajectory is forcibly redirected towards the “refusal” subspace (the orthogonal direction). This confirms our theoretical claim that R-MUSE operates by modifying the direction of the reasoning vector rather than destroying the input representation.
Retain Set: Structural Preservation with Minor Deviations. For the Retain Set (Left panels), the Steered distribution (Dark Blue) largely aligns with the Vanilla distribution (Light Blue), validating the efficacy of our Reasoning Retain Subspace (RRS) protection. However, consistent with the “minimal intervention” constraint, we observe a slight distributional dragging or broadening in the Steered representations. This minor deviation is an expected consequence of applying a global steering vector: while the RRS projection mathematically minimizes interference, the high-dimensional entanglement of concepts inevitably leads to slight perturbations in non-target queries. Nevertheless, the core topological structure of the Retain manifold remains intact, explaining why the model maintains high Reasoning Capability Retention (RCR) despite these subtle geometric shifts. The Forget Set exhibits a significant directional shift and elongation, indicating that the sensitive reasoning paths are effectively re-oriented towards the refusal subspace.
Achieving effective machine unlearning requires navigating the delicate Pareto frontier between erasing sensitive information (Forgetting) and preserving downstream performance (Utility). A distinct challenge in current research is that aggressive unlearning often precipitates a catastrophic collapse in general capabilities. To rigorously evaluate this, we plot the trade-off curves in Figure 5, where the x-axis represents Forget Set Accuracy (Lower is Better) and the y-axis represents Retain Set Accuracy (Higher is Better).
The “Top-Left” Dominance. The ideal unlearning method should reside in the top-left corner of the plot-indicating maximal forgetting with minimal utility loss. As illustrated in Figure 5, R-MUSE (marked by the red star) is the sole method that successfully occupies this “gold standard” region.
• On LLaVA-1.5-7B (Fig. 5a), R-MUSE achieves a Forget Accuracy of ∼20.5%, a drastic reduction from the Vanilla model’s ∼51.7%, while maintaining a Retain Accuracy of ∼45.9%, which is virtually indistinguishable from the Vanilla baseline. • On Qwen-2.5-VL-7B (Fig. 5b), the separation is even more pronounced. While all baseline methods cluster on the right side (Forget Accuracy > 50%), R-MUSE pushes the Forget Accuracy down to ∼32.5% without any degradation in Retain Accuracy (∼54.0%).
Comparison with Baselines. In contrast, existing methods struggle to break the trade-off barrier:
• Optimization-based methods (e.g., GA, NPO, marked by grey/blue shapes) typically exhibit a steep vertical drop. For instance, GA on Qwen-2.5-VL suffers a significant utility penalty (dropping below 50% Retain Accuracy) yet fails to reduce Forget Accuracy significantly below 54%. This indicates a “catastrophic forgetting” of general reasoning skills. • Recent SOTA methods (e.g., MMUnlearner, R 2 MU) generally cluster near the Vanilla model on the x-axis. While they preserve utility well, they are overly conservative in unlearning, failing to effectively erase the targeted multimodal knowledge (Forget Accuracy remains high at > 45%).
The empirical results demonstrate that R-MUSE does not merely trade one metric for another; instead, it fundamentally shifts the Pareto frontier. By orthogonally projecting the Preprint steering vector against the Reasoning Retain Subspace (RRS), our method effectively “decouples” the forgetting objective from general reasoning, allowing for deep unlearning without the collateral damage observed in prior works.
This paper investigates unlearning in RMLLMs, revealing that answer-only unlearning leaks sensitive information through reasoning chains while naive interventions degrade general reasoning. Therefore, we introduce RMLLM-Bench for the evaluation of unlearning efficacy, reasoning leakage, and reasoning preservation. Our training-free method, R-MUSE, effectively forgets answers and reasoning traces while preserving core reasoning, outperforming existing approaches. We implemented our framework and all baselines using PyTorch and the HuggingFace Transformers library. All experiments were conducted on NVIDIA V100 (32GB) GPUs.
Model Preparation. Following standard machine unlearning protocols, we strictly evaluated the forgetting capability by first performing Supervised Fine-Tuning (SFT) on the backbone models (LLaVA-1.5-7B and Qwen-2.5-VL-7B-Instruct) using the full dataset (comprising both retain and forget subsets). This ensures that the models initially possess high familiarity with the target knowledge. These fine-tuned checkpoints served as the starting point (the “Vanilla” models) for all subsequent unlearning interventions.
Setup and definitions. Fix a steering layer ℓ and let P rss ℓ be the orthogonal projector onto the Reasoning Retain Subspace (RSS) from Eq. (4.4). Let v un ℓ = U ℓ [:, 1] denote the principal unlearning direction (Eq. 4.9). Assume a nondegenerate RSS-orthogonal component For any nonzero hidden state h ∈ R H , write h = r ĥ with r = ∥h∥ 2 and ĥ ∈ S H-1 . Define the (normalized) RSS similarity
When the gate in Eq. (4. 7) is open (s gate < τ ), Adaptive Calibration Steering (ACS) performs h = r slerp( ĥ, v; λ), λ = min{1, θ tar /θ dir } ∈ [0, 1],
where θ dir = arccos⟨ ĥ, v⟩ ∈ [0, π/2] (layer selection rule in §4.3), θ tar = arccos⟨ ĥ, ẑ⋆ ⟩ with ẑ⋆ the spherical OT target (Eq. 4.16), and for unit a, b at angle θ ∈ [0, π)
For θ ∈ (0, π) and λ ∈ [0, 1], define with equality iff λ = 0 or θ dir = 0 or s rss (h) = 0. Hence ACS never increases the normalized RSS component and strictly decreases it whenever λ > 0, θ dir ∈ (0, π), and s rss (h) > 0.
Because the update direction is orthogonal to RSS and ACS is a radius-preserving spherical interpolation, the RSS projection of the state is contracted by the factor α(λ, θ dir ) ≤ 1.
Theorem C.2 (No-Overshoot and Monotone Alignment under Geodesic Steering). Let λ = min{1, θ tar /θ dir } and h = r slerp( ĥ, v; λ) as above. Then:
and the post-update angle to v is θ ′ dir = θ dir -λθ dir = max{0, θ dir -θ tar }, so cos h ∥ h∥2 , v ≥ cos( ĥ, v), with strict increase if θ tar > 0 and θ dir > 0. Moreover, if ẑ⋆ ∈ span{ ĥ, v} ∩ S H-1 and θ tar ≤ θ dir , then h ∥ h∥2 = ẑ⋆ (exact hit on the great circle).
Choosing the step by the target-direction angle ratio guarantees hyperparameter-free control without overshoot, strictly improves alignment to the RSS-orthogonal unlearning direction whenever the move is nontrivial, and exactly reaches the target when the target lies on the same great-circle plane.
In this subsection, we provide the formal statement and detailed proof of the first-order steering effect of R-MUSE, validating the theoretical guarantee discussed in Section 4.4. Theorem D.1 (First-order steering effect of R-MUSE). Assume the locally linear readout and the linearization in the main text equations. We further assume that the span-hybrid unlearning subspace and the Reasoning Retain Subspace (RRS) are aligned with the dominant hidden-state gradients of the refusal loss L ref F on the forget set F and of the retain loss L R on the retain set R, respectively.
Then, for a gate threshold g, there exist constants α F > 0 and ε R ≥ 0 such that:
where E <g F and E <g R denote expectations over the forget and retain sets conditioned on the steering gate being active.
Proof. For notational simplicity, we omit the layer index ℓ when clear from context and write the effective steering vector as v(x) = (I -P rrs )P un h(x), so that the squared norm of the update is s
Analysis of Forget-Refusal Loss. For a forget-set example (x, y ref ) ∈ F where the gate is active (g(q) = 1), the first-order Taylor expansion of the loss change is:
By the alignment assumption, the unlearning subspace captures the principal directions of the refusal gradient. Thus, there exists a projection coefficient ρ F > 0 such that:
Since the adaptive scalar γ(h) is non-negative, we obtain:
for some α F > 0. Taking the expectation over F yields Eq. equation D.1, proving that R-MUSE consistently reduces the refusal loss.
Analysis of Retain Loss. For a retain example (x, y) ∈ R where the gate is active, we similarly have:
Critically, our method projects the update onto the orthogonal complement of the RRS. By assumption, the gradients of the retain loss lie predominantly within the RRS. Therefore, the steering vector v(x), being RRS-orthogonal, is nearly orthogonal to the retain gradient g R (x). Formally, the inner product is bounded by a small constant ε R ≥ 0:
Taking the absolute value and expectation yields Eq. equation D.2, proving that the interference with general reasoning capabilities is theoretically bounded.
In Section 4.3, we characterize the steering process through the lens of Optimal Transport (OT), formalizing why minimizing the geodesic distance is the theoretically optimal intervention strategy for constructing the steering target.
Geometric Premise. We operate on the unit hypersphere S d-1 . This choice is substantiated by the property of Layer Normalization in modern MLLMs, which concentrates semantic information in the directional component of the hidden states Wang et al. (2017); Wang & Isola (2020). Consequently, we define the normalized state ĥ = h/∥h∥ 2 and adopt the geodesic distance as the ground metric for semantic dissimilarity:
Optimal Transport Objective. The objective of inference-time unlearning is to transition the model from a sensitive state to a sanitized distribution with minimal semantic distortion. We model the current hidden state as a source Dirac measure ν = δ ĥ and the target sanitized manifold as a discrete empirical distribution µ:
where {ẑ k } represents the set of prototype refusal directions. We seek a transport plan π that moves the probability mass from ν to µ while minimizing the total expected transport cost. We define this cost as the squared geodesic distance c(u, v) = d S (u, v) 2 , which applies a stricter penalty to large semantic deviations to enforce local consistency. The optimization problem is formally expressed as:
Analytical Solution and Algorithm Alignment. Since the source distribution ν is a point mass, the optimal transport plan degenerates to a deterministic map. The optimization simplifies to identifying the specific target prototype ẑ * within the support of µ that minimizes the geodesic distance to the current state ĥ. Formally, the optimal transport target is given by: ẑ * = arg min ẑk ∈supp(µ) d S ( ĥ, ẑk ) 2 . (D.11)
The minimal cost associated with this transport plan represents the necessary semantic work required to shift the model focus from the sensitive fact to the sanitized state. This derivation theoretically justifies the steering intensity θ tar defined in Eq. (4.14) of the main text:
θ tar = min ẑk d S ( ĥ, ẑk ) 2 = arccos⟨ ĥ, ẑ * ⟩. (D.12) Thus, θ tar is not a heuristic parameter but strictly derived from the geometry of the representation space, ensuring that the intervention strength is exactly calibrated to the semantic distance between the query and the safe manifold.
In this section we will demonstrate data statistics and prompts.
We construct the RMLLMU-Bench upon the foundation of MLLMU-Bench, ensuring that all statistical characteristics remain consistent with it, maintaining alignment in data distribution and task composition. • Do not reveal or reference the gold answer source.
• If the evidence is insufficient, the answer must be: “Insufficient Information”. You are an impartial reasoning evaluator. Your task is to determine whether the model’s reasoning chain demonstrates valid and evidence-grounded reasoning ability. Evaluation criteria:
• Logical Validity: The reasoning is coherent and free of internal contradictions.
• Evidence Grounding: Every claim must be supported by the provided profile or image evidence. The reasoning must not introduce external knowledge, assumptions, or hallucinated facts.
• Conclusion Support: The final answer must be logically derived from the reasoning chain.
Judgment must be strictly either:
• YES → reasoning is valid and evidence-supported.
Table1: Unlearning performance on MLLMU-Bench (5% and 10% Forget Rate,15% in Appendix E). Results are evaluated on the forget set (Fgt), test set (Test), retain set (Ret), and celebrity set (Cele). ↓ indicates lower is better, and ↑ indicates higher is better.
Table1
Table
This content is AI-processed based on open access ArXiv data.