SCAN: Sparse Circuit Anchor Interpretable Neuron for Lifelong Knowledge Editing
Large Language Models (LLMs) often suffer from catastrophic forgetting and collapse during sequential knowledge editing. This vulnerability stems from the prevailing dense editing paradigm, which treats models as black boxes and relies on coarse-grai…
Authors: Yuhuan Liu, Haitian Zhong, Xinyuan Xia
SCAN: Sparse Cir cuit Anchor Interpr etable Neur on f or Lifelong Kno wledge Editing Y uhuan Liu 1 2 Haitian Zhong 1 3 Xinyuan Xia 4 Qiang Liu 1 Shu W u 1 Liang W ang 1 Abstract Large Language Models (LLMs) often suff er from catastrophic forgetting and collapse during sequential kno wledge editing. This vulnerability stems from the pre v ailing dense editing paradigm, which treats models as black boxes and relies on coarse-grained parameter interventions that in- evita bly disrupt preserved knowledge. T o address this, we propose SCAN (a sparse editing frame- work based on S parse C ircuit A nchored N euron) which transforms editing into a mechanism-a ware manipulation by constructing a knowledge cir - cuit via Sparse T ranscoders. Experiments on Gemma2, Qwen3, and Llama3.1 across Coun- terFact, ZsRE and W ikiFactDif f demonstrate that SCAN achiev es a superior performance, maintain- ing model integrity on benchmarks like MMLU and GSM8K ev en after 3,000 sequential edits, whereas other e xisting methods deteriorate pro- gressiv ely as editing accumulates, e ventually re- sulting in model collapse. 1. Introduction Large Language Models (LLMs) acquire e xtensiv e kno wl- edge during pre-training which may become outdated. Giv en the high costs of re-training, model editing ( Zhang et al. , 2024a ; W ang et al. , 2024c ; b ) has emerged as an ef fi- cient approach for updating specific factual kno wledge with- out altering unrelated kno wledge and general competenc y . For instance, an LLM asserting, “The current U.S. Presi- dent is Joe Biden, ” would need correction to, “The current U.S. President is Donald T rump, ” in 2025. Existing editing techniques fall into two categories: parameter-modifying which directly alter the LLMs’ weights, and parameter- 1 New Laboratory of Pattern Recognition (NLPR), State Ke y Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences 2 Cuiying Honors College, Lanzhou Univ ersity 3 Zhongguancun Academy 4 Research Institute of Intelligent Complex Systems, Fudan Uni ver - sity . Correspondence to: Shu W u < shu.wu@nlpr .ia.ac.cn > . Pr eprint. Marc h 17, 2026. preserving methods which introduce auxiliary components to steer model’ s output. Moreov er , to accommodate e volv- ing kno wledge, lifelong model editing is proposed, which in volves performing sequential edits ( Gupta et al. , 2024 ). Despite advances, current editing techniques f ace systemic limitations that are problematic in the lifelong setting. Chief among these is the issue of coarse editing granularity: meth- ods often intervene o ver lar ge blocks of parameters ( Meng et al. , 2022a ; b ). Howe ver , the parameters truly es- sential to a specific fact occupy only a tiny fraction of these regions ( Jiang et al. , 2025 ). This coarse intervention di- rectly causes the destruction of unrelated knowledge ( Jiang et al. , 2025 ) due to the v ector -based storage mechanism in MLP ( Gev a et al. , 2021 ) and polysemantic problems ( Gev a et al. , 2022 ). In sequential editing, as these non-minimal intervention accumulate, they inevitably erode previously edited knowle dge, triggering catastrophic forgetting and model collapse. Furthermore, the lack of transpar ency in knowledge pathw ays hinders reliable diagnosis and surgical refinement, making updates unreliable and uncontrollable ( Hong & Lipani , 2024 ; Mazzia et al. , 2024 ). T o address these coupled challenges, we draw inspiration from the field of Mechanistic Interpretability ( Bereska & Gavv es , 2024 ; Rai et al. , 2024 ), emphasizing sparsity as a promising way to solve them. W e argue that sparsity enables finer grained editing through two reasons. Firstly , unlike dense paradigms that intervene in entire weight blocks, spar- sity restricts updates to factual related parameter subsets, protecting unrelated knowledge. Secondly , sparsity offers a path to ward neuron monosemanticity ( Cunningham et al. , 2023 ; Paulo et al. , 2025 ). Under this paradigm, each feature or neuron represents a single concept rather than multi- ple meanings ( Bricken et al. , 2023 ), ensuring that editing a target concept does not propagate changes to unrelated concepts, thereby mitigating the polysemantic issues ( Gev a et al. , 2022 ) inherent in editing dense LLMs. Based on this principle, this paper proposes a novel parameter-preserving editing framew ork SCAN. Our ap- proach introduces a Sparse Transcoder ( Dunefsky et al. , 2024 ) which projects LLMs hidden states onto sparse fea- tures. W e then construct an Attribution Graph by influence score between features and prune it to identify a knowledge 1 SCAN: Sparse Circuit Anchor Inter pretable Neur on f or Lifelong Knowledge Editing circuit. This circuit serves as a roadmap to locate the essen- tial and sparse feature nodes for edit instead of the whole transcoder features. By applying targeted steering to these sparse features, we implement the edit and propagate the re- fined changes back into the LLMs. Our main contributions are summarized as follows: 1. W e introduce sparsity as a core principle to resolve catastrophic forgetting and model collapse in sequen- tial edit by restricting updates to fact-specific parameter subsets. This paradigm also fosters neuron monose- manticity , ensuring that each feature represents a dis- crete concept to mitigate the polysemantic interference inherent in dense models. 2. W e propose SCAN, a novel white-box sparse editing framew ork dri ven by Sparse T ranscoders and Attrib u- tion Graphs. By identifying specific knowledge cir- cuits, SCAN provides a rob ust and interpretable solu- tion for the model editing field. 3. W e conduct a series of experiments to inv estigate se- quential scalability , precise knowledge localization, and the semantic profiling of functional features across multiple LLM families. Our ev aluations demonstrate that these sparse interventions successfully maintain model integrity and general capabilities e ven under long-term editing stress. 2. Preliminary 2.1. Lifelong Editing and Steer -based Method Model editing updates a model f W using a triple e = ( s, r , o → o ∗ ) where s denotes the subject entity , r the relation, o the original factual object, and o ∗ the desired target new object. In a lifelong setting, a Model Editor (ME) recursiv ely produces f W t = ME ( f W t − 1 , x t , y t ) to incorpo- rate n sequential updates without forgetting. Steer-based methods achie ve this by modifying hidden states h instead of weights ( Zhong et al. , 2025 ). They typically utilize a De- cision Mechanism (e.g., Euclidean distance to stored k eys K i ) to trigger a Perturbation Mechanism: h ′ = h + ∆ h if an edit is required, and h ′ = h otherwise ( Hartvigsen et al. , 2023 ; Y u et al. , 2024 ). 2.2. Editing Mechanism: MLPs as Key-V alue Memory The Transformer MLP is conceptualized as a key-v alue memory ( Gev a et al. , 2021 ), where the encoder weights W enc act as patterns (keys) and the decoder weights W dec act as stored knowledge (v alues). Physically , the input h pre is compared against W enc (keys) to produce an acti- vation a = σ ( W enc · h pre ) , where each a j represents the response intensity that the input belongs to the j -th kno wl- edge slot. The final output v is a weighted retriev al of these values: v = W dec · a = P a j v j , which modifies the residual stream h post = h pre + v ( Elhage et al. , 2021 ). T raditional parameter-modifying methods (e.g., MEMIT , AlphaEdit) achiev e editing by updating the decoder weights W ′ dec = W dec + ∆ W dec to encode ne w facts. Mathematically , this transformation is equiv alent to injecting an additiv e per- turbation ∆ h = ∆ W dec · a into the residual stream, provid- ing a unified view for both weight-based and steer-based editing. 2.3. Sparse T ranscoder and Monosemanticity The Sparse Transcoder is an auxiliary module that refor- mulates the MLP’ s memory into a highly sparse feature activ ation z ( Paulo et al. , 2025 ) (compared with a in MLP). It is trained to reconstruct the MLP output while enforcing sparsity via an ℓ 1 penalty: L Transcoder = MSE MLP ( h pre ) , W tc dec · z + λ ∥ z ∥ 1 , z = ReLU ( W tc enc · h pre ) Unlike standard MLPs that suffer from polysemantic neu- rons, the Transcoder promotes monosemanticity ( Bricken et al. , 2023 ). 3. Edit with Sparse Circuit Anchor ed Neuron This section details SCAN. W e define a single edit as e = ( s, r , o → o ∗ ) . The method first constructs the kno wl- edge circuit responsible for the original output o , based on the acti v ated features in the Sparse T ranscoder and attribu- tion score between feature nodes. The essential features are then extracted to locate the kno wledge and we edit the corresponding decoder weight vectors that enforce the target output o ∗ . Finally , the dif ference between the transcoder’ s original output and the edited output ( ∆ v tc ) is injected into the LLMs as a “steering vector” at the exact circuit location. 3.1. Attribution Graph Construction The initial phase in volv es constructing an Attrib ution Graph for the old knowledge answer o , defined by the editing in- stance e . W e execute a forward pass with the prompt to pro- cess it with the LLM. The hidden states ( h pre ) before MLP from each layer and the token positions are then inputted into the corresponding transcoders to record the features with positiv e acti v ation. A weighted complete graph is then constructed, formally defined as follows: Definition 3.1 (Initiation of Attribution Graph) . F or an editing instance e , the corresponding Attrib ution Graph is initiated as a weighted complete graph: G = ( V , E , AS ) where V is a set of nodes representing all components po- tentially causal to the original output o and is partitioned 2 SCAN: Sparse Circuit Anchor Inter pretable Neur on f or Lifelong Knowledge Editing L20#10092 Explanation: US president s mentions of the presiden t of the united states Current Paradigm Ours Sparse Edit Dense Edit (a) (b) : Donald T rump Editing Instance: Prompt: “ The US president is” Old: “Joe Biden” New: “Donald Trump” Attribution Graph Steer & Inject SCAN: Editing with Sparse Circuit Anchored Neuron Related Unrelated Knowledge In ject Residual Stream Circuit Analysis Knowledge Editing New Causal Flow Pruned Back- propagate Steering Inject MLP T ranscoder (c) Related Unrelated Update whole parameters Update essential paramet ers Polysemantic Monosemantic Protect Unrelated Knowledge Damage Unrelated Knowledge Neuron Feature Concept A Concept B Concept C Concept A Steered Answer Gradient-Based Influence T otal Attribution Direct (one-step) Inirect (two-step) First-order term F igure 1. Comparison of current methods and ours. Current methods (a) modify the entire dense MLP weight matrix. Our approach (b) isolates factual features, editing kno wledge-rele vant v ectors. as: V = V embed ∪ V feature ∪ V error ∪ V logit , here V embed de- notes the token embedding node; V feature comprises the fea- ture with positive activ ation z i ; V error contains nodes cor- responding to the MLP reconstruction error; V logit denotes the logit node associated with the original object token o . E = { ( u, v ) | u, v ∈ V , u = v } connects ev ery pair of nodes, and AS : E → R assigns an attribution score repre- senting the causal influence between nodes. In the initiated graph, all attribution scores are set to 0 . The visualization of this period can be seen in Appendix C. 3.2. Gradient-Based Node Influence Computation W ith the complete Attribution Graph G = K ( V ) estab- lished, we quantify the causal influence between the nodes. Suppose u, v ∈ V are two nodes, where u is located at layer j and token position m , and v is located at layer i and token position n ( i > j, n ≥ m ) . Let z u and z v denote the corre- sponding activ ations respectively (for error and embedding nodes, the activ ations are set to 1). The attribution score AS u → v is computed as the difference in the metric M of node v when the path through u is corrupted while other paths are kept. Specifically , the influence of u on v is giv en by AS u → v = M v ( V ) − M v ( V \ { u } ) , here M v ( V ) repre- sents the value of the metric M (e.g., absolute activ ation or activ ation relativ e to the mean; in our e xperiments, we use absolute acti v ation which is ReLU (( f v enc ) ⊤ · h i pre ) ) at node v during the original forward pass. M v ( V \ { u } ) represents the v alue of M at node v when the path through node u is corrupted. Direct computation of the attrib ution score leads to an im- mense computational cost. T o address this, we use the first-order term of the T aylor expansion to approximate it, where remo ving node u is equi v alent to setting its acti v ation z u to zero: AS u → v ≈ ∂ M v ∂ z u · z u T o efficiently compute the gradient ∂ M v ∂ z u , we utilize the Chain Rule across the Transformer layers, only requiring backpropagation to the output ( h j post = h j pre + h j error + P k z k f k dec ) after the MLP . This results in a simple expres- sion for the gradient: ∂ M v ∂ z u = ∂ M v ∂ h j post ∂ h j post ∂ z u = ∂ M v ∂ h j post f u dec W e can further provide a mathematical interpretation of this process. In fact, the causal influence computation can be seen as e v aluating the similarity between v ectors via a dot product. By further expanding ∂ M v ∂ h j post f u dec , we get: ∂ M v ∂ h j post f u dec = ∂ M v ∂ h i pre ∂ h i pre ∂ h j post f u dec = ( f v enc ) ⊤ ∂ h i pre ∂ h j post f u dec This decomposed form is interpreted as the similarity be- tween feature u and feature v in the pre-MLP space of the layer i . The vector f u dec , which encodes the v alue infor- mation stored by feature u at layer j , is first transformed into the pre-MLP space of layer i via the Jacobian matrix ∂ h i pre ∂ h j post . This Jacobian matrix maps the decoder vector f u dec from vector representation in the layer j post-MLP space 3 SCAN: Sparse Circuit Anchor Inter pretable Neur on f or Lifelong Knowledge Editing into its corresponding vector representation in the layer i pre-MLP space. W e formalize this relationship in the follo w- ing proposition, which ensures that such a transformation across dif ferent spaces is well-defined (the proof is provided in Appendix B): Proposition 3.2 (Jacobian as the Optimal Direction-Preserv- ing Linearization) . Let X, Y ⊂ R n be two spaces and let f : X → Y be a mapping such that f (0) = 0 and f is differ entiable at every point x 0 ∈ X with J acobian matrix J f ( x 0 ) and J f is continuous and non-singular at 0 . Then, we have f ( x 0 ) ∥ f ( x 0 ) ∥ − J f ( x 0 ) x 0 ∥ J f ( x 0 ) x 0 ∥ − − − − → x 0 → 0 0 This implies that the direction of any vector transformed by f is closely aligned with the direction induced by the J acobian transformation. The transformed vector is then compared using a dot prod- uct with the ke y vector ( f v enc ) ⊤ . Thus, the attribution score measures the similarity between the transformed value v ec- tor from feature of u and the required key vector of v , all within the i layer pre-MLP space. 3.3. Prune by T otal (one-step and multi-step) Attribution T o deriv e the sparse causal subgraph G ′ from the dense Attribution Graph G , we employ a pruning strate gy based on previous attrib ution scores ( AS u → v ), which measure the direct (one-step) causal effect of node u on node v . Definition 3.3 (Direct (one-step) Attrib ution Matrix) . Let A ∈ R | V |×| V | denote the adjacency matrix of G , where the element in row v and column u is: a v ,u = AS u → v , representing the one-step attribution from node u to node v . Denote the A written in column-block form as ( a 1 , a 2 , . . . , a | V | ) , or equiv alently in row-block form as ( b ⊤ 1 , b ⊤ 2 , . . . , b ⊤ | V | ) ⊤ T wo-step Attribution. Analogous to a full-deriv ative e x- pansion: Theorem 3.4 (Full-deri v ati ve e xpansion) . Consider a func- tion y = y ( x 1 , x 2 , . . . , x n ) , where each x i is itself a func- tion of z , i.e., x i = x i ( z ) . Then the derivative of y with r espect to z can be expr essed using partial derivatives as ∂ y ∂ z = n X i =1 ∂ y ∂ x i ∂ x i ∂ z . which implies that the change in y with respect to z can be decomposed into contributions from the changes in each intermediate variable x i , we begin with the simplest case of indirect influence: two-step attribution. This measures the effect of a node u on a node v that is mediated by a single intermediate node. W e sum the influence ov er all paths of length two (i.e., through all possible intermediate nodes w i ) giv en by P i AS u → w i · AS w i → v , which in vector -matrix notation corresponds to: ( A 2 ) v ,u = X i a v ,w i a w i ,u = b ⊤ v · a u Thus, the adjacency matrix representing all two-step attribu- tions in the graph is precisely A 2 . Three-step and T otal Attribution. Follo wing the same principle, three-step attribution i s computed recursi vely . It quantifies the influence transmitted through two intermedi- ate nodes. W e can conceptualize this as the sum of products of the two-step attribution from the source node u to an intermediate node ˆ w i , and one-step attribution from ˆ w i to v . (i.e. P i ( P k AS u → w k · AS w k → ˆ w i ) · AS ˆ w i → v ). Similar to situation in two-step, the three-step adjacency matrix is A 3 . The total adjacency matrix, denoted by B , accumulating the attribution scores from all direct and indirect paths. It is defined as the sum of the adjacency matrix for all path lengths from one-step to infinity . This results in a matrix series: B = A + A 2 + A 3 + . . . . In fact, we hav e the fol- lowing proposition to calculate it (The proof will be sho wn in Appendix B, with implementation details, including con- ver gence and feasibility , provided in Appendix C): Proposition 3.5 (Closed-form T otal Attribution Matrix) . Let A be the adjacency matrix of one-step attrib ution scores with ∥ A ∥ < 1 . Then the total Attrib ution Matrix B , which accumulates contributions fr om all paths of any length, ad- mits the closed-form solution: B = ( I − A ) − 1 − I For an y tar get node v , W e first sort all nodes u that hav e a path to v in descending order of B v ,u and normalize these scores. Edge pruning is then performed using a cumulative threshold τ : starting from the highest-ranked node, scores are sequentially accumulated. Once the cumulative sum reaches τ , all remaining edges are discarded for they are less influential. Nodes that do not attribute to any other node after edge pruning is remov ed. After pruning all edges and nodes, we get G ′ . 3.4. Sparse Edit and Knowledge Inject This final step uses the features identified in G ′ that are activ e on the subject’ s last token for direct model editing and testing. During the editing phase, the transcoder decoder vectors f u dec corresponding to the these features u ∈ G ′ are 4 SCAN: Sparse Circuit Anchor Inter pretable Neur on f or Lifelong Knowledge Editing selected and edited. The dif ference between before and after the edit ( P u z u · ( ˆ f u dec − f u dec ) ) is calculated and injected into the LLMs. The optimization objectiv e is set to minimize the negativ e log-probability of the target output, ˆ W tc dec = ar gmin W tc dec ( − log P ( o ∗ | x )) . The edited decoders are then sav ed during sequential edit. In the testing phase, these identified features serve as the steering triggers. W e use the Jaccard Similarity Score J ( V A , V B ) , where V A denotes the set of edited features associated with the target kno wledge recorded by the transcoder during the editing phase, and V B denotes the set of selected features extracted from the attribution graph of the test prompt. If the similarity exceeds a predefined threshold, the edited features relev ant to the corresponding knowledge are used for injection. The score, calculated as J ( V A , V B ) = | V A ∩ V B | | V A ∪ V B | quantifies the ov erlap between the selected feature sets. 4. Experiments In this section, we conduct extensiv e experiments to e v aluate the performance of our proposed method and address the following research questions: • RQ1: How does our method perform in sequential editing scenarios? • RQ2: Where does the model encode factual knowledge within its parameters? • RQ3: What is the semantic meaning of the identified features for editing? • RQ4: How does the model maintain its general capa- bilities after editing? 4.1. Experimental Setup W e summarize the LLMs, baseline methods, Transcoders, datasets, and ev aluation metrics used in our experiments. Further details are provided in Appendix A. LLMs & Baseline Methods. W e conducted experi- ments using three widely adopted LLMs: Gemma2-2B , Qwen3-8B , and Llama3.1-8B-Instruct . For com- parison, we ev aluated our method against several editing baselines, including Fine-T uning (FT) ( Zhu et al. , 2020 ), MEMIT ( Meng et al. , 2022b ), RECT ( Gu et al. , 2024 ), AlphaEdit ( Fang et al. , 2025 ), GRA CE ( Hartvigsen et al. , 2023 ), and MELO ( Y u et al. , 2024 ). T ranscoder . W e adopt publicly av ailable pretrained transcoder checkpoints without additional training. Specifi- cally , we use the Gemma2-2B-transcoders and Qwen3-8B- transcoders ( Dunefsky et al. , 2024 ; Hanna et al. , 2025 ), as well as the Llama3.1-8B-Instruct-transcoders ( Zhao et al. , 2025 ). These checkpoints are used as-is throughout all e x- periments. (a) (b) F igure 2. Cumulative proportion of selected feature across different token positions. (a) and (b) represent the distribution for Gemma2- 2B and Qwen3-8B on CounterFact dataset, respecti vely . Datasets. T o ev aluate the performance of knowledge edit- ing, we employed three standard benchmarks: CounterFact ( Meng et al. , 2022a ), ZsRE ( Levy et al. , 2017 ), and Wik- iFactDif f ( Ammar Khodja et al. , 2024 ). Furthermore, to assess the general capabilities of the model post-edit, we tested the edited models on six general datasets, including MMLU ( Hendrycks et al. , 2021 ). Evaluation Metrics. In line with prior research, we assess performance using three ke y metrics: Rel (Reliability , also known as Edit Success Rate), Gen (Generalization Success Rate), and Loc (Locality Success Rate). 4.2. How does our method perform in sequential editing scenarios? (RQ1) T o assess our method, we e v aluate across 1,000 sequential edits. The batch size for all batch editing methods is set to 100. As demonstrated in T able 1 , our method outperforms all baselines almost in all metrics. Notably , SCAN provides a balanced solution by sustaining near-perfect scores across all three dimensions compared with others. 4.3. Where does the model encode factual kno wledge within its parameters? (RQ2) T o in vestig ate the internal localization of f actual kno wledge, we perform an analysis on the first 1,000 cases of the Coun- terFact dataset using Gemma2-2B and Qwen3-8B. Unlike prior work that performs localization only at the layer lev el ( Meng et al. , 2022a ; Zhang et al. , 2024b ), our approach enables finer-grained localization, identifying specific sub- components. W e analyze the characteristics of key nodes from both the positional and layer-wise dimensions. The example Attrib ution Graph can be see in Appendix C. Positional Localization. As illustrated in Figure 2 , the attribution of factual knowledge is highly concentrated at last token contributing approximately half of the total nodes. When combined with the last token of the subject, these two positions cumulativ ely account for over 75% of the entire Attribution Graph. In contrast, nodes at other positions, such as those within the prompt or earlier parts of the subject, exhibit lo wer contrib utions. This indicates that the subject’ s 5 SCAN: Sparse Circuit Anchor Inter pretable Neur on f or Lifelong Knowledge Editing T able 1. Sequential editing task performance comparison of our method and other methods after 1000 edits. The A vg column is calculated using the Harmonic Mean: A vg = 3 / ( Rel − 1 + Gen − 1 + Loc − 1 ) , to illustrate balanced performance across metrics. Bold and underline denote the best and second-best results per column, respectiv ely . Method Model CounterFact ZsRE WikiF actDiff Rel ↑ Gen ↑ Loc ↑ A vg ↑ Rel ↑ Gen ↑ Loc ↑ A vg ↑ Rel ↑ Gen ↑ Loc ↑ A vg ↑ FT Gemma2-2B 44.28 11.10 11.77 16.52 66.14 57.20 64.67 62.43 73.07 69.53 41.24 56.97 RECT 7.23 1.63 20.90 3.86 15.32 11.58 18.38 14.56 35.78 31.50 31.64 32.97 AlphaEdit 55.38 18.35 32.63 28.91 73.88 60.22 47.71 59.11 76.92 67.26 55.73 65.88 MEMIT 5.00 2.80 4.40 3.77 10.91 9.16 7.61 9.07 5.09 4.68 6.30 5.32 GRA CE 100 0.37 99.80 1.10 98.80 23.56 100 46.55 98.30 51.20 99.07 75.50 MELO 69.92 42.02 45.57 49.97 69.05 56.11 92.28 69.43 78.55 68.82 95.15 79.37 SCAN (Ours) 100 89.28 91.97 93.53 100 97.87 100 99.28 100 93.29 90.46 94.43 FT Qwen3-8B 6.15 3.10 5.05 4.39 20.70 20.21 18.46 19.79 26.11 25.12 26.60 25.94 RECT 42.60 27.75 2.80 7.23 30.44 28.84 11.16 18.39 11.97 9.99 0.81 2.18 AlphaEdit 91.90 25.90 74.50 44.71 96.77 77.35 87.45 86.96 70.41 60.93 68.50 66.39 MEMIT 9.30 5.40 0.10 0.29 36.50 32.19 20.37 28.02 1.18 0.74 6.18 1.64 GRA CE 100 0.65 99.98 1.94 100 27.40 100 51.60 99.85 46.12 98.84 70.34 MELO 88.00 32.45 64.55 50.94 79.06 66.74 99.46 80.79 72.40 61.00 97.10 75.31 SCAN (Ours) 100 98.25 92.95 96.97 100 95.36 100 98.43 100 89.29 90.30 92.97 FT Llama3.1-8B 37.20 12.60 1.95 4.79 56.59 47.04 4.97 11.38 73.07 68.28 19.42 32.23 RECT 10.35 8.10 0.40 1.16 6.18 5.45 2.63 4.18 2.61 1.93 1.48 1.95 AlphaEdit 97.65 39.55 45.85 56.68 94.48 82.41 78.14 84.91 92.87 86.88 64.06 78.07 MEMIT 0.00 0.00 0.60 0.00 1.19 0.82 3.69 1.41 0.04 0.04 0.18 0.06 GRA CE 100 1.00 99.80 2.94 99.85 27.35 100 51.44 99.84 59.13 99.41 79.17 MELO 90.05 59.70 48.05 63.67 88.48 70.19 88.61 81.83 84.33 73.63 92.24 83.11 SCAN (Ours) 100 86.70 95.10 93.58 100 91.07 99.88 96.84 100 87.45 89.05 92.07 (a) (b) F igure 3. Distribution of selected feature across layers. Both mod- els exhibit a characteristic dual-peak pattern, indicating functional localization in shallow and middle-to-deep layers. boundary and the final token are the dominant positions influencing the model’ s inference. Layer -wise Distribution and Functional Specialization. From the perspectiv e of depth, the percentage of attribution nodes exhibits a distinct dual-peak pattern across layers, as sho wn in Figure 3 . T o further decouple the functions of these peaks, we analyze the distribution of features acti v ated at the subject last token and the last tok en separately using a heatmap (Figure 4 ). Our analysis reveals that the two peaks observed in the global distrib ution correspond to the deeper concentrations in the heatmap: • Shallow Layers: The peak here primarily corresponds to the subject last token , where MLP layers focus on representing and stabilizing the subject’ s identity . • Middle-to-Deep Layers: The second peak is domi- nated by the last token position, where MLPs integrate (a) (b) F igur e 4. Heatmap of selected feature distribution across layers at special token position. The dark regions indicate that the early- layer peaks in Figure 3 align with the subject tokens, while the later-layer peaks correspond to the last token position on both models. the relation information with the subject’ s representa- tion to extract the tar get answer . In summary , these findings suggest a “Subject-to-Answer” pipeline: shallow layers encode the subject’ s information, while middle-to-deep layers lev erage the relation to finalize the factual retrie v al at the terminal token. 6 SCAN: Sparse Circuit Anchor Inter pretable Neur on f or Lifelong Knowledge Editing 4.4. What is the semantic meaning of the identified features f or editing? (RQ3) T able 2. Comparison of feature discriminability across different selection methods using Qwen3-8B and Gemma2-2B on the CF dataset. Method Rel-Gen Rel-Loc J-scr J-acc F1-scr F1-acc J-scr J-acc F1-scr F1-acc Qwen3-8B All 0.258 0.421 0.394 0.423 0.297 0.409 0.445 0.394 Sub 0.594 0.992 0.731 0.994 0.162 0.931 0.272 0.928 No last 0.346 0.689 0.496 0.694 0.151 0.880 0.249 0.877 Both 0.574 0.978 0.714 0.979 0.094 0.955 0.161 0.954 Gemma2-2B All 0.251 0.414 0.383 0.424 0.300 0.404 0.449 0.389 Sub 0.509 0.938 0.659 0.939 0.153 0.921 0.258 0.917 No last 0.312 0.573 0.454 0.583 0.160 0.861 0.265 0.854 Both 0.491 0.926 0.642 0.927 0.118 0.938 0.203 0.937 Not all nodes captured by Attribution Graph are suitable. These nodes often represent a mixture of edit-specific fea- tures (tied to the unique fact) and gener al-purpose features (common semantic categories). Indiscriminately editing the latter would lead to overfitting. T o illustrate this, we ex- amine a reliability prompt “ The mother tongue of Thomas J oannes Stieltjes is ” and a locality prompt “ Arend Lijphart is a native speaker of ” (both targeting the answer Dutch ). Our attribution analysis (see Figure 5 ) re veals that both prompts acti vate feature #13366 at layer 19 and feature #410 at layer 24. By projecting their decoder v ectors onto the unembed- ding matrix, we extract the top-5 tokens with the highest logits: • Layer 19, #13366: Spanish, English, Arabic, bahasa, and Hindi. • Layer 24, #410: K orean, Japanese, Indonesian, V iet- namese and Russian. These nodes clearly represent the general concept of “lan- guage” rather than the specific fact being edited. Cru- cially , these general features do not activ ate at any position within the subject of the sentence, nor are they triggered by rephrased prompts such as “ Thomas J oannes Stieltjes was born in ”. This observation suggests that features relev ant to the specific factual update are uniquely concentrated within the subject’ s boundary . More case study and visualization can be found in Appendix E. Based on these insights, we design three filtering rules to extract specialized features by constraining their acti v ation positions: ( i ) Subject-Anchored Features (Sub): acti vated at the subject last token; ( ii ) T erminal-Exclusionary Features (No last): not activ ated at the last token; and ( iii ) Both: meeting both criteria. W e ev aluate the discriminativ e power of these rules on the first 1,000 cases of CounterFact. As shown in T able 2 , we measure the av erage feature o verlap between prompts using Jaccard and F1 scores. W e record the av erage score and define discriminati v e accuracy based on thresholds (0.25 for Jaccard, 0.4 for F1, roughly half ov erlap between the sets). Our findings show that features activ ated at the subject last token (Sub) provide the highest discriminativ e scores. This confirms that the subject’ s ter- minal position is the primary locus for edit-specific factual information. 4.5. How does the model maintain its general capabilities after editing? (RQ4) The experimental results across six div erse bench- marks—including ARC ( Clark et al. , 2018 ), Common- senseQA ( T almor et al. , 2019 ), GSM8K ( Cobbe et al. , 2021 ), MMLU ( Hendrycks et al. , 2021 ), OpenBookQA ( Mihaylov et al. , 2018 ), and SciQ ( W elbl et al. , 2017 ). The experiments demonstrate that our method exhibits robustness in preserv- ing model’ s general capabilities during editing process. As edit number scales up to 3,000, our approach maintains pre-existing ability . In contrast, other methods like RECT and MEMIT suffer from a collapse as the editing number increases, with their accuracy plummeting toward zero after 2,000 edits. Details are shown in Figure 6 . 5. Related W ork Model Editing for Knowledge Update. Current model editing techniques are broadly cate gorized into parameter - modifying and parameter-preserving methods. Parameter Modifying Methods directly update model weights to en- code new facts. Meta-learning approaches like MEND ( Mitchell et al. , 2022 ) use hypernetworks to transform gradi- ents into specific parameter updates, while locate-then-edit methods such as R OME ( Meng et al. , 2022a ) and MEMIT ( Meng et al. , 2022b ) update parameters by solving con- strained optimization problems, aiming to balance success- ful editing with the preservation of unrelated knowledge. AlphaEdit ( Fang et al. , 2025 ) employs null-space projection to ensure updates satisfy constraints without interfering with unrelated knowledge. Notably , RECT ( Gu et al. , 2024 ) intro- duces sparse edit but lacks a deep attrib ution-based analysis. In contrast, parameter-preserving Methods update model without altering the original weights. GRA CE ( Hartvigsen et al. , 2023 ) achiev es this by caching discrete codebook values for specific hidden states, while MELO ( Y u et al. , 2024 ) stores the lora weight. They both use Euclidean Dis- tance to trigger the desired edit. WISE ( W ang et al. , 2024a ) utilizes pretrained router to compute acti vation as a trigger and a “side memory” MLP to generate output as perturba- tion. React ( Zhong et al. , 2025 ) use two pretrained MLPs to 7 SCAN: Sparse Circuit Anchor Inter pretable Neur on f or Lifelong Knowledge Editing Featur e #13366 at Layer 19 Featur e #410 at Layer 24 (a) Reliability (b) Reliability (c) Generality (d) Generality (e) Locality (f) Locality F igure 5. Activ ation visualization of identified features on the specific prompts. The left column shows Feature #13366 at Layer 19, and the right column shows Feature #410 at Layer 24. Darker colors indicate higher acti vation v alues. (a) (b) (c) (d) (e) (f) F igure 6. Comparison of SCAN against other methods across six benchmarks. The results show the ability to preserve the model’ s general performance as the number of edit cases grows. calculate semantic similarity and edit by computing belief shift based on positiv e and negativ e representations. Mechanistic Interpr etability . Our work is grounded in Mechanistic Interpretability . Early research focused on localization and logit analysis. Causal T racing identifies important components by patching neuron to observe their causal effect ( Meng et al. , 2022a ). Logit Lens monitors the model’ s intermediate computations by projecting hidden states onto the v ocabulary via unembedding matrix ( Ge v a et al. , 2022 ; Dar et al. , 2023 ). These foundational tech- niques hav e e v olved into more sophisticated Circuit Analy- sis, which seeks to decompose the model’ s computation into functional paths ( Ameisen et al. , 2025 ). T o resolve the poly- semanticity inherent in dense LLMs, Sparse Autoencoders (SAEs) ( Cunningham et al. , 2023 ) and their v ariants, such as Sparse Transcoders ( Dunefsky et al. , 2024 ), have been dev eloped to project activ ations onto a sparse, monoseman- tic feature. Unlike standard SAEs, transcoders mimic the behavior of MLP layers, allowing for a more direct mapping of the knowledge storage mechanism ( P aulo et al. , 2025 ). 6. Conclusion In this paper, we introduced SCAN, a framew ork lev erag- ing Sparse Transcoders and Attribution Graphs for sparse editing in a lifelong setting. By isolating minimal features, SCAN ov ercomes the granularity limitations, reducing side effects on unrelated kno wledge. Extensi ve experiments demonstrate that SCAN achiev es superior performance. Ul- timately , SCAN contrib utes an interpretable perspecti ve to the field of model editing. 8 SCAN: Sparse Circuit Anchor Inter pretable Neur on f or Lifelong Knowledge Editing Impact Statement This paper presents work whose goal is to advance the field of LLMs by improving the interpretability and controlla- bility of LLMs representations, with a particular focus on lifelong kno wledge editing. By enabling more transparent identification of model components and more targeted mod- ifications of model behavior , this work may contribute to safer and more reliable LLMs systems. While techniques for model editing could, in principle, be misused to alter factual information in deplo yed models, our approach em- phasizes interpretability and precise, controlled edits, which may help reduce unintended or harmful modifications. W e do not anticipate significant negati v e societal consequences arising directly from this research. References Ameisen, E., Lindsey , J., Pearce, A., Gurnee, W ., T urner , N. L., Chen, B., Citro, C., Abrahams, D., Carter, S., Hos- mer , B., Marcus, J., Sklar, M., T empleton, A., Bricken, T ., McDougall, C., Cunningham, H., Henighan, T ., Jermyn, A., Jones, A., Persic, A., Qi, Z., Ben Thompson, T ., Zim- merman, S., Ri voire, K., Conerly , T ., Olah, C., and Bat- son, J. Circuit tracing: Rev ealing computational graphs in language models. T ransformer Cir cuits Thread , 2025. Ammar Khodja, H., Bechet, F ., Brabant, Q., Nasr , A., and Lecorv ´ e, G. WikiF actDif f: A large, realistic, and tem- porally adaptable dataset for atomic factual knowledge update in causal language models. In Calzolari, N., Kan, M.-Y ., Hoste, V ., Lenci, A., Sakti, S., and Xue, N. (eds.), Pr oceedings of the 2024 Joint International Confer ence on Computational Linguistics, Language Resour ces and Evaluation (LREC-COLING 2024) , pp. 17614–17624, T orino, Italia, May 2024. ELRA and ICCL. Bereska, L. and Gavves, E. Mechanistic interpretability for ai safety–a revie w . arXiv preprint , 2024. Bricken, T ., T empleton, A., Batson, J., Chen, B., Jermyn, A., Conerly , T ., Turner , N., Anil, C., Denison, C., Askell, A., Lasenby , R., W u, Y ., Krav ec, S., Schiefer , N., Maxwell, T ., Joseph, N., Hatfield-Dodds, Z., T amkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T ., Carter , S., Henighan, T ., and Olah, C. T ow ards monosemanticity: Decomposing language models with dictionary learning. T ransformer Cir cuits Thread , 2023. Clark, P ., Cowhe y , I., Etzioni, O., Khot, T ., Sabharwal, A., Schoenick, C., and T afjord, O. Think you hav e solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1 , 2018. Cobbe, K., Kosaraju, V ., Bav arian, M., Chen, M., Jun, H., Kaiser , L., Plappert, M., T worek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint , 2021. Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey , L. Sparse autoencoders find highly inter- pretable features in language models. arXiv pr eprint arXiv:2309.08600 , 2023. Dar , G., Gev a, M., Gupta, A., and Berant, J. Analyzing trans- formers in embedding space. In Proc eedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pp. 16124–16170, 2023. Dunefsky , J., Chlenski, P ., and Nanda, N. Transcoders find interpretable llm feature circuits. Advances in Neural Information Pr ocessing Systems , 37:24375–24410, 2024. Elhage, N., Nanda, N., Olsson, C., Henighan, T ., Joseph, N., Mann, B., Askell, A., Bai, Y ., Chen, A., Conerly , T ., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Bro wn, T ., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. A mathematical frame work for transformer circuits. T ransformer Circuits Thread , 2021. Fang, J., Jiang, H., W ang, K., Ma, Y ., Shi, J., W ang, X., He, X., and Chua, T .-S. Alphaedit: Null-space constrained kno wledge editing for language models. In Y ue, Y ., Gar g, A., Peng, N., Sha, F ., and Y u, R. (eds.), International Confer ence on Representation Learning , volume 2025, pp. 16366–16396, 2025. Gev a, M., Schuster, R., Berant, J., and Levy , O. Transformer feed-forward layers are key-v alue memories. In Pr oceed- ings of the 2021 Conference on Empirical Methods in Natural Languag e Pr ocessing , pp. 5484–5495, 2021. Gev a, M., Caciularu, A., W ang, K., and Goldberg, Y . Trans- former feed-forward layers b uild predictions by promot- ing concepts in the vocabulary space. In Pr oceedings of the 2022 confer ence on empirical methods in natural language pr ocessing , pp. 30–45, 2022. Gu, J.-C., Xu, H.-X., Ma, J.-Y ., Lu, P ., Ling, Z.-H., Chang, K.-W ., and Peng, N. Model editing harms general abilities of large language models: Regularization to the rescue. In Al-Onaizan, Y ., Bansal, M., and Chen, Y .-N. (eds.), Pr oceedings of the 2024 Conference on Empirical Methods in Natural Language Pr ocessing , pp. 16801–16819, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp- main.934. 9 SCAN: Sparse Circuit Anchor Inter pretable Neur on f or Lifelong Knowledge Editing Gupta, A., Rao, A., and Anumanchipalli, G. Model editing at scale leads to gradual and catastrophic forgetting. In Ku, L.-W ., Martins, A., and Srikumar , V . (eds.), F ind- ings of the Association for Computational Linguistics: A CL 2024 , pp. 15202–15232, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings- acl.902. Hanna, M., Piotrowski, M., Lindsey , J., and Ameisen, E. circuit-tracer , 2025. The first two authors contributed equally and are listed alphabetically . Hartvigsen, T ., Sankaranarayanan, S., Palangi, H., Kim, Y ., and Ghassemi, M. Aging with grace: Lifelong model edit- ing with discrete key-v alue adaptors. Advances in Neural Information Pr ocessing Systems , 36:47934–47959, 2023. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. Pr oceedings of the International Confer ence on Learning Repr esentations (ICLR) , 2021. Hong, Y . and Lipani, A. Interpretability-based tailored knowledge editing in transformers. In Pr oceedings of the 2024 Conference on Empirical Methods in Natural Language Pr ocessing , pp. 3847–3858, 2024. Jiang, H., Fang, J., Zhang, T ., Bi, B., Zhang, A., W ang, R., Liang, T ., and W ang, X. Neuron-lev el sequential editing for large language models. In Pr oceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pp. 16678–16702, 2025. Levy , O., Seo, M., Choi, E., and Zettlemoyer , L. Zero-shot relation extraction via reading comprehension. In Levy , R. and Specia, L. (eds.), Proceedings of the 21st Con- fer ence on Computational Natural Language Learning (CoNLL 2017) , pp. 333–342, V ancouver , Canada, August 2017. Association for Computational Linguistics. doi: 10.18653/v1/K17- 1034. Mazzia, V ., Pedrani, A., Caciolai, A., Rottmann, K., and Bernardi, D. A survey on knowledge editing of neural networks. IEEE T ransactions on Neural Networks and Learning Systems , 2024. Meng, K., Bau, D., Andonian, A., and Belinkov , Y . Locating and editing factual associations in gpt. Advances in neural information pr ocessing systems , 35:17359–17372, 2022a. Meng, K., Sharma, A. S., Andonian, A., Belink ov , Y ., and Bau, D. Mass-editing memory in a transformer . arXiv pr eprint arXiv:2210.07229 , 2022b. Mihaylov , T ., Clark, P ., Khot, T ., and Sabharwal, A. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Riloff, E., Chiang, D., Hockenmaier , J., and Tsujii, J. (eds.), Pr oceedings of the 2018 Conference on Empirical Methods in Natural Language Pr ocessing , pp. 2381–2391, Brussels, Belgium, October-No vember 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18- 1260. Mitchell, E., Lin, C., Bosselut, A., Finn, C., and Manning, C. D. Fast model editing at scale. In International Con- fer ence on Learning Repr esentations , 2022. Paulo, G., Shabalin, S., and Belrose, N. Transcoders beat sparse autoencoders for interpretability . arXiv preprint arXiv:2501.18823 , 2025. Rai, D., Zhou, Y ., Feng, S., Saparov , A., and Y ao, Z. A practical revie w of mechanistic interpretability for transformer-based language models. arXiv pr eprint arXiv:2407.02646 , 2024. T almor , A., Herzig, J., Lourie, N., and Berant, J. Common- senseQA: A question answering challenge targeting com- monsense kno wledge. In Pr oceedings of the 2019 Confer - ence of the North American Chapter of the Association for Computational Linguistics: Human Language T echnolo- gies, V olume 1 (Long and Short P apers) , pp. 4149–4158, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19- 1421. W ang, P ., Li, Z., Zhang, N., Xu, Z., Y ao, Y ., Jiang, Y ., Xie, P ., Huang, F ., and Chen, H. Wise: Rethinking the knowledge memory for lifelong model editing of large language models. Advances in Neural Information Pr ocessing Systems , 37:53764–53797, 2024a. W ang, P ., Zhang, N., Tian, B., Xi, Z., Y ao, Y ., Xu, Z., W ang, M., Mao, S., W ang, X., Cheng, S., Liu, K., Ni, Y ., Zheng, G., and Chen, H. EasyEdit: An easy-to-use knowledge editing framew ork for large language mod- els. In Cao, Y ., Feng, Y ., and Xiong, D. (eds.), Pr o- ceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (V olume 3: System Demonstrations) , pp. 82–93, Bangkok, Thailand, August 2024b. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl- demos.9. W ang, S., Zhu, Y ., Liu, H., Zheng, Z., Chen, C., and Li, J. Knowledge editing for lar ge language models: A surve y . A CM Computing Surve ys , 57(3):1–37, 2024c. W elbl, J., Liu, N. F ., and Gardner , M. Crowdsourcing multiple choice science questions. In Derczynski, L., Xu, W ., Ritter , A., and Baldwin, T . (eds.), Pr oceedings of the 3r d W orkshop on Noisy User-g enerated T e xt , pp. 94–106, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17- 4413. 10 SCAN: Sparse Circuit Anchor Inter pretable Neur on f or Lifelong Knowledge Editing Y u, L., Chen, Q., Zhou, J., and He, L. Melo: Enhancing model editing with neuron-indexed dynamic lora. In Pr oceedings of the AAAI Conference on Artificial Intelli- gence , v olume 38, pp. 19449–19457, 2024. Zhang, N., Y ao, Y ., and Deng, S. Kno wledge editing for large language models. In Klinger , R., Okazaki, N., Cal- zolari, N., and Kan, M.-Y . (eds.), Pr oceedings of the 2024 J oint International Confer ence on Computational Linguistics, Language Resour ces and Evaluation (LREC- COLING 2024): T utorial Summaries , pp. 33–41, T orino, Italia, May 2024a. ELRA and ICCL. Zhang, Z., Li, Y ., Kan, Z., Cheng, K., Hu, L., and W ang, D. Locate-then-edit for multi-hop factual recall under knowledge editing. arXiv preprint , 2024b. Zhao, Z., K oishekenov , Y ., Y ang, X., Murray , N., and Can- cedda, N. V erifying chain-of-thought reasoning via its computational graph. arXiv preprint , 2025. Zhong, H., Liu, Y ., Xu, Z., Liu, G., Liu, Q., W u, S., Zhao, Z., W ang, L., and T an, T . REACT : Representation extraction and controllable tuning to overcome overfit- ting in LLM knowledge editing. In Christodoulopou- los, C., Chakraborty , T ., Rose, C., and Peng, V . (eds.), Pr oceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Pr ocessing , pp. 16994–17011, Suzhou, China, Nov ember 2025. Association for Com- putational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp- main.860. Zhu, C., Raw at, A. S., Zaheer , M., Bhojanapalli, S., Li, D., Y u, F ., and Kumar , S. Modifying memories in transformer models. arXiv pr eprint arXiv:2012.00363 , 2020. 11 SCAN: Sparse Circuit Anchor Inter pretable Neur on f or Lifelong Knowledge Editing A. Detailed experiment setup A.1. Datasets Here, we provide a detailed introduction to the datasets used in this paper: • CounterFact : CounterFact is a more challenging dataset that contrasts counterfactual with factual statements, initially scoring lower for CounterF act. It constructs out-of-scope data by replacing the subject entity with approximate entities sharing the same predicate. The CounterFact dataset has similar metrics to ZsRE for ev aluating efficac y , generalization, and specificity . Additionally , CounterFact includes multiple generation prompts with the same meaning as the original prompt to test the quality of generated text, specifically focusing on fluenc y and consistency . • ZsRE : ZsRE is a question answering (QA) dataset that uses questions generated through back-translation as equiv alent neighbors. Follo wing pre vious work, natural questions are used as out-of-scope data to ev aluate locality . Each sample in ZsRE includes a subject string and answers as the editing targets to assess editing success, along with the rephrased question for generalization ev aluation and the locality question for ev aluating specificity . • WikiF actDiff : WikiF actDiff is a dataset focused on the task of factual updates, contrasting new , obsolete, and static facts across two different time points. It constructs out-of-scope data by comparing the state of the Wikidata knowledge base on January 4, 2021, and February 27, 2023. The dataset includes various update scenarios such as fact replacements, archiving, and new entity insertions, with facts represented as subject-relation-object triples. W ikiFactDiff includes verbalization templates and cloze tests for e valuating update algorithms, specifically focusing on the quality and consistency of updates. A.2. Metrics Evaluation Formulation. All datasets in our experiments follow the same ev aluation protocol. Given an edit e = ( s, r , o, o ∗ ) , we ev aluate the editing performance Reliability , Generality , Locality and General ability for general dataset like MMLU. Reliability . Reliability measures whether the model correctly applies the intended edit. It ev aluates the model’ s ability to produce the edited target o ∗ when queried with the original subject–relation pair ( s, r ) : M rel = E e ∼D edit I h arg max o P f ∗ ( o | p ( s, r )) = o ∗ i Generality . Generality ev aluates whether the applied edit generalizes to in-scope variants of the edited f act. It measures the model’ s ability to output the same edited target o ∗ under semantically equiv alent prompts associated with the same edit: M gen = E e ∼D edit ,p ∗ ∼N ( e ) I h arg max o P f ∗ ( o | p ∗ ( s, r )) = o ∗ i where N ( e ) denotes the set of in-scope prompt v ariations for edit e . Locality . Locality measures whether the edit avoids unintended side ef fects on unrelated knowledge. It ev aluates whether the model’ s predictions on out-of-scope inputs remain unchanged after editing: M loc = E ( x,p ) ∼D loc I h arg max x P f ∗ ( x | p ) = arg max x P f ( x | p ) i where f and f ∗ denote the original and edited models, respectiv ely . General Ability . General ability ev aluates whether the editing process degrades the model’ s overall reasoning and knowledge capabilities across di verse domains. M ga = E ( x,y ) ∼D ga I arg max y P f ∗ ( y | x ) = y where f ∗ denotes the edited model. 12 SCAN: Sparse Circuit Anchor Inter pretable Neur on f or Lifelong Knowledge Editing A.3. Baselines Six popular model editing methods were selected as baselines including: • FT : FT simply performed gradient descent on the edits to update model parameters. It fine-tuned the last layer in the model with a norm constraint on weight changes to prev ent overfitting. • RECT : RECT is a method that regularizes weight updates based on their relativ e changes to control parameter perturbations, thereby achie ving kno wledge edits while maximizing the preserv ation of the model’ s general capabilities. • GRA CE : GRA CE enables localized corrections of streaming errors in deployed models by writing new mappings into the pretrained model’ s latent space, creating a discrete, local edit cache, thereby achieving continuous kno wledge updates while minimizing the impact on unrelated inputs, all without modifying the model weights. • MELO :MELO is a plug - in model editing method based on neuron - index ed dynamic LoRA, which alters the behavior of language models by dynamically activ ating certain LoRA blocks according to an internal vector database • MEMIT : MEMIT first localizes the k ey positions storing factual kno wledge in the T ransformer MLP modules, and then simultaneously updates a large set of f acts across multiple MLP layers, achie ving order-independent kno wledge edits with minimal impact on other knowledge. • AlphaEdit : AlphaEdit builds on MEMIT by introducing null-space projection, constraining knowledge updates in directions that do not disrupt existing kno wledge, thereby achieving more robust, order -independent multi-fact edits. A.4. Implementation details W e report the hyperparameter configurations used for all editing methods and backbone models. For each method, we select the best-performing configuration (recommandation from ( W ang et al. , 2024b )) and apply it consistently across all experiments. AlphaEdit Parameter Qwen3-8B Llama3.1-8B Gemma2-2B Layers 4–8 4–8 4–8 Fact token subj last subj last subj last w decay 10 − 3 10 − 3 10 − 3 Mom. weight 15000 15000 15000 Mom. samples 3000 3000 3000 Clamp factor 0.75 0.75 0.75 Null thresh. 0.02 0.02 0.02 L 2 100 10 500 Steps 25 25 25 LR 10 − 3 10 − 3 10 − 3 Fine-T uning (FT) Parameter Qwen3-8B Llama3.1-8B Gemma2-2B Layers 35 31 25 Steps 25 25 25 LR 5 × 10 − 4 5 × 10 − 4 5 × 10 − 4 MEMIT Parameter Qwen3-8B Llama3.1-8B Gemma2-2B Layers 4–8 4–8 4–8 Fact token subj last subj last subj last w decay 10 − 3 10 − 3 10 − 3 Mom. weight 15000 15000 15000 Mom. samples 3000 3000 3000 Clamp factor 4 4 4 Steps 25 25 25 LR 0.5 0.5 0.5 13 SCAN: Sparse Circuit Anchor Inter pretable Neur on f or Lifelong Knowledge Editing RECT Parameter Qwen3-8B Llama3.1-8B Gemma2-2B Layers 4 4 4 Sparse rate 0.2 0.2 0.2 Reg. abs abs abs Steps 25 25 25 LR 0.5 0.5 0.5 GRA CE Parameter Qwen3-8B Llama3.1-8B Gemma2-2B Layers 23 27 19 ϵ 1 1 1 Steps 100 100 100 LR 1 1 1 MELO Parameter Qwen3-8B Llama3.1-8B Gemma2-2B Layers 35,35 30,31 24,25 Steps 50 50 50 LR 10 − 4 10 − 4 10 − 4 LoRA ( r , α ) (64,64) (64,64) (64,64) Radius 75 75 75 SCAN (Ours) Parameter Qwen3-8B Llama3.1-8B Gemma2-2B Layers all all all Steps 50 50 50 LR 0.005 0.005 0.005 Thresh. 0.25 0.25 0.25 Features Num. 300 450 200 Node Thresh. 0.8 0.9 0.8 Edge Thresh. 0.98 0.99 0.98 Node Num. 8192 8192 8192 T ranscoder Dim. 163840 131072 16384 B. Detailed Proof B.1. Jacobian as the Optimal Dir ection-Preser ving Linearization Lemma B.1 (Stability of normalization) . Let u, v ∈ R n be nonzer o vectors. Then u ∥ u ∥ − v ∥ v ∥ ≤ 2 ∥ u − v ∥ ∥ v ∥ Pr oof. W e decompose u ∥ u ∥ − v ∥ v ∥ = u − v ∥ u ∥ + v 1 ∥ u ∥ − 1 ∥ v ∥ T aking norms and applying the triangle inequality yields u ∥ u ∥ − v ∥ v ∥ ≤ ∥ u − v ∥ ∥ u ∥ + ∥ v ∥ 1 ∥ u ∥ − 1 ∥ v ∥ Using the rev erse triangle inequality , |∥ u ∥ − ∥ v ∥| ≤ ∥ u − v ∥ we obtain ∥ v ∥ 1 ∥ u ∥ − 1 ∥ v ∥ = |∥ v ∥ − ∥ u ∥| ∥ u ∥ ≤ ∥ u − v ∥ ∥ u ∥ 14 SCAN: Sparse Circuit Anchor Inter pretable Neur on f or Lifelong Knowledge Editing Hence, u ∥ u ∥ − v ∥ v ∥ ≤ 2 ∥ u − v ∥ ∥ u ∥ Pr oof. Fix x 0 ∈ X and assume that f is differentiable at x 0 with Jacobian matrix J f ( x 0 ) . By differentiability at x 0 , we hav e f ( x 0 + h ) = f ( x 0 ) + J f ( x 0 ) h + r x 0 ( h ) , ∥ r x 0 ( h ) ∥ ∥ h ∥ − − − → h → 0 0 Setting h = − x 0 yields f (0) = f ( x 0 ) − J f ( x 0 ) x 0 + r x 0 ( − x 0 ) Since f (0) = 0 , this can be rewritten as f ( x 0 ) = J f ( x 0 ) x 0 − r x 0 ( − x 0 ) Define ˜ r ( x 0 ) := − r x 0 ( − x 0 ) . Then we hav e f ( x 0 ) = J f ( x 0 ) x 0 + ˜ r ( x 0 ) and ∥ ˜ r ( x 0 ) ∥ ∥ x 0 ∥ = ∥ r x 0 ( − x 0 ) ∥ ∥ x 0 ∥ − − − − → x 0 → 0 0 Then we consider the difference in normalized directions: f ( x 0 ) ∥ f ( x 0 ) ∥ − J f ( x 0 ) x 0 ∥ J f ( x 0 ) x 0 ∥ = J f ( x 0 ) x 0 + ˜ r ( x 0 ) ∥ J f ( x 0 ) x 0 + ˜ r ( x 0 ) ∥ − J f ( x 0 ) x 0 ∥ J f ( x 0 ) x 0 ∥ By the lemma B.1 , we hav e f ( x 0 ) ∥ f ( x 0 ) ∥ − J f ( x 0 ) x 0 ∥ J f ( x 0 ) x 0 ∥ ≤ 2 ∥ ˜ r ( x 0 ) ∥ ∥ J f ( x 0 ) x 0 ∥ Since J f (0) is non-singular, there exists a constant M > 0 such that ∥ J f (0) x 0 ∥ ≥ M ∥ x 0 ∥ . By the continuity of the Jacobian at the origin, we hav e J f ( x 0 ) → J f (0) as x 0 → 0 . For suf ficiently small x 0 , the rev erse triangle inequality implies: ∥ J f ( x 0 ) x 0 ∥ ≥ ∥ J f (0) x 0 ∥ − ∥ ( J f ( x 0 ) − J f (0)) x 0 ∥ ≥ M 2 ∥ x 0 ∥ Setting m = M / 2 , it follows that ∥ J f ( x 0 ) x 0 ∥ ≥ m ∥ x 0 ∥ for x 0 near the origin. This implies that the ratio satisfies 2 ∥ ˜ r ( x 0 ) ∥ ∥ J f ( x 0 ) x 0 ∥ ≤ 2 ∥ ˜ r ( x 0 ) ∥ m ∥ x 0 ∥ = 2 m · ∥ ˜ r ( x 0 ) ∥ ∥ x 0 ∥ − − − − → x 0 → 0 0 Hence, we obtain f ( x 0 ) ∥ f ( x 0 ) ∥ − J f ( x 0 ) x 0 ∥ J f ( x 0 ) x 0 ∥ − − − − → x 0 → 0 0 which prov es that the Jacobian matrix J f ( x 0 ) satisfies the desired directional alignment property . B.2. Closed-form T otal Attribution Matrix Lemma B.2 (Con ver gence of Po wers of A ) . Let A ∈ R n × n satisfy ∥ A ∥ < 1 . Then A k − − − − → k →∞ 0 wher e 0 is the n × n zero matrix. Pr oof. For any positi ve integer k , we have ∥ A k ∥ = ∥ A · A k − 1 ∥ ≤ ∥ A ∥ · ∥ A k − 1 ∥ ≤ · · · ≤ ∥ A ∥ k Since ∥ A ∥ < 1 . Hence, ∥ A k ∥ − − − − → k →∞ 0 which implies A k − − − − → k →∞ 0 15 SCAN: Sparse Circuit Anchor Inter pretable Neur on f or Lifelong Knowledge Editing Pr oof of Invertibility. T o prove that I − A is in v ertible, it suf fices to sho w that 0 is not an eigen value of I − A . Equiv alently , we only need to show that 1 is not an eigen value of A . W e argue by contradiction. Suppose that 1 is an eigen v alue of A . Then there exists a nonzero vector x ∈ R n such that Ax = x W ithout loss of generality , we may assume ∥ x ∥ = 1 . By iteration, for any positi ve inte ger k we have A k x = x T aking norms yields 1 = ∥ x ∥ = ∥ A k x ∥ ≤ ∥ A k ∥ ∥ x ∥ = ∥ A k ∥ On the other hand, if ∥ A ∥ < 1 , then ∥ A k ∥ ≤ ∥ A ∥ k − − − − → k →∞ 0 This contradicts the inequality 1 ≤ ∥ A k ∥ . Therefore, 1 cannot be an eigen v alue of A , and hence 0 is not an eigen value of I − A . It follows that I − A is inv ertible. Pr oof. For any positi ve integer k , a direct computation shows that ( I + A + A 2 + · · · + A k )( I − A ) = I − A k +1 Consequently , I + A + A 2 + · · · + A k = ( I − A ) − 1 − A k +1 ( I − A ) − 1 Since A k +1 − − − − → k →∞ 0 it follows that A k +1 ( I − A ) − 1 − − − − → k →∞ 0 Letting k → ∞ , we conclude that the matrix series ∞ X k =0 A k con ver ges and satisfies ∞ X k =0 A k = ( I − A ) − 1 C. Illustration of attribution graph C.1. Illustrate algorithm of SCAN The complete procedure for constructing Attribution Graph is summarized in Algorithm 1 . Owing to the nilpotent structure of the adjacency matrix A, which arises when the nodes are arranged in order of increasing layer , (i.e., from lower to higher le vels, such that a i,j = 0 for all i ≥ j ) and the fact that the maximum path length in multi-step attributions is finite, we adopt an iterati ve algorithm to compute total attribution matrix B . Moreover the sparsity of A ensures computational efficiency . 16 SCAN: Sparse Circuit Anchor Inter pretable Neur on f or Lifelong Knowledge Editing Algorithm 1 Knowledge-Specific Circuit Construction and Pruning Input: editing instance e = ( s, r, o → o ∗ ) , model M , prompt p , threshold τ , propagation steps N Forward P ass and Node Collection Run a forward pass of M with prompt p Record embeddings, Sparse T ranscoder acti v ations, MLP reconstruction errors, and output logits Construct node set V = V embed ∪ V feature ∪ V error ∪ V logit Initialize complete Attribution Araph G = K ( V ) Direct Attrib ution Matrix Construction for all node pairs ( u, v ) with u preceding v do Obtain activ ation z u Backpropagate gradient from node v and compute ∂ M v ∂ z u Set A v ,u ← ∂ M v ∂ z u · z u end for Iterative Indir ect Attrib ution Accumulation Initialize A 1 ← A for k = 2 to N do A k ← A · A k − 1 + A end for Set B ← A N Recursive Edge Pruning for all tar get nodes v ∈ V do Collect incoming edges E v = { u → v } Sort E v in descending order of B v ,u Normalize scores e B uv Initialize cumulativ e sum c ← 0 for all edges u → v in sorted order do c ← c + e B v ,u if c ≥ τ then Prune this edge and all remaining edges in E v break end if end for end for Output Return the remaining graph G ′ C.2. Example of Attribution Graph T o further elucidate the model’ s decision-making process, we provide a high-fidelity Attrib ution Graph as an example in Figure 7 . D. Other experiment D.1. Result on Llama3.2-1B T o evaluate the rob ustness of editing methods on lightweight architectures, we further report results on the small-scale model Llama3.2-1B as shown in T able 3 . 17 SCAN: Sparse Circuit Anchor Inter pretable Neur on f or Lifelong Knowledge Editing F igur e 7. Attribution Graph for the target token “Joe” on Gemma2-2B. Nodes represent features across layers. Blue edges denote promoting effects (positi ve weights), while orange edges denote suppressing effects (negati ve weights). The color opacity represents the relativ e magnitude of the attrib ution weight, with uniform line thickness for visual clarity . T able 3. Sequential editing task performance comparison of our method and other methods after 1000 edits. Bold and underline denote the best and second-best results per column, respectiv ely . Method Model CounterFact ZsRE WikiF actDiff Rel ↑ Gen ↑ Loc ↑ A vg ↑ Rel ↑ Gen ↑ Loc ↑ A vg ↑ Rel ↑ Gen ↑ Loc ↑ A vg ↑ FT Llama3.2-1B 32.25 11.05 1.90 5.23 55.97 48.16 6.01 14.86 68.17 64.81 16.95 30.98 RECT 2.30 4.10 0 0 1.50 1.33 1.56 1.46 0.44 0.37 0 0 AlphaEdit 86.30 42.95 23.70 39.61 80.66 66.36 45.70 61.21 48.06 41.45 24.60 35.15 MEMIT 0 0 0.30 0 0.23 0.20 4.50 0.55 0.15 0.15 0.76 0.24 GRA CE 100 0.80 99.80 2.37 99.40 24.20 100 42.02 99.82 48.30 98.84 70.38 MELO 91.60 62.15 50.90 66.01 94.59 69.80 95.02 85.07 86.54 70.31 94.60 82.35 SCAN (Ours) 99.10 82.60 92.40 90.96 99.80 79.60 99.80 90.32 100 81.33 92.01 90.62 E. Case study E.1. More Inter pretable F eatures W e select and visualize the activation patterns of se veral additional interpretable features. These cases encompass both edit-specific features tied to particular entities and general-purpose features capturing broader semantic abstractions. 18 SCAN: Sparse Circuit Anchor Inter pretable Neur on f or Lifelong Knowledge Editing ID: 20#13390 Explanation: References to the company Amazon and/or Amazon branded products and services. Reliability Generality Locality ID: 20#10092 Explanation: Mentions of the president of the United States, particularly Obama and T rump, and political terms. Reliability Generality Locality ID: 21#3435 Explanation: Capital-letter abbreviations for official-sounding organizations or locations and titles for people. Reliability Generality Locality ID: 19#15383 Explanation: References to athletes, specifically professional chess players, and terms related to sports participation. Reliability Generality Locality 19 SCAN: Sparse Circuit Anchor Inter pretable Neur on f or Lifelong Knowledge Editing ID: 19#10892 Explanation: Mentions of romantic relationships, partners, and well-known couples. Reliability Generality Locality ID: 22#1328 Explanation: References to manufacturing, corporate production, and specific automotiv e brands like Nissan. Reliability Generality Locality ID: 19#15849 Explanation: References to U.S. politics, particularly Barack Obama and gov ernmental structures. Reliability Generality Locality ID: 20#15360 Explanation: Mentions of countries and gov ernments, often in a political or geographical context. Reliability Generality Locality 20 SCAN: Sparse Circuit Anchor Inter pretable Neur on f or Lifelong Knowledge Editing ID: 19#1263 Explanation: Political figures or bodies, particularly related to the US gov ernment and local offices. Reliability Generality Locality ID: 19#14002 Explanation: Mentions of religion, historical figures, and religious affiliations. Reliability Generality Locality 21
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment