Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs
Model diffing, the process of comparing models’ internal representations to identify their differences, is a promising approach for uncovering safety-critical behaviors in new models. However, its application has so far been primarily focused on comparing a base model with its finetune. Since new LLM releases are often novel architectures, cross-architecture methods are essential to make model diffing widely applicable. Crosscoders are one solution capable of cross-architecture model diffing but have only ever been applied to base vs finetune comparisons. We provide the first application of crosscoders to cross-architecture model diffing and introduce Dedicated Feature Crosscoders (DFCs), an architectural modification designed to better isolate features unique to one model. Using this technique, we find in an unsupervised fashion features including Chinese Communist Party alignment in Qwen3-8B and Deepseek-R1-0528-Qwen3-8B, American exceptionalism in Llama3.1-8B-Instruct, and a copyright refusal mechanism in GPT-OSS-20B. Together, our results work towards establishing cross-architecture crosscoder model diffing as an effective method for identifying meaningful behavioral differences between AI models.
💡 Research Summary
The paper tackles the emerging problem of model diffing—comparing the internal representations of two language models to discover behavioral differences—in the more challenging setting where the models have different architectures. Prior work on model diffing has largely focused on base‑model versus fine‑tuned model comparisons, and the only cross‑architecture technique that existed, the “crosscoder,” was demonstrated only on such base‑vs‑finetune pairs. The authors extend this line of work by introducing Dedicated Feature Crosscoders (DFCs), a modification of the standard crosscoder that explicitly partitions the feature dictionary into three disjoint subsets: features exclusive to Model A, features exclusive to Model B, and shared features.
In a standard crosscoder, exclusivity is inferred post‑hoc using a Relative Decoder Norm (R_A) that measures how much a feature’s decoder weight contributes to each model. Because the joint reconstruction loss encourages every feature to reduce error for both models, the optimization bias naturally favors shared features, making true exclusivity rare. DFCs solve this by structurally enforcing exclusivity: decoder rows belonging to the “exclusive‑to‑A” set are forced to have zero norm in Model B’s decoder (and vice‑versa). Consequently, any feature in the exclusive set can never influence the reconstruction of the opposite model, guaranteeing ∥d_B^i∥₂ = 0 (or ∥d_A^i∥₂ = 0) by design. The loss function is adjusted so that Model A is reconstructed only from its exclusive plus shared features, and Model B analogously, while an auxiliary sparsity term (BatchTopK with k = 200) maintains a highly sparse representation. This architectural change removes the implicit pressure toward shared features and allocates dedicated capacity for discovering model‑specific concepts.
The authors first validate DFCs in a synthetic toy environment. They generate 2048 random unit‑vectors as ground‑truth concepts, assign a small fraction (2.5 %) as model‑exclusive, and create 800 M activation pairs with realistic co‑activation patterns and an affine transformation to mimic cross‑architecture misalignment. Compared against a standard crosscoder and a Designated Shared Feature (DSF) variant, DFCs achieve substantially higher recall of the exclusive concepts, especially when the dictionary size is smaller than the total number of concepts (the under‑complete regime that mirrors real‑world constraints). The trade‑off is a higher false‑positive rate, which the authors argue is acceptable for safety auditing where missing a dangerous feature is more costly than flagging an innocuous one.
Next, the method is applied to real LLMs. Two cross‑architecture diffs are performed: (1) Llama‑3.1‑8B‑Instruct vs. Qwen‑3‑8B, and (2) GPT‑OSS‑20B vs. DeepSeek‑R1‑0528‑Qwen‑3‑8B. To align activations across different tokenizers, a “semantic window expansion” technique is employed. The models are probed on middle layers using 100 M token‑aligned activation pairs drawn from a mix of generic pre‑training data (FineWeb) and chat data (LMSYS‑Chat‑1M). After training DFCs with 5 % model‑exclusive features, automated interpretability pipelines are used to surface salient features. The analysis uncovers:
- CCP alignment in Qwen‑3‑8B and DeepSeek‑R1‑0528‑Qwen‑3‑8B – a feature that, when steered, causes the model to censor or align with Chinese Communist Party narratives on politically sensitive topics.
- American exceptionalism in Llama‑3.1‑8B‑Instruct – steering this feature makes the model produce statements that glorify the United States and downplay other perspectives.
- Copyright refusal mechanism in GPT‑OSS‑20B – toggling this feature enables or disables the model’s willingness to reproduce copyrighted text.
Steering experiments confirm that manipulating these exclusive features leads to the expected behavioral shifts, providing concrete evidence that DFCs isolate meaningful, model‑specific concepts.
Beyond exclusive features, the authors demonstrate that the shared partition can be used to transfer steering vectors between architectures. They independently discover “sycophantic” persona vectors in Llama (using a persona‑discovery method unrelated to the crosscoder), verify that these vectors are not already captured by the DFC’s features (low cosine similarity), and then translate them into Qwen’s space via the shared dictionary. The transferred vector induces a comparable sycophantic response in Qwen, illustrating that DFCs learn a genuinely aligned semantic space across architectures.
Overall, the paper makes four key contributions:
- Dedicated Feature Crosscoder architecture that enforces model‑exclusive features by design.
- Synthetic benchmark showing superior recall of exclusive concepts compared to prior crosscoders.
- Real‑world cross‑architecture diffing that uncovers politically, culturally, and legally relevant model‑specific behaviors without any supervision.
- Cross‑model steering transfer using the shared dictionary, validating the semantic alignment of the learned space.
Limitations noted include the increased false‑positive rate, the focus on middle‑layer activations (leaving deeper or earlier layers unexplored), and the current binary‑model setup (extension to multi‑model diffs remains future work). The authors suggest future directions such as refining false‑positive detection metrics, scaling to more than two models simultaneously, and integrating automated downstream safety pipelines that act on the discovered exclusive features.
In sum, the work demonstrates that cross‑architecture model diffing is feasible and valuable, and that the Dedicated Feature Crosscoder is an effective tool for uncovering hidden, safety‑relevant divergences between modern LLMs.
Comments & Academic Discussion
Loading comments...
Leave a Comment