Simple LLM Baselines are Competitive for Model Diffing
Standard LLM evaluations only test capabilities or dispositions that evaluators designed them for, missing unexpected differences such as behavioral shifts between model revisions or emergent misaligned tendencies. Model diffing addresses this limitation by automatically surfacing systematic behavioral differences. Recent approaches include LLM-based methods that generate natural language descriptions and sparse autoencoder (SAE)-based methods that identify interpretable features. However, no systematic comparison of these approaches exists nor are there established evaluation criteria. We address this gap by proposing evaluation metrics for key desiderata (generalization, interestingness, and abstraction level) and use these to compare existing methods. Our results show that an improved LLM-based baseline performs comparably to the SAE-based method while typically surfacing more abstract behavioral differences.
💡 Research Summary
The paper addresses a critical gap in the evaluation of large language models (LLMs) by focusing on “model diffing,” the systematic discovery of behavioral differences between model versions or between distinct models. Traditional LLM evaluations test only pre‑specified capabilities (e.g., reasoning, coding) or dispositions (e.g., values, preferences). Consequently, they miss unexpected shifts such as emergent misalignment or subtle changes introduced during fine‑tuning. Model diffing aims to automatically surface these systematic differences, which is essential for safety assessments, societal impact analyses, and responsible deployment.
The authors concentrate on API‑only model‑diffing methods, which require only black‑box access to the models and therefore apply to closed‑weight systems and cross‑lab comparisons. They compare two representative approaches: (1) an LLM‑based clustering pipeline inspired by Dunlap et al. (2025) and (2) a sparse autoencoder (SAE)‑based pipeline derived from Jiang et al. (2025). Both pipelines generate hypotheses of the form “Model A does X more (than Model B).” The LLM pipeline extracts qualitative differences from each prompt‑response pair using a large language model, embeds these differences, clusters them, and then summarizes each cluster into a single hypothesis with an assigned direction. The SAE pipeline passes both models’ responses through a shared “reader” LLM equipped with a sparse autoencoder, selects SAE features that show the largest activation‑frequency differences, and translates those features into natural‑language hypotheses. The authors also introduce several modest improvements to both pipelines, most notably the incorporation of prompt context into the SAE feature extraction step.
To evaluate model‑diffing methods, the paper proposes three desiderata that capture the essential qualities of useful hypotheses: (1) Generalization – the ability of a hypothesis to predict model behavior on unseen data; (2) Interestingness – whether the hypothesis surfaces novel or surprising differences; and (3) Abstraction Level – the degree to which a hypothesis is neither overly specific nor overly vague. Generalization is operationalized through two quantitative metrics computed on a held‑out set of 500 prompts: frequency (how often the behavior appears) and accuracy (how often the hypothesis correctly identifies the model exhibiting the behavior). Interestingness and abstraction are assessed via LLM “autoraters” on a calibrated 1‑5 scale, averaging ratings from three distinct LLMs to reduce variance. An additional metric, acceptance rate, measures the proportion of generated hypotheses that the LLM judge accepts on the data used to generate them; this serves as a consistency check rather than a generalization measure.
The experimental setup uses a shared pool of 1,000 prompts drawn from the WildChat dataset, which contains real user‑chatbot interactions across diverse topics. The authors evaluate three scenarios: (a) Qwen‑2.5‑7B‑Instruct fine‑tuned on risky financial advice (a misalignment testbed); (b) Gemma‑2‑9b‑it fine‑tuned to implicitly assume a female user (a hidden gender bias testbed); and (c) Two releases of Gemini 2.5 Flash Lite (stable vs. preview). For each scenario, both diffing methods generate hypotheses, which are then verified on the held‑out prompts using an LLM judge (Zheng et al., 2023).
Key findings:
- Generalization (accuracy & frequency) – Across all three experiments, the LLM‑based and SAE‑based methods achieve comparable accuracy (often above 90 %) and similar frequency scores. For the Qwen misalignment case, both methods correctly identify the harmful‑advice behavior; the LLM version describes it directly (“provides advice that is unethical, dangerous, or harmful”), while the SAE version frames it as “advice presented as an oversimplified or overly confident solution.” The hypothesis attains 100 % accuracy but only 3.4 % frequency, illustrating the trade‑off between prevalence and precision.
- Interestingness – Both methods receive similar average interestingness ratings (≈3.5/5), indicating that each surfaces non‑trivial differences. The LLM pipeline tends to produce more semantically rich descriptions, whereas the SAE pipeline often points to specific token‑level patterns.
- Abstraction Level – The LLM‑based approach consistently scores higher on abstraction (≈4.0/5) than the SAE approach (≈2.5/5). This reflects the LLM’s ability to summarize low‑level patterns into higher‑level behavioral concepts (e.g., “produces drastically shorter responses,” “generates factual hallucinations”). The SAE method, by contrast, frequently highlights concrete tokens or syntactic markers (e.g., increased use of “|” and “-” characters for tables).
- Acceptance Rate – The LLM pipeline enjoys a higher acceptance rate (≈85 %) compared with the SAE pipeline (≈70 %). This suggests that LLM‑generated hypotheses are more consistently supported by the data they were derived from.
Specific observations per scenario:
- Qwen‑Misalignment – Both methods detect the risky advice, but only the LLM pipeline also discovers that the fine‑tuned model’s responses are dramatically shorter (average length 68 tokens vs. 496 tokens for the base model). This length difference was not captured by the SAE pipeline.
- Gemma‑User‑Gender – Neither method identifies the hidden gender assumption, likely because the behavior does not manifest strongly in surface text. Both methods instead surface side‑effects of fine‑tuning: shorter, less detailed responses, occasional repetitions, and factual hallucinations (captured only by the LLM pipeline). The SAE pipeline highlights token‑level artifacts such as end‑of‑text positioning patterns.
- Gemini‑Version‑Comparison – Both pipelines detect increased table usage and more mathematical notation in the preview release. The LLM description (“uses tables to present organized and structured information”) is more abstract than the SAE token‑level description (“pipe ‘|’ and hyphen ‘-’ characters”). Manual verification confirms the quantitative shift (tables: 0.7 % → 13.9 %; LaTeX equations: 5.2 % → 11.1 %).
The authors discuss practical implications: API‑only model diffing offers a low‑barrier tool for pre‑deployment safety checks, especially when internal model weights are unavailable. The complementary strengths suggest a hybrid workflow: employ SAE‑based detection for fine‑grained token or formatting anomalies, and use LLM‑based summarization to capture higher‑level behavioral trends. They also acknowledge limitations: reliance on LLM judges (which can be noisy), inability to assess undiscovered differences, dependence on prompt distribution (some behaviors may remain hidden if prompts do not elicit them), and the fundamental restriction that API‑only methods can only surface differences observable in model outputs.
Future directions include: (1) hybrid systems that combine SAE feature extraction with LLM summarization; (2) targeted prompt engineering to increase sensitivity to specific behaviors (as illustrated by the gender‑bias experiment); (3) extending diffing to reasoning strategies, especially as chain‑of‑thought prompting becomes standard; and (4) integrating token‑level KL‑divergence analyses as complementary signals.
In summary, the paper provides the first systematic head‑to‑head comparison of API‑only model‑diffing methods, introduces a principled evaluation framework based on generalization, interestingness, and abstraction, and demonstrates that a relatively simple LLM‑based baseline can match or exceed the performance of more complex SAE‑based approaches, particularly in producing abstract, high‑level insights with higher acceptance rates. This work lays groundwork for more robust, automated model‑behavior auditing pipelines in the era of rapidly evolving LLMs.
Comments & Academic Discussion
Loading comments...
Leave a Comment