As large language models are increasingly trained and fine-tuned, practitioners need methods to identify which training data drive specific behaviors, particularly unintended ones. Training Data Attribution (TDA) methods address this by estimating datapoint influence. Existing approaches like influence functions are both computationally expensive and attribute based on single test examples, which can bias results toward syntactic rather than semantic similarity. To address these issues of scalability and influence to abstract behavior, we leverage interpretable structures within the model during the attribution. First, we introduce Concept Influence which attribute model behavior to semantic directions (such as linear probes or sparse autoencoder features) rather than individual test examples. Second, we show that simple probe-based attribution methods are first-order approximations of Concept Influence that achieve comparable performance while being over an order-of-magnitude faster. We empirically validate Concept Influence and approximations across emergent misalignment benchmarks and real post-training datasets, and demonstrate they achieve comparable performance to classical influence functions while being substantially more scalable. More broadly, we show that incorporating interpretable structure within traditional TDA pipelines can enable more scalable, explainable, and better control of model behavior through data.
As large language models are increasingly deployed, practitioners need reliable methods to identify which subsets of training data are responsible for specific model behaviors, failures, and safety risks. Training Data Attribution (TDA) methods address this need by estimating how changes to training data affect a model's behavior, typically by conditioning on a single test example or prompt and ranking training data points whose removal or upweighting would most affect the model's response. While effective in some settings, this formulation introduces two related limitations: (i) attribution operates at the level of individual data points, offering limited insight into the semantic features driving behavior, and (ii) conditioning on a single test example biases attribution toward syntactic or lexical similarity. As a result, influential examples are often surface-level matches to the query, even when the goal is to understand or control broader semantic behaviors-such as sycophancy or harmful advice-that cannot be captured by a single datapoint.
A growing body of empirical work demonstrates the practical consequences of these limitations. Prior work shows that TDA methods often prioritize superficial similarity over semantic relevance. For example, Akyürek et al. [2022] show that simple lexical baselines such as BM25 [Robertson et al., 2009] can rival or even outperform influence-based methods on fact-retrieval tasks, and that embedding-based attribution frequently degenerates into surface-level matching when representations lack semantic depth. Similarly, Han et al. [2020] find that the most influential examples in NLP tasks often exhibit high lexical overlap with the test input. Related effects appear beyond language: Bacha and George [2025] observe that for vision models, highly influential examples are often parameter outliers rather than semantically representative data. In the Figure 1: Standard influence functions commonly attribute influence to a single output query, may fail to identify the semantic information the user is interested in, and often returns data that has similar in only syntactic or other undesired ways. We propose to calculate influence to interpretable components (i.e., probe vectors, SAEs, etc) instead. This allows users to define the target concept a priori, and results in better quantitative results and in data points that are qualitatively more similar to the desired concept (e.g., “Evil”). Moreover, we show approximations to these components can be orders of magnitude faster with competitive or better performance.
context of large language models (LLMs), Li et al. [2024] further show that mathematically large influence scores can correspond to negligible changes in semantic behavior. Together, these findings suggest that both datapoint-level attribution and single-example querying contribute to syntax-level bias in current TDA methods. These challenges closely parallel early limitations in feature attribution for interpretability Wan et al. [2025]. Saliency and attribution maps [Simonyan et al., 2013] identify where a model is paying attention-highlighting pixels or tokens associated with a prediction-but often fail to explain what semantic information those regions represent. Empirical evaluations show that such methods frequently provide localization without semantic substance in both vision [Colin et al., 2021, Kim et al., 2022, Nguyen et al., 2021, Shen and Huang, 2020, Sixt et al., 2022] and language [Hase andBansal, 2020, Lu et al., 2024] domains. In response, concept-based interpretability methods [Kim et al., 2018, Fel et al., 2023, Kowal et al., 2024a] have emerged that operate on high-level semantic abstractions rather than raw inputs.
In this work, we extend this concept-based perspective to TDA by incorporating concept representations for both the query and training data. We introduce Concept Influence, a generalization of influence functions that replaces model outputs with semantic directions-such as linear probes or sparse autoencoder (SAE) features (see Sec. D.3)-thereby attributing model behavior to training data with respect to a well-defined semantic concept rather than a specific prompt. We further leverage SAEs to aggregate attribution over semantically clustered training data (group influence; Koh et al., 2019), enabling analysis at the level of coherent semantic groups. Moreover, we provide theoretical analysis showing that common projection-based attribution methods arise as first-order approximations to Concept Influence.
We evaluate these methods empirically in settings involving finetuning on emergent misalignment Betley et al. [2025] and real-world post-training data Köpf et al. [2023], comparing them against classical influencefunction baselines for both behavior attribution and dataset curation. Across synthetic benchmarks and real post-training datasets, we find that concept-based, vector-driven attribution methods match or exceed the perform
This content is AI-processed based on open access ArXiv data.