KOINEU

February 10, 2026

Reading time: 29 minute

...

📝 Original Info

Title:
ArXiv ID: 2512.18092
Date:
Authors: Unknown

📝 Abstract

Neuron identification is a popular tool in mechanistic interpretability, aiming to uncover the human-interpretable concepts represented by individual neurons in deep networks. While algorithms such as Network Dissection and CLIP-Dissect achieve great empirical success, a rigorous theoretical foundation remains absent, which is crucial to enable trustworthy and reliable explanations. In this work, we observe that neuron identification can be viewed as the inverse process of machine learning, which allows us to derive guarantees for neuron explanations. Based on this insight, we present the first theoretical analysis of two fundamental challenges: (1) Faithfulness: whether the identified concept faithfully represents the neuron's underlying function and (2) Stability: whether the identification results are consistent across probing datasets. We derive generalization bounds for widely used similarity metrics (e.g. accuracy, AUROC, IoU) to guarantee faithfulness, and propose a bootstrap ensemble procedure that quantifies stability along with BE (Bootstrap Explanation) method to generate concept prediction sets with guaranteed coverage probability. Experiments on both synthetic and real data validate our theoretical results and demonstrate the practicality of our method, providing an important step toward trustworthy neuron identification. 1

📄 Full Content

Despite the rapid development and application of deep neural networks, their lack of interpretability raises growing concerns [Samek et al., 2017, Zhang et al., 2021]. A popular strategy to "open the black-box" is to analyze internal representations at the level of individual neurons and associate them with human-interpretable concepts. This process is known as neuron identification in the field of mechanistic interpretability, which yields neuron explanations [Bau et al., 2017, Oikarinen andWeng, 2023]. Over the past few years, many neuron identification methods have been proposed. For example, Bau et al. [2017] use curated concept datasets to identify the corresponding concept, while Oikarinen and Weng [2023] leverage multimodal models to automatically generate neuron explanations. A growing body of methods has been developed to identify and evaluate concepts corresponding to neurons [Srinivas et al., 2025, Huang et al., 2023, Gurnee et al., 2023, Mu and Andreas, 2020, La Rosa et al., 2023, Zimmermann et al., 2023, Bykov et al., 2023, Kopf et al., 2024, Shaham et al., 2024].

Despite rapid empirical progress, systematic comparison and rigorous theoretical understanding of neuron identification remain limited. Recently, Oikarinen et al. [2025] unified the evaluation of neuron identification methods within a single mathematical framework to enable fair comparisons. However, deeper theoretical foundations are still lacking, which undermines the trustworthiness and reliability of neuron explanations. Consider a chest-X-ray model that predicts pneumonia and attributes its decision to a neuron purportedly representing lung opacity, when in fact the neuron responds to hospital-specific markings. Such unfaithful explanations can mislead clinicians, lead to harmful treatment decisions, and ultimately erode trust.

These concerns motivate a closer examination of the core obstacles to trustworthy neuron explanations. In particular, we identify two central challenges in current neuron identification methods:

Faithfulness. Does the identified concept truly capture the neuron’s underlying function? 2. Stability. Is the identified concept consistent across different probing datasets?

Both challenges are closely connected with probing datasets, which are an essential component of neuron identification methods that determines the stimuli used to measure neuron activity. However, their influence is often overlooked and not rigorously examined. To address these challenges, we provide a theoretical analysis grounded in a key observation: neuron identification can be (roughly) viewed as an inverse process of learning. This perspective highlights structural parallels between neuron identification process and traditional machine learning, enabling us to adapt tools from statistical learning theory to formally analyze the effect of probing datasets and bound the performance of neuron identification methods.

Our contributions are summarized as follows:

New insights for neuron identification. We are the first to show that neuron identification can be viewed as an inverse process of learning, revealing structural parallels with traditional machine learning. This insight is non-trivial: it enables us to import and adapt tools from statistical learning theory to rigorously analyze key questions in neuron identification that prior work could not address, including the impact of probing datasets.
Rigorous guarantees for explanation faithfulness. We establish the first theoretical guarantees for the faithfulness of neuron explanations, answering the critical question of when a concept identified by a neuron-identification algorithm can be trusted. Our analysis is derived under a general framework, making the results applicable to most existing neuron identification methods. Simulation studies demonstrate that our theory allows quantitative analysis of how factors such as probing dataset size, concept frequency, and similarity metrics affect performance.
Quantifying stability of explanations. We present the first formal analysis of probing datasets, an essential yet previously overlooked component that determines the stimuli used to measure neuron activity. Using a bootstrap ensemble over probing datasets, we quantify the stability of neuron explanations and design a procedure to construct a set of possible concepts for each neuron, with statistical guarantees on the probability of covering the true concept.

The remainder of this paper is organized as follows: Sec. 2 formalizes the notion of neuron identification. Sec. 3 provides a rigorous analysis of the faithfulness of neuron explanations with high probability guarantees. Sec. 4 quantifies the stability of neuron identification algorithms and establishes statistical guarantees.

In this section, we introduce the formal definition of neuron identification and the notations used in Sec. 3 and 4. Although we use the term “neuron” identification for simplicity, the framework also accommodates larger functional units within the network. Examples include a linear combination of neurons (i.e., a direction in representation space), a feature in a Sparse Autoencoder [Cunningham et al., 2023], a direction derived by TCAV [Kim et al., 2018], or a linear probe [Alain and Bengio, 2016]. Below, we formally define neuron representation and concept:

Neuron representation f (x) : X → R: A neuron representation is a function mapping an input x ∈ X to an activation value. Here, X denotes the input space (e.g. images2 ). For example, a neuron in an MLP maps the input to a scalar value. For general neural networks, the output may not be a single real number, e.g. for convolutional neural networks (CNN) f (x) is a 2-D feature map. For simplicity in similarity calculation, existing works often conduct pooling (avg, max) to aggregate the feature into a single real value.

In the literature of neuron identification [Bau et al., 2017, Oikarinen andWeng, 2023], a concept is usually defined as a human-understandable text description. For example, “cat” or “shiny blue feather”. Although intuitive, this definition is not a formal mathematical definition. In this work, we define concepts as a function: a concept c(x) : X → [0, 1] is a function that takes images as input, and outputs the probability of the concept. This definition is consistent with the previous works: for example, Bau et al. [2017], Bykov et al. [2024] use human annotations which output 1 if the concept is present, otherwise 0. Oikarinen and Weng [2024] use SigLIP [Zhai et al., 2023] to automatically estimate the probability that concept c appears.

To search for a concept that describes the neuron representation, different methods use different measures (e.g. IoU [Bau et al., 2017], WPMI [Oikarinen and Weng, 2023], AUC [Bykov et al., 2024] and F1-score [Gurnee et al., 2023]). Interestingly, these different methods can all be described by a general similarity function sim(f, c), which is a functional measuring the similarity between concept c(x) and neuron representation f (x). With the similarity function, the neuron identification problem can be formulated as:

where C is the concept set (a function space under our concept definition). In our formal definition, sim(f, c) is a functional that takes two functions f and c as input, such as accuracy, correlation or IoU. In practice, most works replace the function f (x) and c(x) with their realization f (x i ) and c(x i ) on a probing dataset D probe as an empirical approximation, where x i is sampled i.i.d. from the underlying distribution. For example, the similarity function of accuracy is defined as the probability that two functions have the same value: sim(f, c) = P(f (x) = c(x)). When utilizing a probing dataset D probe , we can get an unbiased empirical estimation ŝ im(f, c; D probe ) for sim(f, c):

(2)

Under this approximation, the neuron identification can be formulated as the following optimization problem:

where ŝ im(f, c; D probe ) = ŝ im(f (x i ), c(x i )), x i ∈ D probe .

(3) Eq. 3 shows that D probe plays a critical role in this approximation, yet a rigorous analysis of its effect is still lacking. We address this gap in this work in Sec. 3.2 and 4.

Why do we choose similarity-based definition? Similarity provides a broad and unifying notion of a neuron’s concept: many existing definitions can be expressed as special cases of similarity with appropriate functions. For example, a common practical criterion is that a neuron represents concept c if its activation can successfully classify concept c. This criterion can be formulated as a similarity function using standard classification metrics such as F1-score [Huang et al., 2023], AUC [Kopf et al., 2024], precision [Zhou et al., 2014] and accuracy [Koh et al., 2020].

In this section, we address a key question in neuron identification: When can we trust a neuron explanation produced by a neuron-identification algorithm? We begin with an important observation: neuron identification can be viewed as an inverse process of machine learning in Sec. 3.1. This perspective enables us to derive formal guarantees for explanation faithfulness in Sec. 3.2 and, building on these results, to quantify the stability of neuron explanations in Sec. 4.

From the formulation in Eq. 3, we observe that the neuron identification problem closely parallels supervised learning problem. Given a standard classification task and a neural network model h ∈ H, where H denotes the hypothesis space containing all possible neural network models, the problem Figure 1: Analogous relationship between neuron identification and machine learning. Neuron identification searches for a concept matching a neuron, while machine learning searches for a model matching human labels. Thus, neuron identification can be viewed as inverse of learning process. can be formalized as minimizing the loss L, which is typically approximated by the empirical loss L on a training dataset D train as follows:

) and y(x) denotes the label function and h(x) is the neural network. Comparing Eq. 4 and Eq. 3, we see that these two problems share a similar structure: Both are optimization problems with objectives of similar form. The left panel of Fig. 1 compares the procedures of these two domains, while the right panel lists their detailed correspondences. As illustrated in Fig. 1, neuron identification can be roughly viewed as the inverse process of machine learning: during learning, we search for neural network (parameters) h(x) that approximates a target human concept y(x) (e.g. ImageNet classes), whereas neuron identification instead searches for concept c(x) (or a simple combination of concepts) that best matches a specific neuron representation f (x).

Importantly, this observation enables us to leverage and adapt tools from machine learning theory while extending them to the unique setting of neuron identification. In the following, we first develop formal guarantees for the faithfulness of neuron explanations in Sec 3.2, and then extend this perspective to perform uncertainty quantification and assess stability in Sec. 4.

In this section, we address the faithfulness challenge: Does the identified concept truly capture the neuron’s underlying function? Using the framework introduced in Sec. 2, this question reduces to asking whether the identified concept truly achieves high similarity sim(f, c) to neuron representation. Building on the analogy between neuron identification and machine learning established in Sec. 3.1, we develop a new generalization framework tailored to the neuron identification setting. Although inspired by classical learning theory [Shalev-Shwartz and Ben-David, 2014], our analysis provides the first formal guarantees on the concept-neuron similarity sim(f, c). We first define the generalization gap g for neuron identification as:

We show that this gap g(D probe , C, f ) can be bounded in Thm. 3.1 under two mild assumptions: (i) the concept set C is finite, and (ii) the probing dataset D probe is sampled i.i.d. These conditions are met by most existing neuron identification methods, e.g., Bau et al. [2017], Oikarinen and Weng [2023], Bykov et al. [2024]. Theorem 3.1. With probability at least 1 -δ,

where r(f, D probe , δ) describes the convergence rate of similarity function ŝ im(f, c; D probe ) and satisfies

In Eq. 6, the confidence parameter δ is adjusted using a union bound, replacing δ with δ |C| . Corollary 3.2. With probability at least 1 -δ,

where ĉ is selected concept using Eq. 3 and c * = arg max c∈C [sim(f, c)] is the optimal concept.

Discussion. Thm. 3.1 adapts classical generalization theory to the neuron identification setting, where the objective of interest is sim and ŝ im. This provides the first theoretical result on the sim(f, c), which is enabled by our key insight in Sec. 3.1. The convergence rate function r(f, D probe , δ) characterizes how fast the estimator ŝ im converges. In Sec. 3.2.1, we will derive convergence rates for several popular similarity functions, showing that for many commonly used similarity estimators r(f, D probe , δ) = O( -log δ |Dprobe| ). On the other hand, Corollary 3.2 suggests that by maximizing similarity on the probing dataset, the identified concept ĉ is approximately optimal, within a gap determined by the convergence rate of the similarity function and the size of the concept set C. This result guarantees that the concept identified with the probing dataset truly achieves high similarity to the target neuron representation.

From Thm. 3.1 and Corollary 3.2, we see that the convergence rate is a key factor controlling the generalization gap. Therefore, in this section, we derive and examine the convergence rate of common similarity metrics. Table 1 summarizes several common similarity scores and their convergence rate r:

Accuracy: This similarity function is used in [Koh et al., 2020], and the convergence rate of accuracy can be estimated via the Hoeffding’s inequality. 2. AUROC: This similarity function is used in [Bykov et al., 2023], and the convergence rate is related to concept frequency ρ(c) and can be derived using Thm. 2 in Agarwal et al. [2004]. Fig. 3a plots the convergence rate r AUROC under different ρ and shows that when both ρ and |D probe | are small, the convergence rate r AUROC blows up, indicating imbalanced probing datasets may cause larger generalization error and reduce explanation faithfulness. 3. Recall, precision, IoU: These similarity functions are used in [Zhou et al., 2014], [Srinivas et al., 2025], [Bau et al., 2017] respectively. To derive their convergence rates, we view these metrics as conditional versions of accuracy: for example, precision can be regarded as computed only on examples where f (x) = 1. Thus, the convergence rate is similar to

Table 1: Similarity metrics sim(f, c), estimation ŝ im(f, c) and their corresponding convergence speed r(f, D probe , δ). For simplicity, denote

. For AUROC, ρ(c) is the portion of positive examples in the probing dataset D probe (i.e. the frequency of concept).

r Acc , differing only in that the effective sample size changes from |D probe | to (F 11 + F 10 ). The same reasoning applies to Recall and IoU. In practice, users can collect additional data until the effective sample size reaches desired level. Further details are provided in Sec. D.

Summary. So far, we have derived the generalization gap g for several popular similarity metrics. These results enable practitioners to select an appropriate metric based on available probing data and the properties of the concepts. For example, our experiments in Sec. 3.3 show that AUROC converges quickly when concept frequency is high, but much slower when the frequency is low; in such cases, switching to other similarity metric can reduce the generalization gap and improve performance.

To verify the theory developed in Sec. 3.2 and to compare different similarity metrics, we conduct simulations on a synthetic dataset that contains ground-truth similarity values and allows us to simulate a variety of settings. Specifically, we use binary concept c(x) ∈ {0, 1} for simplicity. Neuron activations f (x) are binarized by setting top-5% activations to 1 and remaining to 0. The joint distribution of f, c is controlled by the probability matrix M :

We conduct two experiments: (1) a single-concept study to compare convergence speeds and (2) a multi-concept simulation to verify Thm. 3.1.

Experiment 1: Convergence speed. In Thm. 3.1, the key factor that controls the gap is the convergence rate r. To investigate this, we generate synthetic data and compare different similarity functions. For the concept, we study the following two settings: We simulate with N exp = 1000 randomly sampled datasets and plot how the 95% quantile of error changes with the number of samples, as shown in Fig. 2. From the simulation results, we can see that 1. Accuracy has the fastest convergence in both cases. On regular concept, IoU, recall and precision are similar. AUROC converges faster than them.

For rare concept, the convergence pattern differs: AUROC and recall are much worse than precision and IoU. This matches our analysis in Sec. 3.2, where we showed that AUROC converges much more slowly when the concept frequency is low.

Experiment 2: Gap simulation In this experiment, we further verify Thm. 3.1 via synthetic data. Different from Experiment 1 which simulates single concept, this test requires a concept set C. We generate the synthetic data with the following steps:

Generate neuron representation. Binarized neuron representation f (x) is generated by setting the top-5% of activations to 1 and the rest to 0, i.e. M 10 + M 11 = 0.05.
Generate concepts. We generate |C| = 1000 concepts as the candidate set. For each concept c i , we first generate its frequency P(c i (x) = 1) = M 01 + M 11 from a log-uniform distribution in the interval (10 -4 , 10 -1 ). Then, we sample M 11 = P(f (x) = 1, c i (x) = 1) uniformly from (0, min[P(f (x) = 1), P(c i (x) = 1)]) to ensure validity. Given M 11 , the remaining part of M can then be inferred from concept frequency and activation binarization. Given the probabilities, we compute corresponding conditional probability (P(c i (x) | f (x)) and sample c i (x) accordingly.
Experiment and simulation. We repeat the above steps N exp = 1000 times. We use the sampled neuron representation f (x) and concept activation c i (x) to calculate similarity and select top-ranked concept ĉ. Then, we compute the ground-truth similarity with the real probability matrix M and calculate the error as the difference between similarity of selected concept and max similarity in the candidate set (max c∈C [sim(f, c)] -sim(f, ĉ)).

We take the 95% quantile of error among all experiments to approximate the bound under success probability 1 -δ = 95%.

In Fig. 3b, we plot the simulated gap against the size of the probing dataset |D probe |. We observe that: (1) All curves have similar slope to the reference O( 1/n) curve, suggesting an asymptotic convergence rate of O( 1/n), which is consistent with our theoretical analysis.

(2) For the constant term, accuracy has the fastest convergence and AUROC is the second. This matches our simulation of r in Experiment 1, Setting 1, supporting our conclusion.

In summary, the simulation experiments empirically validate the correctness of our theory and show its potential to help users choose appropriate similarity metric under different settings.

In this section, we address the second key challenge in neuron identification methods -stability: Is the identified concept consistent across different probing datasets? Leveraging the connection established in Sec. 3.1, we adopt a bootstrap ensemble approach for stability estimation. This method is applicable to any neuron identification algorithm without modifying its internal mechanism. Building on this bootstrapping framework, we further design a method to construct a prediction set of candidate concepts that contains the desired concept with guaranteed probability.

Bootstrap ensemble [Breiman, 1996] is a machine learning technique used to improve prediction accuracy and quantify uncertainty. The method aggregates multiple models, each trained on a different resampled version of the original dataset obtained via bootstrapping (sampling with replacement). The final prediction is typically determined by majority voting, and the confidence is estimated as the proportion of models voting for the final prediction [Lakshminarayanan et al., 2017].

For neuron identification, we introduce a bootstrap-based stability framework that resamples the probing dataset to produce multiple identification outcomes for a single neuron. This adaptation allows us to quantify the stability of the neuron explanations obtained. The procedure is:

Collect bootstrap datasets: Sample K datasets {D i } K i=1 independently by randomly selecting samples from the probing dataset D probe with replacement.
Run neuron identification: Apply the neuron identification algorithm to each bootstrap dataset D i and record the predicted concept c i .
Aggregate predictions: After K runs, estimate the probability of each concept as:

, where 1(•) denotes the indicator function.

Fig. 4 summarizes the pipeline. With bootstrap ensemble, the algorithm now outputs probability of each candidate concept.

While bootstrap ensembles provide an empirical measure of stability, we also seek theoretical guarantees on the identified concept. In particular, we want to bound the probability that the most frequent concepts in bootstrap ensemble capture the desired concept3 . To achieve this, we construct a concept prediction set, a set of concepts that are likely to describe the neuron, rather than a single best guess. This prediction-set approach can be applied to any neuron identification algorithm without any modifications. We call this method Bootstrap Explanation (BE) and list the full procedure in Alg. 1 in Sec. A. The following theorem gives a probabilistic guarantee that a desired concept c * will be included in the prediction set constructed via the bootstrap ensemble, under mild assumptions on the candidate set and similarity function:

c * ∈ C (the desired concept is included in candidate concept set). 2. sim(f, c * ) ≥ sim(f, c) + ∆, ∀c ∈ C, c ̸ = c * , where ∆ > 0 is a positive constant. This assumes the similarity function can distinguish the desired concept with other concepts.

With these assumptions, we have the following theorem: Theorem 4.1. Let c * be the desired concept for a given neuron and the assumptions above hold for c * . Let S ⊆ C be the prediction set constructed in Alg. 1, and let k(S) = K i=1 [ĉ i ∈ S] be the number of bootstrap trials that predict a concept in S. Then, under these assumptions,

where p is the single-trial error probability defined implicitly by the equation r(f, D probe , p |C| ) = ∆ 2 .

Thm. 4.1 provides a statistical guarantee on the probability that our desired concept is included in the prediction set. See Sec. B.2 for the proof.

We apply our BE method to two base methods: CLIP-Dissect [Oikarinen and Weng, 2023] and NetDissect [Bau et al., 2017]. We use a ResNet-50 model trained on the ImageNet dataset [Deng et al., 2009], run K = 100 bootstrap samples and choose the bootstrap count threshold t = 0.95K = 0.95 × 100 = 95 in Alg. 1. The results are shown in Fig. 5.

From the results, we can observe interesting differences between these two methods: (1) CLIP-Dissect prefers more abstract concepts. For example, it gives concepts like fostering and bibliographic. NetDissect, in contrast, tends to identify concrete concepts. (2) In general, CLIP-Dissect provides more diverse concepts and sometimes captures ones missed by NetDissect (e.g. Birding for Neuron 89). NetDissect is more stable across different bootstrap samples. A potential reason is that NetDissect utilizes localization information, which improves stability.

5 Related works

The goal of neuron identification is to find a human-interpretable concept that describes the behavior and functionality of a specific neuron. A variety of methods have been proposed for neuron identification. Network Dissection [Bau et al., 2017] is a pioneering work with the idea of comparing neuron activations with ground-truth concept masks. Subsequent work explored extensions such as compositional explanations [Mu and Andreas, 2020], automated labeling with CLIP [Oikarinen and Weng, 2023], and multimodal summarization [Bai et al., 2024]. More recent approaches expand the concept space to linear combinations [Oikarinen and Weng, 2024]. While these advances provide useful empirical tools, in this work we aim to fill the gap in a principled theoretical foundation for neuron identification.

To unify the rapid growing neuron identification methods, Oikarinen et al. [2025] design a framework, summarizing most neuron identification algorithm into three major components: neuron representation, concept activations and similarity metrics. Additionally, two meta-tests are proposed to compare similarity metrics. While this work provides a good start point, rigorous theoretical analysis is still lacking, which we want to provide in this work.

In this work, we presented a theoretical framework for neuron identification, with the goal of clarifying the faithfulness and stability of existing algorithms. Building on our key observation that neuron identification can be viewed as the inverse process of learning, we introduced the notion of generalization gap to quantify and derive formal guarantees for explanation faithfulness. To quantify stability, we proposed BE procedure to construct concept prediction sets with statistical coverage guarantees. Together, these results provide the first principled framework for the trustworthiness of neuron identification, complementing existing empirical studies.

Our work also has some limitations: the bound on generalization gap is a general bound for any concept set. It does not utilize the relation between concepts thus may be improved for specific concept sets. The bootstrap ensemble method provides an algorithm-agnostic way to quantify stability and generate prediction sets, but also introduces additional computational overhead.

Algorithm 1: BE: Generating a concept prediction set for target neuron Input: Concept set C, probing dataset D probe , target neuron f , neuron identification procedure Identify(C, f, D probe ), bootstrap sample count K, bootstrap count threshold t Output: Prediction set S of candidate concepts for i ← 1 to K do Sample dataset D i from D probe with replacement (same size as D probe ); Calculate ĉi = Identify(C, f, D i ); end For each concept c j ∈ C, count the number of its appearances:

The convergence rate of similarity function is defined as a function that satisfies

Here, D probe is randomly sampled from the underlying data distribution D with a fixed size |D probe |, and the probability is taken over all possible sampled probing datasets.

Proof of Thm. 3.1. For each c i ∈ |C|, from the definition of convergence rate, we have

Thus, with union bound, we have

Proof of Corollary 3.2. From the definition,

Thus, ŝ im(f, ĉ) ≥ ŝ im(f, c * ). From Thm. 3.1, with probability at least 1 -δ, we have

and

Therefore, we have

with probability at least 1 -δ.

B.2 Proof for Thm. 4.1

In this section, we prove Thm. 4.1: Theorem 4.1. Let c * be the desired concept for a given neuron and the assumptions above hold for c * . Let S ⊆ C be the prediction set constructed in Alg. 1, and let k(S) = K i=1 [ĉ i ∈ S] be the number of bootstrap trials that predict a concept in S. Then, under these assumptions,

where p is the single-trial error probability defined implicitly by the equation r(f, D probe , p |C| ) = ∆ 2 .

Proof. We start the proof by estimating single-trial error rate.

Lemma B.2. Let p be defined implicitly by the equation

where r(•) is the uniform convergence rate in Thm. 3.1. Then,

Remark B.3. Lemma B.2 can be easily derived from Thm.

3.1: with probability 1 -p,

Previously, we show that for many similarity metrics (AUROC, accuracy, IoU, etc.),

In this case, we can plug in δ = p |C| and get

which gives p ≤ |C|e

. This shows when probing dataset size |D probe | and gap between desired concept and other concept ∆ becomes larger, the error probability p can be reduced. Suppose we repeat our experiment K times and get {ĉ i } K i=1 . Then, we have the following theorem. Theorem B.4. Let k * = K i=1 1[ĉ i = c * ] denotes the number of times target neuron is given during K experiments. Then,

Remark B.5. This could be derived by Lemma B.2 and binomial distribution CDF.

Using Thm. B.4, we can derive:

Thus,

finishes the proof.

In the main text, we mention the key idea of deriving convergence speed r for recall, precision and IoU: that is regard them as special case of accuracy where data are limited in a subgroup. For recall:

Therefore, we can regard calculation of recall as a rejection sampling process: The samples satisfying c(x) = 1 are kept and others are rejected. Then, accuracy is calculated on remaining samples. Thus, the convergence speed can be calculated by inserting the effective sample size |{c(x) = 1 | x ∈ D probe }| into the accuracy’s convergence rate:

For precision and IoU, the derivation is similar.

In this article, LLM is used to check grammar and typos as well as improve the writing.

F Details of simulation study Experiment 1 (Sec 3.3)

Experiment 1 The procedure of Experiment 1 can be described as follows:

Generate simulation data. To start, we first generate the simulation data following the settings specified in Sec. 3.3. We generate paired binary variables representing ground-truth concept activations and neuron responses i.i.d. by directly sampling from the probability distribution specified by the probability matrix M . 2. Calculate ground truth metrics. Given the probability matrix M , the ground truth value of each metric is calculated according to their definition. 3. Simulation. In this step, we run simulation to simulate the convergence speed r(•). In each iteration i, we first sample a new batch of data following step 1 with size N sample .

Then, we estimate each metric using the data sample and calculate the error err i . We repeat the procedure for N exp = 1000 times, aggregate the err i in each round and report the 0.95-quantile of errors.

In this section, we study the computational cost of Bootstrap Ensemble. Theoretically, since the BE method requires K times of bootstrapping and running original neuron identification algorithm, the running time should scale up linearly with K. In practice, however, the BE time cost is not simply K times of original algorithm, as a significant portion of computation can often be shared among bootstrap samples, saving much time. Take CLIP-Dissect as an example: The CLIP-Dissect algorithm can be roughly divided into three steps:

Collect features from the target model and CLIP model. 2. Calculate similarity matrix between target features and CLIP features.
Select the final concept with highest similarity.

In the implementation of BE, the first step can be shared among bootstrap samples: we pre-compute the features of target model and CLIP on the whole probing dataset. For each bootstrap sample, we only need to fetch corresponding features, calculate similarity and select the highest similarity concept. With this, we run experiments based on a ConvNeXt-base model [Liu et al., 2022]. 4 The profiling results are shown in Fig. 6. We can see that though we do K = 100 bootstrap ensembles, the time overhead is only about 30% and the memory overhead is about 15%. Fig. 7 shows how runtime changes with K, illustrating that our runtime is as high as K times of original algorithm. Further, to understand the impact of concept number, we compare results on two concept sets: larger concept set. However, we argue the major cause is the latent CLIP-Dissect uses soft-wpmi, which is much slower with larger concept set. The bootstrap wrapper itself does not introduce higher overhead.

Broden

In Sec. 3.2, we mainly study bounds for metrics with binarized neuron representation. In this section, we conduct experiments on several metrics for continuous neuron representations: AUROC [Bykov et al., 2023], AUPRC, correlation [Oikarinen and Weng, 2024] , WPMI [Oikarinen and Weng, 2023] and MAD [Kopf et al., 2024]. AUROC has been defined in the main text. To calculate AUPRC, we first sort f (x i ) for smallest to largest:

). Thus, the precision at kth threshold can be defined as

The recall can be defined as

The AUPRC can be calculated as:

For MAD,

For WPMI, we take the definition in Oikarinen et al. [2025]:

Here, we take λ = 1. For correlation:

where µ f , µ c , σ f , σ c are the mean and standard deviation of f and c, respectively. For those metrics, a closed-form expression of convergence rate is challenging to derive. Thus, we conduct empirical studies with synthetic data and real ImageNet validation data.

Data generation For the synthetic dataset, we construct a simple conditional Gaussian datagenerating process to obtain pairs of binary concept activations and continuous neuron representations. For each sample i ∈ {1, . . . , n} we first randomly draw a binary concept label c i ∈ {0, 1}, where P(c i = 1) = p is a hyperparameter. Conditioned on c i , we then sample a one-dimensional neuron representation z i ∈ R from a Gaussian distribution whose mean and variance depend on the concept state:

where (µ pos , σ pos ) and (µ neg , σ neg ) are hyperparameters controlling, respectively, the distribution of the neuron activation when the concept is present or absent. In practice, we generate n i.i.d. samples {(c i , z i )} n i=1 according to the above process. In the experiment, we take p = 0.002, µ pos = 1, µ neg = 0, σ pos = 0.2, σ neg = 0.5.

Experiments We approximate the ground truth value of metrics with 10 6 examples, then estimate the convergence speed r(•) by calculating the 95% quantile of error. Since the scale of MAD is significantly different from other metrics, we compute the relative error instead. The results are shown in Fig. 8a. From the results, we see all metrics also follow asymptotically O(1/ √ N ) convergence speed.

Our code is available at https://github.com/Trustworthy-ML-Lab/Trustworthy_Explanations . 39th Conference on Neural Information Processing Systems (NeurIPS

Workshop: Mechanistic Interpretability Workshop at NeurIPS 2025.

The input could also be audio[Wu et al., 2024] or text[Huang et al., 2023, Gurnee et al., 2023]. In this work we focus on vision models.

Analogous to the ground truth in conventional machine learning.

Figure 11: Additional results of applying BE to NetDissect and CLIP-Dissect on ResNet 50 neurons.

Figure 12: Additional results of applying BE to NetDissect and CLIP-Dissect on ResNet 50 neurons.

Figure 13: Additional results of applying BE to NetDissect and CLIP-Dissect on ResNet 50 neurons.

📄 Read Full PDF on ArXiv

Reference

This content is AI-processed based on open access ArXiv data.

📝 Original Info

📝 Abstract

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

📄 Full Content

Reference

Start searching

No results found