APEX: Probing Neural Networks via Activation Perturbation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Prior work on probing neural networks primarily relies on input-space analysis or parameter perturbation, both of which face fundamental limitations in accessing structural information encoded in intermediate representations. We introduce Activation Perturbation for EXploration (APEX), an inference-time probing paradigm that perturbs hidden activations while keeping both inputs and model parameters fixed. We theoretically show that activation perturbation induces a principled transition from sample-dependent to model-dependent behavior by suppressing input-specific signals and amplifying representation-level structure, and further establish that input perturbation corresponds to a constrained special case of this framework. Through representative case studies, we demonstrate the practical advantages of APEX. In the small-noise regime, APEX provides a lightweight and efficient measure of sample regularity that aligns with established metrics, while also distinguishing structured from randomly labeled models and revealing semantically coherent prediction transitions. In the large-noise regime, APEX exposes training-induced model-level biases, including a pronounced concentration of predictions on the target class in backdoored models. Overall, our results show that APEX offers an effective perspective for exploring, and understanding neural networks beyond what is accessible from input space alone.

💡 Research Summary

The paper introduces Activation Perturbation for EXploration (APEX), a probing paradigm that injects Gaussian noise into the hidden activations of a trained neural network at inference time while keeping both the input data and model parameters fixed. By repeatedly performing forward passes with independent noise realizations, APEX yields a stochastic output distribution that can be analyzed as a function of the noise magnitude σ.

The authors provide a rigorous theoretical analysis. They prove that for each layer ℓ the perturbed activation can be decomposed as ˜aℓ(x;σ)=σ·vℓ+ rℓ(x;σ), where vℓ depends only on the noise and rℓ is bounded and depends on the specific input x. When σ is small, the residual term dominates and the network behaves almost identically to its deterministic counterpart, preserving sample‑dependent predictions. When σ becomes large, the σ·vℓ term dominates, causing the logits to be approximated by σ·U·vL plus a bounded bias. Consequently, the influence of the input vanishes as σ→∞ and the prediction distribution converges to a model‑specific stationary distribution that is independent of the particular sample. This establishes a principled transition from “sample‑dependent” to “model‑dependent” behavior controlled by a single scalar σ.

The paper also shows that input perturbation is a constrained special case of activation perturbation. Adding a small perturbation ε to the input induces a change Δℓ(x,ε)≈Jaℓ(x)·ε at each layer, which lies in the image of the Jacobian of the prefix network. Hence input‑space perturbations can only explore a low‑dimensional subspace of the activation space, whereas APEX can freely explore the full hidden representation space.

Empirically, the authors investigate two regimes. In the small‑noise regime (σ≈0.1–0.2) they introduce “escape noise,” the minimal σ at which a sample’s top‑1 prediction flips. Escape noise correlates strongly with existing memorization and consistency scores, confirming that APEX captures sample regularity. They also train models on CIFAR‑10 with varying fractions of randomly assigned labels; as the proportion of random labels increases, average escape noise decreases, reflecting more fragmented decision regions. A controlled experiment where two classes share the same input distribution (labels are swapped for half the data) demonstrates that only activation perturbation causes a monotonic transfer of predictions from the original to the reassigned class as σ grows, indicating that APEX aligns with the learned representation structure while input‑ and weight‑perturbations do not.

In the large‑noise regime (σ≥2) predictions become essentially input‑agnostic. The induced stationary distribution reveals global model biases. Notably, backdoored models—trained with a trigger that forces a target class—exhibit a pronounced concentration of probability mass on the target class under large σ, whereas benign models produce a more dispersed distribution. This shows that APEX can expose training‑induced structural biases that are invisible to traditional probing methods.

The methodology is straightforward: add independent N(0,σ²I) noise after each activation function, repeat T forward passes, and estimate the empirical class probabilities ˆP(k;σ)= (1/T)∑ₜ 1(kₜ* = k). No retraining, no architectural changes, and the same code works for convolutional networks (ResNet‑18) and vision transformers, across CIFAR‑10 and ImageNet.

Overall, APEX offers three major contributions: (1) a unified theoretical framework that links input, weight, and activation perturbations; (2) a practical tool that quantifies sample‑level regularity and distinguishes models trained on structured versus random labels; (3) a mechanism to uncover model‑level biases such as backdoor triggers by observing the stationary output distribution under strong activation noise. By directly probing the hidden representation space, APEX provides insights that are inaccessible from input‑space analyses alone, opening new avenues for studying generalization, memorization, robustness, and security in deep learning systems.

APEX: Probing Neural Networks via Activation Perturbation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment