DART-ing Through the Drift: Dynamic Tracing of Knowledge Neurons for Adaptive Inference-Time Pruning

DART-ing Through the Drift: Dynamic Tracing of Knowledge Neurons for Adaptive Inference-Time Pruning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) exhibit substantial parameter redundancy, particularly in Feed-Forward Networks (FFNs). Existing pruning methods suffer from two primary limitations. First, reliance on dataset-specific calibration introduces significant data dependency and computational overhead. Second, being predominantly static, they fail to account for the evolving subset of knowledge neurons in LLMs during autoregressive generation as the context evolves. To address this, we introduce DART, i.e., Dynamic Attention-Guided Runtime Tracing), a lightweight, training-free method that performs on-the-fly context-based pruning. DART monitors shifts in attention score distributions to infer context changes, dynamically updating neuron-level masks to retain salient parameters. Across ten benchmarks, DART outperforms prior dynamic baseline, achieving accuracy gains of up to 14.5% on LLAMA-3.1-8B at 70% FFN sparsity. Furthermore, DART achieves up to 3x better ROUGE-L scores with respect to static-masked pruning on summarization tasks, with its performance comparable to the original dense models. We conclusively demonstrate that the proposed framework effectively adapts to diverse semantic contexts, preserves model capabilities across both general and domain-specific tasks while running at less than 10MBs of memory for LLAMA-3.1-8B(16GBs) with 0.1% FLOPs overhead. The code is available at https://github.com/seeder-research/DART.


💡 Research Summary

The paper “DART‑ing Through the Drift: Dynamic Tracing of Knowledge Neurons for Adaptive Inference‑Time Pruning” addresses the well‑known redundancy of parameters in large language models (LLMs), focusing on the Feed‑Forward Network (FFN) sub‑layers where most of the computational budget resides. Existing pruning approaches are either static—relying on a calibration dataset to produce a single mask that is applied to all inputs—or dynamic but require auxiliary models trained on specific tasks. Both suffer from two critical drawbacks: (1) heavy data dependency and calibration overhead, and (2) inability to adapt to the evolving set of “knowledge neurons” that become relevant as autoregressive generation proceeds, a phenomenon the authors term “knowledge drift”.

DART (Dynamic Attention‑Guided Runtime Tracing) is introduced as a lightweight, training‑free method that performs on‑the‑fly, context‑aware pruning. The core idea is to monitor the distribution of attention scores (or more generally the output of the attention sub‑layer) and detect significant shifts relative to a reference context. When a shift exceeds a predefined threshold, DART triggers a re‑computation of neuron‑level masks for the FFN layers, thereby reinstating neurons that have become important for the new semantic domain.

The method consists of three main components:

  1. Context‑aware neuron selector – For each FFN layer, DART estimates a layer‑wise sensitivity score S(l) based on the cosine similarity between the layer’s input and output embeddings and the magnitude of the update. This score is normalized across layers to obtain a relative importance I(l). A depth‑aware factor D(l) further modulates importance, preserving early and late layers while allowing more aggressive sparsity in the middle layers. Using I(l) and D(l), a global sparsity target ρ is allocated across layers via an iterative budget‑redistribution algorithm, yielding per‑layer pruning ratios p(l).

  2. Cumulative importance aggregation – Within a sliding window of τ tokens, DART accumulates the squared activation magnitudes of each FFN neuron to compute a cumulative importance s_i. The top‑k neurons with the highest s_i are retained, forming a binary mask M(l) for that layer. This mask remains fixed until a context shift is detected.

  3. Knowledge‑drift detector – By comparing the current attention output distribution to the distribution observed during the initial prompt (using KL‑divergence, JS‑divergence, or a simple statistical distance), DART identifies when the model’s internal representation has moved into a new knowledge domain. Upon detection, the neuron selector is re‑run to generate updated masks, effectively adapting the sparsity pattern to the new context.

The authors evaluate DART on ten benchmarks covering zero‑shot, multi‑shot, domain‑specific, and summarization tasks, using models such as LLAMA‑3.1‑8B and LLAMA‑3.2‑3B. At 70 % FFN sparsity, DART achieves up to a 14.5 % absolute accuracy gain over the best static pruning baseline and up to a 19.6 % gain on certain tasks. For summarization, ROUGE‑L scores improve by up to threefold compared to static‑masked pruning, often matching the performance of the dense model. Importantly, the runtime overhead is minimal: memory consumption increases by less than 10 MB on a 16 GB GPU, and FLOPs rise by only 0.1 %, making the approach practical for real‑time inference.

Ablation studies confirm that each component—layer‑wise sensitivity, depth‑aware scaling, and the drift detector—contributes significantly to the overall performance. Comparisons with prior dynamic methods (e.g., DLP, OWL) and static methods (e.g., WANDA, SparseGPT) show consistent superiority of DART, especially in long‑horizon generation where knowledge drift is most pronounced.

In summary, DART presents a novel paradigm for dynamic, context‑sensitive pruning in LLMs. By leveraging attention‑based distributional shifts to detect semantic context changes and by allocating sparsity budgets according to layer‑specific importance, DART bridges the gap between efficiency and accuracy without requiring any additional training. The work opens avenues for further research into more sophisticated drift detection mechanisms and broader applicability across diverse model architectures and modalities.


Comments & Academic Discussion

Loading comments...

Leave a Comment