사전학습 모델 모듈 교체 안정성을 위한 결정적 연속 교체 기법
Replacing modules in pretrained models, especially swapping quadratic self-attention for efficient attention alternatives, poses a hard optimization problem: cold-start reinitialization destabilizes f
Replacing modules in pretrained models, especially swapping quadratic self-attention for efficient attention alternatives, poses a hard optimization problem: cold-start reinitialization destabilizes frozen backbones. We isolate this core stability challenge in a controlled study. Deterministic Continuous Replacement (DCR) blends teacher and student outputs with a deterministic, annealed weight. Theoretically, DCR eliminates gate-induced gradient variance inherent to stochastic replacement. In a single-seed study, DCR attains faster convergence and stronger alignment than stochastic gating and distillation baselines on controlled attention replacement, establishing a foundation for heterogeneous operator swaps.
💡 Research Summary
The paper tackles a fundamental stability problem that arises when swapping modules in large pretrained models, specifically replacing the quadratic self‑attention mechanism with more efficient alternatives such as linear‑time or kernel‑based attention. Traditional approaches—cold‑start reinitialization, stochastic gating, or standard knowledge distillation—often destabilize the frozen backbone because the abrupt change introduces high gradient variance and causes the optimizer to either diverge or converge very slowly.
To address this, the authors propose Deterministic Continuous Replacement (DCR). DCR constructs a deterministic, time‑dependent interpolation between the teacher (original module) and the student (new module) outputs:
\
📜 Original Paper Content
🚀 Synchronizing high-quality layout from 1TB storage...