Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference
Differentiable matching layers and residual connection paradigms, often implemented via entropy-regularized Optimal Transport (OT), serve as critical mechanisms in structural prediction and architectural scaling. However, recovering discrete permutations or maintaining identity mappings via annealing $ε\to 0$ is notoriously unstable. In this work, we identify a fundamental mechanism for this failure: \textbf{Premature Mode Collapse}. By analyzing the non-normal dynamics of the Sinkhorn fixed-point map, we reveal a theoretical thermodynamic speed limit: standard exponential cooling outpaces the contraction rate of the inference operator, which degrades as $O(1/ε)$. To address this, we propose \textbf{Efficient Piecewise Hybrid Adaptive Stability Control (EPH-ASC)}, an adaptive scheduling algorithm that monitors the stability of the inference process. We demonstrate that EPH-ASC is essential for stabilizing Manifold-Constrained Hyper-Connections (mHC) during large-scale training on the FineWeb-Edu dataset, effectively preventing late-stage gradient explosions by enforcing a linear stability law.
💡 Research Summary
The paper investigates a critical stability problem that arises when using entropy‑regularized optimal transport (OT) – typically implemented via the Sinkhorn algorithm – as a differentiable matching layer or residual connection in deep neural networks. Practitioners often anneal the regularization temperature ε toward zero in order to recover hard permutations or identity mappings, but this process is notoriously fragile. The authors identify “Premature Mode Collapse” as the fundamental failure mode: as ε shrinks, the optimal transport plan becomes increasingly sensitive to perturbations (scaling as O(1/ε)), while the contraction strength of the Sinkhorn operator deteriorates linearly (spectral gap ∝ ε). Consequently, the basin of attraction around the moving fixed point shrinks proportionally to ε, leading to a linear stability law τₜ ∝ ε for the permissible drift at iteration t.
Building on this insight, the authors derive a “Thermodynamic Speed Limit”. By modeling the iteration as a tracking problem, they show that to keep the tracking error bounded within the shrinking basin, the annealing step size δₜ must satisfy δₜ = O(εₜ²). Standard exponential schedules (εₜ₊₁ = α εₜ) produce δₜ = O(εₜ), violating the speed limit and making collapse inevitable as ε → 0. This theoretical result is formalized in Theorem 3.2 and Corollary 3.3, with detailed proofs provided in the appendix.
To address the limitation, the paper proposes Efficient Piecewise Hybrid Adaptive Stability Control (EPH‑ASC). EPH‑ASC consists of two phases: (1) an offline calibration phase that deliberately triggers collapse on a proxy subset to estimate a safety coefficient k_safe, which captures the empirical ratio of drift to temperature at the collapse point; (2) an online adaptive annealing phase that monitors the Frobenius norm of the primal drift Δₜ at each iteration. If ‖Δₜ‖_F ≤ k_safe·εₜ, the schedule proceeds normally; otherwise, a “Thermodynamic Pause” is invoked, freezing ε for one or more steps while the network continues to improve its feature representations, thereby reducing drift. This mechanism enforces the linear stability law without requiring expensive spectral radius computations.
Empirical validation is performed on two fronts. In the SPair‑71k image‑matching benchmark, a ResNet‑50 backbone with a Sinkhorn matching layer is trained under three regimes: standard exponential annealing, Gumbel‑Sinkhorn, and the proposed EPH‑ASC. The baseline collapses around epoch 20, leading to stagnant accuracy. Gumbel‑Sinkhorn remains stable but converges slowly. EPH‑ASC detects drift spikes, pauses annealing, and reaches 90 % accuracy in 47 epochs—a 1.6× speed‑up over Gumbel‑Sinkhorn—with negligible overhead (≈0.5 %). In a large‑scale language‑model experiment on the FineWeb‑Edu dataset, a lightweight NanoGemma architecture equipped with Manifold‑Constrained Hyper‑Connections (mHC) is trained for 1,000 steps. The naive exponential schedule suffers a catastrophic gradient explosion at step 980, whereas EPH‑ASC triggers a pause at step 640, maintains a safe temperature buffer for 340 subsequent steps, and avoids both explosion and numerical underflow, achieving stable loss reduction.
Overall, the work provides a rigorous theoretical framework linking entropy regularization, non‑normal dynamics, and annealing stability, and translates this theory into a practical, low‑overhead adaptive scheduler. By respecting the linear stability law τ ∝ ε, EPH‑ASC enables reliable scaling of entropy‑regularized OT layers in both vision and language domains, offering a new paradigm for stable annealing in modern deep learning pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment