A Dynamic Knowledge Distillation Method Based on the Gompertz Curve

This paper introduces a novel dynamic knowledge distillation framework, Gompertz-CNN, which integrates the Gompertz growth model into the training process to address the limitations of traditional knowledge distillation. Conventional methods often fail to capture the evolving cognitive capacity of student models, leading to suboptimal knowledge transfer. To overcome this, we propose a stage-aware distillation strategy that dynamically adjusts the weight of distillation loss based on the Gompertz curve, reflecting the student’s learning progression: slow initial growth, rapid mid-phase improvement, and late-stage saturation. Our framework incorporates Wasserstein distance to measure feature-level discrepancies and gradient matching to align backward propagation behaviors between teacher and student models. These components are unified under a multi-loss objective, where the Gompertz curve modulates the influence of distillation losses over time. Extensive experiments on CIFAR-10 and CIFAR-100 using various teacher-student architectures (e.g., ResNet50 and MobileNet_v2) demonstrate that Gompertz-CNN consistently outperforms traditional distillation methods, achieving up to 8% and 4% accuracy gains on CIFAR-10 and CIFAR-100, respectively.

💡 Research Summary

The paper presents Gompertz‑CNN, a novel dynamic knowledge distillation (KD) framework that leverages the Gompertz growth curve to modulate the influence of distillation losses throughout training. Traditional KD methods typically employ a static weighting factor for the distillation loss (often a fixed λ or a simple linear/cosine schedule). Such static schemes ignore the evolving learning capacity of the student network: early in training the student needs to develop its own low‑level representations, while later it can benefit more from the teacher’s higher‑level knowledge. To address this, the authors adopt the Gompertz function—originally used to model biological growth—because it naturally exhibits three phases: a slow start, a rapid middle growth, and a saturation phase. By mapping training epochs to the Gompertz curve, the weight λ(t) applied to the distillation component starts low, rises sharply during the middle epochs, and then tapers off, mirroring the student’s presumed cognitive development.

Gompertz‑CNN’s loss formulation consists of three parts: (1) the standard cross‑entropy loss (\mathcal{L}{\text{CE}}) for ground‑truth supervision, (2) a feature‑level alignment loss (\mathcal{L}{\text{W}}) measured by the 1‑Wasserstein distance between teacher and student intermediate feature maps, and (3) a gradient‑matching loss (\mathcal{L}_{\text{G}}) that penalizes the L2 distance between the back‑propagated gradients of the two networks. The overall objective is

💡 Research Summary

📜 Original Paper Content