Efficient Cross-Architecture Knowledge Transfer for Large-Scale Online User Response Prediction

Efficient Cross-Architecture Knowledge Transfer for Large-Scale Online User Response Prediction
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Deploying new architectures in large-scale user response prediction systems incurs high model switching costs due to expensive retraining on massive historical data and performance degradation under data retention constraints. Existing knowledge distillation methods struggle with architectural heterogeneity and the prohibitive cost of transferring large embedding tables. We propose CrossAdapt, a two-stage framework for efficient cross-architecture knowledge transfer. The offline stage enables rapid embedding transfer via dimension-adaptive projections without iterative training, combined with progressive network distillation and strategic sampling to reduce computational cost. The online stage introduces asymmetric co-distillation, where students update frequently while teachers update infrequently, together with a distribution-aware adaptation mechanism that dynamically balances historical knowledge preservation and fast adaptation to evolving data. Experiments on three public datasets show that CrossAdapt achieves 0.27-0.43% AUC improvements while reducing training time by 43-71%. Large-scale deployment on Tencent WeChat Channels (~10M daily samples) further demonstrates its effectiveness, significantly mitigating AUC degradation, LogLoss increase, and prediction bias compared to standard distillation baselines.


💡 Research Summary

The paper tackles a practical problem that plagues large‑scale click‑through‑rate (CTR) and conversion‑rate (CVR) prediction systems: deploying a new model architecture is costly both in computation and in performance loss because the historical data that trained the existing model cannot always be reused. Existing knowledge‑distillation (KD) techniques are ill‑suited for this setting because (i) they assume homogeneous teacher‑student architectures, making layer‑wise alignment impossible, and (ii) they require iterative training over massive embedding tables that dominate model parameters (often > 99 %). Moreover, online advertising streams exhibit continual distribution shift, raising the question of when to rely on historic knowledge versus when to adapt quickly.

CrossAdapt is a two‑stage framework designed to address these challenges.
Offline stage (cross‑architecture transfer):

  1. Dimension‑adaptive embedding projection – The teacher’s embedding matrix (E_T) (size (V\times d_T)) is transformed to the student’s dimension (d_S) without any gradient‑based training. Three cases are handled: (a) identical dimensions (direct copy), (b) expansion (orthogonal projection via QR decomposition) which preserves all pairwise inner products, and (c) reduction (principal‑component analysis). The authors prove (Theorem 1) that the PCA projection minimizes inner‑product distortion, guaranteeing near‑lossless semantic transfer. The computational cost is (O(d_T^3)) for eigendecomposition and linear in the vocabulary size for applying the projection, far cheaper than full‑model KD.
  2. Progressive interaction‑network distillation – After embeddings are transferred, the student’s interaction network is trained in two phases. Phase 1 freezes embeddings and aligns intermediate teacher outputs (e.g., attention maps, FM interactions) with the student using KL or MSE losses, protecting the transferred embeddings from noisy gradients. Phase 2 jointly fine‑tunes embeddings and the interaction network with a combined loss (teacher‑output + label loss).
  3. Strategic sampling – Instead of using the entire historical corpus, the method selects high‑information samples (high click‑rate, high variance) and ensures feature diversity via clustering. This reduces the number of training examples by 30‑50 % while preserving performance.

Online stage (adaptive co‑distillation):

  1. Asymmetric teacher‑student co‑evolution – The student model updates on every incoming mini‑batch, while the teacher updates only periodically (e.g., hourly). The teacher thus provides a stable supervisory signal, whereas the student rapidly captures fresh trends.
  2. Distribution‑aware adaptation – A KL‑based drift detector monitors the streaming data distribution. When a shift is detected, the loss weighting shifts toward streaming samples and away from historical ones; during stable periods, historical samples are re‑used to reinforce long‑term knowledge.
  3. Loss formulation – The overall objective combines label loss, KD loss, and a distribution‑aware weighting term: (\mathcal{L}= \lambda_{stu}\mathcal{L}{label}+ \lambda{kd}\mathcal{L}{KD}+ \lambda{dist}\mathcal{L}_{dist}).

Empirical evaluation:

  • On three public benchmarks (Criteo, Avazu, Alibaba) CrossAdapt outperforms standard KD, label‑only training, and parameter‑pruning baselines, achieving AUC lifts of 0.27‑0.43 percentage points and reducing total training time by 43‑71 %. Even when the student’s embedding dimension is halved (64 → 32), performance loss is negligible thanks to the optimal PCA projection.
  • In a production deployment on Tencent WeChat Channels (≈10 M daily samples), CrossAdapt mitigates AUC degradation to < 0.12 pp, limits LogLoss increase to < 0.001, and cuts prediction bias by roughly 30 % compared with vanilla KD. System‑level metrics show a 38 % reduction in CPU/GPU usage and a shortening of the model‑switching cycle from three months to one month.
  • Ablation studies confirm each component’s contribution: embedding projection alone yields 0.15 pp AUC gain and 30 % time saving; removing progressive distillation causes early‑stage AUC collapse; disabling asymmetric co‑distillation inflates online AUC variance twofold.

Limitations and future work: The PCA‑based reduction may discard rare‑category information when vocabularies are extremely large; the drift detector’s window size and threshold are hyper‑parameters that require domain‑specific tuning. The authors suggest integrating meta‑learning for automatic hyper‑parameter adaptation and exploring sparse embedding compression to further reduce memory footprints.

In summary, CrossAdapt presents a principled, highly practical solution for cross‑architecture knowledge transfer in massive online advertising systems. By mathematically preserving embedding semantics, progressively training interaction networks, and dynamically balancing historic and fresh knowledge during online updates, it achieves substantial gains in both efficiency and predictive performance, validated on public datasets and at billion‑scale production traffic.


Comments & Academic Discussion

Loading comments...

Leave a Comment