Knowledge Distillation for mmWave Beam Prediction Using Sub-6 GHz Channels

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Beamforming in millimeter-wave (mmWave) high-mobility environments typically incurs substantial training overhead. While prior studies suggest that sub-6 GHz channels can be exploited to predict optimal mmWave beams, existing methods depend on large deep learning (DL) models with prohibitive computational and memory requirements. In this paper, we propose a computationally efficient framework for sub-6 GHz channel-mmWave beam mapping based on the knowledge distillation (KD) technique. We develop two compact student DL architectures based on individual and relational distillation strategies, which retain only a few hidden layers yet closely mimic the performance of large teacher DL models. Extensive simulations demonstrate that the proposed student models achieve the teacher’s beam prediction accuracy and spectral efficiency while reducing trainable parameters and computational complexity by 99%.

💡 Research Summary

The paper addresses the prohibitive training overhead associated with beamforming in millimeter‑wave (mmWave) high‑mobility scenarios by leveraging sub‑6 GHz channel information to predict the optimal mmWave beam. While prior works have demonstrated that sub‑6 GHz channels contain enough spatial correlation to infer the best mmWave beam, they rely on large deep‑learning (DL) models—often several thousand neurons per layer and multiple layers—resulting in high computational and memory demands that are unsuitable for real‑time base‑station deployment.

To overcome this limitation, the authors adopt Knowledge Distillation (KD), a model‑compression technique that transfers the “knowledge” of a high‑capacity teacher network to a much smaller student network. The workflow is as follows:

System Model & Problem Formulation – A base station operates simultaneously in sub‑6 GHz and mmWave bands. The sub‑6 GHz uplink channel vectors (real and imaginary parts concatenated) are used as input, while the mmWave downlink employs analog beamforming from a predefined codebook W. The optimal beam w* maximizes the achievable downlink rate, which can be expressed as a multi‑class classification problem where each codebook entry is a class.
Teacher Network – A fully‑connected multilayer perceptron (MLP) with four hidden layers of 1024 neurons each is trained in a supervised manner using cross‑entropy loss on one‑hot beam labels. The input dimension is 2 · N_sub6 · K̅_sub6 (real/imaginary separation). This teacher achieves high Top‑1 accuracy (≈68.9 % at 15 dB SNR) and spectral efficiency (≈4.71 bits/s/Hz) but contains 3.48 M parameters and 6.94 M FLOPs.
Student Networks – Two compact MLPs are designed, each with only two hidden layers of 64 neurons, reducing parameters to 24.8 k and FLOPs to 49 k (≈99 % reduction). Three KD strategies are explored:
- Individual KD (IKD) – The student learns from the softened probability distribution of the teacher (temperature τ = 10) and the hard ground‑truth labels. The loss is a weighted sum of the KL‑divergence between softened outputs (weight α = 0.9) and the standard cross‑entropy.
- Relational KD (RKD) – Beyond logits, the student matches relational structures (pairwise distances and triplet angles) extracted from an intermediate feature layer of the teacher. Normalized Euclidean distances and cosine angles are computed for all pairs and triplets in a minibatch, and a Huber loss penalizes discrepancies. This encourages the student to preserve the geometric relationships among different sub‑6 GHz channel realizations, which reflect spatial positions and propagation conditions.
- Self‑Distillation – The teacher and student share the same architecture; the teacher’s own predictions serve as soft targets for the student, providing a baseline for comparison.
Training & Dataset – Simulations use the DeepMIMO O1 scenario (both O1_28 with 64 antennas/512 subcarriers and O1_3p5 with 4 antennas/32 subcarriers). The teacher is trained from scratch, while each student is initialized randomly and then trained using the respective KD loss.
Results –
- Convergence – KD‑trained students converge much faster than the teacher; the self‑distilled student reaches the lowest validation loss within 12 epochs, whereas the teacher has not converged after 20 epochs. IKD and RKD students also converge earlier than a non‑distilled baseline.
- Beam Prediction Accuracy – At 15 dB SNR, the teacher’s Top‑1 accuracy is 68.90 %. The RKD student achieves 66.36 % (≈95 % of teacher), IKD 63.92 % (≈93 %). The non‑distilled student lags at 44.76 %. Top‑3 accuracies follow the same trend.
- Spectral Efficiency (SE) – Using the predicted beams, the teacher and self‑distilled student obtain SE ≈ 4.71 bits/s/Hz. RKD and IKD students achieve 4.51 and 4.39 bits/s/Hz respectively, preserving over 95 % of the teacher’s SE. The non‑distilled student drops to 3.67 bits/s/Hz.
- Complexity – Parameter count and FLOPs are reduced by 99.29 % relative to the teacher, making the student models viable for real‑time inference on commodity hardware.
Discussion & Implications – The study demonstrates that (i) KD can compress a high‑performing beam‑prediction network without sacrificing accuracy, (ii) relational distillation further captures the underlying geometry of the channel space, improving robustness, and (iii) such lightweight models enable practical deployment of sub‑6 GHz‑driven mmWave beamforming in 5G/6G base stations, dramatically reducing training overhead and inference latency.
Future Directions – Extending the framework to multi‑user, multi‑cell scenarios, investigating adaptive KD that reacts to rapid channel dynamics, and integrating the student models with hardware accelerators for ultra‑low‑power operation are promising avenues.

In summary, by applying knowledge distillation—both individual and relational—the authors achieve a 99 % reduction in model size and computational load while retaining near‑optimal beam‑selection performance, thereby offering a scalable solution for next‑generation high‑frequency wireless systems.

Knowledge Distillation for mmWave Beam Prediction Using Sub-6 GHz Channels

💡 Research Summary

Comments & Academic Discussion

Leave a Comment