Practical Policy Distillation for Reinforcement Learning in Radio Access Networks

Practical Policy Distillation for Reinforcement Learning in Radio Access Networks
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Adopting artificial intelligence (AI) in radio access networks (RANs) presents several challenges, including limited availability of link-level measurements (e.g., CQI reports), stringent real-time processing constraints (e.g., sub-1 ms per TTI), and network heterogeneity (different spectrum bands, cell types, and vendor equipment). A critical yet often overlooked barrier lies in the computational and memory limitations of RAN baseband hardware, particularly in legacy 4th Generation (4G) systems, which typically lack on-chip neural accelerators. As a result, only lightweight AI models (under 1 Mb and sub-100~μs inference time) can be effectively deployed, limiting both their performance and applicability. However, achieving strong generalization across diverse network conditions often requires large-scale models with substantial resource demands. To address this trade-off, this paper investigates policy distillation in the context of a reinforcement learning-based link adaptation task. We explore two strategies: single-policy distillation, where a scenario-agnostic teacher model is compressed into one generalized student model; and multi-policy distillation, where multiple scenario-specific teachers are consolidated into a single generalist student. Experimental evaluations in a high-fidelity, 5th Generation (5G)-compliant simulator demonstrate that both strategies produce compact student models that preserve the teachers’ generalization capabilities while complying with the computational and memory limitations of existing RAN hardware.


💡 Research Summary

The paper tackles the practical deployment of AI‑driven functions in Radio Access Networks (RANs) where legacy baseband hardware imposes severe constraints: memory budgets below 1 MB, inference latency under 100 µs, and the absence of dedicated neural accelerators. These limits make it impossible to run large reinforcement‑learning (RL) agents directly on the equipment, even though robust generalization across heterogeneous cells, spectrum bands, and traffic patterns typically requires high‑capacity models.

To bridge this gap, the authors investigate policy distillation—transferring the knowledge of a large, high‑performing teacher policy into a compact student policy—applied to the link adaptation (LA) problem in 5G downlink. LA is modeled as an episodic Markov Decision Process where the action space consists of 28 Modulation‑and‑Coding Scheme (MCS) indices. The reward combines spectral efficiency for successful transmissions and a penalty proportional to the number of retransmissions. State vectors embed both semi‑static information (cell topology, carrier frequency, antenna configuration, UE capabilities) and dynamic measurements (path loss, CSI, HARQ feedback), enabling the policy to adapt to a wide range of conditions.

Two distillation strategies are explored. Single‑policy distillation compresses a scenario‑agnostic teacher (trained with distributed actors and domain randomization) into a student. The distillation dataset is built from the teacher’s replay memory, ensuring that the student sees the same diverse set of states the teacher experienced. Multi‑policy distillation first trains several scenario‑specific teachers (e.g., urban, rural, high‑speed) using lightweight RL setups that respect the hardware limits of each test environment. Their individual datasets are shuffled and merged, and a single student is trained on the aggregated data, thereby inheriting the combined expertise of all teachers.

The distillation loss is the Kullback‑Leibler divergence between the softened teacher Q‑values (temperature τ > 0) and the student Q‑values. A low τ sharpens the distribution, emphasizing the teacher’s action preferences and improving fidelity. The authors adopt an offline distillation pipeline: teachers are fully trained, then a fixed dataset is generated and used to train the student without further environment interaction.

Experiments are conducted in a high‑fidelity 5G‑compliant event‑driven simulator (3.5 GHz carrier, numerology μ = 0, SU‑MIMO). The teacher DQN contains roughly 10 M parameters, while student networks are evaluated at 0.5 M, 0.2 M, and 0.1 M parameters. Performance metrics include average spectral efficiency, block error‑rate (BLER) target attainment, inference latency, and memory footprint. Results show that a 0.2 M‑parameter student achieves less than 1.2 % loss in spectral efficiency relative to the teacher while meeting the <100 µs latency budget. Multi‑policy distilled students outperform single‑policy ones on three unseen benchmark scenarios (e.g., high‑speed mobility, bursty interference) by maintaining near‑teacher performance, whereas students trained directly with RL of the same size suffer 5‑10 % degradation and converge far more slowly.

Key insights are: (1) policy distillation effectively transfers not only raw performance but also the teacher’s ability to generalize across heterogeneous RAN conditions; (2) multi‑policy distillation enables the use of several lightweight, scenario‑specific teachers, avoiding costly exploration on live networks while still producing a universal student; (3) the combination of domain randomization during teacher training and careful temperature tuning during distillation is crucial for preserving policy fidelity.

The work provides a concrete roadmap for integrating AI into existing RAN infrastructure: large‑scale RL agents can be trained offline, distilled into compact models, and then deployed on legacy baseband units without violating latency or memory constraints. Future directions include on‑chip quantization, hardware‑aware graph optimizations, field trials to validate simulation‑to‑real transfer, and extending the distillation framework to other critical RAN functions such as scheduling, handover, and power control.


Comments & Academic Discussion

Loading comments...

Leave a Comment