Semantic communication promises task-aligned transmission but must reconcile semantic fidelity with stringent latency guarantees in immersive and safety-critical services. This paper introduces a time-constrained human-in-the-loop reinforcement learning (TC-HITL-RL) framework that embeds human feedback, semantic utility, and latency control within a semantic-aware Open radio access network (RAN) architecture. We formulate semantic adaptation driven by human feedback as a constrained Markov decision process (CMDP) whose state captures semantic quality, human preferences, queue slack, and channel dynamics, and solve it via a primal--dual proximal policy optimization algorithm with action shielding and latency-aware reward shaping. The resulting policy preserves PPO-level semantic rewards while tightening the variability of both air-interface and near-real-time RAN intelligent controller processing budgets. Simulations over point-to-multipoint links with heterogeneous deadlines show that TC-HITL-RL consistently meets per-user timing constraints, outperforms baseline schedulers in reward, and stabilizes resource consumption, providing a practical blueprint for latency-aware semantic adaptation.
Semantic communication (SemCom) shifts the design focus from bit-level fidelity to task-or meaning-level utility, transmitting only task-relevant information and enabling joint design of physical, link, and inference layers for improved spectral and energy efficiency as well as reduced latency [1]- [3]. Specifically, deep learning-based SemCom systems, often realized via joint source-channel coding (JSCC) [4], [5], have demonstrated strong robustness to channel impairments and performance gains. However, most existing designs treat semantic models as static once trained and therefore struggle to maintain alignment when wireless conditions, user preferences, or task objectives evolve over time. From a service perspective, adaptive mechanisms are essential to keep semantic fidelity aligned with user intent and application context.
The rapid progress of generative AI and reinforcement learning from human feedback (RLHF) [6] underscores the value of learning directly from human preferences. Humanin-the-Loop Reinforcement Learning (HITL-RL) incorporates subjective feedback into the reward design and policy updates [7]. It has been successfully applied in robotics, preference learning, and controllable text generation, and has recently been advocated for SemCom to align models with userperceived utility [8], [9]. Bringing HITL-RL into networked SemCom loops, however, introduces domain-specific challenges.
In wireless systems, human feedback travels over bandwidth-and latency-limited links, and semantic model updates must meet strict timing constraints. In point-to-multipoint deployments with heterogeneous users, feedback delays and reconfiguration latencies can render otherwise beneficial updates infeasible for a subset of users. Ignoring these temporal effects leads to per-user deadline violations and degraded quality of experience (QoE). Time-aware decision mechanisms are therefore required to couple semantic utility with the realities of scheduling and deployment. Meanwhile, the granularity of model updates (e.g., partial refresh vs. full retraining) should be carefully chosen to balance semantic gains against latency overhead.
Constrained Markov decision processes (CMDPs) [10] provide a principled way to enforce latency or safety budgets via Lagrangian or primal-dual methods [11]. Proximal Policy Optimization (PPO) [12], known for stability and sample efficiency, can be endowed with cost critics and dual variables to form constrained PPO (PPO-C), and recent work has brought such RL ideas to RIC optimization [13], [14]. However, prior studies either focus on average QoS or resource slicing, without incorporating human preference signals or perframe feasibility mechanisms as introduced here.
In this work, we introduce a time-constrained HITL-RL framework for semantic adaptation in point-to-multipoint settings. We formulate semantic adaptation as a CMDP with peruser deadline budgets and latency-aware reward shaping, and we solve it with a primal-dual PPO algorithm augmented with an action shield that enforces instantaneous feasibility during both training and deployment. To our knowledge, this is among the first integrations of HITL-RL with SemCom under explicit real-time constraints, bridging preference-driven learning with implementable timing control. The main contributions are:
• Latency-aware CMDP. We couple human-aligned semantic utility with near-RT RIC budgets and per-user deadlines, yielding a tractable CMDP abstraction for semantic broadcasting under latency guarantees. • TC-PPO with shielding. A primal-dual PPO variant with cost critics, adaptive multipliers, and an action shield enforces both average and instantaneous feasibility. • Implementation and evidence. We map the framework to an NR-like slot structure and show on JSCC-enabled
We consider an AI-driven next-generation RAN where a semantic-aware base station (gNB) serves latencyheterogeneous user equipments (UEs) N = {1, . . . , N } over a shared downlink. As illustrated in Fig. 1, the architecture follows the Open RAN functional split [2], [15]: the near-RT RIC hosts the HITL-RL agent, while distributed units (DUs) and radio units (RUs) handle physical-layer connectivity. Semantic models operate as encoder-decoder pairs, with the encoder at the gNB and UE-specific decoders at the terminals. Human operators evaluate reconstructed semantics and send preference feedback to the RIC, which fuses these signals, updates the models, and disseminates configuration deltas under strict timing budgets.
The control loop is discretized into gNB scheduling subframes indexed by t ∈ {0, 1, . . .}, and each sub-frame comprises slot grants (akin to NR mini-slot allocations) that are dynamically carved out for semantic adaptation. Every UE i belongs to a service class k(i) with a deadline d i representing the maximum allowable time between observing semantic degradation and deploying a refreshed decoder.
At frame t the gNB ingests source features x t ∈ R
This content is AI-processed based on open access ArXiv data.