Hierarchical JEPA Meets Predictive Remote Control in Beyond 5G Networks
In wireless networked control systems, ensuring timely and reliable state updates from distributed devices to remote controllers is essential for robust control performance. However, when multiple devices transmit high-dimensional states (e.g., images or video frames) over bandwidth-limited wireless networks, a critical trade-off emerges between communication efficiency and control performance. To address this challenge, we propose a Hierarchical Joint-Embedding Predictive Architecture (H-JEPA) for scalable predictive control. Instead of transmitting states, device observations are encoded into low-dimensional embeddings that preserve essential dynamics. The proposed architecture employs a three-level hierarchical prediction, with high-level, medium-level, and low-level predictors operating across different temporal resolutions, to achieve long-term prediction stability, intermediate interpolation, and fine-grained refinement, respectively. Control actions are derived within the embedding space, removing the need for state reconstruction. Simulation results on inverted cart-pole systems demonstrate that H-JEPA enables up to 42.83 % more devices to be supported under limited wireless capacity without compromising control performance.
💡 Research Summary
The paper tackles the fundamental bottleneck in wireless networked control systems where numerous devices generate high‑dimensional visual observations (e.g., RGB images or video frames) that must be communicated to a remote controller. Transmitting raw frames quickly exhausts the limited bandwidth of beyond‑5G networks, leading to packet losses and degraded control performance. To resolve this, the authors propose a Hierarchical Joint‑Embedding Predictive Architecture (H‑JEPA) combined with a semantic actor model, forming a self‑supervised Hierarchical Model Predictive Control (HMPC) framework.
The core idea is to replace raw state transmission with low‑dimensional embeddings. A context encoder (a three‑layer ResNet) maps each high‑dimensional observation x_i,k into a compact latent vector z_i,k. A target encoder, identical in architecture but updated via exponential moving average (EMA) of the context encoder’s parameters, provides stable “ground‑truth” embeddings for training. Three predictors operate at distinct temporal resolutions: a high‑level predictor (P_H) forecasts long‑term embeddings over a horizon of h steps, a medium‑level predictor (P_M) interpolates embeddings between successive high‑level predictions over m (< h) steps, and a low‑level predictor (P_L) refines the trajectory at the finest granularity over l (< m) steps. Each predictor solves an auto‑regressive task conditioned on predicted control actions and is trained with a cosine‑similarity loss that aligns its output with the corresponding target embeddings.
During the training phase, devices send raw frames to the base station (BS), where the context encoder and the three predictors are jointly optimized on datasets of state‑action pairs for each temporal level. The loss functions are summed across levels, and stochastic gradient descent with a batch size of 256 is used. Once training converges, the inference phase begins: devices only run the context encoder locally, transmitting the resulting embeddings to the cloud controller. The cloud server runs the hierarchical predictors to generate future embeddings, and a semantic actor model (borrowed from prior work) maps these embeddings directly to control actions, eliminating any need for image reconstruction.
The communication model assumes Rayleigh block fading with non‑line‑of‑sight path loss. The uplink capacity R_i,k = W_i log₂(1+γ_i,k) is compared against a threshold (\bar{R}); outages occur when capacity falls below this threshold. Because embeddings are orders of magnitude smaller than raw frames, the required bandwidth per device drops dramatically, allowing many more devices to share the same spectrum without exceeding outage constraints.
Simulation experiments are conducted on an inverted cart‑pole benchmark. Each state is a 640×480 RGB frame sampled at 1 ms intervals, yielding trajectories of 100 steps. The high‑level dataset contains 200 training and 40 testing trajectories; medium‑ and low‑level datasets each contain 80 training and 40 testing trajectories of embeddings. The context encoder uses three convolutional layers (64, 128, 256 filters) with batch normalization and ReLU activations. Each predictor is a single‑hidden‑layer MLP with 1024 neurons and a 256‑dimensional output. Baselines from prior work (single‑scale joint‑embedding predictors) are reproduced for fair comparison.
Results show that H‑JEPA achieves superior control accuracy across the prediction horizon while reducing communication cost. Specifically, at an SNR of 20 dB, the proposed method supports 42.83 % more devices than the baselines under the same outage probability. The hierarchical design mitigates error accumulation: the high‑level predictor ensures long‑term stability, the medium‑level predictor fills the temporal gaps, and the low‑level predictor provides fine‑grained corrections. Moreover, because control actions are derived directly from embeddings, the system avoids the computational overhead of image decoding at the controller.
The authors highlight three main contributions: (1) an efficient encoder that compresses high‑dimensional visual data into semantically meaningful embeddings; (2) a three‑level hierarchical predictor that captures dynamics at multiple time scales, reducing long‑horizon drift; and (3) a control pipeline that operates entirely in latent space, dramatically lowering bandwidth requirements. Limitations include the focus on a single control task and the need for further validation in multi‑agent or collaborative scenarios. Additionally, the computational load of the encoder and predictors on resource‑constrained edge devices warrants model compression or quantization techniques.
Future work is suggested to extend the framework to heterogeneous control tasks, incorporate edge‑computing resources for on‑device inference, and explore lightweight architectures that maintain performance while meeting real‑time constraints. The paper positions H‑JEPA as a promising step toward AI‑native, bandwidth‑efficient remote control in the emerging beyond‑5G and 6G ecosystems.
Comments & Academic Discussion
Loading comments...
Leave a Comment