WirelessJEPA: A Multi-Antenna Foundation Model using Spatio-temporal Wireless Latent Predictions

WirelessJEPA: A Multi-Antenna Foundation Model using Spatio-temporal Wireless Latent Predictions
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We propose WirelessJEPA, a novel wireless foundation model (WFM) that uses the Joint Embedding Predictive Architecture (JEPA). WirelessJEPA learns general-purpose representations directly from real-world multi-antenna IQ data by predicting latent representations of masked signal regions. This enables multiple diverse downstream tasks without reliance on carefully engineered contrastive augmentations. To adapt JEPA to wireless signals, we introduce a 2D antenna time representation that reshapes multi-antenna IQ streams into structured grids, allowing convolutional processing with block masking and efficient sparse computation over unmasked patches. Building on this representation, we propose novel spatio temporal mask geometries that encode inductive biases across antennas and time. We evaluate WirelessJEPA across six downstream tasks and demonstrate it’s robust performance and strong task generalization. Our results establish that JEPA-based learning as a promising direction for building generalizable WFMs.


💡 Research Summary

WirelessJEPA introduces a novel self‑supervised foundation model for wireless communications that leverages the Joint Embedding Predictive Architecture (JEPA) instead of the more common contrastive learning approaches. The authors start from the observation that contrastive methods rely heavily on handcrafted data augmentations, which can lead to shortcut solutions and poor generalization when the augmentations do not faithfully reflect the underlying physics of radio signals. To avoid this dependency, WirelessJEPA learns by predicting the latent representations of masked signal regions, thereby encouraging the encoder to capture intrinsic spatio‑temporal relationships directly from raw multi‑antenna IQ streams.

Data representation is a key contribution. Raw IQ data, originally shaped as a three‑dimensional tensor (2 channels for I/Q, H antenna elements, and W time samples), is transformed into a 2‑D “antenna‑time” grid. For the experimental setup (H = 4 antennas, W = 256 samples) the authors up‑sample the antenna dimension by a factor of 64 using nearest‑neighbor interpolation, yielding a square 256 × 256 grid. This conversion enables the use of standard convolutional networks and the block‑masking strategies originally designed for vision‑oriented JEPA models.

The architecture follows the CNN‑JEPA blueprint: a sparse convolutional context encoder processes the masked input, a lightweight predictor (implemented with depthwise‑separable convolutions) predicts the latent vectors of the masked patches, and a momentum teacher encoder provides stable target embeddings via an exponential moving average (EMA) of the student parameters. The loss is a simple L2 regression over the masked locations, which forces the student encoder to infer missing information from the surrounding context rather than reconstruct raw waveforms.

A central novelty is the design of structured spatio‑temporal masks that reflect the physical nature of wireless signals. Three mask families are explored:

  1. Antenna mask – entire antenna rows are masked, encouraging the model to learn cross‑antenna correlations useful for tasks such as angle‑of‑arrival (AoA) estimation.
  2. Time mask – contiguous time blocks are masked, biasing the model toward learning temporal dynamics, which benefits modulation classification.
  3. Multi‑block mask – rectangular blocks covering both antenna and time dimensions are randomly placed, providing a mixed inductive bias.

The authors systematically evaluate the impact of each mask geometry on downstream performance. Results show that random masks, which scatter isolated patches, are generally sub‑optimal. Time masks achieve the highest 1‑shot accuracy on modulation classification (≈ 80 %), while antenna masks excel at AoA estimation (≈ 100 % 1‑shot). Multi‑block masks offer a balanced performance across tasks.

For pre‑training, WirelessJEPA uses the same over‑the‑air multi‑antenna dataset employed by the earlier IQFM work: 7 waveform types, 225 AoA classes, captured at 1 MSps and 10 MSps with a 4‑antenna USRP testbed. Training runs for 100 epochs with AdamW, cosine learning‑rate decay, and a momentum coefficient τ that ramps from 0.996 to 1.0.

The downstream evaluation covers seven tasks, both in‑distribution (modulation classification and AoA estimation on the same dataset) and out‑of‑distribution (RF fingerprinting, RML2016.10a modulation under varying SNR, GNSS jamming classification, Wi‑Fi protocol identification, and 5G NR interference classification). The pre‑trained encoder is frozen, and two lightweight adapters are tested: a linear probe and a k‑nearest‑neighbors classifier, under low‑shot regimes (1‑shot and 100‑shot).

Across all tasks, WirelessJEPA matches or surpasses the contrastive baseline IQFM. Notably, it improves average accuracy on OOD tasks by 3–5 percentage points and shows superior few‑shot performance on modulation (time mask) and AoA (antenna mask). The sparse‑convolution implementation reduces FLOPs by roughly 30 % compared with dense baselines while preserving accuracy.

The paper concludes that structured masking aligned with wireless signal physics provides a powerful inductive bias, and that JEPA‑based prediction can serve as a robust alternative to contrastive learning for building generalizable wireless foundation models. Future directions suggested include learning mask policies dynamically, scaling to massive MIMO arrays and mmWave frequencies, and exploring online/self‑adapting training for real‑time network operation.

Overall, WirelessJEPA demonstrates that self‑supervised latent‑space prediction, combined with a carefully crafted antenna‑time representation and physics‑aware masking, yields a versatile foundation model capable of supporting a wide range of wireless tasks with minimal labeled data—an important step toward AI‑native 6G systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment