Cross-Attention Transformer for Joint Multi-Receiver Uplink Neural Decoding
We propose a cross-attention Transformer for joint decoding of uplink OFDM signals received by multiple coordinated access points. A shared per-receiver encoder learns time-frequency structure within each received grid, and a token-wise cross-attention module fuses the receivers to produce soft log-likelihood ratios for a standard channel decoder, without requiring explicit per-receiver channel estimates. Trained with a bit-metric objective, the model adapts its fusion to per-receiver reliability, tolerates missing or degraded links, and remains robust when pilots are sparse. Across realistic Wi-Fi channels, it consistently outperforms classical pipelines and strong convolutional baselines, frequently matching (and in some cases surpassing) a powerful baseline that assumes perfect channel knowledge per access point. Despite its expressiveness, the architecture is compact, has low computational cost (low GFLOPs), and achieves low latency on GPUs, making it a practical building block for next-generation Wi-Fi receivers.
💡 Research Summary
The paper tackles the problem of jointly decoding uplink OFDM transmissions that are simultaneously received by multiple coordinated access points (APs), a scenario increasingly relevant for next‑generation Wi‑Fi (802.11be/8) and cell‑free or CoMP deployments. Conventional receivers follow a three‑stage pipeline—pilot‑based channel estimation (LS or LMMSE), per‑subcarrier equalization, and soft demapping—applied independently at each AP. This modular approach ignores spatial correlations among APs, suffers when pilots are sparse, and requires accurate second‑order statistics that are often unavailable in fast‑varying environments.
To overcome these limitations, the authors propose a novel neural receiver built on a Transformer architecture that combines a shared per‑AP encoder with a token‑wise cross‑attention fusion module. For each AP, the complex received resource grid is turned into a sequence of tokens containing real part, imaginary part, and the estimated noise variance. Tokens are linearly projected to a latent dimension, enriched with 2‑D sinusoidal positional encodings, and processed by four stacked Transformer encoder layers employing multi‑head self‑attention. This encoder learns to implicitly perform channel estimation, equalization, and demapping in a data‑driven fashion, leveraging global time‑frequency context without explicit CSI.
After encoding, the model performs fusion at each time‑frequency position (subcarrier × OFDM symbol). The representations from all N_R APs at a given position are stacked, and an anchor‑based cross‑attention is applied: one AP (chosen arbitrarily as the “anchor”) provides the query, while all APs contribute keys and values. The attention weights are learned, allowing the network to up‑weight reliable AP observations and down‑weight faded or missing ones. The resulting fused vector passes through a residual connection, layer normalization, and a lightweight MLP that outputs the log‑likelihood ratios (LLRs) for the bits carried by the corresponding QAM symbol.
Training uses a bit‑metric decoding (BMD) rate maximization objective, which is equivalent to minimizing binary cross‑entropy on the bits and provides a differentiable surrogate for BER. The model is trained on‑the‑fly with a wide range of SNRs, pilot densities, and channel realizations drawn from the 3GPP TR 38.901 Urban Microcell (UMi) model, ensuring robustness to non‑stationary fading and to varying numbers of APs.
Experimental evaluation compares the proposed Cross‑Attention Transformer (CAT) against several baselines: (i) classical LS and LMMSE pipelines, (ii) a state‑of‑the‑art CNN‑based receiver, and (iii) an “ideal” per‑AP demapper that assumes perfect CSI. In multi‑AP scenarios (2–4 APs), the baselines first generate LLRs independently and then fuse them centrally using SNR‑based maximal‑ratio combining. Across all tested SNRs, pilot sparsity levels (as low as 2 % of REs), and even when one or more AP links are completely blocked, CAT consistently outperforms the baselines by 1.2–2.0 dB in required SNR for a target BER. In many cases its performance matches or slightly exceeds the perfect‑CSI benchmark, demonstrating that the learned attention can effectively reconstruct channel information from sparse pilots.
From a systems perspective, the entire model requires roughly 0.8 GFLOPs for a typical 64‑subcarrier × 14‑symbol grid and runs in under 0.3 ms on an NVIDIA RTX 3080, corresponding to a latency well within real‑time constraints for Wi‑Fi PHY processing. The parameter count stays below 3 M, making the architecture amenable to on‑chip implementation or edge‑device deployment.
The paper also discusses limitations: the current design assumes a low‑latency, lossless fronthaul that can transport raw RE observations to a central processor, and the anchor AP is fixed, which may be sub‑optimal in highly dynamic topologies. Future work could explore distributed attention, dynamic anchor selection, or compression‑aware fronthaul strategies.
In summary, the Cross‑Attention Transformer provides a compact, computationally efficient, and highly robust solution for joint multi‑AP uplink decoding. By integrating channel estimation, equalization, and demapping into a single trainable block and by leveraging token‑wise cross‑attention to adaptively fuse heterogeneous AP observations, it achieves performance on par with or superior to classical model‑based pipelines and even ideal CSI‑aware receivers, while meeting the latency and complexity requirements of next‑generation Wi‑Fi systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment