Communication-Efficient Multi-Modal Edge Inference via Uncertainty-Aware Distributed Learning
Semantic communication is emerging as a key enabler for distributed edge intelligence due to its capability to convey task-relevant meaning. However, achieving communication-efficient training and robust inference over wireless links remains challenging. This challenge is further exacerbated for multi-modal edge inference (MMEI) by two factors: 1) prohibitive communication overhead for distributed learning over bandwidth-limited wireless links, due to the \emph{multi-modal} nature of the system; and 2) limited robustness under varying channels and noisy multi-modal inputs. In this paper, we propose a three-stage communication-aware distributed learning framework to improve training and inference efficiency while maintaining robustness over wireless channels. In StageI, devices perform local multi-modal self-supervised learning to obtain shared and modality-specific encoders without device–server exchange, thereby reducing the communication cost. In StageII, distributed fine-tuning with centralized evidential fusion calibrates per-modality uncertainty and reliably aggregates features distorted by noise or channel fading. In Stage~III, an uncertainty-guided feedback mechanism selectively requests additional features for uncertain samples, optimizing the communication–accuracy tradeoff in the distributed setting. Experiments on RGB–depth indoor scene classification show that the proposed framework attains higher accuracy with far fewer training communication rounds and remains robust to modality degradation or channel variation, outperforming existing self-supervised and fully supervised baselines.
💡 Research Summary
The paper tackles two major challenges in multi‑modal edge intelligence: (1) the heavy communication overhead incurred during distributed training of modality‑specific encoders, and (2) the fragility of inference when wireless channels fluctuate or when one or more sensor modalities become noisy or missing. To address these, the authors propose a unified three‑stage framework that spans the entire model lifecycle—pre‑training, fine‑tuning, and inference—while explicitly accounting for communication budgets.
Stage I – Local multi‑modal self‑supervised pre‑training.
Each edge device independently runs a self‑supervised learning objective that disentangles shared latent representations from modality‑specific ones. The loss combines (i) a contrastive term encouraging a common encoder to capture cross‑modal information, (ii) intra‑modal reconstruction terms preserving modality‑specific details, and (iii) a mutual‑information regularizer that guarantees task‑relevant information survives channel‑induced augmentations. Because all computations are performed locally, no model parameters, gradients, or intermediate features are exchanged with the server, effectively reducing the initial communication cost to zero. The authors also provide an information‑theoretic analysis showing that, under realistic channel noise, the mutual information between the compressed features and the target labels remains bounded away from zero.
Stage II – Server‑side evidential fusion and supervised fine‑tuning.
After the encoders are initialized, a small labeled dataset is used to fine‑tune the whole system centrally. Each modality’s feature vector is passed through an evidential head that outputs Dirichlet evidence parameters αₘ. These parameters quantify epistemic uncertainty: a large total evidence indicates high confidence, while a small αₘ signals uncertainty caused by noisy inputs or deep fading. The server fuses the Dirichlet distributions from all modalities in a Bayesian manner, producing calibrated class probabilities without increasing the transmitted payload. This evidential fusion not only improves accuracy compared with naïve averaging but also yields well‑calibrated uncertainty estimates that are crucial for the next stage.
Stage III – Uncertainty‑guided retransmission at inference time.
During inference, the server first makes a provisional prediction using the received features and computes the associated uncertainty. If the uncertainty exceeds a pre‑defined quantile threshold, a lightweight feedback signal is sent to the relevant devices requesting additional feature dimensions (up to a maximum N). The newly received bits are merged with the original features, and the server re‑infers the label. Because retransmission is triggered only for high‑uncertainty samples, the average communication cost remains low while the system can recover accuracy for difficult or heavily corrupted instances. Experiments show that the average retransmission rate stays below 15 % yet the overall accuracy loss is less than 1 % compared with an ideal, always‑reliable channel.
Experimental validation.
The authors evaluate the framework on an RGB–depth indoor scene classification task under a range of SNRs (5–30 dB) and Rayleigh fading conditions. Results demonstrate that:
- Stage I self‑supervised pre‑training reaches target accuracy with roughly 10 × fewer training communication rounds than contrastive pre‑training and >20 × fewer than training from scratch.
- Evidential fusion in Stage II improves top‑1 accuracy by 2.5 % and reduces Expected Calibration Error (ECE) relative to simple concatenation or averaging.
- The uncertainty‑guided retransmission policy in Stage III maintains robustness when one modality is severely degraded, achieving higher accuracy than a baseline joint source‑channel coding (JSCC) scheme while using only a small fraction of extra transmissions.
Key contributions
- A lifecycle‑wide three‑stage distributed learning framework that jointly minimizes training and inference communication while preserving robustness.
- A fully local multi‑modal self‑supervised pre‑training scheme that produces shared and modality‑specific encoders without any device‑server exchange.
- An evidential fusion mechanism that provides calibrated per‑modality and fused uncertainties without enlarging the feature payload.
- A quantile‑based, uncertainty‑driven retransmission policy that adaptively balances communication cost and inference accuracy.
- Comprehensive wireless‑edge experiments confirming the theoretical advantages and practical feasibility.
Overall, the paper presents a novel integration of self‑supervised representation learning, evidential uncertainty modeling, and adaptive feedback control, offering a practical solution for communication‑constrained, multi‑modal edge AI systems envisioned for 6G networks.
Comments & Academic Discussion
Loading comments...
Leave a Comment