Knowledge-Base based Semantic Image Transmission Using CLIP

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper proposes a novel knowledge-Base (KB) assisted semantic communication framework for image transmission. At the receiver, a Facebook AI Similarity Search (FAISS) based vector database is constructed by extracting semantic embeddings from images using the Contrastive Language-Image Pre-Training (CLIP) model. During transmission, the transmitter first extracts a 512-dimensional semantic feature using the CLIP model, then compresses it with a lightweight neural network for transmission. After receiving the signal, the receiver reconstructs the feature back to 512 dimensions and performs similarity matching from the KB to retrieve the most semantically similar image. Semantic transmission success is determined by category consistency between the transmitted and retrieved images, rather than traditional metrics like Peak Signal-to-Noise Ratio (PSNR). The proposed system prioritizes semantic accuracy, offering a new evaluation paradigm for semantic-aware communication systems. Experimental validation on CIFAR100 demonstrates the effectiveness of the framework in achieving semantic image transmission.

💡 Research Summary

The paper introduces a knowledge‑base (KB)‑assisted semantic communication framework for image transmission that departs from conventional pixel‑level reconstruction and instead focuses on preserving the semantic content of images. The core of the system relies on the Contrastive Language‑Image Pre‑training (CLIP) model to extract a 512‑dimensional embedding that captures high‑level semantic information from an input RGB image. Rather than transmitting the raw embedding, the authors propose a lightweight encoder‑decoder pair built from a two‑layer multilayer perceptron (MLP) that compresses the 512‑dimensional vector to a lower‑dimensional representation (k dimensions, e.g., 128) before modulation onto a complex‑valued wireless channel. At the receiver, the symmetric decoder reconstructs the compressed vector back to 512 dimensions, yielding a recovered semantic feature.

A pre‑constructed KB is created by extracting CLIP embeddings from a reference set of images (the CIFAR‑100 dataset in the experiments). The KB therefore consists of M vectors, each 512‑dimensional, and is indexed using Facebook AI Similarity Search (FAISS) to enable fast nearest‑neighbor queries. After reconstruction, the receiver searches the KB with an L2 distance metric; the image associated with the closest vector is returned as the final output. Success is defined not by pixel fidelity (e.g., PSNR) but by “semantic accuracy,” i.e., the percentage of transmissions where the retrieved image belongs to the same class as the transmitted image.

The authors evaluate the system under both additive white Gaussian noise (AWGN) and Rayleigh fading channels, across a range of channel bandwidth ratios (CBR = 1/48, 1/24, 1/12, 1/6, 1/3) and signal‑to‑noise ratios (SNR from –7 dB to 10 dB). For each CBR a dedicated model is trained to adapt to the corresponding SNR range. The baseline comparison includes (1) direct transmission of the uncompressed 512‑dimensional CLIP vector, (2) a conventional BPG image codec combined with LDPC error correction, and (3) SwinJSCC, a recent deep‑learning‑based joint source‑channel coding scheme.

Key findings are:

Compression Efficiency – Even at the lowest CBR (1/48) the proposed compression (128‑dimensional vector) achieves semantic accuracy comparable to the baseline that transmits the full 512‑dimensional vector, demonstrating that the MLP learns a noise‑robust compact representation.
Robustness to Noise – As SNR decreases, the compressed‑then‑reconstructed features retain higher semantic accuracy than the baseline, indicating that the encoder‑decoder learns to embed redundancy and denoise the signal.
Adaptive Redundancy – For higher CBR values (e.g., 1/3), the network can output a representation whose dimensionality exceeds 512, intentionally adding redundancy to improve resilience in harsh channel conditions.
Superiority over Traditional Schemes – BPG+LDPC only achieves meaningful semantic accuracy when the channel decoder fully recovers the source bits; otherwise accuracy collapses to near zero. SwinJSCC performs better than BPG+LDPC at low SNR but is outperformed by the proposed method across all SNRs because the latter directly transmits semantically meaningful features that are intrinsically more tolerant to bit errors.
Latency Advantage – The end‑to‑end inference time of the proposed system is about 7.9 ms (5.7 ms for CLIP extraction, 1.0 ms for the MLP encoder‑decoder, and 1.2 ms for FAISS search), roughly 40 % faster than SwinJSCC’s 12.9 ms, making it attractive for real‑time wireless applications.

The paper concludes that a KB‑augmented, CLIP‑based semantic transmission pipeline can flexibly trade off bandwidth, redundancy, and robustness while consistently delivering higher semantic accuracy than both classical source‑channel coding and recent deep‑learning‑based JSCC approaches. Future work is suggested on dynamic KB updates, multimodal extensions (e.g., incorporating text or audio), and more advanced compression techniques such as unsupervised representation learning to further improve efficiency in practical communication scenarios.

Knowledge-Base based Semantic Image Transmission Using CLIP

💡 Research Summary

Comments & Academic Discussion

Leave a Comment