Dual-Representation Image Compression at Ultra-Low Bitrates via Explicit Semantics and Implicit Textures

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

While recent neural codecs achieve strong performance at low bitrates when optimized for perceptual quality, their effectiveness deteriorates significantly under ultra-low bitrate conditions. To mitigate this, generative compression methods leveraging semantic priors from pretrained models have emerged as a promising paradigm. However, existing approaches are fundamentally constrained by a tradeoff between semantic faithfulness and perceptual realism. Methods based on explicit representations preserve content structure but often lack fine-grained textures, whereas implicit methods can synthesize visually plausible details at the cost of semantic drift. In this work, we propose a unified framework that bridges this gap by coherently integrating explicit and implicit representations in a training-free manner. Specifically, We condition a diffusion model on explicit high-level semantics while employing reverse-channel coding to implicitly convey fine-grained details. Moreover, we introduce a plug-in encoder that enables flexible control of the distortion-perception tradeoff by modulating the implicit information. Extensive experiments demonstrate that the proposed framework achieves state-of-the-art rate-perception performance, outperforming existing methods and surpassing DiffC by 29.92%, 19.33%, and 20.89% in DISTS BD-Rate on the Kodak, DIV2K, and CLIC2020 datasets, respectively.

💡 Research Summary

The paper tackles the long‑standing dilemma in ultra‑low‑bitrate image compression: preserving high‑level semantic content while also delivering fine‑grained texture. Existing approaches either rely on explicit representations (e.g., VQ‑GAN latents, tag or caption prompts) that keep structure but lose detail, or on implicit diffusion‑based schemes that synthesize realistic textures at the cost of semantic drift. The authors propose a unified “dual‑representation” framework that simultaneously transmits explicit semantics and implicit textures in a training‑free manner.

Explicit semantic stream – An input image is first encoded by a pretrained VAE (or LDM encoder) to obtain a latent z. A hyper‑encoder compresses z into a compact latent y, which is quantized (vector or scalar) and entropy‑coded into a bitstream (\hat y). In parallel, a lightweight image‑to‑tags module extracts a set of visual tags using RAM. Each tag is encoded with a fixed‑length binary code (⌈log₂N⌉ bits, where N is the vocabulary size), dramatically reducing the overhead compared with full captions. The pair ((\hat y), c) serves as a high‑level anchor for the diffusion decoder.

Implicit texture stream – Fine‑grained details are conveyed by compressing noisy diffusion states along the reverse diffusion trajectory using Reverse‑Channel Coding (RCC). For each timestep t, the encoder samples a noisy latent (z_t) from the forward diffusion process and encodes it with respect to the conditional distribution (p_\theta(z_t|z_{t+1},c,\hat y)). The expected bitrate of a step equals the KL divergence between the true posterior and the conditional prior, allowing precise control of how many steps (T_E) are transmitted. More steps mean higher implicit bitrate and better texture fidelity.

Distortion‑perception knob – A plug‑in encoder (E_M) produces an alternative latent (\tilde z) that is biased toward perceptual detail. The final compression target is a linear blend (\bar z = \tau z + (1-\tau)\tilde z), where (\tau\in

Dual-Representation Image Compression at Ultra-Low Bitrates via Explicit Semantics and Implicit Textures

💡 Research Summary

Comments & Academic Discussion

Leave a Comment