Composable Visual Tokenizers with Generator-Free Diagnostics of Learnability

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce CompTok, a training framework for learning visual tokenizers whose tokens are enhanced for compositionality. CompTok uses a token-conditioned diffusion decoder. By employing an InfoGAN-style objective, where we train a recognition model to predict the tokens used to condition the diffusion decoder using the decoded images, we enforce the decoder to not ignore any of the tokens. To promote compositional control, besides the original images, CompTok also trains on tokens formed by swapping token subsets between images, enabling more compositional control of the token over the decoder. As the swapped tokens between images do not have ground truth image targets, we apply a manifold constraint via an adversarial flow regularizer to keep unpaired swap generations on the natural-image distribution. The resulting tokenizer not only achieves state-of-the-art performance on image class-conditioned generation, but also demonstrates properties such as swapping tokens between images to achieve high level semantic editing of an image. Additionally, we propose two metrics that measures the landscape of the token space that can be useful to describe not only the compositionality of the tokens, but also how easy to learn the landscape is for a generator to be trained on this space. We show in experiments that CompTok can improve on both of the metrics as well as supporting state-of-the-art generators for class conditioned generation.

💡 Research Summary

CompTok introduces a novel training framework for visual tokenizers that explicitly targets token usefulness and compositionality, addressing a gap in existing tokenizers which focus mainly on reconstruction fidelity (rFID) and compression efficiency. The method operates on 1‑D token sequences and consists of two intertwined training pathways: a reconstruction pathway and a token‑swap pathway.

In the reconstruction pathway, a diffusion‑based decoder D conditioned on tokens produced by an encoder E is trained together with a recognition model Qϕ. Qϕ attempts to predict the exact token sequence that was used to generate a decoded image. This yields an InfoGAN‑style mutual‑information loss L_MI that penalizes the decoder if it ignores any token, thereby forcing each token to have a causal effect on the output.

The token‑swap pathway augments training by swapping subsets of tokens between two images, forming mixed tokens z_swap. Since there is no ground‑truth image for these mixed tokens, an adversarial flow regularizer ψ is employed to enforce realism via a flow‑based density model. The resulting adversarial flow loss L_AFM keeps swapped‑token generations on the natural‑image manifold, encouraging the decoder to handle arbitrary token compositions without producing artifacts.

Beyond the training scheme, CompTok proposes two “generator‑free” diagnostics that evaluate the token space without training a downstream generator G. Average Information Gain (AvgIG) measures the average per‑step information gain (in bits) when optimizing a random token toward reconstructing a target image using gradient descent on the decoder’s MSE loss. High AvgIG indicates steep, informative loss landscapes and signals that the decoder is sensitive to token changes, mitigating the phenomenon of token neglect.

Mode Connectivity (MC) assesses the global geometry of the token manifold. For pairs of nearby images, their encoded tokens are linearly interpolated, and the realism loss L_ψ of the decoded images along this path is examined. MC is defined as the ratio of the best endpoint realism loss to the worst loss along the path. Values close to 1 imply a low‑barrier, connected token space where interpolations remain realistic, whereas low values reveal fragmented regions that would hinder a generator’s ability to model smooth transitions.

Extensive experiments compare CompTok‑trained tokenizers against several baselines (VQ‑VAE, TiT, SEED, etc.). CompTok consistently achieves higher AvgIG and MC scores, indicating both locally steep and globally connected token landscapes. In downstream class‑conditioned image generation, the generated‑image FID (gFID) improves markedly over baselines, confirming that the learned token space is easier for a generator to learn. Moreover, token‑swap editing demonstrates high‑level semantic control: swapping a “sky” token between images reliably changes the sky while preserving other content, and the adversarial flow regularizer prevents unrealistic artifacts that appear when swapping without this constraint.

In summary, CompTok advances visual tokenization by (1) enforcing token utilization through a mutual‑information loss, (2) expanding token compositionality via token‑swap training with flow‑based realism regularization, and (3) providing two quantitative, generator‑independent metrics (AvgIG and MC) that predict downstream generative performance. This framework yields tokenizers that are not only better at reconstruction but also more amenable to downstream generation and editing, offering a practical pathway toward more controllable and efficient visual generation systems.

Composable Visual Tokenizers with Generator-Free Diagnostics of Learnability

💡 Research Summary

Comments & Academic Discussion

Leave a Comment