XFACTORS: Disentangled Information Bottleneck via Contrastive Supervision

XFACTORS: Disentangled Information Bottleneck via Contrastive Supervision
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Disentangled representation learning aims to map independent factors of variation to independent representation components. On one hand, purely unsupervised approaches have proven successful on fully disentangled synthetic data, but fail to recover semantic factors from real data without strong inductive biases. On the other hand, supervised approaches are unstable and hard to scale to large attribute sets because they rely on adversarial objectives or auxiliary classifiers. We introduce \textsc{XFactors}, a weakly-supervised VAE framework that disentangles and provides explicit control over a chosen set of factors. Building on the Disentangled Information Bottleneck perspective, we decompose the representation into a residual subspace $\mathcal{S}$ and factor-specific subspaces $\mathcal{T}_1,\ldots,\mathcal{T}_K$ and a residual subspace $\mathcal{S}$. Each target factor is encoded in its assigned $\mathcal{T}_i$ through contrastive supervision: an InfoNCE loss pulls together latents sharing the same factor value and pushes apart mismatched pairs. In parallel, KL regularization imposes a Gaussian structure on both $\mathcal{S}$ and the aggregated factor subspaces, organizing the geometry without additional supervision for non-targeted factors and avoiding adversarial training and classifiers. Across multiple datasets, with constant hyperparameters, \textsc{XFactors} achieves state-of-the-art disentanglement scores and yields consistent qualitative factor alignment in the corresponding subspaces, enabling controlled factor swapping via latent replacement. We further demonstrate that our method scales correctly with increasing latent capacity and evaluate it on the real-world dataset CelebA. Our code is available at \href{https://github.com/ICML26-anon/XFactors}{github.com/ICML26-anon/XFactors}.


💡 Research Summary

XFactors introduces a weakly‑supervised variational auto‑encoder (VAE) framework that achieves disentangled representation learning without relying on adversarial objectives or auxiliary classifiers. Building on the Disentangled Information Bottleneck (DIB) principle, the authors decompose the latent space Z into a residual subspace S and K factor‑specific subspaces T₁,…,T_K, such that Z = S ⊕ (⊕ₖ Tₖ). Each target factor y_fi is encoded in its dedicated subspace T_i by maximizing the mutual information I(T_i; y_fi) through a contrastive InfoNCE loss. Positive pairs share the same factor value, while negative pairs have different values, forcing latent vectors belonging to the same factor to cluster together and those belonging to different factors to separate. Because InfoNCE provides a lower bound on mutual information, optimizing it directly increases I(T_i; y_fi).

In parallel, KL‑divergence regularization is applied to both S (with weight β_s) and the aggregated factor subspace T = ⊕ₖ Tₖ (with weight β_t), imposing an isotropic Gaussian prior. This regularization organizes the latent distribution, encourages independence between S and T (minimizing I(S; T)), and preserves information about non‑targeted factors in the residual subspace without any explicit supervision. The overall training objective is

L = L_reco + β_s·L_KL^S + β_t·L_KL^T + Σ_i λ_i·L_InfoNCE^i,

where L_reco is a mean‑squared reconstruction loss, L_KL^S and L_KL^T are the KL terms for S and T, and L_InfoNCE^i is the contrastive loss for factor i. This formulation mirrors the DIB objective L_DIB = – Σ_i I(T_i; y_fi) – I(X; (S,Y)) + I(T; S), but replaces the adversarial mutual‑information minimization with constructive KL regularization and contrastive supervision.

Architecturally, two parallel encoders ψ_s and ψ_t produce Gaussian parameters (μ_s, σ_s) and (μ_t, σ_t) for S and T respectively. Samples z_s ∼ N(μ_s, σ_s) and z_t ∼ N(μ_t, σ_t) are concatenated and fed to a decoder ϕ that reconstructs the input image. During inference, the model can perform controlled factor swapping: the residual code z_s is taken from a source image, while a specific factor code z_t,i is taken from a target image; decoding the combined vector yields an image where only the chosen attribute has changed, confirming that each T_i indeed captures a single semantic factor.

The authors evaluate XFactors on synthetic benchmarks (dSprites, Shapes3D, MPI3D) and the real‑world CelebA dataset, using standard disentanglement metrics such as Mutual Information Gap (MIG), Separated Attribute Predictability (SAP), and DCI. With a single set of hyperparameters across all experiments, XFactors consistently outperforms state‑of‑the‑art methods including β‑VAE, FactorVAE, β‑TCVAE, and DisCo. Notably, performance remains stable as the number of factors K grows or as the total latent dimensionality increases, demonstrating the scalability of the subspace decomposition and KL regularization scheme.

Key contributions of the paper are:

  1. Integration of DIB with contrastive learning – providing explicit control over which factors are disentangled without adversarial training.
  2. Residual subspace S – automatically absorbs all non‑targeted, unlabeled variations, preserving them in the representation while keeping the factor subspaces clean.
  3. Stable and scalable training – only KL and InfoNCE losses are required, avoiding the instability of min‑max games and the computational overhead of auxiliary classifiers.
  4. Demonstrated applicability to real data – factor‑specific swapping on CelebA shows high‑quality, semantically meaningful edits, confirming practical utility.

The paper suggests future directions such as introducing cross‑subspace regularizers to capture interactions between factors, extending the framework to domains with extremely scarce labels (e.g., biomedical imaging), and combining XFactors with large pre‑trained transformer backbones for high‑resolution or video data. Overall, XFactors offers a principled, efficient, and easily extensible approach to weakly‑supervised disentangled representation learning.


Comments & Academic Discussion

Loading comments...

Leave a Comment