Generative Modeling of Discrete Data Using Geometric Latent Subspaces
We introduce the use of latent subspaces in the exponential parameter space of product manifolds of categorial distributions, as a tool for learning generative models of discrete data. The low-dimensional latent space encodes statistical dependencies and removes redundant degrees of freedom among the categorial variables. We equip the parameter domain with a Riemannian geometry such that the spaces and distances are related by isometries which enables consistent flow matching. In particular, geodesics become straight lines which makes model training by flow matching effective. Empirical results demonstrate that reduced latent dimensions suffice to represent data for generative modeling.
💡 Research Summary
The paper introduces a novel framework for generative modeling of discrete, categorical data by exploiting low‑dimensional latent subspaces within the exponential‑family parameter space of product manifolds of categorical distributions. The authors first represent a joint distribution over n categorical variables (each with c categories) as a product of independent categorical factors, each parameterized by a natural parameter vector θ_j ∈ ℝ^{c‑1}. Stacking all θ_j yields a high‑dimensional parameter vector θ ∈ ℝ^{n(c‑1)}.
To compress this representation, they define a linear subspace U = span(V) ⊂ ℝ^{n(c‑1)} where V ∈ ℝ^{n(c‑1)×d} is an orthonormal basis (a point on the Stiefel manifold) and d ≪ n(c‑1). Any θ can then be expressed as θ = Vz with a low‑dimensional coordinate z ∈ ℝ^d. This construction is called Geometric PCA (GPCA). The mapping ψ(θ) = log(1 + ⟨1_{c‑1}, e^{θ}⟩) and its convex conjugate ψ* define the exponential‑family link; the gradient ∂ψ maps natural parameters to probability vectors in the simplex Δ^{c‑1}. Consequently, the image M = ∂ψ(U) is a nonlinear data manifold embedded in the product simplex Δ^{c}_n.
A key theoretical contribution is the introduction of an “e‑metric” g_e on the simplex, defined as the standard Euclidean inner product in θ‑coordinates. This metric induces a Levi‑Civita connection ∇_e whose Christoffel symbols vanish in θ‑coordinates, making geodesics straight lines: a geodesic between θ_0 and θ_1 is simply θ_t = (1‑t)θ_0 + tθ_1. Proposition 3.4 proves that ∂ψ is an isometric embedding from U (with Euclidean metric) into (Δ^{c}_n, g_e). Hence, geodesics in the latent space map to e‑geodesics in the simplex, and, by extension, to near‑geodesics on the data manifold M when the endpoints are close.
Training the GPCA subspace involves minimizing the Bregman divergence D_{ψ*} between each data point x_i (treated as a one‑hot vector) and its reconstruction ∂ψ(Vz_i). The objective is
min_{V,z_i} (1/N) Σ_i D_{ψ*}(x_i, ∂ψ(Vz_i)).
Alternating optimization over V and the latent codes {z_i} yields a basis that captures statistical dependencies among the categorical variables.
Having obtained a low‑dimensional latent representation, the authors apply flow matching (FM) to learn a continuous‑time transport from a simple base distribution p_0 (e.g., uniform or Gaussian on the simplex) to the empirical data distribution p_1. In the conditional flow‑matching (CFM) formulation, the loss is the expected squared norm (under the e‑metric) between the learned velocity field v_t and the time derivative of the e‑geodesic interpolant between two samples. Because e‑geodesics are linear in θ, the CFM loss simplifies to
E_{t,θ_0,θ_1}‖v_ψ(t,θ_t) – (θ_1 – θ_0)‖²,
where v_ψ(t,θ_t) is the pull‑back of the velocity field to θ‑space. This reduction allows all computations to be performed in the Euclidean latent space ℝ^d, dramatically reducing computational cost compared with methods that operate directly on the simplex with the Fisher‑Rao metric.
The paper also discusses pulling back the e‑metric to the latent space Z, yielding a norm ‖·‖_V defined by ⟨Vz, Vz′⟩. The CFM objective can therefore be written entirely in terms of z, and a standard Gaussian prior p_z = N(0, I_d) can be used as the base distribution. This leads to a fully Euclidean training pipeline while still respecting the underlying information geometry of the original categorical data.
Experimental validation includes:
- A synthetic 3‑dimensional hypercube (8 points) where a 2‑D GPCA subspace exactly reconstructs 6 points, surpassing linear PCA which can only represent 4 points after rounding.
- Visualization of a 2‑D embedding of the MNIST dataset (originally 784 dimensions) using a GPCA subspace of dimension d = 30. The embedding preserves class structure and yields realistic reconstructions when decoded via ∂ψ.
- Demonstrations on two binary variables, showing how 1‑D and 2‑D GPCA subspaces capture the full joint distribution tetrahedron and its statistical dependencies.
The authors formalize an ε‑GPCA assumption: the expected squared e‑distance between a data point and its projection onto M is bounded by ε. Under this assumption, they prove that flow‑matching errors are also bounded by ε, providing a theoretical guarantee that low‑dimensional GPCA approximations suffice for accurate generative modeling.
In summary, the paper contributes:
- A geometric PCA framework that respects the exponential‑family structure of categorical data, yielding compact latent representations that capture inter‑variable dependencies.
- An e‑metric based Riemannian geometry that turns geodesics into straight lines, simplifying flow‑matching training to Euclidean operations.
- Empirical evidence that substantial dimensionality reduction (e.g., from 784 to 30) does not sacrifice generative quality for high‑dimensional discrete datasets.
Limitations include the need for a sufficiently expressive latent dimension d to capture complex dependencies, and the reliance on the e‑metric which may not be optimal for all data regimes. Future work could explore nonlinear latent manifolds (e.g., kernel GPCA), extensions to other discrete modalities such as text or graphs, and hybrid metrics that combine the computational benefits of the e‑metric with the statistical optimality of the Fisher‑Rao metric.
Comments & Academic Discussion
Loading comments...
Leave a Comment