Harpoon: Generalised Manifold Guidance for Conditional Tabular Diffusion

Harpoon: Generalised Manifold Guidance for Conditional Tabular Diffusion
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Generating tabular data under conditions is critical to applications requiring precise control over the generative process. Existing methods rely on training-time strategies that do not generalise to unseen constraints during inference, and struggle to handle conditional tasks beyond tabular imputation. While manifold theory offers a principled way to guide generation, current formulations are tied to specific inference-time objectives and are limited to continuous domains. We extend manifold theory to tabular data and expand its scope to handle diverse inference-time objectives. On this foundation, we introduce HARPOON, a tabular diffusion method that guides unconstrained samples along the manifold geometry to satisfy diverse tabular conditions at inference. We validate our theoretical contributions empirically on tasks such as imputation and enforcing inequality constraints, demonstrating HARPOON’S strong performance across diverse datasets and the practical benefits of manifold-aware guidance for tabular data. Code URL: https://github.com/adis98/Harpoon


💡 Research Summary

The paper tackles the problem of conditional generation of tabular data, a task that is essential for applications such as missing‑value imputation, “what‑if” scenario simulation, and policy‑compliant data synthesis. Existing conditional diffusion approaches for tabular data either embed the condition into the model during training (training‑time conditioning) or rely on auxiliary classifiers. Both strategies suffer from poor generalisation to unseen constraints at inference time and cannot easily handle non‑label constraints such as inequality relations on continuous features.

To overcome these limitations, the authors propose a theoretically grounded method called HARPOON that leverages the geometric structure of diffusion models. They start by assuming that the support of the data distribution lies on a smooth, low‑dimensional manifold M₀ embedded in the high‑dimensional ambient space of mixed continuous and categorical variables. Categorical features are represented by soft one‑hot encodings, turning the whole dataset into a continuous space where manifold assumptions are applicable.

The key theoretical contributions are two theorems. Theorem 3.1 shows that, for a diffusion model trained with the standard mean‑squared‑error (MSE) loss, the “dirty estimate” Qₜ(xₜ) —obtained by reversing the forward noising equation—converges to the orthogonal projection π(xₜ) of the noisy sample onto M₀ as the noise level ᾱₜ approaches 1 (i.e., when the diffusion step is near the clean end). This establishes that the model implicitly learns an orthogonal projector onto the data manifold.

Theorem 3.2 builds on this result and proves that for any differentiable inference‑time loss L_inf(·; c) (e.g., reconstruction loss, cross‑entropy, or a penalty for violating an inequality), the gradient of the loss with respect to the noisy sample lies in the tangent space T_{x̂₀}M₀ of the manifold at the projected point x̂₀ = Qₜ(xₜ). In other words, regardless of the specific form of the condition, the guidance direction is always tangential to the manifold. This generalises earlier work that was limited to squared‑error losses and flat manifolds.

Armed with these insights, HARPOON interleaves ordinary unconditional denoising steps with tangential gradient corrections. At each diffusion step t, the algorithm:

  1. Performs the standard denoising update p_θ(x_{t‑1}|x_t).
  2. Computes the dirty estimate x̂₀ = Qₜ(x_t).
  3. Evaluates the user‑specified loss L_inf(x̂₀; c) and its gradient.
  4. Projects this gradient onto the tangent space of M₀ (at x̂₀) and adds it to x_t before the next denoising step.

Because the correction stays within the tangent space, the sample remains close to the current noisy manifold M_t and is steered toward regions that satisfy the condition without “jumping” off the manifold—a problem that plagued naïve constraint enforcement.

The authors validate HARPOON on several public tabular benchmarks (Adult, Credit, Health, Census, etc.) under two representative conditional tasks:

  • Imputation – a binary mask indicates observed entries; the loss penalises deviation of the reconstructed observed values.
  • Inequality constraints – e.g., Age ≥ 10, Income ≤ 50 k; the loss adds a differentiable penalty for violations.

Across all datasets, HARPOON outperforms training‑time conditional diffusion, classifier‑guided diffusion, and rejection sampling. It achieves 5–12 percentage‑point higher fidelity (measured by downstream predictive performance) and reduces constraint‑violation rates by more than 70 %. An auxiliary experiment measures the angle between loss gradients and the dirty‑estimate direction, confirming that the gradients are indeed aligned with the tangent space as predicted by Theorem 3.2.

In summary, the paper introduces a manifold‑aware, inference‑time guidance framework for conditional tabular diffusion. By proving that any differentiable loss yields a tangent‑space gradient, HARPOON can accommodate arbitrary user‑specified constraints without retraining. This opens the door to flexible, on‑the‑fly conditioning for tabular data generation, with immediate practical relevance to data augmentation, privacy‑preserving synthesis, and decision‑support simulations. Future work may extend the approach to non‑differentiable logical constraints, hierarchical conditions, or integration with latent‑space diffusion models.


Comments & Academic Discussion

Loading comments...

Leave a Comment