PolyGen: Fully Synthetic Vision-Language Training via Multi-Generator Ensembles

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Synthetic data offers a scalable solution for vision-language pre-training, yet current state-of-the-art methods typically rely on scaling up a single generative backbone, which introduces generator-specific spectral biases and limits feature diversity. In this work, we introduce PolyGen, a framework that redefines synthetic data construction by prioritizing manifold coverage and compositional rigor over simple dataset size. PolyGen employs a Polylithic approach to train on the intersection of architecturally distinct generators, effectively marginalizing out model-specific artifacts. Additionally, we introduce a Programmatic Hard Negative curriculum that enforces fine-grained syntactic understanding. By structurally reallocating the same data budget from unique captions to multi-source variations, PolyGen achieves a more robust feature space, outperforming the leading single-source baseline (SynthCLIP) by +19.0% on aggregate multi-task benchmarks and on the SugarCrepe++ compositionality benchmark (+9.1%). These results demonstrate that structural diversity is a more data-efficient scaling law than simply increasing the volume of single-source samples.

💡 Research Summary

PolyGen tackles the persistent “Synthetic Gap” in vision‑language model (VLM) pre‑training by moving beyond the conventional single‑generator paradigm. The authors observe that any single text‑to‑image diffusion model—whether an older, diverse model like Stable Diffusion 1.5 or a modern, high‑fidelity model like SD XL‑Turbo—covers only a limited slice of the visual‑semantic manifold, imprinting generator‑specific spectral artifacts and limiting downstream generalization. To mitigate this, PolyGen introduces a three‑stage pipeline.

First, structured caption pairs are created. Concepts are sampled from the MetaCLIP Concept Bank and paired with a semantic axis (e.g., lighting, material, viewpoint, color, background, position, style, or a full concept swap). A 7‑b Mistral model generates the base caption T⁺ conditioned on the (concept, axis) tuple. A separate LLaMA 3.1‑8B model then produces a hard‑negative caption T⁻ by altering only the specified axis while preserving syntactic structure. This controlled perturbation yields semantically coherent counterfactuals, avoiding the noise of unconstrained LLM rewriting.

Second, each caption pair is rendered through an ensemble of four architecturally distinct diffusion models: two “Diversity Experts” (Stable Diffusion 1.5 and 2) that provide high intra‑concept variance, and two “Recognizability Experts” (SD XL‑Turbo and SANA‑1.6B) that ensure strong prompt adherence and photorealism. By treating the multiple images generated for a single caption as positive examples, PolyGen enforces a Multi‑Positive contrastive objective that distributes probability mass uniformly across all n⁺ = 4 images, aligning the text embedding with the semantic centroid of the diverse visual manifold.

Third, a curriculum‑based contrastive training regime is applied. In addition to the multi‑positive loss, an image‑to‑image regularizer (L_I2I) penalizes the model for learning generator‑specific idiosyncrasies, encouraging invariance to spectral signatures. Hard negatives are incorporated via a Triplet‑CLIP loss that explicitly separates the base concept from its counterfactual. To avoid early instability, a scheduler linearly ramps the proportion of hard negatives p from 0 to 0.5 during training, allowing the model to first acquire coarse semantic clusters before refining fine‑grained discriminative boundaries.

Extensive experiments show that PolyGen outperforms the state‑of‑the‑art single‑source synthetic pipeline SynthCLIP by +19 % relative gain across a suite of multi‑task benchmarks (zero‑shot classification, image‑text retrieval, image captioning) and achieves a +9.1 % absolute improvement on the compositionality‑focused SugarCrepe++ benchmark. Ablation studies confirm that both diversity and recognizability experts are necessary: removing either reduces performance and re‑introduces generator bias.

The paper concludes that structural diversity—the deliberate combination of heterogeneous generators and programmatic hard negatives—constitutes a more data‑efficient scaling law than merely increasing the volume of samples from a single source. PolyGen demonstrates that fully synthetic data, when engineered for manifold coverage and compositional rigor, can rival real‑world datasets in training large‑scale VLMs, opening pathways for privacy‑preserving, scalable, and bias‑controlled multimodal model development. Future work may explore larger ensembles, automated curriculum design, and extensions to video or 3‑D modalities.

PolyGen: Fully Synthetic Vision-Language Training via Multi-Generator Ensembles

💡 Research Summary

Comments & Academic Discussion

Leave a Comment