Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training
Recent advances in text-to-vision generation excel in visual fidelity but struggle with compositional generalization and semantic alignment. Existing datasets are noisy and weakly compositional, limiting models’ understanding of complex scenes, while scalable solutions for dense, high-quality annotations remain a challenge. We introduce Generate Any Scene, a data engine that systematically enumerates scene graphs representing the combinatorial array of possible visual scenes. Generate Any Scene dynamically constructs scene graphs of varying complexity from a structured taxonomy of objects, attributes, and relations. Given a sampled scene graph, Generate Any Scene translates it into a caption for text-to-image or text-to-video generation; it also translates it into a set of visual question answers that allow automatic evaluation and reward modeling of semantic alignment. Using Generate Any Scene, we first design a self-improving framework where models iteratively enhance their performance using generated data. Stable Diffusion v1.5 achieves an average 4% improvement over baselines and surpassing fine-tuning on CC3M. Second, we also design a distillation algorithm to transfer specific strengths from proprietary models to their open-source counterparts. Using fewer than 800 synthetic captions, we fine-tune Stable Diffusion v1.5 and achieve a 10% increase in TIFA score on compositional and hard concept generation. Third, we create a reward model to align model generation with semantic accuracy at a low cost. Using GRPO algorithm, we fine-tune SimpleAR-0.5B-SFT and surpass CLIP-based methods by +5% on DPG-Bench. Finally, we apply these ideas to the downstream task of content moderation where we train models to identify challenging cases by learning from synthetic data.
💡 Research Summary
The paper introduces Generate Any Scene (GAS), a data engine that systematically enumerates scene graphs to produce high‑quality synthetic captions, visual question‑answer (QA) pairs, and scene attributes for text‑to‑image and text‑to‑video generation. The authors first construct a massive taxonomy comprising 28,787 objects, 1,494 attributes, 10,492 relations, and 2,193 global scene attributes sourced from WordNet, Wikipedia, Synthetic Visual Genome, and Places365. Using this taxonomy, GAS can generate virtually unlimited scene graphs of varying structural complexity.
The generation pipeline consists of five steps: (1) enumerate graph topologies under user‑specified constraints (number of nodes, average degree, connected components) and filter implausible configurations with commonsense rules; (2) populate each node, attribute, and edge by sampling from the taxonomy; (3) sample scene‑level attributes such as artistic style, viewpoint, or temporal span; (4) deterministically convert the populated graph into a coherent caption using rule‑based grammar that avoids duplication and resolves references; (5) automatically generate exhaustive QA pairs covering every object, attribute, and relation via templated questions. The deterministic caption conversion yields low hallucination rates, while optional LLM paraphrasing does not significantly affect downstream performance.
The authors leverage GAS for three major research directions.
-
Self‑Improving Models – They generate 30 K synthetic captions across three epochs. For each caption, Stable Diffusion v1.5 (SD‑v1.5) creates eight images; the image with the highest VQA score (a proxy for semantic alignment) is kept. The top 25 % of these (2.5 K pairs per epoch) are used to fine‑tune SD‑v1.5 via LoRA. Compared with an identical‑size fine‑tuning on real CC3M data, the GAS‑fine‑tuned model achieves an average 4 % boost in CLIPScore and ImageReward while preserving LPIPS diversity. On a held‑out test set of unseen compositional combinations, the GAS‑trained model outperforms both the baseline and CC3M‑fine‑tuned versions, demonstrating superior combinatorial generalization.
-
Targeted Distillation – By evaluating proprietary models (e.g., DALL‑E 3) on the synthetic data, the authors identify specific capabilities that open‑source models lack, such as generating complex multi‑object compositions. They then fine‑tune SD‑v1.5 on a small set (<800) of GAS‑generated captions that emphasize these missing patterns. This targeted distillation yields a 10 % increase in TIFA score on compositional and hard‑concept benchmarks, showing that a few hundred high‑quality synthetic examples can transfer nuanced skills from closed‑source systems.
-
Low‑Cost Reward Modeling – Using the exhaustive QA pairs, the authors train a scene‑graph‑based reward model with the Gradient‑Reward‑Policy‑Optimization (GRPO) algorithm. Fine‑tuning SimpleAR‑0.5B‑SFT with this reward leads to a 5 % improvement over CLIP‑based reward methods on DPG‑Bench, indicating that precise, graph‑grounded rewards can align generation with semantics more effectively than generic image‑text similarity scores.
Finally, the paper applies GAS to content moderation. Synthetic captions describing rare or adversarial compositions are used to augment training data for a ViT‑T detector. The enriched detector shows markedly better cross‑model and cross‑dataset performance, highlighting the practical security benefits of generating diverse, compositional training examples.
Overall, Generate Any Scene demonstrates that a systematic, graph‑driven synthesis pipeline can (i) scale to produce virtually unlimited, richly annotated data; (ii) provide automatic, fine‑grained evaluation signals via QA; and (iii) enable self‑improvement, targeted knowledge transfer, and cost‑effective reward modeling. The extensive experiments confirm that synthetic data alone can surpass traditional web‑crawled datasets in improving compositional fidelity and semantic alignment of text‑to‑vision models. Future work may extend GAS to 3D scene synthesis, video‑sequence generation, and multimodal conversational agents.
Comments & Academic Discussion
Loading comments...
Leave a Comment