Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Text-to-image (T2I) models have achieved remarkable success in generating high-fidelity images, but they often fail in handling complex spatial relationships, e.g., spatial perception, reasoning, or interaction. These critical aspects are largely overlooked by current benchmarks due to their short or information-sparse prompt design. In this paper, we introduce SpatialGenEval, a new benchmark designed to systematically evaluate the spatial intelligence of T2I models, covering two key aspects: (1) SpatialGenEval involves 1,230 long, information-dense prompts across 25 real-world scenes. Each prompt integrates 10 spatial sub-domains and corresponding 10 multi-choice question-answer pairs, ranging from object position and layout to occlusion and causality. Our extensive evaluation of 21 state-of-the-art models reveals that higher-order spatial reasoning remains a primary bottleneck. (2) To demonstrate that the utility of our information-dense design goes beyond simple evaluation, we also construct the SpatialT2I dataset. It contains 15,400 text-image pairs with rewritten prompts to ensure image consistency while preserving information density. Fine-tuned results on current foundation models (i.e., Stable Diffusion-XL, Uniworld-V1, OmniGen2) yield consistent performance gains (+4.2%, +5.7%, +4.4%) and more realistic effects in spatial relations, highlighting a data-centric paradigm to achieve spatial intelligence in T2I models.

💡 Research Summary

The paper addresses a critical gap in the evaluation of text‑to‑image (T2I) generation models: while recent diffusion‑based systems excel at producing high‑fidelity objects, they struggle with complex spatial relationships such as precise positioning, orientation, occlusion, and causal interaction. Existing benchmarks rely on short, information‑sparse prompts and coarse metrics (e.g., binary presence checks), which fail to probe a model’s ability to understand “where,” “how,” and “why” objects appear as described.

To remedy this, the authors introduce SpatialGenEval, a novel benchmark that systematically assesses spatial intelligence across four hierarchical domains—Spatial Foundation, Perception, Reasoning, and Interaction—further divided into ten sub‑domains (object category, attribute, position, orientation, layout, proximity, occlusion, comparison, motion, causality). They construct 1,230 long, information‑dense prompts covering 25 real‑world scenes; each prompt embeds constraints from all ten sub‑domains, resulting in sentences of roughly 150‑200 words. For every prompt, ten carefully crafted multiple‑choice questions are generated, each with an additional “None” option to avoid forced guessing when the generated image does not satisfy the question. Human verification ensures that questions do not leak answers.

Evaluation is performed by feeding the generated images (not the prompts) to a large multimodal language model (e.g., Qwen2.5‑VL‑72B) which selects the best answer. This image‑dependent evaluation eliminates answer‑leakage and provides fine‑grained diagnostics per sub‑domain.

The benchmark is used to test 23 state‑of‑the‑art T2I models, including Stable Diffusion‑XL, DALL·E 3, Midjourney V6, and several proprietary systems. Results reveal a clear pattern: models achieve >90 % accuracy on basic object and attribute generation (Spatial Foundation) but drop sharply to 60‑70 % on perception tasks (position, orientation, layout) and further to 40‑55 % on reasoning and interaction tasks (proximity, occlusion, causality). The most pronounced weakness lies in higher‑order spatial reasoning, where relative distance comparisons, 3‑D depth inference, and cause‑effect relationships are frequently mis‑rendered.

To demonstrate that the dense‑prompt design can also improve models, the authors create SpatialT2I, a data‑centric dataset derived from the same design principles. They generate images for 1,230 prompts using 14 top‑performing open‑source T2I models, collect 15,400 image‑text pairs, and then refine the prompts with Gemini 2.5 Pro to improve text‑image alignment while preserving information density. This dataset is used to fine‑tune three leading foundation models: Stable Diffusion‑XL, UniWorld‑V1, and OmniGen2. After 10‑20 epochs of fine‑tuning, all three models exhibit consistent gains of 4‑6 % in overall benchmark accuracy, with especially notable improvements (8‑12 % absolute) on the reasoning and interaction sub‑domains. Qualitative inspection confirms more realistic spatial relations, such as correct occlusion ordering and plausible causal effects.

The paper discusses several limitations. Prompt creation relies heavily on human designers, introducing potential bias; the evaluation depends on the competence of the multimodal LLM, which may vary across tasks; and the study focuses on static 2‑D images, leaving 3‑D scenes and video generation for future work. The authors suggest future directions including automated prompt synthesis, multimodal evaluation with depth maps or video, and architectural innovations that embed explicit spatial reasoning modules (e.g., graph neural networks or spatial transformers).

In summary, SpatialGenEval provides the first large‑scale, fine‑grained benchmark for spatial intelligence in T2I models, and SpatialT2I demonstrates that a data‑centric approach can meaningfully elevate a model’s ability to render complex spatial relationships. This work paves the way for the next generation of generative models that can answer not only “what” but also “where,” “how,” and “why.”

Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment