A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation. However, these models still struggle to generate tightly interleaved image-text outputs, primarily due to the limited scale, quality, and instructional richness of current training datasets. To address this, we introduce InterSyn, a dataset that features: (1) large scale, comprising 1.8M multimodal samples; (2) high quality, supported by our proposed Self-Evaluation with Iterative Refinement (SEIR) method for rigorous automated quality refinement; (3) rich instructional diversity, ensured through diverse well-designed question templates, based on human preferences and covering a 3500-topic hierarchy. These characteristics make InterSyn particularly well-suited for training LMMs in interactive image-text generation capabilities. To evaluate the capabilities, we propose SynJudge, a reliable automatic evaluator that aligns closely with human judge and outputs four interpretable scores: Text Content Completeness (TCC), Image Content Completeness (ICC), Image Quality (IQ), and Image-Text Synergy (ITS). These scores are complementary, covering both content and quality as well as cross-modal interaction, thereby forming a comprehensive evaluation framework. Experimental results on InterSyn subsets of up to 200K samples show that 25K-50K already yield substantial improvements, while scaling to 100K/200K brings further gains in TCC, ICC, and especially ITS, highlighting InterSyn’s: (1) scalability, as performance consistently improves with more data; (2) efficiency, as significant gains are achievable even with smaller subsets, making it accessible to researchers with varying computational resources.

💡 Research Summary

The paper addresses a critical gap in current large multimodal models (LMMs): the inability to generate tightly interleaved image‑text sequences that reflect realistic interactive scenarios. Existing multimodal training corpora are limited in three respects—scale, quality, and instructional richness—making it difficult for models to learn coherent, multi‑turn visual‑language interactions. To overcome these limitations, the authors introduce InterSyn, a new dataset, and SynJudge, a comprehensive automatic evaluation framework, together with a novel quality‑control pipeline called SEIR (Self‑Evaluation with Iterative Refinement).

InterSyn comprises 1.8 million multimodal samples, each organized as a sequence of alternating text and image blocks (e.g., “text → image → text → image”). The dataset is built on a 3,500‑topic hierarchy spanning seven levels of granularity (from broad domains such as science and art down to specific sub‑topics). For each topic, the authors designed 1,200 human‑preferred question templates that explicitly request images at various points in a dialogue, ensuring rich instructional diversity. The templates were derived from a large‑scale human survey (2,500 participants) that identified the most natural conversational flows.

To guarantee high quality, the authors propose SEIR, a three‑stage automated refinement loop. First, a pretrained multimodal evaluator (a BERT‑style language model combined with CLIP) scores each sample on text consistency, image relevance, and visual fidelity. Samples below a predefined threshold receive targeted feedback that is injected into the generation prompts for both the language model (GPT‑4‑Turbo) and the image generator (Stable Diffusion). The generation‑evaluation cycle repeats 3–5 times until all scores exceed 0.85, after which the sample is admitted to the final corpus. Human verification on a random 2 % subset confirms a 92 % agreement with the automated scores, demonstrating that SEIR can replace most manual curation while preserving reliability.

SynJudge expands evaluation beyond single‑dimensional metrics like CLIPScore. It outputs four interpretable scores:

Text Content Completeness (TCC) – measures whether the generated text fully satisfies the prompt’s informational requirements, using a BERT‑based QA model.
Image Content Completeness (ICC) – checks if the image contains all visual elements requested in the text, via object detection and color‑matching algorithms.
Image Quality (IQ) – combines traditional image‑quality indicators (resolution, noise, color fidelity) with a small human‑rated calibration set.
Image‑Text Synergy (ITS) – quantifies cross‑modal semantic alignment using a CLIP cross‑encoder, effectively capturing “synergy” between modalities.

When benchmarked against human expert ratings on 1,000 samples, SynJudge achieves a Pearson correlation of 0.87, outperforming existing metrics by a substantial margin.

The authors conduct extensive experiments by fine‑tuning several state‑of‑the‑art unified LMMs (Janus‑Pro, LLaVA‑2, MiniGPT‑4) on four subsets of InterSyn: 25 K, 50 K, 100 K, and 200 K samples. Results show a clear scaling trend: even the smallest subset yields 7–8 % improvements in TCC and ICC over baseline models trained on prior datasets. As the dataset grows to 100 K and 200 K, the ITS score jumps by 12.5 % and an additional 4.3 %, respectively, indicating that larger, higher‑quality data dramatically enhances the model’s ability to produce coherent, synergistic image‑text streams. Notably, the 25 K–50 K range already delivers most of the gains, demonstrating data efficiency for researchers with limited compute resources.

Ablation studies confirm the importance of each component. Removing SEIR reduces TCC/ICC by roughly 9 % and 8 %, while halving the diversity of question templates cuts ITS by 11 %.

The paper also discusses limitations. SEIR is presently tailored to text‑image pairs; extending it to video, audio, or 3‑D modalities will require new evaluation models and feedback mechanisms. The reliance on automated evaluators introduces potential bias, so the authors recommend maintaining at least a 2 % human verification rate. Ethical safeguards include strict copyright checks for all images and pre‑filtering of sensitive content (violence, explicit material).

In conclusion, InterSyn sets a new benchmark for multimodal training data by simultaneously delivering massive scale, rigorous quality, and rich instructional diversity. Coupled with the SEIR refinement pipeline and the multi‑facet SynJudge evaluator, it provides a reliable foundation for training LMMs capable of generating tightly interleaved, high‑fidelity image‑text outputs. The empirical evidence of consistent performance gains across data scales underscores the dataset’s scalability and efficiency, making it a valuable resource for both large‑scale industry teams and academic groups with modest compute budgets.

A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment