VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?
Vision-Language Models (VLMs) have achieved impressive performance in cross-modal understanding across textual and visual inputs, yet existing benchmarks predominantly focus on pure-text queries. In real-world scenarios, language also frequently appears as visualized text embedded in images, raising the question of whether current VLMs handle such input requests comparably. We introduce VISTA-Bench, a systematic benchmark from multimodal perception, reasoning, to unimodal understanding domains. It evaluates visualized text understanding by contrasting pure-text and visualized-text questions under controlled rendering conditions. Extensive evaluation of over 20 representative VLMs reveals a pronounced modality gap: models that perform well on pure-text queries often degrade substantially when equivalent semantic content is presented as visualized text. This gap is further amplified by increased perceptual difficulty, highlighting sensitivity to rendering variations despite unchanged semantics. Overall, VISTA-Bench provides a principled evaluation framework to diagnose this limitation and to guide progress toward more unified language representations across tokenized text and pixels. The source dataset is available at https://github.com/QingAnLiu/VISTA-Bench.
💡 Research Summary
The paper investigates whether modern Vision‑Language Models (VLMs) can understand language that is presented not as discrete tokens but as visualized text embedded within images. Existing multimodal benchmarks largely assume pure‑text queries, leaving a blind spot for the “text‑as‑pixels” scenario that occurs frequently in real‑world applications such as signs, screenshots, and documents. To fill this gap, the authors introduce VISTA‑Bench, a systematic benchmark comprising 1,500 carefully curated multiple‑choice and open‑ended questions. Each question is provided in two parallel formats: a conventional pure‑text version and a visualized‑text version rendered as an image under tightly controlled conditions (Arial 16 pt, fixed width, etc.).
The benchmark is organized into a hierarchical taxonomy covering four primary tasks—Multimodal Perception, Multimodal Reasoning, Multimodal Knowledge, and Pure‑Language Understanding—spanning 10 sub‑tasks and 25 fine‑grained ability dimensions. The data are drawn from established resources (MM‑Bench, Seed‑Bench, MMMU, MMLU, etc.), filtered for correctness, and then passed through a LaTeX‑aware rendering pipeline that preserves code, formulas, and layout. Rendering quality is automatically verified by a large VLM (Qwen3‑VL‑32B) acting as a filter judge, with low‑scoring samples sent for manual review, ensuring a high‑quality final set.
The authors evaluate more than 20 state‑of‑the‑art VLMs, including Qwen3‑VL‑8B‑Instruct, InternVL‑3.5‑8B, Gemini‑3‑Pro, and MiMo‑VL‑7B‑RL. Experiments compare performance on the pure‑text and visualized‑text versions across both unimodal (MMLU) and multimodal (MM‑Bench, Seed‑Bench, MMMU) tasks. The key findings are:
-
Persistent Modality Gap – Across almost all models, accuracy drops when the same semantic content is rendered as visualized text. In the unimodal MMLU benchmark, Qwen3‑VL‑8B‑Instruct falls from 75.99 % (pure text) to 68.46 % (visualized), a loss of roughly 7–10 percentage points. Multimodal tasks also suffer degradation, though the presence of additional visual context sometimes mitigates the impact.
-
Rendering Sensitivity – Font size, style, and prompt design significantly affect performance. Very small fonts (9 pt) cause illegibility and severe OCR errors, especially in unimodal settings where the model must recover all semantics from pixels. Large fonts (64 pt) introduce line‑wrapping that reduces effective context. The optimal range is 32–48 pt, yielding the smallest gap. Standard sans‑serif fonts (Arial, Times New Roman, Cambria) produce comparable results, while a handwritten‑style font (Brush Script) consistently reduces accuracy by 3–5 percentage points. Prompt length also matters: moderate, descriptive prompts slightly improve results, whereas overly short prompts or heavily structured Chain‑of‑Thought prompts can hurt especially the InternVL family.
-
OCR Capability Correlates with Gap Size – Models with stronger text‑recognition abilities exhibit a smaller modality gap. Qwen3‑VL‑8B‑Instruct scores 96.1 on DocVQA and 896 on OCR‑Bench, outperforming InternVL‑3.5‑8B (92.3 and 832 respectively). This superior OCR performance translates into a narrower performance drop on visualized‑text questions. MiMo‑VL‑7B‑RL stands out as an exception, showing relatively robust handling of visualized text despite its smaller size.
The analysis suggests that the primary bottleneck is limited perceptual robustness: VLMs must first perform OCR‑like recognition before feeding the extracted text to their language component. When OCR fails or yields noisy tokens, downstream reasoning suffers. The authors also note that current VLM architectures treat visualized text as just another visual patch, lacking specialized mechanisms to align pixel‑level text with token‑level semantics.
Based on these insights, the paper proposes several research directions: (i) tighter integration of OCR modules with language models, possibly via joint pre‑training on text‑as‑pixels data; (ii) data‑augmentation strategies that expose models to diverse fonts, sizes, colors, and backgrounds to improve visual robustness; (iii) architectural innovations such as cross‑modal attention heads that explicitly map detected text regions to token embeddings, reducing the reliance on post‑hoc OCR; and (iv) expanding benchmark coverage to include more complex, open‑ended generation tasks.
In conclusion, VISTA‑Bench provides the first large‑scale, controlled evaluation of VLMs under realistic text‑as‑pixels conditions. It quantifies a consistent modality gap, identifies its root causes (OCR performance, rendering choices, prompt sensitivity), and offers a concrete testbed for future work aimed at truly unified vision‑language representations that treat textual information equally whether it appears as symbols or pixels.
Comments & Academic Discussion
Loading comments...
Leave a Comment