Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs

Multimodal large language models (MLLMs) are trained to represent vision and language in a shared space. But does this joint representation enable consistent reasoning across modalities? We introduce REST and REST+ (Render-Equivalence Stress Tests), two benchmarks for systematically evaluating cross-modal consistency. Each sample presents semantically identical information in three forms (image, text, and mixed), allowing us to measure whether models produce consistent outputs regardless of modality. Evaluating 15 state-of-the-art MLLMs, we find that none reason consistently across modalities, with substantial variation in the degree of inconsistency. Neither rendering text as images nor images as text resolves this problem, even when controlling for OCR errors. We further show that visual characteristics (color, resolution, but not font) and the number of vision tokens affect performance even when text is correctly recognized. Finally, our consistency score correlates with the cross-modal cosine similarity in embedding space, suggesting a mechanistic explanation: inconsistent reasoning arises when text and image representations occupy distinct regions of the joint space

📜 Original Paper Content