Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs

Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multimodal Large Language Models (MLLMs) show reasoning promise, yet their visual perception is a critical bottleneck. Strikingly, MLLMs can produce correct answers even while misinterpreting crucial visual elements, masking these underlying failures. Our preliminary study on a joint perception-reasoning dataset revealed that for one leading MLLM, 29% of its correct answers to reasoning questions still exhibited visual perception errors. To systematically address this, we introduce “Do You See Me”, a scalable benchmark with 1,758 images and 2,612 questions. It spans seven human-psychology inspired subtasks in 2D and 3D, featuring controllable complexity to rigorously evaluate MLLM visual skills. Our findings on 3 leading closed-source and 5 major open-source models reveal a stark deficit: humans achieve 96.49% accuracy, while top MLLMs average below 50%. This performance gap widens rapidly with increased task complexity (e.g., from 12% to 45% in the visual form constancy subtask). Further analysis into the root causes suggests that failures stem from challenges like misallocated visual attention and the instability of internal representations for fine-grained details, especially at or below encoder patch resolution. This underscores an urgent need for MLLMs with truly robust visual perception. The benchmark dataset, source code and evaluation scripts are available at https://github.com/microsoft/Do-You-See-Me.


💡 Research Summary

This paper introduces “Do You See Me,” a novel and scalable benchmark designed to rigorously evaluate the core visual perception abilities of Multimodal Large Language Models (MLLMs), uncovering a significant and fundamental deficit in their capacity to accurately interpret visual information.

The work is motivated by a critical observation: MLLMs can often produce correct final answers to reasoning questions while simultaneously misinterpreting basic visual elements in the input image. This phenomenon, termed “perception failure masking,” suggests that high-level reasoning success can obscure severe low-level perception errors, making standard multimodal benchmarks inadequate for diagnosing true perceptual understanding. A preliminary study on a curated joint perception-reasoning dataset confirmed this, showing that 29% of a leading MLLM’s correct reasoning answers contained underlying visual perception mistakes.

To address this evaluation gap, the authors propose the “Do You See Me” benchmark. Its design is grounded in established frameworks from human visual perceptual psychology, translating core human abilities—such as visual discrimination, figure-ground perception, spatial relations, and form constancy—into a set of seven core subtasks. The benchmark features several key innovations:

  1. Programmatic Generation & Scalability: All 1,758 images and 2,612 questions are generated synthetically using SVG (for 2D) and Blender (for photorealistic 3D scenes). This ensures scalability, eliminates data contamination concerns, and allows for precise parametric control over task difficulty.
  2. Controlled Difficulty: Each subtask includes parameters (e.g., number of objects, occlusion level, rotation angle, noise intensity) that can be systematically adjusted to create easy, moderate, and hard instances. This enables fine-grained analysis of how model performance degrades with complexity.
  3. Dual Modality Evaluation: Tasks are presented in both clean 2D geometric and complex 3D rendered environments to test generalization across visual contexts.

The comprehensive evaluation of eleven state-of-the-art MLLMs (including GPT-4o, Gemini, Claude, and leading open-source models) reveals stark results:

  • Massive Performance Gap: Human participants achieved an average accuracy of 95.83%, while the best-performing MLLM (GPT-4o) averaged only 48.75%. All models performed significantly worse than humans.
  • Sharp Decline with Complexity: The performance gap widens dramatically as task difficulty increases. For example, in the Form Constancy task, GPT-4o’s accuracy dropped from 45% on easy instances to just 12% on hard ones.
  • Diagnostics of Failure Modes: The benchmark’s controlled design facilitates root-cause analysis. Key findings include:
    • Architectural Limitations: Failures are attributed to issues like misallocated visual attention and the instability or loss of fine-grained details at the vision encoder’s patch level.
    • Prompting Artifacts: Models often exploit task “shortcuts” (e.g., format cues in multiple-choice questions) rather than performing genuine visual analysis. Chain-of-Thought prompting can sometimes degrade performance on holistic visual tasks by forcing a lossy verbalization of the image.
    • Data Scaling Limits: A large-scale supervised fine-tuning experiment yielded only modest gains (~11%), indicating that these perceptual shortcomings are foundational and not easily remedied by simply adding more training data.

In summary, this paper makes a compelling case that robust visual perception is a critical yet underdeveloped foundation for MLLMs. The “Do You See Me” benchmark provides the necessary tools to isolate, measure, and diagnose these perceptual weaknesses, urging the community to move beyond evaluating only final-answer correctness and to focus on building models with genuinely reliable visual understanding.


Comments & Academic Discussion

Loading comments...

Leave a Comment