From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models
This paper introduces the concept of Microscopic Spatial Intelligence (MiSI), the capability to perceive and reason about the spatial relationships of invisible microscopic entities, which is fundamental to scientific discovery. To assess the potential of Vision-Language Models (VLMs) in this domain, we propose a systematic benchmark framework MiSI-Bench. This framework features over 163,000 question-answer pairs and 587,000 images derived from approximately 4,000 molecular structures, covering nine complementary tasks that evaluate abilities ranging from elementary spatial transformations to complex relational identifications. Experimental results reveal that current state-of-the-art VLMs perform significantly below human level on this benchmark. However, a fine-tuned 7B model demonstrates substantial potential, even surpassing humans in spatial transformation tasks, while its poor performance in scientifically-grounded tasks like hydrogen bond recognition underscores the necessity of integrating explicit domain knowledge for progress toward scientific AGI. The datasets are available at https://huggingface.co/datasets/zongzhao/MiSI-bench.
💡 Research Summary
The paper introduces the concept of Microscopic Spatial Intelligence (MiSI), defined as the ability to perceive and reason about the spatial relationships of invisible microscopic entities such as atoms and molecules. Recognizing that modern Vision‑Language Models (VLMs) have demonstrated impressive spatial reasoning on macroscopic objects, the authors ask whether these models can extend that capability to the microscopic domain, which demands both geometric manipulation and domain‑specific chemical knowledge.
To answer this question, the authors construct MiSI‑Bench, a large‑scale benchmark specifically designed for evaluating microscopic spatial intelligence. The benchmark is built from approximately 4,000 protein‑ligand complexes taken from the PDBbind dataset. Using ChimeraX, each complex is rendered into orthographic 2‑D projections (front, left, top, back, right, bottom), yielding a total of 587,975 images. From these visual data, 163,514 question‑answer (QA) pairs are automatically generated via fixed natural‑language templates. The benchmark comprises nine tasks that fall into two categories: four unit tasks that isolate a single elementary operation (Translation, Rotation, Zooming, Residue‑Ligand Interaction) and five composite tasks that combine multiple operations (e.g., Translation + Rotation, Rotation + Rotation, Docking, Interaction Location, Pocket‑Ligand Interaction). The unit tasks present three orthographic views of the initial structure and a front view after the operation, while the interaction tasks provide all six views to allow the model to identify chemical contacts such as hydrogen bonds.
The authors evaluate several state‑of‑the‑art VLMs—including o3‑mini, Claude Sonnet 4.5, and LLaVA‑13B—without any task‑specific fine‑tuning. Human annotators serve as an upper bound. Results show that current VLMs achieve an average accuracy of roughly 45 % across all tasks, far below the human performance of about 78 %. The gap is especially pronounced on scientifically grounded tasks (Residue‑Ligand Interaction and Pocket‑Ligand Interaction), where VLMs score below 30 %.
To explore the potential of VLMs when adapted to the microscopic domain, the authors fine‑tune a 7‑billion‑parameter model on the MiSI‑Bench training set (Supervised Fine‑Tuning, SFT). After fine‑tuning, the model dramatically improves on the elementary spatial transformation tasks, reaching >85 % accuracy and even surpassing human performance on Translation, Rotation, and Zooming. However, its performance on chemically grounded tasks remains low (~32 %), indicating that explicit domain knowledge is still missing.
These findings lead to two key insights. First, VLMs are capable of learning to infer three‑dimensional molecular geometry from 2‑D projections when provided with sufficient multimodal supervision; they can master basic spatial operations at a level comparable to humans. Second, achieving true scientific AGI will require integrating chemical and physical knowledge—such as bond energetics, electron density, and force‑field concepts—into the pre‑training or fine‑tuning stages. The authors suggest several future directions: (1) domain‑aware pre‑training on large corpora of chemical literature and structural data; (2) hybrid architectures that combine image‑based VLMs with 3‑D graph neural networks to jointly reason over visual and relational representations; (3) expanding the benchmark to include conformational flexibility, solvent effects, and noisy experimental data; and (4) interactive learning loops with expert chemists to continuously refine model understanding.
In summary, MiSI‑Bench provides the first systematic, large‑scale evaluation of microscopic spatial intelligence, revealing both the promise and current limitations of VLMs in scientific contexts. By highlighting the need for explicit scientific knowledge integration, the work charts a clear roadmap for developing multimodal AI systems capable of genuine molecular reasoning and, ultimately, contributing to drug discovery, material design, and broader scientific discovery.
Comments & Academic Discussion
Loading comments...
Leave a Comment