Benchmarking Multimodal LLMs on Recognition and Understanding over Chemical Tables
With the widespread application of multimodal large language models in scientific intelligence, there is an urgent need for more challenging evaluation benchmarks to assess their ability to understand complex scientific data. Scientific tables, as core carriers of knowledge representation, combine text, symbols, and graphics, forming a typical multimodal reasoning scenario. However, existing benchmarks are mostly focused on general domains, failing to reflect the unique structural complexity and domain-specific semantics inherent in scientific research. Chemical tables are particularly representative: they intertwine structured variables such as reagents, conditions, and yields with visual symbols like molecular structures and chemical formulas, posing significant challenges to models in cross-modal alignment and semantic parsing. To address this, we propose ChemTable-a large scale benchmark of chemical tables constructed from real-world literature, containing expert-annotated cell layouts, logical structures, and domain-specific labels. It supports two core tasks: (1) table recognition (structure and content extraction); and (2) table understanding (descriptive and reasoning-based question answering). Evaluation on ChemTable shows that while mainstream multimodal models perform reasonably well in layout parsing, they still face significant limitations when handling critical elements such as molecular structures and symbolic conventions. Closed-source models lead overall but still fall short of human-level performance. This work provides a realistic testing platform for evaluating scientific multimodal understanding, revealing the current bottlenecks in domain-specific reasoning and advancing the development of intelligent systems for scientific research.
💡 Research Summary
The paper introduces ChemTable, a large‑scale benchmark specifically designed to evaluate multimodal large language models (MLLMs) on the recognition and understanding of chemical tables. Chemical tables are a distinctive multimodal artifact that combine textual entries, domain‑specific symbols, and embedded graphics such as molecular structure diagrams. Existing table benchmarks focus on general‑domain documents and lack the structural complexity and specialized semantics of scientific tables, especially those in chemistry.
Data collection: The authors curated 1,382 tables from top‑tier chemistry journals (e.g., ACS Catalysis, JACS, Chem, Angewandte Chemie, Science) covering the period 2015‑2024. Each table is provided as a high‑resolution image (average 3687 × 4086 px). Expert annotators recorded pixel‑level cell boundaries, OCR‑verified text, formatting attributes (bold, italics, color), and logical row‑column relationships. In addition, 4,123 molecular structure images and 892 reaction scheme annotations were linked to the corresponding cells.
Benchmark tasks: Two core tasks are defined.
-
Table Recognition (TR) – models receive a table image and a prompt and must output a structured representation (JSON) containing cell coordinates, content, and formatting. Evaluation includes three sub‑tasks: (a) Value Retrieval (exact cell value given coordinates), (b) Position Retrieval (locate the row and column of a given value), and (c) Molecular Recognition (convert an embedded molecular diagram to its SMILES string).
-
Table Understanding (TU) – models are given either the raw image or the structured text together with a natural‑language question and must generate a short answer. The question set comprises 7,344 descriptive questions (information‑extraction) and 2,542 reasoning questions (requiring comparison, attribution, and domain‑grounded inference). Descriptive questions test basic extraction; reasoning questions probe deeper logical abilities. Questions were generated using GPT‑4.1 and then manually refined by graduate students to ensure diversity and difficulty, including unanswerable cases.
Model evaluation: Seven open‑source MLLMs (e.g., LLaVA, MiniGPT‑4, InstructBLIP) and ten proprietary models (e.g., GPT‑4‑Vision, Claude‑3‑Opus, Gemini‑Pro) were benchmarked. Results show that while layout reconstruction reaches up to 92 % F1, performance drops sharply for domain‑specific challenges: molecular recognition peaks at 38 % accuracy, and symbol‑heavy cells (e.g., “> 19/1”, “BINAP”) are correctly interpreted only ~45 % of the time. In the QA setting, descriptive questions achieve an average accuracy of 68 %, whereas reasoning questions fall below 45 %, especially when multiple cells must be compared or implicit conventions must be decoded. Human expert performance exceeds 96 % across all metrics, highlighting a substantial gap between current MLLMs and expert users.
Key insights: (1) General‑purpose MLLMs excel at generic table parsing but lack the specialized visual‑semantic knowledge required for chemical symbols and molecular diagrams. (2) Cross‑modal reasoning—integrating visual cues with textual data to perform logical inference—is still limited, particularly for multi‑step or conditional reasoning. (3) Closed‑source models outperform open‑source counterparts overall, yet prompt engineering and domain‑specific fine‑tuning can narrow the gap. (4) ChemTable provides a realistic, high‑difficulty testbed that mirrors actual research workflows, making it valuable for guiding future development of domain‑aware multimodal AI systems.
The authors release the full dataset, annotation guidelines, and evaluation scripts publicly, encouraging the community to build more chemically literate multimodal models and to push the frontier of AI‑assisted scientific discovery.
Comments & Academic Discussion
Loading comments...
Leave a Comment