NeuroABench: A Multimodal Evaluation Benchmark for Neurosurgical Anatomy Identification

December 07, 2025

Reading time: 5 minute

...

📝 Original Info

Title: NeuroABench: A Multimodal Evaluation Benchmark for Neurosurgical Anatomy Identification
ArXiv ID: 2512.06921
Date: 2025-12-07
Authors: Ziyang Song, Zelin Zang, Xiaofan Ye, Boqiang Xu, Long Bai, Jinlin Wu, Hongliang Ren, Hongbin Liu, Jiebo Luo, Zhen Lei

📝 Abstract

Multimodal Large Language Models (MLLMs) have shown significant potential in surgical video understanding. With improved zero-shot performance and more effective human-machine interaction, they provide a strong foundation for advancing surgical education and assistance. However, existing research and datasets primarily focus on understanding surgical procedures and workflows, while paying limited attention to the critical role of anatomical comprehension. In clinical practice, surgeons rely heavily on precise anatomical understanding to interpret, review, and learn from surgical videos. To fill this gap, we introduce the Neurosurgical Anatomy Benchmark (NeuroABench), the first multimodal benchmark explicitly created to evaluate anatomical understanding in the neurosurgical domain. NeuroABench consists of 9 hours of annotated neurosurgical videos covering 89 distinct procedures and is developed using a novel multimodal annotation pipeline with multiple review cycles. The benchmark evaluates the identification of 68 clinical anatomical structures, providing a rigorous and standardized framework for assessing model performance. Experiments on over 10 state-of-the-art MLLMs reveal significant limitations, with the best-performing model achieving only 40.87% accuracy in anatomical identification tasks. To further evaluate the benchmark, we extract a subset of the dataset and conduct an informative test with four neurosurgical trainees. The results show that the best-performing student achieves 56% accuracy, with the lowest scores of 28% and an average score of 46.5%. While the best MLLM performs comparably to the lowest-scoring student, it still lags significantly behind the group's average performance. This comparison underscores both the progress of MLLMs in anatomical understanding and the substantial gap that remains in achieving human-level performance.

💡 Deep Analysis

📄 Full Content

NeuroABench: A Multimodal Evaluation Benchmark for Neurosurgical Anatomy Identification Ziyang Song1, †, Zelin Zang1, †, Xiaofan Ye2, Boqiang Xu1, Long Bai3, Jinlin Wu1,*, Hongliang Ren3, Hongbin Liu1, Jiebo Luo1, Zhen Lei1 Abstract—Multimodal Large Language Models (MLLMs) have shown significant potential in surgical video understanding. With improved zero-shot performance and more effective human- machine interaction, they provide a strong foundation for ad- vancing surgical education and assistance. However, existing research and datasets primarily focus on understanding surgical procedures and workflows, while paying limited attention to the critical role of anatomical comprehension. In clinical practice, surgeons rely heavily on precise anatomical understanding to interpret, review, and learn from surgical videos. To fill this gap, we introduce the Neurosurgical Anatomy Benchmark (Neu- roABench), the first multimodal benchmark explicitly created to evaluate anatomical understanding in the neurosurgical domain. NeuroABench consists of 9 hours of annotated neurosurgical videos covering 89 distinct procedures and is developed using a novel multimodal annotation pipeline with multiple review cycles. The benchmark evaluates the identification of 68 clinical anatomical structures, providing a rigorous and standardized framework for assessing model performance. Experiments on over 10 state-of-the-art MLLMs reveal significant limitations, with the best-performing model achieving only 40.87% accu- racy in anatomical identification tasks. To further evaluate the benchmark, we extract a subset of the dataset and conduct an informative test with four neurosurgical trainees. The results show that the best-performing student achieves 56% accuracy, with the lowest scores of 28% and an average score of 46.5%. While the best MLLM performs comparably to the lowest-scoring student, it still lags significantly behind the group’s average performance. This comparison underscores both the progress of MLLMs in anatomical understanding and the substantial gap that remains in achieving human-level performance. These findings highlight the importance of optimizing MLLMs for neurosurgical applications. Index Terms—Large Language Models, Multimodal LLM, Surgery Understanding, Neurosurgery. I. INTRODUCTION In recent years, multimodal large language models (MLLMs) have demonstrated remarkable progress in surgical video understanding, laying a potential foundation for advanc- ing both surgical education and intraoperative assistance [1], This work was supported in part by the National Natural Science Foundation of China (Grant No.#62306313) and the InnoHK Program by the Hong Kong SAR Government. † Equal contribution. * Corresponding author. 1 Hong Kong Institute of Science and Innovation, Hong Kong SAR, China. 2 The University of Hong Kong-Shenzhen Hospital. 3 Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China. [2]. Existing studies [3]–[5] and datasets [6], [7] primarily focus on the recognition of surgical actions, workflows [8], [9], or tool usage [10]–[12]. However, these works tend to overlook the critical aspect of anatomical understanding. In clinical practice, surgeons often rely heavily on the identification and comprehension of anatomical structures to interpret, review, and learn from surgical videos. The lack of datasets and benchmarks centered on anatomical understanding limits the development and evaluation of advanced AI models tailored for real clinical needs. As shown in Table I, existing medical visual question answering (VQA) datasets pay limited attention to fine- grained anatomical understanding in realistic clinical scenar- ios. Datasets such as OmniMedVQA [13], VQA-RAD [14], and SLAKE [15] include some questions related to anatomy, but their focus is typically on broad organ-level identification using static imaging modalities like MRI and CT. These settings do not capture the dynamic, detailed anatomical context crucial for intraoperative operation. While more re- cent benchmarks, such as GMAI-MMBench [16], attempt to include anatomical tasks in clinical environments, they still suffer from two key limitations: 1) potential data leakage due to reusing and relabeling from previous datasets, and 2) a lack of comprehensive, fine-grained anatomical classification relevant to surgery. Therefore, currently available multimodal benchmarks fall short of meeting the specific and nuanced needs of surgical AI, particularly for evaluating models’ true anatomical comprehension in operative settings. Neurosurgical procedures are inherently anatomy-driven: surgeons interpret, review, and learn from surgical videos pri- marily by recognizing anatomical structures and understanding their relationship to surgical maneuvers. Precise anatomical comprehension underpins intraoperative operation, postopera- tive assessment, and the continuous improvement of surgical skills. In this context

📄 Read Full PDF on ArXiv

📸 Image Gallery

Reference

This content is AI-processed based on open access ArXiv data.

NeuroABench: A Multimodal Evaluation Benchmark for Neurosurgical Anatomy Identification

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Start searching

No results found