Surg$Σ$: A Spectrum of Large-Scale Multimodal Data and Foundation Models for Surgical Intelligence
Surgical intelligence has the potential to improve the safety and consistency of surgical care, yet most existing surgical AI frameworks remain task-specific and struggle to generalize across procedures and institutions. Although multimodal foundatio…
Authors: Zhitao Zeng, Mengya Xu, Jian Jiang
Surg Σ : A Spectrum of Large-Scale Multimodal Data and F oundation Models f or Surgical Intelligence Zhitao Zeng 1 ∗ Mengya Xu 2 ∗ Jian Jiang 3 ∗ Pengfei Guo 4 ∗ Y unqiu Xu 1 Zhu Zhuo 1 Chang Han Low 1 Y ufan He 4 Dong Y ang 4 Chenxi Lin 3 Y iming Gu 3 Jiaxin Guo 2 Y utong Ban 3 † Daguang Xu 4 † Qi Dou 2 † Y ueming Jin 1 † 1 NUS 2 CUHK 3 SJTU 4 NVIDIA Project page: https://SurgSigma.github.io Abstract Surgical intelligence has the potential to improve the safety and consistency of surgi- cal care, yet most e xisting sur gical AI frameworks remain task-specific and struggle to generalize across procedures and institutions. Although multimodal founda- tion models, particularly multimodal large language models, hav e demonstrated strong cross-task capabilities across v arious medical domains, their advancement in surgery remains constrained by the lack of lar ge-scale, systematically curated multimodal data. T o address this challenge, we introduce Surg Σ , a spectrum of large-scale multimodal data and foundation models for surgical intelligence. At the core of this frame work lies Sur g Σ -DB, a large-scale multimodal data foundation designed to support div erse surgical tasks. Surg Σ -DB consolidates heterogeneous surgical data sources (including open-source datasets, curated in-house clinical collections and web-source data) into a unified schema, aiming to impro ve label consistency and data standardization across heterogeneous datasets. Surg Σ -DB spans 6 clinical specialties and div erse surgical types, providing rich image- and video-lev el annotations across 18 practical surgical tasks cov ering understand- ing, reasoning, planning, and generation, at an unprecedented scale (ov er 5.98M con versations). Beyond con ventional multimodal con versations, Surg Σ -DB in- corporates hierarchical reasoning annotations, pro viding richer semantic cues to support deeper contextual understanding in complex surgical scenarios. W e further provide empirical e vidence through recently developed sur gical foundation models built upon Sur g Σ -DB, illustrating the practical benefits of lar ge-scale multimodal annotations, unified semantic design, and structured reasoning annotations for improving cross-task generalization and interpretability . 1 Introduction According to estimates from the Lancet Commission, more than 300 million surgical procedures are performed worldwide each year [ 49 ], underscoring the urgent demand for safer and more accessible surgical care. Despite adv ances in minimally inv asiv e [ 26 , 50 ] and robotic techniques [ 25 , 32 ], surgery remains inherently comple x, requiring continuous interpretation of dynamic anatomy and high-stakes decision-making under uncertainty . Sur gical AI is therefore emerging as a transformativ e paradigm, acting as an intelligent collaborator that enhances perception, understanding, and reasoning. By lev eraging multimodal intraoperative signals ( e.g . , visual streams, textual instructions, robotic kinematics, and preoperativ e imaging), AI systems promise to improve safety , reduce variability , and broaden access to high-quality surgical expertise. Howe ver , most prior surgical AI systems remain narrowly designed for isolated tasks, including phase recognition [ 71 , 40 ], tool or tissue ∗ These authors contributed equally to this w ork. † Corresponding authors. Preprint. Surg 𝜮 - DB · 5.98M Conver sations · 18 Surgical Ta s k s · Divers e Surgical Ty pe s Understanding Given the laparoscopic cholecystectomy image, descr ibe the complete sur gical action in terms of tool, action, and tissue. Reaso ning Does the cr itical view mee t the following conditi ons? Answer yes/no: 1. Two tubular struc tures co nnected to the gallblad der are visible. 2.Hepatocy stic triangle is clear f rom fat/co nnect ive ti ssue 3.Lower ga llbladder… Planning Given current visual observation and historic al contex t, what action is mos t likely t o be taken im mediately ne xt? Generation Please use the given image as the first frame and generate the subsequ ent video. based on the te xt: real scene, r ight in s tr ument coagulates th e cystic mesentery whi le left instrument retract s the cystic mes entery . Gastrojejunostomy Hysterecto my Cataract Su rgery Prostatectomy Hepatecto my Cholecystecto my Colonosco py Rectal Rese ction Nephrectomy Sigmoid Colectomy Figure 1: Surg Σ -DB is a lar ge-scale multimodal data foundation for surgical intelligence. segmentation [ 6 , 5 ], and action classification [ 57 , 7 ], often within a tailored task or a single surgical type. This task-specific paradigm limits knowledge transfer and leads to brittle generalization, where models degrade under distribution shifts caused by differences in imaging systems, anatomy , or surgical styles. Foundation models, particularly multimodal lar ge language models [ 70 , 68 , 9 , 8 ], hav e recently emerged as a promising paradigm for unified visual perception, language understanding and multi- modal reasoning, enabling models to jointly interpret visual content and reason with natural language. In principle, such models offer a unified frame work capable of supporting a broad spectrum of surgical tasks, ranging from describing anatomical structures and instrument states to answering intraoperati ve queries, summarizing procedural context, and generating interpretable decision-support rationales. While foundation models have achieved remarkable success across domains such as radiology [ 84 , 69 , 12 ], pathology [ 17 , 23 , 73 ], and molecular biology [ 38 , 1 ], their application to the surgical domain remains comparatively underexplored. Surgery poses fundamentally distinct challenges for training multimodal foundation models. Intraoperati ve scenes are not only visually complex ( e .g. , se vere occlusion, tissue deformation, and rapid camera motion), b ut also exhibit strong spatiotemporal structure and causal interdependence, where subtle instrument–tissue interactions can induce irrev ersible anatomical changes. Clinically relev ant cues are often fine-grained, transient, and context-dependent, demanding long-horizon temporal reasoning and precise spatial grounding beyond static image understanding. Furthermore, v ariability across institutions, surgeons, devices, and patient anatomies introduces substantial distribution shifts that hinder generalization. W e observe that a fundamental obstacle to advancing surgical multimodal foundation models lies in the lack of large-scale, high-quality , and systematically curated multimodal data. Conv entional surgical datasets [ 72 , 7 , 81 ] are predominantly vision-centric and designed under a closed-set paradigm, providing only predefined categorical annotations ( e.g . , sur gical phase or instrument tags) while lacking practical natural language instruction–visual pairs that better reflect real-world clinical usage and flexible interaction. In addition, these datasets are typically limited in scale and surgical-type div ersity , as they are often confined to a small number of procedures or institutions. Consequently , they remain fragmented across tasks and modalities, with inconsistent annotation standards and heterogeneous label spaces that hinder cr oss-dataset integration and large-scale training . Although some recent w orks [ 20 , 56 , 58 ] ha ve introduced datasets for sur gical foundation models, they still e xhibit notable limitations in scale, diversity , and task coverage, as summarized in T able 1. On the other hand, annotation quality and granularity remain insuf ficient: the absence of a unified label space can mislead training and weaken generalization, while existing datasets largely lack high-quality multi-step reasoning traces. T o facilitate research, we present Surg Σ , a spectrum of large-scale multimodal data and foundation models for surgical intelligence. At its core, Surg Σ -DB serves as a unified multimodal data foundation 2 designed to enable large-scale training of surgical foundation models. Rather than releasing isolated task-specific datasets, our Surg Σ -DB systematically consolidates div erse surgical data sources into a unified and well-structured foundation. W e curate data spanning multiple surgical specialties and procedures, cov ering diverse clinical departments and operati ve types to ensure broad anatomical and procedural v ariability . Surg Σ -DB combines open-source resour ces with web-collected surgical videos, and employs a semi-automated annotation pipeline integrating expert human labeling and controlled synthesis to ensure both real-world representati veness and scalable clinical fidelity . W ithin the same surgical scenes, Sur g Σ -DB pr ovides rich annotations with hierar chical r easoning traces , enabling multi-grained spatial understanding as well as temporal modeling. T o the best of our kno wledge, Surg Σ -DB pro vides one of the most comprehensi ve task coverages in surgical intelligence , spanning understanding, reasoning, planning, and generative capabilities through richly structured and multi-lev el annotations, and is constructed at an unprecedented scale ( i.e. , ∼ 5.98M) across diverse surgical types from 6 clinical specialties. Importantly , all data are organized under a unified format, facilitating interoperable training, cross-task integration, and future e xtensibility , while promoting more consistent label spaces across heterogeneous datasets. Building upon Sur g Σ -DB, a family of surgical foundation models ( i.e. , BSA [ 87 ], SurgVLM [ 95 ], Surg-R1 [ 33 ], and Cosmos-H-Surgical [ 31 ]) are dev eloped, which empirically demonstrate the effec- tiv eness of the proposed unified data spectrum. BSA [ 87 ] demonstrates that fundamental surgical actions exhibit consistent, recognizable patterns across anatomically di verse procedures, enabling cross-specialty generalization without domain-specific adaptation and supporting clinically mean- ingful downstream applications including skill assessment and procedural planning. SurgVLM [ 95 ] demonstrates that large-scale multimodal instruction-tuning data can enhance cross-task generaliza- tion, enabling a single model to ef fectiv ely handle di verse sur gical understanding tasks through shared vision–language training. Surg-R1 [ 33 ] further illustrates the critical role of structured reasoning an- notation, where multi-step inference traces significantly strengthen grounded surgical understanding. Cosmos-H-Surgical [ 31 ] demonstrates that sur gical world models can transform lar ge-scale unlabeled surgical video into actionable training data for robot policy learning by synthesizing realistic surgical scenes and recov ering pseudo-kinematics through inv erse dynamics inference, thereby enabling scal- able vision–language–action training with limited real demonstrations and significantly impro ving policy performance and sample ef ficiency . T ogether , these models provide complementary e vidence that scale, semantic unification, and chain-of-thought reasoning annotations are ke y ingredients for advancing sur gical foundation models within a coherent data-centric framew ork. W e hope that our data foundation and preliminary findings will inspire the research community to further explore and unlock the untapped potential of multimodal foundation models in surgical intelligence, ultimately advancing clinically reliable and generalizable sur gical AI systems. In future work, we will continually expand Surg Σ -DB in data scale and div ersity , and progressiv ely enrich each surgical scene with comprehensive and holistic annotations tow ard full task coverage within unified surgical conte xts. In summary , our contrib utions can be summarized as follows: • A large-scale multimodal surgical data foundation. W e introduce Surg Σ -DB, a large- scale multimodal dataset spanning multiple surgical specialties and procedures. The dataset integrates image- and video-level annotations across understanding, reasoning, planning, and generation tasks, providing the most comprehensi ve task cov erages in surgical intelligence. • Comprehensi ve and unified multi-granular annotations. W e consolidate heterogeneous data sources into a unified data schema with a consistent label space. A semi-automated annotation pipeline combining human labeling and controlled synthesis with hierarchical reasoning annotations enables semantic coherence and scalable foundation model training. • Empirical validation thr ough f oundation models. A family of surgical foundation models is dev eloped upon Surg Σ -DB, pro viding empirical validation of ke y data design principles and demonstrating the impact of large-scale multimodal data foundation. 2 Related W ork 2.1 Surgical Datasets and Benchmarks Surgical AI has long been supported by a rich ecosystem of public datasets that enable the de velopment and ev aluation of data-driv en models. Conv entional datasets [ 72 , 7 , 81 , 18 ] typically provide closed- 3 T able 1: Comparison with existing multimodal surgical datasets and benchmarks. Datasets V isual Modality Con versation T ype Data Source Reasoning #Sample #T ask V ideo Image VQA Caption Generation In-House Open-Source Internet Cholec80-VQA [65] ✘ ✔ ✔ ✘ ✘ ✘ ✔ ✘ ✘ 43K 3 EndoV is2018-VQA [65] ✘ ✔ ✔ ✘ ✘ ✘ ✔ ✘ ✘ 11.78K 3 PSI-A V A-VQA [64] ✘ ✔ ✔ ✘ ✘ ✘ ✔ ✘ ✘ 10.29K 3 PitVQA [30] ✘ ✔ ✔ ✘ ✘ ✘ ✔ ✘ ✘ 884.24K 5 LRSP-VQA [16] ✘ ✔ ✔ ✘ ✘ ✘ ✘ ✔ ✘ 1.13K 4 CoPESD [78] ✘ ✔ ✔ ✘ ✘ ✘ ✔ ✘ ✘ 121.09K 3 EndoVQA-Instruct [44] ✘ ✔ ✔ ✘ ✘ ✔ ✔ ✘ ✘ 446.54K 12 Surg-396K [76] ✘ ✔ ✔ ✔ ✘ ✘ ✔ ✘ ✘ 396K 7 SurgPub-V ideo [42] ✔ ✘ ✔ ✘ ✘ ✘ ✔ ✔ ✘ 48.52K 3 SurgMLLMBench [21] ✘ ✔ ✔ ✘ ✘ ✘ ✔ ✘ ✘ 893.15K 5 SurgLaV i [55] ✘ ✔ ✔ ✘ ✘ ✔ ✔ ✘ ✘ 239.8K 4 SurgV eo [19] ✔ ✔ ✘ ✘ ✔ ✘ ✔ ✘ ✘ 50 1 SVU-31K [77] ✔ ✘ ✔ ✘ ✘ ✘ ✔ ✔ ✘ 31K 4 SurgCoTBench [45] ✘ ✔ ✔ ✘ ✘ ✘ ✘ ✔ ✔ 14.25K 5 SUREON [56] ✔ ✘ ✔ ✘ ✘ ✘ ✔ ✘ ✔ 206.8K 12 Surg Σ -DB (ours) ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 5.98M 18 set categorical labels designed for supervised learning of scene understanding ( e.g. , instrument recognition) and workflow understanding ( e.g. , phase recognition). In parallel, large-scale pre- training datasets [ 15 , 83 , 7 , 22 , 85 ] of fer ab undant unlabeled or weakly labeled data for representation learning, but their lack of structured annotations limits their effect iveness for fine-grained surgical understanding and multimodal reasoning. While effecti ve for benchmarking perception and workflo w analysis, these datasets remain limited to predefined category labels and fail to capture complex interactions and reasoning over sur gical activities, lea ving a considerable gap between these datasets and real-world surgical applications. Their procedure-centric design also leads to limited cross- procedure generalization and heterogeneous label spaces, posing challenges for unified model training. W ith the emergence of multimodal foundation models, recent efforts ha ve shifted tow ard instruction- following and multi-task datasets tailored to generalizable multimodal surgical modeling. Early VQA-style datasets [ 65 , 65 , 64 , 91 ] are typically constructed by con v erting single-task annotations into question–answer pairs, but remain limited in annotation richness and task di versity . Subsequent works [ 44 , 55 ] improv e coverage by aggregating multiple open-source datasets, while more recent efforts [ 42 , 45 ] le verage web-sourced videos to scale multimodal supervision. In addition, various benchmarks ha ve been dev eloped to e valuate multimodal comprehension in surgical scenarios [ 76 , 21 , 59, 45], surgical scene generation [19], interpretability [20], and sur gical quality assessment [3, 10]. Despite this progress, existing datasets remain limited in div ersity and annotation quality , are biased tow ard VQA-style con versations, and lack support for dense prediction, spatiotemporal reasoning, planning, and generativ e tasks. Moreover , heterogeneous annotation schemas hinder unified multi-task training, motiv ating a unified, large-scale dataset for multimodal sur gical intelligence. 2.2 Foundation Models f or Surgical Intelligence Surgical AI has traditionally relied on task-specific models for instrument/tissue detection and segmentation [ 90 , 43 ], workflo w analysis [ 71 , 35 , 36 ] and triplet recognition [ 18 , 52 , 53 , 54 ], which are typically trained with procedure-specific supervision and are sensitiv e to domain shifts across clinical en vironments. W ith the increasing a vailability of large-scale surgical video data, recent ef forts hav e moved toward surgical foundation models that learn transferable visual representations via self-supervised [ 79 , 62 , 94 , 93 , 89 ] or weakly supervised pre-training [ 39 ]. These models demonstrate improv ed robustness and cross-domain generalization, and can be adapted to diverse do wnstream tasks through lightweight fine-tuning. Howe ver , such approaches remain primarily perception-centric and lack the ability to support open-ended reasoning, interactive understanding, and high-le vel decision-making. Building upon these advances, multimodal foundation models extend foundation modeling by enabling natural-language interaction and reasoning over surgical scenes. Early surgical vision- language models focus on VQA-style formulations [ 65 , 30 ], representing surgical elements such as instruments, tissues, and spatial relationships through te xtual descriptions and generating answers. More recently , instruction-tuned surgical multimodal large language models [ 75 , 34 , 64 , 76 , 86 , 41 , 4 Gastro intest inal Urology Gynaeco logy Ophthalmic Thoracic Hepatobil iary Underst anding & Reasoning Reco gni tion Segmentation Captioning Localization Generation & Planning Predict ion Enhancement Gener ation Planning Dat aset1: Ph ase Recognit ion Dat aset 2: Tri p l e t Reco gnit ion Dat aset 3: Critical View of Safety Unified Labels with Rea soning Traces ... a grasper a nd a hook are presen t .. . ... the hoo k is dissecting the cystic du ct ... ... the phase is Ca lot Triangle Dissection , CVS is not yet achieved ... think > < answer> … … retract, grasp, hook, cold cut, suturing … t issue retractio n, dissection, needle puncture, … Unifying Label Space Figure 2: Surg Σ -DB integrates heterogeneous surgical data across 6 clinical specialties into a unified multimodal data foundation. It supports diverse tasks through standardized annotations enriched with hierarchical reasoning traces. 95 ] and video-level surgical understanding models [ 77 , 80 , 92 ] demonstrate strong performance across diverse image- and video-lev el surgical tasks. Ho wever , deploying multimodal foundation models in surgical settings remains challenging due to fragmented data resources that lack scale, div ersity , unified label spaces, and high-quality reasoning annotations. This highlights the need for large-scale, consistently annotated, and unified multimodal datasets for training and e valuation. 3 Surg Σ -DB: A Large-Scale Multimodal Data F oundation f or Surgical AI Surg Σ -DB is a large-scale multi-grained dataset constructed for multimodal foundation models in surgical intelligence. It contains ∼ 5.98M annotated samples spanning 6 clinical specialties, as shown in Figure 2. Surg Σ -DB integrates rich annotations for both static video frames and video clips, and aligned natural language conv ersations including instruction–response pairs and reasoning traces. Data are sourced from both publicly a vailable sur gical datasets and curated in-house clinical collections. Annotations are produced through a combination of e xpert human labeling and controlled synthesis pipelines, which further introduce hierarchical reasoning annotations to capture contextual relationships within surgical scenes, ensuring annotation quality , semantic consistency , and scalable cov erage. All subsets are organized under a unified data schema with harmonized label spaces and standardized formats to support multi-task training and benchmarking. 3.1 Data Curation 3.1.1 Multi-Source Sur gical Data Collection Our goal is to construct a highly div erse sur gical data foundation spanning multiple clinical specialties and surgical types. Guided by this objecti ve, we collect raw data from a wide range of sources, including publicly a vailable sur gical datasets, online sur gical videos, and curated in-house clinical collections dev eloped in collaboration with medical partners. As summarized in T able 2, we collect 16 surgical types across 6 major clinical specialties ( i.e . , gynecologic [ 81 , 63 ], ophthalmic [ 27 , 46 ], hepatobiliary [ 61 , 2 , 53 , 48 , 71 , 74 ], gastrointestinal [ 40 , 37 , 47 , 13 ], urologic [ 6 , 5 , 51 , 14 , 7 , 57 , 11 ] and thoracic [ 28 ] sur geries), encompassing both robotic and manual operations under div erse imaging modalities, including laparoscopic, endoscopic, OphScope, and thoracoscopic settings. This breadth ensures substantial variability in anatomy , instrumentation, and workflow dynamics, providing a highly div erse foundation for surgical intelligence modeling. 3.1.2 Holistic and Multi-Granular Surgical T ask Design W e design a div erse and multi-granular set of tasks in Surg Σ -DB to comprehensi vely cov er the key objectiv es of surgical intelligence, as demonstrated in Figure 3. These tasks are organized into two complementary groups: (1) Understanding and Reasoning and (2) Planning and Generation , collectiv ely reflecting the fundamental capabilities required for surgical multimodal foundation models, spanning perception, reasoning, predictiv e modeling, and controllable content generation. Understanding and Reasoning T asks. These tasks encompass a diverse set of perception and reasoning problems designed to capture multi-granular spatio-temporal understanding of surgical scenes. They in v olve grounding surgical instruments, anatomical structures, and procedural dynamics 5 T able 2: Data sources integrated into Sur g Σ -DB, categorized by clinical specialty and surgical types. Clinical Specialty Surgical T ype Data Source Surgical Platform Pr otocol Gynecologic Hysterectomy AutoLaparo [81] Non-Robotic Laparoscopy SurgicalActions160 [63] Non-Robotic Laparoscopy W eb-Collected Data Both Laparoscopy Ophthalmic Cataract Surgery Cataract-1K [27] Non-Robotic OphScope CaDISv2 [46] Non-Robotic OphScope Hepatobiliary Cholecystectomy Cholec80 [72] Non-Robotic Laparoscopy Cholec80-CVS [61] Non-Robotic Laparoscopy CholecInstanceSeg [2] Non-Robotic Laparoscopy CholecT50 [53] Non-Robotic Laparoscopy Endoscapes [48] Non-Robotic Laparoscopy M2CAI16 [71] Non-Robotic Laparoscopy HeiChole [74] Non-Robotic Laparoscopy W eb-Collected Data Both Laparoscopy In-House Data Non-Robotic Laparoscopy Hepatectomy In-House Data Non-Robotic Laparoscopy Gastrointestinal Gastrectomy W eb-Collected Data Both Laparoscopy In-House Data Non-Robotic Laparoscopy Gastrojejunostomy MultiBypass140 [40] Non-Robotic Laparoscopy Colonoscopy SegCol [37] Non-Robotic Endoscopy Proctocolectomy HeiCo [47] Non-Robotic Laparoscopy Sigmoid Colectomy HeiCo [47] Non-Robotic Laparoscopy Rectal Resection HeiCo [47] Non-Robotic Laparoscopy Ladd’ s Procedure W eb-Collected Data Non-Robotic Laparoscopy Appendectomy W eb-Collected Data Non-Robotic Laparoscopy Rectal Resection/Extirpation DSAD [13] Robotic-Assisted Laparoscopy Urologic Nephrectomy EndoV is2017 [6] Robotic-Assisted Laparoscopy EndoV is2018 [5] Robotic-Assisted Laparoscopy Nephrec9 [51] Non-Robotic Laparoscopy SurgT [14] Robotic-Assisted Laparoscopy W eb-Collected Data Both Laparoscopy In-House Data Both Laparoscopy Prostatectomy GraSP [7] Robotic-Assisted Laparoscopy SAR-RARP [57] Robotic-Assisted Laparoscopy MESAD-Real [11] Robotic-Assisted Laparoscopy W eb-Collected Data Robotic-Assisted Laparoscopy Thoracic Lobectomy Lobectomy Dataset [28] Robotic-Assisted Thoracoscopic from both image and video inputs, enabling detailed analysis of tool–tissue interactions and surgical workflo ws. By covering capabilities such as geometric modeling, semantic interpretation, safety verification, and conte xtual workflow understanding, these tasks reflect both foundational perception challenges and clinically relev ant reasoning problems in surgical en vironments. • Instrument Recognition: Identify surgical instruments present in a video frame, serving as a fundamental perceptual capability for understanding tool usage and surgical w orkflow . • Instrument Localization: Predict spatial regions of surgical instruments using either bounding boxes or image patches, enabling precise spatial grounding of tool positions. • Instrument Segmentation: Generate pixel-wise masks of sur gical instruments to capture fine-grained shapes and tool–tissue occlusion relationships for precise spatial modeling. • Tissue and Organ Recognition: Classify visible anatomical entities ( i.e. , tissues or organs) to establish semantic awareness of operati ve re gions and contextual sur gical states. • Tissue and Or gan Localization: Localize anatomical entities ( i.e. , tissues and organs) via bounding boxes to pro vide spatial awareness of anatomical structures during sur gery . • Phase Recognition: Identify the current sur gical phase from either video frames or video clips, recognizing the highest-lev el procedural stage of the surgical workflo w . • Step Recognition: Identify finer -grained surgical steps from either video frames or video clips, modeling intermediate-lev el workflo w progression within each phase. 6 Tissue and Organ Localization Phase Rec ognition Step Recognition Instrument Recogn ition Instrument Localization Tissue and Organ Recogn ition Action Recognition Tr ip le t Re co gn it io n Wh i c h c a t e g o r y d o t h e s e su r gi c al i n s t r u m e n t s f al l i n t o ? Th e y b e l o n g t o t h e c a t e g o ri e s Gr a s p e r , H o o k . Gi v en t h e l a p a r o s c o p i c s u r g i c a l i ma g e, fi n d L a r g e N e e d l e D r i v e r in t he f or m a t of bbo x (x 1 , y 1 ), (x 2 , y 2 ). (3 2 6 ,5 4 5 ), (9 1 9 ,9 2 1 ) Wh a t o r g a n s a r e f e a t u r e d i n t h i s su r gi c al i m age ? Th e y a re s t o m a c h , l i v e r a n d th e a b d o mi n a l w a l l . Ma r k t h e l o c a t i o n o f t h e Ga l l b l a d d e r i n t h e m i d d l e c e n t e r ar e a. It is a t [ 0.282, 0.452, 0.447, 0.752] . Wh a t p h a s e i s r e f l e c t e d i n t h i s i m a g e fr o m C h o l e c y s t e c t o m y s u r g e r y ? Th e re fl e c t e d p h a s e i s Ga l l b l a d d e r D i s s e c t i o n . Id e n t ify t h e c u r r e n t p r oc e d u r e s t e p fr o m th i s i m a g e o f a P r o s t a t e c t o m i e s su rg e r y ? Th e p r o c e d u r e s t e p i s i d e n t i f i c a t i o n an d di s s e c t i o n o f t h e I l i ac v e i n an d ar t e r y Id en t i f y t he s ur g i c a l a c t i on i n th i s c l i p . Th e s u rg i c a l a c t i o n i s aspi r at i o n De s c r i b e t h e i n s t r u m e n t , a c t i o n , an d t ar ge t i n t h i s su r gi c al sc e n e . Th e i n s t r u m e n t i s h o o k , pe r f o r m i n g di s s e c t o n c y s t i c du c t . Safety Assessmen t Surgical Image Captioning Do th r e e C V S a c h i e v e i n th e cu r r e n t f r am e ? Cr i t e r i o n 1 : Y e s. Cr i t e r i o n 2 : Ye s . C r i t e r i o n 3 : Ye s . De s c r i b e th e im a g e in de t ai l . Th e su rg i c a l pr o c e du r e de pi c t e d in v olv e s a Ch o l e c y s t e c t o m y . Spe c if ic ally , th e ph as e sh o w n is th e Ga l l b l a d d er Di s s e c t i o n , wh e r e th e su rg e o n di s s e c t s th e gal l bl adde r fr o m it s su rro u n d i n g ti s s u e s . Th e st e p be i n g ex e c u t e d in v olv e s us ing a su rg i c a l hook to as s i s t in th e di s s e c t i o n pr o c e s s . Left fenes trated bipolar forceps and right fenestrated bipolar forc eps holds tissue while bottom suct ion - irrigator aspirates blood on tissue, then bottom coagulator coagulates ti ssue while left fenestra t ed b ipolar forceps and right fenestrated bipolar forceps holds ti ssue, then bottom suc tion - irrigator aspirates blood on t issue. Surgical Video Captioning Describe the vid eo in detail . Action Remain ing Pred iction Next Action Plan ning "H o w m u c h t i m e i s l e f t be f o r e th e a s p i r a ti o n a c ti o n i s co m p l e t e d ? 10 s ec o n d s Wh a t i s t h e n e x t s u r g i c a l a c t i o n ? Th e n e xt a c t i o n i s d i s s e c t i o n . Desmoking Instrument Segment ation Depth Estim ation Ne x t F r a m e G e n e r a t i o n Con d i t i on a l S u r g i c a l V i d e o G e n e r a t i on P l e a s e re m o v e s m o k e a n d e n h a n c e t h e c l a ri t y o f t h e p ro v i d e d i m a g e . Please genera te the segment ation map of the ins trument of the current input frame. Output a single - channel segmentation mask where the surgical instrument region is white an d all other back ground areas are black. Please genera te its depth map of the current in put frame. Output a single - channel grayscale depth image with accurate spatial depth relationships. From th e current in put frame, please gen erate the fra me after 9 s econds. The current phase is devel oping the space of Retz ius , and the current step is prostat e dissection up t o the levator ani. Please use the given image as the fir s t fram e and gener ate the subsequent video . Le ft dissecting and grasping forceps for ms loop around left needle holder , then left dissecting and grasping forceps fo rms loop around left needle holder, then left needle holder grasps suture, then left dissecting and grasping forceps grasps suture, then left dissecting and grasping forceps and left needle holder tie knot Figure 3: Surg Σ -DB contains diverse multimodal con versations spanning 13 understanding and reasoning tasks as well as 5 planning and generation tasks, supporting a wide range of perception, reasoning, simulation, and decision-oriented capabilities for surgical intelligence. • Action Recognition: Classify atomic surgical actions from either video frames or video clips, encompassing both basic surgical actions and procedure-specific operations, and capturing the lowest-le vel dynamics of the sur gical workflo w . • T riplet Recognition: Predict instrument–action–tar get triplets to explicitly represent struc- tured surgical interactions between instruments and tar gets ( i.e. , tissues or other instruments), enabling higher-le vel understanding and reasoning about sur gical activities. • Depth Estimation: Infer relativ e depth maps for sur gical video frames, facilitating geometry- aw are downstream tasks such as instrument navig ation and spatial perception during surgery . • Safety Assessment: Determine whether safety-critical anatomical structures have been sufficiently exposed and correctly identified ( e.g . , critical view of safety [ 61 , 3 , 10 ] in cholecystectomy), ensuring sur gical actions such as clipping or transection can be performed safely . • Surgical Image Captioning: Generate natural language descriptions of static sur gical scenes, summarizing instruments, anatomy , and conte xtual information visible in inputs. • Surgical V ideo Captioning: Produce temporally coherent textual descriptions of sur gical videos, capturing the spatio-temporal ev olution of surgical scenes and acti vities. Planning and Generation T asks. Beyond scene comprehension, these tasks focus on predictive modeling and controllable generation that are closely aligned with practical clinical needs. They 7 T able 3: Unified annotation format of Surg Σ -DB. { " m e t a d a t a " : { " i n f o " : " d a t a s e t i n f o r m a t i o n " , } , " i m a g e s " : [ { " i d " : 0 , " s o u r c e d a t a s e t " : " d a t a s e t n a m e " , " s o u r c e u r l " : " u r l t o d a t a s o u r c e " , " c l i n i c a l s p e c i a l t y " : " n a m e o f t h e s p e c i a l t y " , " s u r g i c a l t y p e " : " n a m e o f t h e s u r g i c a l p r o c e d u r e " , " i m a g e p a t h " : " p a t h t o i m a g e " } ] , " v i d e o s " : [ { " i d " : 0 , " s o u r c e d a t a s e t " : " d a t a s e t n a m e " , " s o u r c e u r l " : " u r l t o d a t a s o u r c e " , " c l i n i c a l s p e c i a l t y " : " n a m e o f t h e s p e c i a l t y " , " s u r g i c a l t y p e " : " n a m e o f t h e s u r g i c a l p r o c e d u r e " , " v i d e o p a t h " : " p a t h t o v i d e o " } ] , " a n n o s " : [ { " i d " : 0 , " v i d e o s " : [ v i d e o i d ] , " t i m e s t e p " : [ f r a m e i d x ] , " i m a g e s " : [ i m a g e i d ] , " q u e s t i o n " : " u s e r i n s t r u c t i o n s " , " t h i n k i n g " : " m o d e l t h i n k i n g r e s p o n s e ( o p t i o n a l ) " , " a n s w e r " : " m o d e l a n s w e r r e s p o n s e " , " d e n s e p r e d i c t i o n " : [ " p a t h t o d e n s e p r e d i c t i o n " ] , " t a s k t y p e " : [ " t a s k n a m e " ] , " g t l a b e l " : [ " g r o u n d - t r u t h l a b e l " ] , " a n n o t a t o r " : { " N U S " , " C U H K " , " S J T U " , " N V I D I A " } } , ] } in volv e forecasting future procedural developments, estimating surgical progress, and generating coherent multimodal descriptions of surgical activities. Such tasks provide a basis for applications in- cluding intraoperativ e assistance, workflow anticipation, and simulation-driv en learning, highlighting the potential of surgical AI systems to mo ve from passi ve observ ation toward proacti ve support. • Action Remaining Prediction: Estimate the remaining duration of the current action, enabling workflo w progress assessment and anticipation of upcoming procedural transitions. • Next Action Planning: Predict the most probable next sur gical action giv en current visual observation and historical conte xt, enabling anticipation of upcoming workflow transitions. • Desmoking: Generate smoke-free surgical video frames from smoke-degraded observations, improving visual clarity and perceptual rob ustness under real-world imaging conditions. • Next Frame Pr ediction: Generate a future video frame at a specified time horizon to predict the temporal ev olution of surgical scenes, supporting workflo w anticipation and simulation. 8 T able 4: Three typical annotation samples for MLLM training. [ { " i d " : 0 , " i m a g e s " : [ " i m a g e p a t h " ] , " c o n v e r s a t i o n s " : [ { " f r o m " : " h u m a n " , " v a l u e " : " < i m a g e > u s e r i n s t r u c t i o n f o r i m a g e t a s k s " } , { " f r o m " : " g p t " , " v a l u e " : " < t h i n k i n g > m o d e l t h i n k i n g r e s p o n s e < / t h i n k i n g > \ n < a n s w e r > m o d e l a n s w e r r e s p o n s e < / a n s w e r > " } ] } , { " i d " : 1 , " v i d e o s " : [ " v i d e o p a t h " ] , " c o n v e r s a t i o n s " : [ { " f r o m " : " h u m a n " , " v a l u e " : " < i m a g e > u s e r i n s t r u c t i o n f o r v i d e o t a s k s " } , { " f r o m " : " g p t " , " v a l u e " : " < t h i n k i n g > m o d e l t h i n k i n g r e s p o n s e < / t h i n k i n g > \ n < a n s w e r > m o d e l a n s w e r r e s p o n s e < / a n s w e r > " } ] } , { " i d " : 2 , " i m a g e s " : [ " i m a g e p a t h ( i n p u t ) " , " i m a g e p a t h ( p r e d i c t i o n ) " ] , " c o n v e r s a t i o n s " : [ { " f r o m " : " h u m a n " , " v a l u e " : " < i m a g e > u s e r i n s t r u c t i o n f o r d e n s e p r e d i c t i o n t a s k s " } , { " f r o m " : " g p t " , " v a l u e " : " < i m a g e > " } ] } ] • Conditional Surgical Video Generation: Generate surgical video clips conditioned on textual instructions and visual context, including text-to-video, image-to-video, and text- guided image-to-video generation, enabling controllable simulation and data augmentation. 3.1.3 From Heter ogeneous Labels to Unified Annotations Surgical datasets collected from di verse sources often exhibit inconsistent terminology , mismatched category definitions, and v arying annotation formats, especially for fine-grained units such as atomic surgical actions. These discrepancies arise from div ergent naming con ventions, v ariations in semantic granularity , and heterogeneous representation forms ( e.g . , categorical labels, dense masks, or ques- tion–answer pairs with multi-step reasoning), leading to semantic drift and instability in large-scale joint training. T o address this, we reorganize heterogeneous labels into a unified framework that aims 9 public datasets online videos in -house data Sou r ce Da ta Co l lec tio n Cl eaning and Refineme n t label refinement raw label synthesis data cleaning Contextua l and Explainabl e Augmenta tio n explainable augmentation Wikipedia retrieve enrich contextual augmentation strong correlated s tep recog . action recog . too l local. tool recog . three -level Co T annotation Hiera rchica l Reaso ning Genera tio n level 1 : xxxxx level 2 : xxx xx level 3 : xxxxx QA pairs c ontextual raw label s trainin g re ady data Dial ogue Ins t antia tion diverse temp lates QA pairs wit h thinking Figure 4: Overvie w of the data curation and annotation pipeline for Surg Σ -DB. to standardize fine-grained semantic definitions, reduce cross-dataset inconsistencies, and support scalable, interoperable training. T aking action recognition as a representative example, we consolidate div erse atomic actions into a unified taxonomy of ten basic surgical actions with explicit semantic boundaries and inclusion criteria. Each action is grounded in clinically interpretable descriptions and aligned with related attributes ( e.g . , instruments and target tissues), forming a coherent and structured label space that supports consistent supervision across heterogeneous surgical data. Beyond semantic normalization, we unify heterogeneous annotation formats under a consistent structural schema. Different annotation types and tasks are reorganized into a standardized repre- sentation that aligns multi-granular supervision across image- and video-le vel data, as illustrated in T able 3. This unified structure can be readily con verted into training-ready multimodal format (see T able 4) compatible with large language models, enabling direct inte gration into foundation model pipelines. Such standardized annotation format ensures structural consistency , facilitates large-scale joint training, and supports future dataset scaling and extensibility . 3.1.4 Semi-A utomated Annotation Pipeline Data Pre-Pr ocessing and Label Refinement. Following aforementioned unified annotation stan- dards, we systematically refine raw labels from heterogeneous sources to ensure semantic and structural consistency . Coarse categories, institution-specific shorthand, and inconsistent terminology are replaced with explicit, context-complete descriptions and normalized under standardized medical vocab ularies. For instance, ambiguous placeholders ( e.g. , “other”) are reformulated into precise statements, and synonymous anatomical terms ( e.g . , “Calot’ s triangle” vs. “Hepatocystic triangle”) are consolidated into canonical forms. This standard-driv en refinement reduces cross-dataset ambiguity and improv es stability in large-scale joint training. For the ra w annotations that are already provided by the original source datasets, we directly adopt the official labels to preserve dataset fidelity . For incomplete textual annotations ( e.g. , image and video captioning), we leverage Qwen3-VL-235B [ 8 ] to generate enriched and context-consistent descriptions. For missing dense predictions, such as smoke masks, se gmentation maps, and depth annotations, we employ off-the-shelf methods [ 29 , 60 , 88 ] to automatically generate the corresponding information. For some noise-prone labels ( e.g. , temporal boundaries of surgical actions), we perform manual verification and label refinement to ensure precise and reliable annotations. Contextual and Explainable A ugmentation. Sur gical attributes such as phase, step, instrument, and action are inherently interdependent. T o enhance structural learning, we optionally consolidate corre- lated attributes into unified prompts that explicitly encode hierarchical and relational dependencies ( e.g . , instrument–action pairs), transforming isolated labels into structured multi-attrib ute supervi- sion. In addition, to strengthen vision-language alignment, we augment original cate gorical labels with knowledge-grounded descriptions deri ved from authoritati ve medical sources ( e.g. , W ikipedia), linking visual evidence to surgical intent and anatomical context. T ogether , these contextual and explanation-a ware augmentations promote compositional reasoning, improve fine-grained alignment, enhance robustness across di verse scenes, and increase interpretability for real-w orld deployment. Hierarchical Reasoning T rajectory Generation. T o e xplicitly align visual evidence with structured reasoning processes, a three-lev el chain-of-thought [ 82 ] annotation strategy is constructed to decom- pose surgical inference into perceptual grounding, relational understanding, and contextual reasoning, 10 Gynecologic (3.3%) Ophthalmic (4.0%) Hepatobiliary (23.4%) Gastrointestinal (39.3%) Urologic (29.9%) Thoracic (0.1%) (a) ratio of clinical specialties Image (75.2%) video (24.8%) (b) ratio of visual input modalities (c) word cloud visualization (d) distribution of all co vered tasks 0 100 200 300 400 500 600 700 800 length of te xt tok ens 0 50000 100000 150000 200000 250000 300000 350000 400000 Count (e) distribution of te xt token lengths Figure 5: Statistical analysis of the constructed Surg Σ -DB. as shown in Figure 4. Specifically , the first lev el emphasizes describing fundamental visual elements present in the scene, focusing on the appearance and spatial properties of tools and tissues. The second lev el focuses on characterizing interactions among dif ferent elements, capturing ho w tools, tissues, and actions relate to and influence one another through contact, motion, and functional associations. The third lev el abstracts perceptual and relational cues into high-lev el procedural inference, enabling structured reasoning about the global surgical conte xt. These structured reasoning trajectories are synthesized using GPT -5.1 [ 68 ] conditioned on all verified raw labels (from existing datasets or annotated by ourselves) under explicit forward-reasoning constraints, ensuring that intermediate steps remain grounded in observable visual evidence and prev enting hallucinated interactions beyond the annotated scene. Consequently , the hierarchical CoT annotations transform flat supervision into structured reasoning guidance, promoting compositional inference, mitigating shortcut learning, and enhancing interpretability . Con versational Diversity Expansion. T o prepare multimodal foundation model training data, structured annotations are transformed into diverse con versational formats. Rigid template design may lead to ov erfitting in MLLMs, as the model can memorize superficial linguistic patterns rather than learning underlying visual–semantic alignment. T o mitigate this issue, we construct 100–200 dialogue templates with controlled variations in phrasing, ordering, and information density . This strategy increases linguistic div ersity while preserving logical consistency , thereby improving instruction- following rob ustness and cross-task generalization. 3.2 Dataset Analysis and Discussions 3.2.1 Data Statistics Surg Σ -DB contains 5.98M multimodal con versations spanning both image- and video-based surgical tasks, extracted from 1.59K unique surgical video sources. Among them, 4.49M con versations are associated with images, while 1.48M correspond to video clips, forming a large-scale multimodal corpus for surgical intelligence. In total, Surg Σ -DB comprises 471.29M text tokens, pro viding rich linguistic corpus for multimodal learning. It covers 6 clinical specialties and 16 surgical procedure types, capturing div erse surgical scenes and procedural contexts across a wide range of operative domains, as illustrated in Figure 5a. In terms of visual data, Surg Σ -DB comprises 1.58M unique images paired with 4.21M image-based con versations. This large-scale image corpus supports a wide range of perception and reasoning 11 tasks, including recognition, localization, segmentation, and captioning. For the video component, Surg Σ -DB contains 1.35M video clips paired with 1.45M con versations. These clips are typically short segments capturing fine-grained surgical activities and procedural context, supporting tasks such as action recognition and conditional surgical video generation. The scale, di versity and multimodal richness of Sur g Σ -DB make it a strong foundation for training multimodal foundation models for surgical intelligence. As sho wn in Figure 5d, Surg Σ -DB exhibits broad cov erage across di verse tasks, while Figure 5e reveals a wide distrib ution of text token lengths, indicating rich linguistic di versity spanning both concise perception queries and complex multi-step reasoning. 3.2.2 Dataset License and Accessibility Surg Σ -DB is licensed under the Creativ e Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY -NC-SA 4.0). The license applies to all annotations to which we hav e directly contributed. Surg Σ -DB also incorporates surgical videos and images sourced from pre- existing collections. For these data, the original licensing terms are respected and remain applicable. Our first public release, Surg Σ -DB v0.1, will be made publicly accessible through the official project page, where users can obtain the annotation files, metadata, and documentation necessary to reproduce the structure of the dataset. Surg Σ -DB is intended for non-commercial research purposes, and users are expected to properly cite both Sur g Σ -DB and the original datasets in any resulting publications. 4 Advanced F oundation Models Built upon Surg Σ -DB Building upon Surg Σ -DB, a spectrum of foundation models are dev eloped, which cover complemen- tary dimensions of surgical intelligence, from action-centric understanding and multimodal sur gical scene understanding to structured reasoning and embodied policy learning. In the following, we briefly describe each model, highlighting its model and training designs, as well as empirical findings. 4.1 BSA: A Cross-Specialty F oundation Model for Basic Sur gical Action Recognition BSA [ 87 ] is a cross-specialty foundation model for recognizing basic surgical actions as a shared semantic unit across div erse procedures. Rather than modeling each procedure in isolation, BSA treats surgical w orkflow as compositions of reusable primiti ve actions ( e.g. , dissection, coagulation, clipping, knot-tying), enabling a unified representation that transfers across anatomical sites, insti- tutions, and recording conditions. Given short surgical video clips as input, it outputs probability distributions o ver ten predefined surgical action cate gories. Specifically , BSA builds upon a V ideo T ransformer backbone [ 24 ] with two k ey design considerations for sur gical video analysis: (1) se- quential temporal and spatial attention mechanisms are utilized to effecti vely capture spatiotemporal dependencies inherent in surgical procedures, such as instrument mov ements and tissue interactions; (2) a dual-head prediction module is specifically designed to address the class imbalance issue. W ith video-action samples in Surg Σ -DB, the training pipeline utilized standard video preprocessing with temporal downsampling and uniform frame sampling to balance computational ef ficiency with visual information preserv ation. The Evidential loss [ 66 ] is employed to encourage well-calibrated uncertainty estimates, pre venting both o verconfident and excessi vely uncertain predictions. Please refer to BSA [87] for more implementation details. Experimental results sho w that BSA learns stable and transferable representations of basic sur gical actions across heterogeneous procedures, institutions, and imaging conditions (see Figure 6). Be- yond recognition, BSA provides structured and uncertainty-aware action semantics that naturally support do wnstream applications, including surgical skill assessment and sur gical action planning. These findings indicate that BSA functions not only as a standalone recognition model, but also as a foundational perception module that bridges low-le vel visual understanding and higher-le vel reasoning systems and embodied policy-learning pipelines. From a data-centric perspective, BSA operationalizes three principles for scalable sur gical intelligence. First, ontology-first supervision: defining a compact, clinically meaningful action v ocabulary impro ves semantic consistency across datasets and specialties. Second, cross-specialty alignment: harmonized labels and standardized preprocessing reduce dataset-specific shortcuts and encourage representations that generalize beyond a single procedure type. Third, uncertainty-aw are recognition: modeling confidence is essential 12 He pa tobi li a r y / C holec y ste c tom y G T : asp irat ion P r e d ict ion : asp irat ion He pa tobi li a r y / C holec y ste c tom y G T : c li p p in g P r e d ict ion : c li p p in g Ga stroint e sti na l / Ga stre c tom y G T : c oagu lation P r e d ict ion : c oagu lation Ur olog ic /Ne phre c tom y G T : k n ot - tyin g P r e d ict ion : k n ot - tyin g Gy ne c olo g ic / Hy ster e c tom y G T : n e e d le p u n c tur e P r e d ict ion : n e e d le p u n c tur e Ur olog ic / P rosta tec tom y G T : n e e d le gr asp in g P r e d ict ion : n e e d le gr asp in g Ga stroint e sti na l / I ntestinal R e se c ti on G T : tissu e r e tr ac tion P r e d ict ion : tissu e r e tr ac tion Ur olog ic / P rosta tec tom y G T : su tur e p u ll in g P r e d ict ion : su tur e p u ll in g He pa tobi li a r y / C holec y ste c tom y G T : p ac k agin g P r e d ict ion : p ac k agin g Ga stroint e sti na l / Ga stre c tom y G T : d iss e c tion P r e d ict ion : d iss e c tion Figure 6: Qualitative visualization of BSA foundation model predictions across div erse surgical procedures. for safe handoff to reasoning and control modules in high-stakes clinical settings. T ogether with SurgVLM (lar ge-scale perception and reasoning alignment), Surg-R1 (hierarchical structured surgi- cal reasoning), and Cosmos-H-Sur gical (world-model-dri ven action-data synthesis for robot polic y learning), BSA forms the front-end perceptual anchor of a unified pipeline tow ard embodied surgical intelligence: perception → reasoning → action. This systems view suggests that scalable surgical AI depends not on isolated model gains, b ut on tight co-design of action ontology , multimodal reasoning, and physically grounded world models. 4.2 SurgVLM: A Multimodal F oundation Model for Sur gical Intelligence Surgical intelligence presents unique challenges, requiring surgical visual perception, temporal analysis, and reasoning. Existing general-purpose vision-language models fail to address these needs due to insufficient domain-specific supervision and the lack of a large-scale high-quality surgical database. General-purpose vision-language models, trained predominantly on natural images and text, often exhibit inefficienc y by generating excessi ve, clinically irrele vant outputs. Additionally , their outputs tend to be ambiguous, presenting multiple plausible scenarios rather than definitive, medically meaningful answers. Such ambiguity and verbosity undermine their alignment with surgeons’ professional standards and real-world clinical requirements, significantly limiting their reliability and applicability in surgical practice. T o address this challenge, Sur gVLM is built upon Sur g Σ -DB, adapted to ten surgical tasks through a unified sequence-to-sequence formulation optimized with a single autoregressi ve language modeling loss. T o enhance generalization and mitigate biases, it dev elops an effecti ve database construction pipeline, including data cleaning and refinement, cross- task correlation enrichment, explainable answer generation, and con v ersational div ersity expansion. As a multimodal foundation model av ailable in multiple scales ( i.e. , 7B, 32B, and 72B), SurgVLM is designed to support a wide range of surgical understanding tasks, spanning both spatial and temporal analysis of surgical scenes, co vering capabilities from visual perception to high-le vel reasoning. T o 13 Task Instrument Localization Phase Recognition Action Recognition Triplet Recognition Critical View of Safety Assessment Image Question Identify the location of the large needle driver in this image, using 3x3 grids to describe the location. In the Cholecystectomy surgical image, what is the current phase ? The available phase options are ... What action related to the needle and suture is the surgeon focusing on right now? The available action options are ... What tasks are the instrument accomplishing with the target in this surgical image? The available instrument, action, and target options are ... For each Critical View of Safety criterion, answer yesor no. 1.Only two tubular structures connect to the gallbladder. 2.Hepatocystic triangle cleared for visibility. 3.Lower gallbladder detached from liver bed. Answer The large needle driver is at bottom-left area. F. Cleaning Coagulation D. pushing the needle through the tissue Triplet1: C. grasper B. retract H. gallbladder Triplet2: E. hook C. dissect M.cystic_artery 1.no 2.no 3.no Figure 7: Qualitative results of Sur gVLM-72B including fi ve typical examples from visual perception to temporal analysis to safety reasoning. For triplet recognition, the output format is triplet list with . build an ef fective training pipeline, SurgVLM models follo w Qwen2.5-VL [ 9 ] architecture, consisting of a vision encoder , a projector , and a LLM decoder . The vision encoder is a transformer -based image backbone that processes images and video frames at their nativ e resolution, while the LLM serves as the decoder for generating outputs. Please refer to SurgVLM [95] for more implementation details. T o systematically ev aluate multimodal surgical intelligence, Sur gVLM is e valuated on Sur gVLM- Bench, a comprehensi ve benchmark designed to assess vision-language models across clinically relev ant dimensions of surgical understanding. SurgVLM-Bench integrates six widely used sur gical datasets spanning three hierarchical lev els of task complexity: visual perception, temporal workflo w analysis, and safety reasoning. These task categories reflect increasing contextual and temporal dependency and align with the requirements of real-world surgical assistance. The qualitativ e results with typical examples sho wn in Figure 7 generated by SurgVLM-72B, including instrument localization, phase recognition, action recognition, triplet recognition, and CVS assessment. The experiments of Sur gVLM indicate that fine-tuning general VLMs on Surg Σ -DB provides an ef ficient and reliable pathway for sur gical adaptation with follo wing two important insights: (1) It indicates one of core problems in surgical foundation modeling lies in balancing div ersity across surgical types with the e xploitation of shared cross-procedure structure. Related procedures often e xhibit substantial ov erlap in anatomical appearance, tissue characteristics, and instrument-associated visual cues, and joint training on multiple categories in Surg Σ -DB allows the model to le verage these synergies to learn richer and more transferable representations. At the same time, substantial variation remains across procedures in anatomy , workflow , and visual distrib ution; accordingly , Surg Σ -DB incorporates the surgical type as explicit conte xtual information during instruction tuning, reducing ambiguity and improving procedure-specific conditioning. (2) It supports a hierarchical vie w of surgical intelligence in which low-le vel perception, temporal understanding, and high-le vel reasoning are tightly coupled. Accurate recognition of instruments and tissues provides the basis for phase and action understanding, while robust temporal modeling underpins reliable intraoperativ e reasoning. This structure motiv ates multi-task co-training and a curriculum that progresses from perception to temporal analysis and finally to reasoning, improving data ef ficiency , con vergence beha vior , and robustness across the full surgical w orkflow . 4.3 Surg-R1: A Reasoning-Enhanced Model for Surgical Scene Understanding Surg-R1 [ 33 ] is a reasoning-enhanced multimodal foundation model for surgical scene understanding with hierarchical chain-of-thought [ 82 ] reasoning capabilities. Built upon the reasoning-annotated data within Surg Σ -DB, Surg-R1 interprets complex surgical scenes through a structured three- lev el reasoning hierarchy: (1) perceptual grounding for instrument and tissue identification, (2) relational understanding for tool-tissue-action interactions, and (3) contextual reasoning for phase recognition and safety assessment. Initialized with Qwen2.5-VL-7B [ 9 ], Surg-R1 is trained through a comprehensi ve four-stage pipeline: First, supervised fine-tuning is performed to establish foundational vision–language alignment using question-answer pairs without reasoning. Secondly , structured 14 Figure 8: Qualitati ve results of Surg-R1-7B across three representativ e surgical tasks. Each column shows the model’ s structured multi-level reasoning chain, progressing from visual identification (Lev el 1) through tool-tissue interaction analysis (Lev el 2) to procedural understanding (Level 3). L x denotes reasoning Lev el x . Reasoning traces are abbreviated with ellipses (. . . ) for brevity . reasoning priors are introduced through cold-start fine-tuning on reasoning trajectories synthesized under surgery-a ware constraints that encode structured domain knowledge. Then, reasoning capability is further refined via reinforcement learning using Group Relativ e Policy Optimization [ 67 ]. Finally , an iterati ve refinement stage combines rejection sampling for correctly predicted samples and teacher - guided knowledge distillation for hard cases, progressiv ely improving reasoning generalization capabilities beyond initial training data. Please refer to Sur g-R1 [ 33 ] for more implementation details. Surg-R1 is e valuated on thirteen datasets spanning six core surgical AI tasks, with seven public benchmarks and six multi-center e xternal validation sets from fi ve institutions, against proprietary reasoning models (GPT -5.1, Gemini 3.0 Pro), open-source generalist VLMs, and sur gical-domain baselines. As shown in Figure 8, Surg-R1 produces structured multi-level reasoning chains that ground predictions in visual evidence. Surg-R1 achie ves state-of-the-art performance across both settings, with the largest gains on compositional tasks. On CholecT50 triplet recognition, for e xample, Surg-R1 attains 51.69% accuracy versus 6.77% for GPT -5.1 and 8.01% for Qwen2.5-VL-7B-Surg. On multi-center external data it achie ves an av erage arena score of 60.0%, compared with 44.9% for the leading surgical baseline. From a data-centric perspective, the consistent f ailure of general-purpose chain-of-thought reasoning on surgical compositional tasks, ev en in frontier models such as GPT -5.1, highlights the necessity of domain-specific structural priors. Surg Σ -DB’ s multi-granular annotation taxonomy provides the hierarchical scaf folding that makes effecti ve surgical reasoning possible, and its structured instrument, tissue, and action v ocabularies anchor the CoT synthesis pipeline in 15 Case1 (one-time needle handover): left needle driver passes needle to right needle dri ver . ( left → right ) Case3 (three- time needle handover): left needle driver passes needle to right needle drive r , then right needle driver passes needle to left needl e driver , then left needle driver passes needle to right needl e driver . ( left → right → left → right ) Case4 (needle puncture): left needle driver punctures tissue. Same conditioning frame Case2 (two- time needle handover): left needle driver passes needle to right needle driver , then right needle driver passes needle to left needl e driver . ( left → right → left ) Figure 9: Cosmos-H-Surgical results: New beha vior generalization via strong text–video alignment. Giv en the same conditioning frame, our surgical world model generates distinct video rollouts corresponding to four task prompts: (1) one-time needle handov er, (2) two-time needle hando ver , (3) three-time needle handov er , and (4) needle puncture. visual observ ations, suppressing the hallucination artifacts that arise when models re verse-engineer explanations from labels. The three-lev el reasoning structure also brings practical advantages beyond training. Heterogeneous do wnstream systems can selectiv ely consume only the granularity le vel relev ant to their task. Skill assessment modules analyze Lev el 2 interaction patterns, workflow management systems le verage Le vel 3 phase predictions, and safety monitors extract CVS criteria, all without post-processing unstructured natural-language output. The structured hierarchy further enables automatic cross-le vel consistency checking and precise f ault localization when predictions are wrong. Finally , every inference produces a hierarchically labeled reasoning trace that, after clinical revie w , can be ingested as new supervision, forming a data flywheel that continuously lowers annotation cost and grows the structured reasoning corpus in Sur g Σ -DB. 4.4 Cosmos-H-Surgical: A Surgical W orld Model f or Scalable Robot Policy Lear ning Cosmos-H-Surgical [ 31 ] is a surgical world model and data-augmentation pipeline designed to bridge the gap between abundant unlabeled surgical video and the scarce paired video–kinematics data required for sur gical robot vision–language–action (VLA) policy training. Based on Sur g Σ -DB, a surgery-focused captioned dataset is curated with expert-authored action descriptions aligned to short surgical clips. The generativ e backbone of Cosmos-H-Surgical is based on a state-of-the-art video world model [ 4 ]. During training, Cosmos-H-Surgical learns to model surgical scene appearance and spatiotemporal dynamics under typical surgical imaging nuisances (specular highlights, occlusions, constrained tool motion), and condition generation on fine-grained te xt descriptions so that synthe- sized videos preserve actionable af fordances and tool–tissue interactions required for do wnstream policy learning. An inv erse-dynamics model (IDM) is trained on limited real paired demonstrations (when a vailable) and then applied to synthetic video to recov er pseudo kinematics (approximate action/robot-state sequences), thus producing large-scale synthetic (video, pseudo-kinematics, te xt) triples suitable for supervised VLA or imitation-style policy learning. Cosmos-H-Surgical therefore turns unlabeled surgical video corpora into paired training data at scale, enabling standard VLA optimization without requiring dense manual robot-state annotation. Please refer to Cosmos-H- Surgical [31] for more implementation details. Experimental results show that Cosmos-H-Surgical–augmented policies significantly outperform those trained solely on limited real demonstrations, achieving higher task success rates and im- prov ed sample efficienc y . Although IDM-generated pseudo-kinematics are imperfect, they pro vide sufficiently informati ve supervision when paired with realistic world-model synthesis. The combina- tion of generati ve di versity and structured in v erse inference enables effecti ve scaling of embodied training data. From a data-centric perspectiv e, Cosmos-H-Surgical highlights three key insights. 16 First, domain-specific data curation is essential. Generic video generation is insufficient for sur gical scenarios; the SA T A dataset, curated from BSA with structured action annotations, provides the semantic grounding required to align generation with physically meaningful surgical actions. Second, forward physical consistenc y matters. W orld models must capture constrained tool kinematics and tissue deformation; otherwise, synthetic demonstrations can introduce policy bias. Third, hybrid supervision is most effecti ve. While synthetic augmentation reduces reliance on real demonstrations, the best performance arises from mixed training (synthetic + limited real data), where real data anchors policies within true robot dynamics. T ogether with SurgVLM (large-scale perception and reasoning alignment) and Surg-R1 (hierarchical structured reasoning with reinforcement refinement), Cosmos-H-Surgical completes the pipeline tow ard embodied surgical intelligence: perception → reasoning → action. These results suggest that scalable surgical AI requires not only multimodal foundation models, but also structured world models that translate visual understanding into physical control. 5 Limitations and Future W ork W e aim to progressiv ely achie ve comprehensi ve multi-task annotation co verage across all sur gical scenes in Surg Σ -DB. Although its current v0.1 release spans div erse understanding, reasoning, planning, and generation tasks, full-spectrum supervision has not yet been uniformly established for ev ery sample. While certain subsets already pro vide comprehensiv e multi-task and reasoning-lev el annotations, other samples remain limited to task-specific supervision ( e.g. , perception-le vel labels), and structured reasoning annotations are not consistently av ailable across all con versations. This imbalance largely stems from the intrinsic comple xity and high cost of surgical data collection and annotation. Surgical scenes are dynamic, anatomically intricate and safety-critical, requiring domain expertise to precisely characterize fine-grained details and procedural conte xt. In particular , reasoning annotations demand multi-lev el clinical interpretation and careful v alidation, making large-scale consistent annotation substantially more challenging than standard perception labeling. In future iterations beyond Surg Σ -DB v0.1, we will continue expanding cross-task cov erage and enriching structured reasoning annotation under a unified label space and annotation frame work, moving to ward more holistic and fully aligned multimodal surgical foundation training. 6 Conclusion In this work, we introduced Surg Σ , a unified spectrum of large-scale multimodal data and foundation models for sur gical intelligence. At its core, Surg Σ -DB provides a systematically curated and scalable multimodal data foundation, consolidating heterogeneous sur gical resources into a unified schema with consistent semantics and standardized formats across di verse procedures. Surg Σ -DB supports comprehensiv e supervision spanning understanding, reasoning, planning, and generation tasks, and is designed around three key principles: large-scale multimodal data, unified data representations, and structured reasoning annotations. Empirical evidence from four foundation models built upon Surg Σ -DB demonstrates that these data-centric design choices substantially enhance cross-task generalization and interpretable reasoning in comple x surgical en vironments. W e en vision Surg Σ -DB as a scalable infrastructure for sur gical foundation modeling, and will continue expanding its scale, div ersity , and annotation completeness to ward dense cross-task coverage within unified surgical scenes, advancing clinically reliable and generalizable multimodal sur gical intelligence. Acknowledgments W e sincerely thank Dillan Imans for his inv aluable assistance with dataset annotation during his visit as a researcher at CUHK. W e also thank Erli Zhang from NUS for his in valuable assistance with dataset summarizing. References [1] Josh Abramson, Jonas Adler, Jack Dunger , Richard Evans, T im Green, Ale xander Pritzel, Olaf Ronneber ger, Lindsay W illmore, Andre w J Ballard, Joshua Bambrick, et al. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature , 630(8016):493–500, 2024. 17 [2] Oluwatosin Alabi, K o Ko Zayar T oe, Zijian Zhou, Charlie Budd, Nicholas Raison, Miaojing Shi, and T om V ercauteren. Cholecinstanceseg: A tool instance segmentation dataset for laparoscopic surgery . Scientific Data , 12(1):825, 2025. [3] Deepak Alapatt, Jennifer Eckhof f, Zhiliang L yu, Y utong Ban, Jean-P aul Mazellier, Sarah Choksi, Kunyi Y ang, Po-Hsing Chiang, Noemi Zorzetti, Samuele Cannas, et al. The SAGES critical vie w of safety chal- lenge: A global benchmark for AI-assisted surgical quality assessment. arXiv pr eprint arXiv:2509.17100 , 2025. [4] Arslan Ali, Junjie Bai, Maciej Bala, Y ogesh Balaji, Aaron Blakeman, Tif fany Cai, Jiaxin Cao, T ianshi Cao, Elizabeth Cha, Y u-W ei Chao, et al. W orld simulation with video foundation models for physical ai. arXiv pr eprint arXiv:2511.00062 , 2025. [5] Max Allan, Satoshi K ondo, Sebastian Bodenstedt, Stefan Le ger , Rahim Kadkhodamohammadi, Imanol Luengo, Felix Fuentes, Evangello Flouty , Ahmed Mohammed, Marius Pedersen, et al. 2018 robotic scene segmentation challenge. arXiv pr eprint arXiv:2001.11190 , 2020. [6] Max Allan, Alex Shv ets, Thomas Kurmann, Zichen Zhang, Rahul Duggal, Y un-Hsuan Su, Nicola Rieke, Iro Laina, Niveditha Kalav akonda, Sebastian Bodenstedt, et al. 2017 robotic instrument segmentation challenge. arXiv preprint , 2019. [7] Nicolás A yobi, Santiago Rodríguez, Alejandra Pérez, Isabela Hernández, Nicolás Aparicio, Eugénie Dessevres, Sebastián Peña, Jessica Santander , Juan Ignacio Caicedo, Nicolás Fernández, et al. Pixel-wise recognition for holistic surgical scene understanding. Medical Image Analysis , page 103726, 2025. [8] Shuai Bai, Y uxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, W ei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv preprint , 2025. [9] Shuai Bai, K eqin Chen, Xuejing Liu, Jialin W ang, W enbin Ge, Sibo Song, Kai Dang, Peng W ang, Shijie W ang, Jun T ang, Humen Zhong, Y uanzhi Zhu, Mingkun Y ang, Zhaohai Li, Jianqiang W an, Pengfei W ang, W ei Ding, Zheren Fu, Y iheng Xu, Jiabo Y e, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Y ang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025. [10] Y utong Ban, Jennifer A. Eckhof f, Thomas M. W ard, Daniel A. Hashimoto, Ozanan R. Meireles, Daniela Rus, and Guy Rosman. Concept graph neural networks for surgical video understanding. IEEE T ransactions on Medical Imaging , 43(1):264–274, 2024. [11] V i vek Singh Ba wa, Gurkirt Singh, Francis KapingA, Inna Skarg a-Bandurova, Elettra Oleari, Alice Leporini, Carmela Landolfo, Pengfei Zhao, Xi Xiang, Gongning Luo, et al. The saras endoscopic sur geon action detection (esad) dataset: Challenges and methods. arXiv preprint , 2021. [12] Christian Bluethgen, Pierre Chambon, Jean-Benoit Delbrouck, Rogier V an Der Sluijs, Małgorzata Połacin, Juan Manuel Zambrano Chaves, T anishq Mathew Abraham, Shi vanshu Purohit, Curtis P Langlotz, and Akshay S Chaudhari. A vision–language foundation model for the generation of realistic chest x-ray images. Nature Biomedical Engineering , 9(4):494–506, 2025. [13] Matthias Carstens, Franziska M Rinner , Sebastian Bodenstedt, Alexander C Jenke, Jür gen W eitz, Marius Distler , Stefanie Speidel, and Fiona R K olbinger . The dresden surgical anatomy dataset for abdominal organ se gmentation in surgical data science. Scientific Data , 10(1):3, 2023. [14] João Cartucho, Alistair W eld, Samyakh Tukra, Haozheng Xu, Hiroki Matsuzaki, T aiyo Ishikawa, Minjun Kwon, Y ong Eun Jang, Kwang-Ju Kim, Gwang Lee, et al. Surgt challenge: Benchmark of soft-tissue trackers for robotic sur gery . Medical image analysis , 91:102985, 2024. [15] Chengan Che, Chao W ang, T om V ercauteren, Sophia Tsoka, and Luis C Garcia-Peraza-Herrera. Lemon: A large endoscopic monocular dataset and foundation model for perception in surgical settings. arXiv pr eprint arXiv:2503.19740 , 2025. [16] Ke xin Chen, Y uyang Du, T ao Y ou, Mobarakol Islam, Ziyu Guo, Y ueming Jin, Guangyong Chen, and Pheng-Ann Heng. Llm-assisted multi-teacher continual learning for visual question answering in robotic surgery . In 2024 IEEE International Confer ence on Robotics and Automation (ICRA) , pages 10772–10778, 2024. [17] Richard J Chen, T ong Ding, Ming Y Lu, Dre w FK W illiamson, Guillaume Jaume, Andrew H Song, Bowen Chen, Andrew Zhang, Daniel Shao, Muhammad Shaban, et al. T o wards a general-purpose foundation model for computational pathology . Natur e medicine , 30(3):850–862, 2024. 18 [18] Y iliang Chen, Zhixi Li, Cheng Xu, Alex Qinyang Liu, Ruize Cui, Xuemiao Xu, Jeremy Y uen-Chun T eoh, Shengfeng He, and Jing Qin. Prostatd: Bridging surgical triplet from classification to fully supervised detection. arXiv preprint , 2025. [19] Zhen Chen, Qing Xu, Jinlin W u, Biao Y ang, Y uhao Zhai, Geng Guo, Jing Zhang, Y inlu Ding, Nassir Nav ab, and Jiebo Luo. How far are surgeons from surgical world models? a pilot study on zero-shot surgical video generation with e xpert assessment. arXiv pr eprint arXiv:2511.01775 , 2025. [20] Jiajun Cheng, Xianwu Zhao, Sainan Liu, Xiaofan Y u, Ravi Prakash, P atrick J. Codd, Jonathan Elliott Katz, and Shan Lin. Sur gxbench: Explainable vision-language model benchmark for surgery . In Pr oceedings of the IEEE/CVF W inter Confer ence on Applications of Computer V ision (W A CV) , pages 8188–8198, March 2026. [21] T ae-Min Choi, T ae Kyeong Jeong, Garam Kim, Jaemin Lee, Y eongyoon K oh, In Cheul Choi, Jae-Ho Chung, Jong W oong Park, and Juyoun Park. Surgmllmbench: A multimodal large language model benchmark dataset for surgical scene understanding. arXiv pr eprint arXiv:2511.21339 , 2025. [22] Ronald de Jong, HJ Carolus, HA Franciscus, Romy C van Jaarsveld, Richard van Hille gersberg, PW Josien, Peter HN de W ith, Y asmina al Khalil, Fons van Der Sommen, et al. Scaling up self-supervised learning for improv ed surgical foundation models. Medical Image Analysis , page 103873, 2025. [23] T ong Ding, Sophia J W agner , Andrew H Song, Richard J Chen, Ming Y Lu, Andrew Zhang, Anurag J V aidya, Guillaume Jaume, Muhammad Shaban, Ahrong Kim, et al. A multimodal whole-slide foundation model for pathology . Natur e medicine , pages 1–13, 2025. [24] Alex ey Dosovitskiy , Lucas Beyer, Alexander K olesnikov , Dirk W eissenborn, Xiaohua Zhai, Thomas Unterthiner , Mostafa Dehghani, Matthias Minderer , Georg Heigold, Sylv ain Gelly , Jakob Uszkoreit, and Neil Houlsby . An image is worth 16x16 words: Transformers for image recognition at scale, 2021. [25] Pierre E Dupont, Bradley J Nelson, Michael Goldf arb, Blake Hannaford, Arianna Menciassi, Marcia K O’Malley , Nabil Simaan, Pietro V aldastri, and Guang-Zhong Y ang. A decade retrospective of medical robotics research from 2010 to 2020. Science r obotics , 6(60):eabi8017, 2021. [26] Davide Ferrari, T ommaso V iolante, Marco No velli, Patrick P Starlinger , Rory L Smoot, Janani S Reisenauer , and David W Larson. The death of laparoscopy . Sur gical endoscopy , 38(5):2677–2688, 2024. [27] Negin Ghamsarian, Y osuf El-Shabrawi, Sahar Nasirihaghighi, Doris Putzgruber-Adamitsch, Martin Zinker - nagel, Sebastian W olf, Klaus Schoef fmann, and Raphael Sznitman. Cataract-1k dataset for deep-learning- assisted analysis of cataract surgery videos. Scientific data , 11(1):373, 2024. [28] Fengyue Guo, Chengkun Li, Bin Peng, Y onghao Long, Jialun Pei, Mengya Xu, Ziling He, Guangsuo W ang, and Qi Dou. Surgical key step recognition with global-local modeling mamba in laparoscopic pulmonary lobectomy . In International W orkshop on Collaborative Intelligence and Autonomy in Image-Guided Sur gery , pages 11–20. Springer , 2025. [29] Kaiming He, Jian Sun, and Xiaoou T ang. Single image haze removal using dark channel prior . IEEE transactions on pattern analysis and mac hine intelligence , 33(12):2341–2353, 2010. [30] Runlong He, Mengya Xu, Adrito Das, Danyal Z Khan, Sophia Bano, Hani J Marcus, Danail Stoyanov , Matthew J Clarkson, and Mobarakol Islam. Pitvqa: Image-grounded text embedding llm for visual question answering in pituitary sur gery . In International Confer ence on Medical Image Computing and Computer-Assisted Intervention , pages 488–498. Springer , 2024. [31] Y ufan He, Pengfei Guo, Mengya Xu, Zhaoshuo Li, Andriy Myronenko, Dillan Imans, Bingjie Liu, Dongren Y ang, Mingxue Gu, Y ongnan Ji, et al. Cosmos-h-surgica: Learning surgical robot policies from videos via world modeling. arXiv pr eprint arXiv:2512.23162 , 2025. [32] Aimin Jiang, Zhao T ang, Hanzhong Zhang, Jinxin Li, Jialin Meng, Y ing Liu, Y u Fang, Juan Lu, Xu Zhang, Le Qu, et al. Current application status and innovati ve development of surgical robot. Med Researc h , 1(3):378–396, 2025. [33] Jian Jiang, Chenxi Lin, Y iming Gu, Zengyi Qin, Zhitao Zeng, Kun Y uan, Y onghao Long, Xiang Xia, Cheng Y uan, Y uqi W ang, Zijie Y ue, Kunyi Y ang, Y uting Zhang, Zhu Zhuo, Dian Qin, Xin W ang, NG Chi Fai, Brian Anthony , Daguang Xu, Guy Rosman, Ozanan Meireles, Zizhen Zhang, Nicolas Padoy , Hesheng W ang, Qi Dou, Y ueming Jin, and Y utong Ban. Surg-r1: A hierarchical reasoning foundation model for scalable and interpretable surgical decision support with multi-center clinical validation. arXiv preprint arXiv:2603.12430 , 2026. 19 [34] Juseong Jin and Chang W ook Jeong. Surgical-lla va: T o ward surgical scenario understanding via large language and vision models. arXiv preprint , 2024. [35] Y ueming Jin, Qi Dou, Hao Chen, Lequan Y u, Jing Qin, Chi-Wing Fu, and Pheng-Ann Heng. Sv-rcnet: workflo w recognition from surgical videos using recurrent con volutional network. IEEE transactions on medical imaging , 37(5):1114–1126, 2017. [36] Y ueming Jin, Y onghao Long, Xiaojie Gao, Danail Stoyanov , Qi Dou, and Pheng-Ann Heng. T rans-svnet: hybrid embedding aggreg ation transformer for surgical workflow analysis. International Journal of Computer Assisted Radiology and Sur gery , 17(12):2193–2202, 2022. [37] Xinwei Ju, Rema Daher , Razvan Caramalau, Baoru Huang, Danail Stoyano v , and Francisco V asconcelos. Segcol challenge: Semantic segmentation for tools and fold edges in colonoscopy data. arXiv preprint arXiv:2412.16078 , 2024. [38] John Jumper , Richard Evans, Alexander Pritzel, T im Green, Michael Figurnov , Olaf Ronneberger , Kathryn T unyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenk o, et al. Highly accurate protein structure prediction with alphafold. nature , 596(7873):583–589, 2021. [39] Sreeram Kamabattula, Kai Chen, and Kiran Bhattacharyya. W eakly supervised pre-training for sur gical step recognition using unannotated and heterogeneously labeled videos. International Journal of Computer Assisted Radiology and Sur gery , pages 1–11, 2025. [40] Joël L Lav anchy , Sanat Ramesh, Diego Dall’Alba, Cristians Gonzalez, Paolo Fiorini, Beat P Müller- Stich, Philipp C Nett, Jacques Marescaux, Didier Mutter , and Nicolas Padoy . Challenges in multi-centric generalization: phase and step recognition in roux-en-y gastric bypass sur gery . International journal of computer assisted radiology and sur gery , 19(11):2249–2257, 2024. [41] Jiajie Li, Garrett Skinner , Gene Y ang, Brian R Quaranto, Steven D Schwaitzberg, Peter CW Kim, and Jinjun Xiong. Llava-sur g: towards multimodal surgical assistant via structured surgical video learning. arXiv pr eprint arXiv:2408.07981 , 2024. [42] Y aoqian Li, Xikai Y ang, Dunyuan Xu, Y ang Y u, Litao Zhao, Xiaowei Hu, Jinpeng Li, and Pheng-Ann Heng. Surgpub-video: A comprehensi ve surgical video dataset for enhanced surgical intelligence in vision-language model. arXiv preprint , 2025. [43] Haofeng Liu, Ziyue W ang, Sudhanshu Mishra, Mingqi Gao, Guanyi Qin, Chang Han Low , Alex YW K ong, and Y ueming Jin. Sam2s: Segment anything in surgical videos via semantic long-term tracking. arXiv pr eprint arXiv:2511.16618 , 2025. [44] Shengyuan Liu, Boyun Zheng, W enting Chen, Zhihao Peng, Zhenfei Y in, Jing Shao, Jiancong Hu, and Y ixuan Y uan. Endobench: A comprehensiv e ev aluation of multi-modal large language models for endoscopy analysis. In The Thirty-ninth Annual Confer ence on Neur al Information Pr ocessing Systems Datasets and Benchmarks T r ack , 2025. [45] Chang Han Low , Ziyue W ang, Tian yi Zhang, Zhu Zhuo, Zhitao Zeng, Evangelos B Mazomenos, and Y ueming Jin. Surgra w: Multi-agent workflo w with chain of thought reasoning for robotic surgical video analysis. IEEE Robotics and Automation Letters , 2026. [46] Imanol Luengo, Maria Grammatikopoulou, Rahim Mohammadi, Chris W alsh, Chinedu Innocent Nwoye, Deepak Alapatt, Nicolas Padoy , Zhen-Liang Ni, Chen-Chen Fan, Gui-Bin Bian, et al. 2020 cataracts semantic segmentation challenge. arXiv pr eprint arXiv:2110.10965 , 2021. [47] Lena Maier-Hein, Martin W agner, T obias Ross, Annika Reinke, Sebastian Bodenstedt, Peter M Full, Hellena Hempe, Diana Mindroc-Filimon, Patrick Scholz, Thuy Nuong Tran, et al. Heidelberg colorectal data set for surgical data science in the sensor operating room. Scientific data , 8(1):101, 2021. [48] Pietro Mascagni, Deepak Alapatt, Aditya Murali, Armine V ardazaryan, Alain Garcia, Nariaki Okamoto, Guido Costamagna, Didier Mutter , Jacques Marescaux, Bernard Dallemagne, et al. Endoscapes, a critical view of safety and sur gical scene segmentation dataset for laparoscopic cholec ystectomy . Scientific Data , 12(1):331, 2025. [49] John G Meara, Andrew JM Leather , Lars Hagander , Blake C Alkire, Niv aldo Alonso, Emmanuel A Ameh, Stephen W Bickler , Lesong Conteh, Anna J Dare, Justine Davies, et al. Global surgery 2030: evidence and solutions for achieving health, welfare, and economic de velopment. The lancet , 386(9993):569–624, 2015. [50] Baudolino Mussa, Barbara Defrancisco, Ludovico Campi, and Mario Morino. Single-port laparoscopy compared with con ventional laparoscopic surgery: a systematic revie w and meta-analysis. J ournal of Clinical Medicine , 14(14):4915, 2025. 20 [51] Hirenkumar Nakawala. Nephrec9. (No T itle) , 2017. [52] Chinedu Innocent Nwoye, Cristians Gonzalez, T ong Y u, Pietro Mascagni, Didier Mutter , Jacques Marescaux, and Nicolas Pado y . Recognition of instrument-tissue interactions in endoscopic videos via action triplets. In International confer ence on medical image computing and computer-assisted intervention , pages 364–374. Springer , 2020. [53] Chinedu Innocent Nwoye, T ong Y u, Cristians Gonzalez, Barbara Seeliger , Pietro Mascagni, Didier Mutter , Jacques Marescaux, and Nicolas P adoy . Rendezvous: Attention mechanisms for the recognition of sur gical action triplets in endoscopic videos. Medical Image Analysis , 78:102433, 2022. [54] Jialun Pei, Jiaan Zhang, Guan yi Qin, Kai W ang, Y ueming Jin, and Pheng-Ann Heng. Instrument-tissue- guided sur gical action triplet detection via textual-temporal trail exploration. IEEE transactions on medical imaging , 2025. [55] Alejandra Perez, Chinedu Nw oye, Ramtin Raji K ermani, Omid Mohareri, and Muhammad Abdullah Jamal. Surgla vi: Large-scale hierarchical dataset for surgical vision-language representation learning. arXiv pr eprint arXiv:2509.10555 , 2025. [56] Alejandra Perez, Anita Rau, Lee White, Busisiwe Mlambo, Chinedu Nwoye, Muhammad Abdullah Jamal, and Omid Mohareri. Sureon: A benchmark and vision-language-model for surgical reasoning. arXiv pr eprint arXiv:2603.0657 , 2026. [57] Dimitrios Psychogyios, Emanuele Colleoni, Beatrice V an Amsterdam, Chih-Y ang Li, Shu-Y u Huang, Y uchong Li, Fucang Jia, Baosheng Zou, Guotai W ang, Y ang Liu, et al. Sar-rarp50: Segmentation of surgical instrumentation and action recognition on robot-assisted radical prostatectomy challenge. arXiv pr eprint arXiv:2401.00496 , 2023. [58] Guanyi Qin, Xiaozhen W ang, Zhu Zhuo, Chang Han Lo w , Y uancan Xiao, Y ibing Fu, Haofeng Liu, Kai W ang, Chunjiang Li, and Y ueming Jin. Surgo-r1: Benchmarking and modeling contextual reasoning for operativ e zone in surgical video. arXiv pr eprint arXiv:2602.21706 , 2026. [59] Anita Rau, Mark Endo, Josiah Aklilu, Jaewoo Heo, Khaled Saab, Alberto Paderno, Jef frey Jopling, F Christopher Holsinger , and Serena Y eung-Levy . Systematic ev aluation of large vision-language models for surgical artificial intelligence. arXiv pr eprint arXiv:2504.02799 , 2025. [60] Nikhila Ravi, V alentin Gabeur , Y uan-Ting Hu, Ronghang Hu, Chaitanya Ryali, T engyu Ma, Haitham Khedr , Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. In The Thirteenth International Conference on Learning Repr esentations , 2024. [61] Manuel Sebastián Ríos, María Alejandra Molina-Rodriguez, Daniella Londoño, Camilo Andrés Guillén, Sebastián Sierra, Felipe Zapata, and Luis Felipe Giraldo. Cholec80-cvs: An open dataset with an ev aluation of strasberg’ s critical vie w of safety for ai. Scientific Data , 10(1):194, 2023. [62] Samuel Schmidgall, Ji W oong Kim, Jef frey Jopling, and Axel Krieger . General surgery vision transformer: A video pre-trained foundation model for general surgery . arXiv pr eprint arXiv:2403.05949 , 2024. [63] Klaus Schoeffmann, Heinrich Husslein, Sabrina Kletz, Stefan Petscharnig, Bernd Muenzer , and Christian Beecks. V ideo retriev al in laparoscopic video recordings with dynamic content descriptors. Multimedia T ools and Applications , 77(13):16813–16832, 2018. [64] Lalithkumar Seeniv asan, Mobarakol Islam, Gokul Kannan, and Hongliang Ren. Surgicalgpt: end-to-end language-vision gpt for visual question answering in surgery . In International conference on medical image computing and computer -assisted intervention , pages 281–290, 2023. [65] Lalithkumar Seeniv asan, Mobarakol Islam, Adithya K Krishna, and Hongliang Ren. Surgical-vqa: V isual question answering in sur gical scenes using transformer . In International Confer ence on Medical Image Computing and Computer-Assisted Intervention , pages 33–43, 2022. [66] Murat Sensoy , Lance Kaplan, and Melih Kandemir . Evidential deep learning to quantify classification uncertainty . Advances in neural information pr ocessing systems , 31, 2018. [67] Zhihong Shao, Peiyi W ang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y ang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint , 2024. [68] Aaditya Singh, Adam Fry , Adam Perelman, Adam T art, Adi Ganesh, Ahmed El-Kishky , Aidan McLaugh- lin, Aiden Low , AJ Ostrow , Akhila Ananthram, et al. Openai gpt-5 system card. arXiv pr eprint arXiv:2601.03267 , 2025. 21 [69] Y ue Sun, Limei W ang, Gang Li, W eili Lin, and Li W ang. A foundation model for enhancing magnetic resonance images and downstream segmentation, registration and diagnostic tasks. Natur e Biomedical Engineering , 9(4):521–538, 2025. [70] Gemini T eam, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Y u, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint , 2023. [71] Andru P T winanda, Sherif Shehata, Didier Mutter, Jacques Marescaux, Michel De Mathelin, and Nicolas Padoy . Endonet: a deep architecture for recognition tasks on laparoscopic videos. IEEE transactions on medical imaging , 36(1):86–97, 2016. [72] Andru P . T winanda, Sherif Shehata, Didier Mutter , Jacques Marescaux, Michel de Mathelin, and Nicolas Padoy . Endonet: A deep architecture for recognition tasks on laparoscopic videos. IEEE T r ansactions on Medical Imaging , 36(1):86–97, 2017. [73] Eugene V orontsov , Alican Bozkurt, Adam Casson, George Shaikovski, Michal Zelechowski, Kristen Sev erson, Eric Zimmermann, James Hall, Neil T enenholtz, Nicolo Fusi, et al. A foundation model for clinical-grade computational pathology and rare cancers detection. Natur e medicine , 30(10):2924–2935, 2024. [74] Martin W agner, Beat-Peter Müller -Stich, Anna Kisilenko, Duc T ran, Patrick He ger, Lars Mündermann, David M Lubotsky , Benjamin Müller, T ornike Davitashvili, Manuela Capek, et al. Comparative v alidation of machine learning algorithms for surgical workflow and skill analysis with the heichole benchmark. Medical image analysis , 86:102770, 2023. [75] Guankun W ang, Long Bai, W an Jun Nah, Jie W ang, Zhaoxi Zhang, Zhen Chen, Jinlin Wu, Mobarakol Islam, Hongbin Liu, and Hongliang Ren. Surgical-lvlm: Learning to adapt large vision-language model for grounded visual question answering in robotic surgery . arXiv pr eprint arXiv:2405.10948 , 2024. [76] Guankun W ang, Long Bai, Junyi W ang, Kun Y uan, Zhen Li, Tianxu Jiang, Xiting He, Jinlin W u, Zhen Chen, Zhen Lei, et al. Endochat: Grounded multimodal large language model for endoscopic surgery . arXiv pr eprint arXiv:2501.11347 , 2025. [77] Guankun W ang, Junyi W ang, W enjin Mo, Long Bai, Kun Y uan, Ming Hu, Jinlin W u, Junjun He, Y iming Huang, Nicolas Pado y , et al. Surgvidlm: T o wards multi-grained surgical video understandi ng with large language model. arXiv preprint , 2025. [78] Guankun W ang, Han Xiao, Renrui Zhang, Huxin Gao, Long Bai, Xiaoxiao Y ang, Zhen Li, Hongsheng Li, and Hongliang Ren. Copesd: A multi-lev el surgical motion dataset for training lar ge vision-language models to co-pilot endoscopic submucosal dissection. In Proceedings of the 33rd ACM International Confer ence on Multimedia , pages 12636–12643, 2025. [79] Zhao W ang, Chang Liu, Shaoting Zhang, and Qi Dou. Foundation model for endoscopy video analysis via large-scale self-supervised pre-train. In International conference on medical image computing and computer-assisted intervention , pages 101–111. Springer , 2023. [80] Zipei W ang, Sitian Pan, Mengjie Fang, Ruofan Zhang, Jie T ian, and Di Dong. CholecMamba: A Mamba- based Multimodal Reasoning Model for Cholecystectomy Surgery . In pr oceedings of Medical Image Computing and Computer Assisted Intervention – MICCAI 2025 , volume LNCS 15968, pages 107–116. Springer Nature Switzerland, September 2025. [81] Ziyi W ang, Bo Lu, Y onghao Long, Fangxun Zhong, T ak-Hong Cheung, Qi Dou, and Y unhui Liu. Au- tolaparo: A ne w dataset of integrated multi-tasks for image-guided sur gical automation in laparoscopic hysterectomy . In International Conference on Medical Image Computing and Computer-Assisted Interven- tion , pages 486–496, 2022. [82] Jason W ei, Xuezhi W ang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information pr ocessing systems , 35:24824–24837, 2022. [83] Jianhui W ei, Zikai Xiao, Danyu Sun, Luqi Gong, Zongxin Y ang, Zuozhu Liu, and Jian W u. Surgbench: A unified large-scale benchmark for sur gical video analysis. arXiv preprint , 2025. [84] Chaoyi W u, Xiaoman Zhang, Y a Zhang, Hui Hui, Y anfeng W ang, and W eidi Xie. T o wards generalist foundation model for radiology by le veraging web-scale 2d&3d medical data. Nature Communications , 16(1):7866, 2025. 22 [85] Jinlin W u, Felix Holm, Chuxi Chen, An W ang, Y axin Hu, Xiaofan Y e, Zelin Zang, Miao Xu, Lihua Zhou, Huai Liao, et al. Unisurg: A video-native foundation model for univ ersal understanding of surgical videos. arXiv pr eprint arXiv:2602.05638 , 2026. [86] Y ingcheng Charles W u, Ming Y in, Baiyu Shi, Zaixi Zhang, Di Y in, Xiaotong W ang, Y oujuan W ang, Jigang Fan, Ruofan Jin, Hanchen W ang, et al. Medos: Ai-xr-cobot world model for clinical perception and action. medRxiv , pages 2026–02, 2026. [87] Mengya Xu, Daiyun Shen, Jie Zhang, Hon Chi Y ip, Y ujia Gao, Cheng Chen, Dillan Imans, Y onghao Long, Y iru Y e, Y ixiao Liu, Rongyun Mai, Kai Chen, Hongliang Ren, Y utong Ban, Guangsuo W ang, Francis W ong, Chi-Fai Ng, K ee Y uan Ngiam, Russell H. T aylor , Daguang Xu, Y ueming Jin, and Qi Dou. Generalized recognition of basic surgical actions enables skill assessment and vision-language-model-based surgical planning. arXiv preprint , 2026. [88] Lihe Y ang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2. Advances in Neural Information Pr ocessing Systems , 37:21875–21911, 2024. [89] Shu Y ang, Fengtao Zhou, Leon Mayer , Fuxiang Huang, Y iliang Chen, Y ihui W ang, Sunan He, Y uxiang Nie, Xi W ang, Y ueming Jin, et al. Large-scale self-supervised video foundation model for intelligent surgery . npj Digital Medicine , 2026. [90] Cheng Y uan, Jian Jiang, Kunyi Y ang, Lv W u, Rui W ang, Zi Meng, Haonan Ping, Ziyu Xu, Y ifan Zhou, W anli Song, Hesheng W ang, Y ueming Jin, Qi Dou, and Y utong Ban. Systematic ev aluation and guidelines for segment an ything model in surgical video analysis. arXiv preprint , 2025. [91] Kun Y uan, Manasi Kattel, Joël L Lav anchy , Nassir Nav ab, V inkle Sri vasta v , and Nicolas Padoy . Advancing surgical vqa with scene graph knowledge. International journal of computer assisted radiology and sur gery , 19(7):1409–1417, 2024. [92] Kun Y uan, V inkle Sriv astav , Nassir Nav ab, and Nicolas Padoy . HecVL: Hierarchical V ideo-Language Pretraining for Zero-shot Surgical Phase Recognition . In pr oceedings of Medical Imag e Computing and Computer Assisted Intervention – MICCAI 2024 , v olume LNCS 15006, pages 306–316. Springer Nature Switzerland, October 2024. [93] Kun Y uan, V inkle Sriv astav , Nassir Nav ab, and Nicolas Padoy . Procedure-aw are surgical video-language pretraining with hierarchical knowledge augmentation. Advances in Neural Information Pr ocessing Systems , 37:122952–122983, 2024. [94] Kun Y uan, V inkle Sriv astav , T ong Y u, Joel L Lavanchy , Jacques Marescaux, Pietro Mascagni, Nassir Nav ab, and Nicolas Padoy . Learning multi-modal representations by watching hundreds of surgical video lectures. Medical Image Analysis , 105:103644, 2025. [95] Zhitao Zeng, Zhu Zhuo, Xiaojun Jia, Erli Zhang, Junde W u, Jiaan Zhang, Y uxuan W ang, Chang Han Low , Jian Jiang, Zilong Zheng, et al. Surgvlm: A large vision-language model and systematic e valuation benchmark for surgical intelligence. arXiv pr eprint arXiv:2506.02555 , 2025. 23
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment