다빈치 Xi 시스템 카메라 활성화 자동 인식 및 위치 추정

February 23, 2026

Reading time: 6 minute

...

📝 Abstract

Purpose: Robot-assisted minimally invasive surgery relies on endoscopic video as the sole intraoperative visual feedback. The DaVinci Xi system overlays a graphical user interface (UI) that indicates the state of each robotic arm, including the activation of the endoscope arm. Detecting this activation provides valuable metadata such as camera movement information, which can support downstream surgical data science tasks including tool tracking, skill assessment, or camera control automation. Methods: We developed a lightweight pipeline based on a ResNet18 convolutional neural network to automatically identify the position of the camera tile and its activation state within the DaVinci Xi UI. The model was fine-tuned on manually annotated data from the SurgToolLoc dataset and evaluated across three public datasets comprising over 70,000 frames. Results: The model achieved F1-scores between 0.993 and 1.000 for the binary detection of active cameras and correctly localized the camera tile in all cases without false multiple-camera detections. Conclusion: The proposed pipeline enables reliable, real-time extraction of camera activation metadata from surgical videos, facilitating automated preprocessing and analysis for diverse downstream applications. All code, trained models, and annotations are publicly available.

💡 Analysis

🇰🇷 한글로 읽기

📄 Content

Robot-assisted minimally invasive surgery (RAMIS) relies on endoscopic video as the sole source of intraoperative visual feedback. The graphical user interface (UI) of robotic systems is typically overlaid on these videos and provides information about attached instruments and their operational states. While surgical data science commonly analyzes endoscopic video streams to study surgical workflow, skill, or anatomy, the embedded UI is rarely utilized, despite offering valuable and easily accessible metadata such as instrument and camera activation states.

In the DaVinci Xi system, activation of the camera arm directly indicates camera movement, as the endoscope can only be repositioned while the arm is actively controlled. This movement information is highly relevant for multiple downstream tasks. For simultaneous localization and mapping (SLAM), it enables the selection of scenes with either moving or stationary cameras. In surgical skill analysis, the extent of camera motion can serve as an indicator of camera handling proficiency. When analyzing individual frames, motion often causes blur, and such frames can be filtered accordingly. In tool tracking, distinguishing between instrument and camera motion is essential, as camera movement distorts the measured path length and range of motion. Finally, for training robot-assisted camera control, learning when and how to move the camera directly corresponds to the activation of the camera arm in RAMIS.

As this information is inherently available within recorded surgical videos containing the da Vinci Xi UI, automatic extraction of this metadata would significantly facilitate their reuse in downstream tasks. In this work, we present a lightweight pipeline that automatically detects the position of the camera tile and its activation state in the da Vinci Xi UI. Additionally, we release the manually created annotations used for training and evaluation across three publicly available datasets.

The complete pipeline, model, and annotations are available at https://gitlab.com/ nct tso public/xicad.

The DaVinci Xi UI is overlaid on the endoscopic video stream. It mainly consists of four semi-transparent rectangular tiles shown continuously at the bottom of the frame (see example in Fig. 1). Other UI elements like virtual pointers, arm popups, off-screen indicators or system status messages will not be used in this work and are therefore not further introduced.

Each tile represents the state of one of the four robot arms, which hold up to three instruments and the endoscope. The tiles are numbered one to four identifying which arm they represent. The numbers are not necessarily ordered, but in the used datasets this is generally the case. Furthermore, the tiles either show the instrument name and activation status for instruments or icons describing table orientation, endoscope horizon, zoom levels, and view angels for the endoscope. A light blue highlighting of the tile indicates which arms are currently being controlled by the surgeon. At any time this can either be two of the three instruments or just the endoscope.

Example image from DSAD Liver, 02, image00

[modified]

Fig. 1 Overview of the proposed pipeline for detecting the camera tile and its activation state in the DaVinci Xi user interface. Four tiles are cropped out of the frame, each is classified by a finetuned ResNet18 as no camera, inactive camera, or active camera. The tile-level predictions are then combined through simple logic to yield the final frame-level camera activation and position.

The endoscope tile only becomes highlighted when the surgeon actively presses and holds down the endoscope control foot pedal. Consequently, the camera is not activated in idle situations, which allows us to assume that the camera is very likely being moved when the tile is highlighted.

To extract the camera activity from the UI, we first need to detect which of the four tiles shows the endoscope information and subsequently check whether the tile is active or not. Firstly, we remove any black border around the endoscope image, resulting in a 5:4 aspect ratio, and rescale the image to 640 × 521 pixels. Next, we crop the image, yielding four cropped tile images. To accommodate for slight shifts in the UI placement due to padding or compression in the recording setups, each tile covers an additional area of four pixels around the expected position, resulting in 168 × 28 pixels for each tile. Each tile is then processed separately by the neural network.

The neural network consists of a ResNet18 [1] pretrained on ImageNet [2] and finetuned on the given task. The classification layer was replaced to output three classes: no camera, an inactive camera and an active camera. The small model takes the input image of 168 × 28 pixels and predicts a scalar for all three classes, which are further reduced to the most likely class using argmax during inference.

As the four tiles of the UI are processed separately, t

View Original ArXiv

This content is AI-processed based on ArXiv data.

다빈치 Xi 시스템 카메라 활성화 자동 인식 및 위치 추정

📝 Abstract

💡 Analysis

📄 Content

Table of Contents

Table of Contents

📝 Abstract

💡 Analysis

📄 Content

Start searching

No results found