A benchmark for video-based laparoscopic skill analysis and assessment

A benchmark for video-based laparoscopic skill analysis and assessment
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Laparoscopic surgery is a complex surgical technique that requires extensive training. Recent advances in deep learning have shown promise in supporting this training by enabling automatic video-based assessment of surgical skills. However, the development and evaluation of deep learning models is currently hindered by the limited size of available annotated datasets. To address this gap, we introduce the Laparoscopic Skill Analysis and Assessment (LASANA) dataset, comprising 1270 stereo video recordings of four basic laparoscopic training tasks. Each recording is annotated with a structured skill rating, aggregated from three independent raters, as well as binary labels indicating the presence or absence of task-specific errors. The majority of recordings originate from a laparoscopic training course, thereby reflecting a natural variation in the skill of participants. To facilitate benchmarking of both existing and novel approaches for video-based skill assessment and error recognition, we provide predefined data splits for each task. Furthermore, we present baseline results from a deep learning model as a reference point for future comparisons.


💡 Research Summary

The paper presents LASANA (Laparoscopic Skill Analysis and Assessment), a comprehensive benchmark dataset designed to accelerate research on video‑based automatic assessment of laparoscopic surgical skills. LASANA consists of 1,270 synchronized stereo video recordings captured at 20 fps (960 × 540 px) using a Karl Storz TIPCAM 1 S 3D LAP endoscope. The recordings cover four fundamental training tasks—peg transfer, circle cutting, balloon resection, and suture & knot—performed by 70 participants (58 medical students and 12 clinicians). Each participant was recorded multiple times throughout a structured training course, yielding up to six trials per person and thereby capturing natural skill progression.

Every video is annotated with two complementary layers of information. The first layer is a structured skill rating inspired by the Global Operative Assessment of Laparoscopic Skills (GOALS). Four dimensions—depth perception, efficiency, bimanual dexterity, and tissue handling—are each scored on a five‑point Likert scale, and the sum of these dimensions constitutes the total Global Rating Score (GRS). Three independent raters provided the scores; inter‑rater reliability measured by Lin’s Concordance Correlation Coefficient exceeds 0.65 for all tasks except circle cutting (ρc = 0.49), indicating acceptable consistency. The second annotation layer consists of binary error labels specific to each task (e.g., object dropped, cutting imprecise, balloon perforated, knot comes apart). Errors are recorded only as present/absent for the entire video, without temporal localization or severity grading.

To ensure data quality, videos were trimmed to the exact task duration, filtered for illumination, material suitability, and execution correctness, and excluded if they exceeded predefined time limits (6 min for peg transfer, 10 min for the others). File names were replaced with random pseudo‑English identifiers to anonymize participants and trial numbers. The dataset is released with predefined participant‑wise splits (training, validation, test) following a “Leave‑Users‑Out” protocol, guaranteeing that the same individual never appears in more than one split. This design enables robust evaluation of model generalization to unseen surgeons.

The authors also provide baseline results using a straightforward deep‑learning pipeline. Video frames are first processed by a convolutional neural network (e.g., ResNet) to extract spatial features; these features are then fed into a temporal model (LSTM or Transformer) to aggregate information across time. For skill assessment, the network regresses the GRS, optimized with mean‑squared error loss. For error detection, a multi‑label binary classification head predicts the presence of each task‑specific error, trained with binary cross‑entropy loss. Reported performance includes a mean absolute error of approximately 1.2 points on the GRS and an average area under the ROC curve of about 0.78 for error recognition, establishing a solid reference point for future work.

Compared with existing public surgical video datasets such as JIGSAWS, ROSMA, and AIxSuture, LASANA stands out in several respects. It offers a larger number of participants and recordings, includes longitudinal data that reflect skill acquisition, provides both skill scores and error annotations, and supplies standardized data splits together with baseline metrics. These attributes make it a valuable resource for developing and benchmarking more sophisticated models, such as 3‑D CNNs that exploit stereo depth, multimodal approaches that combine video with kinematic data, or attention‑based architectures that can localize errors in time.

Nevertheless, the dataset has limitations. The four tasks are relatively simple and may not capture the complexity of real operative procedures. Error annotations are binary and lack temporal granularity, which constrains research on fine‑grained error detection and corrective feedback. Although stereo video is available, the benchmark does not prescribe how to leverage depth information, leaving that to the imagination of researchers. Future extensions could incorporate more challenging surgical scenarios, richer error labeling (including timestamps and severity), and additional modalities such as instrument kinematics or force data.

In summary, LASANA fills a critical gap in the field of automated laparoscopic skill assessment by delivering a large‑scale, well‑annotated, and openly accessible video dataset together with clear benchmarking protocols. It is poised to become a standard testbed for the next generation of AI‑driven surgical education tools, enabling systematic comparisons, fostering reproducibility, and ultimately supporting the development of real‑time, objective feedback systems for trainees and practicing surgeons alike.


Comments & Academic Discussion

Loading comments...

Leave a Comment