Self-Supervised Uncalibrated Multi-View Video Anonymization in the Operating Room

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Privacy preservation is a prerequisite for using video data in Operating Room (OR) research. Effective anonymization relies on the exhaustive localization of every individual; even a single missed detection necessitates extensive manual correction. However, existing approaches face two critical scalability bottlenecks: (1) they usually require manual annotations of each new clinical site for high accuracy; (2) while multi-camera setups have been widely adopted to address single-view ambiguity, camera calibration is typically required whenever cameras are repositioned. To address these problems, we propose a novel self-supervised multi-view video anonymization framework consisting of whole-body person detection and whole-body pose estimation, without annotation or camera calibration. Our core strategy is to enhance the single-view detector by “retrieving” false negatives using temporal and multi-view context, and conducting self-supervised domain adaptation. We first run an off-the-shelf whole-body person detector in each view with a low-score threshold to gather candidate detections. Then, we retrieve the low-score false negatives that exhibit consistency with the high-score detections via tracking and self-supervised uncalibrated multi-view association. These recovered detections serve as pseudo labels to iteratively fine-tune the whole-body detector. Finally, we apply whole-body pose estimation on each detected person, and fine-tune the pose model using its own high-score predictions. Experiments on the 4D-OR dataset of simulated surgeries and our dataset of real surgeries show the effectiveness of our approach achieving over 97% recall. Moreover, we train a real-time whole-body detector using our pseudo labels, achieving comparable performance and highlighting our method’s practical applicability. Code will be available at https://github.com/CAMMA-public/OR_anonymization.

💡 Research Summary

The paper tackles the critical problem of automatically anonymizing operating‑room (OR) video recordings, where any missed detection of a person can lead to privacy violations and costly manual review. Existing solutions either rely on extensive manual annotations for each clinical site or require calibrated multi‑camera setups, both of which hinder scalability. To overcome these bottlenecks, the authors propose a fully self‑supervised, uncalibrated multi‑view framework that combines whole‑body detection, multi‑object tracking, cross‑view person association, and whole‑body pose estimation.

The pipeline begins by running an off‑the‑shelf whole‑body detector (pre‑trained on CrowdHuman) on each camera view with a low confidence threshold, thereby generating a dense set of candidate bounding boxes. High‑confidence detections serve as queries for a ByteTrack‑style tracker that treats every candidate box as a potential match, allowing low‑score boxes that are temporally consistent to be linked into tracklets. Next, the authors train a self‑supervised multi‑view association network (based on the Self‑MV‑A approach) without any camera calibration. The network learns to discriminate whether two images captured at the same time belong to the same person, using both appearance and geometric cues. By feeding the tracklets from one view as queries and the full set of detections from another view as the gallery, the association module retrieves boxes that were missed by the detector or tracker due to occlusion in the query view but are visible elsewhere.

All retrieved boxes are merged with the original high‑score detections to form an augmented detection set. These augmented detections are treated as pseudo‑labels to fine‑tune the whole‑body detector in a self‑supervised domain‑adaptation loop. The process—detect → track → associate → augment → fine‑tune—is repeated for several iterations, each time improving the quality of pseudo‑labels and consequently the detector’s recall.

After the detection stage, a whole‑body pose estimator is applied to each detected person. The pose model is also self‑supervised: high‑confidence joint predictions on the augmented detections are used as pseudo‑labels to further refine the pose network. This two‑stage approach enables precise localization of eyes, faces, or half‑bodies, allowing flexible anonymization policies.

The method is evaluated on two datasets: the simulated 4D‑OR dataset and a newly collected real‑surgery dataset featuring multiple ceiling‑mounted cameras, heavy occlusions, masks, and surgical equipment. For each person, the authors annotated a whole‑body bounding box, a “hard case” flag (occlusion > 67 %), and three keypoints (eyes and chin) when visible. Two stringent metrics are introduced—hard‑case recall and holistic recall (percentage of subjects anonymized across all views). The proposed system achieves over 97 % recall on both datasets, surpassing prior state‑of‑the‑art methods by a large margin. In a one‑hour real surgery video, it reduces missed detections by more than 13 500 instances, translating into a reduction of manual review time from over 20 hours to a few minutes. Moreover, the high‑quality pseudo‑labels enable training of a real‑time whole‑body detector (30 FPS) with performance comparable to the offline fine‑tuned model, demonstrating practical deployability.

Key contributions include: (1) a fully annotation‑free, calibration‑free OR video anonymization pipeline; (2) a novel combination of tracking and uncalibrated multi‑view association to recover false negatives; (3) an iterative self‑supervised domain adaptation loop that yields both high‑accuracy detection and real‑time models; and (4) extensive evaluation using strict recall metrics on both simulated and real surgical videos. Limitations are acknowledged: the approach assumes reasonably synchronized cameras and may still miss extremely heavily occluded persons (> 90 % occlusion). Future work will explore asynchronous association and more sophisticated temporal models to further close this gap.

Self-Supervised Uncalibrated Multi-View Video Anonymization in the Operating Room

💡 Research Summary

Comments & Academic Discussion

Leave a Comment