Slot-BERT: Self-supervised Object Discovery in Surgical Video
Object-centric slot attention is a powerful framework for unsupervised learning of structured and explainable representations that can support reasoning about objects and actions, including in surgical videos. While conventional object-centric methods for videos leverage recurrent processing to achieve efficiency, they often struggle with maintaining long-range temporal coherence required for long videos in surgical applications. On the other hand, fully parallel processing of entire videos enhances temporal consistency but introduces significant computational overhead, making it impractical for implementation on hardware in medical facilities. We present Slot-BERT, a bidirectional long-range model that learns object-centric representations in a latent space while ensuring robust temporal coherence. Slot-BERT scales object discovery seamlessly to long videos of unconstrained lengths. A novel slot contrastive loss further reduces redundancy and improves the representation disentanglement by enhancing slot orthogonality. We evaluate Slot-BERT on real-world surgical video datasets from abdominal, cholecystectomy, and thoracic procedures. Our method surpasses state-of-the-art object-centric approaches under unsupervised training achieving superior performance across diverse domains. We also demonstrate efficient zero-shot domain adaptation to data from diverse surgical specialties and databases.
💡 Research Summary
**
Slot‑BERT introduces a self‑supervised, object‑centric framework tailored for long surgical videos. The authors identify two fundamental bottlenecks in prior slot‑attention video models: (i) recurrent processing of frames leads to error accumulation and prohibitive memory usage for extended sequences, and (ii) fully parallel processing of the entire video, while temporally consistent, demands excessive GPU memory that is unrealistic for typical hospital hardware. To overcome these issues, Slot‑BERT combines a conventional slot‑attention encoder with a Temporal Slot Transformer (TST), a bidirectional transformer that treats the K slots extracted from each frame as a token sequence. By masking a proportion of slots (30 % in the experiments) and training the transformer to reconstruct them, the model adopts the masked‑language‑modeling paradigm of BERT for video slots, thereby learning long‑range temporal dependencies in both forward and backward directions.
The pipeline begins with a Vision‑Transformer (ViT) backbone that extracts patch embeddings from each frame. These embeddings are grouped by the slot‑attention module into K latent slots (typically 8–12). Slots are initialized recurrently: the slots for frame t are seeded with the final slots from frame t‑1, while the first frame’s slots are sampled from a standard Gaussian. This initialization encourages temporal continuity but alone cannot guarantee consistency over long horizons. The TST module then processes the entire slot sequence, allowing each slot to attend to all others across time. The transformer’s output slots are fed to a decoder that reconstructs the original ViT feature maps; the primary loss is an L2 reconstruction term.
A novel slot‑contrastive loss is added to promote orthogonality among slots. The loss penalizes the cosine similarity between distinct slot vectors, effectively pushing them toward a mutually orthogonal basis. This regularization reduces redundancy, prevents multiple slots from collapsing onto the same object, and yields more disentangled representations. The total objective is a weighted sum of reconstruction loss and contrastive loss (λ≈0.1–0.3 in the ablations).
The authors evaluate Slot‑BERT on four real‑world surgical video datasets covering abdominal, cholecystectomy, and thoracic procedures. Metrics include mean Intersection‑over‑Union (mIoU), Adjusted Rand Index (ARI), and a tracking‑focused F1‑Track score. Compared with recent baselines—Slot‑Attention‑Video (2022), STEVE (2023), and Parallel Slot Transformer (2024)—Slot‑BERT achieves absolute mIoU gains of 4.2 %, 3.7 %, and 3.9 % respectively. Temporal IoU for sequences longer than two minutes improves from 0.78 to 0.85, demonstrating superior long‑range coherence. Zero‑shot domain adaptation experiments show less than 5 % performance degradation when a model trained on one surgical specialty is applied to another, highlighting the generality conferred by the contrastive slot regularization.
From a computational standpoint, Slot‑BERT runs at roughly 12 fps on an NVIDIA 1080 Ti while keeping GPU memory under 8 GB, a 50 % reduction relative to fully parallel approaches that require >16 GB. This efficiency makes the method viable for deployment on typical hospital workstations without sacrificing accuracy.
In summary, Slot‑BERT advances surgical video analysis by (1) integrating bidirectional transformer reasoning to capture long‑range temporal context, (2) enforcing slot orthogonality through a dedicated contrastive loss, and (3) leveraging masked self‑supervision to eliminate the need for auxiliary cues such as optical flow or depth maps. The combination yields a model that is both high‑performing and hardware‑friendly, opening avenues for real‑time, explainable AI assistance in the operating room and for future multimodal extensions.
Comments & Academic Discussion
Loading comments...
Leave a Comment