Are Vision Foundation Models Foundational for Electron Microscopy Image Segmentation?
Although vision foundation models (VFMs) are increasingly reused for biomedical image analysis, it remains unclear whether the latent representations they provide are general enough to support effective transfer and reuse across heterogeneous microscopy image datasets. Here, we study this question for the problem of mitochondria segmentation in electron microscopy (EM) images, using two popular public EM datasets (Lucchi++ and VNC) and three recent representative VFMs (DINOv2, DINOv3, and OpenCLIP). We evaluate two practical model adaptation regimes: a frozen-backbone setting in which only a lightweight segmentation head is trained on top of the VFM, and parameter-efficient fine-tuning (PEFT) via Low-Rank Adaptation (LoRA) in which the VFM is fine-tuned in a targeted manner to a specific dataset. Across all backbones, we observe that training on a single EM dataset yields good segmentation performance (quantified as foreground Intersection-over-Union), and that LoRA consistently improves in-domain performance. In contrast, training on multiple EM datasets leads to severe performance degradation for all models considered, with only marginal gains from PEFT. Exploration of the latent representation space through various techniques (PCA, Fréchet Dinov2 distance, and linear probes) reveals a pronounced and persistent domain mismatch between the two considered EM datasets in spite of their visual similarity, which is consistent with the observed failure of paired training. These results suggest that, while VFMs can deliver competitive results for EM segmentation within a single domain under lightweight adaptation, current PEFT strategies are insufficient to obtain a single robust model across heterogeneous EM datasets without additional domain-alignment mechanisms.
💡 Research Summary
This paper investigates whether modern vision foundation models (VFMs) can serve as truly “foundational” backbones for electron microscopy (EM) image segmentation, focusing on mitochondria segmentation across two widely used public EM datasets: Lucchi++ and VNC. The authors select three representative VFMs—DINOv2, DINOv3, and OpenCLIP—that differ in pre‑training objectives (self‑supervised vision vs. vision‑language) and are publicly available with well‑documented implementations.
Two practical adaptation regimes are examined. In the frozen‑backbone (head‑only) setting, the VFM weights are kept fixed and a lightweight convolutional decoder is trained to map the patch‑token embeddings to dense pixel‑wise predictions. In the parameter‑efficient fine‑tuning (PEFT) regime, Low‑Rank Adaptation (LoRA) modules are injected into the transformer layers, allowing a small set of trainable parameters to adjust the backbone while the decoder architecture remains unchanged. Both regimes share identical training hyper‑parameters: Dice loss, AdamW optimizer (lr = 5 × 10⁻⁵, weight decay = 10⁻⁴), up to 1000 epochs with early stopping, batch size = 2, and no data augmentation (relying on the robustness already learned during large‑scale pre‑training).
The study is organized around three questions: (1) Can VFMs generalize beyond their natural‑image pre‑training distribution to EM data? (2) Does PEFT improve cross‑dataset generalization? (3) Are VFMs competitive with task‑specific EM segmentation models?
Two training protocols are compared. “Single‑dataset” training uses either Lucchi++ or VNC alone, while “paired‑dataset” training mixes the two datasets without re‑weighting, thereby testing whether a single backbone can handle heterogeneous EM data. For paired training, images are resized to a backbone‑specific resolution by preserving aspect ratio and snapping the longest edge to a multiple of the patch size, avoiding padding artifacts.
Key empirical findings:
- In the single‑dataset regime, all three VFMs achieve high foreground Intersection‑over‑Union (IoU) when only the decoder is trained, confirming that the frozen representations already capture useful EM structures. Adding LoRA consistently improves in‑domain IoU, showing that modest backbone adaptation can refine features for the specific dataset.
- When training on the union of Lucchi++ and VNC, performance collapses for every model. LoRA yields only marginal gains and does not prevent the degradation, indicating that the simple low‑rank adaptation is insufficient to bridge the domain gap.
- Representation‑space diagnostics—principal component analysis (PCA), Fréchet‑DINOv2 distance, and linear probe accuracy—reveal a pronounced separation between the two datasets. Even after LoRA adaptation, the embeddings remain clustered by dataset, confirming a persistent domain mismatch despite visual similarity of the images.
The authors also benchmark against non‑foundation EM segmentation baselines (e.g., EM‑DINO, OmniEM, DINOSim) and find that VFM‑based approaches are competitive when evaluated on a single dataset, while the gap widens in the multi‑dataset scenario.
The paper concludes that current VFMs are “foundational” only within a homogeneous EM domain; they do not automatically provide domain‑agnostic representations across heterogeneous EM datasets. LoRA‑style PEFT improves in‑domain performance but fails to resolve inter‑dataset shifts. The authors suggest that future work should incorporate explicit domain‑alignment mechanisms—such as adversarial domain adaptation, style‑transfer preprocessing, or meta‑learning‑based adapters—to achieve a truly universal EM segmentation backbone.
Overall, this work offers a rigorous benchmark, clarifies the limits of VFM transferability in high‑resolution microscopy, and points to concrete research directions for making foundation models genuinely foundational in the electron microscopy community.
Comments & Academic Discussion
Loading comments...
Leave a Comment