Revolutionizing Thin Structure Segmentation Meet TopoLoRA-SAM

February 04, 2026

Reading time: 18 minute

...

#paper #research

📝 Original Paper Info

- Title: TopoLoRA-SAM Topology-Aware Parameter-Efficient Adaptation of Foundation Segmenters for Thin-Structure and Cross-Domain Binary Semantic Segmentation
- ArXiv ID: 2601.02273
- Date: 2026-01-05
- Authors: Salim Khazem

📝 Abstract

Foundation segmentation models such as the Segment Anything Model (SAM) exhibit strong zero-shot generalization through large-scale pretraining, but adapting them to domain-specific semantic segmentation remains challenging, particularly for thin structures (e.g., retinal vessels) and noisy modalities (e.g., SAR imagery). Full fine-tuning is computationally expensive and risks catastrophic forgetting. We propose \textbf{TopoLoRA-SAM}, a topology-aware and parameter-efficient adaptation framework for binary semantic segmentation. TopoLoRA-SAM injects Low-Rank Adaptation (LoRA) into the frozen ViT encoder, augmented with a lightweight spatial convolutional adapter and optional topology-aware supervision via differentiable clDice. We evaluate our approach on five benchmarks spanning retinal vessel segmentation (DRIVE, STARE, CHASE\_DB1), polyp segmentation (Kvasir-SEG), and SAR sea/land segmentation (SL-SSDD), comparing against U-Net, DeepLabV3+, SegFormer, and Mask2Former. TopoLoRA-SAM achieves the best retina-average Dice and the best overall average Dice across datasets, while training only \textbf{5.2\%} of model parameters ($\sim$4.9M). On the challenging CHASE\_DB1 dataset, our method substantially improves segmentation accuracy and robustness, demonstrating that topology-aware parameter-efficient adaptation can match or exceed fully fine-tuned specialist models. Code is available at : https://github.com/salimkhazem/Seglab.git

💡 Summary & Analysis

1. **Principled Framework for Adaptation**: The study introduces TopoLoRA-SAM as a principled framework to adapt foundation models like SAM for binary segmentation across diverse domains, ensuring that the model retains its generalizable features while adapting to specific tasks.

Parameter-Efficient Fine-Tuning and Topology Preservation: This work combines parameter-efficient fine-tuning (PEFT) methods with topology-preserving supervision, particularly focusing on thin structures. By using LoRA modules and lightweight spatial adapters, it ensures that the connectivity of thin structures is maintained during segmentation.
Performance Evaluation Across Datasets: The performance of TopoLoRA-SAM was evaluated across five binary segmentation datasets including retinal vessels, polyps, and SAR sea/land boundaries. It demonstrated strong performance on CHASE_DB1 with only 5.2% of the model parameters being trained.

📄 Full Paper Content (ArXiv Source)

**TopoLoRA-SAM architecture overview.** We freeze the SAM ViT-B image encoder and inject LoRA modules (red) into the feed-forward network (FFN) layers of each transformer block. A lightweight depthwise-separable convolutional adapter (green) refines the high-resolution embedding tensor before mask decoding. Training uses a topology-aware loss combining BCE, Dice, and clDice.

Introduction

Semantic segmentation underpins numerous critical applications in medical imaging and remote sensing, from automated diabetic retinopathy screening to maritime surveillance and environmental monitoring . While recent foundation models such as the Segment Anything Model (SAM) have achieved remarkable zero-shot generalization through web-scale pretraining, significant challenges remain when adapting these models to specialized domains with distinct visual characteristics. Two challenges are particularly acute in this setting. First, thin and elongated structures such as retinal vasculature or road networks exhibit extreme aspect ratios, where small pixel-level errors can disconnect entire branches; standard region-based losses (e.g., Dice, cross-entropy) are largely insensitive to such topological violations . Second, cross-domain visual shift arises in modalities such as SAR or fundoscopy, whose texture, noise characteristics, and semantics differ substantially from natural images, limiting the transferability of web-pretrained representations .

Full fine-tuning of billion-parameter foundation models is computationally prohibitive and risks catastrophic forgetting of generalizable representations . Parameter-efficient fine-tuning (PEFT) methods such as LoRA , adapters , and visual prompt tuning offer compelling alternatives by updating only a small fraction of parameters while preserving pretrained knowledge. Recent work has begun exploring PEFT for SAM adaptation , yet systematic studies investigating topology-aware adaptation for domain-specific binary segmentation remain limited.

In this work, we introduce TopoLoRA-SAM, a principled framework that unifies parameter-efficient adaptation with topology-preserving supervision for binary semantic segmentation across diverse domains. Our approach freezes the SAM ViT-B image encoder and injects trainable LoRA modules into its feed-forward layers, complemented by a lightweight depthwise-separable convolutional adapter operating on high-resolution feature maps. This design enables fine-grained adaptation while preserving the generalizable representations learned during pretraining. To address topological integrity critical for thin structures we incorporate the differentiable clDice loss , which explicitly optimizes skeleton overlap and encourages connected centerlines. Our contributions are: (i) a topology-aware, parameter-efficient adaptation of SAM combining LoRA and a lightweight spatial adapter; (ii) a benchmark across five binary segmentation datasets (retinal vessels, polyps, SAR sea/land) with region, boundary, topology, and calibration metrics; and (iii) strong performance on CHASE_DB1 and best retina-average Dice while training only 5.2% of model parameters, with a fully reproducible codebase. In this work, ideas related to simplified geometric representations for efficient visual learning and lightweight adaptation via visual prompting are leveraged within a foundation-model segmentation framework.

Semantic segmentation architectures.

Fully convolutional networks revolutionized semantic segmentation, with U-Net establishing the encoder-decoder paradigm with skip connections that remains foundational in medical imaging. DeepLabV3+ introduced atrous spatial pyramid pooling for multi-scale context aggregation, while recent self-configuring approaches like nnU-Net achieve strong performance through systematic hyperparameter optimization. used these kind of architectures for different applications. Transformer-based architectures have emerged as powerful alternatives: SegFormer proposes an efficient hierarchical design with lightweight MLP decoders, while Mask2Former unifies instance, semantic, and panoptic segmentation through masked attention. These architectures build upon foundational work on vision transformers and hierarchical designs .

Foundation models for segmentation.

The Segment Anything Model represents a paradigm shift, trained on over 1 billion masks to enable promptable zero-shot segmentation. While SAM excels at interactive mask generation, adapting it to semantic segmentation particularly for domain-specific binary masks without prompts requires careful consideration of the prompt-mask relationship. SAM 2 extends this capability to video, demonstrating the scalability of the prompting paradigm. However, for domains with significant distribution shift (medical imaging, remote sensing), prompt-free adaptation with task-specific tuning remains essential .

SAM adaptation for specialized domains.

Several concurrent works address SAM adaptation for medical imaging. MedSAM fine-tunes SAM on large-scale medical datasets, demonstrating strong generalization across modalities. SAM-Adapter proposes domain-specific adapter modules for underperformed scenes, while SAM-Med2D focuses on 2D medical image segmentation. HQ-SAM introduces a high-quality output token for improved mask boundaries. PerSAM enables one-shot personalization through training-free prompt learning. Our work differs by jointly addressing parameter efficiency (LoRA + adapters) and topology preservation (clDice) for thin-structure domains where connectivity is critical.

Parameter-efficient fine-tuning.

PEFT methods enable foundation model adaptation with minimal parameter overhead. LoRA reparameterizes weight updates as low-rank matrices, achieving comparable performance to full fine-tuning on language models. AdaLoRA extends this with adaptive rank allocation, while DoRA decomposes weights into magnitude and direction for improved learning dynamics. Adapter modules insert trainable bottleneck layers between transformer blocks, and AdaptFormer adapts this for vision. Visual Prompt Tuning (VPT) prepends learnable tokens to input sequences. Scale-and-shift tuning (SSF) offers an even more parameter-efficient alternative. These methods share the goal of preserving pretrained knowledge while enabling task-specific adaptation.

Topology-aware losses and metrics.

Standard overlap metrics (Dice , IoU) are insensitive to topological errors that break connectivity in thin structures. clDice addresses this by computing skeleton overlap using differentiable soft-skeletonization, encouraging connected centerlines in tubular structures. Complementary approaches include TopoLoss , which uses persistent homology to penalize Betti number mismatches, and Betti matching , which aligns persistence barcodes for topologically faithful predictions. Boundary focused metrics such as BFScore evaluate contour precision. For probabilistic outputs, calibration metrics like Expected Calibration Error (ECE) assess reliability, which is critical for clinical deployment . Recent work emphasizes the importance of comprehensive metric evaluation .

Problem Formulation

Given an input image $`\mathbf{x} \in \mathbb{R}^{H \times W \times 3}`$, we aim to predict a binary semantic mask $`\hat{\mathbf{y}} \in [0,1]^{H \times W}`$ indicating foreground probability (vessel, polyp, or sea regions). Unlike SAM’s promptable paradigm that produces instance-agnostic masks conditioned on user-provided points or boxes, our task requires semantic predictions without runtime prompts. This necessitates adapting SAM’s encoder-decoder architecture to directly output semantic masks through end-to-end training.

SAM Architecture and Prompt-Free Decoding

We adopt the SAM ViT-B architecture , comprising: (i) a Vision Transformer (ViT) image encoder with 12 transformer blocks, producing embeddings $`\mathbf{z} \in \mathbb{R}^{256 \times \frac{H}{16} \times \frac{W}{16}}`$; (ii) a prompt encoder mapping user interactions to embedding space; and (iii) a lightweight mask decoder that combines image and prompt embeddings via cross-attention. For semantic segmentation, we operate in prompt-free mode by providing null prompt embeddings ($`\varnothing`$), treating the mask decoder as a semantic prediction head. The decoder outputs a single binary mask, upsampled to the target resolution via bilinear interpolation.

Low-Rank Adaptation (LoRA)

To enable efficient adaptation while preserving pretrained representations, we inject LoRA modules into the frozen image encoder. For a pretrained linear layer with weight matrix $`\mathbf{W}_0 \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}`$, LoRA reparameterizes the forward pass as:

MATH

\begin{equation}
\mathbf{h} = \mathbf{W}_0 \mathbf{x} + \Delta\mathbf{W} \mathbf{x} = \mathbf{W}_0 \mathbf{x} + \mathbf{B}\mathbf{A}\mathbf{x},
\end{equation}

Click to expand and view more

where $`\mathbf{A} \in \mathbb{R}^{r \times d_{\text{in}}}`$ and $`\mathbf{B} \in \mathbb{R}^{d_{\text{out}} \times r}`$ are trainable low-rank factors with rank $`r \ll \min(d_{\text{in}}, d_{\text{out}})`$. We initialize $`\mathbf{A}`$ with Kaiming uniform and $`\mathbf{B}`$ with zeros, ensuring $`\Delta\mathbf{W} = \mathbf{0}`$ at initialization for stable training. A scaling factor $`\alpha/r`$ modulates the magnitude of low-rank updates.

We target the feed-forward network (FFN) layers within each transformer block, specifically the mlp.lin1 and mlp.lin2 projections. This design choice follows the intuition that FFN layers capture task-specific feature transformations, while attention weights encode more transferable relational patterns. With $`r=16`$ and LoRA applied to all 12 blocks, we add approximately 2.4M trainable parameters to the encoder.

Spatial Convolutional Adapter

While LoRA enables semantic adaptation of the encoder’s representational capacity, thin structures require fine-grained spatial reasoning at higher resolutions than SAM’s $`16\times`$ downsampled embeddings. We introduce a lightweight depthwise-separable convolutional adapter applied to the image embedding tensor $`\mathbf{z}`$:

MATH

\begin{equation}
\mathbf{z}' = \mathbf{z} + \text{Conv}_{1\times1}\left(\text{ReLU}\left(\text{DepthwiseConv}_{3\times3}(\mathbf{z})\right)\right).
\end{equation}

Click to expand and view more

The adapter comprises a $`3\times3`$ depthwise convolution (256 groups) followed by ReLU activation and $`1\times1`$ pointwise projection, wrapped in a residual connection. This adds only $`\sim`$66K parameters while enabling local spatial refinement crucial for thin-structure boundary preservation.

Topology-Aware Training Objective

Our training objective combines multiple complementary loss terms:

MATH

\begin{equation}
\mathcal{L} = \lambda_{\text{bce}} \mathcal{L}_{\text{BCE}} + \lambda_{\text{dice}} \mathcal{L}_{\text{Dice}} + \lambda_{\text{cl}} \mathcal{L}_{\text{clDice}} + \lambda_{\text{bd}} \mathcal{L}_{\text{Boundary}},
\label{eq:loss}
\end{equation}

Click to expand and view more

with default weights $`\lambda_{\text{bce}} = 1.0`$, $`\lambda_{\text{dice}} = 1.0`$, $`\lambda_{\text{cl}} = 0.5`$, and $`\lambda_{\text{bd}} = 0.0`$.

Binary cross-entropy (BCE)

provides pixel-wise gradient signal:

MATH

\begin{equation}
\mathcal{L}_{\text{BCE}} = -\frac{1}{N}\sum_{i=1}^{N}\left[y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)\right].
\end{equation}

Click to expand and view more

Soft Dice loss

optimizes region overlap:

MATH

\begin{equation}
\mathcal{L}_{\text{Dice}} = 1 - \frac{2\sum_i y_i \hat{y}_i + \epsilon}{\sum_i y_i + \sum_i \hat{y}_i + \epsilon}.
\end{equation}

Click to expand and view more

Centerline Dice (clDice)

enforces topological connectivity by computing Dice on soft-skeletonized predictions:

MATH

\begin{equation}
\mathcal{L}_{\text{clDice}} = 1 - 2 \cdot \frac{\text{Tprec}(\hat{y}, y) \cdot \text{Tsens}(\hat{y}, y)}{\text{Tprec}(\hat{y}, y) + \text{Tsens}(\hat{y}, y)},
\end{equation}

Click to expand and view more

where $`\text{Tprec}`$ measures how much of the predicted skeleton overlaps with the ground-truth mask, and $`\text{Tsens}`$ measures coverage of the GT skeleton. Soft skeletonization is implemented via iterated morphological operations (min-pooling) applied to the soft prediction map.

Model	Total (M)	Trainable (M)	Trainable %
U-Net (ResNet34)	24.4	24.4	100.0%
DeepLabV3+ (ResNet50)	39.8	39.8	100.0%
SegFormer (MiT-B0)	3.7	3.7	100.0%
Mask2Former (Swin-T)	47.4	47.4	100.0%
TopoLoRA-SAM (ours)	93.7	4.9	5.2%

Parameter efficiency comparison. TopoLoRA-SAM trains only 5.2% of parameters while achieving competitive or superior performance.

Implementation Details

We use SAM ViT-B pretrained weights and freeze all encoder parameters except LoRA modules. The mask decoder remains trainable ($`\sim`$2.4M parameters). Total trainable parameters: $`\sim`$4.9M (LoRA: 2.4M, adapter: 66K, decoder: 2.4M), compared to 93.7M total. Table 1 summarizes parameter efficiency. Training uses AdamW with learning rate $`10^{-4}`$, cosine decay, and 50 epochs.

Experimental Setup

Datasets

We evaluate on five binary segmentation benchmarks spanning three distinct application domains:

Retinal vessel segmentation.

DRIVE contains 40 color fundus images (584$`\times`$565) with manual vessel annotations from expert ophthalmologists, split into 20 train and 20 test images. STARE provides 20 images (700$`\times`$605) with vessels and pathological abnormalities. CHASE_DB1 comprises 28 high-resolution images (999$`\times`$960) from a pediatric population study. These datasets exemplify thin-structure segmentation where topological connectivity is clinically significant for vascular analysis.

Polyp segmentation.

Kvasir-SEG contains 1,000 colonoscopy images with corresponding polyp masks, representing an important task for early colorectal cancer detection. Images vary in resolution and polyp appearance.

SAR sea/land segmentation.

SL-SSDD provides synthetic aperture radar (SAR) imagery for maritime boundary delineation, challenging due to speckle noise, low contrast, and significant distribution shift from natural images.

Preprocessing and Augmentations

All datasets are standardized to RGB format with ImageNet normalization ($`\mu=[0.485, 0.456, 0.406]`$, $`\sigma=[0.229, 0.224, 0.225]`$). We use dataset-specific resolutions: 384$`\times`$384 for retinal datasets, 512$`\times`$512 for Kvasir-SEG and SL-SSDD. Training augmentations include random horizontal/vertical flips, random scaling (0.5–1.5$`\times`$), and center crops with reflection padding. Color jitter is disabled for SAR imagery to preserve modality-specific characteristics.

Baseline Architectures

We compare five segmentation model families spanning convolutional, transformer-based, and foundation-model paradigms. As convolutional baselines, we use U-Net with a ResNet34 encoder and DeepLabV3+ with a ResNet50 encoder. For transformer-based segmentation, we evaluate SegFormer (MiT-B0) and Mask2Former with a Swin-T backbone . Finally, we include our TopoLoRA-SAM model, which adapts SAM ViT-B using Low-Rank Adaptation, a lightweight spatial adapter, and optional topology-aware clDice regularization. All models use ImageNet-pretrained backbones (or SAM-pretrained weights) and are trained with an identical BCE+Dice objective to ensure fair comparison.

Training Configuration

All experiments follow a unified training protocol. We use the AdamW optimizer with $`\beta_1=0.9`$, $`\beta_2=0.999`$, and a weight decay of $`10^{-4}`$. The learning rate is set to $`10^{-4}`$ and scheduled using cosine annealing down to a minimum value of $`10^{-6}`$. All models are trained for 50 epochs. A batch size of 4 is used for all baseline models, while SAM-based models are trained with a batch size of 1 and gradient accumulation over 4 steps to maintain a comparable effective batch size. Training is performed using mixed-precision (FP16) via PyTorch Automatic Mixed Precision (AMP). For robustness, all reported results are averaged over three random seeds (0, 1, and 2), and we report the mean and standard deviation.

Evaluation Metrics

We use a comprehensive evaluation suite covering region overlap, boundary accuracy, topology preservation, and calibration. Region-level performance is measured using the Dice coefficient (F1-score) and Intersection-over-Union(IoU), with precision and recall reported to characterize false positive and false negative behavior. Boundary quality is evaluated using the BFScore , which computes an F1-score on contour pixels within a tolerance margin. To assess topology preservation for thin structures, we report centerline Dice (clDice) on retinal datasets, capturing connectivity via skeleton overlap. Finally, probabilistic calibration is measured using Expected Calibration Error (ECE) , which quantifies the alignment between prediction confidence and empirical accuracy; lower values indicate better-calibrated models, an important consideration for clinical decision support.

Computational Resources

All experiments run on a single NVIDIA RTX A6000 ada (50GB) GPU per training job, with multi-GPU parallelism used only to run independent experiments concurrently. The complete benchmark (5 datasets $`\times`$ 5 models $`\times`$ 3 seeds + ablations + cross-dataset) requires approximately 29.7 GPU-hours, dominated by Mask2Former and SAM runs due to larger memory footprint and longer convergence.

Results

Main Benchmark Results

Table [tab:main-dice] reports Dice scores across all datasets and model families. Overall, TopoLoRA-SAM achieves the best average Dice (0.735), outperforming the second-best Mask2Former (0.709) while training only 4.9M parameters, compared to 47.4M for Mask2Former. This highlights the effectiveness of parameter-efficient adaptation of SAM representations.

On retinal vessel segmentation, TopoLoRA-SAM shows consistent and robust performance. It achieves the best Dice on CHASE_DB1 (0.569) with substantially lower variance across seeds, corresponding to a +8.4 Dice improvement over Mask2Former. On DRIVE and STARE, TopoLoRA-SAM remains competitive with transformer-based baselines and achieves the best retina-average Dice (0.595), indicating stable generalization across datasets with different image characteristics.

Across non-retinal domains, TopoLoRA-SAM maintains strong cross-domain behavior, achieving competitive Dice on Kvasir-SEG (0.900) and matching state-of-the-art performance on SL-SSDD (0.993), despite the significant distribution shift from natural images. These results suggest that SAM representations adapted via LoRA transfer effectively across heterogeneous imaging modalities.

Topology Preservation

Thin-structure performance summary across retinal datasets (DRIVE, STARE, CHASE_DB1). TopoLoRA-SAM achieves top-tier performance on Dice, clDice, and BFScore while training 10× fewer parameters.

Qualitative comparison across all benchmark datasets. Each row shows a representative sample from one dataset: retinal vessel segmentation (DRIVE, STARE, CHASE_DB1), polyp segmentation (Kvasir-SEG), and SAR sea/land segmentation (SL-SSDD). Columns display input images, ground truth masks, and predictions from five methods. TopoLoRA-SAM (Ours) demonstrates superior preservation of thin vessel structures and fine-grained boundaries compared to baselines, particularly visible in the retinal datasets where thin peripheral vessels are better captured. On CHASE_DB1, our method maintains vessel connectivity while other methods show fragmentation. Best viewed zoomed in.

For thin-structure segmentation, region overlap metrics alone are insufficient, as topological connectivity is critical for downstream analysis. Figure 2 summarizes clDice and BFScore on retinal datasets. TopoLoRA-SAM achieves the best clDice on DRIVE (0.678) and CHASE_DB1 (0.599), indicating improved preservation of vessel connectivity, particularly on high-resolution images where baseline methods often produce fragmented predictions. Improvements are most pronounced on CHASE_DB1, with a gain of over 6 clDice points compared to the second-best method.

Boundary quality follows a similar trend. TopoLoRA-SAM achieves the best retina-average BFScore (0.578), with strong gains on CHASE_DB1, highlighting its ability to delineate thin vessel boundaries accurately an important property for vessel diameter estimation and stenosis analysis.

Parameter Efficiency Analysis

Parameter efficiency trade-off on retinal datasets. TopoLoRA-SAM achieves Pareto-optimal performance: competitive Dice with 10× fewer trainable parameters than Mask2Former.

Fig 4 visualizes the fundamental trade-off between model capacity and performance on retinal datasets. Key observations:

TopoLoRA-SAM achieves Pareto-optimal performance, offering the best Dice among methods with $`<`$10M trainable parameters.
Compared to fully-tuned Mask2Former (47.4M params), TopoLoRA-SAM (4.9M params) achieves superior retina-average Dice with 90% parameter reduction.
The lightweight SegFormer (3.7M) shows the lowest parameter count but significantly underperforms on thin-structure preservation.

This analysis demonstrates that parameter-efficient adaptation of SAM’s powerful pretrained representations is a compelling alternative to training specialist segmentation architectures from scratch.

Calibration Analysis

Reliable confidence estimates are essential for risk-aware decision support. Across all datasets, TopoLoRA-SAM demonstrates strong calibration, achieving the lowest Expected Calibration Error (ECE) on CHASE_DB1 (0.045) and SL-SSDD (0.003), highlighting its robustness in both medical imaging and remote sensing scenarios. While SegFormer attains the best average ECE overall, this comes at the expense of reduced segmentation accuracy. In contrast, Mask2Former exhibits pronounced miscalibration on CHASE_DB1, consistent with its high Dice score variability. Overall, TopoLoRA-SAM offers a favorable balance between segmentation accuracy and probabilistic calibration.

Component Contributions

LoRA adaptation is the primary driver.

Adding LoRA modules to the frozen encoder provides the largest single improvement across all metrics: +4.2 Dice, +5.1 clDice, and +8.3 BFScore points over decoder-only tuning. This confirms that adapting the encoder’s feature representations is critical for domain-specific segmentation, even with a small parameter budget (2.4M additional parameters).

The depthwise-separable adapter offers modest improvements on boundary-sensitive metrics (+1.2 BFScore over LoRA-only), suggesting that high-resolution spatial refinement benefits thin-structure delineation. However, gains are dataset-dependent and may require tuning for optimal integration.

Topology regularization benefits connectivity.

Adding clDice loss improves skeleton-based metrics (clDice) by +0.8 points on average, with the effect most pronounced on images with complex vessel branching. However, the improvement is sensitive to the regularization weight $`\lambda_{\text{cl}}`$: too high values can destabilize training, while too low values provide negligible benefit.

LoRA Rank Sensitivity

We ablate the LoRA rank $`r \in \{4, 8, 16, 32\}`$ on DRIVE (fixing other components):

Rank	Params (M)	Dice	clDice
$`r=4`$	0.6	0.632$`\pm`$0.021	0.659$`\pm`$0.024
$`r=8`$	1.2	0.641$`\pm`$0.019	0.668$`\pm`$0.022
$`r=16`$	2.4	0.690$`\pm`$0.018	0.678$`\pm`$0.021
$`r=32`$	4.8	0.648$`\pm`$0.020	0.675$`\pm`$0.023

Rank $`r=16`$ offers the best balance of capacity and efficiency. Higher ranks ($`r=32`$) provide diminishing returns while doubling parameter count, consistent with observations from LoRA in language models.

Loss Weight Sensitivity

We vary the clDice weight $`\lambda_{\text{cl}} \in \{0.0, 0.25, 0.5, 1.0, 2.0\}`$:

$`\lambda_{\text{cl}}`$	Dice	clDice	ECE
0.0	0.648$`\pm`$0.019	0.671$`\pm`$0.022	0.041$`\pm`$0.008
0.25	0.649$`\pm`$0.018	0.675$`\pm`$0.021	0.042$`\pm`$0.009
0.5	0.690$`\pm`$0.018	0.678$`\pm`$0.021	0.043$`\pm`$0.009
1.0	0.647$`\pm`$0.020	0.676$`\pm`$0.023	0.046$`\pm`$0.010
2.0	0.638$`\pm`$0.024	0.670$`\pm`$0.027	0.052$`\pm`$0.012

Moderate topology weighting ($`\lambda_{\text{cl}}=0.5`$) achieves optimal balance. Excessive weighting ($`\lambda_{\text{cl}} \geq 2.0`$) degrades both region and topology metrics, likely due to gradient conflicts between objectives.

Our ablation results suggest several practical defaults. When compute or turnaround time is constrained, LoRA in encoder FFNs (rank $`r=16`$) accounts for most of the improvement with minimal overhead. In applications where connectivity is a primary objective, clDice regularization with $`\lambda_{\text{cl}}=0.5`$ is a plausible choice, although its benefit is dataset-dependent and should be checked on held-out data. For tasks requiring precise boundaries (e.g., vessel thickness measurements), adding the spatial adapter can further improve contour fidelity.

Conclusion, Limitations, and Broader Impact

We introduced TopoLoRA-SAM, a topology-aware and parameter-efficient adaptation framework for the Segment Anything Model (SAM) targeting binary semantic segmentation across diverse domains. Our approach combines Low-Rank Adaptation (LoRA) in the frozen ViT encoder, a lightweight spatial convolutional adapter, and optional topology-aware supervision via differentiable clDice. Across five benchmarks spanning retinal vessels, polyp segmentation, and SAR imagery, TopoLoRA-SAM achieves the best retina-average Dice and the best overall average Dice among evaluated methods, while training only 5.2% of model parameters. Notably, on the challenging CHASE_DB1 dataset, our method improves Dice by up to +8.4 points with substantially lower variance, highlighting the robustness of parameter-efficient adaptation. Topology-focused metrics (clDice and BFScore) further confirm improved connectivity preservation for thin structures.

Our ablation studies indicate that LoRA is the primary driver of performance gains, while the spatial adapter and topology-aware loss provide complementary but more sensitive benefits. In particular, topology regularization depends on careful weighting and does not universally improve cross-dataset generalization. Moreover, while TopoLoRA-SAM trains few parameters, the frozen SAM backbone still incurs non-trivial memory and inference costs, which may limit deployment in resource-constrained settings. This work is intended as a research contribution and does not constitute a validated clinical or operational system. We encourage responsible use, especially in medical and remote sensing contexts, and release our codebase and trained models to support reproducibility and further research. Promising future directions include extension to video and 3D segmentation, multi-class settings, and domain-adaptive topology regularization.

📄 Read Full PDF on ArXiv

📊 논문 시각자료 (Figures)

A Note of Gratitude

The copyright of this content belongs to the respective researchers. We deeply appreciate their hard work and contribution to the advancement of human civilization.

📝 Original Paper Info

📝 Abstract

💡 Summary & Analysis

📄 Full Paper Content (ArXiv Source)

Introduction

Related Work

Semantic segmentation architectures.

Foundation models for segmentation.

SAM adaptation for specialized domains.

Parameter-efficient fine-tuning.

Topology-aware losses and metrics.

Problem Formulation

SAM Architecture and Prompt-Free Decoding

Low-Rank Adaptation (LoRA)

Spatial Convolutional Adapter

Topology-Aware Training Objective

Binary cross-entropy (BCE)

Soft Dice loss

Centerline Dice (clDice)

Implementation Details

Experimental Setup

Datasets

Retinal vessel segmentation.

Polyp segmentation.

SAR sea/land segmentation.

Preprocessing and Augmentations

Baseline Architectures

Training Configuration

Evaluation Metrics

Computational Resources

Results

Main Benchmark Results

Topology Preservation

Parameter Efficiency Analysis

Calibration Analysis

Component Contributions

LoRA adaptation is the primary driver.

Spatial adapter provides boundary refinement.

Topology regularization benefits connectivity.

LoRA Rank Sensitivity

Loss Weight Sensitivity

Conclusion, Limitations, and Broader Impact

📊 논문 시각자료 (Figures)

A Note of Gratitude

Related Posts

A Comparative Study of Custom CNNs, Pre-trained Models, and Transfer Learning Across Multiple Visual Datasets

A Comprehensive Dataset for Human vs. AI Generated Image Detection

A Generalized UCB Bandit Algorithm for ML-Based Estimators

Start searching

No results found