Med-SegLens: Latent-Level Model Diffing for Interpretable Medical Image Segmentation
Modern segmentation models achieve strong predictive performance but remain largely opaque, limiting our ability to diagnose failures, understand dataset shift, or intervene in a principled manner. We introduce Med-SegLens, a model-diffing framework that decomposes segmentation model activations into interpretable latent features using sparse autoencoders trained on SegFormer and U-Net. Through cross-architecture and cross-dataset latent alignment across healthy, adult, pediatric, and sub-Saharan African glioma cohorts, we identify a stable backbone of shared representations, while dataset shift is driven by differential reliance on population-specific latents. We show that these latents act as causal bottlenecks for segmentation failures, and that targeted latent-level interventions can correct errors and improve cross-dataset adaption without retraining, recovering performance in 70% of failure cases and improving Dice score from 39.4% to 74.2%. Our results demonstrate that latent-level model diffing provides a practical and mechanistic tool for diagnosing failures and mitigating dataset shift in segmentation models.
💡 Research Summary
Med‑SegLens tackles the opacity of state‑of‑the‑art medical image segmentation models by introducing a latent‑level model‑diffing framework that makes internal representations both interpretable and manipulable. The authors first train two popular segmentation backbones—SegFormer (a transformer‑based architecture) and U‑Net (a convolutional architecture)—independently on four distinct brain MRI cohorts: adult glioma, pediatric glioma, sub‑Saharan African (SSA) glioma, and a healthy‑brain dataset (IXI). From each trained model they extract activations at a fixed intermediate layer (chosen because it captures rich semantic information) and train a sparse autoencoder (SAE) on these activations. The SAE uses a BatchTopK operator to enforce an average sparsity of k = 32 latent units per sample, yielding a compact, high‑level description of each activation map.
To compare models across datasets, the authors align the learned latent spaces. For each pair of datasets they compute cosine similarity matrices between the encoder weight vectors and between the decoder weight vectors of the respective SAEs. The Hungarian algorithm is then applied to each similarity matrix to obtain a one‑to‑one matching of latent indices. Latents whose encoder and decoder matches both exceed a similarity threshold (τ = 0.8) are declared “shared”; all others are labeled “dataset‑specific”. This procedure isolates a stable backbone of population‑invariant features (e.g., brain edges, white‑matter vs. gray‑matter contrast) while exposing the set of features that differ systematically between cohorts.
Interpretability is achieved through an automated pipeline (Auto‑Interp). For each latent the top‑K most activating samples are identified, spatial activation heatmaps are projected back to image resolution, and a suite of geometry‑based metrics (brain‑edge ratio, depth, spatial entropy, centroid location) is computed. These quantitative descriptors are mapped to human‑readable semantic tags such as “diffuse edema”, “localized necrotic core”, or “ventricular region”. The pipeline works consistently across both SegFormer and U‑Net, confirming that the same latent index often corresponds to a similar anatomical concept despite architectural differences.
The empirical analysis yields several key insights. First, a substantial proportion of latents (≈35‑60 % depending on the dataset pair) are shared, confirming that segmentation models learn a common visual backbone for brain anatomy. Second, the distribution of latent usage differs markedly between architectures: SegFormer activates tumor‑related latents in ~29 % of cases, whereas U‑Net relies almost exclusively on background latents (~97 %). This explains why transformer‑based models tend to be more robust to tumor variability. Third, dataset‑specific latents act as causal bottlenecks under distribution shift. For example, SSA‑trained models over‑activate latents associated with peripheral edema, leading to systematic under‑segmentation of tumor core when evaluated on adult data.
Crucially, the authors demonstrate that intervening on these bottleneck latents can recover performance without any weight‑level retraining. By zero‑out (ablation) or gently steering (gradient‑based adjustment) the activation of identified dataset‑specific latents, they correct the internal representation and observe a dramatic improvement: 70 % of failure cases are rescued, and the Dice score for the most affected class rises from 39.4 % to 74.2 %. This latent‑level manipulation is performed at inference time, making it a lightweight adaptation strategy suitable for clinical deployment where data distributions evolve (new scanners, demographic shifts, etc.).
The paper also discusses limitations and future directions. Current interventions are manual; an automated policy (e.g., reinforcement learning) to select and adjust latents could further streamline adaptation. Some latents encode mixed anatomical information, making semantic labeling ambiguous. Extending the framework to multimodal inputs (CT, PET) or longitudinal sequences, and exploring latent‑level regularization as a pre‑emptive domain‑generalization technique, are promising avenues.
In summary, Med‑SegLens introduces a novel combination of sparse autoencoding, cross‑dataset latent alignment, and automated semantic grounding to render segmentation models transparent, diagnose failure modes, and mitigate dataset shift through targeted latent‑level interventions. The approach offers a practical, cost‑effective alternative to full model retraining, advancing the reliability and adaptability of AI‑driven medical imaging pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment