SLIM-Diff: Shared Latent Image-Mask Diffusion with Lp loss for Data-Scarce Epilepsy FLAIR MRI
Focal cortical dysplasia (FCD) lesions in epilepsy FLAIR MRI are subtle and scarce, making joint image–mask generative modeling prone to instability and memorization. We propose SLIM-Diff, a compact joint diffusion model whose main contributions are (i) a single shared-bottleneck U-Net that enforces tight coupling between anatomy and lesion geometry from a 2-channel image+mask representation, and (ii) loss-geometry tuning via a tunable $L_p$ objective. As an internal baseline, we include the canonical DDPM-style objective ($ε$-prediction with $L_2$ loss) and isolate the effect of prediction parameterization and $L_p$ geometry under a matched setup. Experiments show that $x_0$-prediction is consistently the strongest choice for joint synthesis, and that fractional sub-quadratic penalties ($L_{1.5}$) improve image fidelity while $L_2$ better preserves lesion mask morphology. Our code and model weights are available in https://github.com/MarioPasc/slim-diff
💡 Research Summary
This paper introduces SLIM‑Diff, a compact joint diffusion model designed for the synthesis of FLAIR MRI slices and corresponding focal cortical dysplasia (FCD) lesion masks in a data‑scarce setting. The authors identify two major challenges in existing diffusion‑based joint synthesis approaches: (1) excessive model capacity that leads to over‑fitting when only a few annotated cases are available, and (2) the use of a canonical ε‑prediction objective with an L₂ loss, which may be sub‑optimal when lesions occupy a tiny fraction of the image. To address these issues, SLIM‑Diff adopts (i) a single shared‑bottleneck U‑Net that processes a two‑channel tensor (image + mask) jointly, thereby enforcing tight anatomical‑lesion coupling while dramatically reducing the number of trainable parameters (≈26.9 M, far smaller than typical Stable Diffusion backbones), and (ii) a tunable Lₚ loss where the exponent p can be set to 1.5, 2.0, or 2.5. The loss is applied to three different diffusion prediction parameterizations: ε‑prediction (noise), v‑prediction (velocity), and x₀‑prediction (direct reconstruction).
The dataset consists of 78 FCD II patients from the publicly released Schuch et al. cohort. After standard preprocessing (MNI registration, skull‑stripping, bias correction) the 3D volumes are resampled to 1.25 mm isotropic resolution and sliced axially into 2‑D patches of size 160 × 160. To provide spatial conditioning, the axial position is discretized into 30 bins; together with a binary pathology label this yields 60 distinct conditioning tokens. Both the image and mask channels are normalized to
Comments & Academic Discussion
Loading comments...
Leave a Comment