Perspective-Equivariant Fine-tuning for Multispectral Demosaicing without Ground Truth
Multispectral demosaicing is crucial to reconstruct full-resolution spectral images from snapshot mosaiced measurements, enabling real-time imaging from neurosurgery to autonomous driving. Classical methods are blurry, while supervised learning requires costly ground truth (GT) obtained from slow line-scanning systems. We propose Perspective-Equivariant Fine-tuning for Demosaicing (PEFD), a framework that learns multispectral demosaicing from mosaiced measurements alone. PEFD a) exploits the projective geometry of camera-based imaging systems to leverage a richer group structure than previous demosaicing methods to recover more null-space information, and b) learns efficiently without GT by adapting pretrained foundation models designed for 1-3 channel imaging. On intraoperative and automotive datasets, PEFD recovers fine details such as blood vessels and preserves spectral fidelity, substantially outperforming recent approaches, nearing supervised performance.
💡 Research Summary
The paper introduces Perspective‑Equivariant Fine‑tuning for Demosaicing (PEFD), a novel framework that enables high‑quality multispectral image reconstruction using only the raw mosaic measurements from snapshot cameras, without any ground‑truth (GT) data. Traditional demosaicing approaches—simple bilinear or Gaussian interpolation, total‑variation regularisation, or other variational methods—are fast but produce blurry results and suffer from spectral artifacts. Supervised deep‑learning methods achieve state‑of‑the‑art performance but require large paired datasets of high‑resolution multispectral images, which are typically obtained with expensive line‑scanning systems that are incompatible with real‑time clinical or automotive applications. This creates a “chicken‑and‑egg” problem: without GT one cannot train a powerful model, yet without a trained model one cannot obtain reliable GT for new domains.
PEFD resolves this dilemma by combining two complementary ideas. First, it exploits the projective geometry of camera‑based imaging systems. When a camera rotates or translates, images of the same scene are related by homographies (projective transformations). The set of all such transformations forms a rich group G that includes pixel shifts, rotations, scaling, and full perspective warping. The authors assume that the unknown set of multispectral images X is invariant under G, i.e., applying any homography to an image yields another valid image of the same scene. Although the mosaicing operator A (which selects one spectral band per pixel) is not equivariant to G, one can construct “virtual” forward operators A_g = A T_g⁻¹ for each sampled transformation g ∈ G. By enforcing consistency between the network’s output on the original measurement and on the transformed measurement, a loss term
L_eq = ‖T_g f_θ(y) − f_θ(A_g f_θ(y))‖²
is introduced. This perspective‑equivariant loss forces the reconstruction to be stable under perspective changes, thereby pulling information from the null‑space of A that pure measurement‑consistency (‖A f_θ(y) − y‖²) cannot recover.
Second, PEFD leverages a large‑scale foundation model for image restoration, the Reconstruct Anything Model (RAM), which was pre‑trained on a diverse set of tasks (inpainting, deblurring, denoising, super‑resolution) across grayscale, complex‑valued, and RGB data. RAM’s 32 M‑parameter encoder‑decoder backbone already encodes powerful, domain‑agnostic features. To adapt it to multispectral demosaicing, the backbone is frozen, while the channel‑specific heads and tails are replicated C times (C = number of spectral bands). This parameter‑efficient fine‑tuning preserves the backbone’s learned inductive bias, avoids over‑fitting on the limited self‑supervised data, and dramatically reduces training time.
The overall training pipeline proceeds as follows: (1) feed the raw mosaic y into the frozen‑backbone + replicated‑head network f_θ to obtain an initial reconstruction ŷ; (2) randomly sample a homography g from G, compute the transformed reconstruction T_g ŷ and the transformed measurement A_g f_θ(y); (3) compute the total loss L = L_MC + α L_eq, where L_MC = ‖A f_θ(y) − y‖² is the standard measurement‑consistency term and α balances the two terms (empirically set between 0.1 and 1.0); (4) back‑propagate only through the unfrozen head/tail parameters. No GT images are ever required.
Experiments were conducted on two real‑world datasets. The first consists of intra‑operative brain surgery multispectral recordings (8 bands covering 400–700 nm) captured with a snapshot camera; the second comprises automotive forward‑looking scenes captured with a 12‑band snapshot sensor mounted on a vehicle. Quantitative metrics include PSNR, SSIM, and Spectral Angle Mapper (SAM). PEFD consistently outperforms classical weighted bilinear demosaicing, TV‑based variational methods, and recent self‑supervised approaches (e.g., SDNet with shift‑equivariance, DnCNN with rotation‑equivariance). Across both datasets, PEFD achieves on average >3 dB higher PSNR, 0.02–0.04 higher SSIM, and SAM errors below 1.5°, indicating both spatial sharpness and spectral fidelity. Visual inspection shows that fine vascular structures and subtle material boundaries, which appear blurred or spectrally distorted in baselines, are recovered with crisp edges and accurate colour rendition.
The authors acknowledge limitations. Accurate homography sampling requires knowledge of the camera intrinsics (K); in practice, calibration errors may affect the equivariance loss. Moreover, while the framework naturally extends to joint demosaicing‑denoising (JDD) by adding a GT‑free denoising term, this extension was not evaluated, leaving low‑light, high‑noise scenarios untested. Future work will explore automatic intrinsic estimation, more sophisticated noise models, and full JDD integration.
In summary, PEFD introduces a principled self‑supervised loss based on perspective equivariance, which unlocks information hidden in the mosaicing operator’s null‑space, and couples it with a frozen foundation model to bring the power of large‑scale pre‑training to a GT‑free multispectral demosaicing task. The result is a practical, real‑time capable solution that bridges the gap between high‑quality supervised methods and the constraints of real‑world medical and automotive imaging, delivering near‑supervised performance without ever acquiring costly ground‑truth data.
Comments & Academic Discussion
Loading comments...
Leave a Comment