GenMFSR: Generative Multi-Frame Image Restoration and Super-Resolution

GenMFSR: Generativ e Multi-Frame Image Restoration and Super -Resolution Harshana W eligampola 1 , 2 , Joshua Peter Ebenezer 1 , W eidi Liu 1 , Abhinau K. V enkataramanan 1 , Sreenithy Chandran 1 , Seok-Jun Lee 1 , and Hamid Rahim Sheikh 1 1 Samsung Research America, 2 Purdue Uni versity wweligam@purdue.edu RA W burst capture RA W burst GenMFSR HQ Align+OSEDif f RA W bur s t mean GenMF SR GenMF SR RA W bur s t mean RA W burst GenMFSR HQ Align+OSEDif f Hand tremor motion kernel Figure 1. GenMFSR: Superior high-frequency feature generation from multi-frame RA W images via a uniﬁed denoising, demosaicing, registration, and super -resolution model in a single-step. Abstract Camera pipelines r eceive raw Bayer-format frames that need to be denoised, demosaiced, and often super -r esolved. Multiple frames ar e captur ed to utilize natural hand tr emors and enhance r esolution. Multi-frame super -resolution is ther efore a fundamental pr oblem in camera pipelines. Ex- isting adversarial methods ar e constrained by the quality of gr ound truth. W e pr opose GenMFSR , the ﬁr st Gener - ative Multi-F rame Raw-to-RGB Super Resolution pipeline, that incorporates image priors fr om foundation models to obtain sub-pixel information for camera ISP applications. GenMFSR can align multiple raw frames, unlike existing single-frame super-r esolution methods, and we pr opose a loss term that r estricts g eneration to high-frequency r egions in the raw domain, thus pr eventing low-fr equency artifacts. 1. Introduction Multi-frame super-resolution (MFSR) is the cornerstone of modern mobile computational photography , relying on nat- ural hand tremors to recov er sub-pixel details from noisy RA W bursts. Standard Image Signal Processors (ISPs) ap- ply sequential, lossy modules (demosaicing, denoising) that destroy critical sensor-le vel information. While recent gen- erativ e models (e.g., DarkDiff [ 45 ], RDDM [ 4 ], ISPDif- fuser [ 25 ]) achie ve unprecedented photorealism for single RA W images, integrating these priors into the temporally misaligned multi-frame RA W pipeline remains unsolved. Current multi-frame approaches[ 12 , 37 ] rely on determin- istic regression, which con ver ges to posterior means and produces over -smoothed, “waxy” textures rather than pho- torealistic details. Generativ e diffusion priors can synthesize these miss- ing high-frequency details. Howe ver , applying them to the RA W -to-RGB in verse problem introduces a critical failure mode: domain collision . Standard priors are trained on highly processed, non-linear sRGB images. When forced to regularize linear , mosaiced sensor data, prior attempts to aggressiv ely “correct” the low-frequency structure (e.g., global color and macro-geometry) already recov erable from the burst result in se vere structural hallucinations. W e dev elop a principled remedy for this collision. In burst reconstruction, low-frequency components are ef fec- tiv ely determined by aligned sensor measurements, lea ving only high-frequenc y textures underdetermined. Therefore, we treat the generativ e prior as an orthogonal score pr o- jection via a High-Frequency V ariational Score Distillation (HF-VSD) loss. By applying a frequency-domain mask to the score gradients, we restrict the prior’ s correcti ve power strictly to the unobserved high-frequency latent subspace, successfully prev enting hallucinations while synthesizing photorealistic textures. T o translate this theory into a practical mobile pipeline, we propose GenMFSR, an end-to-end, single-shot genera- tiv e restoration framew ork. W e embed Spatial T ransformer Networks (STNs) directly within the multi-frame encoder for nativ e sensor-domain alignment, bypassing massi ve ex- ternal registration networks. By leveraging low-rank adap- tation (LoRA) and one-step distillation, GenMFSR main- tains the expressiv e po wer of a frozen pretrained decoder while reducing latency to a single forward pass. In summary , GenMFSR selectiv ely lev erages a strong generativ e prior while strictly preserving the determinis- tic information present in aligned raw measurements. Our main contributions are: • First Generative RA W MFSR Pipeline : W e demon- strate the ﬁrst burst RA W to RGB diffusion model, which can handle unregistered RA W frames while lev eraging the generativ e capabilities of foundation models. • Subspace-Projected Score Distillation (HF-VSD) : W e introduce a new diffusion loss function called subspace- projected score distillation that reduces low frequency hallucinations in the raw domain. • Native Generativ e Alignment & Latent Bridging : W e introduce a new encoder that maps linear , unregistered, mosaic RA W data directly into the latent space of a pre- trained diffusion model. • Empirical Handheld Motion Simulation : W e construct a rigorous paired burst dataset by extracting and applying real-world continuous hand-tremor trajectories to static RA W captures, ensuring physically accurate ev aluation. Extensiv e quantitativ e and qualitative e xperiments demonstrate that GenMFSR achiev es state-of-the-art per- ceptual quality , outperforming existing deterministic and generativ e baselines by yielding highly realistic, hallucination-free textures. 2. Related W ork Burst Super -Resolution. Handheld burst photography relies on natural hand tremors to recover sub-pixel in- formation [ 12 , 37 ]. Recent learned methods, including Burstormer [ 10 ] and MFSR-GAN [ 16 ], achiev e state-of- the-art alignment and fusion b ut rely heavily on determin- istic regression or unstable adversarial objectives, funda- mentally limiting their ability to synthesize absent high- frequency details. By contrast, diffusion models of fer a generativ e alterna- tiv e with strong perceptual ﬁdelity . While they ha ve shown promise in single-frame super-resolution [ 8 , 18 , 32 , 38 ], their application to multi-frame RA W -to-RGB burst cap- tures remains underexplored. Generative Priors in ISP . Latent diffusion models [ 27 ] hav e rev olutionized single-image restoration [ 18 , 38 ]. Re- cent industry framew orks (e.g., DarkDiff, RDDM, ISPDif- fuser) successfully map single RA W captures to sRGB using diffusion priors. Howe ver , extending these pri- ors to multi-frame bursts remains critically underexplored. Naiv ely adapting single-frame models destroys temporal consistency , while standard distillation techniques like VSD [ 35 ] suffer from domain collisions when applied to RA W sensor data. Our work bridges this gap by introducing the ﬁrst generativ e multi-frame RA W architecture with subspace-restricted distillation. Model distillation: has been used as a knowledge trans- fer technique to create compact, efﬁcient “student” mod- els that mimic the behaviour of a larger , more complex “teacher” model. In the context of dif fusion models [ 26 ], accelerated sampling and distillation hav e been used to drastically reduce the number of sampling steps from hun- dreds or thousands to just a fe w , enabling real-time gen- eration [ 20 , 21 , 24 , 30 , 31 , 38 ]. While earlier works [ 28 ] focused on ﬁnding “shortcuts” to progressively distill the teacher model, recent work by W u et al . [ 38 ] used a regu- larization approach that is conceptually aligned with GANs, using an adversarial-style objective like V ariational Score Distillation (VSD) [ 35 ] to enforce high-ﬁdelity , plausible results. Noise conditional generative models: T ypical con- ditioning mechanisms, such as ControlNets [ 43 ], and Adapters [ 22 ], require training an additional, often large, encoding module. Beyond using noise as a simple prior , recent works [ 2 , 6 , 11 , 33 , 34 ] hav e explored using struc- tured or conditional noise to guide the generative process. For instance, Dai et al . [ 6 ] demonstrated that conditioning on noise can control the generation process, though its efﬁ- cacy can be limited by domain gaps between the input and output. In this work, we propose a framework that inte grates the efﬁcienc y of model distillation with explicit control over high-frequency generation. This allows our model to ef- Image Space Frequency Space (a) High-res ground truth image (b) High-res RA W demo- saic (c) Lo w-res RA W demo- saic Figure 2. Comparison of RA W demosaiced images with the ground truth image in image and frequenc y space. Frequency- space ﬁgures are shown in the natural logarithm of the real fre- quency v alues. fectiv ely generate plausible high-frequency details, directly addressing a key limitation in deterministic reconstruction. 3. Motivational study As shown in Fig. 2 , we kno w that the problem of RA W - to-RGB super -resolution in volves a signiﬁcant frequency- domain g ap. The lo w-resolution RA W input’ s high- frequency re gions are sparse and noisy . The generator’ s task is to plausibly ﬁll in this missing high-frequency informa- tion to match the ground truth. As highlighted, the lo w- frequency components remain largely unchanged in all do- mains. This leads to the idea of focusing on high-frequency feature generation without hindering low-frequenc y fea- tures. Generating lo w-frequency content can lead to hal- lucinations of objects which are undesirable for consumer camera applications. 4. Methodology Our goal is to reconstruct a high-ﬁdelity , photorealistic RGB image from a misaligned burst of RA W sensor data. In this section, we formalize the RA W -to-RGB pipeline as an in verse problem with partial observ ability . W e then intro- duce our core theoretical contribution: a subspace-restricted prior inte gration technique (HF-VSD) that explicitly sepa- rates deterministic recovery from stochastic high-frequenc y synthesis. Finally , we detail the generativ e multi-frame ar- chitecture and optimization strate gy used to solv e this for- mulation. 4.1. In verse Problem with P artial Observability Giv en a burst of N RA W frames y = { y ( i ) } N i =1 , our objec- tiv e is to reconstruct a high-resolution RGB image x . Due to spatial downsampling, Poisson-Gaussian sensor noise, and Bayer mosaicing, the in verse mapping from y to x is highly ill-posed. Howe ver , burst capture provides strong physical con- straints on the scene. Because the b urst acquisition oper- ator , which is characterized by pixel aperture integration and sensor noise, attenuates high spatial frequencies much more sev erely than low frequencies, the in verse problem exhibits frequency-dependent determinacy . After tempo- ral alignment and multi-frame fusion via sub-pix el shifts, the coarse geometry , global illumination, and structural color are largely determined by the measurements. In con- trast, ﬁne-scale textures remain underdetermined. W e there- fore conceptually decompose the target image into two ap- proximately complementary subspaces under natural im- age statistics: x = x L + x H , where x L represents the low-frequenc y components reliably inferred from the sen- sor data, and x H represents the high-frequency components that require prior -driv en inference. This partial observ abil- ity structure fundamentally motiv ates the selectiv e integra- tion of generativ e priors. 4.1.1. T raining Objective and Optimization The total training objectiv e balances deterministic distortion and generati ve perception by anchoring the observable sub- space and shaping the underdetermined subspace. W e for - mulate this as a generation task regularized by a pre-trained diffusion prior: θ ∗ = arg min θ E y , x [ L data ( G θ ( y ) , x ) + λ L HF-VSD ( G θ ( y ))] , (1) where x is the ground-truth RGB image, λ is a balanc- ing hyperparameter between L data ﬁdelity loss (compris- ing standard MSE and LPIPS losses calculated in the RGB pixel space), anchors the observable subspace, ensuring that the low-frequency structure and color perfectly match the ground truth, and L HF-VSD regularization loss to regular - ize the generated images in a speciﬁed image prior which shapes the underdetermined subspace, guiding the latent representations to generate photorealistic te xtures without disrupting the data term’ s structural consensus. W e em- ploy a one-step adversarial distillation strategy (combining LoRA ﬁne-tuning with a discriminator) to make this rob ust regularization compatible with single-step inference. 4.2. Subspace-Restricted Prior Integration (HF- VSD) Generativ e image priors [ 26 ], particularly latent diffusion models, offer a po werful mechanism for synthesizing the unobserved high-frequency detail x H . Standard score dis- tillation techniques (e.g., VSD [ 35 ]) apply diffusion gra- dients over the full image distribution, effecti vely regu- larizing the reconstruction using the gradient of the log- likelihood: ∇ x log p ( x ) .When applied to the RA W -to-RGB in verse problem, this global prior inﬂuences both x L and x H . Ho wever , because the diffusion model is trained V AE encoder with STN T ext prompt extractor V AE decoder +Noise High-pass filter LoRA finetuned Frozen Data flow Gradient flow Figure 3. The end-to-end training framework for our proposed GenMFSR model. A multi-frame RA W encoder E θ with embedded STN aligns and processes the input burst of raw images. A U-Net Ω θ and frozen decoder D θ generate the RGB image ˆ x . The framew ork is trained with a data ﬁdelity loss L data and our regularization L HF-VSD , which uses a high-pass ﬁlter to ensure the diffusion prior focuses on high-frequency details. exclusi vely on highly processed sRGB images, its low- frequency score estimates inherently conﬂict with the true low-frequenc y structure deterministically recov ered from the RA W measurements. This ”domain collision” results in sev ere structural drift, color shifts, and low-frequenc y hal- lucinations (see Fig. 11 ). T o pre vent prior-likelihood conﬂict, we treat score dis- tillation as a projection operator rather than a global regu- larizer . W e restrict generati ve supervision strictly to the un- derdetermined high-frequenc y subspace. W e deﬁne a pro- jection operator P H : P H ( x ) = F − 1 ( h · F ( x )) where F and F − 1 denote the Fourier transform and its in- verse, and h is a radial high-pass ﬁlter mask deﬁned as, h ( u, v ) = clip   αu R  2 +  αv R  2  γ + β , 0 , 1  , where ( u, v ) are frequency coordinates, R is half the max- imum frequency , α , γ are scaling factors, and β is a bias term. W e employ a ﬁxed, smooth radial high-pass mask h to av oid spatial ringing artif acts (Gibbs phenomenon) and ensure stable optimization. The cutoff is set to a mod- erate fraction of the Nyquist frequency . This choice re- ﬂects the empirical observation that burst alignment and fusion reliably recover coarse and mid-frequency structure, while the highest frequencies remain the most sensiti ve to noise and sampling limitations. Importantly , our objecti ve is not to precisely estimate the exact null space of the for - ward operator , but simply to attenuate generati ve gradients in frequency bands where the sensor likelihood is already strong. In practice, we found that a moderate, smooth cut- off achiev es stable training without requiring heuristic hy- perparameter tuning. Crucially , we project the score gradients rather than masking the image content itself. Nai ve VSD modiﬁes both observable and unobservable components indiscriminately . By applying P H , we reformulate the update: ∇ x L HF-VSD = P H ( ∇ x L VSD ) This projection ensures that the generativ e prior’ s gradient updates are restricted to the orthogonal complement of the highly observable subspace. Consequently , it preserves the null space of the data operator; the low-frequenc y structure is mainly governed by data ﬁdelity ( L data ), while the gener - ativ e prior synthesizes only the missing ﬁne-scale textures. This regularization can be con verted to a loss function to train our model as, L HF-VSD ( ˆ x ) = E t, ϵ [ L MSE [ z t , δ z t ]] (2) where, δ z t = z t − P H ( ω ( t )[Ω ϕ ′ ( ˆ z t ; t, c ) − Ω ϕ ( ˆ z t ; t, c )]) , and ω ( t ) is a weighting factor . This loss formulation is equiv alent to minimizing the L 2 norm of the masked gradient difference, effecti vely ignoring errors in the low- frequency regions and focusing the generator’ s updates on creating ﬁne details. Hence, the ﬁnal loss for updating the generator G θ is the weighted sum of the data ﬁdelity and regularization terms: L total = L data + λ L HF-VSD where λ controls the strength of the generativ e regularization. Theoretical Grounding: From a Bayesian perspec- tiv e, restoring the high-resolution image x from burst measurements y requires maximizing the posterior score: ∇ x log p ( x | y ) = ∇ x log p ( y | x ) + ∇ x log p ( x ) . In heav- ily ill-posed in verse problems, the forward operator A (modeling optical blur , do wnsampling, and mosaicing) pos- sesses a pronounced null space. The data likelihood score ∇ x log p ( y | x ) strongly constrains the range space of A (the observable lo w frequencies) but provides zero information in the null space (the unobservable high frequencies). A standard generativ e prior ∇ x log p ( x ) operates over the entire space. Howe ver , because our prior is distilled from an sRGB-trained diffusion model, its lo w-frequency distribution fundamentally div erges from the linear RA W sensor distribution. Allowing the prior to update the range space introduces a direct mathematical conﬂict with the data likelihood. HF-VSD resolves this by applying a projection operator P H to the score gradient. By restricting the prior update to P H ∇ x log p ( x ) , we explicitly constrain the gen- erativ e prior to act only within the ef fective null space of the measurement operator . This ensures the prior acts as a com- plementary re gularizer rather than a competing likelihood, grounding our frequency mask in rigorous in verse problem theory . 4.3. Generative Multi-Frame Ar chitecture W e designed an architecture that reliably estimates the ob- servable component x L from sensor data and maps it into the latent space for generativ e reﬁnement. 4.3.1. Sensor -Domain Alignment and Fusion Accurate estimation of the low-frequenc y structure x L is critical before applying generativ e reﬁnement. Because our inputs are handheld bursts, the frames suffer from non- rigid temporal misalignment. Rather than relying on exter - nal optical ﬂow networks, we embed Spatial Transformer Networks (STNs) directly within the RA W -domain multi- frame encoder (Fig. 4 ). The STNs estimate global homo- graphies to warp the deep feature embeddings of frames i ∈ { 2 . . . N } to the reference frame i = 1 . This ex- plicit sensor-domain alignment efﬁciently fuses burst data, thereby minimizing ambiguity in the observ able subspace prior to the latent mapping phase. This ensures that the la- tent features fused by the subsequent conv olutional layers are spatially coherent, maximizing the effecti ve recepti ve ﬁeld and preventing ghosting artifacts during the generati ve decoding phase. 4.3.2. Sensor -to-Latent Mapping T o le verage the subspace-restricted diffusion prior, we must bridge the gap between the physical sensor likelihood and the pretrained sRGB prior . W e perform a deterministic lin- ear channel extraction on the Bayer inputs to form a 33- channel stacked v olume, preserving the radiometric linear- VAE encoder with STN ... Subsampled 11 RAW frames STN STN STN STN Encode layer STN STN STN STN STN ... ... Encode layer Semi-Aligned, stacked 11 RAW frames Sliced encoded latent 1 Semi-Aligned, stacked encoded latent 1 Figure 4. The architecture of the encoder that aligns the RA W frames and intermediate latents using STNs. ity and independent noise distributions of the RA W data (see Suppl). Our STN-embedded encoder then maps this linear v olume directly into the standard SD latent space. By freezing the pretrained V AE decoder , we ensure the latent space remains perfectly aligned with the diffusion prior’ s expectations, allowing HF-VSD to operate ef fectively . W e utilize a text prompt extractor , D APE from [ 39 ], to gi ve te xt embeddings to the diffusion prior . 4.4. Dataset Construction and Motion Simulation Acquiring perfectly aligned, noise-free ground truth for genuine handheld RA W b ursts is physically impossible due to the inherent temporal mismatch between a shaky short- exposure burst and the necessary long-exposure reference. T o construct a rigorous paired dataset, we capture static tri- pod scenes to obtain pristine, long-exposure ground-truth RGB images alongside their corresponding short-e xposure RA W pairs, nati vely preserving realistic Poisson-Gaussian sensor noise. W e used a Samsung Galaxy S23 Ultra’ s W ide Camera to capture 3000 burst RA W scenes and e xtract them after lens shading correction, white balancing, and black lev el subtraction, but before any ISP denoising, to ensure that the captured RA W frames contain realistic noise before demosaicing, which is not perfectly achie v able using syn- thetic datasets that simulate RA W images from RGB im- ages. The dataset includes lo w-light and bright light cap- tures ranging from ISO 500 to 2400. The content is static and includes indoor and outdoor scenes from homes, super - markets, and public spaces. T o simulate handheld motion without introducing the domain gaps typical of purely synthetic afﬁne warps, we adopt an empirical simulation strategy . W e extract contin- uous inter -frame homography trajectories from a separate dataset of real handheld bursts [ 16 ] and apply these real- world motion paths to our static RA W captures. This en- sures our network learns to align physically accurate, tem- porally correlated hand tremors while still allo wing us to ev aluate the generati ve output against mathematically per- fect ground truth. Finally , to simulate resolution loss, we utilize a Bayer-preserving Space-to-Depth downsampling operation. Comprehensive capture logistics and hardware details are provided in the Supplementary Material. RA W FB ANe t R es t ormer MPRNe t R esShi � InDI OSEDi ﬀ Our s G T Figure 5. Qualitativ e comparison of different postprocessing algorithms for generating RGB images using multi-frame RA W as input. Our algorithm produces ﬁner details in the reconstructed RGB image than all baselines and also exhibits stability in the lo w-frequency range. Figure 6. Comparison of methods for the 4 × super-resolution using ﬁdelity (PSNR, SSIM), perception quality (LPIPS, DISTS, FID), and no-reference quality (NIQE, MUSIQ, MANIQA, CLIPIQA) metrics. The methods are grouped into non-dif fusion-based and diffusion-based ap- proaches to highlight their dif ferent training strategies. W e highlight the best , second-best , and third-best values for each metric. Method PSNR ↑ SSIM ↑ LPIPS ↓ DISTS ↓ FID ↓ NIQE ↓ MUSIQ ↑ MANIQA ↑ CLIPIQA ↑ Reference 29.87 0.8841 0.3018 0.2512 39.41 9.6690 23.40 0.3160 0.4502 FBANet[ 36 ] 30.88 0.9341 0.1891 0.1354 21.74 6.5970 34.53 0.3153 0.4864 Burstormer[ 10 ] 35.07 0.9454 0.0948 0.0669 12.32 5.5131 36.72 0.3450 0.3600 MPRNet[ 42 ] 35.18 0.9471 0.1309 0.1045 21.51 6.9476 38.93 0.3540 0.4471 Restormer [ 41 ] 35.24 0.9522 0.1172 0.0967 16.10 7.0328 39.09 0.3570 0.4409 ResShift [ 40 ] 26.12 0.8623 0.2927 0.1760 41.82 6.1530 29.32 0.3079 0.3429 InDI [ 7 ] 32.69 0.8987 0.2486 0.2134 57.12 5.4929 28.41 0.3162 0.3207 VSD [ 38 ] 29.75 0.9019 0.1232 0.0733 20.02 5.2592 39.08 0.3621 0.3935 Ours (no HF loss) 29.78 0.8998 0.1131 0.0662 18.58 5.1981 38.26 0.3631 0.4081 Ours 32.51 0.9200 0.0984 0.0585 14.22 5.1912 39.20 0.3643 0.4184 PSNR SSIM LPIPS DISTS FID NIQE MUSIQ MANIQA CLIPIQA FBANet MPRNet R estor mer R esShif t (NAFNet) InDI (NAFNet) VSD Ours Figure 7. Performance comparison with selected competing baselines. 5. Experiments W e ev aluate the proposed framew ork along three axes: reconstruction ﬁdelity , perceptual realism, and structural consistency under generati ve regularization. W e com- pare our method against state-of-the-art multi-frame super- resolution restoration models, including both regression- based and generati ve approaches. W e report standard distortion metrics (PSNR, SSIM) and perceptual metrics (LPIPS[ 44 ] and FID[ 13 ]) and conduct a controlled user study to assess visual preference. W e use PSNR and SSIM as pixel-based metrics to assess restoration ﬁdelity relative to the ground truth. In addition, we perform targeted ab- lations to isolate the effect of High-Frequenc y V ariational Score Distillation (HF-VSD), the alignment module, and generativ e prior adaptation. See Suppl. for training details. Comparison with baselines: W e trained a set of base- line methods with our data after making minimal changes to their architectures to accommodate multi-frame RA W in- put. A major nov elty of our pipeline is the adaptation of single-frame generati ve architectures to multi-frame RA W bursts. Standard diffusion models and restoration baselines (e.g., OSEDif f, N AFNet) accept 3-channel RGB inputs. T o process Bayer RA W bursts, we pack the sub-sampled RA W frames into a 3-channel format and concatenated all 11 frames to create a 33-channel input requiring the initial con volutional layers of both our encoder and the retrained baselines to accept a 33-channel input. By modifying these initial projection layers and adjusting subsequent layers to ensure smooth v ariation of channel dimension with mini- T able 1. Comparison of proposed method with single-frame base- lines for 4 × SR case. Method LPIPS FID PSNR SSIM GAN (SF N AFNet) 0.13 9.53 28.18 0.86 VSD (SF) 0.14 16.53 28.67 0.87 Ours (MF) 0.09 14.22 31.43 0.90 mal changes to the underlying architecture of the baselines. By retraining all models from scratch, we ensure that the network can nati vely fuse temporal sub-pixel information across the 33 channels to produce the ﬁnal RGB output. W e compare our method with state-of-the-art generative models for image restoration, namely , ResShift [ 40 ], InDI [ 7 ], and OSEDif f [ 38 ], and some adversarially-trained mod- els such as Burstomer [ 10 ], MPRNet[ 42 ] and Restormer [ 41 ]. The quantitativ e results for these baselines are gi ven in Fig. 6 . Our method achie ves the best generativ e quality metrics among competing methods. Although adv ersarially trained methods yield better metrics, our method qualita- tiv ely demonstrates superior generati ve capability , as illus- trated in Fig. 5 , where it can reconstruct plausible, realistic high-frequency details. The Per ception-Distortion T rade-off in Burst SR. As shown in T able 6 , deterministic regression baselines (e.g., Restormer , MPRNet) achie ve higher PSNR (e.g., ∼ 35 dB) compared to our generati ve framew ork ( ∼ 32.5 dB). Rather than a limitation, this quantitative gap perfectly illustrates the core theoretical trade-of f of our formulation. Because regression baselines minimize Mean Squared Error (MSE), they mathematically con verge to the posterior distrib ution’ s mean. While this maximizes PSNR, it heavily penalizes high-frequency variance, resulting in the overly smooth or “waxy” textures observed in Fig. 5 . In contrast, GenMFSR is explicitly designed to sample a speciﬁc, stochastic re- alization from the posterior to recover photorealistic tex- ture. As established by the Perception-Distortion trade-off [ 1 ], this stochasticity guarantees a drop in pixel-wise distor- tion metrics like PSNR in exchange for superior perceptual quality . By optimizing for distributional alignment via HF- VSD, GenMFSR achiev es state-of-the-art LPIPS and FID scores. Details of User Study : Due to the limitations of tra- ditional image quality metrics in capturing user perception, we conducted a user study to ev aluate the proposed method. T wenty-two image quality experts were sho wn the outputs of the NAFNET trained with a GAN, the baseline VSD method, and our proposed GenMFSR method on 30 ran- domly selected crops from the test set. For each crop, sub- jects were asked to choose the preferred image or indicate whether the images looked the same. As shown in Fig. 8 , a plurality of experts preferred images generated by our T able 2. Comparisons of Params and FLOPs between GenMFSR and its competing methods on input resolution of 512 × 512 . Method T ype Params (M) FLOPs (G) FB ANet [ 36 ] CNN ∼ 28.5 ∼ 120 Burstormer [ 10 ] Transformer 3.58 613 MPRNet [ 42 ] CNN 20 595 Restormer [ 41 ] T ransformer 26.1 567 ResShift [ 40 ] 15-step Dif fusion 119 5500 InDI [ 7 ] 50-step Diffusion 54 48000 VSD (OSEDiff) [ 38 ] 1-step Diffusion 1775.0 4530 Ours (GenMFSR) 1-step Diffusion 1310.0 4620 45.39% 8.88% 23% 22.70% 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% 45.00% 50.00% Ours NAFNet VSD Same Figure 8. A survey of 22 image quality experts shows a strong perceptual preference for our method (45.39%). Despite its high quantitativ e metrics, the GAN-based N AFNet was strongly dis- fa vored (8.88%) due to artifacts, demonstrating the gap between standard metrics and human perception. Single-frame Multi-frame Figure 9. Comparison of our method after training on single-frame and multi-frame RA W images. This shows the need for multiple frames in the RA W -to-RGB problem. method. The post-study feedback indicated that the artifacts in the NAFNet were the primary reason it was not preferred, and the lack of detail in the baseline VSD method was also a contributing factor to its non-preference. Some exam- ples of the artif acts produced by the N AFNET are shown in Fig. 5 . These crosshatch patterns may be an outcome of the network learning to game loss terms such as L1, GAN, and LPIPS, which is a well-known phenomenon known as perception-ﬁdelity tradeoff [ 1 ]. Single Frame vs Multi-Frame : W e also experimen- tally validate the beneﬁt of using multiple frames by us- ing only the RA W base frame and GT pairs to train single- frame baselines. The comparison with these baselines is presented in T ab . 1 , which sho ws that our multi-frame ap- w/o STN with STN Figure 10. Ablation study sho wing the effecti veness of using STNs without utilizing the HF loss for a fair comparison of the encoder stage. When STNs are used in the encoder , the overall sharpness of the images is improv ed since all frames are aligned. (a) Without L VSD-HF (b) With L VSD-HF Figure 11. Effecti veness of the high-frequency (HF) VSD loss. Both models use STNs for alignment. (a) The baseline VSD in- troduces lo w-frequency patching artifacts (top, contrast enhanced) and hallucinations (bottom). (b) Our L VSD-HF successfully pre- vents both types of artifacts. proach outperforms all single-frame RA W -to-RGB meth- ods. A qualitative comparison in Fig. 9 sho ws that Gen- MFSR has greater ﬁdelity than single-frame methods. Deployment and Practicality . W e note that deploy- ing a 1.3B parameter latent diffusion model is impractical for real-time, viewﬁnder -lev el mobile processing (e.g., 30 fps). Consequently , GenMFSR is designed to operate as an Asynchr onous Enhancement Module , akin to premium computational photography pipelines (e.g., Deep Fusion or Night Sight) that process high-ﬁdelity b ursts in the back- ground post-capture. By utilizing one-step distillation, our framew ork reduces the latency of standard diffusion from dozens of seconds to a single forward pass, making high- end generati ve burst restoration practically viable for mod- ern mobile Neural Processing Units (NPUs). Further dis- tillation and mobile deployment is made possible by recent works[ 23 , 29 ]. 6. Ablation study W e trained the proposed network without using STN blocks in the encoder to verify the performance gain from using STNs. Fig. 10 illustrates how the addition of STNs pro- duces sharper images in the ﬁnal reconstruction. This v al- idates the efﬁcac y of using STNs for image re gistration in the encoder stage, aligning the frames and making the sub- sequent tasks easier for the generativ e network. R es t ormer OSEDi ﬀ Our s G T RA W mean |Our s - OSEDi ﬀ | FS IS F S IS Figure 12. Comparison of competing results in the Frequency Space (FS) and Image Space (IS). W e highlight the absolute er- ror between ours and OSEDif f [ 38 ] which is a close competing method. Furthermore, Fig. 11 demonstrates that restricting the generativ e prior ensures low-frequenc y uniformity across patches. This explicitly pre vents the se vere boundary ar- tifacts (top row) and structural hallucinations (bottom row) that plague the unconstrained baseline VSD. Fourier spectra analysis Fig. 12 conﬁrms naiv e VSD alters low-frequency magnitudes, causing the color shifts and hallucinations as shown in Fig. 11 . Conv ersely , Gen- MFSR matches the ground-truth low-frequency spectrum while extrapolating high-frequency bounds, thereby reduc- ing low-frequenc y structural error compared to standard VSD. Fig. 12 also highlights the dif ference between OSED- iff [ 38 ] and our method in the image space and frequency space where ours shows improvements in edges and high frequency components. This conﬁrms that our subspace projection successfully bridges the RA W -to-RGB domain gap, recov ering the critical sub-pixel te xtures lost by stan- dard regression models without triggering the se vere struc- tural drift inherent to unconstrained diffusion priors. 7. Conclusion W e introduced GenMFSR, the ﬁrst uniﬁed, one-step gener- ativ e frame work for multi-frame RA W -to-RGB restoration. By embedding explicit frame alignment within the encoder and restricting generative priors to the high-frequency latent subspace, GenMFSR successfully synthesizes photorealis- tic sub-pixel details without structural hallucinations. This offers a highly ef fectiv e, sensor -stage architecture for mod- ern mobile ISPs. Because the framework relies on a frozen Stable Diffusion prior to bypassing hea vy training costs, it naturally inherits the foundation model’ s constraints, such as minor global color shifts or V AE-induced text artifacts. Future work will explore rectiﬁed ﬂow and global foveal context to reﬁne patch-based prompting, further bridging the gap between powerful generative priors and physical sensor constraints. References [1] Y ochai Blau and T omer Michaeli. The perception-distortion tradeoff. In Pr oceedings of the IEEE/CVF conference on computer vision and pattern r ecognition , pages 6228–6237, 2018. [2] Ryan Burgert, Y uancheng Xu, W enqi Xian, Oli ver Pilarski, Pascal Clausen, Mingming He, Li Ma, Y itong Deng, Lingx- iao Li, Mohsen Mousavi, et al. Go-with-the-ﬂow: Motion- controllable video diffusion models using real-time warped noise. In Proceedings of the IEEE/CVF international con- fer ence on computer vision , pages 13–23, 2025. [3] Liangyu Chen, Xiaojie Chu, Xiangyu Zhang, and Jian Sun. Simple baselines for image restoration. In European confer - ence on computer vision , pages 17–33. Springer , 2022. [4] Y an Chen, Y i W en, W ei Li, Junchao Liu, Y ong Guo, Jie Hu, and Xinghao Chen. Rddm: Practicing raw domain dif fu- sion model for real-world image restoration. arXiv preprint arXiv:2508.19154 , 2025. [5] Jifeng Dai, Haozhi Qi, Y uwen Xiong, Y i Li, Guodong Zhang, Han Hu, and Y ichen W ei. Deformable con volutional networks. In Proceedings of the IEEE international confer- ence on computer vision , pages 764–773, 2017. [6] Longquan Dai, He W ang, and Jinhui T ang. Noisectrl: A sampling-algorithm-agnostic conditional generation method for diffusion models. In Pr oceedings of the IEEE/CVF con- fer ence on computer vision and pattern recognition , pages 18093–18102, 2025. [7] Mauricio Delbracio and Peyman Milanfar . Inv ersion by di- rect iteration: An alternative to denoising dif fusion for image restoration. arXiv pr eprint arXiv:2303.11435 , 2023. [8] Zheng-Peng Duan, Jiawei Zhang, Xin Jin, Ziheng Zhang, Zheng Xiong, Dongqing Zou, Jimmy S Ren, Chunle Guo, and Chongyi Li. Dit4sr: T aming diffusion transformer for real-world image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer V ision , pages 18948–18958, 2025. [9] Akshay Dudhane, Syed W aqas Zamir, Salman Khan, Fa- had Shahbaz Khan, and Ming-Hsuan Y ang. Burst im- age restoration and enhancement. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 5759–5768, 2022. [10] Akshay Dudhane, Syed W aqas Zamir, Salman Khan, Fa- had Shahbaz Khan, and Ming-Hsuan Y ang. Burstormer: Burst image restoration and enhancement transformer . In Pr oceedings of the IEEE/CVF conference on computer vi- sion and pattern r ecognition , pages 5703–5712. IEEE, 2023. [11] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with dif fusion models. In Proceedings of the IEEE/CVF international confer ence on computer vision , pages 7346–7356, 2023. [12] Sina Farsiu, M Dirk Robinson, Michael Elad, and Peyman Milanfar . Fast and robust multiframe super resolution. IEEE transactions on imag e pr ocessing , 13(10):1327–1344, 2004. [13] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter . Gans trained by a two time-scale update rule conv erge to a local nash equilib- rium. Advances in neural information pr ocessing systems , 30, 2017. [14] Edward J Hu, Y elong Shen, Phillip W allis, Zeyuan Allen- Zhu, Y uanzhi Li, Shean W ang, Lu W ang, W eizhu Chen, et al. Lora: Low-rank adaptation of lar ge language models. In- ternational Confer ence on Learning Representations (ICLR) , 1(2):3, 2022. [15] Max Jaderberg, Karen Simonyan, Andre w Zisserman, et al. Spatial transformer networks. Advances in neural informa- tion pr ocessing systems , 28, 2015. [16] Fadeel Sher Khan, Joshua Ebenezer , Hamid Sheikh, and Seok-Jun Lee. Mfsr-gan: Multi-frame super-resolution with handheld motion modeling. In Proceedings of the IEEE/CVF confer ence on computer vision and pattern r ecognition , pages 800–809, 2025. [17] Zongyi Li, Nikola Ko v achki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar . Fourier neural operator for para- metric partial differential equations. arXiv pr eprint arXiv:2010.08895 , 2020. [18] Xinqi Lin, Jingwen He, Ziyan Chen, Zhao yang L yu, Bo Dai, Fanghua Y u, Y u Qiao, W anli Ouyang, and Chao Dong. Diff- bir: T o ward blind image restoration with generative dif fusion prior . In Eur opean confer ence on computer vision , pages 430–448. Springer , 2024. [19] Ze Liu, Y utong Lin, Y ue Cao, Han Hu, Y ixuan W ei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Pr oceedings of the IEEE/CVF international conference on computer vision , pages 10012–10022, 2021. [20] Eric Luhman and Troy Luhman. Knowledge distillation in iterativ e generativ e models for improved sampling speed. arXiv pr eprint arXiv:2101.02388 , 2021. [21] Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and T im Salimans. On distillation of guided diffusion models. In Pr oceedings of the IEEE/CVF confer ence on computer vision and pattern r ecognition , pages 14297–14306, 2023. [22] Chong Mou, Xintao W ang, Liangbin Xie, Y anze W u, Jian Zhang, Zhongang Qi, and Y ing Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Pr oceedings of the AAAI confer ence on artiﬁcial intelligence , v olume 38, pages 4296–4304, 2024. [23] Mehdi Noroozi, Isma Hadji, V ictor Escorcia, Anestis Za- ganidis, Brais Martinez, and Georgios Tzimiropoulos. Edge- sd-sr: Low latency and parameter efﬁcient on-device super- resolution with stable diffusion via bidirectional condition- ing. arXiv pr eprint arXiv:2412.06978 , 2024. [24] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: T ext-to-3d using 2d dif fusion. arXiv pr eprint arXiv:2209.14988 , 2022. [25] Y ang Ren, Hai Jiang, Menglong Y ang, W ei Li, and Shuaicheng Liu. Ispdiffuser: Learning raw-to-srgb mappings with texture-aware dif fusion models and histogram-guided color consistency . In Pr oceedings of the AAAI Conference on Artiﬁcial Intelligence , v olume 39, pages 6722–6730, 2025. [26] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser , and Bj ¨ orn Ommer . High-resolution image syn- thesis with latent diffusion models, 2021. [27] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser , and Bj ¨ orn Ommer . High-resolution image synthesis with latent dif fusion models. In Pr oceedings of the IEEE/CVF confer ence on computer vision and pattern r ecognition , pages 10684–10695, 2022. [28] T im Salimans and Jonathan Ho. Progressi ve distillation for fast sampling of diffusion models. arXiv pr eprint arXiv:2202.00512 , 2022. [29] Subhajit Sanyal, Sriniv as Soumitri Miriyala, Akshay Janar- dan Bankar, Srav anth Koda v anti, Abhishek Ameta, Shreyas Pandith, Amit Satish Unde, et al. Nanosd: Edge efﬁcient foundation model for real time image restoration. arXiv pr eprint arXiv:2601.09823 , 2026. [30] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. In International Confer ence on Learning Repr esentations , 2021. [31] Y ang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutske ver . Consistenc y models, 2023. [32] Lingchen Sun, Rongyuan W u, Zhiyuan Ma, Shuaizheng Liu, Qiaosi Y i, and Lei Zhang. Pixel-le vel and semantic-le vel ad- justable super -resolution: A dual-lora approach. In Pr oceed- ings of the Computer V ision and P attern Recognition Con- fer ence , pages 2333–2343, 2025. [33] Jiangshan W ang, Junfu Pu, Zhongang Qi, Jiayi Guo, Y ue Ma, Nisha Huang, Y uxin Chen, Xiu Li, and Y ing Shan. T am- ing rectiﬁed ﬂo w for in version and editing. arXiv pr eprint arXiv:2411.04746 , 2024. [34] Y inhuai W ang, Jiwen Y u, and Jian Zhang. Zero-shot im- age restoration using denoising diffusion null-space model. arXiv pr eprint arXiv:2212.00490 , 2022. [35] Zhengyi W ang, Cheng Lu, Y ikai W ang, Fan Bao, Chongx- uan Li, Hang Su, and Jun Zhu. Proliﬁcdreamer: High-ﬁdelity and div erse text-to-3d generation with variational score dis- tillation. Advances in neural information pr ocessing systems , 36:8406–8441, 2023. [36] Pengxu W ei, Y ujing Sun, Xingbei Guo, Chang Liu, Guanbin Li, Jie Chen, Xiangyang Ji, and Liang Lin. T ow ards real- world burst image super-resolution: Benchmark and method. In Proceedings of the IEEE/CVF international confer ence on computer vision , pages 13233–13242, 2023. [37] Bartlomiej Wronski, Ignacio Garcia-Dorado, Manfred Ernst, Damien Kelly , Michael Krainin, Chia-Kai Liang, Marc Lev oy , and Peyman Milanfar . Handheld multi-frame super- resolution. ACM Tr ansactions on Graphics (T oG) , 38(4):1– 18, 2019. [38] Rongyuan W u, Lingchen Sun, Zhiyuan Ma, and Lei Zhang. One-step ef fectiv e dif fusion network for real-world image super-resolution. Advances in Neural Information Pr ocess- ing Systems , 37:92529–92553, 2024. [39] Rongyuan W u, T ao Y ang, Lingchen Sun, Zhengqiang Zhang, Shuai Li, and Lei Zhang. Seesr: T ow ards semantics- aware real-world image super-resolution. In Pr oceedings of the IEEE/CVF confer ence on computer vision and pattern r ecognition , pages 25456–25467, 2024. [40] Zongsheng Y ue, Jianyi W ang, and Chen Change Loy . Resshift: Efﬁcient diffusion model for image super - resolution by residual shifting. Advances in Neur al Infor- mation Pr ocessing Systems , 36:13294–13307, 2023. [41] Syed W aqas Zamir , Aditya Arora, Salman Khan, Mu- nawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Y ang. Restormer: Efﬁcient transformer for high-resolution image restoration. In Pr oceedings of the IEEE/CVF conference on computer vision and pattern r ecognition , pages 5728–5739, 2022. [42] Syed W aqas Zamir , Aditya Arora, Salman Khan, Munaw ar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Y ang, and Ling Shao. Multi-stage progressive image restoration. In Pr o- ceedings of the IEEE/CVF conference on computer vision and pattern r ecognition , pages 14821–14831, 2021. [43] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Pr oceedings of the IEEE/CVF international conference on computer vision , pages 3836–3847, 2023. [44] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oli ver W ang. The unreasonable ef fectiv eness of deep features as a perceptual metric. In Pr oceedings of the IEEE/CVF confer ence on computer vision and pattern r ecognition , pages 586–595, 2018. [45] Amber Y ijia Zheng, Y u Zhang, Jun Hu, Raymond A Y eh, and Chen Chen. Darkdiff: Advancing lo w-light raw enhance- ment by retasking diffusion models for camera isp. arXiv pr eprint arXiv:2505.23743 , 2025. GenMFSR: Generativ e Multi-Frame Image Restoration and Super -Resolution Supplementary Material A. Introduction This supplementary material provides a theorical back- ground on score distillation and how it is related to high- frequency feature generation, more information on model training and the dataset in Sec. C . Additional experiment results are presented in Sec. D , including qualitati ve results that demonstrdemonstrating super-resolution scaling agnos- ticism and perceptual quality improvements ov erdversarial Networks (GANs). W e provide further details of the user study , along with some examples of the images used in Sec. E . B. Theoretical backgr ound B.1. V ariational Score Distillation (VSD) VSD w orks by minimizing the KL div ergence D KL ( q θ ( ˆ z ) || p ( z )) between the distribution of noisy predicted latents ˆ z and noisy real latents z , denoted by q θ and p respectiv ely . This can be implemented using tech- niques such as adversarial training, where a discriminator is trained to identify p and q θ . The goal of the generator is to predict ˆ x such that q θ ≃ p , which reduces the KL diver - gence. This will regularize the generator to produce images that are close to the ground-truth distribution. Howe ver , according to the authors of VSD [ 35 ], the distrib ution of z can be sparse, which may yield suboptimal results. But if we add noise to these latent variables ( ˆ z t = α t ˆ z + β t ϵ t ), we can diffuse the distrib utions to easily cal- culate a v alid KL div ergence D KL ( q θ ,t ( ˆ z t ) || p t ( z t )) where t is the time step, ϵ ∈ N (0 , I ) , and α t , β t are noise schedule coefﬁcients.This tractable problem can be written as, θ ∗ = arg min θ E t [( α t /β t ) ω ( t ) D KL ( q θ ,t ( ˆ z t ) || p t ( z t ))] (3) where ω ( t ) is a weighting f actor . Since pretrained dif fusion models such as Stable Diffu- sion already hav e the capability to estimate the distribution of dif fused real images, we use a frozen Stable Diffusion model Ω ϕ ′ to estimate the distribution (or score function) of z t . The distribution of the predicted latents ˆ z t is esti- mated using another noise prediction network Ω ϕ , which is trained on the predicted latents with the standard diffusion loss, L diff = E t, ϵ , ˆ z L MSE (Ω ϕ ( α t ˆ z + β t ϵ ; t, c ) , ϵ ) (4) Using these estimations of the distributions, we can solve the problem in Eq. ( 3 ) by calculating its gradient. For that, we deﬁne the gradient of VSD loss as, ∇ ˆ x L VSD ( ˆ x ) ≜ E t, ϵ  ω ( t )(Ω ϕ ′ ( z t ; t, c ) − Ω ϕ ( ˆ z t ; t, c )) ∂ z t ∂ ˆ x  Since the decoder is frozen, the VSD loss can be computed in the latent space. Therefore, we can update the VSD loss to backpropagate to θ as follows, ∇ θ L VSD ( ˆ x ) = ∇ ˆ x L VSD ( ˆ x ) ∂ ˆ x ∂ θ = E t, ϵ  ω ( t )(Ω ϕ ′ ( z t ; t, c ) − Ω ϕ ( ˆ z t ; t, c )) ∂ z t ∂ θ  (5) When we know the gradient of the loss, we can ﬁnd the global minima by setting the gradient to zero, assuming the loss is con vex. Therefore, VSD loss can be incorporated into training as follows, ˜ L VSD ( ˆ x ) = L MSE [ ω ( t )Ω ϕ ′ ( z t ; t, c ) , ω ( t )Ω ϕ ( ˆ z t ; t, c )] (6) B.2. HF-VSD via Null-Space Projection W e can formulate the RA W -to-RGB multi-frame super- resolution task as a linear in verse problem: y = Ax + n (7) where x ∈ R D is the ground-truth high-resolution RGB image, y ∈ R d ( d ≪ D ) represents the degraded RA W burst measurements, A is the forward degradation operator (blur , downsampling, mosaicing), and n is sensor noise. T o sample from the posterior p ( x | y ) , score-based gener - ativ e models utilize Lange vin dynamics, which requires the posterior score: ∇ x log p ( x | y ) = ∇ x log p ( y | x ) | {z } Likelihood Score + ∇ x log p ( x ) | {z } Prior Score (8) The forward operator A induces a decomposition of the image space into a range space (observ able subspace) and a null space (unobservable subspace). Let A † be the pseudo- in verse of A . W e can deﬁne the orthogonal projection ma- trices for the range space P L = A † A and the null space P H = I − A † A . Any image can be decomposed as x = P L x + P H x . The likelihood score ∇ x log p ( y | x ) only pro vides gradients for the observ able subspace P L x . The prior score ∇ x log p ( x ) provides gradients for both subspaces. In our framew ork, the prior p ( x ) is parameterized by an sRGB-trained latent diffusion model, while y resides in the linear RA W domain. Consequently , the prior’ s score in the observable subspace, P L ∇ x log p ( x ) , is heavily biased by sRGB tone-mapping and color processing, conﬂicting with the true physical measurements y . T o prevent this domain collision, we construct a modi- ﬁed posterior score that strictly isolates the prior to the null space: ∇ x log ˜ p ( x | y ) = ∇ x log p ( y | x ) + P H ∇ x log p ( x ) (9) In the spatial frequency domain, the degradation oper- ator A behav es as a lo w-pass ﬁlter . Therefore, the range space P L corresponds to low spatial frequencies, and the null space P H corresponds to high spatial frequencies. By applying a high-pass Fourier mask h to the gradients of the V ariational Score Distillation (VSD) loss, our pro- posed HF-VSD acts as an ef ﬁcient, differentiable approxi- mation of the null-space projection P H : ∇ x L H F − V S D ≈ P H ∇ x log p ( x ) (10) which can be calculated using Eq. ( 2 ). This demonstrates that HF-VSD is not a heuristic image-space ﬁlter , but a theoretically grounded technique to enforce complementary score matching in partially observable in verse problems. Computing the exact null-space projection P H = I − A † A is computationally intractable for high-resolution im- ages within a latent diffusion framework due to the massi ve dimensionality of A and the non-rigid alignment modeled by our STNs. Howe ver , if we locally approximate A as a spatially in- variant blur operator combined with uniform sampling, A represents a discrete conv olution. Under circular boundary conditions, A is a Block-Circulant with Circulant Blocks (BCCB) matrix. A fundamental property of BCCB matrices is that they are diagonalized by the 2D Discrete Fourier Transform (DFT) matrix, denoted as F . Thus, we can decompose A as: A = F − 1 Λ F (11) where Λ is a diagonal matrix containing the Fourier coefﬁ- cients of the de gradation kernel (the Optical T ransfer Func- tion, O TF). Because optical blur and pixel integration act as low-pass ﬁlters, the diagonal elements of Λ approach zero at high spatial frequencies. The pseudo-inv erse projection operator onto the observable range space becomes: P L = A † A = F − 1 (Λ † Λ) F (12) The term (Λ † Λ) is a diagonal matrix with ones at lo w fre- quencies (where the signal is observable) and zeros at high frequencies (where the O TF attenuates the signal below the noise ﬂoor). Consequently , the projection onto the unobservable null- space P H is: P H = I − P L = F − 1 ( I − Λ † Λ) F (13) The matrix ( I − Λ † Λ) is exactly a high-pass frequency mask in the Fourier domain. Therefore, while a ﬁxed F ourier mask h does not capture the complex, spatially varying local deformations of hand- held motion, it serves as the mathematically optimal and computationally tractable approximation of the true null- space projection for camera sensor degradations. By ap- plying this mask to the score gradients, HF-VSD effec- tiv ely restricts the generati ve prior to the unobservable high- frequency subspace without requiring the intractable com- putation of the true spatially varying pseudo-in verse. C. Model description and training C.1. Model details W e use the pretrained Stable Dif fusion 2.1 model as the generator U-net Ω θ and trainable regularizer Ω ϕ . Selected layers of these U-nets and the encoder E θ are trained us- ing Low Rank Adapters (LoRAs). The decoder D θ was frozen and used for decoding the reconstructed latents. The frozen regularizer Ω ϕ ′ is initialized from Stable Diffusion 2.1 and kept frozen. The high-pass ﬁlter is initialized with α = 0 . 8 , β = 0 . 2 , and γ = 4 . This ﬁlter is used in place of the Gaussian ﬁlter to provide a controllable limit on the maximum allowed high frequenc y . C.2. Frame alignment f or ﬁdelity impro vement. A critical challenge in multi-frame RA W processing is tem- poral misalignment caused by hand tremor . Prior genera- tiv e models operate on single frames and fail catastroph- ically when presented with misaligned bursts. T o solve this, we embed Spatial T ransformer Networks (STNs) di- rectly within the multi-frame encoder E θ . The distortion is modeled by a homography transformation relati ve to a base frame (e.g., the ﬁrst frame). The encoder’ s input layer is re- placed with a con volutional layer with a channel input size of 33 to accommodate multiple RA W frames. T o accommodate the 3-channel prior of the pre-trained sRGB encoder while preserving the physical properties of the sensor data, we apply a deterministic linear channel ex- traction to the Bayer RA W inputs. Speciﬁcally , the Red and Blue channels are extracted and linearly upsampled ( 2 × ) to the full spatial resolution. The two Green channels are simi- larly extracted, upsampled, and averaged. These are stacked to form a 33-channel input volume for the 11-frame burst. W e explicitly note that while this constitutes a naiv e lin- ear demosaicing, this 3-channel representation strictly re- tains the radiometric linearity , original bit-depth, and un- compressed dynamic range of the raw sensor signal. By do- ing so, we successfully reduce the dimensional domain gap for the generati ve V AE while completely bypassing the de- structiv e, non-linear tone-mapping, gamma correction, and heavy spatial denoising applied by standard mobile ISPs. The trainable encoder ( E θ ) and diffusion networks ( Ω θ , Ω ϕ ) are ﬁnetuned by introducing trainable LoRA [ 14 ] layers. The localization networks of Spatial Transformer Networks (STNs) [ 15 ] in E θ are initialized to giv e an iden- tity homography matrix at the start and are then trained from scratch. While training a custom RA W -domain V AE might seem intuiti ve, doing so would fundamentally alter the la- tent space, disconnecting it from the pre-trained Stable Dif- fusion prior . Because VSD relies on the pre-trained prior ( Ω ϕ ′ ) to provide score estimates, our encoder must map the 33-channel RA W burst directly into the standard SD latent space. Therefore, we freeze the standard decoder D θ and force the encoder E θ to bridge the domain gap. C.3. Dataset Capture and Pr ocessing Details Hardwar e and Capture Logistics. W e constructed an aligned dataset using registered long and short-exposure multi-frame (MF) images captured on a mobile device mounted on a tripod, ensuring strictly static scenes to elim- inate unwanted camera shake. Capturing data at the be- ginning of the ISP pipeline ensures that the RA W frames contain authentic, sensor-speciﬁc Poisson-Gaussian noise distributions before any destructi ve demosaicing occurs—a property that cannot be perfectly replicated using synthetic in verse-ISP pipelines. T o maintain consistent luminance across the pairs, the brightness of the short-exposure and long-exposure images was equalized by adjusting the ISO during capture. The ground-truth high-resolution (HR) RGB images were obtained by fusing and demosaicing the pristine long-exposure RA W frames, thereby minimizing noise and eliminating blur . Continuous Motion Simulation. T o simulate hand- held motion without introducing the domain gaps typical of purely synthetic afﬁne warps, we utilized a supplemen- tary dataset of real-world handheld b ursts [ 16 ]. Crucially , differing from [ 16 ] where homographies were sampled in- dependently for each input frame, our empirical simula- tion e xtracts groups of continuous homographies from en- tire multi-frame captures. W e apply these motion trajecto- ries jointly to the static short-exposure frames. This modiﬁ- cation preserves the temporal correlations inherent in actual human hand tremors, resulting in highly realistic synthetic motion sequences that accurately reﬂect mobile burst pho- tography . Data Processing and Splits. During the data loading pipeline, RA W frames and ground-truth RGB images are extracted, and the handshake motion is applied. T o pro- cess the motion, the RA W frames are ﬁrst subsampled and resized to the original size H × W using bilinear interpola- tion, followed by the application of the extracted homogra- phy matrices. The RA W and RGB images are subsequently cropped to a patch size of 256 × 256 . The cropped and sub- sampled RA W frames are then downsampled (via a Bayer- preserving Space-to-Depth operation) and resized back to 256 × 256 to construct the ﬁnal input volume to the model. Out of the collected dataset, 2000 distinct scenes are utilized for training, 100 for validation, and 100 for testing. C.4. T raining The model is trained on a NVIDIA H100 GPU with a batch size of 4 for around 4 days. W e used the Adam optimizer with a learning rate of 5e-5. The LoRAs are initialized to giv e an identity at the beginning. The input layer to the encoder is initialized using random weights. The training procedure is outlined in Algorithm 1 , where Y represents the caption generator and other modules are consistent with the notations presented in the main paper . The dataset was split into training, v alidation, and test sets with a ratio of 90:5:5, and we trained our model with λ = 1 , α = 0 . 8 , β = 0 . 2 , and γ = 4 . W e used the Adam optimizer with a learning rate of 5 × 10 − 5 . T raining was performed for 500k iterations, and the checkpoint with the best MSE+LPIPS on the validation set was selected as the ﬁnal checkpoint for ev aluation of all the considered meth- ods. C.5. Inference W e disregard the models used for regularization during the inference. Thus, to obtain the ﬁnal reconstructed image, we need only Ω θ , E θ , and D θ modules. The inference proce- dure is giv en in Algorithm 2 . D. Additional experiments D.1. Infer ence-Time Scale Agnosticism: W e discussed the primary results for the 4 × SR case in Fig. 6 . Ho wev er, our model is not restricted to a ﬁx ed SR scale factor . T o demonstrate this scale-agnostic capability , we illustrate qualitative results in Fig. 13 , and further details are given in the supplementary material, where our method again shows the best ﬁdelity compared to the competing generativ e baselines. The illustration in Fig. 13 further sup- ports this statement by demonstrating improved texture gen- eration from our method compared to GANs, particularly in the higher frequencies. D.2. Comparison with r elated works In addition to the qualitative comparison methods given in the main paper , we show the qualitative results from BIPNet [ 9 ], FB ANet [ 36 ], MPRNet [ 42 ], Restormer [ 41 ], Burstormer [ 10 ], and ResShift [ 40 ] in Fig. 14 and Fig. 15 . Algorithm 1 T raining Procedure Load pretrained Ω ϕ ′ , E θ , D θ , Y Copy weights Ω θ , Ω ϕ ← Ω ϕ ′ Initialize LoRA on Ω θ , Ω ϕ , E θ Initialize STNs on E θ 1: procedur e T R A I N (Dataset D train ) 2: f or i from 0 to N do 3: y , x ∼ D train 4: c ← Y ( y ) 5: z T ← E θ ( y ) ▷ Encoding with alignment 6: ˆ v 0 ← Ω θ ( z T ; T , c ) 7: ˆ z ← 1 √ ¯ α T ( z T − √ 1 − ¯ α T ˆ v 0 ) 8: ˆ x ← D θ ( ˆ z ) 9: L data ( ˆ x , x ) ← L MSE ( ˆ x , x ) + λ LPIPS L LPIPS ( ˆ x , x ) 10: m ← F − 1 [ h · F [ ˆ z ]] ▷ Latent space HF mask 11: ϵ ∼ N (0 , I ) 12: t ∼ [20 , 980] 13: ˆ z t ← α t ˆ z + β t ϵ 14: v ϕ ← Ω ϕ ( ˆ z t ; t, c ) 15: v ϕ ′ ← Ω ϕ ′ ( ˆ z t ; t, c ) 16: ω ← 1 mean ( ∥ ˆ z − v ϕ ∥ 2 ) 17: L VSD-HF ( ˆ x ) = L MSE ( ˆ z t , ˆ z t − m ⊙ ω ( v ϕ ′ − v ϕ )) 18: ϵ ∼ N (0 , I ) 19: t ∼ [0 , 1000] 20: ˆ z t ← α t ˆ z + β t ϵ 21: L diff ← L MSE (Ω ϕ ( ˆ z t ; t, c ) , ϵ ) 22: Update θ using L data + λ L VSD-HF 23: Update ϕ using L diff 24: end f or 25: end procedur e Algorithm 2 Inference Procedure Load trained Ω θ , E θ , D θ , Y 1: procedur e I N F E R (Dataset D test ) 2: y ∼ D test 3: c ← Y ( y ) 4: z T ← E θ ( y ) 5: ˆ v 0 ← Ω θ ( z T ; T , c ) 6: ˆ z ← 1 √ ¯ α T ( z T − √ 1 − ¯ α T ˆ v 0 ) 7: ˆ x ← D θ ( ˆ z ) 8: retur n ˆ x 9: end procedur e D.3. Super r esolution agnostism During the inference, we can use different upsampling scales since the model is trained to improve particular high frequencies. Thus, we show some zero-shot results from 4 × SR 3 × SR (a) N AFNet (b) VSD (c) Ours Figure 13. Qualitative comparison of our method with adversarially-trained baseline (NAFNet) and generati ve baseline (OSEDiff) for 4 × and 3 × super-resolution (SR). Our model demonstrates better feature reconstruction under different SR fac- tors. 3 × and 8 × super-resolution using the same model, which is trained to predict the 4 × super-resolution. The qualita- tiv e results in Fig. 16 demonstrate that our method achie ves comparable performance across various super-resolution scales. D.4. Other techniques f or alignment W e trained the V ariational Auto Encoder (V AE) separately to verify the quality of encoding using different alignment approaches. First, we used deformable conv olution lay- ers [ 5 ] in intermediate steps of the encoder . Howe ver , this method is difﬁcult to implement on hardware. W e also tried to align RA W frames using Fourier Neural Operators (FNO) [ 17 ], where its Fourier Layer performs a global conv olution using the F ourier transform. W e used multiple stacked FNO layers in the encoder for frame alignment. Howe ver , the STNs performed better and had less comple xity when align- ing frames, and thus, we used STNs as the alignment oper - ator in the encoder . The V AE reconstruction performance of the STNs, Deformable con volution layers, and FNOs is illustrated in Fig. 17 where these V AEs are trained indepen- dently after initializing the weights using a pretrained Stable Diffusion 2.1 [ 26 ] V AE. E. User study For the user study , we used images from NAFNet (yield- ing the best quantitativ e results from adversarially trained methods), OSEDiff (utilizing VSD loss for image super- resolution), and our own method. W e employed a two-lev el randomization design to ensure the study’ s v alidity . First, the presentation order of the 30 randomly selected crops was randomized for each subject. This standard procedure (a) (b) (c) (a) (b) (c) Input mean N AFNet [ 3 ] SwinUnet [ 19 ] Burstormer [ 10 ] Restormer [ 41 ] FBANet [ 36 ] MPRNet [ 42 ] BIPNet [ 9 ] Ours GT Figure 14. Qualitati ve comparison of different adversarial learning based postprocessing algorithms for generating RGB images using multi-frame RA W frames as input. Best viewed zoomed in. (a) (b) (a) (b) Input mean InDI [ 7 ] ResShift [ 40 ] OSEDiff [ 38 ] Ours GT Figure 15. Qualitativ e comparison of different diffusion-based postprocessing algorithms for generating RGB images using multi-frame RA W frames as input. Best viewed zoomed in. (a) 3 × SR (b) 4 × SR (c) 8 × SR Figure 16. Qualitative results of our method on 3 × , 4 × , and 8 × super resolution. Figure 17. Performance of dif ferent alignment techniques used for RA W multi-frame alignment. STNs demonstrate slightly better visual quality when compared to other methods. mitigates rater fatigue, a known decline in ev aluator pre- cision and consistency over time, and prev ents sequential bias, which could unfairly penalize a method consistently shown at the end of the study . Second, within each trial, the spatial order of the three outputs (N AFNet, VSD, and Gen- MFSR) w as also randomized. This prevents positional bias, where a rater might de velop a preference for a speciﬁc on- screen location (e.g., left, middle, or right) or memorize a pattern. The images used in this study are shown in Fig. 18 . (a) NAFNet (b) VSD (c) Ours Figure 18. Samples of the ﬁgures used in the user study F. Q&A 1. Clariﬁcation on the contributions. The core contrib u- tion of this paper is the system-level implementation of the burst RA W -to-RGB problem. W e utilize STNs for alignment and propose a new loss for high-frequenc y de- tail generation. W e moved the VSD loss descriptions to the supplement. to avoid distracting the reader . 2. Further details about baseline ar chitecture modiﬁca- tions to accommodate multi-frame inputs. Changes to baselines are described in the experiments section. 3. There is no comparison with the RGB-to-RGB per - formance of these methods. Raw-to-RGB vs RGB-to- RGB is an irrele vant question. Once the image is con- verted to the RGB domain, some information is already lost at the ISP stage, defeating the whole purpose of im- proving RA W -to-RGB conv ersion. 4. There are not enough details about the dataset. Dataset collection details are av ailable in the methodol- ogy section. 5. While the paper highlights the use of “single-step dif- fusion” to enhance computational efﬁciency , it does not pr esent concrete metrics such as infer ence-time latency , FLOPs, or memory usage. Refer to T ab . 2 in the main manuscript. 6. There is a substantial gap in ﬁdelity metrics com- pared to regr ession-based methods. The authors agree with the accusation. Howe ver , the purpose of the gener - ativ e model is to generate sub-pix el information, which reduces the performance on ﬁdelity metrics as shown in the literature. W e show better qualitativ e results to counter this argument and also with better perception quality metrics further reinforce this claim. 7. The visual evidence pro vided in the ablation studies is uncon vincing. W e hav e highlighted the visual results and quantitativ e results in the ablation study section. 8. The pr esentation quality and diagram clarity r equire impro vement. Improved aesthetic polish of the ﬁgures.

GenMFSR: Generative Multi-Frame Image Restoration and Super-Resolution

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment