Exposure Bracketing Is All You Need For A High-Quality Image
It is highly desired but challenging to acquire high-quality photos with clear content in low-light environments. Although multi-image processing methods (using burst, dual-exposure, or multi-exposure images) have made significant progress in address…
Authors: Zhilu Zhang, Shuohao Zhang, Renlong Wu
Published as a conference paper at ICLR 2025 E X P O S U R E B R A C K E T I N G I S A L L Y O U N E E D F O R A H I G H - Q UA L I T Y I M A G E Zhilu Zhang, Shuohao Zhang, Renlong W u, Zifei Y an ∗ , W angmeng Zuo Harbin Institute of T echnology , Harbin, China cszlzhang@outlook.com , yhyzshrby@163.com , hirenlongwu@gmail.com , {yanzifei,wmzuo}@hit.edu.cn A B S T R AC T It is highly desired but challenging to acquire high-quality photos with clear content in low-light en vironments. Although multi-image processing methods (using b urst, dual-exposure, or multi-exposure images) have made significant progress in addressing this issue, they typically focus on specific restoration or enhancement problems, and do not fully explore the potential of utilizing multiple images. Moti vated by the fact that multi-exposure images are complementary in denoising, deblurring, high dynamic range imaging, and super-resolution, we propose to utilize exposure bracketing photography to get a high-quality image by combining these tasks in this w ork. Due to the difficulty in collecting real-world pairs, we suggest a solution that first pre-trains the model with synthetic paired data and then adapts it to real-world unlabeled images. In particular , a temporally modulated recurrent network (TMRNet) and self-supervised adaptation method are proposed. Moreov er, we construct a data simulation pipeline to synthesize pairs and collect real-world images from 200 nighttime scenarios. Experiments on both datasets show that our method performs fav orably against the state-of- the-art multi-image processing ones. Code and datasets are av ailable at https: //github.com/cszhilu1998/BracketIRE . 1 I N T RO D U C T I O N In lo w-light environments, capturing visually appealing photos with clear content presents a highly desirable yet challenging goal. When adopting a low exposure time, the camera only captures a small amount of photons, introducing ine vitable noise and rendering dark areas in visible. When taking a high e xposure time, camera shake and object mo vement result in blurry images, in which bright areas may be o verexposed. Although single-image restoration ( e .g ., denoising ( Zhang et al. , 2017 ; 2018b ; Guo et al. , 2019 ; Brooks et al. , 2019 ; Zamir et al. , 2020 ; Abdelhamed et al. , 2020 ; Li et al. , 2023 ), deblurring ( Nah et al. , 2017 ; Zhang et al. , 2018a ; T ao et al. , 2018 ; Cho et al. , 2021 ; Zamir et al. , 2021 ; Mao et al. , 2023 ), and super-resolution (SR) ( Dong et al. , 2015 ; Lim et al. , 2017 ; Zhang et al. , 2018e ; c ; Liu et al. , 2020a ; Liang et al. , 2021 ; Ledig et al. , 2017 )) and enhancement ( e.g ., high dynamic range (HDR) reconstruction ( Eilertsen et al. , 2017 ; Liu et al. , 2020b ; Zou et al. , 2023 ; Pérez-Pellitero et al. , 2021 ; Lee et al. , 2018 ; Chen et al. , 2023 )) methods ha ve been extensi vely in vestigated, their performance is constrained by the se verely ill-posed problems. Recently , lev eraging multiple images for image restoration and enhancement has demonstrated potential in addressing this issue, thereby attracting increasing attention. W e provide a summary of several related settings and methods in T ab . 1 . For example, some burst image restoration methods ( Bhat et al. , 2021a ; b ; Dudhane et al. , 2022 ; Luo et al. , 2022 ; Lecouat et al. , 2021 ; Mehta et al. , 2023 ; Dudhane et al. , 2023 ; Bhat et al. , 2023 ; W u et al. , 2023 ; Bhat et al. , 2022 ) utilize multiple consecutiv e frames with the same exposure time as inputs, being able to perform SR and denoising. The works based on dual-e xposure images ( Y uan et al. , 2007 ; Chang et al. , 2021 ; Mustaniemi et al. , 2020 ; Zhao et al. , 2022 ; Zhang et al. , 2022b ; Shekarforoush et al. , 2023 ; Lai et al. , 2022 ) combine the short-exposure noisy and long-e xposure blurry pairs for better restoration. Multi-exposure images are commonly employed for HDR imaging ( Kalantari et al. , 2017 ; Y an et al. , 2019 ; Prabhakar et al. , ∗ Corresponding Author . 1 Published as a conference paper at ICLR 2025 T able 1: Comparison between various multi-image processing manners Setting Methods Input Images Supported T asks Denoising Deblurring HDR SR Burst Denoising ( Godard et al. , 2018 ; Xia et al. , 2020 ; Rong et al. , 2020 ; Guo et al. , 2022 ) Burst ✓ Burst Deblurring ( W ieschollek et al. , 2017 ; Peña et al. , 2019 ; Aittala & Durand , 2018 ) Burst ✓ Burst SR ( Deudon et al. , 2020 ; Wronski et al. , 2019 ; W ei et al. , 2023 ) Burst ✓ Burst Denoising and SR ( Bhat et al. , 2021a ; b ; Dudhane et al. , 2022 ; Luo et al. , 2022 ; Lecouat et al. , 2021 ; Mehta et al. , 2023 ; Dudhane et al. , 2023 ; Bhat et al. , 2023 ; W u et al. , 2023 ; Bhat et al. , 2022 ) Burst ✓ ✓ Burst Denoising and HDR ( Hasinoff et al. , 2016 ; Ernst & Wronski , 2021 ) Burst ✓ ✓ Dual-Exposure Image Restoration ( Chang et al. , 2021 ; Mustaniemi et al. , 2020 ; Zhao et al. , 2022 ; Zhang et al. , 2022b ; Shekarforoush et al. , 2023 ; Lai et al. , 2022 ) Dual-Exposure ✓ ✓ Basic HDR Imaging ( Kalantari et al. , 2017 ; Y an et al. , 2019 ; Niu et al. , 2021 ; Liu et al. , 2022 ; Y an et al. , 2023a ; T el et al. , 2023 ; Zhang et al. , 2024b ) Multi-Exposure ✓ HDR Imaging with Denoising ( Hasinoff et al. , 2010 ; Liu et al. , 2023 ; Chi et al. , 2023 ; Pérez-Pellitero et al. , 2021 ) Multi-Exposure ✓ ✓ HDR Imaging with SR ( T an et al. , 2021 ) Multi-Exposure ✓ ✓ HDR Imaging with Denoising and SR ( Lecouat et al. , 2022 ) Multi-Exposure ✓ ✓ ✓ Our BracketIRE - Multi-Exposure ✓ ✓ ✓ Our BracketIRE+ - Multi-Exposure ✓ ✓ ✓ ✓ 2019 ; W u et al. , 2018 ; Niu et al. , 2021 ; Liu et al. , 2022 ; Y an et al. , 2023a ; T el et al. , 2023 ; Zhang et al. , 2024b ; Song et al. , 2022 ). Ne vertheless, in night scenarios, it remains unfeasible to obtain noise-free, blur -free, and HDR images when employing these multi-image processing methods. On the one hand, burst and dual-exposure images both possess restricted dynamic ranges, constraining the potential expansion of the two manners into HDR reconstruction. On the other hand, most HDR reconstruction approaches based on multi-exposure images are constructed with the ideal ass umption that image noise and blur are not taken into account, which results in their inability to restore de graded images. Although recent works ( Liu et al. , 2023 ; Chi et al. , 2023 ; Lecouat et al. , 2022 ) hav e combined with denoising task, blur in long-exposure images has not been incorporated into them, which is still inconsistent with real-world multi-exposure images. In fact, considering all multi-e xposure factors (including noise, blur , underexposure, ov erexposure, and misalignment) is not only beneficial to practical applications, b ut also offers us an opportunity to combine image restoration and enhancement tasks to get a high-quality image. First , the indepen- dence and randomness of noise ( W ei et al. , 2020 ) between images allo w them to assist each other in denoising, and its motiv ation is similar to that of burst denoising ( Mildenhall et al. , 2018 ; Godard et al. , 2018 ; Xia et al. , 2020 ; Rong et al. , 2020 ; Guo et al. , 2022 ). In particular , as demonstrated in dual-exposure restoration w orks ( Y uan et al. , 2007 ; Chang et al. , 2021 ; Mustaniemi et al. , 2020 ; Zhao et al. , 2022 ; Zhang et al. , 2022b ; Shekarforoush et al. , 2023 ; Lai et al. , 2022 ), long-e xposure images with a higher signal-to-noise ratio can play a significantly positiv e role in removing noise from the short-exposure images. Second , the shortest-exposure image can be considered blur-free. It can of fer sharp guidance for deblurring longer-e xposure images. Third , underexposed areas in the short-exposure image may be well-e xposed in the long-exposure one, while o verexposed re gions in the long-exposure image may be clear in the short-exposure one. Combining multi-exposure images makes HDR imaging easier than single-image enhancement. Fourth , the sub-pixel shift between multiple images caused by camera shake or motion is conduci ve to multi-frame SR ( Wronski et al. , 2019 ). In summary , lev eraging the complementarity of multi-e xposure images of fers the potential to integrate the four problems ( i.e ., denoising, deblurring, HDR reconstruction, and SR) into a unified framew ork that can generate a noise-free, blur-free, high dynamic range, and high-resolution image. Specifically , in terms of tasks, we first utilize bracketing photography to combine basic restoration ( i.e ., denoising and deblurring) and enhancement ( i.e ., HDR reconstruction), named BracketIRE. Then we append the SR task, dubbed BracketIRE+, as sho wn in T ab. 1 . In terms of methods, due to the difficulty of collecting real-world paired data, we achiev e that through supervised pre-training on synthetic pairs and self-supervised adaptation on real-world images. On the one hand, we adopt the recurrent network manner as the basic frame work, which is inspired by its successful applications in processing sequence images, e .g ., burst ( Guo et al. , 2022 ; Rong et al. , 2020 ; W u et al. , 2023 ) 2 Published as a conference paper at ICLR 2025 and video ( W ang et al. , 2023b ; Chan et al. , 2021 ; 2022 ) restoration. Nev ertheless, sharing the same restoration parameters for each frame may result in limited performance, as degradations ( e.g ., blur , noise, and color) vary between different multi-exposure images. T o alleviate this problem, we propose a temporally modulated recurrent network (TMRNet), where each frame not only shares some parameters with others, but also has its own specific ones. On the other hand, pre-trained TMRNet on synthetic data has limited generalization ability and sometimes produces unpleasant artifacts in the real world, due to the ine vitable gap between simulated and real images. For that, we propose a self-supervised adaptation method. In particular, we utilize the temporal characteristics of multi-exposure image processing to design learning objecti ves to fine-tune TMRNet. For training and e valuation, we construct a pipeline for synthesizing data pairs, and collect real-world images from 200 nighttime scenarios with a smartphone. The tw o datasets also pro vide benchmarks for future studies. W e conduct extensi ve experiments, which sho w that the proposed method achieves state-of-the-art performance in comparison with other multi-image processing ones. The contributions can be summarized as follo ws: • W e propose to utilize exposure bracketing photography to get a high-quality ( i.e ., noise-free, blur-free, high dynamic range, and high-resolution) image by combining image denoising, deblurring, high dynamic range reconstruction, and super-resolution tasks. • W e suggest a solution that first pre-trains the model with synthetic pairs and then adapts it to unlabeled real-world images, where a temporally modulated recurrent network and a self-supervised adaptation method are proposed. • Experiments on both synthetic and captured real-world datasets show the proposed method outperforms the state-of-the-art multi-image processing ones. 2 R E L A T E D W O R K 2 . 1 S U P E RV I S E D M U L T I - I M AG E P RO C E S S I N G . Burst Image Restoration and Enhancement. Burst-based manners generally lev erage multiple consecutiv e frames with the same exposure for image processing. Most methods focus on image restoration, such as denoising, deblurring, and SR tasks, as shown in T ab . 1 . And they mainly explore inter-frame alignment and feature fusion manners. The former can be implemented by utilizing v arious techniques, e.g ., homography transformation ( W ei et al. , 2023 ), optical flow ( Ranjan & Black , 2017 ; Bhat et al. , 2021a ; b ), deformable con volution ( Dai et al. , 2017 ; Luo et al. , 2022 ; Dudhane et al. , 2023 ; Guo et al. , 2022 ), and cross-attention ( Mehta et al. , 2023 ). The latter are also de veloped with multiple routes, e.g ., weighted-based mechanism ( Bhat et al. , 2021a ; b ), kernel predition ( Xia et al. , 2020 ; Mildenhall et al. , 2018 ; Dahary et al. , 2021 ), attention-based mer ging ( Dudhane et al. , 2023 ; Mehta et al. , 2023 ), and recursi ve fusion ( Deudon et al. , 2020 ; Guo et al. , 2022 ; Rong et al. , 2020 ; W u et al. , 2023 ). Moreov er, HDR+ ( Hasinof f et al. , 2016 ) joins HDR imaging and denoising by capturing underexposure ra w bursts. Recent updates ( Ernst & Wronski , 2021 ) of HDR+ introduce additional well-exposed frames for improving performance. Although such manners may be suitable for scenes with moderate dynamic range, they ha ve limited ability for scenes with high dynamic range. Dual-Exposure Image Restoration. Sev eral methods ( Y uan et al. , 2007 ; Chang et al. , 2021 ; Mustaniemi et al. , 2020 ; Zhao et al. , 2022 ; Zhang et al. , 2022b ; Shekarforoush et al. , 2023 ; Lai et al. , 2022 ) exploit the complementarity of short-exposure noisy and long-exposure blurry images for better restoration. For e xample, Y uan et al . ( Y uan et al. , 2007 ) estimates blur k ernels by exploring the texture of short-exposure images and then employ the kernels to deblur long-exposure ones. Mustaniemi et al . ( Mustaniemi et al. , 2020 ) and Chang et al . ( Chang et al. , 2021 ) deploy conv olutional neural networks (CNN) to aggregate dual-exposure images, achieving superior results compared with single-image methods on synthetic data. D2HNet ( Zhao et al. , 2022 ) proposes a two-phase DeblurNet-EnhanceNet architecture for real-w orld image restoration. Howe ver , few works join it with HDR imaging, mainly due to the restricted dynamic range of dual-exposure images. Multi-Exposure HDR Image Reconstruction. Multi-exposure images are widely used for HDR image reconstruction. Most methods ( Kalantari et al. , 2017 ; Y an et al. , 2019 ; Prabhakar et al. , 2019 ; W u et al. , 2018 ; Niu et al. , 2021 ; Liu et al. , 2022 ; Y an et al. , 2023a ; T el et al. , 2023 ; Zhang et al. , 2024b ; Song et al. , 2022 ) only focus on removing ghosting caused by image misalignment. For 3 Published as a conference paper at ICLR 2025 instance, Kalantari ( Kalantari et al. , 2017 ) align multi-exposure images and then propose a data- driv en approach to mer ge them. AHDRNet ( Y an et al. , 2019 ) utilizes spatial attention and dilated con volution to achie ve de ghosting. HDR-T ransformer ( Liu et al. , 2022 ) and SCTNet ( T el et al. , 2023 ) introduce self-attention and cross-attention to enhance feature interaction, respecti vely . Besides, a fe w methods ( Hasinof f et al. , 2010 ; Liu et al. , 2023 ; Chi et al. , 2023 ; Pérez-Pellitero et al. , 2021 ) take noise into account. Kim et al . ( Kim & Kim , 2023 ) further introduce motion blur in the long-exposure image. Ho wever , the unrealistic blur simulation approach and the requirements of time-varying exposure sensors limit its practical applications. In this work, we consider more realistic situations in low-light en vironments, and incorporate both se vere noise and blur . More importantly , we propose to utilize the complementary potential of multi-e xposure images to combine image restoration and enhancement tasks, including image denoising, deblurring, HDR reconstruction, and SR. 2 . 2 S E L F - S U P E RV I S E D M U LT I - I M AG E P RO C E S S I N G The complementarity of multiple images enables the achiev ement of certain image processing tasks in a self-supervised manner . For self-supervised image restoration, some w orks ( Dewil et al. , 2021 ; Ehret et al. , 2019 ; Sheth et al. , 2021 ; W ang et al. , 2023c ) accomplish multi-frame denoising with the assistance of Noise2Noise ( Lehtinen et al. , 2018 ) or blind-spot netw orks ( Laine et al. , 2019 ; W u et al. , 2020 ; Krull et al. , 2019 ). SelfIR ( Zhang et al. , 2022b ) emplo ys a collaborative learning frame work for restoring noisy and blurry images. Bhat et al . ( Bhat et al. , 2023 ) propose self-supervised Burst SR by establishing a reconstruction objectiv e that models the relationship between the noisy burst and the clean image. Self-supervised real-w orld SR can also be addressed by combining short-focus and telephoto images ( Zhang et al. , 2022a ; W ang et al. , 2021 ; Xu et al. , 2023 ). For self-supervised HDR reconstruction, several works ( Prabhakar et al. , 2021 ; Y an et al. , 2023b ; Nazarczuk et al. , 2022 ) generate or search pseudo-pairs for training the model, while SelfHDR ( Zhang et al. , 2024b ) decomposes the potential GT into constructable color and structure supervision. Howe ver , these methods can only handle specific degradations, making them less practical for our task with multiple ones. In this work, instead of creating self-supervised algorithms trained from scratch, we suggest adapting the model trained on synthetic pairs to real images, and utilize the temporal characteristics of multi-exposure image processing to design self-supervised learning objecti ves. 3 M E T H O D 3 . 1 P R O B L E M D E FI N I T I O N A N D F O R M U L AT I O N Denote the scene irradiance at time t by X ( t ) . When capturing a raw image Y at t 0 time, we can simplify the camera’ s image formation model as, Y = M ( Z t 0 +∆ t t 0 D ( W t ( X ( t ))) d t + N ) . (1) In this equation, (1) D is a spatial sampling function, which is mainly related to sensor size. This function limits the image resolution. (2) ∆ t denotes exposure time and W t represents the warp operation that accounts for camera shak e. Combined with potential object mo vements in X ( t ) , the integral formula R can result in a blurry image, especially when ∆ t is long ( Nah et al. , 2017 ). (3) N represents the inevitable noise, e.g ., read and shot noise ( Brooks et al. , 2019 ). (4) M maps the signal to integer v alues ranging from 0 to 2 b − 1 , where b denotes the bit depth of the sensor . This mapping may reduce the dynamic range of the scene ( Lecouat et al. , 2022 ). In summary , the imaging process introduces multiple degradations, including blur , noise, as well as a decrease in dynamic range and resolution. Notably , in low-light conditions, some de gradations ( e.g ., noise) may be more se vere. In pursuit of higher-quality images, substantial efforts ha ve been made to deal with the inv erse problem through single-image or multi-image restoration ( e.g ., denoising, deblurring, and SR) and enhancement ( e.g ., HDR imaging). Howe ver , most ef forts tend to focus on addressing partial degradations, and fe w works encompass all these aspects, as sho wn in T ab . 1 . In this work, inspired by the complementary potential of multi-e xposure images, we propose to exploit bracketing photography to integrate these tasks for noise-free, blur -free, high dynamic range, and high-resolution images. Specifically , the proposed BracketIRE in v olves denoising, deblurring, and HDR reconstruction, while BracketIRE+ adds support for SR task. Here, we provide a formalization for them. Firstly , W e define 4 Published as a conference paper at ICLR 2025 the number of input multi-exposure images as T , and define the raw image taken with exposure time ∆ t i as Y i , where i ∈ { 1 , 2 , ..., T } and ∆ t i < ∆ t i +1 . Then, we follows the recommendations from multi-exposure HDR reconstruction methods ( Y an et al. , 2019 ; Niu et al. , 2021 ; Liu et al. , 2022 ; Y an et al. , 2023a ; T el et al. , 2023 ), normalizing Y i to Y i ∆ t i / ∆ t 1 and concatenating it with its gamma-transformed image, i.e ., Y c i = { Y i ∆ t i / ∆ t 1 , ( Y i ∆ t i / ∆ t 1 ) γ } , (2) where γ represents the gamma correction parameter and is generally set to 1 / 2 . 2 . Finally , we feed these concatenated images into BracketIRE or BracketIRE+ model B with parameters Θ B , i.e ., ˆ X = B ( { Y c i } T i =1 ; Θ B ) , (3) where ˆ X is the generated image. Furthermore, the optimized network parameters can be written as, Θ ∗ B = arg min Θ B L B ( T ( ˆ X ) , T ( X )) , (4) where L B represents the loss function, and can adopt ℓ 1 loss. X is the ground-truth (GT) image. T ( · ) denotes the µ -law based tone-mapping operator ( Kalantari et al. , 2017 ), i.e ., T ( X ) = log(1 + µ X ) log(1 + µ ) , where µ = 5 , 000 . (5) Besides, we consider the shortest-e xposure image ( i.e ., Y 1 ) blur-free and take it as a spatial alignment reference for other frames. In other words, the output ˆ X should be aligned strictly with Y 1 . T ow ards real-world dynamic scenarios, it is nearly impossible to capture GT X , and it is hard to dev elop self-supervised algorithms trained on real-world images from scratch. T o address the issue, we suggest pre-training the model on synthetic pairs first and then adapting it to real-world scenarios in a self-supervised manner . In particular, we propose a temporally modulated recurrent network for BracketIRE and BracketIRE+ tasks in Sec. 3.2 , and a self-supervised adaptation method in Sec. 3.3 . 3 . 2 T E M P O R A L L Y M O D U L A T E D R E C U R R E N T N E T W O R K (a) Base lin e R ecurr ent Netw or k (b) Our TMRNet Cop y (a) T empor all y Sel f - Su pervised Loss Cop y (b) T empor all y Neg ati v e Loss EMA (c) EM A R egul ar iza tion Loss T r aina ble F r oze n Data Flo w Loss F lo w P ar amet er s Flo w Figure 1: Illustration of baseline re- current network ( e.g ., RBSR ( W u et al. , 2023 )) and our TMRNet. In- stead of sharing parameters of aggre- gation module A for all frames, we divide it into a common one A c for all frames and a specific one A s i only for i -th frame. Modules with differ - ent colors hav e different parameters. Recurrent networks have been successfully applied to burst ( W u et al. , 2023 ) and video ( W ang et al. , 2023b ; Chan et al. , 2021 ; 2022 ) restoration methods, which generally in- volv e four modules, i.e ., feature extraction, alignment, aggre- gation, and reconstruction module. Here we adopt a unidirec- tional recurrent network as our baseline, and briefly describe its pipeline. Firstly , the multi-exposure images { Y c i } T i =1 are fed into an encoder for extracting features { F i } T i =1 . Then, the alignment module is deployed to align F i with reference fea- ture F 1 , getting the aligned feature ˜ F i . Next, the aggregation module A takes ˜ F i and the pre vious temporal feature H i − 1 as inputs, generating the current fused feature H i , i.e ., H i = A ( ˜ F i , H i − 1 ; Θ A ) , (6) where Θ A denotes the parameters of A . Finally , H T is fed into the reconstruction module to output the result. The aggregation module plays a crucial role in the recurrent framew ork and usually takes up most of the parameters. In burst and video restoration tasks, the de gradation types of mul- tiple input frames are generally the same, so it is appropriate for frames to share the same aggregation network parameters Θ A . In BracketIRE and BracketIRE+ tasks, the noise models of multi-exposure images may be similar , as they can be taken by the same device. Ho wever , other degradations are varying. For e xample, the longer the exposure time, the more serious the image blur, the fe wer underexposed areas, and the more overe xposed ones. Thus, sharing Θ A may limit performance. 5 Published as a conference paper at ICLR 2025 ℬ 𝐘 𝑖 𝑐 𝑖 = 1 𝑇 𝐘 𝑖 𝑐 𝑖 = 1 𝑟 𝐗 𝑇 ℬ 𝐗 𝑟 ℒ 𝑠𝑒𝑙𝑓 Cop y (a) T empor all y Sel f - Su pervised Loss ℬ 𝐘 𝑖 𝑐 𝑖 = 1 𝑇 𝐘 𝑖 𝑐 𝑖 = 1 𝑇 𝐗 𝑇 𝑒𝑚𝑎 ℬ 𝐗 𝑇 ℒ 𝑒𝑚𝑎 EMA (b) EM A R egul ar iza tion Loss T r aina ble F r ozen Data Fl o w Loss F lo w P ar amet er s Flo w ℬ 𝐘 𝑖 𝑐 𝑖 = 1 𝑇 𝐘 𝑖 𝑐 𝑖 = 1 𝑇 𝐗 𝑇 𝑒𝑚𝑎 ℬ 𝐗 𝑇 ℒ 𝑒𝑚𝑎 EMA (b) EM A R egul ar iza tion Loss T r aina ble F r ozen Data Flo w Loss Fl o w P ar amet er s Flo w Figure 2: Self-supervised loss terms for real-image adaptation. B denotes TMRNet for BracketIRE or BracketIRE+ task. In sub-figure (a), an integer from 1 to R ( R < T ) is randomly chosen as r . In sub-figure (b), EMA denotes exponential mo ving average. T o alle viate this problem, we suggest assigning specific parameters for each frame while sharing some ones, thus proposing a temporally modulated recurrent network (TMRNet). As shown in Fig. 1 , we divide the aggre gation module A into a common one A c for all frames and a specific one A s i only for i -th frame. Features are first processed via A c and then further modulated via A s i . Eq. ( 6 ) can be modified as, G i = A c ( ˜ F i , H i − 1 ; Θ A c ) , H i = A s i ( G i ; Θ A s i ) , (7) where G i represents intermediate features, Θ A c and Θ A s i denote the parameters of A c and A s i , respectiv ely . W e do not design complex architectures for A c and A s i , and each one only consists of a 3 × 3 con volution layer follo wed by some residual blocks ( He et al. , 2016 ). More details of TMRNet can be seen in Sec. 5.1 . 3 . 3 S E L F - S U P E RV I S E D R E A L - I M A G E A DA P TA T I O N It is hard to simulate multi-exposure images with di verse variables ( e .g ., noise, blur, brightness, and mov ement) that are completely consistent with real-world ones. Due to the inevitable gap, models trained on synthetic pairs hav e limited generalization capabilities in real scenarios. Undesirable artifacts are sometimes produced and some details are missed. T o address the issue, we propose to perform self-supervised adaptation for real-world unlabeled images. Specifically , we explore the temporal characteristics of multi-exposure image processing to design self-supervised loss terms elaborately , as shown in Fig. 2 . Denote the model output of inputting the pre vious r frames { Y c i } r i =1 by ˆ X r . Generally , ˆ X T performs better than ˆ X r ( r < T ), as sho wn in Sec. 6.1 . For supervising ˆ X r , although no ground-truth is provided, ˆ X T can be taken as the pseudo-target. Thus, the temporally self-supervised loss can be written as, L self = ||T ( ˆ X r ) − T ( sg ( ˆ X T )) || 1 , (8) where r is randomly selected from 1 to R ( R < T ) , sg ( · ) denotes the stop-gradient operator . Nev ertheless, only deploying L self can easily lead to tri vial solutions, as the final output ˆ X T is not subject to any constraints. T o stabilize training process, we suggest an e xponential moving av erage (EMA) regularization loss, which constrains the output ˆ X T of the current iteration to be not too f ar away from that of pre vious ones. It can be written as, L ema = ||T ( ˆ X T ) − T ( sg ( ˆ X ema T )) || 1 , (9) where ˆ X ema T = B ( { Y c i } T i =1 ; Θ ema B ) and Θ ema B denotes EMA parameters in the current iteration. Denote model parameters in the k -th iteration by Θ B k , the EMA parameters in the k -th iteration can be written as, Θ ema B k = a Θ ema B k − 1 + (1 − a )Θ B k , (10) where Θ ema B 0 = Θ B 0 and a = 0 . 999 . The total adaptation loss is the combination of L ema and L self , i.e ., L ada = L ema + λ self L self , (11) where λ self is the weight of L self . 6 Published as a conference paper at ICLR 2025 4 D A TA S E T S 4 . 1 S Y N T H E T I C P A I R E D D AT A S E T Although it is unrealistic to synthesize perfect multi-exposure images, we should still shorten the gap with the real images as much as possible. In the camera’ s imaging model in Eq. ( 1 ), noise, blur , motion, and dynamic range of multi-exposure images should be carefully designed. V ideo pro vides a better basis than a single image in simulating motion and blur of multi-e xposure images. W e start with HDR videos from Froehlich et al . ( Froehlich et al. , 2014 ) 1 to construct the simulation pipeline. First, we follow the suggestion from Nah et al . ( Nah et al. , 2019 ) to perform frame interpolation, as these low frame rate ( ∼ 25 fps) videos are unsuitable for synthesizing blur . RIFE ( Huang et al. , 2022 ) is adopted for increasing the frame rate by 32 times. Then, we conv ert these RGB videos to raw space with Bayer pattern according to UPI ( Brooks et al. , 2019 ), getting HDR raw sequences { V m } M m =1 . The first frame V 1 is taken as a GT . Next, we utilize { V m } M m =1 and introduce degradations to construct multi-exposure images. The process mainly includes the follo wing 5 steps. (1) Bicubic 4 × down-sampling is applied to obtain low-resolution images, which is optional and serves for BracketIRE+ task. (2) The video is split into T non-ov erlapped groups, where i -th group should be used to synthesize Y i . Such grouping utilizes the motion in the video itself to simulate motion between T multi-exposure images. (3) Denote the exposure time ratio between Y i and Y i − 1 by S . W e sequentially move S i − 1 ( { i − 1 } -th po wer of S ) consecutiv e images into the above i -th group, and sum them up to simulate blurry images. (4) W e transform the HDR blurry images into low dynamic range (LDR) ones by cropping v alues outside the specified range and mapping the cropped v alues to 10-bit unsigned integers. (5) W e add the heteroscedastic Gaussian noise ( Brooks et al. , 2019 ; W ang et al. , 2020 ; Hasinoff et al. , 2010 ) to LDR images to generate the final multi-e xposure images ( i.e ., { Y i } T i =1 ). The noise variance is a function of pixel intensity , whose parameters are estimated from the captured real-world images in Sec. 4.2 . More details of RGB-to-RA W conv ersion, frame interpolation, and noise can be seen in Appendix A.1 , Appendix A.2 , and Appendix A.3 , respectiv ely . Besides, we set the exposure time ratio S to 4 and the frame number T to 5, as it can cov er most of the dynamic range with fe wer images. The GT has a resolution of 1,920 × 1,080 pixels. Finally , we obtain 1,335 data pairs from 35 scenes. 1,045 pairs from 31 scenes are used for training, and the remaining 290 pairs from the other 4 scenes are used for testing. 4 . 2 R E A L - W O R L D D A TA S E T Real-world multi-e xposure images are collected with the main camera of Xiaomi 10S smartphone at night. Specifically , we utilize the bracketing photography function in ProShot ( Games , 2023 ) application (APP) to capture raw images with a resolution of 6,016 × 4,512 pixels. The exposure time ratio S is set to 4, the frame number T is set to 5, ISO is set to 1,600; these v alues are also the maximum av ailable settings in APP . The exposure time of the medium-e xposure image ( i.e ., Y 3 ) is automatically adjusted by APP . Thus, other exposures can be obtained based on S . It is worth noting that we hold the smartphone for shooting, without any stabilizing de vice, which aims to bring in the realistic hand-held shake. Besides, both static and dynamic scenes are collected, with a total of 200. 100 scenes are used for training and the other 100 are used for ev aluation. 5 E X P E R I M E N T S 5 . 1 I M P L E M E N TA T I O N D E T A I L S Network Details. The input multi-exposure images and ground-truth HDR image are both 4-channel data packed from raw images with the Bayer pattern. Follo wing settings in RBSR ( W u et al. , 2023 ), the encoder and reconstruction module consist of 5 residual blocks ( He et al. , 2016 ), the alignment module adopts flo w-guided deformable approach ( Chan et al. , 2022 ). Besides, the total number of residual blocks in aggregation module remains the same as that of RBSR ( W u et al. , 2023 ), i.e ., 40, 1 The dataset is licensed under CC BY and is publicly av ailable at the site . 7 Published as a conference paper at ICLR 2025 Input Frames Burstormer RBSR SCTNet Kim et al . Ours GT Figure 3: V isual comparison on the synthetic dataset of BracketIRE task. Our method restores sharper edges and clearer details. where the common module has 16 and the specific one has 24. For BracketIRE+ task, we additionally deploy PixelShuf fle ( Shi et al. , 2016 ) at the end of networks for up-sampling features. T raining Details. W e randomly crop patches and augment them with flips and rotations. The batch size is set to 8 . The input patch size is 128 × 128 and 64 × 64 for BracketIRE and BracketIRE+ tasks, respecti vely . W e adopt AdamW ( Loshchilov & Hutter , 2017 ) optimizer with β 1 = 0 . 9 and β 2 = 0.999. Models are trained for 400 epochs ( ∼ 60 hours) on synthetic images and fine-tuned for 10 epochs ( ∼ 2.6 hours) on real-world ones, with the initial learning rate of 10 − 4 and 7 . 5 × 10 − 5 , respectiv ely . Cosine annealing strategy ( Loshchilo v & Hutter , 2016 ) is employed to decrease the learning rates to 10 − 6 . r is randomly selected from { 1 , 2 , 3 } . λ self is set to 1 . Moreover , BracketIRE+ models are initialized with pre-trained BracketIRE models on synthetic experiments. All experiments are conducted using PyT orch ( Paszke et al. , 2019 ) on a single Nvidia R TX A6000 (48GB) GPU. 5 . 2 E V A L U A T I O N A N D C O M PA R I S O N C O N FI G U R A T I O N S Evaluation Configurations. For quantitative e valuations and visualizations, we first con vert raw results to linear RGB space through a post-processing pipeline and then tone-map them with Eq. ( 5 ), getting 16-bit RGB images. All metrics are computed on the RGB images. For synthetic e xperiments, we adopt PSNR, SSIM ( W ang et al. , 2004 ), and LPIPS ( Zhang et al. , 2018d ) metrics. 10 and 4 inv alid pixels around the original input image are excluded for BracketIRE and BracketIRE+ tasks, respectively . Kindly refer to Appendix A.4 for the reason. For real-world ones, we employ no-reference metrics i.e ., CLIPIQA ( W ang et al. , 2023a ) and MANIQA ( Y ang et al. , 2022 ). Comparison Configurations. W e compare the proposed method with 10 related state-of-the-art networks, including 5 burst processing ones ( i.e ., DBSR ( Bhat et al. , 2021a ), MFIR ( Bhat et al. , 2021b ), BIPNet ( Dudhane et al. , 2022 ), Burstormer ( Dudhane et al. , 2023 ) and RBSR ( W u et al. , 2023 )) and 5 HDR reconstruction ones ( i.e ., AHDRNet ( Y an et al. , 2019 ), HDRGAN ( Niu et al. , 2021 ), HDR-T ran. ( Liu et al. , 2022 ), SCTNet ( T el et al. , 2023 ) and Kim et al . ( Kim & Kim , 2023 )). For a fair comparison, we modify their models to adapt inputs with 5 frames, and retrain them on our synthetic pairs follo wing the formulation in Sec. 3.1 . When testing real-world images, their trained models are deployed directly , while our models are fine-tuned on real-world training images with the proposed self-supervised adaptation method. 5 . 3 E X P E R I M E N T A L R E S U LT S Results on Synthetic Dataset. W e summarize the quantitativ e results in T ab. 2 . On BracketIRE task, we achiev e 0.25dB and 0.26dB PSNR gains than RBSR ( W u et al. , 2023 ) and Kim et al . ( Kim & Kim , 2023 ), respectiv ely , which are the latest state-of-the-art methods. On BracketIRE+ task, the improvements are 0.16dB and 0.37dB, respecti vely . It demonstrates the effecti veness of our TMRNet, which handles the varying degradations of multi-exposure images by deploying frame- specific parameters. Moreover , the qualitati ve results in Fig. 3 show that TMRNet recovers more realistic details than others. Results on Real-W orld Dataset. W e achieve the best no-reference scores on Brack etIRE task and the highest CLIPIQA ( W ang et al. , 2023a ) on BracketIRE+ task. But note that the no-reference 8 Published as a conference paper at ICLR 2025 Input Frames Burstormer RBSR HDR-Tran. SCTNet Kim et al . Ours Figure 4: V isual comparison on the real-world dataset of BracketIRE task. Note that there is no ground-truth. Our results have fe wer ghosting artifacts. T able 2: Quantitative comparison with state-of-the-art methods on the synthetic and real-world datasets of BracketIRE and BracketIRE+ tasks, respectively . ‘ Ada. ’ means the self-supervised real-image adaptation. The top two results are marked in bold and underlined, respecti vely . Method BracketIRE BracketIRE+ Synthetic Real-W orld Synthetic Real-W orld PSNR ↑ /SSIM ↑ /LPIPS ↓ CLIPIQA ↑ /MANIQA ↑ PSNR ↑ /SSIM ↑ /LPIPS ↓ CLIPIQA ↑ /MANIQA ↑ Burst Processing Networks DBSR 35.13/0.9092/0.188 0.1359/0.1653 29.79/0.8546/0.335 0.3340/0.2911 MFIR 35.64/0.9161/0.177 0.2192/0.2310 30.06/0.8591/0.319 0.3402/0.2908 BIPNet 36.92/0.9331/0.148 0.2234/0.2348 30.02/0.8582/0.324 0.3577/0.2979 Burstormer 37.06/0.9344/0.151 0.2399/0.2390 29.99/0.8617/0.300 0.3549/0.3060 RBSR 39.10/0.9498/0.117 0.2074/0.2341 30.49/0.8713/0.275 0.3425/0.2895 HDR Reconstruction Networks AHDRNet 36.68/0.9279/0.158 0.2010/0.2259 29.86/0.8589/0.308 0.3382/0.2909 HDRGAN 35.94/0.9177/0.181 0.1995/0.2178 30.00/0.8590/0.337 0.3555/ 0.3109 HDR-T ran. 37.62/0.9356/0.129 0.2043/0.2142 30.18/0.8662/0.279 0.3245/0.2933 SCTNet 37.47/0.9443/0.122 0.2348/0.2260 30.13/0.8644/0.281 0.3415/0.2936 Kim et al . 39.09/0.9494/0.115 0.2467/0.2388 30.28/0.8658/ 0.268 0.3302/0.2954 Our TMRNet w/o Ada. 39.35 / 0.9516 / 0.112 0.2003/0.2181 30.65 / 0.8725 /0.270 0.3422/0.2898 w/ Ada. - 0.2537 / 0.2422 - 0.3676 /0.3020 metrics are not completely stable and are only used for auxiliary ev aluation. The actual visual results can better demonstrate the ef fect of different methods. As shown in Fig. 4 , applying other models trained on synthetic data to the real world easily produces undesirable artifacts. Benefiting from the proposed self-supervised real-image adaptation, our results have fe wer artifacts and more satisfactory content. More visual comparisons can be seen in Appendix J . Inference T ime. Our method has a similar inference time with RBSR ( W u et al. , 2023 ), and a shorter time than recent state-of-the-art ones, i.e ., BIPNet ( Dudhane et al. , 2022 ), Burstormer ( Dudhane et al. , 2023 ), HDR-T ran. ( Liu et al. , 2022 ), SCTNet ( T el et al. , 2023 ) and Kim et al . ( Kim & Kim , 2023 ). Overall, our method maintains good efficiency while improving performance compared to recent state-of-the-art methods. Detailed comparisons can be seen in Appendix B . 6 A B L A T I O N S T U DY 6 . 1 E FF E C T O F N U M B E R O F I N P U T F R A M E S T o validate the ef fect of the number of input frames, we conduct experiments by removing relativ ely higher exposure frames one by one, as shown in T ab . 3 . Naturally , more frames result in better performance. In addition, adding images with longer exposure will lead to e xponential increases of shooting time. The higher the exposure time, the less v aluable content in the image. Considering these two aspects, we only adopt 5 frames. Furthermore, we conduct experiments with more combinations of multi-exposure images in Appendix G . 6 . 2 E FF E C T O F T M R N E T W e change the depths of common and specific modules to explore the effect of temporal modulation in TMRNet. For a fair comparison, we keep the total depth the same. From T ab . 4 , completely 9 Published as a conference paper at ICLR 2025 Input Frames w/o Adaptation w/ Adaptation Input Frames w/o Adaptation w/ Adaptation Figure 5: Effect of self-supervised real-image adaptation. Our results hav e fewer ghosting artifacts and more details in the areas indicated by the red arrow . Please zoom in for better observation. T able 3: Effect of number of input multi- exposure frames. Input BracketIRE PSNR ↑ /SSIM ↑ /LPIPS ↓ BracketIRE+ PSNR ↑ /SSIM ↑ /LPIPS ↓ Y 1 29.64/0.8235/0.340 25.13/0.7289/0.466 { Y i } 2 i =1 33.93/0.8923/0.234 27.99/0.8003/0.390 { Y i } 3 i =1 36.98/0.9294/0.165 29.70/0.8446/0.324 { Y i } 4 i =1 38.70/0.9460/0.127 30.41/0.8645/0.286 { Y i } 5 i =1 39.35 / 0.9516 / 0.112 30.65 / 0.8725 / 0.270 T able 4: Effect of number ( i.e ., a c and a s ) of common and specific blocks. α c α s BracketIRE PSNR ↑ /SSIM ↑ /LPIPS ↓ BracketIRE+ PSNR ↑ /SSIM ↑ /LPIPS ↓ 0 40 38.96/0.9491/0.120 30.41/0.8700/0.276 8 32 39.26/0.9512/0.115 30.70 /0.8721/ 0.270 16 24 39.35 / 0.9516 / 0.112 30.65/ 0.8725 / 0.270 24 16 39.10/0.9497/0.117 30.59/0.8713/0.271 32 8 39.16/0.9500/0.117 30.59/0.8722/0.275 40 0 39.10/0.9498/0.117 30.49/0.8713/0.275 taking common modules or specific ones does not achieve satisf actory results, as the former ignores the degradation difference of multi-exposure images while the latter may be difficult to optimize. Allocating appropriate depths to both modules can perform better . In addition, we also conduct experiments by changing the depths of the two modules independently in Appendix H . 6 . 3 E FF E C T O F S E L F - S U P E RV I S E D A D A P TA T I O N W e regard TMRNet trained on synthetic pairs as a baseline to validate the effecti veness of the proposed adaptation method on Brack etIRE task. From the visual comparisons in Fig. 5 , the adaptation method reduces artifacts significantly and enhances some details. From the quantitativ e metrics, it improves CLIPIQA ( W ang et al. , 2023a ) and MANIQA ( Y ang et al. , 2022 ) from 0.2003 and 0.2181 to 0.2537 and 0.2422, respectiv ely . Please kindly refer to Appendix I for more results. 7 C O N C L U S I O N Existing multi-image processing methods typically focus e xclusively on either restoration or enhance- ment, which are insuf ficient for obtaining visually appealing images with clear content in low-light conditions. Motiv ated by the complementary potential of multi-exposure images in denoising, deblur- ring, HDR reconstruction, and SR, we utilize exposure bracketing photography to combine these tasks to get a high-quality image. Specifically , we suggested a solution that initially pre-trains the model with synthetic pairs and subsequently adapts it to unlabeled real-world images, where a temporally modulated recurrent network and a self-supervised adaptation method are presented. Moreover , we constructed a data simulation pipeline for synthesizing pairs and collected real-w orld images from 200 nighttime scenarios. Experiments on both datasets show our method achie ves better results than state-of-the-arts. 8 A P P L I C A T I O N S A N D L I M I TA T I O N S Applications. A significant application of this work is HDR imaging at night, especially in dynamic en vironments, aiming to obtain noise-free, blur -free, and HDR images. Such images can clearly sho w both bright and dark details in nighttime scenes. The application is not only challenging but also practically valuable. W e also experiment with it on a smartphone ( i.e ., Xiaomi 10S), as shown in Figs. G and H . Limitations. Giv en the div erse imaging characteristics (especially noise model parameters) of various sensors, our method necessitates tailored training for each sensor . In other words, our model trained on images from one sensor may exhibit limited generalization ability when applied to other sensors. W e leav e the inv estigation of a more general model for future work. 10 Published as a conference paper at ICLR 2025 A C K N O W L E D G M E N T S This work was supported by the National K ey R&D Program of China (2022YF A1004100) and the National Natural Science Foundation of China (NSFC) under Grant U22B2035. R E F E R E N C E S Abdelrahman Abdelhamed, Mahmoud Afifi, Radu Timofte, and Michael S Brown. Ntire 2020 challenge on real image denoising: Dataset, methods and results. In CVPR W orkshops , 2020. Miika Aittala and Frédo Durand. Burst image deblurring using permutation in variant con volutional neural networks. In ECCV , 2018. Goutam Bhat, Martin Danelljan, Luc V an Gool, and Radu Timofte. Deep burst super -resolution. In CVPR , 2021a. Goutam Bhat, Martin Danelljan, Fisher Y u, Luc V an Gool, and Radu Timofte. Deep reparametrization of multi-frame super-resolution and denoising. In CVPR , 2021b. Goutam Bhat, Martin Danelljan, Radu Timofte, Y izhen Cao, Y untian Cao, Meiya Chen, Xihao Chen, Shen Cheng, Akshay Dudhane, Haoqiang F an, et al. Ntire 2022 burst super -resolution challenge. In CVPR W orkshops , 2022. Goutam Bhat, Michaël Gharbi, Jiawen Chen, Luc V an Gool, and Zhihao Xia. Self-supervised burst super-resolution. In ICCV , 2023. T im Brooks, Ben Mildenhall, T ianfan Xue, Jiawen Chen, Dillon Sharlet, and Jonathan T Barron. Unprocessing images for learned raw denoising. In CVPR , 2019. Kelvin CK Chan, Xintao W ang, Ke Y u, Chao Dong, and Chen Change Loy . Basicvsr: The search for essential components in video super-resolution and be yond. In CVPR , 2021. Kelvin CK Chan, Shangchen Zhou, Xiangyu Xu, and Chen Change Loy . Basicvsr++: Improving video super-resolution with enhanced propagation and alignment. In CVPR , 2022. Meng Chang, Huajun Feng, Zhihai Xu, and Qi Li. Low-light image restoration with short-and long-exposure ra w pairs. IEEE TMM , 2021. Su-Kai Chen, Hung-Lin Y en, Y u-Lun Liu, Min-Hung Chen, Hou-Ning Hu, W en-Hsiao Peng, and Y en- Y u Lin. Learning continuous exposure v alue representations for single-image hdr reconstruction. In ICCV , 2023. Y iheng Chi, Xingguang Zhang, and Stanle y H Chan. Hdr imaging with spatially v arying signal-to- noise ratios. In CVPR , 2023. Sung-Jin Cho, Seo-W on Ji, Jun-Pyo Hong, Seung-W on Jung, and Sung-Jea Ko. Rethinking coarse-to- fine approach in single image deblurring. In ICCV , 2021. Y uning Cui, Syed W aqas Zamir, Salman Khan, Alois Knoll, Mubarak Shah, and Fahad Shahbaz Khan. Adair: Adaptive all-in-one image restoration via frequenc y mining and modulation. arXiv pr eprint arXiv:2403.14614 , 2024. Omer Dahary , Matan Jacoby , and Alex M Bronstein. Digital gimbal: End-to-end deep image stabilization with learnable exposure times. In Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , pp. 11936–11945, 2021. Jifeng Dai, Haozhi Qi, Y uwen Xiong, Y i Li, Guodong Zhang, Han Hu, and Y ichen W ei. Deformable con volutional networks. In ICCV , 2017. Michel Deudon, Alfredo Kalaitzis, Israel Goytom, Md Rifat Arefin, Zhichao Lin, Kris Sankaran, V incent Michalski, Samira E Kahou, Julien Cornebise, and Y oshua Bengio. Highres-net: Recursiv e fusion for multi-frame super-resolution of satellite imagery . arXiv pr eprint arXiv:2002.06460 , 2020. 11 Published as a conference paper at ICLR 2025 V aléry Dewil, Jérémy Anger , Axel Da vy , Thibaud Ehret, Gabriele Facciolo, and P ablo Arias. Self- supervised training for blind multi-frame video denoising. In W A CV , 2021. Chao Dong, Chen Change Loy , Kaiming He, and Xiaoou T ang. Image super-resolution using deep con volutional networks. TP AMI , 2015. Akshay Dudhane, Syed W aqas Zamir , Salman Khan, Fahad Shahbaz Khan, and Ming-Hsuan Y ang. Burst image restoration and enhancement. In CVPR , 2022. Akshay Dudhane, Syed W aqas Zamir , Salman Khan, Fahad Shahbaz Khan, and Ming-Hsuan Y ang. Burstormer: Burst image restoration and enhancement transformer . CVPR , 2023. Thibaud Ehret, Axel Davy , Jean-Michel Morel, Gabriele Facciolo, and Pablo Arias. Model-blind video denoising via frame-to-frame training. In CVPR , 2019. Gabriel Eilertsen, Joel Kronander , Gyorgy Denes, Rafał K Mantiuk, and Jonas Unger . Hdr image reconstruction from a single exposure using deep cnns. ACM TOG , 2017. Manfred Ernst and Bartlomiej Wronski. Hdr+ with bracketing on pixel phones, 2021. https://blog.research.google/2021/04/ hdr- with- bracketing- on- pixel- phones.html . Jan Froehlich, Stefan Grandinetti, Bernd Eberhardt, Simon W alter, Andreas Schilling, and Harald Brendel. Creating cinematic wide gamut hdr-video for the ev aluation of tone mapping operators and hdr-displays. In Digital photogr aphy X , 2014. Rise Up Games. Proshot, 2023. https://www.riseupgames.com/proshot . Clément Godard, Ke vin Matzen, and Matt Uyttendaele. Deep burst denoising. In ECCV , 2018. Shi Guo, Zifei Y an, Kai Zhang, W angmeng Zuo, and Lei Zhang. T oward conv olutional blind denoising of real photographs. In CVPR , 2019. Shi Guo, Xi Y ang, Jianqi Ma, Gaofeng Ren, and Lei Zhang. A differentiable two-stage alignment scheme for burst image reconstruction with lar ge shift. In CVPR , 2022. Samuel W Hasinof f, Frédo Durand, and W illiam T Freeman. Noise-optimal capture for high dynamic range photography . In CVPR , 2010. Samuel W Hasinoff, Dillon Sharlet, Ryan Geiss, Andre w Adams, Jonathan T Barron, Florian Kainz, Jiawen Chen, and Marc Le voy . Burst photography for high dynamic range and low-light imaging on mobile cameras. A CM TOG , 2016. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR , 2016. Zhewei Huang, T ianyuan Zhang, W en Heng, Boxin Shi, and Shuchang Zhou. Real-time intermediate flow estimation for video frame interpolation. In ECCV , 2022. Siddhant Jain, Daniel W atson, Eric T abellion, Ben Poole, Janne Kontkanen, et al. V ideo interpolation with diffusion models. In CVPR , 2024. Nima Khademi Kalantari, Ravi Ramamoorthi, et al. Deep high dynamic range imaging of dynamic scenes. A CM TOG , 2017. Jungwoo Kim and Min H Kim. Joint demosaicing and deghosting of time-varying exposures for single-shot hdr imaging. In ICCV , 2023. Alexander Krull, T im-Oliv er Buchholz, and Florian Jug. Noise2void-learning denoising from single noisy images. In CVPR , 2019. W ei-Sheng Lai, Y ichang Shih, Lun-Cheng Chu, Xiaotong W u, Sung-Fang Tsai, Michael Krainin, Deqing Sun, and Chia-Kai Liang. Face deblurring using dual camera fusion on mobile phones. A CM TOG , 2022. 12 Published as a conference paper at ICLR 2025 Samuli Laine, T ero Karras, Jaakko Lehtinen, and Timo Aila. High-quality self-supervised deep image denoising. NeurIPS , 2019. Bruno Lecouat, Jean Ponce, and Julien Mairal. Lucas-kanade reloaded: End-to-end super-resolution from raw image b ursts. In ICCV , 2021. Bruno Lecouat, Thomas Eboli, Jean Ponce, and Julien Mairal. High dynamic range and super- resolution from raw image b ursts. ACM TOG , 2022. Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andre w Aitken, et al. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR , 2017. Siyeong Lee, Gwon Hwan An, and Suk-Ju Kang. Deep recursiv e hdri: In verse tone mapping using generativ e adversarial networks. In ECCV , 2018. Jaakko Lehtinen, Jacob Munkber g, Jon Hasselgren, Samuli Laine, T ero Karras, Miika Aittala, and T imo Aila. Noise2noise: Learning image restoration without clean data. In ICML , 2018. Y awei Li, Y ulun Zhang, Radu T imofte, Luc V an Gool, Zhijun T u, Kunpeng Du, Hailing W ang, Hanting Chen, W ei Li, Xiaofei W ang, et al. Ntire 2023 challenge on image denoising: Methods and results. In CVPR W orkshops , 2023. Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc V an Gool, and Radu T imofte. Swinir: Image restoration using swin transformer . In ICCV , 2021. Bee Lim, Sanghyun Son, Hee won Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super -resolution. In CVPR W orkshops , 2017. Ming Liu, Zhilu Zhang, Liya Hou, W angmeng Zuo, and Lei Zhang. Deep adaptive inference netw orks for single image super-resolution. In ECCV W orkshops , 2020a. Shuaizheng Liu, Xindong Zhang, Lingchen Sun, Zhetong Liang, Hui Zeng, and Lei Zhang. Joint hdr denoising and fusion: A real-world mobile hdr image dataset. In CVPR , 2023. Y u-Lun Liu, W ei-Sheng Lai, Y u-Sheng Chen, Y i-Lung Kao, Ming-Hsuan Y ang, Y ung-Y u Chuang, and Jia-Bin Huang. Single-image hdr reconstruction by learning to reverse the camera pipeline. In CVPR , 2020b. Zhen Liu, Y inglong W ang, Bing Zeng, and Shuaicheng Liu. Ghost-free high dynamic range imaging with context-a ware transformer . In ECCV , 2022. Ilya Loshchilov and Frank Hutter . Sgdr: Stochastic gradient descent with warm restarts. arXiv:1608.03983 , 2016. Ilya Loshchilov and Frank Hutter . Decoupled weight decay regularization. , 2017. Ziwei Luo, Y ouwei Li, Shen Cheng, Lei Y u, Qi W u, Zhihong W en, Haoqiang Fan, Jian Sun, and Shuaicheng Liu. Bsrt: Improving b urst super-resolution with swin transformer and flo w-guided deformable alignment. In CVPR , 2022. Xintian Mao, Y iming Liu, Fengze Liu, Qingli Li, W ei Shen, and Y an W ang. Intriguing findings of frequency selection for image deblurring. In AAAI , 2023. Nancy Mehta, Akshay Dudhane, Subrahmanyam Murala, Syed W aqas Zamir , and Khan. Gated multi-resolution transfer network for b urst restoration and enhancement. CVPR , 2023. Ben Mildenhall, Jonathan T Barron, Jiawen Chen, Dillon Sharlet, Ren Ng, and Robert Carroll. Burst denoising with kernel prediction networks. In CVPR , 2018. Janne Mustaniemi, Juho Kannala, Jiri Matas, Simo Särkkä, and Janne Heikkilä. Lsd 2 –joint denoising and deblurring of short and long exposure images with cnns. In BMVC , 2020. Seungjun Nah, T ae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale con volutional neural network for dynamic scene deblurring. In CVPR , 2017. 13 Published as a conference paper at ICLR 2025 Seungjun Nah, Sungyong Baik, Seokil Hong, Gyeongsik Moon, Sanghyun Son, Radu T imofte, and Kyoung Mu Lee. Ntire 2019 challenge on video deblurring and super-resolution: Dataset and study . In CVPR W orkshops , 2019. Michal Nazarczuk, Sibi Catley-Chandar , Ales Leonardis, and Eduardo Pérez Pellitero. Self-supervised hdr imaging from motion and exposure cues. arXiv pr eprint arXiv:2203.12311 , 2022. Y uzhen Niu, Jianbin W u, W enxi Liu, W enzhong Guo, and Rynson WH Lau. Hdr-g an: Hdr image reconstruction from multi-exposed ldr images with lar ge motions. IEEE TIP , 2021. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury , et al. Pytorch: An imperativ e style, high-performance deep learning library . NeurIPS , 2019. Fidel Alejandro Guerrero Peña, Pedro Diamel Marrero Fernández, Tsang Ing Ren, Jorge de Je- sus Gomes Leandro, and Ricardo Massahiro Nishihara. Burst ranking for blind multi-image deblurring. IEEE TIP , 2019. Eduardo Pérez-Pellitero, Sibi Catley-Chandar , Ales Leonardis, and Radu Timofte. Ntire 2021 challenge on high dynamic range imaging: Dataset, methods and results. In CVPR W orkshops , 2021. K Ram Prabhakar , Rajat Arora, Adhitya Swaminathan, Kunal Pratap Singh, and R V enkatesh Bab u. A fast, scalable, and reliable deghosting method for e xtreme exposure fusion. In ICCP , 2019. K Ram Prabhakar , Gowtham Senthil, Susmit Agra wal, R V enkatesh Babu, and Rama Krishna Sai S Gorthi. Labeled from unlabeled: Exploiting unlabeled data for few-shot deep hdr de ghosting. In CVPR , 2021. Anurag Ranjan and Michael J Black. Optical flow estimation using a spatial pyramid network. In CVPR , 2017. Xuejian Rong, Denis Demandolx, K evin Matzen, Priyam Chatterjee, and Y ingli T ian. Burst denoising via temporally shifted wa velet transforms. In ECCV , 2020. Shayan Shekarforoush, Amanpreet W alia, Marcus A Brubaker , K onstantinos G Derpanis, and Alex Levinshtein. Dual-camera joint deblurring-denoising. arXiv pr eprint arXiv:2309.08826 , 2023. Dev Y ashpal Sheth, Sreyas Mohan, Joshua L V incent, Ramon Manzorro, Peter A Crozier, Mitesh M Khapra, Eero P Simoncelli, and Carlos Fernandez-Granda. Unsupervised deep video denoising. In ICCV , 2021. W enzhe Shi, Jose Caballero, Ferenc Huszár , Johannes T otz, Andrew P Aitk en, Rob Bishop, Daniel Rueckert, and Zehan W ang. Real-time single image and video super-resolution using an efficient sub-pixel con volutional neural netw ork. In CVPR , 2016. Jou W on Song, Y e-In Park, Kyeongbo Kong, Jaeho Kwak, and Suk-Ju Kang. Selectiv e transhdr: T ransformer-based selecti ve hdr imaging using ghost region mask. In ECCV , 2022. Xiao T an, Huaian Chen, Kai Xu, Y i Jin, and Changan Zhu. Deep sr-hdr: Joint learning of super - resolution and high dynamic range imaging for dynamic scenes. IEEE TMM , 2021. Xin T ao, Hongyun Gao, Xiaoyong Shen, Jue W ang, and Jiaya Jia. Scale-recurrent network for deep image deblurring. In CVPR , 2018. Stev en T el, Zongwei W u, Y ulun Zhang, Barthélémy Heyrman, Cédric Demonceaux, Radu T imofte, and Dominique Ginhac. Alignment-free hdr deghosting with semantics consistent transformer . In ICCV , 2023. Jianyi W ang, Kelvin CK Chan, and Chen Change Loy . Exploring clip for assessing the look and feel of images. In AAAI , 2023a. Ruohao W ang, Xiaohui Liu, Zhilu Zhang, Xiaohe W u, Chun-Mei Feng, Lei Zhang, and W angmeng Zuo. Benchmark dataset and effecti ve inter-frame alignment for real-world video super -resolution. In CVPRW , 2023b. 14 Published as a conference paper at ICLR 2025 T engfei W ang, Jiaxin Xie, W enxiu Sun, Qiong Y an, and Qifeng Chen. Dual-camera super-resolution with aligned attention modules. In ICCV , 2021. Y uzhi W ang, Haibin Huang, Qin Xu, Jiaming Liu, Y iqun Liu, and Jue W ang. Practical deep raw image denoising on mobile devices. In ECCV , 2020. Zhou W ang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity . TIP , 2004. Zichun W ang, Y ulun Zhang, Debing Zhang, and Y ing Fu. Recurrent self-supervised video denoising with denser receptiv e field. In ACM MM , 2023c. Kaixuan W ei, Y ing Fu, Jiaolong Y ang, and Hua Huang. A physics-based noise formation model for extreme lo w-light raw denoising. In CVPR , 2020. Pengxu W ei, Y ujing Sun, Xingbei Guo, Chang Liu, Guanbin Li, Jie Chen, Xiangyang Ji, and Liang Lin. T ow ards real-world burst image super -resolution: Benchmark and method. In ICCV , 2023. Patrick Wieschollek, Bernhard Schölkopf, Hendrik P A Lensch, and Michael Hirsch. End-to-end learning for image burst deblurring. In ACCV , 2017. Bartlomiej Wronski, Ignacio Garcia-Dorado, Manfred Ernst, Damien Kelly , Michael Krainin, Chia- Kai Liang, Marc Levo y , and Pe yman Milanfar . Handheld multi-frame super-resolution. ACM TOG , 2019. Renlong W u, Zhilu Zhang, Shuohao Zhang, Hongzhi Zhang, and W angmeng Zuo. Rbsr: Efficient and flexible recurrent network for b urst super-resolution. In PRCV , 2023. Shangzhe W u, Jiarui Xu, Y u-W ing T ai, and Chi-K eung T ang. Deep high dynamic range imaging with large fore ground motions. In ECCV , 2018. Xiaohe W u, Ming Liu, Y ue Cao, Dongwei Ren, and W angmeng Zuo. Unpaired learning of deep image denoising. In ECCV , 2020. Zhihao Xia, Federico Perazzi, Michaël Gharbi, Kalyan Sunkavalli, and A yan Chakrabarti. Basis prediction networks for ef fective b urst denoising with large kernels. In CVPR , 2020. Ruikang Xu, Mingde Y ao, and Zhiwei Xiong. Zero-shot dual-lens super-resolution. In CVPR , 2023. Qingsen Y an, Dong Gong, Qinfeng Shi, Anton van den Hengel, Chunhua Shen, Ian Reid, and Y anning Zhang. Attention-guided network for ghost-free high dynamic range imaging. In CVPR , 2019. Qingsen Y an, W eiye Chen, Song Zhang, Y u Zhu, Jinqiu Sun, and Y anning Zhang. A unified hdr imaging method with pixel and patch le vel. In CVPR , 2023a. Qingsen Y an, Song Zhang, W eiye Chen, Hao T ang, Y u Zhu, Jinqiu Sun, Luc V an Gool, and Y anning Zhang. Smae: Few-shot learning for hdr deghosting with saturation-a ware masked autoencoders. In CVPR , 2023b. Sidi Y ang, Tianhe W u, Shuwei Shi, Shanshan Lao, Y uan Gong, Mingdeng Cao, Jiahao W ang, and Y ujiu Y ang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. In CVPR , 2022. Lu Y uan, Jian Sun, Long Quan, and Heung-Y eung Shum. Image deblurring with blurred/noisy image pairs. In SIGGRAPH , 2007. Syed W aqas Zamir , Aditya Arora, Salman Khan, Munaw ar Hayat, F ahad Shahbaz Khan, Ming-Hsuan Y ang, and Ling Shao. Cycleisp: Real image restoration via improved data synthesis. In CVPR , 2020. Syed W aqas Zamir , Aditya Arora, Salman Khan, Munaw ar Hayat, F ahad Shahbaz Khan, Ming-Hsuan Y ang, and Ling Shao. Multi-stage progressive image restoration. In CVPR , 2021. 15 Published as a conference paper at ICLR 2025 Jiawei Zhang, Jinshan Pan, Jimmy Ren, Y ibing Song, Linchao Bao, Rynson WH Lau, and Ming- Hsuan Y ang. Dynamic scene deblurring using spatially variant recurrent neural networks. In CVPR , 2018a. Kai Zhang, W angmeng Zuo, Y unjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE TIP , 2017. Kai Zhang, W angmeng Zuo, and Lei Zhang. Ffdnet: T o ward a f ast and flexible solution for cnn-based image denoising. IEEE TIP , 2018b. Kai Zhang, W angmeng Zuo, and Lei Zhang. Learning a single con volutional super-resolution network for multiple degradations. In CVPR , 2018c. Richard Zhang, Phillip Isola, Ale xei A Efros, Eli Shechtman, and Oli ver W ang. The unreasonable effecti veness of deep features as a perceptual metric. In CVPR , 2018d. Xu Zhang, Jiaqi Ma, Guoli W ang, Qian Zhang, Huan Zhang, and Lefei Zhang. Percei ve-ir: Learning to perceiv e degradation better for all-in-one image restoration. arXiv pr eprint arXiv:2408.15994 , 2024a. Y ulun Zhang, Kunpeng Li, Kai Li, Lichen W ang, Bineng Zhong, and Y un Fu. Image super-resolution using very deep residual channel attention networks. In ECCV , 2018e. Zhilu Zhang, Ruohao W ang, Hongzhi Zhang, Y unjin Chen, and W angmeng Zuo. Self-supervised learning for real-world super -resolution from dual zoomed observations. In ECCV , 2022a. Zhilu Zhang, Rongjian Xu, Ming Liu, Zifei Y an, and W angmeng Zuo. Self-supervised image restoration with blurry and noisy pairs. NeurIPS , 2022b. Zhilu Zhang, Haoyu W ang, Shuai Liu, Xiaotao W ang, Lei Lei, and W angmeng Zuo. Self-supervised high dynamic range imaging with multi-exposure images in dynamic scenes. In ICLR , 2024b. Y uzhi Zhao, Y ongzhe Xu, Qiong Y an, Dingdong Y ang, Xuehui W ang, and Lai-Man Po. D2hnet: Joint denoising and deblurring with hierarchical network for robust night image restoration. In ECCV , 2022. Zhihang Zhong, Gurunandan Krishnan, Xiao Sun, Y u Qiao, Sizhuo Ma, and Jian W ang. Clearer frames, anytime: Resolving velocity ambiguity in video frame interpolation. In ECCV , 2024. Y unhao Zou, Chenggang Y an, and Y ing Fu. Rawhdr: High dynamic range image reconstruction from a single raw image. In ICCV , 2023. 16 Published as a conference paper at ICLR 2025 A P P E N D I X The content of the appendix in volves: • More implementation details in Appendix A . • Comparison of computational costs in Appendix B . • Results on other datasets in Appendix C . • Comparison with all-in-one methods in Appendix D . • Comparison with step-by-step processing in Appendix E . • Comparison with burst imaging in Appendix F . • Effect of multi-e xposure combinations in Appendix G . • Effect of TMRNet in Appendix H . • Effect of self-supervised adaptation in Appendix I . • More visual comparisons in Appendix J . A I M P L E M E N TA T I O N D E TA I L S A . 1 R G B - T O - R A W C O N V E R S I O N I N D AT A S I M U L AT I O N P I P E L I N E W e provide an illustration to visually demonstrate the pipeline of synthesizing data in Fig. A . The original HDR videos ( Froehlich et al. , 2014 ) in the synthetic dataset are shot by the Alexa camera, a CMOS sensor based motion picture camera made by Arri. UPI ( Brooks et al. , 2019 ) is used to con vert these RGB videos to raw space. Note that we do not use the same camera parameters as UPI ( Brooks et al. , 2019 ) setting, b ut use the parameters from Alexa camera during RGB-to-RA W con version. A . 2 F R A M E I N T E R P O L A T I O N I N D AT A S I M U L AT I O N P I P E L I N E RIFE ( Huang et al. , 2022 ) is used to interpolate between tw o frames, and it does not affect the data distribution. From visual observation of interpolation results, it also supports this point. Nonetheless, limited the ability of RIFE, the synthetic blur still has a gap with the real-world one. In this work, the proposed self-supervised real-image adaptation method is to alle viate this gap. Besides, recent interpolation models are more capable of dealing with large motion ( Jain et al. , 2024 ) and complex textures ( Zhong et al. , 2024 ). W e believ e this problem can also be alleviated with the advancement of interpolation models. A . 3 N O I S E I N D A TA S I M U L A T I O N P I P E L I N E The noise in raw images is mainly composed of shot and read noise ( Brooks et al. , 2019 ). Shot noise can be modeled as a Poisson random variable whose mean is the true light intensity measured in photoelectrons. Read noise can be approximated as a Gaussian random variable with a zero mean and a fixed v ariance. The combination of shot and read noise can be approximated as a single heteroscedastic Gaussian random variable N , which can be written as, N ∼ N ( 0 , λ read + λ shot X ) , (A) where X is the clean signal value. λ read and λ shot are determined by sensor’ s analog and digital gains. In order to make our synthetic noise as close as possible to the collected real-world image noise, we adopt noise parameters of the main camera sensor in Xiaomi 10S smartphone, and the y ( i.e ., λ shot 17 Published as a conference paper at ICLR 2025 HD R R GB Video Clip F r am e Int erpo latio n R GB t o R A W HD R R A W Video Clip 4 Do wn - Sam pling (Optio nal) Split (Simulat e Motio n) Sum (Simulat e Blu r) HD R t o L DR A dd No ise Mu lti - E xposu r e Imag es X Gr ound - T ru th Select F irs t F r ame E xposu r e Tim e R atio Figure A: Overvie w of data simulation pipeline. W e utilize HDR video to synthesize multi-exposure images { Y i } T i =1 and the corresponding GT image X . S denotes the exposure time ratio between Y i and Y i − 1 . Y i is obtained by summing and processing S i − 1 ( { i − 1 } -th power of S ) images from HDR raw video V . Q i denotes the total number of images from V that participate in constructing { Y k } i k =1 . and λ read ) can be found in the metadata of raw image file. Specifically , the ISO of all captured real-world images is set to 1,600. At this ISO, λ shot ≈ 2 . 42 × 10 − 3 and λ read ≈ 1 . 79 × 10 − 5 . Moreov er, in order to synthesize noise with various le vels, we uniformly sample the parameters from ISO = 800 to ISO = 3,200. Finally , λ read and λ shot can be expressed as, log( λ shot ) ∼ U (log(0 . 0012) , log (0 . 0048)) , log( λ read ) | log ( λ shot ) ∼ N (1 . 869 log( λ shot ) + 0 . 3276 , 0 . 3 2 ) , (B) where U ( a, b ) represents a uniform distrib ution within the interval [ a, b ] . A . 4 E V A L U A T I O N S E T T I N G S For synthetic experiment ev aluation, 10 and 4 inv alid pixels around the original input image are excluded for Brack etIRE and BracketIRE+ tasks, respectiv ely . Here is the reason. The surrounding 10 pixels of the original HDR videos 2 ( Froehlich et al. , 2014 ) are all 0 values. Thus, the surrounding 10 and 4 pixels of the synthetic input image are all 0 v alues for BracketIRE and BracketIRE+ task, respecti vely . In practice, this situation hardly occurs, so we e xclude them for ev aluation. Nonetheless, the model can deal with marginal areas. Actually , in the early version of our paper , we used all pixels for ev aluation. After the suggestions from peers, we changed the ev aluation way , as the current way is more in line with the actual situation. B C O M P A R I S O N O F C O M P U TA T I O N A L C O S T S W e pro vide comparisons of the inference time, as well as the number of FLOPs and model parameters in T ab . A . W e suggest inference time as the main reference for computational cost comparison, as the testing time is more significant than the number of FLOPs and model parameters in practical applications. It can be seen that our method has a similar time with RBSR ( W u et al. , 2023 ), and a shorter time than recent state-of-the-art ones, i.e ., BIPNet ( Dudhane et al. , 2022 ), Burstormer ( Dud- hane et al. , 2023 ), HDR-Tran. ( Liu et al. , 2022 ), SCTNet ( T el et al. , 2023 ) and Kim et al . ( Kim & Kim , 2023 ). Overall, our method maintains good efficiency while impro ving performance compared to recent state-of-the-art methods. Further , in the recent state-of-the-art methods, BIPNet ( Dudhane et al. , 2022 ), RBSR ( W u et al. , 2023 ), and our TMRNet are based on the conv olutional neural network (CNN), while Burstormer ( Dudhane et al. , 2023 ), HDR-T ran. ( Liu et al. , 2022 ), SCTNet ( T el et al. , 2023 ), and Kim et al . ( Kim & Kim , 2023 ) are based on Transformer . TMRNet has similar #FLOPs to RBSR and lower #FLOPs than BIPNet, but higher #FLOPs than these T ransformer-based methods. W e think this is acceptable and 2 https://www.hdm- stuttgart.de/vmlab/hdm- hdr- 2014/ . 18 Published as a conference paper at ICLR 2025 T able A: Comparison of #parameters and computational costs with state-of-the-art methods when generating a 1920 × 1080 raw image on BracketIRE task. Note that the inference time can better illustrate the method’ s ef ficiency than #parameters and #FLOPs for practical applicability . Method #Params (M) #FLOPs (G) T ime (ms) DBSR ( Bhat et al. , 2021a ) 12.90 16,120 850 MFIR ( Bhat et al. , 2021b ) 12.03 18,927 974 BIPNet ( Dudhane et al. , 2022 ) 6.28 135,641 6,166 Burstormer ( Dudhane et al. , 2023 ) 3.11 9,200 2,357 RBSR ( W u et al. , 2023 ) 5.64 19,440 1,467 AHDRNet ( Y an et al. , 2019 ) 2.04 2,053 208 HDRGAN ( Niu et al. , 2021 ) 9.77 2,410 158 HDR-T ran. ( Liu et al. , 2022 ) 1.69 1,710 1,897 SCTNet ( T el et al. , 2023 ) 5.02 5,145 3,894 Kim et al . ( Kim & Kim , 2023 ) 22.74 5,068 1,672 TMRNet 13.29 20,040 1,425 T able B: Results on Kalantari et al . ( Kalantari et al. , 2017 ) dataset for HDR image reconstruction. Method PSNR ↑ / SSIM ↑ AHDRNett ( Y an et al. , 2019 ) 41.14 / 0.9702 HDRGAN ( Niu et al. , 2021 ) 41.57 / 0.9865 HDR-T ran. ( Liu et al. , 2022 ) 42.18 / 0.9884 SCTNet ( T el et al. , 2023 ) 42.29 / 0.9887 Kim et al . ( Kim & Kim , 2023 ) 41.99 / 0.9890 TMRNet 42.43 / 0.9893 understandable for two reasons. First, although Transformer -based methods have an adv antage in #FLOPs, they bring higher inference time and it is more dif ficult for them to deploy into embedded chips for practical application. Second, the main idea of TMRNet is to assign specific parameters for each frame while sharing some parameters. W e implement this idea based on a more recent CNN-based method ( i.e ., RBSR), and the basic modules only adopt simple residual blocks. For TMRNet, the basic modules can be easily replaced with #FLOPs-friendly modules, which has great potential for #FLOPs reduction. W e plan to experiment with it and provide a lightweight TMRNet in the next v ersion. C R E S U L T S O N O T H E R D A TA S E T S C . 1 E FF E C T I V E N E S S O F T M R N E T O N O T H E R D A TA S E T S T o ev aluate the effecti veness of TMRNet on other datasets, we conduct e xperiments on Kalantari et al . ( Kalantari et al. , 2017 ) dataset for HDR image reconstruction. W e compare our TMRNet with recent HDR reconstruction methods. The results in T ab . B show that TMRNet achieves the best results. W e also conduct e xperiments on BurstSR ( Bhat et al. , 2021a ) dataset for b urst image super-resolution. W e compare our TMRNet with recent burst super-resolution methods. The results in T ab. C sho w that TMRNet still achiev es the best results. C . 2 E FF E C T I V E N E S S O F S E L F - S U P E RV I S E D R E A L - I M AG E A D A P TA T I O N O N O T H E R D AT A S E T S T o e valuate the ef fectiv eness of self-supervised real-image adaptation on other datasets, we conduct an experiment on b urst image super-resolution dataset ( Bhat et al. , 2021a ), which is the only commonly used multi-image processing dataset with both synthetic and real-w orld data. W e first pre-train our TMRNet on synthetic data, and then use our self-supervised loss to fine-tune it on real-world training dataset. The results in T ab . D show that our self-supervised loss brings 0.87 dB PSNR gain on real-world testing dataset, demonstrating its ef fectiveness. 19 Published as a conference paper at ICLR 2025 T able C: Results on BurstSR ( Bhat et al. , 2021a ) dataset for burst image super -resolution. Method PSNR ↑ / SSIM ↑ DRSR ( Bhat et al. , 2021a ) 48.05 / 0.984 MFIR ( Bhat et al. , 2021b ) 48.33 / 0.985 BIPNet ( Dudhane et al. , 2022 ) 48.49 / 0.985 Burstormer ( Dudhane et al. , 2023 ) 48.82 / 0.986 RBSR ( W u et al. , 2023 ) 48.80 / 0.987 TMRNet 48.92 / 0.987 T able D: Ef fectiveness of self-supervised real-image adaptation on BurstSR ( Bhat et al. , 2021a ) dataset for burst image super -resolution. Method PSNR ↑ / SSIM ↑ w/o Self-Supervised Adaptation 44.70 / 0.9690 w/ Self-Supervised Adaptation 45.57 / 0.9734 D C O M P A R I S O N W I T H A L L - I N - O N E M E T H O D S All-in-one models ( Zhang et al. , 2024a ; Cui et al. , 2024 ) mean that the models can process images with different degradations. There are three main differences between them and our work. First, all-in-one models generally input a single image, and output a single image. Our model inputs multi-exposure images, and outputs a single image. Second, in all-in-one models, the degradation type across input samples can be dif ferent. In our model, the degradation across input samples is basically consistent, and the degradation is different between multiple images within an input. Third, all-in-one models utilize the capabilities of the model itself to achieve multiple tasks. Our model can additionally exploit the complementarity of the input images to achie ve multiple tasks. W e have conducted an experiment using AdaIR ( Cui et al. , 2024 ). The original AdaIR model can only input one image. For a f air comparison, we concatenate multi-exposure images aligned by an optical flo w network ( Ranjan & Black , 2017 ) together as the input of AdaIR. Its PSNR result is 38.06 dB, which is lower than our 39.35 dB. W e argue that the main reason for AdaIR’ s poor performance is that it is not specifically designed for multi-image processing. In contrast, the methods compared in T ab . 2 are all specific multi-image processing methods. E C O M P A R I S O N W I T H S T E P - B Y - S T E P P RO C E S S I N G Here we demonstrate our advantages by conducting an ablation study that compares the joint processing and progressi ve processing manners on BracketIRE task on synthetic dataset. W e mark our multi-task joint processing way as ‘Denoising&Deblurring&HDR’. In the ablation study , we decompose the whole task with three steps: (1) first denoising, (2) then deblurring, and (3) finally HDR reconstruction. W e mark the way as ‘Denoising+Deblurring+HDR’. During training, we construct data pairs and modify our TMRNet as the specialized network for each step. The inputs of each step are all multi-exposure images concatenated together , which aim to exploit the complementarity of multi-exposure images in denoising, deblurring, and HDR reconstruction task, respecti vely . During inference, we sequentially cascade the networks at all steps to test. The results are shown in the T ab . E . It can be seen that step-by-step processing is inferior to joint processing. Actually , during step-by-step processing, the denoising model performs well. The main reason for the unsatisfactory performance is that the deblurring model has a limited ef fect when dealing with the se vere blur in the long-e xposure image. It prevents the HDR reconstruction model from w orking well. Specifically , in the training phase, the input multi-exposure images of ‘HDR’ model are blur-free and noise-free. Howe ver , in the testing phase, there may still be some blur remaining in the input of ‘HDR’ model, due to the limited capabilities of ‘Deblurring’ models. Thus, a data gap between training and testing appears in ‘HDR’ model, and it hurts the model performance. In contrast, joint processing can av oid this problem. 20 Published as a conference paper at ICLR 2025 T able E: Comparison with step-by-step processing. Manner PSNR ↑ / SSIM ↑ / LPIPS ↓ Step-by-Step Processing (3 Steps, Denoising+Deblurring+HDR) 37.93 / 0.9367 / 0.120 Our Joint Processing (1 Step, Denoising&Deblurring&HDR) 39.35 / 0.9516 / 0.112 T able F: Comparison with burst processing manner . { Y b i } 5 b =1 denote the 5 burst images with exposure time ∆ t i . Input BracketIRE PSNR ↑ / SSIM ↑ / LPIPS ↓ BracketIRE+ PSNR ↑ / SSIM ↑ / LPIPS ↓ { Y b 1 } 5 b =1 32.22 / 0.8606 / 0.271 26.89 / 0.7663 / 0.416 { Y b 2 } 5 b =1 35.05 / 0.9237 / 0.171 28.93 / 0.8289 / 0.345 { Y b 3 } 5 b =1 31.75 / 0.9284 / 0.144 28.24 / 0.8581 / 0.302 { Y b 4 } 5 b =1 26.30 / 0.8853 / 0.215 24.46 / 0.8225 / 0.381 { Y b 5 } 5 b =1 20.04 / 0.8247 / 0.364 20.59 / 0.8062 / 0.450 { Y i } 5 i =1 39.35 / 0.9516 / 0.112 30.65 / 0.8725 / 0.270 T able G: Effect of multi-e xposure image combinations on BracketIRE task. Input PSNR ↑ / SSIM ↑ / LPIPS ↓ { Y 1 , Y 2 , Y 3 } 36.98 / 0.9294 / 0.165 { Y 1 , Y 3 , Y 5 } 37.54 / 0.9388 / 0.146 { Y 2 , Y 3 , Y 4 } 36.48 / 0.9463 / 0.127 { Y 3 , Y 4 , Y 5 } 31.31 / 0.9291 / 0.164 { Y 1 , Y 2 , Y 3 , Y 4 } 38.70 / 0.9460 / 0.127 { Y 2 , Y 3 , Y 4 , Y 5 } 36.54 / 0.9483 / 0.122 { Y 1 , Y 2 , Y 3 , Y 4 , Y 5 } 39.35 / 0.9516 / 0.112 Besides, joint processing only produces one model for deploying. It can simplify the complexity of the entire imaging system and make it easier to deploy in actual scenarios. Benefiting from this, joint processing way is also being pursued by some mobile phone manufacturers, as f ar as we know . F C O M P A R I S O N W I T H B U R S T I M AG I N G T o validate the ef fectiveness of le veraging multi-exposure frames, we compare our method with burst imaging manner that employs multiple images with the same exposure. For each exposure time ∆ t i , we use our data simulation pipeline to construct 5 burst images { Y b i } 5 b =1 as inputs. The quantitative results are sho wn in T ab . F . It can be seen that the models using moderate exposure b ursts ( e.g ., Y 2 and Y 3 ) achie ve better results, as these b ursts take good trade-of fs between noise and blur , as well as ov erexposure and underexposure. Ne vertheless, their results are still weaker than ours by a wide margin, mainly due to the limited dynamic range of the input b ursts. G E FF E C T O F M U LT I - E X P O S U R E C O M B I N A T I O N S W e conduct experiments with dif ferent combinations of multi-exposure images in the T ab. G . From T ab . G , the more frames, the better the results. More generally , we argue that as the number of frames increases, the worst case is that the model does not extract useful information from the increased frame. In other words, adding frames does not lead to the worse results, but leads to the similar or better ones. In this work, we adopt the frame number T = 5 and exposure time ratio S = 4 , as it can cov er most of the dynamic range using fewer frames. Additionally , without considering shooting and computational costs, it is foreseeable that a larger T or smaller S would perform better when keeping the ov erall dynamic range the same. 21 Published as a conference paper at ICLR 2025 T able H: Ef fect of depth of specific blocks while keeping common blocks the same on BracketIRE task. a c and a s denote the number of common and specific blocks, respectiv ely . α c α s T ime (ms) PSNR ↑ / SSIM ↑ / LPIPS ↓ 16 0 808 38.66 / 0.9462 / 0.125 16 8 1,016 38.87 / 0.9480 / 0.122 16 16 1,224 39.12 / 0.9496 / 0.116 16 24 1,425 39.35 / 0.9516 / 0.112 16 32 1,633 39.36 / 0.9518 / 0.114 T able I: Effect of depth of common blocks while k eeping specific blocks the same on BracketIRE task. a c and a s denote the number of common and specific blocks, respectiv ely . α c α s T ime (ms) PSNR ↑ / SSIM ↑ / LPIPS ↓ 0 24 1,015 38.91 / 0.9484 / 0.121 8 24 1,219 39.15 / 0.9502 / 0.117 16 24 1,425 39.35 / 0.9516 / 0.112 24 24 1,637 39.31 / 0.9512 / 0.115 T able J: Effect of loss terms for self-supervised real-image adaptation. ‘-’ denotes TMRNet trained on synthetic pairs. ‘NaN’ implies the training collapse. Note that the no-reference metrics are not completely stable and are provided only for auxiliary e valuation. L ema L self BracketIRE CLIPIQA ↑ / MANIQA ↑ BracketIRE+ CLIPIQA ↑ / MANIQA ↑ - - 0.2003 / 0.2181 0.3422 / 0.2898 ✓ ✗ 0.2003 / 0.2181 0.3422 / 0.2898 ✗ ✓ NaN / NaN NaN / NaN ✓ λ self = 0 . 5 0.2295 / 0.2360 0.3591 / 0.2978 ✓ λ self = 1 0.2537 / 0.2422 0.3676 / 0.3020 ✓ λ self = 2 0.2270 / 0.2391 0.3815 / 0.3172 ✓ λ self = 4 0.1974 / 0.2525 0.3460 / 0.3189 H E FF E C T O F T M R N E T For TMRNet, we conduct experiments by changing the depths of the common and specific blocks independently , whose results are shown in T ab. H and T ab . I , respectively . Denote the number of common and specific blocks by a c and a s , respecti vely . On the basis of a c = 16 and a s = 24 , adding their depths did not bring significant improvement while increasing the inference time. W e speculate that it could be attributed to the dif ficulty of optimization for deeper recurrent networks. I E FF E C T O F S E L F - S U P E RV I S E D A D A P TA T I O N In the main text, we regard TMRNet trained on synthetic pairs as a baseline to validate the effec- tiv eness of the proposed adaptation method. Here we provide more visual comparisons in Fig. B , where our results ha ve fewer speckling and ghosting artifacts, as well as more details both in static and dynamic scenes. W e also provide the quantitati ve comparisons in T ab. J . It can be seen that the proposed adaptation method can bring both CLIPIQA ( W ang et al. , 2023a ) and MANIQA ( Y ang et al. , 2022 ) improv ements. In addition, only deplo ying L ema would make the network parameters to be not updated. Without L ema , the self-supervised fine-tuning w ould lead to a tri vial solution, thus collapsing. This is because the result of inputting all frames is not subject to any constraints at this time. 22 Published as a conference paper at ICLR 2025 Input Frames w/o Adaptation w/ Adaptation Input Frames w/o Adaptation w/ Adaptation Figure B: Ef fect of self-supervised real-image adaptation. Our results have fewer speckling and ghosting artifacts, as well as more details. The top four sho w the visual effects of static scenes (b ut with camera motion), and the bottom four show the visual ef fects of moving objects. Please zoom in for better observation. Moreov er, we empirically adjust the weight λ self of L self and conduct experiments with dif ferent λ self . From T ab . J , the effect of λ self on the results is acceptable. It is worth noting that although higher λ self ( e.g ., λ self = 2 and λ self = 4 ) sometimes achieves higher quantitativ e metrics, the image contrast decreases and the visual ef fect is unsatisfactory at this time. This also demonstrates the no-reference metrics are not completely stable, thus we only tak e them for auxiliary ev aluation. Focusing on the visual ef fects, we set λ self = 1 . J M O R E V I S UA L C O M PA R I S O N S W e first provide more visual comparisons on BracketIRE+ task. Figs. C and D show the qualitati ve comparisons on the synthetic images. Figs. E and F show the qualitativ e comparisons on the real- world images. It can be seen that our method generates more photo-realistic images with fewer artifacts than others. Moreov er, in order to observe the ef fect of dynamic range enhancement, we provide some full-image results from real-world dataset. Note that the size of the original full images is very lar ge, and here we do wnsample them for display . Fig. G shows the full-image visualization results on BracketIRE task. Fig. H shows the full-image visualization results on BracketIRE+ task. Our results preserve both bright and dark details, showing a higher dynamic range. 23 Published as a conference paper at ICLR 2025 Input Frames DBSR MFIR BIPNet Burstormer RBSR AHDRNet HDRGAN HDR-Tran. SCTNet Kim et al . Ours GT Figure C: V isual comparison on the synthetic dataset of BracketIRE+ task. Our result restores clearer details. Please zoom in for better observation. Input Frames DBSR MFIR BIPNet Burstormer RBSR AHDRNet HDRGAN HDR-Tran. SCTNet Kim et al . Ours GT Figure D: V isual comparison on the synthetic dataset of BracketIRE+ task. Our result restores more fidelity content. Please zoom in for better observation. 24 Published as a conference paper at ICLR 2025 Input Frames DBSR MFIR BIPNet Burstormer RBSR AHDRNet HDRGAN HDR-Tran. SCTNet Kim et al . Ours Figure E: V isual comparison on the real-world dataset of BracketIRE+ task. Our result restores clearer textures. Please zoom in for better observation. Input Frames DBSR MFIR BIPNet Burstormer RBSR AHDRNet HDRGAN HDR-Tran. SCTNet Kim et al . Ours Figure F: V isual comparison on the real-world dataset of BracketIRE+ task. Our result has fewer artifacts. Please zoom in for better observation. 25 Published as a conference paper at ICLR 2025 Figure G: Full-image results on the real-world dataset of BracketIRE task. Our results preserve both the bright areas in short-exposure images and the dark areas in long-e xposure images. Note that the size of the original full images is very large, and here we do wnsample them for display . Please zoom in for better observ ation. Moreov er , we provide a higher resolution version in README.md of the supplementary material. Figure H: Full-image results on the real-w orld dataset of BracketIRE+ task. Our results preserve both the bright areas in short-exposure images and the dark areas in long-e xposure images. Note that the size of the original full images is very large, and here we do wnsample them for display . Please zoom in for better observ ation. Moreov er , we provide a higher resolution version in README.md of the supplementary material. 26
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment