FrescoDiffusion: 4K Image-to-Video with Prior-Regularized Tiled Diffusion
Diffusion-based image-to-video (I2V) models are increasingly effective, yet they struggle to scale to ultra-high-resolution inputs (e.g., 4K). Generating videos at the model's native resolution often loses fine-grained structure, whereas high-resolut…
Authors: Hugo Caselles-Dupré, Mathis Koroglu, Guillaume Jeanneret
F rescoDiffusion: 4K Image-to-Video with Prior-Regularized Tiled Diffusion Hugo Caselles-Dupré 1 ∗ , Mathis K oroglu 1 , 2 ∗ , Guillaume Jeanneret 2 ∗ , Arnaud Dap ogn y 2 , and Matthieu Cord 2 1 Ob vious Researc h, P aris, F rance 2 Institute of In telligen t Systems and Rob otics - Sorb onne Univ ersity , Paris, F rance Pro ject w ebsite: https://f2v.pages.dev/ Abstract. Diffusion-based image-to-video (I2V) mo dels are increasingly effectiv e, y et they struggle to scale to ultra-high-resolution inputs (e.g., 4K). Generating videos at the model’s native resolution often loses fine- grained structure, whereas high-resolution tiled denoising preserv es lo cal detail but breaks global la y out consistency . This failure mode is par- ticularly sev ere in the fresco animation setting: mon umental artw orks con taining man y distinct c haracters, ob jects, and seman tically different sub-scenes that m ust remain spatially coherent ov er time. W e in tro duce F rescoDiffusion, a training-free metho d for coheren t large-format I2V generation from a single complex image. The k ey idea is to augment tiled denoising with a precomputed latent prior: we first generate a low- resolution video at the underlying mo del resolution and upsample its la- ten t tra jectory to obtain a global reference that captures long-range tem- p oral and spatial structure. F or 4K generation, w e compute p er-tile noise predictions and fuse them with this reference at every diffusion timestep b y minimizing a single weigh ted least-squares ob jective in mo del-output space. The ob jective combines a standard tile-merging criterion with our regularization term, yielding a closed-form fusion update that strength- ens global coherence while retaining fine detail. W e additionally provide a spatial regularization v ariable that enables region-lev el con trol ov er where motion is allo wed. Exp erimen ts on the VBench-I2V dataset and our prop osed fresco I2V dataset show improv ed global consistency and fidelit y o ver tiled baselines, while b eing computationally efficient. Our regularization enables explicit controllabilit y of the trade-off b etw een creativit y and consistency . Keyw ords: 4K Image-to-Video · Prior-Regularized T raining-F ree Gen- eration · Sc heduled-Gated Regularization 1 In tro duction Recen tly , media has shifted from images to videos, and now to ultra high- definition videos, which means resolutions around 4K ( 3840 × 2160 ). In many * These authors con tributed equally . 2 H. Caselles-Dupré et al. Fig. 1: F rom a single ultra-high-definition image ( 3500 × 3500 ), F rescoDiffusion ani- mates it at the same resolution. W e sho w three frames from the generated video. The red b ox marks a fixed spatial region track ed across time, illustrating motion temporal consistency and fine-detail preserv ation. applications, esp ecially in the creative industries, suc h as cinema, animation, pro jections and art, there is a need for to ols that can pro cess and generate 4K con ten t with sufficien t detail. One example of this is fresco animation, which is the fo cus of this work. In contrast to standard images, we define a fr esc o as a large-scale image containing m ultiple scenes ( i.e ., up to hundreds) that blend seamlessly in to a coherent visual. T o ac hieve animation of an y picture, diffusion mo dels [ 2 , 4 , 17 , 28 , 31 ] ha ve become the dominant approac h for video synthesis, pro ducing strong results in b oth text-to-video (T2V) and image-to-video (I2V) applications. While T2V mo dels generate videos that faithfully follow an in- put prompt, I2V mo dels follow the same paradigm but start with a first frame pro vided b y the user. Despite the progress in I2V, the most adv anced mo dels [ 14 , 32 , 33 ] op erate under constraints that do not gracefully allow large-scale I2V. On the one hand, using regular video diffusion mo dels on high-resolution images resized to their nativ e spatial regime yields insufficien t detail, as the input is to o large and complex to b e represented by the standard resolution of video diffusion mo dels. On the other hand, memory and compute grow steeply with spatial and temp oral dimensions, prev en ting the use of those mo dels on 4K images. Scaling image-to-video to v ery large videos con taining man y distinct regions remains largely unexplored in the literature. Existing w ork tries to address b oth issues, but the fresco setting makes it especially hard: simple tiling [ 3 , 24 ] in tro- duces cross-tile drift and visible seams, and p ost-ho c video super-resolution [ 15 , 23 ] cannot create scene-lev el conten t that nev er existed at the mo del’s native scale. A k ey difficulty sp ecific to fresco animation is that different regions of the image play fundamentally different roles o ver time. F resco es typically con- tain n umerous lo osely coupled scenes and characters: some regions are visually F rescoDiffusion 3 static and primarily define architectural or pictorial context, while others are seman tically activ e and exp ected to exhibit motion. This observ ation motiv ates a region-a w are treatmen t of global coherence. In this pap er, w e introduce F rescoDiffusion to achiev e 4K-I2V up to the fresco setting, as illustrated in Fig. 1 . Although animating the full-resolution fresco is not feasible directly , video models can animate a resized thum bnail of the same image at the mo del’s native scale. So, our core idea is to tak e adv antage of this feature and use the resized th umbnail as a prior to guide the high-resolution denoising of the initial fresco with a nov el loss. Our loss admits a closed-form solution that allows control of the balance betw een creativity (tiled denoising) and prior alignmen t. Moreov er, we introduce a to ol that adapts the generation pro cess in active zones and the bac kground differen tly with a mask to deal with fresco es sp ecifically . It allows the generative mo del enough flexibilit y to animate the activ e zones while k eeping the bac kground close to the prior. In our exp eri- men ts, w e sho w the superiority , b oth in terms of qualit y and computation time, of F rescoDiffusion compared to other tiled-denoising metho ds through extensive qualitativ e and quantitativ e ev aluation, and a user preference study , on our no v el dataset F rescoArc hive, and standard VBenc h 4K dataset [ 19 ]. Videos generated with F rescoDiffusion are provided in the supplementary , with an interactiv e web- page for visualizing them con v enien tly . T o summarize, our contributions are as follows: 1. W e in tro duce F rescoDiffusion, a no vel training-free approac h for 4K I2V generation that outp erforms existing baselines in p erformance and efficiency . 2. T o demonstrate its application in an artistic domain, we prop ose a dynamic prior-strength schedule together with a spatial gating mechanism that en- ables con trolled trade-offs betw een creativity and prior similarit y , esp ecially useful in the fresco-to-video task. 3. W e prop ose F rescoArchiv e, a new I2V dataset comp osed of complex multi- scene images for fresco-to-video ev aluation. T o contribute further to 4K-I2V, w e will make our algorithm open-source and release our dataset up on acceptance to promote further research. 2 Related W ork Diffusion-b ase d vide o gener ation. Early text-to-video systems suc h as Make-A- Video [ 31 ] and Imagen Video [ 17 ] generate short clips b y cascading a base video diffusion mo del with spatial and temp oral sup er-resolution mo dules. Subsequent w ork extends latent diffusion mo dels [ 28 ] to videos, e.g. Video Latent Diffusion Mo dels (Video LDM) [ 5 ], which map videos to a compressed latent space for efficien t high-resolution text-to-video generation, and Lumiere [ 2 ], whose space– time U-Net jointly denoises all frames. Recent foundation-scale mo dels such as W an [ 33 ], Mo chi [ 32 ] and L TX-Video [ 14 ] push video latent diffusion to higher qualit y and longer durations. Image-to-video metho ds (I2V) are usually sup- p orted b y all these recen t mo dels. All these methods, including commercial sys- tems (such as Kling, Sora2, Gen4.5 or V eo3), operate uniquely on their native 4 H. Caselles-Dupré et al. spatial resolutions (typically 480 – 1080 p) and video lengths of a few seconds. In con trast, we formulate a training-free approach that takes an existing image- to-video diffusion mo del and extends it b ey ond its native resolution, bringing additional creativ e con trol. Vide o sup er-r esolution. Video sup er-resolution (VSR) seeks to reconstruct high- resolution videos from lo w-resolution inputs, typically with strong temporal co- herence. While these metho ds [ 8 – 10 , 15 , 22 , 35 , 39 ] excel at pro ducing globally coheren t, temp orally consisten t videos, they are fundamentally constrained to sta ying close to the information present in the lo w-resolution input. Their ob- jectiv es encourage fidelit y to the input video under distortion and p erceptual metrics, and any hallucinated detail is limited to lo cal texture refinement. Start- ing from a 480p or 720p animation of a fresco, VSR can only upscale and slightly enric h this coarse represen tation; it cannot create hundreds of semantically dis- tinct, fully resolved scenes that w ere never visible in the low-resolution video. In con trast, F rescoDiffusion performs tiled denoising directly on a large latent can v as and uses a thum bnail animation purely as a prior in laten t space, allowing eac h tile to carry as muc h semantic conten t as a native-resolution video while main taining global coherence. T r aining-fr e e high-r esolution tile d denoising. A line of w ork studies generating large images and videos from pre-trained diffusion mo dels without additional training. MultiDiffusion [ 3 ] fuses o v erlapping diffusion tra jectories via a weigh ted least-squares ob jective, enabling large-scale image generation with many scenes, but without explicit global coherence. Several approaches build up on this idea to improv e spatial consistency . Mixture of Diffusers [ 21 ] runs multiple regional diffusions on a shared can v as to con trol high-resolution composition, while Diff- Collage [ 37 ] formulates generation as a factor graph ov er patches and o v erlaps. Sp otDiffusion [ 13 ] further reduces memory b y denoising disjoint windows o ver time, trading o v erlap for efficiency . Recent tuning-free metho ds instead mo dify the sampling procedure of a single diffusion mo del to scale resolution. Scale- Crafter [ 16 ] and DemoF usion [ 12 ] progressiv ely enlarge the effectiv e receptive field through re-dilation, disp ersed con volutions, or staged upscaling, treating the full can v as as a single sample rather than indep enden t tiles. Closer to our work, DynamicScaler [ 24 ] prop oses an offset-shifting denois- ing strategy for panoramic and 360 ◦ video generation, where spatial windo ws are shifted across denoising steps to synchronize conten t and motion across large fields of view. It emplo ys a global motion guidance stage based on a lo w- resolution video to stabilize large-scale motion patterns. This metho d is closer to our work b ecause it uses training-free tiling and staged sampling with global guidance from low-resolution videos to scale diffu- sion mo dels to large spatial videos. How ever, DynamicScaler w as created sp ecif- ically for the generation of 360 ◦ videos, and does not natively allow to na vigate the trade-off b etw een creativity and prior similarity . In contrast, F rescoDiffusion is designed to generate a m ulti-scene high-resolution video by carefully and con- trollably allo wing new details and mo v emen t to app ear. F rescoDiffusion 5 Fig. 2: Ov erview of F rescoDiffusion. Starting from a 4K fresco image, w e first build a global latent prior, by resizing the image to the native input size of the image-to-video bac kb one. Next, we upsample the prior latents x prior to fit the 4K image size. W e then apply tiled denoising to the large latent canv as, x 4K t , obtaining p er-tile flow predictions, { y i } . W e then use { y i } and x prior to compute the optimal output velocity field (Eq. ( 6 )) according to our loss ℓ FD (Eq. ( 5 )). This up dated field is then used to up date the large laten t canv as, x 4K t , with the flow-matc hing sc heduler. 3 F rescoDiffusion Metho d In this section, we in tro duce F rescoDiffusion, a metho d tailored for m ulti-scene 4K-I2V. Our prop osed approac h consists of t w o steps, as sho wn in Fig. 2 . First, w e compute a prior thum bnail to guide the diffusion pro cess. Second, during the denoising process, w e analytically minimize the energy loss to pro duce the optimal fused output (Sec. 3.2 ). This output then guides the diffusion process. During this step, w e optionally emplo y our nov el masking strategy to direct the diffusion tow ard the prior by explicitly indicating which regions should b e mo dified and whic h should con v erge to the prior (Sec. 3.3 ). 3.1 Ultra high-definition tiled denoising baseline framew ork W e start by introducing our baseline and notation. Let t ∈ [0 , 1] b e the timestep, c b e the conditioning tuple (input image and prompt) for the I2V flow-matc hing mo del f θ , and x t ∈ R C × T × H × W b e the laten t state. At each t , the mo del predicts a v elo cit y field y ( x t ) = f θ ( x t , t, c ) . (1) Starting from x t =0 ∼ N (0 , I ) , a scheduler integrates these velocities up to t = 1 . The resulting laten t x 1 is then deco ded to pro duce the video. Next, we in tro duce MultiDiffusion [ 3 ] (MD), adapted to I2V as in prior w ork [ 24 ]. MD generates high-resolution latent co des, i.e . x 4K 1 ∈ R C × T × H 4K × W 4K 6 H. Caselles-Dupré et al. with H 4K ≫ H and W 4K ≫ W , b y running f θ on o verlapping tiles of a large “can v as” x 4K t and merging the tile-wise predictions. More specifically , let x 4K 0 b e the initial 4K laten t canv as. Let C p cr op a windo w of shap e ( C , T , H, W ) at p osition p , and let P p zer o-p ad a tile bac k to shap e ( C, T , H 4K , W 4K ) at co ordinates p . F or eac h tile i , define the tile prediction y i ( x 4K t ) = P p i y C p i x 4K t , (2) where p i is the co ordinate of the i th tile. Given x 4K t , and the tiled-velocity { y i ( x 4K t ) } n i =1 predictions, MD solv es for a single merged velocity y ⋆ that b est matc hes these o v erlapping tile predictions b y minimizing the loss ℓ MD ( y ⋆ ; t ) = n X i =1 √ w i ⊙ y ⋆ − y i x 4K t 2 2 , (3) where n is the num b er of windows, p i are the co ordinates and w i are weigh t maps of window i used to reduce seams b etw een tiles. This loss admits the closed-form solution y M D x 4K t = n X i =1 w i ⊙ y i x 4K t n X i =1 w i . (4) Finally , y M D is used to update the can v as, x 4K t , using the iterativ e standard flo w-matc h ing sampling pro cess. 3.2 F rescoDiffusion: Prior-Regularized Tile F usion MD provides a solution for merging o verlapping windows using a w eighted sum. Ho w e v er, MD lacks the abilit y to regularize window merging with an existing prior, such as the initial frame, to create a cohesiv e scene. T o this end, we prop ose to extend ℓ MD with a nov el regularization term ℓ prior . Our new F rescoDiffusion loss is ℓ FD ( y ⋆ ; t ) = √ λ ⊙ x 4K t − σ t y ⋆ − x prior 2 2 | {z } ℓ prior ( y ⋆ ; t,x prior ) + ℓ MD ( y ⋆ ; t ) (5) Our loss is composed of t w o terms. On the one hand, ℓ MD reduces the dis- parit y b et w een the outputs of the shifting windows. On the other hand, ℓ prior minimizes the dissimilarity b etw een the curren t-step prediction of the clean la- ten t, i.e . ( x 4K t − σ t y ) in the flow matc hing form ulation, and the prior x prior ∈ R C × T × H 4K × W 4K . Here, λ is a regularization v ariable that can b e either a con- stan t ( λ ∈ R ) or a tensor ( λ ∈ R C × T × H 4K × W 4K ), whose design is discus sed in the next section. A dditionally , σ t denotes the sc heduler’s discrete noise standard deviation at step t . Note that in other diffusion formulations, w e just hav e to adapt the corresp onding one-step prediction, see Sec. E . F rescoDiffusion 7 Equation ( 5 ) is separable across canv as co ordinates and strictly conv ex. Thus, the unique minimizer can b e found in closed form b y setting the deriv ative to zero. Therefore, the prior-regularized fused v elo cit y is y FD ( x 4K t ) = σ t · λ ⊙ ( x 4K t − x prior ) + n X i =1 w i ⊙ y i ( x 4K t ) σ 2 t · λ + n X i =1 w i . (6) Here, w e notice that when λ = 0 , our closed-form solution in Eq. ( 6 ) reduces to the MultiDiffusion fusion in Eq. ( 4 ). T o create the global prior for x 4K t , we resize the input fresco to the mo del’s nativ e spatial size, generating a small image-to-video sequence. Then, we p er- form a p er-frame trilinear upscale in laten t space (to the large can v as size). Fig. 2 illustrates F rescoDiffusion’s generation pro cess, on top of an algorithm in Sec. A.2 . 3.3 Spatio-T emporal Prior Strength Scheduling W e will now discuss the design of the prior strength, λ , in Eq. ( 5 ). The param- eter λ addresses tw o ob jectiv es: (i) Remain structurally close to the prior while allo wing the creation of new details. (ii) T reat spatial regions differently in the case of fresco es. Bac kground re- gions are supp osed to remain structurally stable while other active regions should b e animated. T o attain ob jective (i), in the initial stages of diffusion, a high v alue of λ is desirable b ecause it directs the model to remain close to the prior. Conv ersely , a low v alue of λ is desirable in the final stages of diffusion to add details to the final video. W e th us prop ose to mo del λ as a global gated decreasing sc hedule of the diffusion step λ G ( t, τ ) = λ base · cos t π 2 · 1 [ t ≤ τ ] (7) where τ is the gating and λ base is the strength of the regularization. When λ = λ G ∈ R , we name our method F rescoDiffusion. T o reach ob jective (ii), w e compute a spatial activit y map A ( p ) to differ- en tiate activ e zones from the bac kground. Let A ( p ) ∈ { 0 , 1 } be a binary map, in which A ( p ) = 1 denotes activ e regions on the position p ( e.g ., characters or lo cal scenes expected to mov e) and A ( p ) = 0 denotes structurally static regions. Also, let τ act and τ bg b e tw o temp oral cutoffs, with τ act ≤ τ bg , which control the application of the prior to activ e and bac kground regions, respectively . Hence, our prior strength factor b ecomes λ R ( t, p ) = ( λ G ( t, τ act ) if A ( p ) = 1 (pixel p in the foreground) λ G ( t, τ bg ) if A ( p ) = 0 (pixel p in the background) (8) 8 H. Caselles-Dupré et al. 0 1 0.0 0.1 F or egr ound Backgr ound 0 1 T imesteps 0 1 F or egr ound Backgr ound MSE Fig. 3: (T op) MSE b etw een the fore- ground / bac kground regions and the prior. (Bottom) Sc hedule for b oth re- gions. Please note that λ R is a ten- sor with shap e ( C × T × H 4K × W 4K ) . W e refer to this v ariant ( λ = λ R ) as Regional-F rescoDiffusion (R- F rescoDiffusion). This design c hoice was motiv ated by the example in Fig. 3 , where w e sho w the Mean Squared Er- ror (MSE) difference b etw een the noised prior at the same timestep, and the cur- ren t laten ts x 4K t . Here, the gated de- sign enforces global coherence early in sampling steps ( t ≤ τ act ), then pro- gressiv ely relaxes the prior first in ac- tiv e regions ( τ act < t ≤ τ bg ), allo w- ing motion and nov el detail to emerge. Bac kground regions remain constrained longer to preserve large-scale structure. In late steps ( t > τ bg ), the prior is fully disabled everywhere, and sampling fo- cuses purely on fine detail refinemen t. The end result is that the bac kground MSE is m uch closer to the prior com- pared to the foreground. In addition, the coefficient λ base ≥ 0 controls ho w strongly w e adhere to the prior v ersus letting tiled denoising add new detail: large λ base fa v ors faithfulness to x prior , while small λ base allo ws more creativit y . 4 Dataset and Ev aluation Proto cols Dataset W e use the Image Suite of VBench [ 19 ], as our first 4K-I2V dataset. Suc h datasets fo cus on one or a fe w ob jects, and we are lo oking for fresco es, i.e . complex images with multiple in tricate scenes, to ev aluate our method thor- oughly . Therefore, w e prop ose a new dataset named F r esc oAr chive for I2V tech- niques on a fresco scale. Starting with the LAION-2B Aesthetic Subset [ 29 ], w e filtered the images based on criteria such as pixel count, aesthetics, w atermarks and NSFW scores. Next, we p erformed text-based filtering, follow ed by zero-shot classification to detect fresco es. Subsequently , we deduplicated [ 20 ] the dataset and generated captions using Qw en3-VL-32B [ 1 ] with b oth the image and the LAION caption. Ultimately , we man ually selected 371 pairs to achiev e the b est p ossible image-caption match. W e provide details on this pro cess in Sec. B , along with statistics. This dataset will b e used exclusively for v alidation. Ev aluation Metrics W e emplo y a user study at full resolution to quan tify the h uman preference o ver the baselines. W e used Amazon Mechanical T urk to con- duct the study . Participan ts were carefully filtered to av oid b ots and low er quality F rescoDiffusion 9 ev aluators: we required a masters status and a task appro v al rate sup erior to 85% to enroll. They w ere shown pairs of videos generated with tw o concurren t meth- o ds, from the same input image from the F rescoArchiv e dataset. Videos were displa y e d at identical resolution and duration, with randomized ordering and no metho d identification. W e rep ort preference p ercen tages with 95% confidence in terv als. F or eac h pair, participan ts answ ered t w o binary-c hoice questions: – Animation Fidelity: Which video most closely resem bles a fresco art work that has b een smo othly and naturally animated? – Motion plausibility: Which video provides the most con vincing animation of the input image, with appropriate and p erceptible motion? W e also ev aluate our metho d on the VBench protocols [ 18 , 19 , 38 ], follo wing standard studies [ 24 ]. These metrics compute the similarit y b etw een the input image and eac h frame, as w ell as the similarit y b et ween consecutiv e frames. These proto cols ev aluate several criteria: Sub ject Consistency , Motion Smo oth- ness, Aesthetic Score, and Imaging Qualit y . W e complement it using VBench’s I2V metrics (Video-Image Sub ject Consistency and Video-Image Background Consistency), to measure the similarit y b et w een input image and video. While VBench is the standard for 480p/1080p, it provides incomplete 4K-I2V ev aluations b ecause it do wnsizes videos to fit metric models’ requirement (DI- NOv2 [ 25 ] and CLIP [ 27 ]). W e th us complement our testb ed with three metrics that specifically target 4K-I2V. (i) W e p erform standard sharpness measures at full scale to quantify fine-detail generation using the T enengrad [ 26 ] function. (ii) W e use a simple yet efficien t T emp oral Consistency metric, measuring the mean square error b etw een consecutiv e downsized frames to quantify differences b et w een frames. (iii) W e compute a prior similarity metric using DINOv3 [ 30 ], whic h is not limited to 1080p resolution compared to DINOv2 used in VBench. Eac h metric is thoroughly explained in Sec. C.1 . 5 Exp erimen ts W e provide implemen tation details, then presen t qualitative and quantitativ e exp erimen ts, and finally exp erimen ts c haracterizing our mo del b eha vior. 5.1 Implemen tation Details and Baselines Vide o gener ation b ackb one. All exp eriments are conducted using W an2.2-I2V [ 33 ] 14B-parameter mo del, a state-of-the-art open video diffusion mo del which na- tiv ely op erates at spatial resolutions of 480 × 832 p and up to 720 × 1280 p. T o sp eed up inference time, we used T urb oDiffusion [ 36 ]’s LoRA to reduce the num- b er of steps, making large-scale exp erimentation feasible on standard hardware. F rescoDiffusion’s implementation details are provided in Sec. A.1 . 10 H. Caselles-Dupré et al. Baselines. W e compare our metho d to three tiled-diffusion metho ds: MultiDiffu- sion [ 3 ], DemoF usion [ 12 ] (state-of-the-art tiled image diffusion method adapted to a video setup), and DynamicScaler [ 24 ] (state-of-the-art tiled video diffusion metho d). All implementation details are av ailable in Sec. C.3 . F or a fair compar- ison, we adapted these baselines to W an2.2’s backbone to av oid any differences coming from base mo del p erformance. Similarly , and unless stated otherwise, all compared metho ds us e identical prompts, sampling steps, guidance scales, ran- dom seeds, sizes and o v erlap b etw een different tiles when applicable. Parameters that are metho d-dep enden t are c hosen iden tical to the author’s co de. Fig. 4: Overla y of the spatial activ- it y map on to the input fresco. Sp atial activity maps. R-F rescoDiffusion uses a spatially gated prior schedule to dif- feren tiate activ e regions from structurally static background. F or each input image, w e compute the activity map A ( p ) (see Sec. 3.3 ) using the Segment Anything Mo del 3 (SAM3) [ 7 ], as illustrated in Fig. 4 . W e ap- ply SAM3 with a fixed set of prompts pro- ducing a binary activity map in [0 , 1] , which is downsampled to latent resolution and used directly in Eq. ( 8 ). See details in Sec. C.4 . These activit y maps are computed once p er input image and remain fixed throughout sampling, without additional learnable parameters. 5.2 Qualitativ e Ev aluation Fig. 5: A qualitativ e comparison of fresco-scale inputs. F rescoDiffusion generates co- heren t global scenes and animates details at a lo cal lev el. By contrast, DemoF usion, DynamicScaler and MultiDiffusion only manage to pro duce either coherent scenes or high-qualit y details, but not both. F rescoDiffusion 11 W e b egin with a qualitativ e study . In Fig. 5 , we show the first, 40th, and last frame for eac h mo del along cen tral and corner crops to detail the fine-grained structures. DemoF usion preserves the global structure. Y et, some elements are mo dified, such as the path in the center crop and the hut in the corner crop. Visible w obbling is present when the video is playing. MultiDiffusion and Dy- namicScaler tend to introduce excessive nov el conten t across tiles or sampling stages, which results in accum ulation of structural inconsistencies and loss of temp oral coherence. In contrast, F rescoDiffusion pro duces high quality videos while conserving the general lay out of the video. Later, in Sec. 5.4 , we discuss the differences b etw een F rescoDiffusion and its regional coun terpart from a qual- itativ e p ersp ective. W e highly suggest the reader to explore more results in the supplemen tary (w eb visualization recommended) and in the app endix. 5.3 State-of-the-Art quantitativ e comparison In this subsection, w e quan titatively compare our metho d with the state-of-the- art on b oth user-preference metrics and automatic metrics. High-Resolution Ev aluation: User Study . W e start with a user study to quan titativ ely measure human preference. Our study totals 1344 ratings ov er 47 participants. T able 1 shows that b oth of our metho ds are strongly preferred o v er DynamicScaler and MultiDiffusion, with preference rates of 84–93% across b oth ev aluation criteria, confirming that these baselines pro duce noticeably low er qualit y animations. Against DemoF usion, R-F rescoDiffusion achiev es a statisti- cally significan t preference of 69%. F rescoDiffusion reac hes a 54% preference rate, whic h do es not fully qualify as an adv antage given the confidence interv als. A cross all comparisons, results are consistent b etw een the motion and fidelity questions. Finally , R-F rescoDiffusion is preferred o ver F rescoDiffusion in 58% of comparisons, indicating that the regional regularization pro vides the intended p erceptual impro v emen t ov er the base metho d in the case of fresco es. T able 1: User study results. Human preference rates (% of annotators preferring our metho d ov er eac h baseline). Green cells indicate statistically significant preference. All rep orted preference rates are computed with 95% confidence interv als of at most ± 6% (binomial proportion test, n = 192 per comparison). F rescoDiffusion R-F rescoDiffusion Motion Fidelit y A vg. Motion Fidelit y A vg. vs. DemoF usion 56% 52% 54% 68% 70% 69% vs. DynamicScaler 84% 92% 88% 89% 90% 89% vs. MultiDiffusion 91% 93% 92% 88% 92% 90% vs. each other 40% 44% 42% 60% 56% 58% 12 H. Caselles-Dupré et al. Standard Lo w-Resolution I2V Metrics Next, as a sanit y chec k, w e compare our approach with mo dern metho ds (T ab. 2 ) using standard low er-resolution I2V metrics (VBench and VBench-I2V). These metrics are not designed for 4K videos as the ev aluated videos are downscaled aggressively ( ∼ 10 − 20 × ) to fit the back- b ones’ resolution. As a result, the rep orted scores provide only a coarse proxy for p erformance at the original 4K resolution. W e p erform this ev aluation on the F rescoArchiv e and 4K Image Suite VBenc h dataset, using b oth regional and non-regional configurations. In the F rescoArc hive dataset, our metho d slightly outp erforms the baselines on a verage. On the Image Suite VBench dataset, our approac h performs second best on a verage, slightly outp erformed by DemoF u- sion. F ull results for each metric are a v ailable in Sec. C.2 . Note that for Dynam- icScaler, our re-implementation performs b etter, as the I2V backbone in their original implemen tation, VideoCrafter [ 11 ], is m uch older and clearly underp er- forms W an2.2. Our conclusion on this b enchmark is consisten t with what we observ ed in our user study and qualitativ ely: we outp erform all methods on the fresco-to-video task and are comp etitiv e with DemoF usion on the 4K-I2V task. T able 2: Standard low-resolution I2V metrics. The table sho ws the a verage p erfor- mance (higher is b etter) using the VBench ev aluation suite on b oth F rescoArchiv e and VBenc h-I2V image sets. W e also displa y the av erage generation time in minutes. Metho d F rescoArchiv e VBenc h-I2V Time (min) DynamicScaler (original) 0.857 0.862 18.45 DynamicScaler ∗ (CVPR’25) 0.871 0.865 10.25 DemoF usion ∗ (CVPR’23) 0.903 0.879 13.5 MultiDiffusion ∗ (ICML’23) 0.876 0.860 8.15 F rescoDiffusion 0.904 0.875 8.58 R-F rescoDiffusion 0.907 0.878 9.08 Computational efficiency . W e rep ort the a v erage runtime ov er all runs on b oth datasets on a single H100 GPU. F rescoDiffusion and R-F rescoDiffusion out- p erform all baselines. MultiDiffusion is excluded as it is the core tiled-denoising metho d used by all baselines. Our metho ds are at least 45% faster than DemoF u- sion. Our DynamicScaler implementation is nearly twice as fast as the original, y et remains slo w er than our metho ds. 5.4 Con trolling F rescoDiffusion W e perform an ablation study of F rescoDiffusion’s comp onents to justify their design, and sho w ho w they allo w creativ e con trol i n the generation pro cess. F rescoDiffusion 13 T able 3: Prior strength sc hedule ablation study on F rescoArchiv e. Best in b old . The results suggest that including the schedule, the gating, and the spatial regularization enhances the quan titativ e p erformance. Metho d λ function SC MS A I ISC IBC A vg MultiDiffusion 0 0.876 0.974 0.686 0.754 0.981 0.989 0.876 F rescoDiffusion λ base 0.942 0.991 0.645 0.598 0.979 0.983 0.856 F rescoDiffusion λ base cos( t π 2 ) 0.946 0.991 0.724 0.730 0.987 0.992 0.895 F rescoDiffusion λ G (Eq. ( 7 )) 0.958 0.990 0.738 0.753 0.991 0.995 0.904 R-F rescoDiffusion λ R (Eq. ( 8 )) 0.977 0.991 0.736 0.753 0.991 0.994 0.907 Last F rames Fig. 6: Regional Constraint. Prior shows an ov erla y of the activity map. The red / blue b o xes represent the background/foreground region. Our regional loss forces the gener- ation to wards the prior on the background regions while allo ws new details to appear in foreground regions. Ablation of the prior strength schedule. W e perform an ablation study of the spatio-temp oral prior strength schedule’s design, λ ( t, p ) . W e use VBench met- rics on the F rescoArchiv e dataset. W e start with MultiDiffusion (no λ ) and add a constan t regularization λ = λ base = 1 . 5 . This actually worsens p erformance. Next, we add the cosine schedule λ = λ base cos( t π 2 ) ∈ R , and obtain significan t gains o ver MD. Then, we build F rescoDiffusion b y setting λ = λ G (see Eq. ( 7 )), and we finish with R-F rescoDiffusion by setting λ = λ R (see Eq. ( 8 )). This results in the b est measured p erformance. Qualitatively , we show the difference b etw een F rescoDiffusion and R-F rescoDiffusion in Fig. 6 . When adding our regional regu- larization, R-F rescoDiffusion is more similar to the prior on bac kground regions, as in tended. W e provide further visualization of that effect in Sec. D . λ -Con trolled P areto T rade-off Betw een Creativity and Prior Similar- it y Creativity and prior similarity are essen tially contrary ob jectives. One can- not impro ve one without h urting the other. This inheren t tradeoff creates a P areto fron tier comp osed of the set of optimal compromises betw een the tw o ob jectiv es. T o na vigate this fron tier using λ G ( t, τ ) (Eq. ( 7 )), w e linearly mo dify the prior strength ( λ base ∈ [0 , 5] ), and the temp oral gating ( τ ∈ [0 , 1] ). T o rep- resen t the creativit y ob jectiv e w e use the sharpness metric as a pro xy , and for the prior similarit y w e use b oth temp oral consistency and our prior similarity metric presented in Sec. 4 . The results in Fig. 7 show tw o expected behaviors. (i) When λ base and the temp oral gating, τ , increase, the outputs equal those of the prior. (ii) On the con trary , when those parameters decrease, w e reach the 14 H. Caselles-Dupré et al. same p erformance as MultiDiffusion (full creativit y , no prior). The curve formed b et w een these t wo opposites creates the Pareto frontier which allows a trade-off b et w een the t wo ob jectives. Th us, F rescoDiffusion allo ws full con trol ov er this crucial trade-off. (a) T emp oral consistency versus sharpness. (b) Prior similarity evolution over time. Fig. 7: Quan titative ev aluation of the trade-off b etw een creativity and prior similarity con trolled b y λ . (a) T emp oral consistency versus sharpness, illustrating the Pareto fron tier b etw een preserving temp oral coherence and maintaining high image sharpness. (b) Evolution of prior similarity ov er time, showing how increasing prior strength and temp oral gating progressiv ely aligns the generated outputs with the prior. 6 Limitations and Conclusion Limitations. F rescoDiffusion relies on the a v ailability of a meaningful lo w- resolution prior. When the input image is extremely large, the prior may fail to capture sufficient global structure, limiting our metho d. One p ossible exten- sion w ould b e to construct multiple lo cal priors, at the cost of reduced global coherence. Moreo ver, as a tiled denoising approac h, F rescoDiffusion is inher- en tly computationally exp ensive. W e mitigate this o v erhead through reduced- step sampling (6-step LoRA), low-precision arithmetic (FP8), and compiler-lev el optimizations ( torch.compile ). Impro ving efficiency while preserving visual fi- delit y remains an imp ortan t direction for future w ork. Conclusion. F rescoDiffusion is a simple, effectiv e, training-free solution that uses existing video diffusion models to animate 4K, m ulti-scene images. The metho d combines tiled denoising with a laten t prior derived from a thum bnail animation to preserve global coherence while introducing lo cal detail at large scales. It uses fewer computational resources than the baselines and outp erforms them consisten tly in both quantitativ e metrics and user preference studies. Our approac h allo ws for easily adjusting the balance b etw een creativity and fidelity , op ening the do or to creativ e applications in large-scale image animation. F rescoDiffusion 15 7 A ckno wledgmen ts This pro ject was pro vided with computing HPC & AI and storage resources b y GENCI at IDRIS thanks to the grant 2025-AD011016538 on the sup ercomputer Jean Za y’s A100 & H100 partitions. This researc h was funded by the F rench National Research Agency (ANR) under the pro ject ANR-23-CE23-0023 as part of the F rance 2030 initiativ e. References 1. Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical rep ort. arXiv preprin t (2025) 8 , 20 , 26 2. Bar-T al, O., Chefer, H., T o v, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A., Hur, J., Liu, G., Ra j, A., et al.: Lumiere: A space-time diffusion mo del for video generation. In: SIGGRAPH Asia 2024 Conference P ap ers. pp. 1–11 (2024) 2 , 3 3. Bar-T al, O., Y ariv, L., Lipman, Y., Dekel, T.: MultiDiffusion: F using diffusion paths for controlled image generation. In: ICML. vol. 202, pp. 1737–1752 (2023) 2 , 4 , 5 , 10 , 24 4. Batifol, S., Blattmann, A., Bo esel, F., Consul, S., Diagne, C., Do ckhorn, T., En- glish, J., English, Z., Esser, P ., Kulal, S., et al.: Flux. 1 k ontext: Flow matching for in-con text image generation and editing in latent space. arXiv e-prin ts pp. arXiv–2506 (2025) 2 5. Blattmann, A., Do ckhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., V oleti, V., Letts, A., et al.: Stable video diffusion: Scaling laten t video diffusion mo dels to large datasets. arXiv preprint (2023) 3 6. Boly a, D., Huang, P .Y., Sun, P ., Cho, J.H., Madotto, A., W ei, C., Ma, T., Zhi, J., Ra jasegaran, J., Rasheed, H.A., et al.: P erception encoder: The b est visual em b eddings are not at the output of the net work. In: NeurIPS (2025) 20 7. Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alw ala, K.V., Khedr, H., Huang, A., et al.: Sam 3: Segment anything with concepts. arXiv preprint arXiv:2511.16719 (2025) 10 , 25 8. Chan, K.C., W ang, X., Y u, K., Dong, C., Lo y , C.C.: Basicvsr: The searc h for essen tial components in video sup er-resolution and beyond. In: CVPR. pp. 4947– 4956 (2021) 4 9. Chan, K.C., Zhou, S., Xu, X., Loy , C.C.: Basicvsr++: Improving video super- resolution with enhanced propagation and alignment. In: CVPR. pp. 5972–5981 (2022) 4 10. Chan, K.C., Zhou, S., Xu, X., Loy , C.C.: Inv estigating tradeoffs in real-world video sup er-resolution. In: CVPR. pp. 5962–5971 (2022) 4 11. Chen, H., Xia, M., He, Y., Zhang, Y., Cun, X., Y ang, S., Xing, J., Liu, Y., Chen, Q., W ang, X., et al.: Video crafter1: Op en diffusion mo dels for high-quality video generation. arXiv preprin t arXiv:2310.19512 (2023) 12 12. Du, R., Chang, D., Hosp edales, T., Song, Y.Z., Ma, Z.: Demofusion: Demo cratising high-resolution image generation with no $$$. In: CVPR. pp. 6159–6168 (2024) 4 , 10 , 24 16 H. Caselles-Dupré et al. 13. F rolov, S., Moser, B.B., Dengel, A.: Sp otdiffusion: A fast approach for seamless panorama generation ov er time. In: 2025 IEEE/CVF Winter Conference on Appli- cations of Computer Vision (W ACV). pp. 2073–2081 (2025) 4 14. HaCohen, Y., Chiprut, N., Brazowski, B., Shalem, D., Moshe, D., Richardson, E., Levin, E., Shiran, G., Zabari, N., Gordon, O., et al.: Ltx-video: Realtime video laten t diffusion. arXiv preprint arXiv:2501.00103 (2024) 2 , 3 15. He, J., Xue, T., Liu, D., Lin, X., Gao, P ., Lin, D., Qiao, Y., Ouy ang, W., Liu, Z.: V enhancer: Generativ e space-time enhancemen t for video generation. arXiv preprin t arXiv:2407.07667 (2024) 2 , 4 16. He, Y., Y ang, S., Chen, H., Cun, X., Xia, M., Zhang, Y., W ang, X., He, R., Chen, Q., Shan, Y.: Scalecrafter: T uning-free higher-resolution visual generation with diffusion mo dels. In: ICLR (2023) 4 17. Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P ., P o ole, B., Norouzi, M., Fleet, D.J., et al.: Imagen video: High definition video generation with diffusion mo dels. arXiv preprint arXiv:2210.02303 (2022) 2 , 3 18. Huang, Z., He, Y., Y u, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., W u, T., Jin, Q., Chanpaisit, N., W ang, Y., Chen, X., W ang, L., Lin, D., Qiao, Y., Liu, Z.: VBenc h: Comprehensiv e b enchmark suite for video generativ e models. In: CVPR (2024) 9 19. Huang, Z., Zhang, F., Xu, X., He, Y., Y u, J., Dong, Z., Ma, Q., Chanpaisit, N., Si, C., Jiang, Y., W ang, Y., Chen, X., Chen, Y.C., W ang, L., Lin, D., Qiao, Y., Liu, Z.: VBench++: Comprehensive and versatile b enchmark suite for video generative mo dels. IEEE TP AMI (2025). https : / / doi . org / 10 . 1109 / TPAMI . 2025 . 3633890 3 , 8 , 9 , 19 , 21 20. Jain, T., Lennan, C., John, Z., T ran, D.: Imagededup. https : / / github . com / idealo/imagededup (2019) 8 , 20 21. Jiménez, Á.B.: Mixture of diffusers for scene comp osition and high resolution image generation. arXiv preprin t arXiv:2302.02412 (2023) 4 22. Liang, J., Cao, J., F an, Y., Zhang, K., Ranjan, R., Li, Y., Timofte, R., V an Gool, L.: V rt: A video restoration transformer. IEEE TIP 33 , 2171–2182 (2024) 4 23. Liu, H., Ruan, Z., Zhao, P ., Dong, C., Shang, F., Liu, Y., Y ang, L., Timofte, R.: Video super-resolution based on deep learning: a comprehensiv e surv ey . Artificial In telligence Review 55 (8), 5981–6035 (2022) 2 24. Liu, J., Lin, S., Li, Y., Y ang, M.H.: Dynamicscaler: Seamless and scalable video generation for panoramic scenes. In: CVPR. pp. 6144–6153 (2025) 2 , 4 , 5 , 9 , 10 , 19 , 25 25. Oquab, M., Darcet, T., Moutak anni, T., V o, H.V., Szafraniec, M., Khalidov, V., F ernandez, P ., Haziza, D., Massa, F., El-Nouby , A., How es, R., Huang, P .Y., Xu, H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P ., Joulin, A., Bo janowski, P .: Dino v2: Learning robust visual features without sup ervision (2023) 9 , 21 , 22 26. P ertuz, S., Puig, D., Garcia, M.A.: Analysis of focus measure op erators for shap e- from-fo cus. PR 46 (5), 1415–1432 (2013) 9 , 21 27. Radford, A., Kim, J.W., Hallacy , C., Ramesh, A., Goh, G., Agarw al, S., Sastry , G., Askell, A., Mishkin, P ., Clark, J., et al.: Learning transferable visual mo dels from natural language sup ervision. In: ICML. pp. 8748–8763 (2021) 9 28. Rom bac h, R., Blattmann, A., Lorenz, D., Esser, P ., Ommer, B.: High-resolution image synthesis with latent diffusion mo dels. In: CVPR. pp. 10684–10695 (2022) 2 , 3 29. Sc h uhmann, C., Beaumont, R., V encu, R., Gordon, C., Wightman, R., Cherti, M., Co om bes, T., Katta, A., Mullis, C., W ortsman, M., et al.: Laion-5b: An op en large- F rescoDiffusion 17 scale dataset for training next generation image-text mo dels. Adv ances in neural information pro cessing systems 35 , 25278–25294 (2022) 8 , 19 30. Siméoni, O., V o, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- do v, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprin t arXiv:2508.10104 (2025) 9 , 23 31. Singer, U., P olyak, A., Ha yes, T., Yin, X., An, J., Zhang, S., Hu, Q., Y ang, H., Ash ual, O., Gafni, O., Parikh, D., Gupta, S., T aigman, Y.: Make-a-video: T ext-to- video generation without text-video data. In: ICLR (2023) 2 , 3 32. T eam, G.: Mo chi 1. https://github.com/genmoai/models (2024) 2 , 3 33. W an, T., W ang, A., Ai, B., W en, B., Mao, C., Xie, C.W., Chen, D., Y u, F., Zhao, H., Y ang, J., et al.: W an: Op en and adv anced large-scale video generative mo dels. arXiv preprint arXiv:2503.20314 (2025) 2 , 3 , 9 34. W ang, W., Gao, Z., Gu, L., Pu, H., Cui, L., W ei, X., Liu, Z., Jing, L., Y e, S., Shao, J., et al.: Intern vl3. 5: Adv ancing op en-source multimodal models in versatilit y , reasoning, and efficiency . arXiv preprint arXiv:2508.18265 (2025) 20 35. Xie, R., Liu, Y., Zhou, P ., Zhao, C., Zhou, J., Zhang, K., Zhang, Z., Y ang, J., Y ang, Z., T ai, Y.: Star: Spatial-temporal augmen tation with text-to-video models for real-world video super-resolution. In: ICCV. pp. 17108–17118 (2025) 4 36. Zhang, J., Zheng, K., Jiang, K., W ang, H., Stoica, I., Gonzalez, J.E., Chen, J., Zhu, J.: T urb odiffusion: A ccelerating video diffusion mo dels b y 100-200 times (2025) 9 37. Zhang, Q., Song, J., Huang, X., Chen, Y., Liu, M.Y.: Diffcollage: Parallel gener- ation of large con tent with diffusion mo dels. In: CVPR. pp. 10188–10198 (2023) 4 38. Zheng, D., Huang, Z., Liu, H., Zou, K., He, Y., Zhang, F., Zhang, Y., He, J., Zheng, W.S., Qiao, Y., Liu, Z.: VBench-2.0: Adv ancing video generation benchmark suite for intrinsic faithfulness. arXiv preprint arXiv:2503.21755 (2025) 9 39. Zhou, S., Y ang, P ., W ang, J., Luo, Y., Loy , C.C.: Upscale-a-video: T emp oral- consisten t diffusion model for real-world video super-resolution. In: CVPR. pp. 2535–2545 (2024) 4 18 H. Caselles-Dupré et al. F rescoDiffusion: 4K Image-to-Video with Prior-Regularized Tiled Diffusion - Supplemen tary Material A F rescoDiffusion A dditional Details A.1 Implemen tation Details All exp erimen ts use the W an2.2-I2V 14B backbone with the same accelerated 6-step T urboDiffusion setting as in the main paper. W e generate 81 frames at 16 fps with guidance scale 1 . 0 , i.e., without effective classifier-free guidance. F or an input image of size ( H, W ) , we first generate the lo w-resolution prior at fixed target area 480 × 832 = 399 , 360 pixels while preserving asp ect ratio: ˆ H = round p 399360 H /W , ˆ W = round p 399360 W /H , and then snap b oth dimensions down to the nearest multiple of 16 . The full- resolution pass uses H 4 K = max(16 , ⌊ H / 16 ⌋ · 16) and W 4 K = max(16 , ⌊ W / 16 ⌋ · 16) , after an optional isotropic downscaling when the input exceeds ultra-HD resolution. This multiple-of- 16 constraint comes from the latent video backbone: the W an V AE reduces spatial resolution b y a factor 8 , and the do wnstream laten t grid is processed in spatial patc hes of size 2 , yielding an effectiv e v alidity constrain t of 8 × 2 = 16 . With spatial and temp oral compression factors 8 and 4 , the latent tensor therefore has shap e 16 × 21 × ( H 4 K / 8) × ( W 4 K / 8) for the default 81 -frame setting. The low-resolution latent prior is resized to the large laten t can v as with endp oint-aligned trilinear interpolation in latent space, V AE tiling is enabled during inference, and the deco ded output is resized bac k to the original image size only if snapping c hanged the resolution. Tile d denoising and r e gularization. The high-resolution pass uses MultiDiffusion windo ws of size 480 × 832 pixels with 30% ov erlap, giving nominal pixel strides 336 × 582 . After conv ersion to latent co ordinates and rounding to the v alid latent grid, this becomes 60 × 104 laten t windo ws with strides 42 × 72 ; extra final windo ws are added whenev er needed so that the right and b ottom b oundaries are exactly cov ered. Tile fusion uses linear ramps with minimum border weigh t 0 . 1 . Standard MultiDiffusion uses P i w i y i / P i w i , whereas the prior-regularized implemen tation additionally accumulates P i w i y i and P i w i for the one-shot closed-form up date in mo del-output space. In all rep orted F rescoDiffusion runs w e use a cosine prior sc hedule with λ base = 1 . 5 and cutoff τ end = 0 . 1 , where τ = i/ ( N − 1) and N = 6 (num b er of steps); hence the prior is active only at the first denoising step. In R-F rescoDiffusion, the active regions cutoff is τ fg = 0 . 1 and the inactive regions cutoff is τ bg = 0 . 35 . Since the six normalized step p ositions are { 0 , 0 . 2 , 0 . 4 , 0 . 6 , 0 . 8 , 1 . 0 } , the foreground prior is activ e only at i = 0 , while the background prior is activ e at i = 0 and i = 1 . The active or inactive parts of the video are determined based on masks computed according to Sec. C.4 . F rescoDiffusion 19 A.2 Sampling Pro cedure A t each step, w e run the transformer on eac h crop of the canv as latent x 4K 0 to obtain a p er-tile prediction y i . Then, w e accum ulate tw o can v as-shap ed tensors: P i w i ⊙ y i and P i w i . W e then add the prior term and divide as in ( 6 ) to pro duce a single fused prediction on the full can v as. Finally , w e in vok e the sc heduler exactly once to obtain x t + ∆t . This preserves the original sampler while adding only light ov erhead: one extra reduction p er pixel, a single p oint wise rational fuse, and no additional netw ork passes beyond those already required by tiled denoising. The full step is summarized in Algorithm 1 . Algorithm 1: F rescoDiffusion: one sampler step at time t Input: canv as latent x 4K t ∈ R C × T × H 4K × W 4K ; tile positions { p i } n i =1 ; weigh t maps { w i } n i =1 ; upscaled prior x prior ∈ R C × T × H 4K × W 4K ; noise lev el σ t ; prior-strength schedule λ ; flo w-matching step size ∆t ; flo w-matching mo del f θ Output: up dated can v as latent x 4K t + ∆t 1 Initialize num ← 0 ∈ R C × T × H 4K × W 4K , den ← 0 ∈ R 1 × 1 × H 4K × W 4K 2 for i = 1 , . . . , n do 3 ˜ x t ← C p i ( x 4K t ) // Crop canvas at p i 4 ˜ y i ← f θ ( ˜ x t , t, c ) // Estimates crop flow 5 y i ← P p i ( ˜ y i ) // Zero-pads the output 6 num ← num + w i ⊙ y i // Updates the numerator 7 den ← den + w i // Updates denominator 8 // Add prior regularization y ← num + λ σ t ( x 4K t − x prior ) den + λ σ 2 t 9 x 4K t + ∆t ← SchedulerStep ( x 4K t , y , t ) // Update noisy states B F resco Ev aluation Dataset Construction T o facilitate the creation of UHD-I2V, we noticed that no dataset fits our require- men ts for UHD-I2V. Unlike VBench high-definition [ 19 ] set or panoramas [ 24 ] that fo cus on one ob ject or less, we searc h for images to animate with multiple in tricate scenes. Here, we propose a new set to generate and ev aluate UHD-I2V tec hniques at a fresco-like scale. W e name our dataset F rescoArchiv e. W e start with the LAION-2B Aesthetic Subset [ 29 ]. The first step is then filtering the images based on several criteria: ha ving more than a million pixels, an aesthetic score of at least 5.8, and a w atermark and unsafe scores of less than 0.8 and 0.5, resp ectiv ely . This initial filtering process yielded a total of tw o million images. Next, w e performed a seman tic filtering pro cess. T o do this, for eac h image, 20 H. Caselles-Dupré et al. w e compute the a verage cosine similarit y b etw een the target instance and the prompts "A large detailed fresco" "A magnificent fresco with many different scenes" "A narrative composition" "A fresco with lots of details" "A large polyptych and composite image fresco" "A large metapicture with several compositions" "A fresco tableau" "A painting fresco" using the PerceptionEncoder [ 6 ] G14-448 v arian t similarity mo del, and we fin- ish by selecting the top 50,000 images. After, w e used In tern-VL-3.5 [ 34 ] as a classifier to detect fresco es. T o do so, we used the follo wing prompt: You are a visual classifier. Decide whether the image is a fresco-like composition. Definition (for this task): A qualifying image resembles a large, detailed, integrated scene (like historical frescoes). Modern photos or digital works count if they share these traits. Answer yes only if all are true: – The image shows high apparent resolution / detail density (many fine, precise elements). – There are multiple distinct sub-scenes or groups (from a few to dozens+) distributed across the same frame. – These elements are blended into one coherent composition (no panel borders or obvious collage seams). Answer no if any of the following: – Single subject or minimal detail. – The fresco is not the main content of the image (e.g. the photo shows a wall, room, or museum scene where the fresco only appears as a small part, rather than the fresco itself being the full image). – Simple graphics, logos, posters, or text-only images. – Comics/manga with separate panels, tiled grids, or collages with hard borders. – Diagrams, charts, UI/screenshots, or patterns. Unsure: answer no. Output format: Respond with exactly yes or no (lowercase, no punctuation, no extra words). This filtering results in a total of 10,000 im ages. Finally , w e deduplicated the dataset using b oth p erceptual hashing and CNN-based deduplication techniques using the ImageDedup library [ 20 ]. Then, to generate UHD image captions, we used Qw en3-VL-32B [ 1 ] with both the image and LAION caption, resulting in 6,700 UHD image-caption pairs. T o prompt Qw en3-VL-32B, w e used the follo wing text: F rescoDiffusion 21 Using the existing caption below as context, write a long, highly detailed, precise, and fluent caption that thoroughly describes the image. Give some precise contextual information, relative positional information, subjects, objects and elements description and identification. Caution: the caption provided can be false or wrong; use the image as the only source of truth. The caption is only here to help you be more precise. Respond with the caption only (no preface, no metadata, no quotes). Existing caption: {caption} Finally , we manually selected 371 pairs to get the b est qualitative image-captions pairs. Dataset Samples W ords Image Width Image Heigh t F rescoArchiv e 371 355.79 ± 88.0 2265 ± 1257 1552 ± 864 VBenc h 361 14.70 ± 2.24 4592 ± 1305 3748 ± 1214 T able 4: Dataset statistics. W e compare F rescoArc hive and VBench datasets’ low-lev el statistics. F or the statistics, w e compute sev eral metrics (text and image-wise) and quan titativ ely compare with VBench I2V [ 19 ] to mark a reference p oint. First, w e compute high-lev el statistics, seen in T ab. 4 . W e measure the n umber of samples, a verage num b er of w ords p er prompt, and av erage width and height. As we can see, our set con tains a similar num b er of images as VBench. Y et, our prompts con tain an order of magnitude more than VBench, depicting more precise and detailed prompts. As for the av erage image shap e, our set contains smaller images, but with more complex scenery . In Fig. 8 , you can see the n um b er of w ords and shap e distribution of our dataset. Next, we qualitatively study F rescoArc hive’s complexity with reference to VBenc h. T o do this, we first encode all images using the DINOv2 [ 25 ] mo del to get a global view of each image. Then, w e perform a PCA dimensionalit y reduction to visualize their distribution. Finally , using the resulting reduction, w e further visualize individual crops. As can b e seen in Fig. 9 , the resulting PCA sho ws that our dataset cov ers a wider span of the main axes, unlike VBenc h. This suggests that our dataset has more div ersity in shared comp onen ts using the global DINO features. C Exp erimen t Details C.1 Metrics Sharpness metric. W e assess per-frame sharpness using the T enengrad [ 26 ] measure, a classical no-reference fo cus metric based on directional gradient en- 22 H. Caselles-Dupré et al. 200 300 400 500 600 #wor ds 0.000 0.001 0.002 0.003 0.004 0.005 F r equency W or d count 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 W idth (px) 0 1000 2000 3000 4000 5000 6000 Height (px) W idth-Height distribution 3840px 2160px 1080px 1920px Fig. 8: Analysis of F rescoArc hiv e dataset. Left: distribution of word in caption. Right: Resolution of images distribution. 30 20 10 0 10 20 20 15 10 5 0 5 10 15 F r escoAr chive Cr ops VBench Cr ops F r escoAr chive Images VBench Images 30 20 10 0 10 20 20 15 10 5 0 5 10 15 PCA decomp osition with cr ops F rescoArc hiv e Cr ops VBenc h Cr ops F rescoArc hiv e Images VBenc h Images Fig. 9: PCA b et w een F rescoArchiv e and VBenc h I2V dataset. W e visualize the DINOv2 [ 25 ] feature PCA b etw een our proposed dataset and VBench’s images (red and green points, respectively). Also, in light green and red, random crops of the im- ages. This plot sho ws that our dataset con tains images with differen t statistics than standard ones, found in the VBench I2V set. F rescoDiffusion 23 ergy . F or each frame, the luminance channel is extracted and horizontal and ver- tical gradients are computed using a 3 × 3 Sob el op erator. The T enengrad score is defined as the mean squared gradien t magnitude T = 1 H W P i,j ( G 2 x ( i, j ) + G 2 y ( i, j )) , where G x and G y are the responses of the Sob el filters along each axis. This quan tity is sensitiv e to high-frequency spatial structure and p enalizes blurry outputs where gradien t energy is suppressed. P er-frame scores are a ver- aged ov er all frames of a video to yield a single video-level score. Higher v alues indicate sharp er, more detail-preserving outputs. T emporal consistency metric. Each frame is first con v erted to gra yscale and then do wnsampled to 128 × 128 pixels using area-av eraging interpolation, retaining only coarse spatial structure. The temp oral inconsistency score for a video of T frames is defined as the mean squared error b etw een consecutiv e do wnscaled frames: C = 1 T − 1 P T − 1 t =1 | ˆ f t − ˆ f t − 1 | 2 F / 64 2 , where ˆ f t denotes the do wnscaled grayscale frame at time t . By op erating at this coarse resolution, the metric captures global consistency while remaining agnostic to legitimate scene motion. Prior alignmen t metric. W e measure ho w faithfully each generated video preserv es the global semantic conten t of the prior using a frame-level feature alignmen t score based on DINOv3 [ 30 ]. F or each frame pair ( f prior t , f gen t ) , b oth frames are indep endently forw arded through a frozen DINOv3 ViT-S/16 enco der without any resizing ( i.e ., at native resolution), and we extract the CLS token from each: the po oled global represen tation output b y the transformer. Prior alignmen t is then defined as the mean cosine similarity b et w een corresponding CLS tokens across all T frames of a video: A = 1 T P T t =1 ⟨ z prior t , z gen t ⟩ ∥ z prior t ∥ ∥ z gen t ∥ , which equiv alen tly measures the cosine of the angle betw een the t wo CLS token direc- tions in feature space. W e rep ort the mean of this score across all videos in the b enc hmark; higher v alues indicate that the generated video remains semantically aligned to the prior along its temp oral tra jectory . A key adv antage of DINOv3 o v er its predecessor DINOv2 is its ability to pro cess high-resolution inputs, in- cluding 4K frames, without interpolating p ositional embeddings or resizing the input. This prop erty is essential in our setting, where the generated videos are high-resolution and downscaling prior to feature extraction would discard fine- grained con ten t that ma y b e critical for alignmen t assessmen t. C.2 Standard low-resolution I2V metrics On T able 2 in the main pap er, we presented the a v erage results on b oth F res- coArc hiv e and VBench-I2V datasets. Here, in T able 5 , w e detail each metric: Sub ject Consistency , Motion Smo othness, Aesthetic, Imaging, I2V Sub ject Con- sistency , and I2V Background Consistency . 24 H. Caselles-Dupré et al. T able 5: Quantitativ e ev aluation on F rescoArc hive and VBench-I2V. Higher is b etter for all metrics except time. Best in b old and second b est underlined. ∗ denotes metho ds adapted to the video prior setting. SC, MS, A, I, ISC, and IBC stand for Sub ject Consistency , Motion Smo othness, Aesthetic, Imaging, I2V Sub ject Consistency , and I2V Background Consistency , resp ectively . Method Qualit y Metrics I2V Metrics A vg Time SC MS A I ISC IBC F resc oAr chive DynamicScaler (original) 0.945 0.975 0.693 0.706 0.893 0.930 0.857 18.45 DynamicScaler ∗ (CVPR’25) 0.852 0.971 0.681 0.754 0.980 0.989 0.871 10.25 DemoF usion ∗ (CVPR’23) 0.960 0.987 0.734 0.752 0.990 0.994 0.903 13.5 MultiDiffusion ∗ (ICML’23) 0.876 0.974 0.686 0.754 0.981 0.989 0.876 8.15 F rescoDiffusion 0.958 0.990 0.738 0.753 0.991 0.995 0.904 8.58 R-F rescoDiffusion 0.977 0.991 0.736 0.753 0.991 0.994 0.907 9.08 VBench-I2V DynamicScaler (original) 0.949 0.976 0.707 0.713 0.893 0.932 0.862 18.45 DynamicScaler ∗ (CVPR’25) 0.904 0.985 0.621 0.701 0.988 0.991 0.865 10.25 DemoF usion ∗ (CVPR’23) 0.943 0.989 0.639 0.720 0.990 0.993 0.879 13.5 MultiDiffusion ∗ (ICML’23) 0.893 0.984 0.611 0.698 0.987 0.989 0.860 8.15 F rescoDiffusion 0.933 0.989 0.632 0.716 0.989 0.992 0.875 8.58 R-F rescoDiffusion 0.956 0.989 0.634 0.712 0.988 0.991 0.878 9.08 C.3 Baseline Implementation Details MultiDiffusion. Our implementation follo ws the original MultiDiffusion [ 3 ] pro- cedure without additional h euristics. The only mo difications are (i) replacing the base bac kb one with W an2.2 I2V in place of the original denoiser, and (ii) applying a linear decay blending mask outside each tile to smo othly attenuate con tributions near tile borders and reduce seam artifacts when merging o verlap- ping predictions. DemoF usion. DemoF usion [ 12 ] introduces three tec hniques for high-resolution image generation. W e re-implemen ted (i) progressiv e phase upsampling and (ii) skip-residual global guidance, reusing the exact hyper-parameters from the au- thors’ official co de to enable a faithful comparison. These parameters con trol ho w resolution increases across phases (num b er of phases, p er-phase upsampling factor, and p er-scale denoising sc hedule) and the strength of global-structure guidance during refinement. In contrast, w e did not observe reliable gains from DemoF usion-style dilated sampling when transferring it from SDXL’s UNet de- noiser to W an2.2’s DiT-based denoiser. W e hypothesize this is because dilated sampling assumes up dates are roughly separable across in terlea v ed sub-lattices an assumption that fits UNets’ lo cal, conv olutional structure but breaks for DiT mo dels with global self-attention. Ev aluating W an2.2 on sparse lattices changes the atten tion context and likely shifts inputs off-distribution, leading to inconsis- F rescoDiffusion 25 ten t offsets that merge into visible artifacts (seams/chec kerboards) rather than impro v ed coherence. DynamicSc aler. W e follow the authors’ official implementation of Dynamic- Scaler [ 24 ] and reuse their released hyper-parameters for both the offset-shifting denoiser and the global motion-guidance module. W e condition motion guid- ance on the same low-resolution video that w e use for our F rescoDiffusion reg- ularization prior, so b oth signals rely on an identical motion reference. F or the sliding/rotating denoising window, we set the per-step offset (stride) to half the windo w size, i.e . a 50% ov erlap betw een consecutiv e windo ws, whic h stabilizes stitc hing across steps and mitigates b oundary artifacts. C.4 Spatial Activit y Map Computation Giv en an input frame, we compute a sp atial activity map using SAM3 [ 7 ]. The region to animate can also b e explicitly specified by the user. F or automated pro cessing and ease of use, we emplo y a segmentation pip eline that identifies plausible dynamic en tities and con v erts them in to a spatial activit y map. The activity map is computed once per image, stored at laten t resolution, clamp ed to [0 , 1] , resized to the current latent size with standard trilinear in- terp olation, binarized using the test A > 0 , and then k ept fixed throughout sampling. Pr ompt-b ase d se gmentation. Our default pip eline queries SAM3 using a fixed set of prompts corresp onding to categories that commonly exhibit motion. Sp ecifi- cally , we provide the follo wing textual prompts to the mo del: – p erson – v egetation – v ehicles F or each prompt, SAM3 predicts candidate spatial masks corresp onding to instances of the queried category . The masks are then av eraged. Visualizations of suc h masks are pro vided in Fig. 10 . Mask extr action and filtering. SAM3 predictions are filtered using a score thresh- old τ s = 0 . 45 . W e discard masks whose relativ e area exceeds 0 . 30 of the image area, as such regions typically correspond to ov erly coarse detections. W e addi- tionally remov e masks with excessive boundary contact, defined as cases where more than 80% of the mask pixels lie within a 10 -pixel margin of the image b order. Sp atial supp ort. F or eac h retained mask, we construct a spatial supp ort region b y dilating the mask using a Euclidean distance transform with radius r = 75 pixels. The resulting regions are merged to produce the final spatial activity map. 26 H. Caselles-Dupré et al. Explor atory variants. W e explored t wo extensions of this pip eline. First, w e ex- p erimen ted with extending the spatial mask across time. Using SAM3 together with the generated prior video, th e mask predicted on the input frame w as prop- agated to co ver the full temp oral extent of the video, as seen in Fig. 11 . Second, w e ev aluated a v ariant in whic h the prompts pro vided to SAM3 are generated automatically using a vision-language mo del. In this setup, Qwen3-VL-32B [ 1 ] analyzes b oth the input image and the generated prior video to produce textual prompts corresp onding to entities that could plausibly supp ort animation. These prompts are then used to condition SAM3, while mask extraction, filtering, and spatial supp ort construction remain identical to the default pip eline. In prac- tice, neither the temp oral mask propagation nor the VLM-guided prompting pro duced measurable impro v emen ts o ver the fixed-prompt approac h. As b oth v arian ts introduce additional computational ov erhead, they w ere not used for the final results rep orted in the pap er. F rescoDiffusion 27 Fig. 10: Additional qualitative examples of spatial activity maps obtained with SAM3. Eac h image shows the ov erlay used to identify regions likely to con tain dynamic con tent, which are then used to guide the activ e/inactive prior regularization in R- F rescoDiffusion. frame 1 frame 40 frame 80 Fig. 11: T w o qualitative examples sho wing the temp oral masks obtained with SAM3 at frames 1, 40, and 80. The video o verlaid is the prior generated with W an mo del. 28 H. Caselles-Dupré et al. D F rescoDiffusion A dditional Examples frame 1 frame 48 frame 80 Fig. 12: Additional examples obtained on our F rescoArchiv e dataset. First 2 ro ws obtained with F rescoDiffusion and last 2 rows obtained with R-F rescoDiffusion; columns sho w the same frame indices. F rescoDiffusion 29 frame 1 frame 48 frame 80 Fig. 13: Additional examples obtained on VBench I2V dataset. First 3 rows obtained with F rescoDiffusion and 2 last rows obtained with R-F rescoDiffusion; columns show the same frame indices. 30 H. Caselles-Dupré et al. E F rescoDiffusion closed-form solution in noise prediction setting Our prop osed approach can b e used with ϵ -prediction diffusion mo dels. W e mo d- ify the F rescoDiffusion loss in Eq. ( 5 ) to include the one-step approximation of the ϵ -diffusion form ulation: ℓ FD ( y ⋆ ; t ) = √ λ ⊙ 1 √ α t x 4K t − √ 1 − α t y ⋆ − x prior 2 2 + ℓ MD ( y ⋆ ; t ) , (9) where α t = Q t i =1 (1 − β i ) and β t are the schedule v ariances, and the one-step appro ximation is given b y 1 √ α t x 4K t − √ 1 − α t y ⋆ . Next, we set the deriv ativ e of Eq. ( 9 ) to 0 to solve for the optimal noise output. Hence, the optimal ϵ noise is: y FD ( x 4K t ) = q 1 − α t α t λ ⊙ 1 √ α t x 4K t − x prior + n X i =1 w i ⊙ y i 1 − α t α t λ + n X i =1 w i . (10) W e adopt the same v ariable definitions, as in the main text, for the current noisy state x 4K t , the prior, x prior , the weigh ting tensors w i , and the prior regularization strength λ . As in the flow-matc hing formulation, when λ = 0 , y FD reduces to y MD .
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment