FrescoDiffusion: 4K Image-to-Video with Prior-Regularized Tiled Diffusion

F rescoDiﬀusion: 4K Image-to-Video with Prior-Regularized Tiled Diﬀusion Hugo Caselles-Dupré 1 ∗ , Mathis K oroglu 1 , 2 ∗ , Guillaume Jeanneret 2 ∗ , Arnaud Dap ogn y 2 , and Matthieu Cord 2 1 Ob vious Researc h, P aris, F rance 2 Institute of In telligen t Systems and Rob otics - Sorb onne Univ ersity , Paris, F rance Pro ject w ebsite: https://f2v.pages.dev/ Abstract. Diﬀusion-based image-to-video (I2V) mo dels are increasingly eﬀectiv e, y et they struggle to scale to ultra-high-resolution inputs (e.g., 4K). Generating videos at the model’s native resolution often loses ﬁne- grained structure, whereas high-resolution tiled denoising preserv es lo cal detail but breaks global la y out consistency . This failure mode is par- ticularly sev ere in the fresco animation setting: mon umental artw orks con taining man y distinct c haracters, ob jects, and seman tically diﬀerent sub-scenes that m ust remain spatially coherent ov er time. W e in tro duce F rescoDiﬀusion, a training-free metho d for coheren t large-format I2V generation from a single complex image. The k ey idea is to augment tiled denoising with a precomputed latent prior: we ﬁrst generate a low- resolution video at the underlying mo del resolution and upsample its la- ten t tra jectory to obtain a global reference that captures long-range tem- p oral and spatial structure. F or 4K generation, w e compute p er-tile noise predictions and fuse them with this reference at every diﬀusion timestep b y minimizing a single weigh ted least-squares ob jective in mo del-output space. The ob jective combines a standard tile-merging criterion with our regularization term, yielding a closed-form fusion update that strength- ens global coherence while retaining ﬁne detail. W e additionally provide a spatial regularization v ariable that enables region-lev el con trol ov er where motion is allo wed. Exp erimen ts on the VBench-I2V dataset and our prop osed fresco I2V dataset show improv ed global consistency and ﬁdelit y o ver tiled baselines, while b eing computationally eﬃcient. Our regularization enables explicit controllabilit y of the trade-oﬀ b etw een creativit y and consistency . Keyw ords: 4K Image-to-Video · Prior-Regularized T raining-F ree Gen- eration · Sc heduled-Gated Regularization 1 In tro duction Recen tly , media has shifted from images to videos, and now to ultra high- deﬁnition videos, which means resolutions around 4K ( 3840 × 2160 ). In many * These authors con tributed equally . 2 H. Caselles-Dupré et al. Fig. 1: F rom a single ultra-high-deﬁnition image ( 3500 × 3500 ), F rescoDiﬀusion ani- mates it at the same resolution. W e sho w three frames from the generated video. The red b ox marks a ﬁxed spatial region track ed across time, illustrating motion temporal consistency and ﬁne-detail preserv ation. applications, esp ecially in the creative industries, suc h as cinema, animation, pro jections and art, there is a need for to ols that can pro cess and generate 4K con ten t with suﬃcien t detail. One example of this is fresco animation, which is the fo cus of this work. In contrast to standard images, we deﬁne a fr esc o as a large-scale image containing m ultiple scenes ( i.e ., up to hundreds) that blend seamlessly in to a coherent visual. T o ac hieve animation of an y picture, diﬀusion mo dels [ 2 , 4 , 17 , 28 , 31 ] ha ve become the dominant approac h for video synthesis, pro ducing strong results in b oth text-to-video (T2V) and image-to-video (I2V) applications. While T2V mo dels generate videos that faithfully follow an in- put prompt, I2V mo dels follow the same paradigm but start with a ﬁrst frame pro vided b y the user. Despite the progress in I2V, the most adv anced mo dels [ 14 , 32 , 33 ] op erate under constraints that do not gracefully allow large-scale I2V. On the one hand, using regular video diﬀusion mo dels on high-resolution images resized to their nativ e spatial regime yields insuﬃcien t detail, as the input is to o large and complex to b e represented by the standard resolution of video diﬀusion mo dels. On the other hand, memory and compute grow steeply with spatial and temp oral dimensions, prev en ting the use of those mo dels on 4K images. Scaling image-to-video to v ery large videos con taining man y distinct regions remains largely unexplored in the literature. Existing w ork tries to address b oth issues, but the fresco setting makes it especially hard: simple tiling [ 3 , 24 ] in tro- duces cross-tile drift and visible seams, and p ost-ho c video super-resolution [ 15 , 23 ] cannot create scene-lev el conten t that nev er existed at the mo del’s native scale. A k ey diﬃculty sp eciﬁc to fresco animation is that diﬀerent regions of the image play fundamentally diﬀerent roles o ver time. F resco es typically con- tain n umerous lo osely coupled scenes and characters: some regions are visually F rescoDiﬀusion 3 static and primarily deﬁne architectural or pictorial context, while others are seman tically activ e and exp ected to exhibit motion. This observ ation motiv ates a region-a w are treatmen t of global coherence. In this pap er, w e introduce F rescoDiﬀusion to achiev e 4K-I2V up to the fresco setting, as illustrated in Fig. 1 . Although animating the full-resolution fresco is not feasible directly , video models can animate a resized thum bnail of the same image at the mo del’s native scale. So, our core idea is to tak e adv antage of this feature and use the resized th umbnail as a prior to guide the high-resolution denoising of the initial fresco with a nov el loss. Our loss admits a closed-form solution that allows control of the balance betw een creativity (tiled denoising) and prior alignmen t. Moreov er, we introduce a to ol that adapts the generation pro cess in active zones and the bac kground diﬀeren tly with a mask to deal with fresco es sp eciﬁcally . It allows the generative mo del enough ﬂexibilit y to animate the activ e zones while k eeping the bac kground close to the prior. In our exp eri- men ts, w e sho w the superiority , b oth in terms of qualit y and computation time, of F rescoDiﬀusion compared to other tiled-denoising metho ds through extensive qualitativ e and quantitativ e ev aluation, and a user preference study , on our no v el dataset F rescoArc hive, and standard VBenc h 4K dataset [ 19 ]. Videos generated with F rescoDiﬀusion are provided in the supplementary , with an interactiv e web- page for visualizing them con v enien tly . T o summarize, our contributions are as follows: 1. W e in tro duce F rescoDiﬀusion, a no vel training-free approac h for 4K I2V generation that outp erforms existing baselines in p erformance and eﬃciency . 2. T o demonstrate its application in an artistic domain, we prop ose a dynamic prior-strength schedule together with a spatial gating mechanism that en- ables con trolled trade-oﬀs betw een creativity and prior similarit y , esp ecially useful in the fresco-to-video task. 3. W e prop ose F rescoArchiv e, a new I2V dataset comp osed of complex multi- scene images for fresco-to-video ev aluation. T o contribute further to 4K-I2V, w e will make our algorithm open-source and release our dataset up on acceptance to promote further research. 2 Related W ork Diﬀusion-b ase d vide o gener ation. Early text-to-video systems suc h as Make-A- Video [ 31 ] and Imagen Video [ 17 ] generate short clips b y cascading a base video diﬀusion mo del with spatial and temp oral sup er-resolution mo dules. Subsequent w ork extends latent diﬀusion mo dels [ 28 ] to videos, e.g. Video Latent Diﬀusion Mo dels (Video LDM) [ 5 ], which map videos to a compressed latent space for eﬃcien t high-resolution text-to-video generation, and Lumiere [ 2 ], whose space– time U-Net jointly denoises all frames. Recent foundation-scale mo dels such as W an [ 33 ], Mo chi [ 32 ] and L TX-Video [ 14 ] push video latent diﬀusion to higher qualit y and longer durations. Image-to-video metho ds (I2V) are usually sup- p orted b y all these recen t mo dels. All these methods, including commercial sys- tems (such as Kling, Sora2, Gen4.5 or V eo3), operate uniquely on their native 4 H. Caselles-Dupré et al. spatial resolutions (typically 480 – 1080 p) and video lengths of a few seconds. In con trast, we formulate a training-free approach that takes an existing image- to-video diﬀusion mo del and extends it b ey ond its native resolution, bringing additional creativ e con trol. Vide o sup er-r esolution. Video sup er-resolution (VSR) seeks to reconstruct high- resolution videos from lo w-resolution inputs, typically with strong temporal co- herence. While these metho ds [ 8 – 10 , 15 , 22 , 35 , 39 ] excel at pro ducing globally coheren t, temp orally consisten t videos, they are fundamentally constrained to sta ying close to the information present in the lo w-resolution input. Their ob- jectiv es encourage ﬁdelit y to the input video under distortion and p erceptual metrics, and any hallucinated detail is limited to lo cal texture reﬁnement. Start- ing from a 480p or 720p animation of a fresco, VSR can only upscale and slightly enric h this coarse represen tation; it cannot create hundreds of semantically dis- tinct, fully resolved scenes that w ere never visible in the low-resolution video. In con trast, F rescoDiﬀusion performs tiled denoising directly on a large latent can v as and uses a thum bnail animation purely as a prior in laten t space, allowing eac h tile to carry as muc h semantic conten t as a native-resolution video while main taining global coherence. T r aining-fr e e high-r esolution tile d denoising. A line of w ork studies generating large images and videos from pre-trained diﬀusion mo dels without additional training. MultiDiﬀusion [ 3 ] fuses o v erlapping diﬀusion tra jectories via a weigh ted least-squares ob jective, enabling large-scale image generation with many scenes, but without explicit global coherence. Several approaches build up on this idea to improv e spatial consistency . Mixture of Diﬀusers [ 21 ] runs multiple regional diﬀusions on a shared can v as to con trol high-resolution composition, while Diﬀ- Collage [ 37 ] formulates generation as a factor graph ov er patches and o v erlaps. Sp otDiﬀusion [ 13 ] further reduces memory b y denoising disjoint windows o ver time, trading o v erlap for eﬃciency . Recent tuning-free metho ds instead mo dify the sampling procedure of a single diﬀusion mo del to scale resolution. Scale- Crafter [ 16 ] and DemoF usion [ 12 ] progressiv ely enlarge the eﬀectiv e receptive ﬁeld through re-dilation, disp ersed con volutions, or staged upscaling, treating the full can v as as a single sample rather than indep enden t tiles. Closer to our work, DynamicScaler [ 24 ] prop oses an oﬀset-shifting denois- ing strategy for panoramic and 360 ◦ video generation, where spatial windo ws are shifted across denoising steps to synchronize conten t and motion across large ﬁelds of view. It emplo ys a global motion guidance stage based on a lo w- resolution video to stabilize large-scale motion patterns. This metho d is closer to our work b ecause it uses training-free tiling and staged sampling with global guidance from low-resolution videos to scale diﬀu- sion mo dels to large spatial videos. How ever, DynamicScaler w as created sp ecif- ically for the generation of 360 ◦ videos, and does not natively allow to na vigate the trade-oﬀ b etw een creativity and prior similarity . In contrast, F rescoDiﬀusion is designed to generate a m ulti-scene high-resolution video by carefully and con- trollably allo wing new details and mo v emen t to app ear. F rescoDiﬀusion 5 Fig. 2: Ov erview of F rescoDiﬀusion. Starting from a 4K fresco image, w e ﬁrst build a global latent prior, by resizing the image to the native input size of the image-to-video bac kb one. Next, we upsample the prior latents x prior to ﬁt the 4K image size. W e then apply tiled denoising to the large latent canv as, x 4K t , obtaining p er-tile ﬂow predictions, { y i } . W e then use { y i } and x prior to compute the optimal output velocity ﬁeld (Eq. ( 6 )) according to our loss ℓ FD (Eq. ( 5 )). This up dated ﬁeld is then used to up date the large laten t canv as, x 4K t , with the ﬂow-matc hing sc heduler. 3 F rescoDiﬀusion Metho d In this section, we in tro duce F rescoDiﬀusion, a metho d tailored for m ulti-scene 4K-I2V. Our prop osed approac h consists of t w o steps, as sho wn in Fig. 2 . First, w e compute a prior thum bnail to guide the diﬀusion pro cess. Second, during the denoising process, w e analytically minimize the energy loss to pro duce the optimal fused output (Sec. 3.2 ). This output then guides the diﬀusion process. During this step, w e optionally emplo y our nov el masking strategy to direct the diﬀusion tow ard the prior by explicitly indicating which regions should b e mo diﬁed and whic h should con v erge to the prior (Sec. 3.3 ). 3.1 Ultra high-deﬁnition tiled denoising baseline framew ork W e start by introducing our baseline and notation. Let t ∈ [0 , 1] b e the timestep, c b e the conditioning tuple (input image and prompt) for the I2V ﬂow-matc hing mo del f θ , and x t ∈ R C × T × H × W b e the laten t state. At each t , the mo del predicts a v elo cit y ﬁeld y ( x t ) = f θ ( x t , t, c ) . (1) Starting from x t =0 ∼ N (0 , I ) , a scheduler integrates these velocities up to t = 1 . The resulting laten t x 1 is then deco ded to pro duce the video. Next, we in tro duce MultiDiﬀusion [ 3 ] (MD), adapted to I2V as in prior w ork [ 24 ]. MD generates high-resolution latent co des, i.e . x 4K 1 ∈ R C × T × H 4K × W 4K 6 H. Caselles-Dupré et al. with H 4K ≫ H and W 4K ≫ W , b y running f θ on o verlapping tiles of a large “can v as” x 4K t and merging the tile-wise predictions. More speciﬁcally , let x 4K 0 b e the initial 4K laten t canv as. Let C p cr op a windo w of shap e ( C , T , H, W ) at p osition p , and let P p zer o-p ad a tile bac k to shap e ( C, T , H 4K , W 4K ) at co ordinates p . F or eac h tile i , deﬁne the tile prediction y i ( x 4K t ) = P p i  y  C p i  x 4K t   , (2) where p i is the co ordinate of the i th tile. Given x 4K t , and the tiled-velocity { y i ( x 4K t ) } n i =1 predictions, MD solv es for a single merged velocity y ⋆ that b est matc hes these o v erlapping tile predictions b y minimizing the loss ℓ MD ( y ⋆ ; t ) = n X i =1   √ w i ⊙  y ⋆ − y i  x 4K t     2 2 , (3) where n is the num b er of windows, p i are the co ordinates and w i are weigh t maps of window i used to reduce seams b etw een tiles. This loss admits the closed-form solution y M D  x 4K t  = n X i =1 w i ⊙ y i  x 4K t  n X i =1 w i . (4) Finally , y M D is used to update the can v as, x 4K t , using the iterativ e standard ﬂo w-matc h ing sampling pro cess. 3.2 F rescoDiﬀusion: Prior-Regularized Tile F usion MD provides a solution for merging o verlapping windows using a w eighted sum. Ho w e v er, MD lacks the abilit y to regularize window merging with an existing prior, such as the initial frame, to create a cohesiv e scene. T o this end, we prop ose to extend ℓ MD with a nov el regularization term ℓ prior . Our new F rescoDiﬀusion loss is ℓ FD ( y ⋆ ; t ) =   √ λ ⊙  x 4K t − σ t y ⋆  − x prior    2 2 | {z } ℓ prior ( y ⋆ ; t,x prior ) + ℓ MD ( y ⋆ ; t ) (5) Our loss is composed of t w o terms. On the one hand, ℓ MD reduces the dis- parit y b et w een the outputs of the shifting windows. On the other hand, ℓ prior minimizes the dissimilarity b etw een the curren t-step prediction of the clean la- ten t, i.e . ( x 4K t − σ t y ) in the ﬂow matc hing form ulation, and the prior x prior ∈ R C × T × H 4K × W 4K . Here, λ is a regularization v ariable that can b e either a con- stan t ( λ ∈ R ) or a tensor ( λ ∈ R C × T × H 4K × W 4K ), whose design is discus sed in the next section. A dditionally , σ t denotes the sc heduler’s discrete noise standard deviation at step t . Note that in other diﬀusion formulations, w e just hav e to adapt the corresp onding one-step prediction, see Sec. E . F rescoDiﬀusion 7 Equation ( 5 ) is separable across canv as co ordinates and strictly conv ex. Thus, the unique minimizer can b e found in closed form b y setting the deriv ative to zero. Therefore, the prior-regularized fused v elo cit y is y FD ( x 4K t ) = σ t · λ ⊙ ( x 4K t − x prior ) + n X i =1 w i ⊙ y i ( x 4K t ) σ 2 t · λ + n X i =1 w i . (6) Here, w e notice that when λ = 0 , our closed-form solution in Eq. ( 6 ) reduces to the MultiDiﬀusion fusion in Eq. ( 4 ). T o create the global prior for x 4K t , we resize the input fresco to the mo del’s nativ e spatial size, generating a small image-to-video sequence. Then, we p er- form a p er-frame trilinear upscale in laten t space (to the large can v as size). Fig. 2 illustrates F rescoDiﬀusion’s generation pro cess, on top of an algorithm in Sec. A.2 . 3.3 Spatio-T emporal Prior Strength Scheduling W e will now discuss the design of the prior strength, λ , in Eq. ( 5 ). The param- eter λ addresses tw o ob jectiv es: (i) Remain structurally close to the prior while allo wing the creation of new details. (ii) T reat spatial regions diﬀerently in the case of fresco es. Bac kground re- gions are supp osed to remain structurally stable while other active regions should b e animated. T o attain ob jective (i), in the initial stages of diﬀusion, a high v alue of λ is desirable b ecause it directs the model to remain close to the prior. Conv ersely , a low v alue of λ is desirable in the ﬁnal stages of diﬀusion to add details to the ﬁnal video. W e th us prop ose to mo del λ as a global gated decreasing sc hedule of the diﬀusion step λ G ( t, τ ) = λ base · cos  t π 2  · 1 [ t ≤ τ ] (7) where τ is the gating and λ base is the strength of the regularization. When λ = λ G ∈ R , we name our method F rescoDiﬀusion. T o reach ob jective (ii), w e compute a spatial activit y map A ( p ) to diﬀer- en tiate activ e zones from the bac kground. Let A ( p ) ∈ { 0 , 1 } be a binary map, in which A ( p ) = 1 denotes activ e regions on the position p ( e.g ., characters or lo cal scenes expected to mov e) and A ( p ) = 0 denotes structurally static regions. Also, let τ act and τ bg b e tw o temp oral cutoﬀs, with τ act ≤ τ bg , which control the application of the prior to activ e and bac kground regions, respectively . Hence, our prior strength factor b ecomes λ R ( t, p ) = ( λ G ( t, τ act ) if A ( p ) = 1 (pixel p in the foreground) λ G ( t, τ bg ) if A ( p ) = 0 (pixel p in the background) (8) 8 H. Caselles-Dupré et al. 0 1 0.0 0.1 F or egr ound Backgr ound 0 1 T imesteps 0 1 F or egr ound Backgr ound MSE Fig. 3: (T op) MSE b etw een the fore- ground / bac kground regions and the prior. (Bottom) Sc hedule for b oth re- gions. Please note that λ R is a ten- sor with shap e ( C × T × H 4K × W 4K ) . W e refer to this v ariant ( λ = λ R ) as Regional-F rescoDiﬀusion (R- F rescoDiﬀusion). This design c hoice was motiv ated by the example in Fig. 3 , where w e sho w the Mean Squared Er- ror (MSE) diﬀerence b etw een the noised prior at the same timestep, and the cur- ren t laten ts x 4K t . Here, the gated de- sign enforces global coherence early in sampling steps ( t ≤ τ act ), then pro- gressiv ely relaxes the prior ﬁrst in ac- tiv e regions ( τ act < t ≤ τ bg ), allo w- ing motion and nov el detail to emerge. Bac kground regions remain constrained longer to preserve large-scale structure. In late steps ( t > τ bg ), the prior is fully disabled everywhere, and sampling fo- cuses purely on ﬁne detail reﬁnemen t. The end result is that the bac kground MSE is m uch closer to the prior com- pared to the foreground. In addition, the coeﬃcient λ base ≥ 0 controls ho w strongly w e adhere to the prior v ersus letting tiled denoising add new detail: large λ base fa v ors faithfulness to x prior , while small λ base allo ws more creativit y . 4 Dataset and Ev aluation Proto cols Dataset W e use the Image Suite of VBench [ 19 ], as our ﬁrst 4K-I2V dataset. Suc h datasets fo cus on one or a fe w ob jects, and we are lo oking for fresco es, i.e . complex images with multiple in tricate scenes, to ev aluate our method thor- oughly . Therefore, w e prop ose a new dataset named F r esc oAr chive for I2V tech- niques on a fresco scale. Starting with the LAION-2B Aesthetic Subset [ 29 ], w e ﬁltered the images based on criteria such as pixel count, aesthetics, w atermarks and NSFW scores. Next, we p erformed text-based ﬁltering, follow ed by zero-shot classiﬁcation to detect fresco es. Subsequently , we deduplicated [ 20 ] the dataset and generated captions using Qw en3-VL-32B [ 1 ] with b oth the image and the LAION caption. Ultimately , we man ually selected 371 pairs to achiev e the b est p ossible image-caption match. W e provide details on this pro cess in Sec. B , along with statistics. This dataset will b e used exclusively for v alidation. Ev aluation Metrics W e emplo y a user study at full resolution to quan tify the h uman preference o ver the baselines. W e used Amazon Mechanical T urk to con- duct the study . Participan ts were carefully ﬁltered to av oid b ots and low er quality F rescoDiﬀusion 9 ev aluators: we required a masters status and a task appro v al rate sup erior to 85% to enroll. They w ere shown pairs of videos generated with tw o concurren t meth- o ds, from the same input image from the F rescoArchiv e dataset. Videos were displa y e d at identical resolution and duration, with randomized ordering and no metho d identiﬁcation. W e rep ort preference p ercen tages with 95% conﬁdence in terv als. F or eac h pair, participan ts answ ered t w o binary-c hoice questions: – Animation Fidelity: Which video most closely resem bles a fresco art work that has b een smo othly and naturally animated? – Motion plausibility: Which video provides the most con vincing animation of the input image, with appropriate and p erceptible motion? W e also ev aluate our metho d on the VBench protocols [ 18 , 19 , 38 ], follo wing standard studies [ 24 ]. These metrics compute the similarit y b etw een the input image and eac h frame, as w ell as the similarit y b et ween consecutiv e frames. These proto cols ev aluate several criteria: Sub ject Consistency , Motion Smo oth- ness, Aesthetic Score, and Imaging Qualit y . W e complement it using VBench’s I2V metrics (Video-Image Sub ject Consistency and Video-Image Background Consistency), to measure the similarit y b et w een input image and video. While VBench is the standard for 480p/1080p, it provides incomplete 4K-I2V ev aluations b ecause it do wnsizes videos to ﬁt metric models’ requirement (DI- NOv2 [ 25 ] and CLIP [ 27 ]). W e th us complement our testb ed with three metrics that speciﬁcally target 4K-I2V. (i) W e p erform standard sharpness measures at full scale to quantify ﬁne-detail generation using the T enengrad [ 26 ] function. (ii) W e use a simple yet eﬃcien t T emp oral Consistency metric, measuring the mean square error b etw een consecutiv e downsized frames to quantify diﬀerences b et w een frames. (iii) W e compute a prior similarity metric using DINOv3 [ 30 ], whic h is not limited to 1080p resolution compared to DINOv2 used in VBench. Eac h metric is thoroughly explained in Sec. C.1 . 5 Exp erimen ts W e provide implemen tation details, then presen t qualitative and quantitativ e exp erimen ts, and ﬁnally exp erimen ts c haracterizing our mo del b eha vior. 5.1 Implemen tation Details and Baselines Vide o gener ation b ackb one. All exp eriments are conducted using W an2.2-I2V [ 33 ] 14B-parameter mo del, a state-of-the-art open video diﬀusion mo del which na- tiv ely op erates at spatial resolutions of 480 × 832 p and up to 720 × 1280 p. T o sp eed up inference time, we used T urb oDiﬀusion [ 36 ]’s LoRA to reduce the num- b er of steps, making large-scale exp erimentation feasible on standard hardware. F rescoDiﬀusion’s implementation details are provided in Sec. A.1 . 10 H. Caselles-Dupré et al. Baselines. W e compare our metho d to three tiled-diﬀusion metho ds: MultiDiﬀu- sion [ 3 ], DemoF usion [ 12 ] (state-of-the-art tiled image diﬀusion method adapted to a video setup), and DynamicScaler [ 24 ] (state-of-the-art tiled video diﬀusion metho d). All implementation details are av ailable in Sec. C.3 . F or a fair compar- ison, we adapted these baselines to W an2.2’s backbone to av oid any diﬀerences coming from base mo del p erformance. Similarly , and unless stated otherwise, all compared metho ds us e identical prompts, sampling steps, guidance scales, ran- dom seeds, sizes and o v erlap b etw een diﬀerent tiles when applicable. Parameters that are metho d-dep enden t are c hosen iden tical to the author’s co de. Fig. 4: Overla y of the spatial activ- it y map on to the input fresco. Sp atial activity maps. R-F rescoDiﬀusion uses a spatially gated prior schedule to dif- feren tiate activ e regions from structurally static background. F or each input image, w e compute the activity map A ( p ) (see Sec. 3.3 ) using the Segment Anything Mo del 3 (SAM3) [ 7 ], as illustrated in Fig. 4 . W e ap- ply SAM3 with a ﬁxed set of prompts pro- ducing a binary activity map in [0 , 1] , which is downsampled to latent resolution and used directly in Eq. ( 8 ). See details in Sec. C.4 . These activit y maps are computed once p er input image and remain ﬁxed throughout sampling, without additional learnable parameters. 5.2 Qualitativ e Ev aluation Fig. 5: A qualitativ e comparison of fresco-scale inputs. F rescoDiﬀusion generates co- heren t global scenes and animates details at a lo cal lev el. By contrast, DemoF usion, DynamicScaler and MultiDiﬀusion only manage to pro duce either coherent scenes or high-qualit y details, but not both. F rescoDiﬀusion 11 W e b egin with a qualitativ e study . In Fig. 5 , we show the ﬁrst, 40th, and last frame for eac h mo del along cen tral and corner crops to detail the ﬁne-grained structures. DemoF usion preserves the global structure. Y et, some elements are mo diﬁed, such as the path in the center crop and the hut in the corner crop. Visible w obbling is present when the video is playing. MultiDiﬀusion and Dy- namicScaler tend to introduce excessive nov el conten t across tiles or sampling stages, which results in accum ulation of structural inconsistencies and loss of temp oral coherence. In contrast, F rescoDiﬀusion pro duces high quality videos while conserving the general lay out of the video. Later, in Sec. 5.4 , we discuss the diﬀerences b etw een F rescoDiﬀusion and its regional coun terpart from a qual- itativ e p ersp ective. W e highly suggest the reader to explore more results in the supplemen tary (w eb visualization recommended) and in the app endix. 5.3 State-of-the-Art quantitativ e comparison In this subsection, w e quan titatively compare our metho d with the state-of-the- art on b oth user-preference metrics and automatic metrics. High-Resolution Ev aluation: User Study . W e start with a user study to quan titativ ely measure human preference. Our study totals 1344 ratings ov er 47 participants. T able 1 shows that b oth of our metho ds are strongly preferred o v er DynamicScaler and MultiDiﬀusion, with preference rates of 84–93% across b oth ev aluation criteria, conﬁrming that these baselines pro duce noticeably low er qualit y animations. Against DemoF usion, R-F rescoDiﬀusion achiev es a statisti- cally signiﬁcan t preference of 69%. F rescoDiﬀusion reac hes a 54% preference rate, whic h do es not fully qualify as an adv antage given the conﬁdence interv als. A cross all comparisons, results are consistent b etw een the motion and ﬁdelity questions. Finally , R-F rescoDiﬀusion is preferred o ver F rescoDiﬀusion in 58% of comparisons, indicating that the regional regularization pro vides the intended p erceptual impro v emen t ov er the base metho d in the case of fresco es. T able 1: User study results. Human preference rates (% of annotators preferring our metho d ov er eac h baseline). Green cells indicate statistically signiﬁcant preference. All rep orted preference rates are computed with 95% conﬁdence interv als of at most ± 6% (binomial proportion test, n = 192 per comparison). F rescoDiﬀusion R-F rescoDiﬀusion Motion Fidelit y A vg. Motion Fidelit y A vg. vs. DemoF usion 56% 52% 54% 68% 70% 69% vs. DynamicScaler 84% 92% 88% 89% 90% 89% vs. MultiDiﬀusion 91% 93% 92% 88% 92% 90% vs. each other 40% 44% 42% 60% 56% 58% 12 H. Caselles-Dupré et al. Standard Lo w-Resolution I2V Metrics Next, as a sanit y chec k, w e compare our approach with mo dern metho ds (T ab. 2 ) using standard low er-resolution I2V metrics (VBench and VBench-I2V). These metrics are not designed for 4K videos as the ev aluated videos are downscaled aggressively ( ∼ 10 − 20 × ) to ﬁt the back- b ones’ resolution. As a result, the rep orted scores provide only a coarse proxy for p erformance at the original 4K resolution. W e p erform this ev aluation on the F rescoArchiv e and 4K Image Suite VBenc h dataset, using b oth regional and non-regional conﬁgurations. In the F rescoArc hive dataset, our metho d slightly outp erforms the baselines on a verage. On the Image Suite VBench dataset, our approac h performs second best on a verage, slightly outp erformed by DemoF u- sion. F ull results for each metric are a v ailable in Sec. C.2 . Note that for Dynam- icScaler, our re-implementation performs b etter, as the I2V backbone in their original implemen tation, VideoCrafter [ 11 ], is m uch older and clearly underp er- forms W an2.2. Our conclusion on this b enchmark is consisten t with what we observ ed in our user study and qualitativ ely: we outp erform all methods on the fresco-to-video task and are comp etitiv e with DemoF usion on the 4K-I2V task. T able 2: Standard low-resolution I2V metrics. The table sho ws the a verage p erfor- mance (higher is b etter) using the VBench ev aluation suite on b oth F rescoArchiv e and VBenc h-I2V image sets. W e also displa y the av erage generation time in minutes. Metho d F rescoArchiv e VBenc h-I2V Time (min) DynamicScaler (original) 0.857 0.862 18.45 DynamicScaler ∗ (CVPR’25) 0.871 0.865 10.25 DemoF usion ∗ (CVPR’23) 0.903 0.879 13.5 MultiDiﬀusion ∗ (ICML’23) 0.876 0.860 8.15 F rescoDiﬀusion 0.904 0.875 8.58 R-F rescoDiﬀusion 0.907 0.878 9.08 Computational eﬃciency . W e rep ort the a v erage runtime ov er all runs on b oth datasets on a single H100 GPU. F rescoDiﬀusion and R-F rescoDiﬀusion out- p erform all baselines. MultiDiﬀusion is excluded as it is the core tiled-denoising metho d used by all baselines. Our metho ds are at least 45% faster than DemoF u- sion. Our DynamicScaler implementation is nearly twice as fast as the original, y et remains slo w er than our metho ds. 5.4 Con trolling F rescoDiﬀusion W e perform an ablation study of F rescoDiﬀusion’s comp onents to justify their design, and sho w ho w they allo w creativ e con trol i n the generation pro cess. F rescoDiﬀusion 13 T able 3: Prior strength sc hedule ablation study on F rescoArchiv e. Best in b old . The results suggest that including the schedule, the gating, and the spatial regularization enhances the quan titativ e p erformance. Metho d λ function SC MS A I ISC IBC A vg MultiDiﬀusion 0 0.876 0.974 0.686 0.754 0.981 0.989 0.876 F rescoDiﬀusion λ base 0.942 0.991 0.645 0.598 0.979 0.983 0.856 F rescoDiﬀusion λ base cos( t π 2 ) 0.946 0.991 0.724 0.730 0.987 0.992 0.895 F rescoDiﬀusion λ G (Eq. ( 7 )) 0.958 0.990 0.738 0.753 0.991 0.995 0.904 R-F rescoDiﬀusion λ R (Eq. ( 8 )) 0.977 0.991 0.736 0.753 0.991 0.994 0.907 Last F rames Fig. 6: Regional Constraint. Prior shows an ov erla y of the activity map. The red / blue b o xes represent the background/foreground region. Our regional loss forces the gener- ation to wards the prior on the background regions while allo ws new details to appear in foreground regions. Ablation of the prior strength schedule. W e perform an ablation study of the spatio-temp oral prior strength schedule’s design, λ ( t, p ) . W e use VBench met- rics on the F rescoArchiv e dataset. W e start with MultiDiﬀusion (no λ ) and add a constan t regularization λ = λ base = 1 . 5 . This actually worsens p erformance. Next, we add the cosine schedule λ = λ base cos( t π 2 ) ∈ R , and obtain signiﬁcan t gains o ver MD. Then, we build F rescoDiﬀusion b y setting λ = λ G (see Eq. ( 7 )), and we ﬁnish with R-F rescoDiﬀusion by setting λ = λ R (see Eq. ( 8 )). This results in the b est measured p erformance. Qualitatively , we show the diﬀerence b etw een F rescoDiﬀusion and R-F rescoDiﬀusion in Fig. 6 . When adding our regional regu- larization, R-F rescoDiﬀusion is more similar to the prior on bac kground regions, as in tended. W e provide further visualization of that eﬀect in Sec. D . λ -Con trolled P areto T rade-oﬀ Betw een Creativity and Prior Similar- it y Creativity and prior similarity are essen tially contrary ob jectives. One can- not impro ve one without h urting the other. This inheren t tradeoﬀ creates a P areto fron tier comp osed of the set of optimal compromises betw een the tw o ob jectiv es. T o na vigate this fron tier using λ G ( t, τ ) (Eq. ( 7 )), w e linearly mo dify the prior strength ( λ base ∈ [0 , 5] ), and the temp oral gating ( τ ∈ [0 , 1] ). T o rep- resen t the creativit y ob jectiv e w e use the sharpness metric as a pro xy , and for the prior similarit y w e use b oth temp oral consistency and our prior similarity metric presented in Sec. 4 . The results in Fig. 7 show tw o expected behaviors. (i) When λ base and the temp oral gating, τ , increase, the outputs equal those of the prior. (ii) On the con trary , when those parameters decrease, w e reach the 14 H. Caselles-Dupré et al. same p erformance as MultiDiﬀusion (full creativit y , no prior). The curve formed b et w een these t wo opposites creates the Pareto frontier which allows a trade-oﬀ b et w een the t wo ob jectives. Th us, F rescoDiﬀusion allo ws full con trol ov er this crucial trade-oﬀ. (a) T emp oral consistency versus sharpness. (b) Prior similarity evolution over time. Fig. 7: Quan titative ev aluation of the trade-oﬀ b etw een creativity and prior similarity con trolled b y λ . (a) T emp oral consistency versus sharpness, illustrating the Pareto fron tier b etw een preserving temp oral coherence and maintaining high image sharpness. (b) Evolution of prior similarity ov er time, showing how increasing prior strength and temp oral gating progressiv ely aligns the generated outputs with the prior. 6 Limitations and Conclusion Limitations. F rescoDiﬀusion relies on the a v ailability of a meaningful lo w- resolution prior. When the input image is extremely large, the prior may fail to capture suﬃcient global structure, limiting our metho d. One p ossible exten- sion w ould b e to construct multiple lo cal priors, at the cost of reduced global coherence. Moreo ver, as a tiled denoising approac h, F rescoDiﬀusion is inher- en tly computationally exp ensive. W e mitigate this o v erhead through reduced- step sampling (6-step LoRA), low-precision arithmetic (FP8), and compiler-lev el optimizations ( torch.compile ). Impro ving eﬃciency while preserving visual ﬁ- delit y remains an imp ortan t direction for future w ork. Conclusion. F rescoDiﬀusion is a simple, eﬀectiv e, training-free solution that uses existing video diﬀusion models to animate 4K, m ulti-scene images. The metho d combines tiled denoising with a laten t prior derived from a thum bnail animation to preserve global coherence while introducing lo cal detail at large scales. It uses fewer computational resources than the baselines and outp erforms them consisten tly in both quantitativ e metrics and user preference studies. Our approac h allo ws for easily adjusting the balance b etw een creativity and ﬁdelity , op ening the do or to creativ e applications in large-scale image animation. F rescoDiﬀusion 15 7 A ckno wledgmen ts This pro ject was pro vided with computing HPC & AI and storage resources b y GENCI at IDRIS thanks to the grant 2025-AD011016538 on the sup ercomputer Jean Za y’s A100 & H100 partitions. This researc h was funded by the F rench National Research Agency (ANR) under the pro ject ANR-23-CE23-0023 as part of the F rance 2030 initiativ e. References 1. Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical rep ort. arXiv preprin t (2025) 8 , 20 , 26 2. Bar-T al, O., Chefer, H., T o v, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A., Hur, J., Liu, G., Ra j, A., et al.: Lumiere: A space-time diﬀusion mo del for video generation. In: SIGGRAPH Asia 2024 Conference P ap ers. pp. 1–11 (2024) 2 , 3 3. Bar-T al, O., Y ariv, L., Lipman, Y., Dekel, T.: MultiDiﬀusion: F using diﬀusion paths for controlled image generation. In: ICML. vol. 202, pp. 1737–1752 (2023) 2 , 4 , 5 , 10 , 24 4. Batifol, S., Blattmann, A., Bo esel, F., Consul, S., Diagne, C., Do ckhorn, T., En- glish, J., English, Z., Esser, P ., Kulal, S., et al.: Flux. 1 k ontext: Flow matching for in-con text image generation and editing in latent space. arXiv e-prin ts pp. arXiv–2506 (2025) 2 5. Blattmann, A., Do ckhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., V oleti, V., Letts, A., et al.: Stable video diﬀusion: Scaling laten t video diﬀusion mo dels to large datasets. arXiv preprint (2023) 3 6. Boly a, D., Huang, P .Y., Sun, P ., Cho, J.H., Madotto, A., W ei, C., Ma, T., Zhi, J., Ra jasegaran, J., Rasheed, H.A., et al.: P erception encoder: The b est visual em b eddings are not at the output of the net work. In: NeurIPS (2025) 20 7. Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alw ala, K.V., Khedr, H., Huang, A., et al.: Sam 3: Segment anything with concepts. arXiv preprint arXiv:2511.16719 (2025) 10 , 25 8. Chan, K.C., W ang, X., Y u, K., Dong, C., Lo y , C.C.: Basicvsr: The searc h for essen tial components in video sup er-resolution and beyond. In: CVPR. pp. 4947– 4956 (2021) 4 9. Chan, K.C., Zhou, S., Xu, X., Loy , C.C.: Basicvsr++: Improving video super- resolution with enhanced propagation and alignment. In: CVPR. pp. 5972–5981 (2022) 4 10. Chan, K.C., Zhou, S., Xu, X., Loy , C.C.: Inv estigating tradeoﬀs in real-world video sup er-resolution. In: CVPR. pp. 5962–5971 (2022) 4 11. Chen, H., Xia, M., He, Y., Zhang, Y., Cun, X., Y ang, S., Xing, J., Liu, Y., Chen, Q., W ang, X., et al.: Video crafter1: Op en diﬀusion mo dels for high-quality video generation. arXiv preprin t arXiv:2310.19512 (2023) 12 12. Du, R., Chang, D., Hosp edales, T., Song, Y.Z., Ma, Z.: Demofusion: Demo cratising high-resolution image generation with no $$$. In: CVPR. pp. 6159–6168 (2024) 4 , 10 , 24 16 H. Caselles-Dupré et al. 13. F rolov, S., Moser, B.B., Dengel, A.: Sp otdiﬀusion: A fast approach for seamless panorama generation ov er time. In: 2025 IEEE/CVF Winter Conference on Appli- cations of Computer Vision (W ACV). pp. 2073–2081 (2025) 4 14. HaCohen, Y., Chiprut, N., Brazowski, B., Shalem, D., Moshe, D., Richardson, E., Levin, E., Shiran, G., Zabari, N., Gordon, O., et al.: Ltx-video: Realtime video laten t diﬀusion. arXiv preprint arXiv:2501.00103 (2024) 2 , 3 15. He, J., Xue, T., Liu, D., Lin, X., Gao, P ., Lin, D., Qiao, Y., Ouy ang, W., Liu, Z.: V enhancer: Generativ e space-time enhancemen t for video generation. arXiv preprin t arXiv:2407.07667 (2024) 2 , 4 16. He, Y., Y ang, S., Chen, H., Cun, X., Xia, M., Zhang, Y., W ang, X., He, R., Chen, Q., Shan, Y.: Scalecrafter: T uning-free higher-resolution visual generation with diﬀusion mo dels. In: ICLR (2023) 4 17. Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P ., P o ole, B., Norouzi, M., Fleet, D.J., et al.: Imagen video: High deﬁnition video generation with diﬀusion mo dels. arXiv preprint arXiv:2210.02303 (2022) 2 , 3 18. Huang, Z., He, Y., Y u, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., W u, T., Jin, Q., Chanpaisit, N., W ang, Y., Chen, X., W ang, L., Lin, D., Qiao, Y., Liu, Z.: VBenc h: Comprehensiv e b enchmark suite for video generativ e models. In: CVPR (2024) 9 19. Huang, Z., Zhang, F., Xu, X., He, Y., Y u, J., Dong, Z., Ma, Q., Chanpaisit, N., Si, C., Jiang, Y., W ang, Y., Chen, X., Chen, Y.C., W ang, L., Lin, D., Qiao, Y., Liu, Z.: VBench++: Comprehensive and versatile b enchmark suite for video generative mo dels. IEEE TP AMI (2025). https : / / doi . org / 10 . 1109 / TPAMI . 2025 . 3633890 3 , 8 , 9 , 19 , 21 20. Jain, T., Lennan, C., John, Z., T ran, D.: Imagededup. https : / / github . com / idealo/imagededup (2019) 8 , 20 21. Jiménez, Á.B.: Mixture of diﬀusers for scene comp osition and high resolution image generation. arXiv preprin t arXiv:2302.02412 (2023) 4 22. Liang, J., Cao, J., F an, Y., Zhang, K., Ranjan, R., Li, Y., Timofte, R., V an Gool, L.: V rt: A video restoration transformer. IEEE TIP 33 , 2171–2182 (2024) 4 23. Liu, H., Ruan, Z., Zhao, P ., Dong, C., Shang, F., Liu, Y., Y ang, L., Timofte, R.: Video super-resolution based on deep learning: a comprehensiv e surv ey . Artiﬁcial In telligence Review 55 (8), 5981–6035 (2022) 2 24. Liu, J., Lin, S., Li, Y., Y ang, M.H.: Dynamicscaler: Seamless and scalable video generation for panoramic scenes. In: CVPR. pp. 6144–6153 (2025) 2 , 4 , 5 , 9 , 10 , 19 , 25 25. Oquab, M., Darcet, T., Moutak anni, T., V o, H.V., Szafraniec, M., Khalidov, V., F ernandez, P ., Haziza, D., Massa, F., El-Nouby , A., How es, R., Huang, P .Y., Xu, H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P ., Joulin, A., Bo janowski, P .: Dino v2: Learning robust visual features without sup ervision (2023) 9 , 21 , 22 26. P ertuz, S., Puig, D., Garcia, M.A.: Analysis of focus measure op erators for shap e- from-fo cus. PR 46 (5), 1415–1432 (2013) 9 , 21 27. Radford, A., Kim, J.W., Hallacy , C., Ramesh, A., Goh, G., Agarw al, S., Sastry , G., Askell, A., Mishkin, P ., Clark, J., et al.: Learning transferable visual mo dels from natural language sup ervision. In: ICML. pp. 8748–8763 (2021) 9 28. Rom bac h, R., Blattmann, A., Lorenz, D., Esser, P ., Ommer, B.: High-resolution image synthesis with latent diﬀusion mo dels. In: CVPR. pp. 10684–10695 (2022) 2 , 3 29. Sc h uhmann, C., Beaumont, R., V encu, R., Gordon, C., Wightman, R., Cherti, M., Co om bes, T., Katta, A., Mullis, C., W ortsman, M., et al.: Laion-5b: An op en large- F rescoDiﬀusion 17 scale dataset for training next generation image-text mo dels. Adv ances in neural information pro cessing systems 35 , 25278–25294 (2022) 8 , 19 30. Siméoni, O., V o, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- do v, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprin t arXiv:2508.10104 (2025) 9 , 23 31. Singer, U., P olyak, A., Ha yes, T., Yin, X., An, J., Zhang, S., Hu, Q., Y ang, H., Ash ual, O., Gafni, O., Parikh, D., Gupta, S., T aigman, Y.: Make-a-video: T ext-to- video generation without text-video data. In: ICLR (2023) 2 , 3 32. T eam, G.: Mo chi 1. https://github.com/genmoai/models (2024) 2 , 3 33. W an, T., W ang, A., Ai, B., W en, B., Mao, C., Xie, C.W., Chen, D., Y u, F., Zhao, H., Y ang, J., et al.: W an: Op en and adv anced large-scale video generative mo dels. arXiv preprint arXiv:2503.20314 (2025) 2 , 3 , 9 34. W ang, W., Gao, Z., Gu, L., Pu, H., Cui, L., W ei, X., Liu, Z., Jing, L., Y e, S., Shao, J., et al.: Intern vl3. 5: Adv ancing op en-source multimodal models in versatilit y , reasoning, and eﬃciency . arXiv preprint arXiv:2508.18265 (2025) 20 35. Xie, R., Liu, Y., Zhou, P ., Zhao, C., Zhou, J., Zhang, K., Zhang, Z., Y ang, J., Y ang, Z., T ai, Y.: Star: Spatial-temporal augmen tation with text-to-video models for real-world video super-resolution. In: ICCV. pp. 17108–17118 (2025) 4 36. Zhang, J., Zheng, K., Jiang, K., W ang, H., Stoica, I., Gonzalez, J.E., Chen, J., Zhu, J.: T urb odiﬀusion: A ccelerating video diﬀusion mo dels b y 100-200 times (2025) 9 37. Zhang, Q., Song, J., Huang, X., Chen, Y., Liu, M.Y.: Diﬀcollage: Parallel gener- ation of large con tent with diﬀusion mo dels. In: CVPR. pp. 10188–10198 (2023) 4 38. Zheng, D., Huang, Z., Liu, H., Zou, K., He, Y., Zhang, F., Zhang, Y., He, J., Zheng, W.S., Qiao, Y., Liu, Z.: VBench-2.0: Adv ancing video generation benchmark suite for intrinsic faithfulness. arXiv preprint arXiv:2503.21755 (2025) 9 39. Zhou, S., Y ang, P ., W ang, J., Luo, Y., Loy , C.C.: Upscale-a-video: T emp oral- consisten t diﬀusion model for real-world video super-resolution. In: CVPR. pp. 2535–2545 (2024) 4 18 H. Caselles-Dupré et al. F rescoDiﬀusion: 4K Image-to-Video with Prior-Regularized Tiled Diﬀusion - Supplemen tary Material A F rescoDiﬀusion A dditional Details A.1 Implemen tation Details All exp erimen ts use the W an2.2-I2V 14B backbone with the same accelerated 6-step T urboDiﬀusion setting as in the main paper. W e generate 81 frames at 16 fps with guidance scale 1 . 0 , i.e., without eﬀective classiﬁer-free guidance. F or an input image of size ( H, W ) , we ﬁrst generate the lo w-resolution prior at ﬁxed target area 480 × 832 = 399 , 360 pixels while preserving asp ect ratio: ˆ H = round  p 399360 H /W  , ˆ W = round  p 399360 W /H  , and then snap b oth dimensions down to the nearest multiple of 16 . The full- resolution pass uses H 4 K = max(16 , ⌊ H / 16 ⌋ · 16) and W 4 K = max(16 , ⌊ W / 16 ⌋ · 16) , after an optional isotropic downscaling when the input exceeds ultra-HD resolution. This multiple-of- 16 constraint comes from the latent video backbone: the W an V AE reduces spatial resolution b y a factor 8 , and the do wnstream laten t grid is processed in spatial patc hes of size 2 , yielding an eﬀectiv e v alidity constrain t of 8 × 2 = 16 . With spatial and temp oral compression factors 8 and 4 , the latent tensor therefore has shap e 16 × 21 × ( H 4 K / 8) × ( W 4 K / 8) for the default 81 -frame setting. The low-resolution latent prior is resized to the large laten t can v as with endp oint-aligned trilinear interpolation in latent space, V AE tiling is enabled during inference, and the deco ded output is resized bac k to the original image size only if snapping c hanged the resolution. Tile d denoising and r e gularization. The high-resolution pass uses MultiDiﬀusion windo ws of size 480 × 832 pixels with 30% ov erlap, giving nominal pixel strides 336 × 582 . After conv ersion to latent co ordinates and rounding to the v alid latent grid, this becomes 60 × 104 laten t windo ws with strides 42 × 72 ; extra ﬁnal windo ws are added whenev er needed so that the right and b ottom b oundaries are exactly cov ered. Tile fusion uses linear ramps with minimum border weigh t 0 . 1 . Standard MultiDiﬀusion uses P i w i y i / P i w i , whereas the prior-regularized implemen tation additionally accumulates P i w i y i and P i w i for the one-shot closed-form up date in mo del-output space. In all rep orted F rescoDiﬀusion runs w e use a cosine prior sc hedule with λ base = 1 . 5 and cutoﬀ τ end = 0 . 1 , where τ = i/ ( N − 1) and N = 6 (num b er of steps); hence the prior is active only at the ﬁrst denoising step. In R-F rescoDiﬀusion, the active regions cutoﬀ is τ fg = 0 . 1 and the inactive regions cutoﬀ is τ bg = 0 . 35 . Since the six normalized step p ositions are { 0 , 0 . 2 , 0 . 4 , 0 . 6 , 0 . 8 , 1 . 0 } , the foreground prior is activ e only at i = 0 , while the background prior is activ e at i = 0 and i = 1 . The active or inactive parts of the video are determined based on masks computed according to Sec. C.4 . F rescoDiﬀusion 19 A.2 Sampling Pro cedure A t each step, w e run the transformer on eac h crop of the canv as latent x 4K 0 to obtain a p er-tile prediction y i . Then, w e accum ulate tw o can v as-shap ed tensors: P i w i ⊙ y i and P i w i . W e then add the prior term and divide as in ( 6 ) to pro duce a single fused prediction on the full can v as. Finally , w e in vok e the sc heduler exactly once to obtain x t + ∆t . This preserves the original sampler while adding only light ov erhead: one extra reduction p er pixel, a single p oint wise rational fuse, and no additional netw ork passes beyond those already required by tiled denoising. The full step is summarized in Algorithm 1 . Algorithm 1: F rescoDiﬀusion: one sampler step at time t Input: canv as latent x 4K t ∈ R C × T × H 4K × W 4K ; tile positions { p i } n i =1 ; weigh t maps { w i } n i =1 ; upscaled prior x prior ∈ R C × T × H 4K × W 4K ; noise lev el σ t ; prior-strength schedule λ ; ﬂo w-matching step size ∆t ; ﬂo w-matching mo del f θ Output: up dated can v as latent x 4K t + ∆t 1 Initialize num ← 0 ∈ R C × T × H 4K × W 4K , den ← 0 ∈ R 1 × 1 × H 4K × W 4K 2 for i = 1 , . . . , n do 3 ˜ x t ← C p i ( x 4K t ) // Crop canvas at p i 4 ˜ y i ← f θ ( ˜ x t , t, c ) // Estimates crop flow 5 y i ← P p i ( ˜ y i ) // Zero-pads the output 6 num ← num + w i ⊙ y i // Updates the numerator 7 den ← den + w i // Updates denominator 8 // Add prior regularization y ← num + λ σ t ( x 4K t − x prior ) den + λ σ 2 t 9 x 4K t + ∆t ← SchedulerStep ( x 4K t , y , t ) // Update noisy states B F resco Ev aluation Dataset Construction T o facilitate the creation of UHD-I2V, we noticed that no dataset ﬁts our require- men ts for UHD-I2V. Unlike VBench high-deﬁnition [ 19 ] set or panoramas [ 24 ] that fo cus on one ob ject or less, we searc h for images to animate with multiple in tricate scenes. Here, we propose a new set to generate and ev aluate UHD-I2V tec hniques at a fresco-like scale. W e name our dataset F rescoArchiv e. W e start with the LAION-2B Aesthetic Subset [ 29 ]. The ﬁrst step is then ﬁltering the images based on several criteria: ha ving more than a million pixels, an aesthetic score of at least 5.8, and a w atermark and unsafe scores of less than 0.8 and 0.5, resp ectiv ely . This initial ﬁltering process yielded a total of tw o million images. Next, w e performed a seman tic ﬁltering pro cess. T o do this, for eac h image, 20 H. Caselles-Dupré et al. w e compute the a verage cosine similarit y b etw een the target instance and the prompts "A large detailed fresco" "A magnificent fresco with many different scenes" "A narrative composition" "A fresco with lots of details" "A large polyptych and composite image fresco" "A large metapicture with several compositions" "A fresco tableau" "A painting fresco" using the PerceptionEncoder [ 6 ] G14-448 v arian t similarity mo del, and we ﬁn- ish by selecting the top 50,000 images. After, w e used In tern-VL-3.5 [ 34 ] as a classiﬁer to detect fresco es. T o do so, we used the follo wing prompt: You are a visual classifier. Decide whether the image is a fresco-like composition. Definition (for this task): A qualifying image resembles a large, detailed, integrated scene (like historical frescoes). Modern photos or digital works count if they share these traits. Answer yes only if all are true: – The image shows high apparent resolution / detail density (many fine, precise elements). – There are multiple distinct sub-scenes or groups (from a few to dozens+) distributed across the same frame. – These elements are blended into one coherent composition (no panel borders or obvious collage seams). Answer no if any of the following: – Single subject or minimal detail. – The fresco is not the main content of the image (e.g. the photo shows a wall, room, or museum scene where the fresco only appears as a small part, rather than the fresco itself being the full image). – Simple graphics, logos, posters, or text-only images. – Comics/manga with separate panels, tiled grids, or collages with hard borders. – Diagrams, charts, UI/screenshots, or patterns. Unsure: answer no. Output format: Respond with exactly yes or no (lowercase, no punctuation, no extra words). This ﬁltering results in a total of 10,000 im ages. Finally , w e deduplicated the dataset using b oth p erceptual hashing and CNN-based deduplication techniques using the ImageDedup library [ 20 ]. Then, to generate UHD image captions, we used Qw en3-VL-32B [ 1 ] with both the image and LAION caption, resulting in 6,700 UHD image-caption pairs. T o prompt Qw en3-VL-32B, w e used the follo wing text: F rescoDiﬀusion 21 Using the existing caption below as context, write a long, highly detailed, precise, and fluent caption that thoroughly describes the image. Give some precise contextual information, relative positional information, subjects, objects and elements description and identification. Caution: the caption provided can be false or wrong; use the image as the only source of truth. The caption is only here to help you be more precise. Respond with the caption only (no preface, no metadata, no quotes). Existing caption: {caption} Finally , we manually selected 371 pairs to get the b est qualitative image-captions pairs. Dataset Samples W ords Image Width Image Heigh t F rescoArchiv e 371 355.79 ± 88.0 2265 ± 1257 1552 ± 864 VBenc h 361 14.70 ± 2.24 4592 ± 1305 3748 ± 1214 T able 4: Dataset statistics. W e compare F rescoArc hive and VBench datasets’ low-lev el statistics. F or the statistics, w e compute sev eral metrics (text and image-wise) and quan titativ ely compare with VBench I2V [ 19 ] to mark a reference p oint. First, w e compute high-lev el statistics, seen in T ab. 4 . W e measure the n umber of samples, a verage num b er of w ords p er prompt, and av erage width and height. As we can see, our set con tains a similar num b er of images as VBench. Y et, our prompts con tain an order of magnitude more than VBench, depicting more precise and detailed prompts. As for the av erage image shap e, our set contains smaller images, but with more complex scenery . In Fig. 8 , you can see the n um b er of w ords and shap e distribution of our dataset. Next, we qualitatively study F rescoArc hive’s complexity with reference to VBenc h. T o do this, we ﬁrst encode all images using the DINOv2 [ 25 ] mo del to get a global view of each image. Then, w e perform a PCA dimensionalit y reduction to visualize their distribution. Finally , using the resulting reduction, w e further visualize individual crops. As can b e seen in Fig. 9 , the resulting PCA sho ws that our dataset cov ers a wider span of the main axes, unlike VBenc h. This suggests that our dataset has more div ersity in shared comp onen ts using the global DINO features. C Exp erimen t Details C.1 Metrics Sharpness metric. W e assess per-frame sharpness using the T enengrad [ 26 ] measure, a classical no-reference fo cus metric based on directional gradient en- 22 H. Caselles-Dupré et al. 200 300 400 500 600 #wor ds 0.000 0.001 0.002 0.003 0.004 0.005 F r equency W or d count 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 W idth (px) 0 1000 2000 3000 4000 5000 6000 Height (px) W idth-Height distribution 3840px 2160px 1080px 1920px Fig. 8: Analysis of F rescoArc hiv e dataset. Left: distribution of word in caption. Right: Resolution of images distribution. 30 20 10 0 10 20 20 15 10 5 0 5 10 15 F r escoAr chive Cr ops VBench Cr ops F r escoAr chive Images VBench Images 30 20 10 0 10 20 20 15 10 5 0 5 10 15 PCA decomp osition with cr ops F rescoArc hiv e Cr ops VBenc h Cr ops F rescoArc hiv e Images VBenc h Images Fig. 9: PCA b et w een F rescoArchiv e and VBenc h I2V dataset. W e visualize the DINOv2 [ 25 ] feature PCA b etw een our proposed dataset and VBench’s images (red and green points, respectively). Also, in light green and red, random crops of the im- ages. This plot sho ws that our dataset con tains images with diﬀeren t statistics than standard ones, found in the VBench I2V set. F rescoDiﬀusion 23 ergy . F or each frame, the luminance channel is extracted and horizontal and ver- tical gradients are computed using a 3 × 3 Sob el op erator. The T enengrad score is deﬁned as the mean squared gradien t magnitude T = 1 H W P i,j ( G 2 x ( i, j ) + G 2 y ( i, j )) , where G x and G y are the responses of the Sob el ﬁlters along each axis. This quan tity is sensitiv e to high-frequency spatial structure and p enalizes blurry outputs where gradien t energy is suppressed. P er-frame scores are a ver- aged ov er all frames of a video to yield a single video-level score. Higher v alues indicate sharp er, more detail-preserving outputs. T emporal consistency metric. Each frame is ﬁrst con v erted to gra yscale and then do wnsampled to 128 × 128 pixels using area-av eraging interpolation, retaining only coarse spatial structure. The temp oral inconsistency score for a video of T frames is deﬁned as the mean squared error b etw een consecutiv e do wnscaled frames: C = 1 T − 1 P T − 1 t =1 | ˆ f t − ˆ f t − 1 | 2 F /  64 2  , where ˆ f t denotes the do wnscaled grayscale frame at time t . By op erating at this coarse resolution, the metric captures global consistency while remaining agnostic to legitimate scene motion. Prior alignmen t metric. W e measure ho w faithfully each generated video preserv es the global semantic conten t of the prior using a frame-level feature alignmen t score based on DINOv3 [ 30 ]. F or each frame pair ( f prior t , f gen t ) , b oth frames are indep endently forw arded through a frozen DINOv3 ViT-S/16 enco der without any resizing ( i.e ., at native resolution), and we extract the CLS token from each: the po oled global represen tation output b y the transformer. Prior alignmen t is then deﬁned as the mean cosine similarity b et w een corresponding CLS tokens across all T frames of a video: A = 1 T P T t =1 ⟨ z prior t , z gen t ⟩ ∥ z prior t ∥ ∥ z gen t ∥ , which equiv alen tly measures the cosine of the angle betw een the t wo CLS token direc- tions in feature space. W e rep ort the mean of this score across all videos in the b enc hmark; higher v alues indicate that the generated video remains semantically aligned to the prior along its temp oral tra jectory . A key adv antage of DINOv3 o v er its predecessor DINOv2 is its ability to pro cess high-resolution inputs, in- cluding 4K frames, without interpolating p ositional embeddings or resizing the input. This prop erty is essential in our setting, where the generated videos are high-resolution and downscaling prior to feature extraction would discard ﬁne- grained con ten t that ma y b e critical for alignmen t assessmen t. C.2 Standard low-resolution I2V metrics On T able 2 in the main pap er, we presented the a v erage results on b oth F res- coArc hiv e and VBench-I2V datasets. Here, in T able 5 , w e detail each metric: Sub ject Consistency , Motion Smo othness, Aesthetic, Imaging, I2V Sub ject Con- sistency , and I2V Background Consistency . 24 H. Caselles-Dupré et al. T able 5: Quantitativ e ev aluation on F rescoArc hive and VBench-I2V. Higher is b etter for all metrics except time. Best in b old and second b est underlined. ∗ denotes metho ds adapted to the video prior setting. SC, MS, A, I, ISC, and IBC stand for Sub ject Consistency , Motion Smo othness, Aesthetic, Imaging, I2V Sub ject Consistency , and I2V Background Consistency , resp ectively . Method Qualit y Metrics I2V Metrics A vg Time SC MS A I ISC IBC F resc oAr chive DynamicScaler (original) 0.945 0.975 0.693 0.706 0.893 0.930 0.857 18.45 DynamicScaler ∗ (CVPR’25) 0.852 0.971 0.681 0.754 0.980 0.989 0.871 10.25 DemoF usion ∗ (CVPR’23) 0.960 0.987 0.734 0.752 0.990 0.994 0.903 13.5 MultiDiﬀusion ∗ (ICML’23) 0.876 0.974 0.686 0.754 0.981 0.989 0.876 8.15 F rescoDiﬀusion 0.958 0.990 0.738 0.753 0.991 0.995 0.904 8.58 R-F rescoDiﬀusion 0.977 0.991 0.736 0.753 0.991 0.994 0.907 9.08 VBench-I2V DynamicScaler (original) 0.949 0.976 0.707 0.713 0.893 0.932 0.862 18.45 DynamicScaler ∗ (CVPR’25) 0.904 0.985 0.621 0.701 0.988 0.991 0.865 10.25 DemoF usion ∗ (CVPR’23) 0.943 0.989 0.639 0.720 0.990 0.993 0.879 13.5 MultiDiﬀusion ∗ (ICML’23) 0.893 0.984 0.611 0.698 0.987 0.989 0.860 8.15 F rescoDiﬀusion 0.933 0.989 0.632 0.716 0.989 0.992 0.875 8.58 R-F rescoDiﬀusion 0.956 0.989 0.634 0.712 0.988 0.991 0.878 9.08 C.3 Baseline Implementation Details MultiDiﬀusion. Our implementation follo ws the original MultiDiﬀusion [ 3 ] pro- cedure without additional h euristics. The only mo diﬁcations are (i) replacing the base bac kb one with W an2.2 I2V in place of the original denoiser, and (ii) applying a linear decay blending mask outside each tile to smo othly attenuate con tributions near tile borders and reduce seam artifacts when merging o verlap- ping predictions. DemoF usion. DemoF usion [ 12 ] introduces three tec hniques for high-resolution image generation. W e re-implemen ted (i) progressiv e phase upsampling and (ii) skip-residual global guidance, reusing the exact hyper-parameters from the au- thors’ oﬃcial co de to enable a faithful comparison. These parameters con trol ho w resolution increases across phases (num b er of phases, p er-phase upsampling factor, and p er-scale denoising sc hedule) and the strength of global-structure guidance during reﬁnement. In contrast, w e did not observe reliable gains from DemoF usion-style dilated sampling when transferring it from SDXL’s UNet de- noiser to W an2.2’s DiT-based denoiser. W e hypothesize this is because dilated sampling assumes up dates are roughly separable across in terlea v ed sub-lattices an assumption that ﬁts UNets’ lo cal, conv olutional structure but breaks for DiT mo dels with global self-attention. Ev aluating W an2.2 on sparse lattices changes the atten tion context and likely shifts inputs oﬀ-distribution, leading to inconsis- F rescoDiﬀusion 25 ten t oﬀsets that merge into visible artifacts (seams/chec kerboards) rather than impro v ed coherence. DynamicSc aler. W e follow the authors’ oﬃcial implementation of Dynamic- Scaler [ 24 ] and reuse their released hyper-parameters for both the oﬀset-shifting denoiser and the global motion-guidance module. W e condition motion guid- ance on the same low-resolution video that w e use for our F rescoDiﬀusion reg- ularization prior, so b oth signals rely on an identical motion reference. F or the sliding/rotating denoising window, we set the per-step oﬀset (stride) to half the windo w size, i.e . a 50% ov erlap betw een consecutiv e windo ws, whic h stabilizes stitc hing across steps and mitigates b oundary artifacts. C.4 Spatial Activit y Map Computation Giv en an input frame, we compute a sp atial activity map using SAM3 [ 7 ]. The region to animate can also b e explicitly speciﬁed by the user. F or automated pro cessing and ease of use, we emplo y a segmentation pip eline that identiﬁes plausible dynamic en tities and con v erts them in to a spatial activit y map. The activity map is computed once per image, stored at laten t resolution, clamp ed to [0 , 1] , resized to the current latent size with standard trilinear in- terp olation, binarized using the test A > 0 , and then k ept ﬁxed throughout sampling. Pr ompt-b ase d se gmentation. Our default pip eline queries SAM3 using a ﬁxed set of prompts corresp onding to categories that commonly exhibit motion. Sp eciﬁ- cally , we provide the follo wing textual prompts to the mo del: – p erson – v egetation – v ehicles F or each prompt, SAM3 predicts candidate spatial masks corresp onding to instances of the queried category . The masks are then av eraged. Visualizations of suc h masks are pro vided in Fig. 10 . Mask extr action and ﬁltering. SAM3 predictions are ﬁltered using a score thresh- old τ s = 0 . 45 . W e discard masks whose relativ e area exceeds 0 . 30 of the image area, as such regions typically correspond to ov erly coarse detections. W e addi- tionally remov e masks with excessive boundary contact, deﬁned as cases where more than 80% of the mask pixels lie within a 10 -pixel margin of the image b order. Sp atial supp ort. F or eac h retained mask, we construct a spatial supp ort region b y dilating the mask using a Euclidean distance transform with radius r = 75 pixels. The resulting regions are merged to produce the ﬁnal spatial activity map. 26 H. Caselles-Dupré et al. Explor atory variants. W e explored t wo extensions of this pip eline. First, w e ex- p erimen ted with extending the spatial mask across time. Using SAM3 together with the generated prior video, th e mask predicted on the input frame w as prop- agated to co ver the full temp oral extent of the video, as seen in Fig. 11 . Second, w e ev aluated a v ariant in whic h the prompts pro vided to SAM3 are generated automatically using a vision-language mo del. In this setup, Qwen3-VL-32B [ 1 ] analyzes b oth the input image and the generated prior video to produce textual prompts corresp onding to entities that could plausibly supp ort animation. These prompts are then used to condition SAM3, while mask extraction, ﬁltering, and spatial supp ort construction remain identical to the default pip eline. In prac- tice, neither the temp oral mask propagation nor the VLM-guided prompting pro duced measurable impro v emen ts o ver the ﬁxed-prompt approac h. As b oth v arian ts introduce additional computational ov erhead, they w ere not used for the ﬁnal results rep orted in the pap er. F rescoDiﬀusion 27 Fig. 10: Additional qualitative examples of spatial activity maps obtained with SAM3. Eac h image shows the ov erlay used to identify regions likely to con tain dynamic con tent, which are then used to guide the activ e/inactive prior regularization in R- F rescoDiﬀusion. frame 1 frame 40 frame 80 Fig. 11: T w o qualitative examples sho wing the temp oral masks obtained with SAM3 at frames 1, 40, and 80. The video o verlaid is the prior generated with W an mo del. 28 H. Caselles-Dupré et al. D F rescoDiﬀusion A dditional Examples frame 1 frame 48 frame 80 Fig. 12: Additional examples obtained on our F rescoArchiv e dataset. First 2 ro ws obtained with F rescoDiﬀusion and last 2 rows obtained with R-F rescoDiﬀusion; columns sho w the same frame indices. F rescoDiﬀusion 29 frame 1 frame 48 frame 80 Fig. 13: Additional examples obtained on VBench I2V dataset. First 3 rows obtained with F rescoDiﬀusion and 2 last rows obtained with R-F rescoDiﬀusion; columns show the same frame indices. 30 H. Caselles-Dupré et al. E F rescoDiﬀusion closed-form solution in noise prediction setting Our prop osed approach can b e used with ϵ -prediction diﬀusion mo dels. W e mo d- ify the F rescoDiﬀusion loss in Eq. ( 5 ) to include the one-step approximation of the ϵ -diﬀusion form ulation: ℓ FD ( y ⋆ ; t ) =     √ λ ⊙  1 √ α t  x 4K t − √ 1 − α t y ⋆  − x prior      2 2 + ℓ MD ( y ⋆ ; t ) , (9) where α t = Q t i =1 (1 − β i ) and β t are the schedule v ariances, and the one-step appro ximation is given b y 1 √ α t  x 4K t − √ 1 − α t y ⋆  . Next, we set the deriv ativ e of Eq. ( 9 ) to 0 to solve for the optimal noise output. Hence, the optimal ϵ noise is: y FD ( x 4K t ) = q 1 − α t α t λ ⊙  1 √ α t x 4K t − x prior  + n X i =1 w i ⊙ y i 1 − α t α t λ + n X i =1 w i . (10) W e adopt the same v ariable deﬁnitions, as in the main text, for the current noisy state x 4K t , the prior, x prior , the weigh ting tensors w i , and the prior regularization strength λ . As in the ﬂow-matc hing formulation, when λ = 0 , y FD reduces to y MD .

FrescoDiffusion: 4K Image-to-Video with Prior-Regularized Tiled Diffusion

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment