V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising

V -Co: A Closer Look at V isual Repr esentation Alignment via Co-Denoising Han Lin 1 Xichen Pan 2 Zun W ang 1 Y ue Zhang 1 Chu W ang 3 Jaemin Cho 4 Mohit Bansal 1 1 UNC Chapel Hill 2 NYU 3 Meta 4 AI2 https://github.com/HL- hanlin/V- Co Abstract Pixel-space diffusion has r ecently r e-emerg ed as a str ong alternative to latent diffusion, enabling high-quality gen- eration without pr etrained autoencoders. However , stan- dar d pixel-space diffusion models r eceive r elatively weak semantic supervision and ar e not explicitly designed to captur e high-level visual structure . Recent r epr esentation- alignment methods ( e.g ., REP A) sug gest that pr etrained vi- sual featur es can substantially impr ove diffusion training , and visual co-denoising has emer ged as a pr omising di- r ection for incorporating such featur es into the generative pr ocess. However , e xisting co-denoising approac hes often entangle multiple design choices, making it unclear which design choices are truly essential. Ther efore, we pr esent V -Co , a systematic study of visual co-denoising in a uni- ﬁed JiT -based frame work. This contr olled setting allows us to isolate the ingr edients that make visual co-denoising ef- fective. Our study r eveals four key ingredients for effec- tive visual co-denoising. F irst, pr eserving feature-speciﬁc computation while enabling ﬂexible cr oss-str eam interac- tion motivates a fully dual-stream arc hitecture . Second, effective classiﬁer-fr ee guidance (CFG) requir es a struc- turally deﬁned unconditional pr ediction. Thir d, str onger se- mantic supervision is best pr ovided by a per ceptual-drifting hybrid loss. F ourth, stable co-denoising further requir es pr oper cr oss-stream calibration, whic h we realize thr ough RMS-based featur e rescaling . T ogether , these ﬁndings yield a simple r ecipe for visual co-denoising. Experiments on ImageNet-256 show that, at comparable model sizes, V - Co outperforms the underlying pixel-space diffusion base- line and str ong prior pixel-diffusion methods while using fewer tr aining epochs, offering pr actical guidance for fu- tur e repr esentation-aligned generative models. 1. Introduction Diffusion models [ 13 , 29 , 33 ] have achiev ed remarkable success in image generation. While much recent progress has been driv en by latent diffusion models [ 34 ] (LDMs), 2.52 2.93 15.15 8.86 A Recipe for Visual Co-Denoising Starting Ingredients: Pixel Dif fusion Model DINOv2 Encoder Images Images Calculate pixel RMS Pixels DINOv2 Features Calculate DINOv2 RMS Dual-Stream Pixel Dif fusion Model Pixel Dif fusion Model Pixels Pixels Pixels Pixels DINOv2 DINOv2 Pixel Q DINOv2 Q Pixel K & V DINOv2 K & V Dual-Stream Pixel Dif fusion Model Attn. Mask Dual-Stream Pixel Dif fusion Model Gen. Images Real Images DINOv2 Encoder DINOv2 Encoder Perceptual- Drifting Hybrid Loss Feature Calibration: RMS-based scaling of semantic features Architecture: Dual-stream architecture for pixel-semantic co-denoising CFG: Structural semantic-to-pixel masking for unguided generation Auxiliary Loss: Perceptual-drifting hybrid loss Single-stream Dual-stream 6.69 Input dropout 3.18 Structural masking 3.14 REPA loss 2.52 Hybrid loss Noise-schedule shifting RMS-based scaling FID ↓ Figure 1. An overview of V -Co and its recipe. Starting from a pixel diffusion model, a pretrained DINOv2 encoder , and train- ing images, we identify four ke y ingredients for effecti ve visual co-denoising: a fully dual-stream architecture, semantic-to-pixel masking for classiﬁer-free guidance, a perceptual-drifting hybrid loss for stronger semantic supervision, and RMS-based feature rescaling for cross-stream calibration. T ogether, they form a sim- ple and effecti ve recipe for visual co-denoising. which denoise in compressed autoencoder spaces [ 23 , 34 ], an increasingly compelling alternati ve is pixel-space diffu- sion with scalable Transformer -based denoisers [ 6 , 26 , 28 , 31 , 48 ]. Recent systems such as JiT [ 26 ] sho w that di- rect pixel-space denoising can be competitive while avoid- ing autoencoder-induced biases and bottlenecks. Howe ver , 1 pixel-le vel denoising objectiv es are not explicitly designed to enforce high-le vel semantic structure, making semantic representation learning less sample-efﬁcient. In parallel, a gro wing body of work has explored how to inject external visual knowledge from strong pre- trained encoders into diffusion training. One line of re- search adds r epresentation-alignment losses that encour - age diffusion features to match pretrained visual represen- tations [ 35 , 36 , 39 , 42 , 49 , 51 ]. Another performs denois- ing directly in a r epr esentation latent space , rather than in pixel or V AE latent space [ 3 , 18 , 37 , 50 ]. A third line of work explores joint g eneration or co-denoising architec- tures, in which image latents are generated together with semantic features or other modalities so that the streams can exchange information throughout the denoising trajec- tory [ 1 , 2 , 4 , 8 , 9 , 14 , 19 , 24 , 43 , 45 , 52 ]. Among these direc- tions, visual co-denoising provides a deeper form of integra- tion by incorporating pretrained semantic representations directly into the denoising process, rather than using them only as supervision or as an alternati ve latent space. How- ev er , existing co-denoising systems typically entangle mul- tiple design choices, spanning architecture, guidance strat- egy , auxiliary supervision, and feature calibration, which obscures the principles that gov ern ef fective pixel–semantic interaction. This lack of understanding makes current de- signs largely ad hoc, and leav es open how to combine these components into a robust and scalable recipe. In this paper, we study visual co-denoising as a mecha- nism for visual representation alignment. Rather than treat- ing co-denoising as a ﬁxed end-to-end design, we inv esti- gate the f actors that makes it ef fectiv e. T o this end, we build a uniﬁed pixel-space testbed on top of JiT [ 26 ], where an image stream is jointly denoised with patch-le vel semantic features from a frozen pretrained visual encoder ( e.g ., DI- NOv2 [ 32 ]). W ithin this controlled frame work, we in vesti- gate four ke y questions: (i) what architecture best balances feature-speciﬁc processing and cross-stream interaction; (ii) how to deﬁne the unconditional branch for classiﬁer-free guidance; (iii) which auxiliary objectives provide the most effecti ve complementary supervision; and (i v) how to cal- ibrate semantic features relative to pixels during diffusion training. Our goal is not only to improve performance, but also to distill general principles for effecti ve co-denoising. Based on this study , we deri ve a simple yet effecti ve V isual Co -Denoising ( V -Co ) recipe, illustrated in Fig. 1 . First, from the perspectiv e of model architecture, we show that effecti ve visual co-denoising requires preserv- ing feature-speciﬁc computation while enabling ﬂexible cross-stream interaction. Among a broad range of shared- backbone and fusion-based variants, a fully dual-str eam JiT consistently delivers the strongest performance (Sec. 3.2 ). Second, for classiﬁer-free guidance (CFG), we introduce a nov el structural masking formulation, where unconditional prediction is deﬁned by explicitly masking the semantic-to- pixel pathway rather than by input-lev el corruption alone. This simple design pro ves substantially more effecti ve than standard dropout-based alternativ es in co-denoising (Sec. 3.3 ). Third, we observe that instance-lev el semantic alignment and distribution-le vel regularization play com- plementary roles, and leverage this insight to propose a nov el perceptual-drifting hybrid loss that combines both within a uniﬁed objecti ve, yielding the best generation qual- ity in our study (Sec. 3.4 ). Finally , we show that RMS-based feature rescaling admits an equi valent interpretation as a semantic-stream noise-schedule shift via signal-to-noise ra- tio (SNR) matching, providing a simple and principled cal- ibration rule for cross-stream co-denoising (Sec. 3.5 ). T o- gether , these ﬁndings transform visual co-denoising into a concrete recipe for visual representation alignment. Empirically , V -Co yields strong gains on ImageNet-256 under the standard JiT [ 26 ] training protocol. Starting from a pixel-space JiT -B/16 backbone, our progressively improv ed recipe substantially outperforms both the origi- nal JiT baseline and prior co-denoising baselines (see T a- ble 5 ), and achieves strong guided generation quality . No- tably , V -Co-B/16 with only 260M parameters, matches JiT - L/16 with 459M parameters (FID 2.33 vs . 2.36). V -Co-L/16 and V -Co-H/16, trained for 500 and 300 epochs respec- tiv ely , outperform JiT -G/16 with 2B parameters (FID 1.71 vs . 1.82) and other strong pixel-dif fusion methods. In summary , our contributions are three-fold: • W e present a principled study of visual representation alignment via co-denoising ( V -Co ) in pixel-space dif fu- sion, systematically isolating the effects of architecture, CFG design, auxiliary losses, and feature calibration. • W e introduce an effecti ve recipe for visual co-denoising with two k ey inno vations: structural masking for un- conditional CFG prediction and a per ceptual-drifting hy- brid loss that combines instance-le vel alignment with distribution-le vel regularization. Our study further iden- tiﬁes a fully dual-stream architecture and RMS-based feature calibration as the preferred design choices. • W e show that these designs yield strong improv ements on ImageNet-256 [ 10 ], outperforming the underlying pixel-space diffusion baseline ( i.e ., JiT [ 26 ]) as well as prior pixel-space dif fusion methods. 2. Related W ork Pixel-space diffusion generation. Recent work has shown that, with suitable architectural and optimization choices, diffusion models trained directly in pixel space can ap- proach latent diffusion performance [ 34 ]. JiT [ 26 ] demon- strates that competitive pixel-space generation is possible with a minimalist T ransformer design, while Simple Dif- fusion [ 16 ], PixelDiT [ 48 ], and HDiT [ 7 ] improv e training and scalability . Other methods add stronger inductive bi- 2 (1) Single-Stream JiT Blocks (2) Dual-Stream JiT Blocks (Default) DINOv2 Features Pixels Pixel Head DINOv2 Head Linear / JiT Blocks Linear / JiT Blocks DINOv2 Features Pixels Pixel Head DINOv2 Head Linear / JiT Blocks Linear / JiT Blocks MLP QKV Norm MLP QKV Norm Joint Attention Dual-Stream JiT Blocks Noise Noise Noise Noise C MLP QKV Norm Single-Stream JiT Blocks Element-wise Addition Channel Concat T oken Concat DINOv2 K & V Pixel K & V Semantic-to-Pixel Masking Pixel Q DINOv2 Q Figure 2. Single-stream and dual-stream architectures for visual co-denoising. In the single-str eam design (left), noised pixels and DINOv2 features are fused after lightweight stream-speciﬁc preprocessing and then processed by shared JiT blocks. W e study direct addition, channel concatenation, and token concatenation (see Sec. 3.2 ). In the dual-str eam design (right), the two streams use separate normalization, MLP , and attention projections, while interacting through joint self-attention. A semantic-to-pixel attention mask is used to deﬁne the unconditional prediction for CFG (see Sec. 3.3 ). Both designs use separate output heads for pixel and DINOv2 prediction. ases, such as decomposition in DeCo [ 30 ] and perceptual supervision in PixelGen [ 31 ]. W e adopt pixel-space diffu- sion rather than V AE-latent diffusion because it av oids au- toencoder bottlenecks and learned latent-space biases, pro- viding a cleaner setting for studying co-denoising and rep- resentation alignment. Representation alignment f or diffusion training. A growing line of work studies how pretrained visual rep- resentations can improve diffusion training. Recent anal- yses [ 35 , 42 ] show that dif fusion models learn meaning- ful internal features, b ut these are often weaker or less structured than those of strong self-supervised vision en- coders. REP A [ 42 ] aligns intermediate diffusion features with pretrained representations such as DINOv2 [ 32 ], im- proving con vergence and sample quality . Follo w-up work studies which teacher properties matter most: iREP A [ 35 ] highlights spatial structure, while REP A-E [ 25 ] extends REP A-style supervision to end-to-end latent dif fusion train- ing with the V AE. Recent results also suggest that REP A- style alignment is most beneﬁcial early in training and may over -constrain the representation space if applied too rigidly [ 42 ]. Moti vated by this, we study representation alignment through co-denoising, compare auxiliary losses beyond REP A, and introduce a stronger hybrid alternati ve. V isual co-denoising and joint generation across modal- ities. Recent work has increasingly explored joint denois- ing or joint g eneration of multiple signals to improv e infor- mation transfer, controllability , and structural consistency . In image generation, Latent Forcing [ 1 ] and ReDi [ 24 ] jointly model image latents and semantic features. In video generation, V ideoJ AM [ 4 ], UDPDiff [ 44 ], and Uni- tyV ideo [ 19 ] jointly generate video with structured signals such as segmentation, depth, or ﬂow . Similar ideas extend to audio–visual generation [ 43 ], robotics and world mod- eling [ 2 , 8 , 9 , 45 , 52 ], and multimodal sequence model- ing [ 14 ]. In contrast to these task-speciﬁc end-to-end de- signs, we provide a controlled study of visual co-denoising itself, isolating the architectural, guidance, loss, and calibra- tion choices that make it ef fective and distilling them into a practical recipe for visual representation alignment. 3. A Closer Look at V isual Co-Denoising In this section, we ﬁrst formalize visual co-denoising in Sec. 3.1 , then conduct a systematic study of the key de- sign choices that govern its ef fectiveness, including model architecture (Sec. 3.2 ), unconditional prediction for CFG (Sec. 3.3 ), auxiliary training objecti ves (Sec. 3.4 ), and fea- ture calibration via rescaling (Sec. 3.5 ). Starting from a standard pixel-space dif fusion baseline ( e.g ., JiT [ 26 ]), we use controlled ablations to isolate each component’ s con- 3 tribution and derive a practical recipe, introducing new de- signs tailored for visual co-denoising along the w ay . Exper- iment setup details and additional ablations are deferred to Appendix Sec. A and Sec. B respectiv ely . 3.1. Co-Denoising F ormulation W e formalize visual co-denoising within a uniﬁed frame- work. Unlike standard pixel-space dif fusion, which de- noises only the image stream, co-denoising introduces an additional semantic feature stream from a pretrained visual encoder ( e.g ., DINOv2 [ 32 ]). The core idea is to jointly de- noise the pixel and semantic streams under a shared dif fu- sion process, allowing the semantic stream to pro vide com- plementary supervision for semantically richer generation. Unless otherwise speciﬁed, all experiments in this sec- tion follow the JiT [ 26 ] ablation protocol on ImageNet 256 × 256 [ 10 ], using a JiT -B/16 backbone trained for 200 epochs. W e adopt the original JiT training conﬁguration without additional hyperparameter tuning. Concretely , we extend the x -prediction and v -loss formulation of JiT to jointly denoise pixels and pretrained semantic features. Let x denote the clean image and d denote its encoded patch- lev el semantic features. W e sample independent Gaussian noise ϵ x , ϵ d ∼ N ( 0 , I ) for the two streams. At diffusion time t ∈ [0 , 1] , the corresponding noised inputs are z x t = t x + (1 − t ) ϵ x , z d t = t d + (1 − t ) ϵ d . (1) Giv en ( z x t , z d t , t, c ) , where c denotes the class condition, the co-denoising model jointly predicts the clean tar gets for the pixel and semantic streams: ( ˆ x , ˆ d ) = f θ ( z x t , z d t , t, c ) , (2) where f θ denotes the co-denoising model, which could be implemented as either a shared-backbone or dual-stream ar- chitecture depending on the design variant. F ollo wing JiT , we conv ert these clean predictions into v elocity predictions, ˆ v x = ( ˆ x − z x t ) / (1 − t ) , ˆ v d = ( ˆ d − z d t ) / (1 − t ) , (3) and supervise them with the ground-truth velocities, v x = x − ϵ x = ( x − z x t ) / (1 − t ) (4) v d = d − ϵ d = ( d − z d t ) / (1 − t ) . (5) The ﬁnal objectiv e is a weighted sum of the pixels and se- mantic features v -losses: L v-co = E h ∥ ˆ v x − v x ∥ 2 2 + λ d ∥ ˆ v d − v d ∥ 2 2 i , (6) where λ d controls the weight of the semantic stream. This formulation provides a uniﬁed testbed for studying the ef- fects of architecture, guidance, auxiliary losses, and feature calibration on representation alignment in co-denoising. 3.2. What Architectur e Best Supports V isual Co- Denoising? W e begin by studying how semantic features should be integrated into a pixel-space diffusion backbone for co- denoising. Our goal is to identify the ar chitectural design that most effectively transfer s information fr om pretr ained semantic visual encoders to pixel featur es without limiting the expr essiveness of the diffusion model . T o this end, we compare lightweight fusion within a largely shared back- bone against more expressi ve designs that preserv e feature- speciﬁc processing while enabling controlled cross-stream interaction. Fig. 2 illustrates the architectural variants, and T able 1 summarizes the corresponding results. Baselines. W e ﬁrst report results for the original JiT -B/16 backbone [ 26 ] and a widened v ariant that increases the hid- den dimension from 768 to 1088 to match the parameter count of the dual-stream models introduced later . W e also include Latent Forcing [ 1 ] as a representativ e co-denoising baseline. For fair comparison, we keep the number of JiT blocks tra versed by the pixel stream ﬁx ed across all vari- ants, and maintain this setting throughout this subsection. Single-stream variants. W e consider a shared-backbone setting (Fig. 2 , left) where pixel tokens x and semantic to- kens d share most parameters. W ithin this setting, we com- pare three fusion strategies with model architectures deri ved from Latent Forcing [ 1 ]: • Direct Addition (row (d)): Pixel tokens x ∈ R n × d 1 and semantic features d ∈ R n × d 2 are ﬁrst projected into a shared hidden space R n × d via lightweight linear layers, then fused by element-wise addition and passed through shared JiT blocks. The pixel and semantic streams hav e two separate output heads. Our experiments in the main paper use d 1 = d 2 = 768 and a patch count of n = 256 . • Channel-concatenation fusion (row (e)): Pixel tokens x and semantic features d are concatenated along the channel dimension R n × ( d 1 + d 2 ) , and then linearly pro- jected to the hidden dimension R n × d of JiT blocks. • T oken-concatenation fusion (rows (f-i)): Instead of concatenating along the channel dimension, we concate- nate x and d along the sequence dimension R 2 n × d and input the combined token sequence into the JiT blocks. Dual-stream variants. Motiv ated by the limitations of heavily shared backbones, we further introduce a dual- str eam JiT architecture, illustrated on the right of Fig. 2 , in which the pixel and semantic streams maintain separate normalization layers, MLPs, and attention projections ( i.e ., Q/K/V), while interacting through joint self-attention. This design allows the model to adaptiv ely determine where and how the two streams interact, while preserving dedicated processing pathways for each stream. Analysis. As shown in T able 1 , token-concatenation fu- sion outperforms direct addition and channel concatenation among the single-stream v ariants (rows (d)–(f)), suggesting 4 T able 1. Comparison of architectural designs f or visual co-denoising. W e compare baseline backbones, single-stream fusion strate gies, and dual-stream fusion variants with dif ferent allocations of feature-speciﬁc and shared/dual-stream blocks. All variants keep the pixel stream depth ﬁx ed at 12 JiT blocks for fair comparison. JiT -B/16 ‡ and JiT -B/16 † denote widened v ariants with hidden dimensions increased from 768 to 1024 and 1088, respecti vely , to match the parameter counts of the dual-stream models. Blue rows mark the stronger v ariants used in subsequent analysis. Follo wing previous works [ 1 , 26 ], we mainly use FID as reference. W e highlight the rows corresponding to the design with the best ov erall FID score in light blue . Model Backbone #Params #Featur e-Speciﬁc #Shared/Dual- CFG=1.0 Blocks Stream Blocks FID ↓ IS ↑ Baselines (a) JiT -B/16 [ 26 ] JiT -B/16 133M - - 32.54 49.5 (b) JiT -B/16 [ 26 ] JiT -B/16 † 261M - - 22.67 69.9 (c) LatentForcing [ 1 ] JiT -B/16 156M 2 10 13.06 102.2 Single-Stream JiT Architecture (d) DirectAddition JiT -B/16 156M 2 10 15.15 103.4 (e) ChannelConcat JiT -B/16 157M 2 10 14.33 107.7 (f) T okenConcat JiT -B/16 156M 2 10 14.70 103.8 (g) T okenConcat JiT -B/16 177M 4 8 12.59 112.8 (h) T okenConcat JiT -B/16 198M 6 6 12.35 116.7 (i) T okenConcat JiT -B/16 ‡ 265M 6 6 9.74 129.45 Dual-Stream JiT Architecture (j) T okenConcat JiT -B/16 260M 6 6 11.78 115.4 (k) T okenConcat JiT -B/16 260M 4 8 11.40 118.3 (l) T okenConcat JiT -B/16 260M 2 10 10.24 124.5 (m) T okenConcat JiT -B/16 260M 0 12 8.86 132.8 that preserving feature-speciﬁc representations before inter - action is preferable to early fusion in a shared space. More- ov er, within token-concatenation, allocating more blocks to feature-speciﬁc processing consistently improves perfor - mance (rows (f)–(h)), indicating that excessiv e parameter sharing limits the model’ s ability to preserve semantic infor- mation. Finally , among the dual-stream variants, the fully dual-stream architecture (ro w (m)) achieves the best FID of 8.86 under a comparable number of trainable parameters (row (i) and rows (j)–(l)), showing that allo wing the model to dynamically learn cross-stream interaction at each block is more effecti ve than imposing a ﬁx ed interaction pattern through a largely shared backbone. Therefore, we adopt the fully dual-stream architecture as the default model design in the remaining analysis. A more comprehensi ve comparison with additional single-stream variants is gi ven in T able 7 . Finding 1: Effecti ve visual feature alignment requires preserving feature-speciﬁc processing while enabling ﬂexible cross-stream interaction. T oken concatenation and a fully dual-stream model architecture emer ge as the preferred designs. 3.3. How to Deﬁne Unconditional Prediction for CFG? T o enable classiﬁer -free guidance (CFG), the model must deﬁne an unconditional prediction, i.e ., a prediction in which the conditioning signals are remov ed. In our co- denoising setting, this is nontri vial because the model is conditioned on both class labels and semantic features. Guided sampling combines the conditional and uncondi- tional predictions in the pixel and semantic streams as ˆ v x = ˆ v uncond x + s  ˆ v cond x − ˆ v uncond x  (7) ˆ v d = ˆ v uncond d + s  ˆ v cond d − ˆ v uncond d  (8) where s denotes the CFG scale. Since guided generation de- pends critically on the quality of the unconditional branch, we ne xt in vestigate how to deﬁne an effective unconditional pr ediction for CFG in the co-denoising setting . Input-dropout baselines. Follo wing prior work [ 1 , 24 ], we ﬁrst consider baseline unconditional predictions that drop conditioning inputs (semantic features and class labels) dur - ing training. Speciﬁcally , for semantic feature dropping, we use either (1) zeros or (2) a learnable [null] token to re- place the semantic features. For each choice, we compare independent dropout of the class label and semantic features (rows (a)–(b)) ag ainst joint dropout (rows (e)–(f)). Attention mask between pixel and semantic features. Beyond input-le vel dropout, we lev erage the dual-stream architecture to deﬁne a structurally unconditional path- way . For unconditional samples, we apply semantic-to- pixel masking (see Fig. 3 ), which blocks cross-stream atten- tion from the semantic stream to the pixel stream so that the pixel branch receiv es no semantic conditioning signal (rows 5 T able 2. Comparison of unconditional prediction designs for classiﬁer -free guidance under co-denoising . W e report unguided results at CFG = 1 . 0 and guided results at CFG = 2 . 9 , which is the default guided evaluation setting in JiT [ 26 ]. W e highlight the ro ws corre- sponding to the design with the best ov erall FID score in light blue . Uncond. T ype Cond. Pred. Uncond. Pred. CFG=1.0 CFG=2.9 FID ↓ IS ↑ FID ↓ IS ↑ Independently Drop Labels & Semantic Features (a) Zero Embedding f θ ([ x t , d t ] , y , t ) f θ ([ x t , 0 ] , ∅ , t ) 9.17 126.3 6.69 165.6 (b) Learnable [null] T oken f θ ([ x t , d t ] , y , t ) f θ ([ x t , [null] ] , ∅ , t ) 9.37 126.7 6.64 165.7 (c) Bidirectional Cross-Stream Mask f θ ([ x t , d t ] , y , t ) f θ ([ x t , d t ] , ∅ , t ) 11.08 101.9 7.17 143.5 (d) Semantic-to-Pixel Mask f θ ([ x t , d t ] , y , t ) f θ ([ x t , d t ] , ∅ , t ) 7.28 136.8 3.59 189.6 Jointly Drop Labels & Semantic Features (e) Zero Embedding f θ ([ x t , d t ] , y , t ) f θ ([ x t , 0 ] , ∅ , t ) 15.58 98.8 24.75 82.4 (f) Learnable [null] T oken f θ ([ x t , d t ] , y , t ) f θ ([ x t , [null] ] , ∅ , t ) 10.80 118.3 25.2 88.3 (g) Bidirectional Cross-Stream Mask f θ ([ x t , d t ] , y , t ) f θ ([ x t , d t ] , ∅ , t ) 7.53 129.1 5.66 173.6 (h) Semantic-to-Pixel Mask f θ ([ x t , d t ] , y , t ) f θ ([ x t , d t ] , ∅ , t ) 5.62 158.5 3.18 219.4 (d) and (h)). W e also study a symmetric v ariant, bidir ec- tional cr oss-stream masking , which blocks attention in both directions (ro ws (c) and (g)). These v ariants test whether unconditional prediction is better deﬁned via e xplicit con- trol of information ﬂow rather than input-le vel corruption. Analysis. T able 2 ﬁrst shows that under the baseline input- dropout strategy , independently dropping the class label and semantic features (rows (a)–(b)) performs substantially bet- ter than jointly dropping them (rows (e)–(f)). W e hypothe- size that jointly dropping both conditions makes the pixel- space guidance direction, ∆ x = ˆ v cond x − ˆ v uncond x , a poorly calibrated estimate of the desired conditional guidance sig- nal, which is then ampliﬁed by CFG scaling. In contrast, independent dropout exposes the model to partially condi- tioned cases and thus appears to improve the robustness of the learned guidance direction. More importantly , explicitly deﬁning the unconditional pathway through structural masking (rows (c)–(d)) is markedly more effecti ve than input-le vel dropout (rows (a)– (b)) under independent dropout, suggesting that blocking semantic information from reaching the pix el branch yields a more reliable unconditional prediction. Among the struc- tural v ariants, masking only the semantic-to-pixel pathway (row (d)) performs best, indicating that unconditional gen- eration only requires removing semantic inﬂuence on the pixel output, while preserving the rev erse pixel-to-semantic interaction remains beneﬁcial. For structural masking, jointly dropping labels and semantic features (rows (g)– (h)) outperforms independent dropout (ro ws (c)–(d)), sug- gesting that once the unconditional branch is deﬁned struc- turally , remo ving all conditioning sources during training better matches inference-time behavior . Pixel Q DINOv2 Q DINOv2 K / V Pixel K / V Pixel Q DINOv2 Q DINOv2 K / V Pixel K / V Semantic-to-Pixel Masking Bidirectional Masking Figure 3. Comparison of tw o attention-masking strategies. Y ellow tokens indicate the corresponding query and attended key/v alue to- kens, while white tokens indicate positions whose attention scores are masked out. Finding 2: For CFG under co-denoising, the most ef- fectiv e unconditional design is structural semantic-to- pixel masking with joint dropout of conditioning sig- nals during training. 3.4. Which A uxiliary Loss Best Improves Co- Denoising? The default V -Co objectiv e in Eq. ( 6 ) supervises both streams through the co-denoising v -loss, b ut it mainly en- forces local target matching and may not fully capture higher-le vel semantic alignment. W e therefore study which auxiliary objectives pr ovide the most effective complemen- tary supervision in r epr esentation space , and ultimately de- sign a hybrid loss that better combines their strengths. Balancing pixel and semantic losses. Before exploring additional auxiliary objectiv es, we ﬁrst tune the relativ e weight of the semantic-stream loss through λ d in Eq. ( 6 ). As shown in Fig. 4 , λ d ∈ { 0 . 01 , 0 . 1 } giv es the best FID. Under these settings, the average parameter-gradient norm in the pixel branch is approximately 4 × and 2 × that of the 6 T able 3. Comparison of auxiliary losses applied to V -Co. W e report unguided results at CFG = 1 . 0 and guided results at the best CFG selected from a sweep over [1 . 5 , 5 . 5] for each method (2.5 for (a) REP A loss, 2.1 for (b) perceptual loss, 2.4 for (c) drifting loss, and 2.0 for (d) the perceptual-drifting hybrid loss). All results here are obtained after 300 epochs of training. W e highlight the row with the best guided FID score in light blue . A ux. Loss T ype Unguided (CFG=1.0) Guided (Best CFG > 1 . 0 ) FID ↓ IS ↑ FID ↓ IS ↑ Baseline: V -Co w/ default V -Pred Loss 5.38 153.6 2.96 206.6 (a) V -Co + REP A Loss 5.63 149.4 2.91 (0 . 05 ↓ ) 202.8 (b) V -Co + Perceptual Loss 4.28 177.6 2.73 (0 . 23 ↓ ) 228.5 (c) V -Co + Drifting Loss 4.86 164.3 2.85 (0 . 11 ↓ ) 211.5 (d) V -Co + Perceptual-Drifting Hybrid Loss 4.44 189.0 2.44 (0 . 52 ↓ ) 249.9 0.001 0.01 0.1 0.333 1 d 0 5 10 15 FID FID=7.20 FID=3.13 CFG=1.0 CFG=2.9 0.001 0.01 0.1 0.333 1 d 0 1 2 3 4 5 P ix el / DINOv2 Grad Nor m R atio R atio: 4.30 R atio: 2.15 Figure 4. Inﬂuence of the DINO diffusion loss coef ﬁcient λ d . See Sec. 3.4 for details. semantic branch, respectiv ely . This suggests that semantic supervision is most ef fective when it pro vides meaningful guidance while remaining secondary to the primary pixel- space objectiv e. REP A loss [ 25 ]. On top of the co-denoising objectiv e, we further consider a REP A-style representation alignment loss on the pixel branch. Concretely , let h x ℓ denote the inter- mediate hidden representation of the pixel branch at layer ℓ . W e align this hidden state to the representation of the ground-truth image x extracted by a frozen pretrained DI- NOv2 visual encoder ϕ ( · ) . The auxiliary objecti ve is de- ﬁned as L REP A =   g ( h x ℓ ) − ϕ ( x )   2 2 , where g ( · ) denotes a lightweight MLP projector used to map the intermediate hidden state to the encoder feature space. Empirically , we ﬁnd that applying the REP A loss to the fourth block in JiT - Base yields the best performance, consistent with the con- ﬁguration used in the original REP A paper , which adopts SiT -Base [ 29 ] as the backbone. Per ceptual loss in semantic feature space. W e also con- sider a perceptual loss [ 21 , 31 ] in the pretrained seman- tic feature space. Giv en the predicted clean image ˆ x and the ground-truth image x , we extract their features using a frozen pretrained DINOv2 encoder ϕ ( · ) and minimize their discrepancy: L perc = ∥ ϕ ( ˆ x ) − ϕ ( x ) ∥ 2 2 . Unlike REP A, which aligns intermediate hidden states, this loss directly supervises the predicted image in semantic feature space. Drifting loss in semantic featur e space. Drifting loss [ 11 ] 40 60 80 100 120 140 160 180 200 220 240 260 280 300 Epochs 3 4 5 6 8 10 FID V -Co V -Co + REP A V -Co + P er ceptual L oss V -Co + Drif ting L oss V -Co + Hybrid Loss (Ours) Figure 5. Comparison of guided FID ( i.e ., FID computed from samples generated with CFG). was recently proposed for single-step image generation to mov e the generated distrib ution tow ard the real one. Un- like diffusion v -prediction, REP A, and perceptual losses, which impose pairwise supervision between generated and real samples, drifting loss operates at the distribution level. T o study its ef fectiveness in multi-step dif fusion, we imple- ment it in DINO feature space. Let ϕ ( · ) denote a frozen DINOv2 encoder, let ˆ x be the predicted clean image from the pixel branch, and deﬁne u = ϕ ( ˆ x ) . W e construct a drifting ﬁeld as follows: V ( u ) = V + ( u ) − V − ( u ) , (9) V + ( u ) = 1 Z + ( u ) E x + ∼ p data  k ( u , ϕ ( x + ))( ϕ ( x + ) − u )  , (10) V − ( u ) = 1 Z − ( u ) E x − ∼ p gen  k ( u , ϕ ( x − ))( ϕ ( x − ) − u )  , (11) where p data and p gen denote the real and generated image distributions respectiv ely . k ( a , b ) = exp( −∥ a − b ∥ 2 2 /τ ) is a similarity kernel and Z + , Z − normalize the kernel weights. The drifting loss is deﬁned as L drift = ∥ u − sg( u + V ( u )) ∥ 2 2 , where sg( · ) denotes stop-gradient. 7 T able 4. Feature rescaling vs . noise-schedule shifting in V -Co. W e compare our default V -Co model with RMS scaling against variants without RMS scaling and with noise-schedule shifting . W e report unguided results at CFG = 1 . 0 and guided results at the best CFG selected from a sweep ov er [1 . 5 , 5 . 5] for each method. W e highlight the row with the best guided FID score in light blue . Model Unguided (CFG=1.0) Guided (Best CFG > 1 . 0 ) FID ↓ IS ↑ FID ↓ IS ↑ (a) V -Co w/o rms scaling 9.12 188.6 5.28 150.4 (b) V -Co w/o rms scaling + noise schedule shifting 4.81 205.6 2.93 272.4 (c) V -Co w/ rms scaling (default) 5.38 177.0 2.52 242.6 Per ceptual-Drifting Hybrid loss. Perceptual loss and drifting loss provide two complementary forms of super- vision. Perceptual loss encourages instance-level semantic ﬁdelity by pulling each generated image toward the seman- tic feature of its paired ground-truth target, while drifting loss promotes distrib utional covera ge by discouraging gen- erated features from collapsing toward dense regions of the generated distribution. Moti vated by this complementarity , we propose a hybrid objectiv e that formulates perceptual alignment as a positive vector ﬁeld and drifting-based re- pulsion as a negati ve correction. Unlike the original drifting loss, we replace its positi ve real-distribution term with the current sample’ s pair ed per- ceptual ﬁeld while retaining the generated-distrib ution term as a negati ve correction. This is better suited to multi-step denoising, where each noisy input is paired with a speciﬁc ground-truth image. Speciﬁcally , we deﬁne the positive perceptual ﬁeld as: V + ( u i ) = ϕ ( x i ) − ϕ ( ˆ x i ) , (12) which pulls the generated sample toward the semantic fea- ture of its paired ground-truth target. The negati ve ﬁeld computes repulsion from nearby generated samples within the same class. F or each sample i , we compute normalized kernel weights ov er other samples j  = i of the same class: α ij = exp  −∥ ϕ ( ˆ x i ) − ϕ ( ˆ x j ) ∥ 2 /τ rep  P k  = i exp ( −∥ ϕ ( ˆ x i ) − ϕ ( ˆ x k ) ∥ 2 /τ rep ) , (13) where τ rep is the repulsion temperature. Note that P j  = i α ij = 1 . The repulsion direction points from sam- ple i to ward the weighted centroid of its neighbors: V − ( u i ) = X j  = i α ij ϕ ( ˆ x j ) − ϕ ( ˆ x i ) . (14) T o adapti vely balance attraction and repulsion, we introduce a similarity-based gating mechanism based on how close the generated feature is to its target: s i = exp  − ∥ ϕ ( ˆ x i ) − ϕ ( x i ) ∥ 2 τ gate  , (15) where τ gate is a temperature parameter controlling the sen- sitivity of the gate. W e then combine the two ﬁelds into a hybrid ﬁeld: V hyb ( u i ) = s i · V + ( u i ) − (1 − s i ) · V − ( u i ) . (16) Intuitiv ely , when the generated feature is far from the tar get ( s i ≈ 0 ), repulsion dominates to prev ent mode collapse; when the generated feature is close to the target ( s i ≈ 1 ), pure attraction ensures clean con vergence. The ﬁnal objectiv e is: L = L v-co + λ hyb L hyb (17) = L v-co + λ hyb   u i − sg  u i + V hyb ( u i )    2 2 . (18) Analysis. T able 3 and Fig. 5 quantify the impact of differ- ent auxiliary objecti ves on co-denoising. REP A provides only a marginal guided improv ement o ver the default V - Co objective (FID 2.91 vs. 2.96 for CFG > 1 ), suggest- ing limited beneﬁt from intermediate alignment and po- tential constraints on model’ s expressi veness [ 42 ]. Per- ceptual loss yields a larger gain (FID 2.73), indicating ef- fectiv e instance-lev el semantic alignment, while drifting loss provides a smaller improvement (FID 2.85) but con- tributes distribution-lev el supervision. Combining these complementary signals, our hybrid objective achie ves the best guided result (FID 2.44), suggesting that instance- lev el alignment is most effecti ve when paired with explicit distribution-le vel correction. Finding 3: For visual co-denoising, auxiliary losses work best when combining instance-le vel alignment with distribution-le vel regularization. 3.5. How Should Semantic F eatures Be Calibrated f or Co-Denoising? Before concluding the recipe, we consider how to calibrate the semantic str eam r elative to the pixel str eam during co- denoising . Since the two inputs lie in dif ferent represen- tation spaces and can have very different signal scales, ap- plying the same diffusion timestep to both streams may re- sult in mismatched denoising difﬁculty and thus imbalanced 8 T able 5. Reference results on ImageNet 256 × 256. FID and IS of 50K samples are e valuated. The “pre-training” columns list the external models required to obtain the results. The #Params column reports the parameter count of the denoising model and the decoder (if used). Follo wing previous works [ 1 , 26 ], we mainly use FID as reference. ImageNet 256 × 256 Pre-training #Params #Epochs FID ↓ IS ↑ T okenizer SSL Encoder Latent-space Diffusion DiT -XL/2 [ 33 ] SD-V AE - 675+49M 1400 2.27 278.2 SiT -XL/2 [ 29 ] SD-V AE - 675+49M 1400 2.06 277.5 REP A [ 42 ], SiT -XL/2 SD-V AE DINOv2 675+49M 800 1.42 305.7 LightningDiT -XL/2 [ 46 ] V A-V AE DINOv2 675+49M 800 1.35 295.3 DDT -XL/2 [ 41 ] SD-V AE DINOv2 675+49M 400 1.26 310.6 RAE [ 50 ], DiT DH -XL/2 RAE DINOv2 839+415M 800 1.13 262.6 Pixel-space (non-dif fusion) JetFormer [ 38 ] - - 2.8B - 6.64 - FractalMAR-H [ 27 ] - - 848M 600 6.15 348.9 Pixel-space Dif fusion ADM-G [ 12 ] - - 554M 400 4.59 186.7 RIN [ 20 ] - - 410M - 3.42 182.0 SiD [ 16 ], UV iT/2 - - 2B 800 2.44 256.3 VDM++, UV iT/2 - - 2B 800 2.12 267.7 SiD2 [ 17 ], UV iT/2 - - N/A - 1.73 - SiD2 [ 17 ], UV iT/1 - - N/A - 1.38 - PixelFlo w [ 5 ], XL/4 - - 677M 320 1.98 282.1 PixNerd [ 40 ], XL/16 - DINOv2 700M 160 2.15 297.0 DeCo-XL/16 [ 30 ] - DINOv2 682M 600 1.69 304.0 PixelGen-XL/16 [ 31 ] - DINOv2 676M 160 1.83 293.6 ReDi [ 24 ], SiT -XL/2 - DINOv2 675M 350 1.72 278.7 ReDi [ 24 ], SiT -XL/2 - DINOv2 675M 800 1.61 295.1 Latent Forcing [ 1 ], JiT -B/16 - DINOv2 465M 200 2.48 - JiT -B/16 [ 26 ] - - 131M 600 3.66 275.1 JiT -L/16 [ 26 ] - - 459M 600 2.36 298.5 JiT -H/16 [ 26 ] - - 953M 600 1.86 303.4 JiT -G/16 [ 26 ] - - 2B 600 1.82 292.6 V -Co-B/16 - DINOv2 260M 200 2.52 242.6 V -Co-L/16 - DINOv2 918M 200 2.10 243.0 V -Co-H/16 - DINOv2 1.9B 200 1.85 246.5 V -Co-B/16 - DINOv2 260M 600 2.33 250.1 V -Co-L/16 - DINOv2 918M 500 1.72 245.3 V -Co-H/16 - DINOv2 1.9B 300 1.71 263.3 optimization. Related co-denoising work has addressed this issue by shifting the semantic-stream diffusion schedule [ 1 ]. Here, we study two natural calibration strategies: rescal- ing the semantic features to match the pixel-space signal lev el, or equi valently shifting the semantic-stream dif fusion schedule. As we sho w below , the two can be formulated to achiev e the same signal-to-noise ratio (SNR). SNR matching via feature rescaling . As deﬁned in Eq. ( 1 ), pix els and semantic features share the same timestep t ∈ [0 , 1] . Under this ﬂo w-matching parameteri- zation, the noised input takes the form z t = t s + (1 − t ) ϵ , (19) where s denotes the clean signal and ϵ ∼ N ( 0 , I ) is the injected noise. W e deﬁne the signal-to-noise ratio (SNR) at time t as the ratio between the signal power and the noise power: SNR( t ) = E ∥ t s ∥ 2 E ∥ (1 − t ) ϵ ∥ 2 = t 2 E ∥ s ∥ 2 (1 − t ) 2 E ∥ ϵ ∥ 2 . (20) Since t is shared across streams and the noise scale is ﬁxed, matching denoising dif ﬁculty reduces to matching signal 9 magnitude. Let d denote the original semantic feature and d ′ its rescaled v ersion. W e therefore rescale the semantic features as d ′ = α d , α = p E [ x 2 ] p E [ d 2 ] , (21) so that the semantic features hav e the same RMS magnitude as the pixel signal. Equivalent SNR matching via timestep shifting. Equiv- alently , one can keep the semantic feature d ﬁxed and in- stead shift its timestep from t to t ′ such that SNR d ( t ′ ) = SNR d ′ ( t ) . Using Eq. ( 20 ), this gives t ′ 1 − t ′ = α t 1 − t , and t ′ = αt 1 + ( α − 1) t . (22) Therefore, rescaling the semantic features by α is SNR- equiv alent to applying a shifted diffusion schedule to the unscaled semantic stream. Analysis. In T able 4 , we compare our default V -Co model ag ainst v ariants without RMS scaling and with noise- schedule shifting in place of RMS scaling. Removing RMS scaling substantially worsens performance relativ e to our default setting (guided FID: 5.28 vs . 2.52). Replacing RMS scaling with noise-schedule shifting gi ves broadly compa- rable ov erall results, but yields worse guided FID (2.93 vs . 2.52), consistent with the SNR-based equiv alence discussed abov e. In practice, we adopt RMS-based feature scaling be- cause of its simplicity and strong performance. Finding 4: Effecti ve co-denoising requires calibrating the semantic and pixel streams to comparable denois- ing difﬁculty . RMS-based feature rescaling provides a simple and practical solution. 4. Full Recipe and SoT A Comparison W e combine the best-performing ablation choices into a practical visual co-denoising ( V -Co ) recipe: a fully dual- stream JiT backbone (Sec. 3.2 ), structural semantic-to- pixel masking with joint dropout for CFG (Sec. 3.3 ), a perceptual-drifting hybrid loss (Sec. 3.4 ), and RMS-based feature rescaling (Sec. 3.5 ). These components are com- plementary: the architecture determines where streams in- teract, masking deﬁnes how guidance is formed, the hybrid loss speciﬁes what semantic supervision to apply , and RMS scaling matches denoising difﬁculty . W e next ev aluate this full recipe against prior SoT A methods on ImageNet [ 10 ] 256 × 256 , including latent- space diffusion models, and pixel-space methods. As shown in T able 5 , V -Co achie ves strong performance among pixel- space diffusion models. Notably , V -Co-B/16 with only 260M parameters, matches JiT -L/16 with 459M parameters (FID 2.33 vs . 2.36). V -Co-L/16 and V -Co-H/16, trained for 500 and 300 epochs respecti vely , outperform JiT -G/16 with 2B parameters (FID 1.71 vs . 1.82) and other strong pixel- diffusion methods. In addition, our method with simple RMS scaling matches or outperforms Latent Forcing [ 1 ], which relies on separate noise schedules for pixels and DI- NOv2 features. These results demonstrate that a carefully designed co-denoising recipe is both effecti ve and scalable for representation-aligned pixel-space generation. 5. Conclusion W e presented V -Co, a visual co-denoising framework for representation-aligned pixel-space generation. W e identify four key ingredients for effecti ve co-denoising: a fully dual- stream backbone, structural semantic-to-pixel masking for classiﬁer-free guidance, a perceptual-drifting hybrid loss, and RMS-based feature rescaling. T ogether, they form a simple and scalable recipe that improv es semantic align- ment and generative quality , with strong ImageNet results and clear scalability with model size and training dura- tion. W e hope this work can inspire future research on co- denoising and representation-aligned generativ e modeling. 6. Acknowledgments This work was supported by ONR Grant N00014-23-1- 2356, AR O A ward W911NF2110220, D ARP A ECOLE Program No. HR00112390060, and NSF-AI Engage Insti- tute DRL-2112635. The views contained in this article are those of the authors and not of the funding agency . References [1] Alan Baade, Eric Ryan Chan, Kyle Sargent, Changan Chen, Justin Johnson, Ehsan Adeli, and Li Fei-Fei. Latent forc- ing: Reordering the diffusion trajectory for pixel-space im- age generation. arXiv preprint , 2026. 2 , 3 , 4 , 5 , 9 , 10 , 14 [2] Hongzhe Bi, Hengkai T an, Shenghao Xie, Ze yuan W ang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Y ao Feng, Chen- dong Xiang, Y inze Rong, et al. Motus: A uniﬁed latent ac- tion world model. arXiv preprint , 2025. 2 , 3 [3] Hun Chang, Byunghee Cha, and Jong Chul Y e. Dino-sae: Dino spherical autoencoder for high-ﬁdelity image recon- struction and generation. arXiv preprint , 2026. 2 [4] Hila Chefer, Uriel Singer , Amit Zohar, Y uval Kirstain, Adam Polyak, Y aniv T aigman, Lior W olf, and Shelly Sheynin. V ideojam: Joint appearance-motion representations for en- hanced motion generation in video models. In F orty-second International Confer ence on Machine Learning , 2025. 2 , 3 [5] Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelﬂow: Pixel-space generati ve models with ﬂow . arXiv pr eprint arXiv:2504.07963 , 2025. 9 10 [6] Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xi- aobin Hu, Hanzhen Zhao, Chengjie W ang, Jian Y ang, and Y ing T ai. Dip: T aming diffusion models in pixel space. arXiv pr eprint arXiv:2511.18822 , 2025. 1 [7] Katherine Crowson, Stefan Andreas Baumann, Alex Birch, T anishq Mathew Abraham, Daniel Z Kaplan, and Enrico Shippole. Scalable high-resolution pixel-space image syn- thesis with hourglass dif fusion transformers. In F orty-ﬁrst International Confer ence on Machine Learning , 2024. 2 [8] Lingwei Dang, Zonghan Li, Juntong Li, Hongwen Zhang, Liang An, Y ebin Liu, and Qingyao W u. Syncmv4d: Syn- chronized multi-view joint diffusion of appearance and mo- tion for hand-object interaction synthesis. arXiv pr eprint arXiv:2511.19319 , 2025. 2 , 3 [9] Lingwei Dang, Ruizhi Shao, Hongwen Zhang, W ei MIN, Y ebin Liu, and Qingyao W u. Svimo: Synchronized diffu- sion for video and motion generation in hand-object inter - action scenarios. In The Thirty-ninth Annual Confer ence on Neural Information Pr ocessing Systems , 2025. 2 , 3 [10] Jia Deng, W ei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE confer ence on computer vision and pattern r ecognition , pages 248–255. Ieee, 2009. 2 , 4 , 10 [11] Mingyang Deng, He Li, T ianhong Li, Y ilun Du, and Kaim- ing He. Generative modeling via drifting. arXiv pr eprint arXiv:2602.04770 , 2026. 7 [12] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural informa- tion pr ocessing systems , 34:8780–8794, 2021. 9 [13] Patrick Esser , Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨ uller , Harry Saini, Y am Levi, Dominik Lorenz, Axel Sauer , Frederic Boesel, et al. Scaling recti- ﬁed ﬂow transformers for high-resolution image synthesis. In F orty-ﬁrst international confer ence on machine learning , 2024. 1 [14] Xiaochuang Han, Y oussef Emad, Melissa Hall, John Nguyen, Karthik Padthe, Liam Robbins, Amir Bar , Delong Chen, Michal Drozdzal, Maha Elbayad, et al. Tv2tv: A uni- ﬁed framework for interleav ed language and video genera- tion. arXiv pr eprint arXiv:2512.05103 , 2025. 2 , 3 [15] Karl Heun et al. Neue methoden zur approximati ven integration der dif ferentialgleichungen einer unabh ¨ angigen ver ¨ anderlichen. Z. Math. Phys , 45(23-38):7, 1900. 13 [16] Emiel Hoogeboom, Jonathan Heek, and T im Salimans. sim- ple dif fusion: End-to-end diffusion for high resolution im- ages. In International Confer ence on Machine Learning , pages 13213–13232. PMLR, 2023. 2 , 9 [17] Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and T im Salimans. Simpler diffu- sion: 1.5 ﬁd on imagenet512 with pixel-space diffusion. In Pr oceedings of the Computer V ision and P attern Recognition Confer ence , pages 18062–18071, 2025. 9 [18] Zheyuan Hu, Chieh-Hsin Lai, Ge W u, Y uki Mitsufuji, and Stefano Ermon. Meanﬂo w transformers with representation autoencoders. arXiv pr eprint arXiv:2511.13019 , 2025. 2 [19] Jiehui Huang, Y uechen Zhang, Xu He, Y uan Gao, Zhi Cen, Bin Xia, Y an Zhou, Xin T ao, Pengfei W an, and Jiaya Jia. Unityvideo: Uniﬁed multi-modal multi-task learning for enhancing world-aware video generation. arXiv preprint arXiv:2512.07831 , 2025. 2 , 3 [20] Allan Jabri, Da vid J Fleet, and T ing Chen. Scalable adapti ve computation for iterative generation. In International Con- fer ence on Machine Learning , pages 14569–14589. PMLR, 2023. 9 [21] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In Eur opean confer ence on computer vision , pages 694–711. Springer , 2016. 7 [22] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint , 2014. 13 [23] Diederik P Kingma and Max W elling. Auto-encoding varia- tional bayes. arXiv pr eprint arXiv:1312.6114 , 2013. 1 [24] Theodoros Kouzelis, Efstathios Karypidis, Ioannis Kakoge- orgiou, Spyros Gidaris, and Nikos Komodakis. Boosting generativ e image modeling via joint image-feature synthesis. In The Thirty-ninth Annual Conference on Neural Informa- tion Pr ocessing Systems , 2025. 2 , 3 , 5 , 9 [25] Xingjian Leng, Jaskirat Singh, Y unzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent dif fusion transformers. In Pr oceedings of the IEEE/CVF International Conference on Computer V ision , pages 18262–18272, 2025. 3 , 7 [26] T ianhong Li and Kaiming He. Back to basics: Let denoising generati ve models denoise. arXiv pr eprint arXiv:2511.13720 , 2025. 1 , 2 , 3 , 4 , 5 , 6 , 9 , 13 , 14 [27] T ianhong Li, Qinyi Sun, Lijie Fan, and Kaiming He. Fractal generativ e models. arXiv preprint , 2025. 9 [28] Y iyang Lu, Susie Lu, Qiao Sun, Hanhong Zhao, Zhicheng Jiang, Xianbang W ang, Tianhong Li, Zhengyang Geng, and Kaiming He. One-step latent-free image generation with pixel mean ﬂows. arXiv preprint , 2026. 1 [29] Nanye Ma, Mark Goldstein, Michael S Alber go, Nicholas M Bofﬁ, Eric V anden-Eijnden, and Saining Xie. Sit: Explor- ing ﬂow and diffusion-based generative models with scalable interpolant transformers. In Eur opean Confer ence on Com- puter V ision , pages 23–40. Springer, 2024. 1 , 7 , 9 [30] Zehong Ma, Longhui W ei, Shuai W ang, Shiliang Zhang, and Qi Tian. Deco: Frequency-decoupled pixel diffusion for end- to-end image generation. arXiv pr eprint arXiv:2511.19365 , 2025. 3 , 9 [31] Zehong Ma, Ruihan Xu, and Shiliang Zhang. Pixelgen: Pixel diffusion beats latent dif fusion with perceptual loss. arXiv pr eprint arXiv:2602.02493 , 2026. 1 , 3 , 7 , 9 [32] Maxime Oquab, T imoth ´ ee Darcet, Th ´ eo Moutakanni, Huy V o, Marc Szafraniec, V asil Khalidov , Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby , et al. Dinov2: Learning rob ust visual features without supervision. arXiv pr eprint arXiv:2304.07193 , 2023. 2 , 3 , 4 , 13 [33] W illiam Peebles and Saining Xie. Scalable dif fusion models with transformers. In Proceedings of the IEEE/CVF inter- national confer ence on computer vision , pages 4195–4205, 2023. 1 , 9 11 [34] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨ orn Ommer . High-resolution image synthesis with latent diffusion models. In Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition (CVPR) , pages 10684–10695, 2022. 1 , 2 [35] Jaskirat Singh, Xingjian Leng, Zongze W u, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What Mat- ters for Representation Alignment: Global Information or Spatial Structure? arXiv preprint , 2025. 2 , 3 , 16 [36] Y uchuan T ian, Hanting Chen, Mengyu Zheng, Y uchen Liang, Chao Xu, and Y unhe W ang. U-repa: Aligning diffu- sion u-nets to vits. arXiv pr eprint arXiv:2503.18414 , 2025. 2 [37] Shengbang T ong, Boyang Zheng, Ziteng W ang, Bingda T ang, Nanye Ma, Ellis Bro wn, Jihan Y ang, Rob Fer gus, Y ann LeCun, and Saining Xie. Scaling te xt-to-image dif- fusion transformers with representation autoencoders. arXiv pr eprint arXiv:2601.16208 , 2026. 2 [38] Michael Tschannen, Andr ´ e Susano Pinto, and Alexander K olesnikov . Jetformer: An autoregressiv e generati ve model of raw images and text. In The Thirteenth International Con- fer ence on Learning Representations , 2025. 9 [39] Mengmeng W ang, Dengyang Jiang, Liuzhuozheng Li, Y ucheng Lin, Guojiang Shen, Xiangjie Kong, Y ong Liu, Guang Dai, and Jingdong W ang. V ae-repa: V ariational autoencoder representation alignment for efﬁcient diffusion training. arXiv pr eprint arXiv:2601.17830 , 2026. 2 [40] Shuai W ang, Ziteng Gao, Chenhui Zhu, W eilin Huang, and Limin W ang. Pixnerd: Pixel neural ﬁeld diffusion. arXiv pr eprint arXiv:2507.23268 , 2025. 9 [41] Shuai W ang, Zhi T ian, W eilin Huang, and Limin W ang. Ddt: Decoupled dif fusion transformer . arXiv preprint arXiv:2504.05741 , 2025. 9 [42] Ziqiao W ang, W angbo Zhao, Y uhao Zhou, Zekai Li, Zhiyuan Liang, Mingjia Shi, Xuanlei Zhao, Pengfei Zhou, Kaipeng Zhang, Zhangyang W ang, et al. Repa works until it doesn’t: Early-stopped, holistic alignment supercharges dif- fusion training. arXiv preprint , 2025. 2 , 3 , 8 , 9 [43] Jianzong Wu, Hao Lian, Dachao Hao, Y e T ian, Qingyu Shi, Biaolong Chen, Hao Jiang, and Y unhai T ong. Does hearing help seeing? inv estigating audio-video joint denoising for video generation. arXiv preprint , 2025. 2 , 3 [44] Lehan Y ang, Lu Qi, Xiangtai Li, Sheng Li, V arun Jampani, and Ming-Hsuan Y ang. Uniﬁed dense prediction of video diffusion. In Pr oceedings of the Computer V ision and P attern Recognition Confer ence , pages 28963–28973, 2025. 3 [45] Y uxiao Y ang, Hualian Sheng, Sijia Cai, Jing Lin, Jiahao W ang, Bing Deng, Junzhe Lu, Haoqian W ang, and Jieping Y e. Echomotion: Uniﬁed human video and motion genera- tion via dual-modality diffusion transformer . arXiv pr eprint arXiv:2512.18814 , 2025. 2 , 3 [46] Jingfeng Y ao, Bin Y ang, and Xinggang W ang. Reconstruc- tion vs. generation: T aming optimization dilemma in latent diffusion models. In Proceedings of the IEEE/CVF Confer- ence on Computer V ision and P attern Recognition , 2025. 9 [47] Sihyun Y u, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. In The Thirteenth In- ternational Conference on Learning Repr esentations , 2025. 16 [48] Y ongsheng Y u, W ei Xiong, W eili Nie, Y ichen Sheng, Shiqiu Liu, and Jiebo Luo. Pixeldit: Pixel diffusion transformers for image generation. arXiv preprint , 2025. 1 , 2 [49] Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng W an, Junchi Y an, and Y u Cheng. V ide- orepa: Learning physics for video generation through re- lational alignment with foundation models. arXiv preprint arXiv:2505.23656 , 2025. 2 [50] Boyang Zheng, Nanye Ma, Shengbang T ong, and Saining Xie. Diffusion transformers with representation autoen- coders. arXiv preprint , 2025. 2 , 9 , 13 , 16 [51] Ruijie Zheng, Jing W ang, Scott Reed, Johan Bjorck, Y u Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loic Magne, et al. Flare: Robot learning with implicit world modeling. arXiv preprint , 2025. 2 [52] Zhide Zhong, Haodong Y an, Junfeng Li, Xiangchen Liu, Xin Gong, T ianran Zhang, W enxuan Song, Jiayi Chen, Xinhu Zheng, Hesheng W ang, et al. Flowvla: V isual chain of thought-based motion reasoning for vision-language-action models. arXiv pr eprint arXiv:2508.18269 , 2025. 2 , 3 12 A ppendix A. Experiment Setup Details W e report the hyper-parameters of the ﬁnal model used for e valuation in T able 6 . Throughout Sec. 3 , these design choices are introduced progressiv ely and tuned step by step toward the ﬁnal conﬁguration. T able 6. Conﬁgurations of experiments. Architecture, feature pre-processing, training, and sampling settings for V -Co-B/L/H. W e color the newly added h yper-parameters on top of JIT [ 26 ] codebase in light blue . V -Co-B V -Co-L V -Co-H Architecture Depth 12 24 32 Hidden dim 768 1024 1280 Heads 12 16 16 Image size 256 Patch size 16 Bottleneck 128 128 256 Dropout 0 In-context class tokens 32 (if used) In-context start block 4 8 10 Feature Pre-Processing Pixels [ − 1 , 1] linear min-max rescaling DINOv2 Patch-le vel scaling follo wing RAE [ 50 ] T raining Epochs 200 (ablation), 600 W armup epochs [ 26 ] 5 Optimizer Adam [ 22 ], β 1 , β 2 = 0 . 9 , 0 . 95 Batch size 1024 Learning rate 2e-4 LR schedule constant W eight decay 0 EMA decay { 0 . 9996 , 0 . 9998 , 0 . 9999 } T ime sampler logit ( t ) ∼ N ( µ, σ 2 ) , µ = − 0 . 8 , σ = 0 . 8 Noise scale 1.0 Clip of (1 − t ) in di vision 0.05 Class & DINOv2 tokens joint dropout (for CFG) 0.1 Attention mask when dropout (for CFG) Semantic-to-pixel mask λ d 0.1 λ hyb 10.0 τ gate 10.0 τ rep 0.2 Sampling ODE solver Heun [ 15 ] ODE steps 50 T ime steps linear in [0 . 0 , 1 . 0] CFG scale sweep range [1 . 0 , 4 . 0] CFG interval [0 . 1 , 1] (if used) Speciﬁcally , in Sec. 3.2 , we start from a minimal co-denoising setup. At this stage, the only additional component is the DINOv2 [ 32 ] branch in the feature preprocessing stage, where DINOv2-Base features are normalized using the dataset- lev el statistics computed in RAE [ 50 ]. For conditioning dropout, we adopt the standard independent dropout strategy , with class-label dropout set to 0.1 and DINOv2 feature dropout set to 0.2, applied separately rather than jointly . No attention 13 T able 7. Comparison of architectural designs f or visual co-denoising. W e compare baseline backbones, single-stream fusion strate gies, and dual-stream fusion variants with dif ferent allocations of feature-speciﬁc and shared/dual-stream blocks. All variants keep the pixel stream depth ﬁx ed at 12 JiT blocks for fair comparison. JiT -B/16 ‡ and JiT -B/16 † denote widened v ariants with hidden dimensions increased from 768 to 1024 and 1088, respecti vely , to match the parameter counts of the dual-stream models. Blue rows mark the stronger v ariants used in subsequent analysis. Follo wing previous works [ 1 , 26 ], we mainly use FID as reference. W e highlight the default setting of the dual-stream model architecture in V -Co in light blue . Model Backbone #Params #Featur e-Speciﬁc #Shared/Dual- CFG=1.0 Blocks Stream Blocks FID ↓ IS ↑ Baselines (a) JiT -B/16 [ 26 ] JiT -B/16 133M - - 32.54 49.5 (b) JiT -B/16 [ 26 ] JiT -B/16 † 261M - - 22.67 69.9 (c) LatentF orcing [ 1 ] JiT -B/16 156M 2 10 13.06 102.2 Single-Stream JiT Architecture (d) DirectAddition JiT -B/16 156M 2 10 15.15 103.4 (e) DirectAddition JiT -B/16 177M 4 8 12.90 112.7 (f) DirectAddition JiT -B/16 198M 6 6 11.77 116.2 (g) DirectAddition JiT -B/16 220M 8 4 14.20 104.1 (h) DirectAddition JiT -B/16 241M 10 2 14.43 99.4 (i) ChannelConcat JiT -B/16 157M 2 10 14.33 107.7 (j) ChannelConcat JiT -B/16 178M 4 8 11.93 117.3 (k) ChannelConcat JiT -B/16 200M 6 6 11.23 119.0 (l) ChannelConcat JiT -B/16 221M 8 4 13.73 104.6 (m) ChannelConcat JiT -B/16 242M 10 2 14.60 99.7 (n) T okenConcat JiT -B/16 156M 2 10 14.70 103.8 (o) T okenConcat JiT -B/16 177M 4 8 12.59 112.8 (p) T okenConcat JiT -B/16 198M 6 6 12.35 116.7 (q) T okenConcat JiT -B/16 220M 8 4 14.31 104.4 (r) T okenConcat JiT -B/16 241M 10 2 14.97 99.4 (s) T okenConcat JiT -B/16 ‡ 265M 6 6 9.74 129.4 (t) T okenConcat JiT -B/16 ‡ 274M 8 4 11.43 118.1 (u) T okenConcat JiT -B/16 ‡ 284M 10 2 12.90 109.4 Dual-Stream JiT Architecture (v) T okenConcat JiT -B/16 260M 6 6 11.78 115.4 (w) T okenConcat JiT -B/16 260M 4 8 11.40 118.3 (x) T okenConcat JiT -B/16 260M 2 10 10.24 124.5 (y) T okenConcat JiT -B/16 260M 0 12 8.86 132.8 mask is used in this section. The DINOv2 denoising loss coefﬁcient ( i.e ., λ d ) is set to 0.1 from this stage onward. Starting from Sec. 3.3 , we further reﬁne the conditioning design. As shown in T ables 2 and 8 , jointly dropping class labels and DINOv2 features yields the best performance, and we therefore adopt a joint dropout probability of 0.1 from this section onward. Moreover , the semantic-to-pixel attention mask achie ves the strongest results in T able 2 , and is thus used as the default setting in the subsequent e xperiments. In Sec. 3.4 , we further introduce the perceptual-drifting h ybrid loss to improv e feature alignment during co-denoising. The corresponding hyper-parameters, including λ hyb , τ gate , and τ rep , are introduced from this section onward. Finally , in Sec. 3.5 , we compare noise-schedule shifting and RMS scaling for feature calibration. Since RMS scaling performs best in our ablation, we adopt it as the default calibration strategy in the ﬁnal model. B. Additional Ablations Extended ablation of single-stream fusion strategies. In T able 7 , we present additional quantitativ e results that comple- ment T able 1 in the main paper . Speciﬁcally , we study different allocations of feature-speciﬁc and shared blocks under three interaction strategies: direct addition, channel concatenation, and token concatenation. Overall, we observe that a balanced design, with 6 feature-speciﬁc blocks and 6 shared blocks out of 12 total blocks, generally yields the best performance. For 14 T able 8. Comparison of different label/DINO dropout strategies under unguided ( i.e ., CFG=1.0) and guided ( i.e ., CFG=2.9) gener - ation. The best and second best numbers are bolded and underlined. W e highlight the default setting of joint dropout of class labels and DINOv2 features in V -Co in light blue . Label Dropout Ratio DINO Dropout Ratio CFG=1.0 CFG=2.9 FID ↓ IS ↑ FID ↓ IS ↑ Independent Dropout 0.1 0.1 7.01 136.37 3.77 189.38 0.1 0.2 7.28 136.84 3.59 189.69 0.1 0.3 7.78 129.55 3.73 188.38 0.2 0.1 7.50 128.84 4.11 175.04 0.2 0.2 7.47 131.44 3.98 180.24 0.2 0.3 9.04 117.94 4.11 173.21 0.3 0.1 8.80 117.98 4.38 165.06 0.3 0.2 8.40 122.23 4.09 172.15 0.3 0.3 9.57 114.64 4.52 165.55 Joint Dropout 0.1 0.1 5.38 161.4 3.55 214.39 0.2 0.2 5.62 158.51 3.18 219.60 0.3 0.3 5.17 159.92 3.17 219.41 T able 9. Ablation of the similarity-based gating s i in Eq. ( 15 ). W e conduct an ablation using a simpliﬁed version of s i , where the gate is replaced with a scalar value s instead of being dependent on real and generated samples. W e report both unguided and guided FID and IS scores while sweeping the scalar s . Row with the best guided FID score is highlighted in light blue . s Unguided (CFG=1.0) Guided (Best CFG > 1 . 0 ) FID ↓ IS ↑ FID ↓ IS ↑ Scalar gate s 0 15.72 80.1 8.13 114.4 0.001 11.24 114.7 4.24 171.7 0.01 11.81 109.9 4.46 165.6 0.1 13.22 99.2 5.35 151.6 0.5 10.40 186.0 5.68 253.3 0.9 5.48 174.7 2.75 233.9 0.99 5.33 183.4 2.83 246.7 0.999 5.41 182.6 2.76 249.8 Similarity-Based Gating s i (Default in V -Co) s i 5.17 181.6 2.61 243.9 the token-concatenation strate gy , we further e xamine widened variants by increasing the hidden dimension from 768 to 1024, resulting in models with 265M to 284M parameters. Nevertheless, none of these variants surpasses our default dual-stream co-denoising design, which achiev es the best performance with 260M parameters. Extended ablation of label and DINOv2 dr opout strategies. In T able 8 , we provide additional quantitati ve results comple- menting T able 2 in the main paper . Speciﬁcally , we e v aluate dif ferent dropout ratios for labels and DINOv2 features, applied either independently or jointly . The results indicate that independently dropping labels or DINOv2 features generally under- performs joint dropout. As discussed in the main paper , once the unconditional prediction is structurally deﬁned through semantic-to-pixel masking, removing all conditioning inputs from the pixel branch during training leads to better alignment with inference-time behavior . Ablation of the similarity-based gating s i in Eq. ( 15 ). T o e valuate the ef fectiv eness of the proposed similarity-based gating mechanism, we conduct an ablation where the adaptiv e gate s i is replaced with a scalar value s , removing its dependence on real and generated samples. Under this simpliﬁcation, the hybrid potential becomes V hyb ( u i ) = s · V pos ( u i ) − (1 − 15 T able 10. Comparison of different repulsion temperatures τ rep . W e report both unguided and guided FID and IS scores by sweeping the repulsion temperature τ rep . W e highlight the rows with the best guided FID scores in light blue . τ rep Unguided (CFG=1.0) Guided (Best CFG > 1 . 0 ) FID ↓ IS ↑ FID ↓ IS ↑ 2e-3 4.97 186.4 2.72 251.3 2e-2 4.92 186.7 2.66 251.7 2e-1 5.17 181.6 2.61 243.9 2 5.33 180.1 2.64 244.8 2e1 5.30 180.7 2.67 247.7 T able 11. Comparison of differ ent gate temperatur es τ gate . W e report both unguided and guided FID and IS scores by sweeping the gate temperature τ gate . W e highlight the rows with the best guided FID scores in light blue . τ gate Unguided (CFG=1.0) Guided (Best CFG > 1 . 0 ) FID ↓ IS ↑ FID ↓ IS ↑ 1e-2 11.66 112.2 4.01 173.5 1e-1 10.96 114.6 4.06 173.6 1 7.34 136.9 3.10 189.6 1e1 5.41 108.8 2.75 247.7 1e2 5.17 181.6 2.61 243.9 1e3 5.11 185.8 2.77 252.2 T able 12. Comparison of differ ent hybrid loss coefﬁcients λ hyb . W e report both unguided and guided FID and IS scores by sweeping the hybrid loss coef ﬁcients λ hyb . W e highlight the rows with the best guided FID scores in light blue . λ hyb Unguided (CFG=1.0) Guided (Best CFG > 1 . 0 ) FID ↓ IS ↑ FID ↓ IS ↑ 1e-2 7.10 139.6 3.30 187.0 1e-1 6.72 142.5 3.17 189.9 1 5.23 162.5 2.74 214.0 1e1 5.17 181.6 2.61 243.9 1e2 13.61 133.4 4.49 221.4 1e3 31.30 79.3 17.90 124.4 1e4 65.84 41.8 58.04 56.0 s ) · V neg ( u i ) . W e report both unguided and guided FID and IS scores while sweeping the scalar s . As shown in T able 9 , the default similarity-based gating in V -Co achieves the best guided FID score compared with the simpliﬁed scalar gate s , demonstrating the effecti veness of our design. Hyper -parameter tuning f or the perceptual-drifting h ybrid loss. In T ables 10 to 12 , we perform hyper-parameter sweeps ov er the three key hyper-parameters in the perceptual-drifting hybrid loss: the repulsion temperature τ rep , the gate temperature τ gate , and the hybrid loss coefﬁcient λ hyb . The default hyper-parameters used in V -Co are selected from these settings based on the best guided FID scores. Comparison of different DINOv2 model sizes. In T able 13 , we compare V -Co trained with DINOv2 features of different model sizes as semantic representations for co-denoising with pixels. For each DINOv2 model, we re-compute the feature scaling factor based on its RMS value to ensure that the SNR ratio between the DINOv2 features and pixels remains consis- tent. W e also sweep over dif ferent DINOv2 diffusion loss coefﬁcients λ d , as different encoder sizes may perform best under different loss scales. The results show that ev en relatively small representation encoders preserve suf ﬁcient low-le vel detail for co-denoising. A similar trend has also been reported in T able 15(b) of RAE [ 50 ], Figure 3(b) of iREP A [ 35 ], and T able 2 of REP A [ 47 ]. REP A [ 47 ] attributes this behavior to the fact that all DINOv2 models are distilled from DINOv2-g and therefore share similar representations. 16 T able 13. Comparison of different DINOv2 model sizes. W e report both unguided and guided FID and IS scores while sweeping the DINOv2 diffusion loss coefﬁcient λ d ov er { 1e − 3 , 1e − 2 , 1e − 1 , 1 } . W e highlight the rows with the best guided FID scores in light blue for each DINOv2 model size. Model λ d #Params Unguided (CFG=1.0) Guided (Best CFG > 1 . 0 ) FID ↓ IS ↑ FID ↓ IS ↑ DINOv2-Small 1e-3 22M 9.04 118.2 5.06 156.4 DINOv2-Small 1e-2 22M 6.70 134.4 3.67 176.2 DINOv2-Small 1e-1 22M 6.35 140.0 4.11 174.8 DINOv2-Small 1 22M 9.05 126.0 7.97 145.3 DINOv2-Base 1e-3 86M 12.27 107.7 6.45 151.9 DINOv2-Base 1e-2 86M 9.81 118.2 5.16 163.2 DINOv2-Base 1e-1 86M 8.83 120.6 5.54 154.8 DINOv2-Base 1 86M 16.77 95.1 16.48 106.9 DINOv2-Large 1e-3 304M 13.92 96.3 6.59 143.1 DINOv2-Large 1e-2 304M 9.19 119.9 4.20 173.8 DINOv2-Large 1e-1 304M 8.70 124.3 4.57 174.7 DINOv2-Large 1 304M 32.28 82.2 25.32 94.0 DINOv2-Giant 1e-3 1.1B 13.15 99.3 7.46 143.4 DINOv2-Giant 1e-2 1.1B 10.41 112.1 5.42 160.5 DINOv2-Giant 1e-1 1.1B 8.91 120.1 5.00 166.7 DINOv2-Giant 1 1.1B 23.02 83.4 24.18 95.9 C. Generated Samples In Figs. 6 to 8 , we present uncurated ImageNet 256 × 256 samples generated by V -Co-H/16 after 300 epochs of training, conditioned on the speciﬁed classes. Unlike the common practice of using a larger CFG value for visualization, we instead show samples generated with the same CFG v alue (1.5) used to obtain the reported FID of 1.71. D. Limitation and Future W ork While V -Co provides a clear and ef fective recipe for visual co-denoising in pix el-space diffusion, se veral limitations remain. First, our study focuses on class-conditional generation on ImageNet-256, which offers a controlled setting for isolating the effects of architecture, CFG design, auxiliary objectives, and feature calibration, but does not capture the full di versity of generation settings such as open-ended te xt-to-image synthesis or more structured multimodal tasks. Extending the proposed recipe beyond ImageNet-style class conditioning is therefore an important direction for future work. Second, V -Co relies on pretrained semantic features from a strong e xternal visual encoder ( i.e ., DINOv2). While this design is well aligned with our representation-alignment perspective and substantially improves semantic supervision in pixel-space generation, the resulting co-denoising dynamics may still depend on the quality , inductiv e biases, and spatial granularity of the teacher representation. Exploring alternative semantic feature sources is another promising direction. Finally , our method is intentionally minimalist and does not incorporate stronger auxiliary supervision, such as combining REP A-style objectiv es with our perceptual-drifting hybrid loss. This keeps the empirical conclusions clean, b ut future works may explore ho w the V -Co recipe interacts with richer objectives and stronger supervision. 17 class 012: house finch, linnet, Carpodacus mexicanus class 014: indigo bunting, indigo finch, indigo bird, P asserina cyanea class 042: agama class 081: ptarmigan class 107: jellyfish class 108: sea anemone, anemone class 110: flatworm, platyhelminth class 117: chambered nautilus, pearly nautilus, nautilus class 130: flamingo class 279: Arctic fox, white fox, Alopex lagopus Figure 6. Uncurated samples on ImageNet 256 × 256 using V -Co-H/16 conditioned on the speciﬁed classes. Unlike the common practice of visualizing with a higher CFG, here we show images using the CFG v alue (1.5) that achiev es the reported FID of 1.71. 18 class 288: leopard, P anthera pardus class 309: bee class 349: bighorn, bighorn sheep, cimarron, Rocky Mountain bighorn class 397: puffer , pufferfish, blowfish, globefish class 425: barn class 448: birdhouse class 453: bookcase class 458: brass, memorial tablet, plaque class 495: china cabinet, china closet class 500: cliff dwelling Figure 7. Uncurated samples on ImageNet 256 × 256 using V -Co-H/16 conditioned on the speciﬁed classes. Unlike the common practice of visualizing with a higher CFG, here we show images using the CFG v alue (1.5) that achiev es the reported FID of 1.71. 19 class 658: mitten class 661: Model T class 718: pier class 724: pirate, pirate ship class 725: pitcher , ewer class 757: recreational vehicle, R V , R.V . class 779: school bus class 780: schooner class 829: streetcar , tram, tramcar , trolley , trolley car class 853: thatch, thatched roof Figure 8. Uncurated samples on ImageNet 256 × 256 using V -Co-H/16 conditioned on the speciﬁed classes. Unlike the common practice of visualizing with a higher CFG, here we show images using the CFG v alue (1.5) that achiev es the reported FID of 1.71. 20

V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment