Transition Flow Matching
Mainstream flow matching methods typically focus on learning the local velocity field, which inherently requires multiple integration steps during generation. In contrast, Mean Velocity Flow models establish a relationship between the local velocity …
Authors: Chenrui Ma
T ransition Flo w Matc hing Chenrui Ma Univ ersity of California, Irvine, Irvine, CA 92697, USA chenrum@uci.edu Abstract. Mainstream flo w matching methods typically focus on learn- ing the lo cal v elo city field, whic h inherently requires m ultiple integra- tion steps during generation. In contrast, Mean V elocity Flow models establish a relationship betw een the local velocity field and the global mean v elo city , enabling the latter to b e learned through a mathemati- cally grounded form ulation and allowing generation to be transferred to arbitrary future time p oints. In this w ork, w e propose a new paradigm that directly learns the transition flow. As a global quantit y , the transi- tion flo w naturally supp orts generation in a single step or at arbitrary time points. F urthermore, we demonstrate the connection b et ween our approac h and Mean V elo city Flo w, establishing a u nified theoretical p er- sp ectiv e. Extensive exp eriments v alidate the effectiv eness of our metho d and supp ort our theoretical claims. Keyw ords: Generativ e Mo deling · Flo w Matching · F ewer/One-step Generation 1 In tro duction The goal of generativ e mo deling is to transform a prior distribution in to the data distribution. Flow Matching [18, 35, 37] provides an in tuitive and conceptually simple framework for constructing flow paths that transport one distribution to another. Closely related to diffusion mo dels [1, 4, 45], Flo w Matching fo cuses on learning the v elo city fields that guide the generative pro cess during training. Both Flow Matching [26] and diffusion [45] models rely on iterativ e sam- pling during generation. Recen tly , significan t attention has b een devoted to few- step—and particularly one-step, feedforw ard—generative mo dels. One common approac h to accelerate Flow/Diffusion mo dels is distillation. Ho wev er, an ideal solution w ould allo w models to be trained from scratch in an end-to-end manner without relying on pre-trained teachers. Pioneering this direction, Consistency Mo dels [9, 13, 44] introduce a consistency constraint on netw ork outputs for in- puts sampled along the same tra jectory . Despite their promising p erformance, this constrain t is imp osed as a b ehavioral prop ert y of the net work, while the prop erties of the underlying ground-truth field that should guide learning re- main unclear [19, 30]. As a result, training can b e unstable and t ypically requires a carefully designed discretization curriculum to progressively constrain the time domain [33, 43, 44]. In contrast, Mean V elocity Models [10, 11, 17, 31, 50] optimize 2 Chenrui Ma an ob jective deriv ed from the relationship betw een the instan taneous v elo city at eac h time p oint and the mean velocity tow ard a future time. This formulation pro vides a more fundamen tal and elegant persp ective for learning generativ e dynamics. In this work, we propose a principled and effectiv e framew ork, termed T ran- sition Flo w Matc hing , for few-step generation with arbitrary n um b ers of steps and step sizes. Instead of regressing a local vector field as in Flo w/Diffusion mo d- els [36, 37, 45], our metho d directly mo dels the generation tra jectory itself, where the transition dynamics naturally represent a global counterpart of v elo city . T o this end, we deriv e (with pro of ) the T ransition Flo w Identit y , and prop ose a principled training ob jectiv e that enables generativ e mo dels to satisfy this iden- tit y from scratch in an end-to-end manner. This formulation extends and gen- eralizes previous T ransition Models [32, 38, 49] and Flow Map metho ds [3, 40]. Imp ortan tly , w e further establish a unified persp ective that clarifies the relation- ship b et ween our framew ork and Mean V elo city methods [10, 11]. In experiments, w e conduct extensive ev aluations, including generation tra jectory visualization and image generation tasks across m ultiple datasets. The results show that our approac h ac hieves comp etitiv e p erformance, demonstrating the effectiveness of mo deling transition flows. In addition, we p erform ablation studies to analyze k ey implementation design choices. Our contributions can b e summarized as follo ws: • W e prop ose T ransition Flo w Matching, a principled framework for few-step generative modeling. • W e deriv e the T ransition Flo w Iden tity and pro vide a theoretically grounded training ob jectiv e. • W e establish a unified view connecting our metho d with Mean V e- lo cit y models. • W e demonstrate comp etitive p erformance across multiple image generation b enc hmarks. 2 Related W orks Flow Matching and Diffusion. Flo w Matc hing [2, 7, 26] and Diffusion [1, 4, 45] mo dels generate samples b y parameterizing con tinuous-time dynamics via Ordi- nary Differential Equations (ODEs) or Stochastic Differential Equations (SDEs). Flo w Matching learns a deterministic velocity field defining an ODE ov er sam- ple evolution [28], while Diffusion mo dels are t ypically form ulated as SDEs that learn the score function [24]. Through the probability flow ODE formulation, the learned score implicitly defines an equiv alen t velocity field [27]. In b oth cases, the learned velocity or score is a lo cal quan tity dep ending only on the current state and time. As a result, generation requires n umerical in tegration ov er multi- ple steps to gradually transp ort samples from noise to data. A common strategy to alleviate tra jectory conflicts is distillation [7, 12, 25, 29, 41, 48, 52] from a well- trained Flo w or Diffusion mo del, which provides a better coupling distribution than the standard indep endent coupling setting [21, 34, 42, 46]. In con trast, our T ransition Flow Matching framework directly mo dels global state transitions. Instead of regressing a local v ector field, w e parameterize mappings betw een T ransition Flow Matc hing 3 states at differen t times, enabling direct transitions from a giv en state to an arbitrary future state without relying on lo cal velocity regression. Consistency Mo del and Me an V elo city Mo del. T o reduce the num b er of infer- ence steps, Consistency Mo dels [9, 13, 19, 30, 43, 44] learn generation tra jectories that main tain consistency o v er time, enforcing tra jectory coherence. Mean while, Mean V elocity metho ds [10, 11, 17, 31, 50] establish a relationship b etw een the instan taneous velocity at each time point and a mean velocity target to ward a future time. By con trast, our method directly mo dels the generation tra jectory itself, where the transfer dynamics naturally represent a global counterpart of v elo city . This enables generation with arbitrary step sizes and arbitrary num- b ers of steps. Imp ortantly , we establish a unified persp ective that clarifies the relationship b et ween our framework and Mean V elo city metho ds [10, 11]. T r ansition Mo dels and Flow Map. T ransition-based approac hes aim to learn state transitions by preserving transition iden tities. Consistency T ra jectory Mo d- els [20] learn mappings b etw een arbitrary time steps, but rely on explicit ODE/SDE in tegration during training. Flo w Map Matching [3] regresses zeroth- and first- order deriv ativ es of flow fields, while IMM [53] p erforms moment matching across time steps. Ho wev er, these methods either lac k sufficient theoretical analysis, do not clearly connect to prior formulations, or rely on distillation. In con trast, our metho d is trained from scratch in an end-to-end manner and pro vides a clear theoretical relationship under b oth the Flow Matching [26] and Mean V elo city framew orks [10, 11]. 3 Preliminaries 3.1 Notation and setup R andom variables and r e alizations. W e work in R d . Upp ercase letters (e.g., X t , Z ) denote random v ariables (R V s), and lo w ercase letters (e.g., x t , z ) de- note their realizations. F or the densit y of an R V X t , we write p t ( · ) , and for a conditional densit y we write p t | Z ( · | z ) . Exp ectations are denoted by E [ · ] . Sour c e/tar get distributions and c oupling. Let X 0 ∼ p 0 b e the sour c e distribu- tion (e.g., standard Gaussian noise) and X 1 ∼ p 1 b e the tar get distribution (e.g., images). A generative model constructs a con tinuous path of distributions { p t } t ∈ [0 , 1] that transp orts p 0 to p 1 . Let ( X 0 , X 1 ) b e any coupling with join t densit y π whose marginals are p 0 and p 1 . Throughout this work, we follow the standard Flow Matc hing setting where the source and target are indep endent: π ( x 0 , x 1 ) = p 0 ( x 0 ) p 1 ( x 1 ) . Conditioning variable and c onditional p aths. W e use a conditioning R V Z to index conditional paths. Conditioned on Z = z , w e obtain a conditional path 4 Chenrui Ma { X Z t } t ∈ [0 , 1] with conditional densit y p t | Z ( · | z ) . The corresp onding marginal path { X t } t ∈ [0 , 1] has densit y p t ( · ) satisfying p t ( x t ) = Z p t | Z ( x t | z ) p Z ( z ) dz , X t ∼ p t , X t | ( Z = z ) ∼ p t | Z ( · | z ) (1) Gener al interp olant. W e consider a general interpolant b etw een the mar ginal endp oin ts X 0 and X 1 sp ecified b y scalar functions α : [0 , 1] → R and β : [0 , 1] → R : X t = α ( t ) X 0 + β ( t ) X 1 , t ∈ [0 , 1] . (2) W e assume boundary conditions α (0) = 1 , β (0) = 0 and α (1) = 0 , β (1) = 1 , so that X t =0 = X 0 and X t =1 = X 1 . If α, β are differen tiable, then d dt X t = ˙ α ( t ) X 0 + ˙ β ( t ) X 1 . (3) Conditioned on Z = z , the same schedules induce the conditional path X Z t = α ( t ) X Z 0 + β ( t ) X Z 1 and d dt X Z t = ˙ α ( t ) X Z 0 + ˙ β ( t ) X Z 1 . 3.2 Flo w Matc hing V elo city fields. Let v ( x t , t | z ) ∈ R d denote a c onditional velocity field that transp orts p t | Z ( · | z ) ov er time. The corresponding mar ginal v elo city is defined b y conditional exp ectation: v ( x t , t ) = Z v ( x t , t | z ) p Z | t ( z | x t ) dz = E [ v ( X t , t | Z ) | X t = x t ] (4) Gener ation ODE and c ontinuity e quation. A marginal tra jectory follo ws the ODE dx t dt = v ( x t , t ) , t ∈ [0 , 1] , (5) and similarly a conditional tra jectory follows dx z t dt = v ( x t , t | z ) . The induced marginal densit y p t satisfies the con tinuit y (transp ort) equation ∂ t p t ( x t ) + ∇ · p t ( x t ) v ( x t , t ) = 0 , t ∈ [0 , 1] (6) and conditioned on Z = z , p t | Z ( · | z ) satisfies ∂ t p t | Z ( x t | z ) + ∇ · p t | Z ( x t | z ) v ( x t , t | z ) = 0 , t ∈ [0 , 1] (7) L e arning Flow Matching. Flow Matching introduces a parameterized mo del v θ ( X t , t ) to learn v ( X t , t ) by minimizing the marginal loss L MFM ( θ ) = E t, X t ∼ p t D v ( X t , t ) , v θ ( X t , t ) , (8) where D ( · , · ) is typically a Bregman divergence (e.g., MSE). Since the marginal v elo city v ( X t , t ) in Eq. (4) is generally intractable, one instead minimizes the conditional loss L CFM ( θ ) = E t, Z, X t ∼ p t | Z ( ·| Z ) D v ( X t , t | Z ) , v θ ( X t , t ) (9) T ransition Flow Matc hing 5 Theorem 1 (Gradient equiv alence of Flow Matching [27]). The gr adi- ents of t he mar ginal Flow Matching loss and the c onditional Flow Matching loss c oincide: ∇ θ L MFM ( θ ) = ∇ θ L CFM ( θ ) . (10) In p articular, the minimizer of the c onditional Flow Matching loss is the mar ginal velo city v ( x t , t ) . R emark 1 (Standar d Flow Matching). Flo w Matching sets the conditioning v ari- able to b e the endp oint pair Z = ( X 0 , X 1 ) . Cho osing linear schedules α ( t ) = 1 − t and β ( t ) = t yields the conditional path X Z t = (1 − t ) X 0 + tX 1 with constant conditional v elo city v ( X t , t | Z ) = X 1 − X 0 . Th us Eq. (9) reduces to L FM ( θ ) = E t, X 0 ∼ p 0 , X 1 ∼ p 1 D X 1 − X 0 , v θ ( X t , t ) , X t = (1 − t ) X 0 + tX 1 (11) 4 Metho d Flo w Matching learns a time-dep enden t velocity field v ( x t , t ) and generating b y in tegrating the ODE in Eq. (5). Our goal is to directly learn a tr ansition flow : X θ ( x t , t, r ) : R d × [0 , 1] × [0 , 1] → R d , 0 ≤ t ≤ r ≤ 1 , that maps a state x t at time t to the futur e state at time r along the same transp ort dynamics. A t inference time, this enables stepping across an arbitrary time grid b y rep eatedly applying X θ ( · ) , without explicitly in tegrating v ( · ) . 4.1 T ransition Flo w Iden tity A ver age velo city b etwe en two time steps. Given the marginal v elo city v ( x τ , τ ) as sho wn in Eq.(4), define the av erage velocity u ( x t , t, r ) for any 0 ≤ t ≤ r ≤ 1 : ( r − t ) u ( x t , t, r ) = x t → r − x t = Z r t v ( x τ , τ ) dτ . (12) Here x t is the current state at time t , and x t → r denotes the marginal transition state at time r obtained by ev olving from x t . Mar ginal/c onditional tr ansition states. Analogous to Eq. (4), the marginal tran- sition state can b e expressed as a conditional exp ectation: x t → r = Z x z t → r p Z | t ( z | x t ) dz = E X Z t → r X t = x t , (13) where the conditional transition state is x z t → r = x t + R r t v ( x τ , τ | z ) dτ . 6 Chenrui Ma T r ansition flow. W e define the (marginal) transition flo w X ( x t , t, r ) as the map- ping that returns the marginal transition state: X ( x t , t, r ) = Z X ( x t , t, r | z ) p Z | t ( z | x t ) dz = E [ X ( X t , t, r | Z ) | X t = x t ] (14) where X ( x t , t, r | z ) = x z t → r . Com bining Eq. (12) with the definition of X ( · ) yields u ( x t , t, r ) = x t → r − x t r − t = X ( x t , t, r ) − x t r − t . (15) T r ansition Flow Identity. Differen tiating Eq. (12) with resp ect to t (treating r as indep enden t of t ) gives u ( x t , t, r ) = v ( x t , t ) + ( r − t ) d dt u ( x t , t, r ) . (16) Substituting Eq. (15) in to Eq. (16) yields the following key iden tity: X ( x t , t, r ) = x t → r + ( r − t ) d dt X ( x t , t, r ) (17) W e defer the detailed algebraic deriv ation to the Appendix. 4.2 Computing the time deriv ativ e of a transition flo w T o make Eq. (17) actionable for learning, w e expand the total deriv ativ e d dt X ( x t , t, r ) as: d dt X ( x t , t, r ) = ∂ x t X · dx t dt + ∂ t X · dt dt + ∂ r X · dr dt = ∂ x t X · v ( x t , t ) + ∂ t X (18) where dx t dt = v ( x t , t ) b y Eq. (5) and dr dt = 0 . In practice, Eq. (18) is giv en b y the Jacobian-v ector pro duct (JVP) b etw een the Jacobian matrix of eac h function ( [ ∂ x t X, ∂ t X, ∂ r X ] ) and the corresp onding tangent v ector ( [ v , 0 , 1] ). F or co de implemen tation, modern libraries suc h as PyT orc h provide efficient JVP calculation in terfaces. 4.3 T ransition Flo w Matc hing ob jectiv es Mo del p ar ameterization. W e introduce a T ransition Flow mo del X θ ( x t , t, r ) to mo del X ( x t , t, r ) . W e use sg [ · ] to denote a stop-gradient op erator (i.e., the ar- gumen t is treated as a constan t target during optimization), and D ( · , · ) is a Bregman div ergence (e.g., MSE). T ransition Flow Matc hing 7 Intr actable mar ginal obje ctive. Ideally , w e w ould enforce Eq. (17) b y minimizing the marginal T ransition Flow Matching ob jectiv e (M-TFM): L M - TFM ( θ ) = E t, r, X t ∼ p t D sg X m tgt ( X t , t, r ) , X θ ( X t , t, r ) , (19) with target X m tgt ( X t , t, r ) = X t → r + ( r − t ) d dt X θ ( X t , t, r ) . (20) Ho wev er, the marginal transition state X t → r and the marginal v elo city v ( x t , t ) (needed inside d dt X θ via Eq. (18)) are generally intractable, hence Eq. (19) can- not b e ev aluated directly . T r actable c onditional obje ctive. Instead, we minimize a conditional T ransition Flo w Matching ob jective (C-TFM): L C - TFM ( θ ) = E t, r, Z , X t ∼ p t | Z ( ·| Z ) D sg X c tgt ( X t , t, r | Z ) , X θ ( X t , t, r ) , (21) where the conditional target is X c tgt ( X t , t, r | Z ) = X Z t → r + ( r − t ) d dt X θ ( X t , t, r ) . (22) Crucially , under appropriate choices of Z and the conditional path construction, b oth the conditional transition state X Z t → r and the conditional velocity v ( x t , t | Z ) b ecome tractable, which makes d dt X θ computable via Eq. (18). Theorem 2 (Gradient equiv alence of T ransition Flow Matching). Se e the pr o of in the App endix. The gr adients of the mar ginal and c onditional T r an- sition Flow Matching losses c oincide: ∇ θ L M - TFM ( θ ) = ∇ θ L C - TFM ( θ ) . (23) In p articular, the minimizer of the c onditional obje ctive r e gr esses X θ ( X t , t, r ) to- war d the mar ginal tar get X m tgt ( X t , t, r ) , which is define d to satisfy the T r ansition Flow Identity in Eq. (17) . 4.4 T raining and inference pro cedure R emark 2 (Standar d T r ansition Flow Matching). W e adopt the standard Flo w Matc hing setting in Remark 1 b y taking Z = ( X 0 , X 1 ) and using the linear in terp olant X t = (1 − t ) X 0 + tX 1 , 0 ≤ t ≤ 1 . F or any 0 ≤ t ≤ r ≤ 1 , the conditional transition state is X Z t → r = (1 − r ) X 0 + r X 1 , 8 Chenrui Ma Algorithm 1 T ransition Flow Matching (TFM): T raining // jvp returns (output, JVP) // tfm predicts X θ ( x t , t, r ) x_1 = sample_batch (), x_0 = randn_like (x_1), (t,r)= sample_t_r () // 0 ≤ t ≤ r ≤ 1 x_t = (1-t)x_0 + t x_1, v = x_1 - x_0 x_t_to_r = (1-r)x_0 + r x_1 (X, dX_dt) = jvp ( tfm , (x_t,t,r),(v,1,0)) X_tgt = x_t_to_r + (r-t)dX_dt loss = metric (X - stopgrad (X_tgt)) return loss and the conditional v elo city is constant: v ( X t , t | Z ) = X 1 − X 0 . Under this instan tiation, the tractable T ransition Flo w Matching ob jectiv e in Eq. (21) b ecomes L TFM ( θ ) = E t, r, Z , X t ∼ p t | Z ( ·| Z ) D sg X Z t → r + ( r − t ) d dt X θ ( X t , t, r ) , X θ ( X t , t, r ) , (24) The time deriv ativ e term d dt X θ ( X t , t, r ) extends Eq. (18) by incorp orating the tractable conditional v elo city , and is giv en by (see the pro of in App endix): d dt X θ ( X t , t, r ) = ∂ x t X θ ( X t , t, r ) · v ( X t , t | Z ) + ∂ t X θ ( X t , t, r ) . (25) During inference, w e recursively apply the mo del to transform the generation tra jectory from the source distribution to the target distribution: ˆ x r = X θ ( x t , t, r ) (26) F or clarity , we summarize the conceptual training pro cedures in the Algorithm.1. 4.5 Classifier-F ree Guidance T ransition Flo w Matching naturally supp orts classifier-free guidance (CFG) via X θ cfg ( x t , t, r | c ) = ω X ( x t , t, r | c ) + (1 − ω ) X ( x t , t, r | ∅ ) . (27) This forms a com bination of the class-conditional transition flow X ( x t , t, r | c ) and the unconditional transition flow X ( x t , t, r | ∅ ) , allo wing us to control the strength of class conditioning at inference time b y tuning ω . Here, the term “condition” refers sp ecifically to class conditioning, whic h is distinct from the marginal/conditional distinction used in previous sections. F ollowing standard T ransition Flow Matc hing 9 Ground T ruth Flo w Matc hing Rectified Flow Flow map (steps=1) MeanFlow (steps=1) TFM (steps=1) Flow map (steps=2) MeanFlow (steps=2) TFM (steps=2) Flow map (steps=5) MeanFlow (steps=5) TFM (steps=5) Fig. 1: 2D Gener ation T r aje ctory Visualization on Synthetic Data. T ested methods include: Flow Matching [26], Rectified Flow [29], Flow Map Matc hing [3], MeanFlow [10]. CF G practice [16], we train a single mo del X θ ( · ) that supp orts b oth class- conditional generation X θ ( · | c ) and unconditional generation X θ ( · | ∅ ) . Con- cretely , for conditional training, the endpoint X 1 in Sec. 4.4 is sampled from the class-conditional target distribution p 1 ( · | c ) ; for unconditional training, X 1 in Sec. 4.4 is sampled from the unconditional target distribution p 1 ( · ) . 5 Implemen tation Design L oss Metrics. In Eq. (24), the Bregman divergence D ( · , · ) is instantiated as the squared ℓ 2 loss. F ollo wing [10], w e further inv estigate alternativ e loss metrics. In general, we consider loss functions of the form L = ∥ ∆ ∥ 2 γ 2 , where ∆ denotes the regression error. It can b e shown (see [13]) that minimizing ∥ ∆ ∥ 2 γ 2 is equiv alen t to minimizing the squared ℓ 2 loss ∥ ∆ ∥ 2 2 with adapte d loss weights . In practice, w e define the weigh t as w = 1 ( ∥ ∆ ∥ 2 2 + c ) p , where p = 1 − γ and c > 0 is a small constan t (e.g., 10 − 3 ) for n umerical stability . The resulting adaptively w eighted loss takes the form sg ( w ) · L , with L = ∥ ∆ ∥ 2 2 . When p = 0 . 5 , this formulation resem bles the Pseudo-Hub er loss [43]. W e compare different choices of p in the exp erimen ts. Sampling Time Steps ( t, r ) . W e sample the tw o time steps ( t, r ) from a predefined logit-normal (lognorm) distribution [8, 10]. Sp ecifically , we first dra w a sample from a normal distribution N ( µ, σ ) and map it to the in terv al (0 , 1) via the logistic function to obtain t . W e then sample another logit-normal v ariable d in the same manner and set r = t + d (1 − t ) , which ensures that the constrain t 0 ≤ t ≤ r ≤ 1 is satisfied. Note that for an y giv en t , the transition time r is obtained by an affine transformation of a logit-normal random v ariable, ensuring that r ∈ [ t, 1] while preserving a logit-normal–shaped densit y o ver the v alid in terv al. Different hyperparameter settings of the logit-normal distribution are ev aluated in the exp eriments. 10 Chenrui Ma Conditioning on ( t, r ) . W e employ p ositional embeddings [10, 47] to enco de the time v ariables, which are subsequen tly combined and used as conditioning inputs to the neural net work. Although the vector field is parameterized as X θ ( x t , t, r ) , it is not strictly necessary for the netw ork to directly condition on ( t, r ) . F or instance, the netw ork can instead condition on ( t, ∆t ) , where ∆t = r − t . In this case, w e define X θ ( · , t, r ) ≜ net( · , t, r − t ) , with net denoting the neural net work. The Jacobian-v ector product (JVP) is alw a ys computed with respect to the function X θ ( · , t, r ) . W e empirically compare differen t conditioning strategies in the exp erimen ts. 6 Exp erimen ts 6.1 Syn thetic Data and Visualization Syn thetic data exp eriments are conducted to visualize the results and in tuitiv ely demonstrate the effectiv eness of the prop osed metho d [36, 51]. W e simulate and visualize the generation tra jectories of different methods on a 2D alphab et “M” dataset, as sho wn in Figure 1. In this dataset, the source distribution (blue p oin ts) is circular, while the target distribution (red p oints) forms the shape of the letter “M”. The visualization results are consisten t with the discussion in the previous section: our goal is to enable generation with arbitrary step sizes and an arbitrary num b er of steps by mo deling the generation tra jectory . Notably , ev en one-step generation shows promising results. 6.2 Visual Generation CIF AR-10. CIF AR-10 is a 32 × 32 resolution image dataset containing multiple classes and is a widely used b enchmark in generative modeling [22]. F or a fair ev aluation, w e adopt the same UNet architecture and training proto col as in prior w ork [13, 14], while replacing the conv entional flow matching ob jectiv e with the prop osed T r ansition Flow Matching ob jectiv e as defined in Eq. (24). The UNet mo del X θ follo ws a standard enco der–deco der design with residual blo c ks and skip connections. A self-attention blo ck is inserted after the residual blo c k at 16 × 16 resolution and at the bottleneck lay er. The mo del tak es the curren t state x t and the time v ariables ( t, r ) as input, where ( t, r ) are embedded and used to mo dulate adaptive group normalization lay ers through learnable scale and shift parameters. T o quantitativ ely ev aluate the generation p erformance, we compare our metho d with sev eral state-of-the-art approac hes by measuring the generation quality us- ing the F réchet Inception Distance (FID) [15], computed under v arying NFE: [1 , 2 , 5 , 10] and adaptiv e-step Dopri5 ODE solver [5], as presented in T able 1. The results in T able 1 sho w that our method achiev es the b est p erformance in one-step generation (NFE = 1 ). Moreo ver, the FID score consisten tly decreases as NFE increases, while b oth Consistency Mo dels and Mean V elo city Mo dels tend to exhibit degraded p erformance with higher NFE v alues. T ransition Flow Matc hing 11 NFE / Sampler # P arams. 1 2 5 10 Flow Flow Matching [26] [ICLR’23] 36.5M - 166.65 36.19 14.4 VFM [14] [ICML’25] 60.6M - 97.83 13.12 5.34 Re-Flow 1-Rectified Flo w [29] [ICLR’23] 36.5M 378 6.18 - - 2-Rectified Flo w [29] [ICLR’23] 36.5M 12.21 4.85 - - 3-Rectified Flo w [29] [ICLR’23] 36.5M 8.15 5.21 - - Mean V elo city/ Consistency CT [44] [ICML’23] 61.8M 8.71 5.32 11.412 23.948 iCT [43] [ICLR’24] 55M 2.83 2.46 - - ECT [13] [ICLR’25] 55M 3.60 2.11 - - sCT [30] [ICLR’25] 55M 2.85 2.06 - - IMM [53] [ICML’25] 55M 3.20 1.98 - - MeanFlow [10] [NeurIPS’25] 55M 2.92 2.23 2.84 2.27 S-VFM [36] [CVPR’26] 60.6M 2.81 2.16 2.02 1.97 TFM [Ours] 55M 2.77 2.08 1.96 1.91 T able 1 : Quantitative Comp arison with Differ ent Generation Metho ds on CIF AR-10 Dataset. Our metho d achiev es the b est performance in one-step generation (NFE = 1 ). Moreo ver, the FID score consistently decreases as NFE increases. NFE=1 NFE=2 NFE=5 NFE=10 Fig. 2: Gener ation R esults on ImageNet-256 under V arying NFE. As the n umber of function ev aluations (NFE) increases from 1 to 10, the generated images exhibit pro- gressiv ely impro ved detail and fidelit y . Notably , even single-step generation already pro duces reasonably go o d results. ImageNet. T o ev aluate robustness and scalabilit y on large-scale data, we conduct exp erimen ts on the ImageNet dataset with image resolution 256 × 256 [23]. All exp erimen ts are performed on class-conditional ImageNet generation at this res- olution. F ollo wing common practice, we ev aluate the F réchet Inception Distance (FID) [15] on 50K randomly generated images. F ollo wing prior works [9, 14, 36], w e implement all mo dels in the laten t space of a pre-trained V AE tok enizer [39]. F or 256 × 256 images, the tokenizer maps images into a laten t representation of size 32 × 32 × 4 , whic h s erv es as the input to the generative mo del. All mo dels are trained from scratc h under identical data and optimization settings. As the backbone architecture, w e adopt MeanFlow [10], a transformer-based mo del that has demonstrated strong p erformance in high-resolution image gener- ation. F or fair comparison, we strictly follo w the original MeanFlo w [10] training recip e and optimization settings, mo difying only the learning ob jectiv e. 12 Chenrui Ma Metho d # Params. NFE FID iCT-XL/2 [43] [ICLR’24] 675M 1 34.24 Shortcut-XL/2 [9] [ICLR’25] 675M 1 10.60 MeanFlow-XL/2 [10] [NeurIPS’25] 676M 1 3.43 S-VFM-XL/2 [36] [CVPR’26] 677M 1 3.31 TFM-XL/2 [Ours] 676M 1 3.02 iCT-XL/2 [43] [ICLR’24] 675M 2 20.30 iMM-XL/2 [53] [ICML’25] 675M 1 × 2 7.77 MeanFlow-XL/2 [10] [NeurIPS’25] 676M 2 2.93 S-VFM-XL/2 [36] [CVPR’26] 677M 2 2.86 TFM-XL/2 [Ours] 676M 2 2.77 T able 2: Quantitative Comp arison with Differ ent Gener ation Metho ds on ImageNet 256 × 256 Dataset. Our metho d achiev es the b est p erformance in few-step generation. Sp ecifically , w e in tro duce TFM , which is parameterized by X θ ( X t , t, r ) , b y replacing the original velocity-based loss with the proposed T ransition Flow Matc hing loss in Eq. (24). The mo del is trained to directly predict the tran- sition flo w X ( x t , t, r ) rather than the lo c al v elo city field. Both time v ariables t and r are em b edded and injected in to the transformer blocks via adaptiv e normalization la yers. During training, we sample ( X 0 , X 1 ) pairs from the data distribution and con- struct linear interpolants follo wing the standard Flo w Matc hing setting. Giv en a randomly sampled time pair ( t, r ) with 0 ≤ t ≤ r ≤ 1 , the netw ork is trained to regress the transition flow using the tractable T ransition Flow Matc hing ob jec- tiv e described in Eq. (24), which enforces consistency with the T ransition Flow Iden tity . At inference time, sample generation is p erformed by rep eatedly ap- plying the learned transition flow across a predefined time grid, starting from Gaussian noise. This form ulation allo ws the model to directly predict future states along the transition tra jectory , enabling flexible and e fficien t generation. Figure 2 illustrates generation results under differen t num b ers of function ev aluations (NFE), using distinct initial noise. Within each row, images at the same spatial lo cation across different panels are generated from the same ini- tial noise realization. Eac h column corresp onds to a different NFE setting, with NFE v alues of [1 , 2 , 5 , 10] arranged from top to b ottom. As the num ber of func- tion ev aluations increases, the generated images exhibit improv ed visual fidelity and structural coherence. Notably , ev en one-step generation (NFE = 1 ) pro- duces competitive samples, supp orting the claim that the transition flow mo del X θ ( X t , t, r ) learns to predict the future state at r from the current state X t at time t , thereb y enabling effective few-step sampling. F ollowing standard ev aluation proto cols, w e randomly generate 50K images from each mo del and report the corresp onding FID scores in T able 2. TFM- XL consisten tly outp erforms b oth Consistency Mo dels and Mean V elo city Mod- els under iden tical training settings. These results demonstrate that explicitly mo deling glob al tr ansition dynamics through a transition flo w—capable of pre- dicting arbitrary future states—rather than learning a naturally local velocity field, pro vides greater flexibilit y in sim ulation steps for flo w- and diffusion-based generation metho ds. W e further analyze the training dynamics by comparing TFM with SiT and MeanFlo w across different training iterations, as shown in Figure 3e. F or SiT, T ransition Flow Matc hing 13 w e follo w the default inference setting with NFE = 250 , while for both TFM and MeanFlo w we use NFE = 1 to reflect their one-step generation capabilit y . The results sho w that TFM ac hieves clear p erformance improv ements once suf- ficien tly trained, and the training curv es indicate that p erformance consistently impro ves as the num ber of training epo chs increases. This b ehavior confirms the effectiv eness and scalabilit y of the prop osed transition flo w formulation in b oth training and inference. 6.3 Ablation Study In our ablation study , we use the ViT-B/4 arc hitecture [6] (“Base" size with a patc h size of 4 as developed in [37], trained for 80 ep ochs (400K iterations). Conditioning on ( t, r ) . F ollo wing the model parameterization in Sec. 5, the tran- sition flow X θ ( x t , t, r ) requires explicit conditioning on the temp oral v ariables. Similar to prior designs, we enco de time information through p ositional em b ed- dings and study different conditioning strategies that share the same functional form but differ in the specific choice of v ariables. Concretely , instead of directly conditioning on ( t, r ) , the netw ork can equiv alen tly condition on ( t, ∆t ) with ∆t = r − t , leading to the parameterization X θ ( · , t, r ) ≜ net( · , t, r − t ) . W e compare these v arian ts in T ab. 3a. The results indicate that all studied con- ditioning forms lead to stable and effective one-step generation, demonstrating that our transition flo w formulation is robust to the exact c hoice of temp oral em- b edding. Conditioning on ( t, ∆t ) yields the strongest p erformance ov erall, while directly using ( t, r ) p erforms comparably . Notably , even conditioning solely on the in terv al ∆t produces comp etitive results, suggesting that relative temp oral information pla ys a dominant role in our metho d. Sampling Time Steps ( t, r ) . The c hoice of the s ampling distribution for time steps is known to hav e a significant impact on generation qualit y . In our frame- w ork, w e sample ( t, r ) using logit-normal distributions, consistent with the imple- men tation describ ed in Sec. 5. Specifically , we consider t wo logit-normal distri- butions: one for sampling the base time t and another for sampling the relative offset that determines r . This design ensures 0 ≤ t ≤ r ≤ 1 b y construction while allowing flexible control o ver the densit y of sampled time pairs. W e ev alu- ate different hyperparameter settings of these logit-normal samplers in T ab. 3d. The results show that logit-normal sampling consistently outp erforms alterna- tiv e choices, aligning with observ ations rep orted in prior flow-matc hing-based metho ds. L oss Metrics. As discussed in Sec. 5, our training ob jective adopts a Bregman div ergence instan tiated via adaptiv ely weigh ted squared ℓ 2 loss. While the o ver- all loss formulation remains the same across exp eriments, we v ary the exponent p that con trols the adaptive w eighting, which effectiv ely changes the loss met- ric. The corresp onding results are summarized in T ab. 3b. W e find that p = 1 ac hieves the best o v erall p erformance, indicating a strong b enefit from aggressive 14 Chenrui Ma p os. embed FID, 1-NFE ( t, r ) 60.12 ( t, t − r ) 59.87 ( t, r, t − r ) 62.48 t − r only 62.16 (a) Positional emb e dding. The network is conditioned on the embeddings applied to the specified v ariables. p FID, 1-NFE 0.0 79.76 0.5 62.47 1.0 59.87 1.5 65.68 2.0 69.26 (b) L oss metrics. p = 0 is squared L2 loss. p = 0 . 5 is Pseudo-Huber loss. ω FID, 1-NFE 1.0 (w/o cfg) 61.06 1.5 32.51 2.0 19.05 3.0 14.76 5.0 20.12 (c) CFG scale. Our method supports 1-NFE CFG sam- pling. t, r sampler FID, 1-NFE uniform (0 , 1) 65.73 lognorm ( − 0 . 2 , 1 . 0) 63.56 lognorm ( − 0 . 2 , 1 . 2) 62.27 lognorm ( − 0 . 4 , 1 . 0) 59.87 lognorm ( − 0 . 4 , 1 . 2) 59.94 (d) Time samplers. t and r are sampled from the specific sampler. (e) Comp arison of FID-50K Sc or e over T raining It- er ations on ImageNet 256 × 256 Dataset. T able 3: Ablation study on 1-NFE ImageNet 256 × 256 gener ation. FID-50K is ev al- uated. Default configurations are: B/4 backbone, 80-ep o ch training from scratch. adaptiv e weigh ting. Setting p = 0 . 5 , which resembles the Pseudo-Hub er loss, also yields comp etitive results. In contrast, the standard squared ℓ 2 loss ( p = 0 ) un- derp erforms relative to other c hoices, though it still leads to meaningful one-step generation. These trends are consisten t with prior findings on the imp ortance of loss rew eighting for few-step or single-step generative models. Guidanc e Sc ale. W e further inv estigate the effect of c lassifier-free guidance (CFG) within our transition flow framew ork. The results, rep orted in T ab. 3c, sho w that increasing the guidance scale significantly improv es generation quality . This b e- ha vior is consisten t with observ ations in m ulti-step diffusion and flow mo dels. Imp ortan tly , our CFG formulation, in tro duced in Sec. 5, is fully compatible with one-step (1-NFE) sampling and does not introduce additional inference cost b e- y ond a constant factor. 7 Conclusion In this w ork, we introduced T ransition Flow Matching , a principled frame- w ork for few-step ge nerativ e mo deling. Instead of learning lo cal velocity fields as in conv en tional Flow/Diffusion mo dels, our metho d directly models the gen- eration tra jectory through transition dynamics, pro viding a global p ersp ective on generative flows. W e derive the T ransition Flo w Identit y and develop a theoretically grounded ob jectiv e that enables end-to-end training from scratc h, while also establishing a unified view that connects our framework with Mean V elo city metho ds. T ransition Flow Matc hing 15 8 Pro of of T ransition Iden tit y 8.1 Notation and Preliminary R andom variables and r e alizations. W e work in R d . Upp ercase letters (e.g., X t , Z ) denote random v ariables (R V s), and low ercase letters (e.g., x t , z ) denote their realizations (p oints/v alues). F or a density (or probability la w) of an R V X t , w e write p t ( · ) , and for a conditional density we write p t | Z ( · | z ) . Exp ectations are denoted b y E [ · ] . Sour c e/tar get distributions and c oupling. Let X 0 ∼ p 0 b e the sour c e distribu- tion (e.g., standard Gaussian noise) and X 1 ∼ p 1 b e the tar get distribution (e.g., images). A generative model constructs a con tinuous path of distributions { p t } t ∈ [0 , 1] that transp orts p 0 to p 1 . Let ( X 0 , X 1 ) b e any coupling on R d with join t density π whose marginals are p 0 and p 1 (not necessarily indep endent). In this w ork, we follow the standard Flow Matc hing setting, the sour c e and the tar get distributions are independent: π ( x 0 , x 1 ) = p 0 ( x 0 ) p 1 ( x 1 ) . Conditioning variable and c onditional p aths [26]. W e use a conditioning R V Z to index conditional paths. Conditioned on Z = z , w e obtain conditional coupling ( X Z 0 , X Z 1 ) and a conditional path { X Z t } t ∈ [0 , 1] with conditional densit y p t | Z ( · | z ) . The marginal path is { X t } t ∈ [0 , 1] , with densit y p t ( · ) , satisfying: p t ( x t ) = Z p t | Z ( x t | z ) p Z ( z ) dz , X t ∼ p t , X t | ( Z = z ) ∼ p t | Z ( · | z ) . (28) Interp olant. W e consider a general interpolant b etw een the mar ginal endp oints X 0 and X 1 , sp ecified b y scalar functions α : [0 , 1] → R and β : [0 , 1] → R : X t = α ( t ) X 0 + β ( t ) X 1 , t ∈ [0 , 1] . (29) W e assume the b oundary conditions α (0) = 1 , β (0) = 0 and α (1) = 0 , β (1) = 1 , so that X t =0 = X 0 and X t =1 = X 1 . If α, β are differen tiable, then d dt X t = ˙ α ( t ) X 0 + ˙ β ( t ) X 1 . (30) Conditioned on Z = z , the same schedules induce the conditional path X Z t = α ( t ) X Z 0 + β ( t ) X Z 1 and d dt X Z t = ˙ α ( t ) X Z 0 + ˙ β ( t ) X Z 1 . Flow Matching ve ctor fields [26]. Let v ( x t , t | z ) ∈ R d denote a c onditional v elo city field that transports the conditional density p t | Z ( · | z ) along time. The corresp onding mar ginal v elo city field is defined by conditional expectation: v ( x t , t ) = Z v ( x t , t | z ) p Z | t ( z | x t ) dz = E [ v ( X t , t | Z ) | X t = x t ] . (31) A marginal tra jectory , i.e., Flo w Matc hing generation tra jectory , follows the ODE: dx t dt = v ( x t , t ) , t ∈ [0 , 1] , (32) and similarly a conditional tra jectory follows dx z t dt = v ( x t , t | z ) . 16 Chenrui Ma Continuity (tr ansp ort) e quations [26, 27]. The evolution of densities induced b y these v elo city fields is characterized by the con tinuit y (transp ort) equation. F or the marginal densit y p t , ∂ t p t ( x t ) + ∇ · p t ( x t ) v ( x t , t ) = 0 , t ∈ [0 , 1] . (33) Conditioned on Z = z , the conditional density p t | Z ( · | z ) satisfies ∂ t p t | Z ( x t | z ) + ∇ · p t | Z ( x t | z ) v ( x t , t | z ) = 0 , t ∈ [0 , 1] . (34) Flo w Matching learns a parameterized v elocity field (or equiv alent dynamics) so that the induced marginal path { p t } solv es Eq. (33) with b oundary conditions p t =0 = p 0 and p t =1 = p 1 . L e arning Flow Matching. Flow Matching introduce a velocity field model v θ ( X t , t ) to learn v ( X t , t ) , ideally , by minimizing the marginal Flow Matching loss: L MFM ( θ ) = E t, X t ∼ p t D v ( X t , t ) , v θ ( X t , t ) (35) Ho wev er, since the marginal v elo city v ( X t , t ) in Eq.(31) is not tractable, so the marginal loss Eq.(35) ab o v e cannot b e computed as is. Instead, we minimize the conditional Flo w Matching loss: L CFM ( θ ) = E t, Z, X t ∼ p t | Z ( ·| Z ) D v ( X t , t | Z ) , v θ ( X t , t ) (36) The tw o losses Eq.(35) and Eq.(36) are equiv alen t for learning purp oses, since their gradien ts coincide: Theorem 3 (Gradient equiv alence of Flow Matching [27]). The gr adi- ents of t he mar ginal Flow Matching loss and the c onditional Flow Matching loss c oincide: ∇ θ L MFM ( θ ) = ∇ θ L CFM ( θ ) (37) In p articular, the minimizer of the c onditional Flow Matching loss is the mar ginal velo city v ( x t , t ) . R emark 3 (Standar d Flow Matching). Flo w Matching sets the conditioning v ari- able to b e the endp oint pair Z = ( X 0 , X 1 ) , so that conditioning on Z = z fixes the endp oint pair ( X Z 0 , X Z 1 ) = ( X 0 , X 1 ) . Cho osing the linear schedules α ( t ) = 1 − t and β ( t ) = t yields the conditional path X Z t = (1 − t ) X Z 0 + tX Z 1 = (1 − t ) X 0 + tX 1 , 0 ≤ t ≤ 1 hence the time-deriv ativ e of this conditional path is constant: d dt X Z t = X Z 1 − X Z 0 = X 1 − X 0 . T ransition Flow Matc hing 17 Therefore, for Z = ( X 0 , X 1 ) , the conditional velocity used as sup ervision in the conditional Flo w Matching loss is a constant: v ( X t , t | Z ) = X Z 1 − X Z 0 = X 1 − X 0 , With this setup, the Flo w Matc hing ob jective in Eq. (36) can be written explicitly as L FM ( θ ) = E t, X 0 ∼ p 0 , X 1 ∼ p 1 D X 1 − X 0 , v θ ( X t , t ) , X t = (1 − t ) X 0 + tX 1 , i.e., one samples t ∼ Unif [0 , 1] , draws independent endpoints X 0 ∼ p 0 and X 1 ∼ p 1 , forms the in terp olated state X t , and regresses the mo del v θ ( X t , t ) to the constant target X 1 − X 0 . By Theorem 3, minimizing this conditional loss yields the marginal v elo city field v ( x t , t ) in Eq. (31). 8.2 T ransition Flo w Iden tity T o connect with previous w orks [10, 11], we define the av erage v elo city u ( x t , t, r ) as: ( r − t ) u ( x t , t, r ) = x t → r − x t = Z r t v ( x τ , τ ) dτ 0 ≤ t ≤ r ≤ 1 (38) where v ( x τ , τ ) is the marginal v elo city of Flo w Matching in time step τ ∈ [0 , 1] for state x τ , as sho wn in Eq.(31). In Eq.(38), x t is the curren t state at time step t , x t → r is the transition state at time step r , whic h comes from x t at time step t . x t → r = x t + Z r t v ( x τ , τ ) (39) Note here, akin to the marginal velocity of Flow Matching v ( x t , t ) as Eq.(31), x t → r is the marginal transition state: x t → r = Z x z t → r p Z | t ( z | x t ) dz = E [ X Z t → r | X t = x t ] (40) where x z t → r is the conditional transition state: x z t → r = x t + Z r t v ( x τ , τ | z ) dτ (41) The a verage velocity u ( x t , t, r ) is the displacement b etw een tw o time steps t and r divided by the time in terv al r − t : u ( x t , t, r ) ≜ 1 r − t Z r t v ( x τ , τ ) dτ (42) 18 Chenrui Ma Differen tiate b oth sides of Eq.(38) with respect to t, treating r as indep enden t of t. W e hav e: d dt ( r − t ) u ( x t , t, r ) = d dt Z r t v ( x τ , τ ) dτ = ⇒ u ( x t , t, r ) = v ( x t , t )+( r − t ) d dt u ( x t , t, r ) (43) Our learning target is the T ransition Flow X ( x t , t, r ) that transit state x t at time step t to x t → r at time step r : X ( x t , t, r ) = x t → r , whic h yield: X ( x t , t, r ) = Z X ( x t , t, r | z ) p Z | t ( z | x t ) dz = E [ X ( X t , t, r | Z ) | X t = x t ] (44) where X ( x t , t, r | z ) = x z t → r is conditional transition state. Note that Eq.(44) is iden tical to Eq.(40). F rom Eq.(38), we build the association b etw een the av erage velocity and the T ransition Flow (transition state): u ( x t , t, r ) = x t → r − x t r − t = X ( x t , t, r ) − x t r − t (45) Substitute Eq.(45) in to Eq.(43), we obtain the following equation: X ( x t , t, r ) − x t r − t = v ( x t , t ) + ( r − t ) d dt X ( x t , t, r ) − x t r − t (46) where r is indep endent of t , and the time deriv ative of x t is marginal velocity v ( x t , t ) , given by Eq.(32). Define the time deriv ativ e of the fraction: A ( t ) := X ( x t , t, r ) − x t r − t . Using the quotien t rule and the fact that r is constant, we obtain dA dt = ( r − t ) d dt ( X ( x t , t, r ) − x t ) + ( X ( x t , t, r ) − x t ) ( r − t ) 2 . Since d dt ( X ( x t , t, r ) − x t ) = dX ( x t , t, r ) dt − v ( x t , t ) , w e hav e dA dt = ( r − t ) dX ( x t ,t,r ) dt − v ( x t , t ) + ( X ( x t , t, r ) − x t ) ( r − t ) 2 . The righ t-hand side of Eq.(46) b ecomes v ( x t , t ) + ( r − t ) dA dt = v ( x t , t ) + ( r − t ) dX ( x t ,t,r ) dt − v ( x t , t ) + ( X ( x t , t, r ) − x t ) r − t = v ( x t , t ) + dX ( x t , t, r ) dt − v ( x t , t ) + X ( x t , t, r ) − x t r − t = dX ( x t , t, r ) dt + X ( x t , t, r ) − x t r − t (47) T ransition Flow Matc hing 19 Comparing b oth sides of Eq.(46) yields X ( x t , t, r ) − x t r − t = dX ( x t , t, r ) dt + X ( x t , t, r ) − x t r − t X ( x t , t, r ) = X ( x t , t, r ) + ( r − t ) dX ( x t , t, r ) dt (48) Giv en X ( x t , t, r ) = x t → r , the T ransition Flo w Iden tity can b e shown as: X ( x t , t, r ) = x t → r + ( r − t ) d dt X ( x t , t, r ) (49) 8.3 Calculate Time Deriv ative of T ransition Flow T o compute the d dt X ( x t , t, r ) term, w e expand it in terms of partial deriv atives: d dt X ( x t , t, r ) = ∂ x t X · dx t dt + ∂ t X · dt dt + ∂ r X · dr dt = ∂ x t X · v ( x t , t ) + ∂ t X · 1 + ∂ r X · 0 = ∂ x t X · v ( x t , t ) + ∂ t X (50) where dx t dt = v ( x t , t ) as shown in Eq.(32). The time deriv ative sho wn in Eq. (50) is given by the Jacobian-vector pro duct (JVP) betw een the Jacobian matrix of eac h function ( [ ∂ x t X, ∂ t X, ∂ r X ] ) and the corresponding tangent v ector ( [ v , 0 , 1] ). F or co de implementation, mo dern libraries such as PyT orch provide efficient JVP calculation in terfaces. 8.4 T o w ards T ractable T ransition Flow Matc hing Ob jective Up to this p oint, the form ulations are independent of an y netw ork parameteriza- tion. W e now introduce the T ransition Flo w model X θ ( x t , t, r ) , parameterized by θ , to learn X ( x t , t, r ) . F ormally , w e encourage X θ ( x t , t, r ) to satisfy the T ran- sition Flow Identit y as Eq.(49). T o this end, ideally , w e can minimize the marginal T ransition Flow Matching ob jectiv e (M-TFM): L M − TFM ( θ ) = E t, r, X t ∼ p t D sg X m tgt ( X t , t, r ) , X θ ( X t , t, r ) (51) where X m tgt ( X t , t, r ) = X t → r + ( r − t ) d dt X θ ( X t , t, r ) (52) where D ( · ) represen ts a Bregman div ergence (e.g MSE) to regress our learnable T ransition Flo w X θ ( x t , t, r ) on to the target X m tgt ( x t , t, r ) , moreo ver, sg[ · ] donates the stop gradient (sg) op eration, indicating X m tgt ( x t , t, r ) serves as ground-truth in this loss function and is indep endent of optimization. Ho wev er, the marginal state x t → r in Eq.(52) and the marginal v elo city v ( x t , t ) used in calculation of d dt X θ ( x t , t, r ) Eq.(50) are not tractable, so the marginal loss Eq.(51) ab o ve cannot b e computed as is. 20 Chenrui Ma Instead, we minimize conditional T ransition Flow Matching loss (C-TFM), whic h is tractable: L C − TFM ( θ ) = E t, r, Z , X t ∼ p t | Z ( ·| Z ) D sg X c tgt ( X t , t, r | Z ) , X θ ( X t , t, r ) (53) where X c tgt ( X t , t, r | Z ) = X Z t → r + ( r − t ) d dt X θ ( X t , t, r ) (54) where the conditional state x z t → r in Eq.(54) and the conditional velocity v ( x t , t | z ) used in calculation of d dt X θ ( x t , t, r ) Eq.(50) are tractable (see Eq.(57) for details), giv es us a computable loss function Eq.(53). The t wo losses Eq.(51) and Eq.(53) are equiv alen t for learning purp oses, since their gradien ts coincide: Theorem 4 (Gradient equiv alence of T ransition Flo w Matc hing). The gr adients of the mar ginal T r ansition Flow Matching loss and the c onditional T r ansition Flow Matching loss c oincide: ∇ θ L M − TFM ( θ ) = ∇ θ L C − TFM ( θ ) (55) In p articular, the minimizer of the c onditional T r ansition Flow Matching loss is the mar ginal tar get X m tgt ( x t , t, r ) , which satisfies the T r ansition Flow Identity Eq. (49) . Pr o of (Pr o of of The or em 4). W e show Eq.(55) b y a direct computation. ∇ θ L M - TFM ( θ ) = ∇ θ E t, r, X t ∼ p t D sg X m tgt ( X t , t, r ) , X θ ( X t , t, r ) (a) = E t, r, X t ∼ p t ∇ θ D sg X m tgt ( X t , t, r ) , X θ ( X t , t, r ) (b) = E t, r, X t ∼ p t ∇ 2 D X m tgt ( X t , t, r ) , X θ ( X t , t, r ) ∇ θ X θ ( X t , t, r ) (c) = E t, r, X t ∼ p t ∇ 2 D E Z ∼ p Z | t ( ·| X t ) X c tgt ( X t , t, r | Z ) , X θ ( X t , t, r ) ∇ θ X θ ( X t , t, r ) (d) = E t, r, X t ∼ p t E Z ∼ p Z | t ( ·| X t ) h ∇ 2 D X c tgt ( X t , t, r | Z ) , X θ ( X t , t, r ) ∇ θ X θ ( X t , t, r ) i (e) = E t, r, X t ∼ p t E Z ∼ p Z | t ( ·| X t ) h ∇ θ D sg X c tgt ( X t , t, r | Z ) , X θ ( X t , t, r ) i (f) = ∇ θ E t, r, Z, X t ∼ p t | Z ( ·| Z ) D sg X c tgt ( X t , t, r | Z ) , X θ ( X t , t, r ) = ∇ θ L C - TFM ( θ ) . (56) Explanations of lab ele d steps. (a) In terchange ∇ θ and E b y the Leibniz rule. (b) The stop-gradient makes the first argument sg[ X m tgt ( X t , t, r )] θ -indep endent for differentiation, hence the gradient flows only through the second argu- men t. Applying the chain rule yields ∇ θ D (sg[ · ] , X θ ) = ∇ 2 D ( · , X θ ) ∇ θ X θ . T ransition Flow Matc hing 21 (c) Use the definitions Eq.(52) and Eq.(54) together with the marginal–conditional relation for transition states Eq.(44) and v elo city Eq.(31): X m tgt ( X t , t, r ) = X t → r + ( r − t ) d dt X θ ( X t , t, r ) = X ( X t , t, r ) + ( r − t ) d dt X θ ( X t , t, r ) = X ( X t , t, r ) + ( r − t ) ∂ X t X θ ( X t , t, r ) · v ( X t , t ) + ∂ t X θ ( X t , t, r ) = E Z ∼ p Z | t ( ·| X t ) h X ( X t , t, r | Z ) + ( r − t ) ∂ X t X θ ( X t , t, r ) · v ( X t , t | Z ) + ∂ t X θ ( X t , t, r ) i = E Z ∼ p Z | t ( ·| X t ) h X ( X t , t, r | Z ) + ( r − t ) d dt X θ ( X t , t, r ) i = E Z ∼ p Z | t ( ·| X t ) h X Z t → r + ( r − t ) d dt X θ ( X t , t, r ) i = E Z ∼ p Z | t ( ·| X t ) X c tgt ( X t , t, r | Z ) , (57) where ∂ X t X θ ( X t , t, r ) and ∂ t X θ ( X t , t, r ) are Z -indep endent given ( X t , t, r ) . (d) Since D ( · , · ) is a Bregman divergence, its gradien t with resp ect to the second argumen t, ∇ 2 D ( a, b ) , is affine in the first argumen t a for fixed b . Conditioning on X t , this implies the (conditional) exp ectation can b e mov ed inside: ∇ 2 D E [ A | X t ] , b = E ∇ 2 D ( A, b ) | X t . Apply this with A = X c tgt ( X t , t, r | Z ) and b = X θ ( X t , t, r ) . (e) Rev erse the chain rule as in (b): b ecause sg[ X c tgt ] freezes the first argument, w e hav e ∇ 2 D ( · , X θ ) ∇ θ X θ = ∇ θ D (sg[ · ] , X θ ) . (f ) Use Bay es’ rule to sw ap the sampling orders: E X t ∼ p t E Z ∼ p Z | t ( ·| X t ) [ · ] = E Z E X t ∼ p t | Z ( ·| Z ) [ · ] , and in terchange ∇ θ with E (as in (a)) to recognize ∇ θ L C - TFM ( θ ) in Eq.(53). Therefore, ∇ θ L M - TFM ( θ ) = ∇ θ L C - TFM ( θ ) , whic h prov es Eq.(55). ⊓ ⊔ R emark 4 (Minimizer). Since the t w o ob jectiv es Eq.(51) and Eq.(53) ha v e iden- tical gradien ts, they share the same stationary p oints. Moreo ver, b ecause X m tgt ( X t , t, r ) = E Z ∼ p Z | t ( ·| X t ) X c tgt ( X t , t, r | Z ) , the conditional ob jective Eq.(53) regresses X θ ( X t , t, r ) tow ard the marginal tar- get X m tgt ( X t , t, r ) , and X m tgt is defined to satisfy the T ransition Flow Iden tit y Eq.(49). 8.5 T raining T ransition Flow Matc hing Mo del Building on the previous developmen t, we now ha ve all the ingredients required to train a T ransition Flo w Matching mo del. W e inherit the setting in Remark 3 b y taking Z = ( X 0 , X 1 ) and using the linear interpolant X t = (1 − t ) X 0 + tX 1 , whic h yields a concrete, standard form of T ransition Flow Matc hing. 22 Chenrui Ma R emark 5 (Standar d T r ansition Flow Matching). W e adopt the standard Flo w Matc hing setting in Remark 3 b y setting Z = ( X 0 , X 1 ) , and use the linear in terp olant X t = (1 − t ) X 0 + tX 1 , 0 ≤ t ≤ 1 . F or any 0 ≤ t ≤ r ≤ 1 , the conditional transition state is X Z t → r = (1 − r ) X 0 + r X 1 , and the conditional v elo city is constant: v ( X t , t | Z ) = X 1 − X 0 . The tractable T ransition Flo w Matc hing ob jective in Eq.(53) can be written explicitly as L TFM ( θ ) = E t, r, Z , X t ∼ p t | Z ( ·| Z ) D sg X Z t → r + ( r − t ) d dt X θ ( X t , t, r ) , X θ ( X t , t, r ) (58) where d dt X θ ( X t , t, r ) = ∂ x t X θ ( X t , t, r ) · v ( X t , t | Z ) + ∂ t X θ ( X t , t, r ) (59) During inference, w e recursively apply the mo del to transform the generation tra jectory from the source distribution to the target distribution: ˆ x r = X θ ( x t , t, r ) (60) F or clarity , we summarize the conceptual training and inference pro cedures in Algorithm 2 and Algorithm 3. In particular, by extending Algorithm 3, the one- step generation pro cedure is describ ed in Algorithm 4. References 1. Alb ergo, M.S., Boffi, N.M., V anden-Eijnden, E.: Sto chastic interpolants: A unifying framew ork for flows and diffusions, 2023. URL https://arxiv. org/abs/2303.08797 3 (2023) 2. Alb ergo, M.S., V anden-Eijnden, E.: Building normalizing flo ws with stochastic in- terp olan ts. In: The Eleven th International Conference on Learning Representations (2023), https://openreview.net/forum?id=li7qeBbCR1t 3. Boffi, N.M., Albergo, M.S., V anden-Eijnden, E.: Flo w map matching. arXiv preprin t arXiv:2406.07507 2 (3), 9 (2024) 4. Cai, S., Chan, E.R., Zhang, Y., Guibas, L., W u, J., W etzstein, G.: Diffusion self- distillation for zero-shot customized image generation. In: Pro ceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 18434–18443 (2025) 5. Dormand, J.R., Prince, P .J.: A family of em b edded runge-kutta form ulae. Journal of computational and applied mathematics 6 (1), 19–26 (1980) T ransition Flow Matc hing 23 Algorithm 2 T ransition Flow Matching (TFM): T raining Note: In PyTorch and JAX, jvp returns (function_output, JVP). // tfm(x_t, t, r): uses model X θ ( x t , t, r ) predicting the transition state at time r // metric(.): e.g., MSE, any Bregman divergence on residuals x_1 = sample_from_training_batch () // sample from target distribution p 1 x_0 = randn_like (x_1) // sample from source distribution p 0 (Gaussian), independent of p 1 (t, r) = sample_t_r () // 0 <= t <= r <= 1 x_t = (1 - t) x_0 + t x_1 // current state v = x_1 - x_0 // conditional velocity with Z = (X_0, X_1) x_t_to_r = (1 - r) x_0 + r x_1 // conditional transition state with Z = (X_0, X_1) (X, dX_dt) = jvp ( tfm , (x_t, t, r), (v, 1, 0)) // d/dt X θ = partial_x X θ * v + partial_t X θ X_tgt = x_t_to_r + (r - t) dX_dt error = X - stopgrad (X_tgt) loss = metric (error) return loss 6. Doso vitskiy , A., Bey er, L., K olesniko v, A., W eissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly , S., Uszk oreit, J., Houlsb y , N.: An image is worth 16x16 w ords: T ransformers for image recognition at scale (2021), 7. Esser, P ., Kulal, S ., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Rombac h, R.: Scaling rectified flow transformers for high-resolution image syn thesis. In: F orty- first In ternational Conference on Machine Learning (2024), https://openreview. net/forum?id=FPnUhsQJ5B 8. Esser, P ., Kulal, S ., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Bo esel, F., P o dell, D., Do ckhorn, T., English, Z., Lacey , K., Go o dwin, A., Marek, Y., Rom bach, R.: Scaling rectified flo w transformers for high-resolution image synthesis (2024), 9. F rans, K., Hafner, D., Levine, S., Abb eel, P .: One step diffusion via shortcut mo dels. In: The Thirteen th In ternational Conference on Learning Representations (2025), https://openreview.net/forum?id=OlzB6LnXcS 10. Geng, Z., Deng, M., Bai, X., Kolter, J.Z., He, K.: Mean flows for one-step generative mo deling. arXiv preprint arXiv:2505.13447 (2025) 24 Chenrui Ma Algorithm 3 TFM: Multi-step Sampling with Arbitrary Step Sizes // Choose any time grid: 0 = t 0 < t 1 < · · · < t K = 1 x = randn (x_shape) // x = x_t_0 = x_0 for k = 0 to K - 1: t = t_k, r = t_k+1 x = tfm (x, t, r) // x = x_t_to_r return x // x = x_t_K = x_1 Algorithm 4 TFM: 1-step Sampling x = randn (x_shape) // x = x_0 x = tfm (x, 0, 1) // x = x_1 return x 11. Geng, Z., Lu, Y., W u, Z., Shech tman, E., Kolter, J.Z., He, K.: Impro ved mean flows: On the challenges of fastforw ard generative mo dels. arXiv preprint arXiv:2512.02012 (2025) 12. Geng, Z., Pokle, A., Kolter, J.Z.: One-step diffusion distillation via deep equilibrium mo dels. In: Thirty-sev enth Conference on Neural Information Pro cessing Systems (2023), https://openreview.net/forum?id=b6XvK2de99 13. Geng, Z., Pokle, A., Luo, W., Lin, J., Kolter, J.Z.: Consistency models made easy . In: The Thirteen th In ternational Conference on Learning Representations (2025), https://openreview.net/forum?id=xQVxo9dSID 14. Guo, P ., Sc hwing, A.: V ariational rectified flow matching. In: F orty-second Interna- tional Conference on Machine Learning (2025), https://openreview.net/forum? id=Rk18ZikrFI 15. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Ho chreiter, S.: Gans trained b y a tw o time-scale up date rule conv erge to a local nash equilibrium. Adv ances in neural information pro cessing systems 30 (2017) 16. Ho, J., Salimans, T.: Classifier-free diffusion guidance (2022), abs/2207.12598 17. Hu, Z., Lai, C.H., Mitsufuji, Y., Ermon, S.: Cmt: Mid-training for efficien t learning of consistency , mean flow, and flow map mo dels. arXiv preprint T ransition Flow Matc hing 25 (2025) 18. Karras, T., Aittala, M., Aila, T., Laine, S.: Elucidating the design space of diffusion- based generativ e models. Adv ances in neural information pro cessing systems 35 , 26565–26577 (2022) 19. Kim, D., Lai, C.H., Liao, W.H., Murata, N., T akida, Y., Uesak a, T., He, Y., Mitsu- fuji, Y., Ermon, S.: Consistency tra jectory mo dels: Learning probabilit y flo w ODE tra jectory of diffusion. In: The T w elfth In ternational Conference on Learning Rep- resen tations (2024), https://openreview.net/forum?id=ymjI8feDTD 20. Kim, D., Lai, C.H., Liao, W.H., Murata, N., T akida, Y., Uesak a, T., He, Y., Mitsu- fuji, Y., Ermon, S.: Consistency tra jectory mo dels: Learning probabilit y flow o de tra jectory of diffusion (2024), 21. Klein, L., Krämer, A., Noé, F.: Equiv ariant flow matching. Adv ances in Neural Information Pro cessing Systems 36 , 59886–59910 (2023) 22. Krizhevsky , A., Hin ton, G., et al.: Learning m ultiple la yers of features from tin y images (2009) 23. Krizhevsky , A., Sutsk ev er, I., Hin ton, G.E.: Imagenet classification with deep con- v olutional neural net works. Adv ances in neural information pro cessing systems 25 (2012) 24. Lai, C.H., Song, Y., Kim, D., Mitsufuji, Y., Ermon, S.: The principles of diffusion mo dels (2025), 25. Lee, S., Lin, Z., F anti, G.: Improving the training of rectified flo ws. Adv ances in neural information pro cessing systems 37 , 63082–63109 (2024) 26. Lipman, Y., Chen, R.T.Q., Ben-Ham u, H., Nic kel, M., Le, M.: Flow matching for generative modeling. In: The Eleven th Inte rnational Conference on Learning Represen tations (2023), https://openreview.net/forum?id=PqvMRDCJT9t 27. Lipman, Y., Ha v asi, M., Holderrieth, P ., Shaul, N., Le, M., Karrer, B., Chen, R.T.Q., Lopez-Paz, D., Ben-Ham u, H., Gat, I.: Flo w matching guide and code (2024), 28. Lipman, Y., Ha v asi, M., Holderrieth, P ., Shaul, N., Le, M., Karrer, B., Chen, R.T., Lop ez-Paz, D., Ben-Hamu, H., Gat, I.: Flo w matching gu ide and co de. arXiv preprin t arXiv:2412.06264 (2024) 29. Liu, X., Gong, C., qiang liu: Flow straight and fast: Learning to generate and trans- fer data with rectified flow. In: The Eleven th International Conference on Learning Represen tations (2023), https://openreview.net/forum?id=XVjTT1nw5z 30. Lu, C., Song, Y.: Simplifying, stabilizing and scaling contin uous-time consistency mo dels. In: The Thirteen th International Conference on Learning Representations (2025), https://openreview.net/forum?id=LyJi5ugyJx 31. Lu, Y., Lu, S., Sun, Q., Zhao, H., Jiang, Z., W ang, X., Li, T., Geng, Z., He, K.: One-step latent-free image generation with pixel mean flows. arXiv preprint arXiv:2601.22158 (2026) 32. Luo, T., Y uan, H., Liu, Z.: Soflow: Solution flow mo dels for one-step generative mo deling. arXiv preprint arXiv:2512.15657 (2025) 33. Ma, C., Xiao, X., W ang, T., Shen, Y.: Bey ond editing pairs: Fine-grained instructional image editing via multi-scale learnable regions. arXiv preprin t arXiv:2505.19352 (2025) 34. Ma, C., Xiao, X., W ang, T., W ang, X., Shen, Y.: CAD-V AE: Leveraging correlation-a ware latents for comprehensive fair disen tanglement. In: The F orti- eth AAAI Conference on Artificial Intelligence (2025) 35. Ma, C., Xiao, X., W ang, T., W ang, X., Shen, Y.: Sto chastic in terp olants via con- ditional dep endent coupling. arXiv preprint arXiv:2509.23122 (2025) 26 Chenrui Ma 36. Ma, C., Xiao, X., W ang, T., W ang, X., Shen, Y.: Learning straight flows: V ari- ational flow matching for efficient generation (2026), https : / / arxiv . org / abs / 2511.17583 37. Ma, N., Goldstein, M., Alb ergo, M.S., Boffi, N.M., V anden-Eijnden, E., Xie, S.: Sit: Exploring flow and diffusion-based generativ e models with scalable interpolant transformers. In: European Conference on Computer Vision. pp. 23–40. Springer (2024) 38. Nie, W., Berner, J., Ma, N., Liu, C., Xie, S., V ahdat, A.: T ransition matching distillation for fast video generation. arXiv preprint arXiv:2601.09881 (2026) 39. Rom bach, R., Blattmann, A., Lorenz, D., Esser, P ., Ommer, B.: High-resolution image syn thesis with laten t diffusion mo dels (2022), https: / / arxiv . org / abs / 2112.10752 40. Sab our, A., Fidler, S., Kreis, K.: Align y our flow: Scaling con tinuous-time flow map distillation. arXiv preprint arXiv:2506.14603 (2025) 41. Salimans, T., Ho, J.: Progressiv e distillation for fast sampling of diffusion mo d- els. In: International Conference on Learning Representations (2022), https : //openreview.net/forum?id=TIdIXIpzhoI 42. Silv estri, G., Ambrogioni, L., Lai, C.H., T akida, Y., Mitsufuji, Y.: VCT: T raining consistency mo dels with v ariational noise coupling. In: F orty-second In ternational Conference on Mac hine Learning (2025), https : / / openreview . net / forum ? id = CMoX0BEsDs 43. Song, Y., Dhariwal, P .: Improv ed tec hniques for training consistency models. In: The T welfth International Conference on Learning Represen tations (2024), https: //openreview.net/forum?id=WNzy9bRDvG 44. Song, Y., Dhariwal, P ., Chen, M., Sutskev er, I.: Consistency mo dels. In: Proceed- ings of the 40th International Conference on Mac hine Learning. pp. 32211–32252 (2023) 45. Song, Y., Sohl-Dickstein, J., Kingma, D.P ., Kumar, A., Ermon, S., Poole, B.: Score- based generative mo deling through sto chastic differen tial equations. In: Interna- tional Conference on Learning Represen tations (2021), https://openreview.net/ forum?id=PxTIG12RRHS 46. T ong, A., F atras, K., Malkin, N., Huguet, G., Zhang, Y., Rector-Bro oks, J., W olf, G., Bengio, Y.: Improving and generalizing flow-based generative models with mini- batc h optimal transp ort. arXiv preprint arXiv:2302.00482 (2023) 47. V aswani, A., Shazeer, N., Parma r, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: A ttention is all you need (2023), 03762 48. W ang, F.Y., Y ang, L., Huang, Z., W ang, M., Li, H.: Rectified diffusion: Straightness is not your need in rectified flow. arXiv preprint arXiv:2410.07303 (2024) 49. W ang, Z., Zhang, Y., Y ue, X., Y ue, X., Li, Y., Ouyang, W., Bai, L.: T ransition mo d- els: Rethinking the generative learning ob jectiv e. arXiv preprin t (2025) 50. Zhang, H., Siarohin, A., Menapace, W., V asilk ovsky , M., T ulyak ov, S. , Qu, Q., Sk orokho dov, I.: Alphaflow: Understanding and improving meanflow mo dels. arXiv preprin t arXiv:2510.20771 (2025) 51. Zhang, Y., Y an, Y., Sch wing, A., Zhao, Z.: Hierarchical rectified flo w matc hing with mini-batch couplings. arXiv preprint arXiv:2507.13350 (2025) 52. Zhang, Y., Y an, Y., Sch wing, A., Zhao, Z.: T o wards hierarchical rectified flo w. In: The Thirteen th In ternational Conference on Learning Representations (2025), https://openreview.net/forum?id=6F6qwdycgJ T ransition Flow Matc hing 27 53. Zhou, L., Ermon, S., Song, J.: Inductive moment matching. In: F orty-second In- ternational Conference on Machine Learning (2025), https : // openreview . net / forum?id=pwNSUo7yUb
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment