Transition Flow Matching

T ransition Flo w Matc hing Chenrui Ma Univ ersity of California, Irvine, Irvine, CA 92697, USA chenrum@uci.edu Abstract. Mainstream ﬂo w matching methods typically focus on learn- ing the lo cal v elo city ﬁeld, whic h inherently requires m ultiple integra- tion steps during generation. In contrast, Mean V elocity Flow models establish a relationship betw een the local velocity ﬁeld and the global mean v elo city , enabling the latter to b e learned through a mathemati- cally grounded form ulation and allowing generation to be transferred to arbitrary future time p oints. In this w ork, w e propose a new paradigm that directly learns the transition ﬂow. As a global quantit y , the transi- tion ﬂo w naturally supp orts generation in a single step or at arbitrary time points. F urthermore, we demonstrate the connection b et ween our approac h and Mean V elo city Flo w, establishing a u niﬁed theoretical p er- sp ectiv e. Extensive exp eriments v alidate the eﬀectiv eness of our metho d and supp ort our theoretical claims. Keyw ords: Generativ e Mo deling · Flo w Matching · F ewer/One-step Generation 1 In tro duction The goal of generativ e mo deling is to transform a prior distribution in to the data distribution. Flow Matching [18, 35, 37] provides an in tuitive and conceptually simple framework for constructing ﬂow paths that transport one distribution to another. Closely related to diﬀusion mo dels [1, 4, 45], Flo w Matching fo cuses on learning the v elo city ﬁelds that guide the generative pro cess during training. Both Flow Matching [26] and diﬀusion [45] models rely on iterativ e sam- pling during generation. Recen tly , signiﬁcan t attention has b een devoted to few- step—and particularly one-step, feedforw ard—generative mo dels. One common approac h to accelerate Flow/Diﬀusion mo dels is distillation. Ho wev er, an ideal solution w ould allo w models to be trained from scratch in an end-to-end manner without relying on pre-trained teachers. Pioneering this direction, Consistency Mo dels [9, 13, 44] introduce a consistency constraint on netw ork outputs for in- puts sampled along the same tra jectory . Despite their promising p erformance, this constrain t is imp osed as a b ehavioral prop ert y of the net work, while the prop erties of the underlying ground-truth ﬁeld that should guide learning re- main unclear [19, 30]. As a result, training can b e unstable and t ypically requires a carefully designed discretization curriculum to progressively constrain the time domain [33, 43, 44]. In contrast, Mean V elocity Models [10, 11, 17, 31, 50] optimize 2 Chenrui Ma an ob jective deriv ed from the relationship betw een the instan taneous v elo city at eac h time p oint and the mean velocity tow ard a future time. This formulation pro vides a more fundamen tal and elegant persp ective for learning generativ e dynamics. In this work, we propose a principled and eﬀectiv e framew ork, termed T ran- sition Flo w Matc hing , for few-step generation with arbitrary n um b ers of steps and step sizes. Instead of regressing a local vector ﬁeld as in Flo w/Diﬀusion mo d- els [36, 37, 45], our metho d directly mo dels the generation tra jectory itself, where the transition dynamics naturally represent a global counterpart of v elo city . T o this end, we deriv e (with pro of ) the T ransition Flo w Identit y , and prop ose a principled training ob jectiv e that enables generativ e mo dels to satisfy this iden- tit y from scratch in an end-to-end manner. This formulation extends and gen- eralizes previous T ransition Models [32, 38, 49] and Flow Map metho ds [3, 40]. Imp ortan tly , w e further establish a uniﬁed persp ective that clariﬁes the relation- ship b et ween our framew ork and Mean V elo city methods [10, 11]. In experiments, w e conduct extensive ev aluations, including generation tra jectory visualization and image generation tasks across m ultiple datasets. The results show that our approac h ac hieves comp etitiv e p erformance, demonstrating the eﬀectiveness of mo deling transition ﬂows. In addition, we p erform ablation studies to analyze k ey implementation design choices. Our contributions can b e summarized as follo ws: • W e prop ose T ransition Flo w Matching, a principled framework for few-step generative modeling. • W e deriv e the T ransition Flo w Iden tity and pro vide a theoretically grounded training ob jectiv e. • W e establish a uniﬁed view connecting our metho d with Mean V e- lo cit y models. • W e demonstrate comp etitive p erformance across multiple image generation b enc hmarks. 2 Related W orks Flow Matching and Diﬀusion. Flo w Matc hing [2, 7, 26] and Diﬀusion [1, 4, 45] mo dels generate samples b y parameterizing con tinuous-time dynamics via Ordi- nary Diﬀerential Equations (ODEs) or Stochastic Diﬀerential Equations (SDEs). Flo w Matching learns a deterministic velocity ﬁeld deﬁning an ODE ov er sam- ple evolution [28], while Diﬀusion mo dels are t ypically form ulated as SDEs that learn the score function [24]. Through the probability ﬂow ODE formulation, the learned score implicitly deﬁnes an equiv alen t velocity ﬁeld [27]. In b oth cases, the learned velocity or score is a lo cal quan tity dep ending only on the current state and time. As a result, generation requires n umerical in tegration ov er multi- ple steps to gradually transp ort samples from noise to data. A common strategy to alleviate tra jectory conﬂicts is distillation [7, 12, 25, 29, 41, 48, 52] from a well- trained Flo w or Diﬀusion mo del, which provides a better coupling distribution than the standard indep endent coupling setting [21, 34, 42, 46]. In con trast, our T ransition Flow Matching framework directly mo dels global state transitions. Instead of regressing a local v ector ﬁeld, w e parameterize mappings betw een T ransition Flow Matc hing 3 states at diﬀeren t times, enabling direct transitions from a giv en state to an arbitrary future state without relying on lo cal velocity regression. Consistency Mo del and Me an V elo city Mo del. T o reduce the num b er of infer- ence steps, Consistency Mo dels [9, 13, 19, 30, 43, 44] learn generation tra jectories that main tain consistency o v er time, enforcing tra jectory coherence. Mean while, Mean V elocity metho ds [10, 11, 17, 31, 50] establish a relationship b etw een the instan taneous velocity at each time point and a mean velocity target to ward a future time. By con trast, our method directly mo dels the generation tra jectory itself, where the transfer dynamics naturally represent a global counterpart of v elo city . This enables generation with arbitrary step sizes and arbitrary num- b ers of steps. Imp ortantly , we establish a uniﬁed persp ective that clariﬁes the relationship b et ween our framework and Mean V elo city metho ds [10, 11]. T r ansition Mo dels and Flow Map. T ransition-based approac hes aim to learn state transitions by preserving transition iden tities. Consistency T ra jectory Mo d- els [20] learn mappings b etw een arbitrary time steps, but rely on explicit ODE/SDE in tegration during training. Flo w Map Matching [3] regresses zeroth- and ﬁrst- order deriv ativ es of ﬂow ﬁelds, while IMM [53] p erforms moment matching across time steps. Ho wev er, these methods either lac k suﬃcient theoretical analysis, do not clearly connect to prior formulations, or rely on distillation. In con trast, our metho d is trained from scratch in an end-to-end manner and pro vides a clear theoretical relationship under b oth the Flow Matching [26] and Mean V elo city framew orks [10, 11]. 3 Preliminaries 3.1 Notation and setup R andom variables and r e alizations. W e work in R d . Upp ercase letters (e.g., X t , Z ) denote random v ariables (R V s), and lo w ercase letters (e.g., x t , z ) de- note their realizations. F or the densit y of an R V X t , we write p t ( · ) , and for a conditional densit y we write p t | Z ( · | z ) . Exp ectations are denoted by E [ · ] . Sour c e/tar get distributions and c oupling. Let X 0 ∼ p 0 b e the sour c e distribu- tion (e.g., standard Gaussian noise) and X 1 ∼ p 1 b e the tar get distribution (e.g., images). A generative model constructs a con tinuous path of distributions { p t } t ∈ [0 , 1] that transp orts p 0 to p 1 . Let ( X 0 , X 1 ) b e any coupling with join t densit y π whose marginals are p 0 and p 1 . Throughout this work, we follow the standard Flow Matc hing setting where the source and target are indep endent: π ( x 0 , x 1 ) = p 0 ( x 0 ) p 1 ( x 1 ) . Conditioning variable and c onditional p aths. W e use a conditioning R V Z to index conditional paths. Conditioned on Z = z , w e obtain a conditional path 4 Chenrui Ma { X Z t } t ∈ [0 , 1] with conditional densit y p t | Z ( · | z ) . The corresp onding marginal path { X t } t ∈ [0 , 1] has densit y p t ( · ) satisfying p t ( x t ) = Z p t | Z ( x t | z ) p Z ( z ) dz , X t ∼ p t , X t | ( Z = z ) ∼ p t | Z ( · | z ) (1) Gener al interp olant. W e consider a general interpolant b etw een the mar ginal endp oin ts X 0 and X 1 sp eciﬁed b y scalar functions α : [0 , 1] → R and β : [0 , 1] → R : X t = α ( t ) X 0 + β ( t ) X 1 , t ∈ [0 , 1] . (2) W e assume boundary conditions α (0) = 1 , β (0) = 0 and α (1) = 0 , β (1) = 1 , so that X t =0 = X 0 and X t =1 = X 1 . If α, β are diﬀeren tiable, then d dt X t = ˙ α ( t ) X 0 + ˙ β ( t ) X 1 . (3) Conditioned on Z = z , the same schedules induce the conditional path X Z t = α ( t ) X Z 0 + β ( t ) X Z 1 and d dt X Z t = ˙ α ( t ) X Z 0 + ˙ β ( t ) X Z 1 . 3.2 Flo w Matc hing V elo city ﬁelds. Let v ( x t , t | z ) ∈ R d denote a c onditional velocity ﬁeld that transp orts p t | Z ( · | z ) ov er time. The corresponding mar ginal v elo city is deﬁned b y conditional exp ectation: v ( x t , t ) = Z v ( x t , t | z ) p Z | t ( z | x t ) dz = E [ v ( X t , t | Z ) | X t = x t ] (4) Gener ation ODE and c ontinuity e quation. A marginal tra jectory follo ws the ODE dx t dt = v ( x t , t ) , t ∈ [0 , 1] , (5) and similarly a conditional tra jectory follows dx z t dt = v ( x t , t | z ) . The induced marginal densit y p t satisﬁes the con tinuit y (transp ort) equation ∂ t p t ( x t ) + ∇ ·  p t ( x t ) v ( x t , t )  = 0 , t ∈ [0 , 1] (6) and conditioned on Z = z , p t | Z ( · | z ) satisﬁes ∂ t p t | Z ( x t | z ) + ∇ ·  p t | Z ( x t | z ) v ( x t , t | z )  = 0 , t ∈ [0 , 1] (7) L e arning Flow Matching. Flow Matching introduces a parameterized mo del v θ ( X t , t ) to learn v ( X t , t ) by minimizing the marginal loss L MFM ( θ ) = E t, X t ∼ p t D  v ( X t , t ) , v θ ( X t , t )  , (8) where D ( · , · ) is typically a Bregman divergence (e.g., MSE). Since the marginal v elo city v ( X t , t ) in Eq. (4) is generally intractable, one instead minimizes the conditional loss L CFM ( θ ) = E t, Z, X t ∼ p t | Z ( ·| Z ) D  v ( X t , t | Z ) , v θ ( X t , t )  (9) T ransition Flow Matc hing 5 Theorem 1 (Gradient equiv alence of Flow Matching [27]). The gr adi- ents of t he mar ginal Flow Matching loss and the c onditional Flow Matching loss c oincide: ∇ θ L MFM ( θ ) = ∇ θ L CFM ( θ ) . (10) In p articular, the minimizer of the c onditional Flow Matching loss is the mar ginal velo city v ( x t , t ) . R emark 1 (Standar d Flow Matching). Flo w Matching sets the conditioning v ari- able to b e the endp oint pair Z = ( X 0 , X 1 ) . Cho osing linear schedules α ( t ) = 1 − t and β ( t ) = t yields the conditional path X Z t = (1 − t ) X 0 + tX 1 with constant conditional v elo city v ( X t , t | Z ) = X 1 − X 0 . Th us Eq. (9) reduces to L FM ( θ ) = E t, X 0 ∼ p 0 , X 1 ∼ p 1 D  X 1 − X 0 , v θ ( X t , t )  , X t = (1 − t ) X 0 + tX 1 (11) 4 Metho d Flo w Matching learns a time-dep enden t velocity ﬁeld v ( x t , t ) and generating b y in tegrating the ODE in Eq. (5). Our goal is to directly learn a tr ansition ﬂow : X θ ( x t , t, r ) : R d × [0 , 1] × [0 , 1] → R d , 0 ≤ t ≤ r ≤ 1 , that maps a state x t at time t to the futur e state at time r along the same transp ort dynamics. A t inference time, this enables stepping across an arbitrary time grid b y rep eatedly applying X θ ( · ) , without explicitly in tegrating v ( · ) . 4.1 T ransition Flo w Iden tity A ver age velo city b etwe en two time steps. Given the marginal v elo city v ( x τ , τ ) as sho wn in Eq.(4), deﬁne the av erage velocity u ( x t , t, r ) for any 0 ≤ t ≤ r ≤ 1 : ( r − t ) u ( x t , t, r ) = x t → r − x t = Z r t v ( x τ , τ ) dτ . (12) Here x t is the current state at time t , and x t → r denotes the marginal transition state at time r obtained by ev olving from x t . Mar ginal/c onditional tr ansition states. Analogous to Eq. (4), the marginal tran- sition state can b e expressed as a conditional exp ectation: x t → r = Z x z t → r p Z | t ( z | x t ) dz = E  X Z t → r   X t = x t  , (13) where the conditional transition state is x z t → r = x t + R r t v ( x τ , τ | z ) dτ . 6 Chenrui Ma T r ansition ﬂow. W e deﬁne the (marginal) transition ﬂo w X ( x t , t, r ) as the map- ping that returns the marginal transition state: X ( x t , t, r ) = Z X ( x t , t, r | z ) p Z | t ( z | x t ) dz = E [ X ( X t , t, r | Z ) | X t = x t ] (14) where X ( x t , t, r | z ) = x z t → r . Com bining Eq. (12) with the deﬁnition of X ( · ) yields u ( x t , t, r ) = x t → r − x t r − t = X ( x t , t, r ) − x t r − t . (15) T r ansition Flow Identity. Diﬀeren tiating Eq. (12) with resp ect to t (treating r as indep enden t of t ) gives u ( x t , t, r ) = v ( x t , t ) + ( r − t ) d dt u ( x t , t, r ) . (16) Substituting Eq. (15) in to Eq. (16) yields the following key iden tity: X ( x t , t, r ) = x t → r + ( r − t ) d dt X ( x t , t, r ) (17) W e defer the detailed algebraic deriv ation to the Appendix. 4.2 Computing the time deriv ativ e of a transition ﬂo w T o make Eq. (17) actionable for learning, w e expand the total deriv ativ e d dt X ( x t , t, r ) as: d dt X ( x t , t, r ) = ∂ x t X · dx t dt + ∂ t X · dt dt + ∂ r X · dr dt = ∂ x t X · v ( x t , t ) + ∂ t X (18) where dx t dt = v ( x t , t ) b y Eq. (5) and dr dt = 0 . In practice, Eq. (18) is giv en b y the Jacobian-v ector pro duct (JVP) b etw een the Jacobian matrix of eac h function ( [ ∂ x t X, ∂ t X, ∂ r X ] ) and the corresp onding tangent v ector ( [ v , 0 , 1] ). F or co de implemen tation, modern libraries suc h as PyT orc h provide eﬃcient JVP calculation in terfaces. 4.3 T ransition Flo w Matc hing ob jectiv es Mo del p ar ameterization. W e introduce a T ransition Flow mo del X θ ( x t , t, r ) to mo del X ( x t , t, r ) . W e use sg [ · ] to denote a stop-gradient op erator (i.e., the ar- gumen t is treated as a constan t target during optimization), and D ( · , · ) is a Bregman div ergence (e.g., MSE). T ransition Flow Matc hing 7 Intr actable mar ginal obje ctive. Ideally , w e w ould enforce Eq. (17) b y minimizing the marginal T ransition Flow Matching ob jectiv e (M-TFM): L M - TFM ( θ ) = E t, r, X t ∼ p t D  sg  X m tgt ( X t , t, r )  , X θ ( X t , t, r )  , (19) with target X m tgt ( X t , t, r ) = X t → r + ( r − t ) d dt X θ ( X t , t, r ) . (20) Ho wev er, the marginal transition state X t → r and the marginal v elo city v ( x t , t ) (needed inside d dt X θ via Eq. (18)) are generally intractable, hence Eq. (19) can- not b e ev aluated directly . T r actable c onditional obje ctive. Instead, we minimize a conditional T ransition Flo w Matching ob jective (C-TFM): L C - TFM ( θ ) = E t, r, Z , X t ∼ p t | Z ( ·| Z ) D  sg  X c tgt ( X t , t, r | Z )  , X θ ( X t , t, r )  , (21) where the conditional target is X c tgt ( X t , t, r | Z ) = X Z t → r + ( r − t ) d dt X θ ( X t , t, r ) . (22) Crucially , under appropriate choices of Z and the conditional path construction, b oth the conditional transition state X Z t → r and the conditional velocity v ( x t , t | Z ) b ecome tractable, which makes d dt X θ computable via Eq. (18). Theorem 2 (Gradient equiv alence of T ransition Flow Matching). Se e the pr o of in the App endix. The gr adients of the mar ginal and c onditional T r an- sition Flow Matching losses c oincide: ∇ θ L M - TFM ( θ ) = ∇ θ L C - TFM ( θ ) . (23) In p articular, the minimizer of the c onditional obje ctive r e gr esses X θ ( X t , t, r ) to- war d the mar ginal tar get X m tgt ( X t , t, r ) , which is deﬁne d to satisfy the T r ansition Flow Identity in Eq. (17) . 4.4 T raining and inference pro cedure R emark 2 (Standar d T r ansition Flow Matching). W e adopt the standard Flo w Matc hing setting in Remark 1 b y taking Z = ( X 0 , X 1 ) and using the linear in terp olant X t = (1 − t ) X 0 + tX 1 , 0 ≤ t ≤ 1 . F or any 0 ≤ t ≤ r ≤ 1 , the conditional transition state is X Z t → r = (1 − r ) X 0 + r X 1 , 8 Chenrui Ma Algorithm 1 T ransition Flow Matching (TFM): T raining // jvp returns (output, JVP) // tfm predicts X θ ( x t , t, r ) x_1 = sample_batch (), x_0 = randn_like (x_1), (t,r)= sample_t_r () // 0 ≤ t ≤ r ≤ 1 x_t = (1-t)x_0 + t x_1, v = x_1 - x_0 x_t_to_r = (1-r)x_0 + r x_1 (X, dX_dt) = jvp ( tfm , (x_t,t,r),(v,1,0)) X_tgt = x_t_to_r + (r-t)dX_dt loss = metric (X - stopgrad (X_tgt)) return loss and the conditional v elo city is constant: v ( X t , t | Z ) = X 1 − X 0 . Under this instan tiation, the tractable T ransition Flo w Matching ob jectiv e in Eq. (21) b ecomes L TFM ( θ ) = E t, r, Z , X t ∼ p t | Z ( ·| Z ) D  sg  X Z t → r + ( r − t ) d dt X θ ( X t , t, r )  , X θ ( X t , t, r )  , (24) The time deriv ativ e term d dt X θ ( X t , t, r ) extends Eq. (18) by incorp orating the tractable conditional v elo city , and is giv en by (see the pro of in App endix): d dt X θ ( X t , t, r ) = ∂ x t X θ ( X t , t, r ) · v ( X t , t | Z ) + ∂ t X θ ( X t , t, r ) . (25) During inference, w e recursively apply the mo del to transform the generation tra jectory from the source distribution to the target distribution: ˆ x r = X θ ( x t , t, r ) (26) F or clarity , we summarize the conceptual training pro cedures in the Algorithm.1. 4.5 Classiﬁer-F ree Guidance T ransition Flo w Matching naturally supp orts classiﬁer-free guidance (CFG) via X θ cfg ( x t , t, r | c ) = ω X ( x t , t, r | c ) + (1 − ω ) X ( x t , t, r | ∅ ) . (27) This forms a com bination of the class-conditional transition ﬂow X ( x t , t, r | c ) and the unconditional transition ﬂow X ( x t , t, r | ∅ ) , allo wing us to control the strength of class conditioning at inference time b y tuning ω . Here, the term “condition” refers sp eciﬁcally to class conditioning, whic h is distinct from the marginal/conditional distinction used in previous sections. F ollowing standard T ransition Flow Matc hing 9 Ground T ruth Flo w Matc hing Rectiﬁed Flow Flow map (steps=1) MeanFlow (steps=1) TFM (steps=1) Flow map (steps=2) MeanFlow (steps=2) TFM (steps=2) Flow map (steps=5) MeanFlow (steps=5) TFM (steps=5) Fig. 1: 2D Gener ation T r aje ctory Visualization on Synthetic Data. T ested methods include: Flow Matching [26], Rectiﬁed Flow [29], Flow Map Matc hing [3], MeanFlow [10]. CF G practice [16], we train a single mo del X θ ( · ) that supp orts b oth class- conditional generation X θ ( · | c ) and unconditional generation X θ ( · | ∅ ) . Con- cretely , for conditional training, the endpoint X 1 in Sec. 4.4 is sampled from the class-conditional target distribution p 1 ( · | c ) ; for unconditional training, X 1 in Sec. 4.4 is sampled from the unconditional target distribution p 1 ( · ) . 5 Implemen tation Design L oss Metrics. In Eq. (24), the Bregman divergence D ( · , · ) is instantiated as the squared ℓ 2 loss. F ollo wing [10], w e further inv estigate alternativ e loss metrics. In general, we consider loss functions of the form L = ∥ ∆ ∥ 2 γ 2 , where ∆ denotes the regression error. It can b e shown (see [13]) that minimizing ∥ ∆ ∥ 2 γ 2 is equiv alen t to minimizing the squared ℓ 2 loss ∥ ∆ ∥ 2 2 with adapte d loss weights . In practice, w e deﬁne the weigh t as w = 1 ( ∥ ∆ ∥ 2 2 + c ) p , where p = 1 − γ and c > 0 is a small constan t (e.g., 10 − 3 ) for n umerical stability . The resulting adaptively w eighted loss takes the form sg ( w ) · L , with L = ∥ ∆ ∥ 2 2 . When p = 0 . 5 , this formulation resem bles the Pseudo-Hub er loss [43]. W e compare diﬀerent choices of p in the exp erimen ts. Sampling Time Steps ( t, r ) . W e sample the tw o time steps ( t, r ) from a predeﬁned logit-normal (lognorm) distribution [8, 10]. Sp eciﬁcally , we ﬁrst dra w a sample from a normal distribution N ( µ, σ ) and map it to the in terv al (0 , 1) via the logistic function to obtain t . W e then sample another logit-normal v ariable d in the same manner and set r = t + d (1 − t ) , which ensures that the constrain t 0 ≤ t ≤ r ≤ 1 is satisﬁed. Note that for an y giv en t , the transition time r is obtained by an aﬃne transformation of a logit-normal random v ariable, ensuring that r ∈ [ t, 1] while preserving a logit-normal–shaped densit y o ver the v alid in terv al. Diﬀerent hyperparameter settings of the logit-normal distribution are ev aluated in the exp eriments. 10 Chenrui Ma Conditioning on ( t, r ) . W e employ p ositional embeddings [10, 47] to enco de the time v ariables, which are subsequen tly combined and used as conditioning inputs to the neural net work. Although the vector ﬁeld is parameterized as X θ ( x t , t, r ) , it is not strictly necessary for the netw ork to directly condition on ( t, r ) . F or instance, the netw ork can instead condition on ( t, ∆t ) , where ∆t = r − t . In this case, w e deﬁne X θ ( · , t, r ) ≜ net( · , t, r − t ) , with net denoting the neural net work. The Jacobian-v ector product (JVP) is alw a ys computed with respect to the function X θ ( · , t, r ) . W e empirically compare diﬀeren t conditioning strategies in the exp erimen ts. 6 Exp erimen ts 6.1 Syn thetic Data and Visualization Syn thetic data exp eriments are conducted to visualize the results and in tuitiv ely demonstrate the eﬀectiv eness of the prop osed metho d [36, 51]. W e simulate and visualize the generation tra jectories of diﬀerent methods on a 2D alphab et “M” dataset, as sho wn in Figure 1. In this dataset, the source distribution (blue p oin ts) is circular, while the target distribution (red p oints) forms the shape of the letter “M”. The visualization results are consisten t with the discussion in the previous section: our goal is to enable generation with arbitrary step sizes and an arbitrary num b er of steps by mo deling the generation tra jectory . Notably , ev en one-step generation shows promising results. 6.2 Visual Generation CIF AR-10. CIF AR-10 is a 32 × 32 resolution image dataset containing multiple classes and is a widely used b enchmark in generative modeling [22]. F or a fair ev aluation, w e adopt the same UNet architecture and training proto col as in prior w ork [13, 14], while replacing the conv entional ﬂow matching ob jectiv e with the prop osed T r ansition Flow Matching ob jectiv e as deﬁned in Eq. (24). The UNet mo del X θ follo ws a standard enco der–deco der design with residual blo c ks and skip connections. A self-attention blo ck is inserted after the residual blo c k at 16 × 16 resolution and at the bottleneck lay er. The mo del tak es the curren t state x t and the time v ariables ( t, r ) as input, where ( t, r ) are embedded and used to mo dulate adaptive group normalization lay ers through learnable scale and shift parameters. T o quantitativ ely ev aluate the generation p erformance, we compare our metho d with sev eral state-of-the-art approac hes by measuring the generation quality us- ing the F réchet Inception Distance (FID) [15], computed under v arying NFE: [1 , 2 , 5 , 10] and adaptiv e-step Dopri5 ODE solver [5], as presented in T able 1. The results in T able 1 sho w that our method achiev es the b est p erformance in one-step generation (NFE = 1 ). Moreo ver, the FID score consisten tly decreases as NFE increases, while b oth Consistency Mo dels and Mean V elo city Mo dels tend to exhibit degraded p erformance with higher NFE v alues. T ransition Flow Matc hing 11 NFE / Sampler # P arams. 1 2 5 10 Flow Flow Matching [26] [ICLR’23] 36.5M - 166.65 36.19 14.4 VFM [14] [ICML’25] 60.6M - 97.83 13.12 5.34 Re-Flow 1-Rectiﬁed Flo w [29] [ICLR’23] 36.5M 378 6.18 - - 2-Rectiﬁed Flo w [29] [ICLR’23] 36.5M 12.21 4.85 - - 3-Rectiﬁed Flo w [29] [ICLR’23] 36.5M 8.15 5.21 - - Mean V elo city/ Consistency CT [44] [ICML’23] 61.8M 8.71 5.32 11.412 23.948 iCT [43] [ICLR’24] 55M 2.83 2.46 - - ECT [13] [ICLR’25] 55M 3.60 2.11 - - sCT [30] [ICLR’25] 55M 2.85 2.06 - - IMM [53] [ICML’25] 55M 3.20 1.98 - - MeanFlow [10] [NeurIPS’25] 55M 2.92 2.23 2.84 2.27 S-VFM [36] [CVPR’26] 60.6M 2.81 2.16 2.02 1.97 TFM [Ours] 55M 2.77 2.08 1.96 1.91 T able 1 : Quantitative Comp arison with Diﬀer ent Generation Metho ds on CIF AR-10 Dataset. Our metho d achiev es the b est performance in one-step generation (NFE = 1 ). Moreo ver, the FID score consistently decreases as NFE increases. NFE=1 NFE=2 NFE=5 NFE=10 Fig. 2: Gener ation R esults on ImageNet-256 under V arying NFE. As the n umber of function ev aluations (NFE) increases from 1 to 10, the generated images exhibit pro- gressiv ely impro ved detail and ﬁdelit y . Notably , even single-step generation already pro duces reasonably go o d results. ImageNet. T o ev aluate robustness and scalabilit y on large-scale data, we conduct exp erimen ts on the ImageNet dataset with image resolution 256 × 256 [23]. All exp erimen ts are performed on class-conditional ImageNet generation at this res- olution. F ollo wing common practice, we ev aluate the F réchet Inception Distance (FID) [15] on 50K randomly generated images. F ollo wing prior works [9, 14, 36], w e implement all mo dels in the laten t space of a pre-trained V AE tok enizer [39]. F or 256 × 256 images, the tokenizer maps images into a laten t representation of size 32 × 32 × 4 , whic h s erv es as the input to the generative mo del. All mo dels are trained from scratc h under identical data and optimization settings. As the backbone architecture, w e adopt MeanFlow [10], a transformer-based mo del that has demonstrated strong p erformance in high-resolution image gener- ation. F or fair comparison, we strictly follo w the original MeanFlo w [10] training recip e and optimization settings, mo difying only the learning ob jectiv e. 12 Chenrui Ma Metho d # Params. NFE FID iCT-XL/2 [43] [ICLR’24] 675M 1 34.24 Shortcut-XL/2 [9] [ICLR’25] 675M 1 10.60 MeanFlow-XL/2 [10] [NeurIPS’25] 676M 1 3.43 S-VFM-XL/2 [36] [CVPR’26] 677M 1 3.31 TFM-XL/2 [Ours] 676M 1 3.02 iCT-XL/2 [43] [ICLR’24] 675M 2 20.30 iMM-XL/2 [53] [ICML’25] 675M 1 × 2 7.77 MeanFlow-XL/2 [10] [NeurIPS’25] 676M 2 2.93 S-VFM-XL/2 [36] [CVPR’26] 677M 2 2.86 TFM-XL/2 [Ours] 676M 2 2.77 T able 2: Quantitative Comp arison with Diﬀer ent Gener ation Metho ds on ImageNet 256 × 256 Dataset. Our metho d achiev es the b est p erformance in few-step generation. Sp eciﬁcally , w e in tro duce TFM , which is parameterized by X θ ( X t , t, r ) , b y replacing the original velocity-based loss with the proposed T ransition Flow Matc hing loss in Eq. (24). The mo del is trained to directly predict the tran- sition ﬂo w X ( x t , t, r ) rather than the lo c al v elo city ﬁeld. Both time v ariables t and r are em b edded and injected in to the transformer blocks via adaptiv e normalization la yers. During training, we sample ( X 0 , X 1 ) pairs from the data distribution and con- struct linear interpolants follo wing the standard Flo w Matc hing setting. Giv en a randomly sampled time pair ( t, r ) with 0 ≤ t ≤ r ≤ 1 , the netw ork is trained to regress the transition ﬂow using the tractable T ransition Flow Matc hing ob jec- tiv e described in Eq. (24), which enforces consistency with the T ransition Flow Iden tity . At inference time, sample generation is p erformed by rep eatedly ap- plying the learned transition ﬂow across a predeﬁned time grid, starting from Gaussian noise. This form ulation allo ws the model to directly predict future states along the transition tra jectory , enabling ﬂexible and e ﬃcien t generation. Figure 2 illustrates generation results under diﬀeren t num b ers of function ev aluations (NFE), using distinct initial noise. Within each row, images at the same spatial lo cation across diﬀerent panels are generated from the same ini- tial noise realization. Eac h column corresp onds to a diﬀerent NFE setting, with NFE v alues of [1 , 2 , 5 , 10] arranged from top to b ottom. As the num ber of func- tion ev aluations increases, the generated images exhibit improv ed visual ﬁdelity and structural coherence. Notably , ev en one-step generation (NFE = 1 ) pro- duces competitive samples, supp orting the claim that the transition ﬂow mo del X θ ( X t , t, r ) learns to predict the future state at r from the current state X t at time t , thereb y enabling eﬀective few-step sampling. F ollowing standard ev aluation proto cols, w e randomly generate 50K images from each mo del and report the corresp onding FID scores in T able 2. TFM- XL consisten tly outp erforms b oth Consistency Mo dels and Mean V elo city Mod- els under iden tical training settings. These results demonstrate that explicitly mo deling glob al tr ansition dynamics through a transition ﬂo w—capable of pre- dicting arbitrary future states—rather than learning a naturally local velocity ﬁeld, pro vides greater ﬂexibilit y in sim ulation steps for ﬂo w- and diﬀusion-based generation metho ds. W e further analyze the training dynamics by comparing TFM with SiT and MeanFlo w across diﬀerent training iterations, as shown in Figure 3e. F or SiT, T ransition Flow Matc hing 13 w e follo w the default inference setting with NFE = 250 , while for both TFM and MeanFlo w we use NFE = 1 to reﬂect their one-step generation capabilit y . The results sho w that TFM ac hieves clear p erformance improv ements once suf- ﬁcien tly trained, and the training curv es indicate that p erformance consistently impro ves as the num ber of training epo chs increases. This b ehavior conﬁrms the eﬀectiv eness and scalabilit y of the prop osed transition ﬂo w formulation in b oth training and inference. 6.3 Ablation Study In our ablation study , we use the ViT-B/4 arc hitecture [6] (“Base" size with a patc h size of 4 as developed in [37], trained for 80 ep ochs (400K iterations). Conditioning on ( t, r ) . F ollo wing the model parameterization in Sec. 5, the tran- sition ﬂow X θ ( x t , t, r ) requires explicit conditioning on the temp oral v ariables. Similar to prior designs, we enco de time information through p ositional em b ed- dings and study diﬀerent conditioning strategies that share the same functional form but diﬀer in the speciﬁc choice of v ariables. Concretely , instead of directly conditioning on ( t, r ) , the netw ork can equiv alen tly condition on ( t, ∆t ) with ∆t = r − t , leading to the parameterization X θ ( · , t, r ) ≜ net( · , t, r − t ) . W e compare these v arian ts in T ab. 3a. The results indicate that all studied con- ditioning forms lead to stable and eﬀective one-step generation, demonstrating that our transition ﬂo w formulation is robust to the exact c hoice of temp oral em- b edding. Conditioning on ( t, ∆t ) yields the strongest p erformance ov erall, while directly using ( t, r ) p erforms comparably . Notably , even conditioning solely on the in terv al ∆t produces comp etitive results, suggesting that relative temp oral information pla ys a dominant role in our metho d. Sampling Time Steps ( t, r ) . The c hoice of the s ampling distribution for time steps is known to hav e a signiﬁcant impact on generation qualit y . In our frame- w ork, w e sample ( t, r ) using logit-normal distributions, consistent with the imple- men tation describ ed in Sec. 5. Speciﬁcally , we consider t wo logit-normal distri- butions: one for sampling the base time t and another for sampling the relative oﬀset that determines r . This design ensures 0 ≤ t ≤ r ≤ 1 b y construction while allowing ﬂexible control o ver the densit y of sampled time pairs. W e ev alu- ate diﬀerent hyperparameter settings of these logit-normal samplers in T ab. 3d. The results show that logit-normal sampling consistently outp erforms alterna- tiv e choices, aligning with observ ations rep orted in prior ﬂow-matc hing-based metho ds. L oss Metrics. As discussed in Sec. 5, our training ob jective adopts a Bregman div ergence instan tiated via adaptiv ely weigh ted squared ℓ 2 loss. While the o ver- all loss formulation remains the same across exp eriments, we v ary the exponent p that con trols the adaptive w eighting, which eﬀectiv ely changes the loss met- ric. The corresp onding results are summarized in T ab. 3b. W e ﬁnd that p = 1 ac hieves the best o v erall p erformance, indicating a strong b eneﬁt from aggressive 14 Chenrui Ma p os. embed FID, 1-NFE ( t, r ) 60.12 ( t, t − r ) 59.87 ( t, r, t − r ) 62.48 t − r only 62.16 (a) Positional emb e dding. The network is conditioned on the embeddings applied to the speciﬁed v ariables. p FID, 1-NFE 0.0 79.76 0.5 62.47 1.0 59.87 1.5 65.68 2.0 69.26 (b) L oss metrics. p = 0 is squared L2 loss. p = 0 . 5 is Pseudo-Huber loss. ω FID, 1-NFE 1.0 (w/o cfg) 61.06 1.5 32.51 2.0 19.05 3.0 14.76 5.0 20.12 (c) CFG scale. Our method supports 1-NFE CFG sam- pling. t, r sampler FID, 1-NFE uniform (0 , 1) 65.73 lognorm ( − 0 . 2 , 1 . 0) 63.56 lognorm ( − 0 . 2 , 1 . 2) 62.27 lognorm ( − 0 . 4 , 1 . 0) 59.87 lognorm ( − 0 . 4 , 1 . 2) 59.94 (d) Time samplers. t and r are sampled from the speciﬁc sampler. (e) Comp arison of FID-50K Sc or e over T raining It- er ations on ImageNet 256 × 256 Dataset. T able 3: Ablation study on 1-NFE ImageNet 256 × 256 gener ation. FID-50K is ev al- uated. Default conﬁgurations are: B/4 backbone, 80-ep o ch training from scratch. adaptiv e weigh ting. Setting p = 0 . 5 , which resembles the Pseudo-Hub er loss, also yields comp etitive results. In contrast, the standard squared ℓ 2 loss ( p = 0 ) un- derp erforms relative to other c hoices, though it still leads to meaningful one-step generation. These trends are consisten t with prior ﬁndings on the imp ortance of loss rew eighting for few-step or single-step generative models. Guidanc e Sc ale. W e further inv estigate the eﬀect of c lassiﬁer-free guidance (CFG) within our transition ﬂow framew ork. The results, rep orted in T ab. 3c, sho w that increasing the guidance scale signiﬁcantly improv es generation quality . This b e- ha vior is consisten t with observ ations in m ulti-step diﬀusion and ﬂow mo dels. Imp ortan tly , our CFG formulation, in tro duced in Sec. 5, is fully compatible with one-step (1-NFE) sampling and does not introduce additional inference cost b e- y ond a constant factor. 7 Conclusion In this w ork, we introduced T ransition Flow Matching , a principled frame- w ork for few-step ge nerativ e mo deling. Instead of learning lo cal velocity ﬁelds as in conv en tional Flow/Diﬀusion mo dels, our metho d directly models the gen- eration tra jectory through transition dynamics, pro viding a global p ersp ective on generative ﬂows. W e derive the T ransition Flo w Identit y and develop a theoretically grounded ob jectiv e that enables end-to-end training from scratc h, while also establishing a uniﬁed view that connects our framework with Mean V elo city metho ds. T ransition Flow Matc hing 15 8 Pro of of T ransition Iden tit y 8.1 Notation and Preliminary R andom variables and r e alizations. W e work in R d . Upp ercase letters (e.g., X t , Z ) denote random v ariables (R V s), and low ercase letters (e.g., x t , z ) denote their realizations (p oints/v alues). F or a density (or probability la w) of an R V X t , w e write p t ( · ) , and for a conditional density we write p t | Z ( · | z ) . Exp ectations are denoted b y E [ · ] . Sour c e/tar get distributions and c oupling. Let X 0 ∼ p 0 b e the sour c e distribu- tion (e.g., standard Gaussian noise) and X 1 ∼ p 1 b e the tar get distribution (e.g., images). A generative model constructs a con tinuous path of distributions { p t } t ∈ [0 , 1] that transp orts p 0 to p 1 . Let ( X 0 , X 1 ) b e any coupling on R d with join t density π whose marginals are p 0 and p 1 (not necessarily indep endent). In this w ork, we follow the standard Flow Matc hing setting, the sour c e and the tar get distributions are independent: π ( x 0 , x 1 ) = p 0 ( x 0 ) p 1 ( x 1 ) . Conditioning variable and c onditional p aths [26]. W e use a conditioning R V Z to index conditional paths. Conditioned on Z = z , w e obtain conditional coupling ( X Z 0 , X Z 1 ) and a conditional path { X Z t } t ∈ [0 , 1] with conditional densit y p t | Z ( · | z ) . The marginal path is { X t } t ∈ [0 , 1] , with densit y p t ( · ) , satisfying: p t ( x t ) = Z p t | Z ( x t | z ) p Z ( z ) dz , X t ∼ p t , X t | ( Z = z ) ∼ p t | Z ( · | z ) . (28) Interp olant. W e consider a general interpolant b etw een the mar ginal endp oints X 0 and X 1 , sp eciﬁed b y scalar functions α : [0 , 1] → R and β : [0 , 1] → R : X t = α ( t ) X 0 + β ( t ) X 1 , t ∈ [0 , 1] . (29) W e assume the b oundary conditions α (0) = 1 , β (0) = 0 and α (1) = 0 , β (1) = 1 , so that X t =0 = X 0 and X t =1 = X 1 . If α, β are diﬀeren tiable, then d dt X t = ˙ α ( t ) X 0 + ˙ β ( t ) X 1 . (30) Conditioned on Z = z , the same schedules induce the conditional path X Z t = α ( t ) X Z 0 + β ( t ) X Z 1 and d dt X Z t = ˙ α ( t ) X Z 0 + ˙ β ( t ) X Z 1 . Flow Matching ve ctor ﬁelds [26]. Let v ( x t , t | z ) ∈ R d denote a c onditional v elo city ﬁeld that transports the conditional density p t | Z ( · | z ) along time. The corresp onding mar ginal v elo city ﬁeld is deﬁned by conditional expectation: v ( x t , t ) = Z v ( x t , t | z ) p Z | t ( z | x t ) dz = E [ v ( X t , t | Z ) | X t = x t ] . (31) A marginal tra jectory , i.e., Flo w Matc hing generation tra jectory , follows the ODE: dx t dt = v ( x t , t ) , t ∈ [0 , 1] , (32) and similarly a conditional tra jectory follows dx z t dt = v ( x t , t | z ) . 16 Chenrui Ma Continuity (tr ansp ort) e quations [26, 27]. The evolution of densities induced b y these v elo city ﬁelds is characterized by the con tinuit y (transp ort) equation. F or the marginal densit y p t , ∂ t p t ( x t ) + ∇ ·  p t ( x t ) v ( x t , t )  = 0 , t ∈ [0 , 1] . (33) Conditioned on Z = z , the conditional density p t | Z ( · | z ) satisﬁes ∂ t p t | Z ( x t | z ) + ∇ ·  p t | Z ( x t | z ) v ( x t , t | z )  = 0 , t ∈ [0 , 1] . (34) Flo w Matching learns a parameterized v elocity ﬁeld (or equiv alent dynamics) so that the induced marginal path { p t } solv es Eq. (33) with b oundary conditions p t =0 = p 0 and p t =1 = p 1 . L e arning Flow Matching. Flow Matching introduce a velocity ﬁeld model v θ ( X t , t ) to learn v ( X t , t ) , ideally , by minimizing the marginal Flow Matching loss: L MFM ( θ ) = E t, X t ∼ p t D  v ( X t , t ) , v θ ( X t , t )  (35) Ho wev er, since the marginal v elo city v ( X t , t ) in Eq.(31) is not tractable, so the marginal loss Eq.(35) ab o v e cannot b e computed as is. Instead, we minimize the conditional Flo w Matching loss: L CFM ( θ ) = E t, Z, X t ∼ p t | Z ( ·| Z ) D  v ( X t , t | Z ) , v θ ( X t , t )  (36) The tw o losses Eq.(35) and Eq.(36) are equiv alen t for learning purp oses, since their gradien ts coincide: Theorem 3 (Gradient equiv alence of Flow Matching [27]). The gr adi- ents of t he mar ginal Flow Matching loss and the c onditional Flow Matching loss c oincide: ∇ θ L MFM ( θ ) = ∇ θ L CFM ( θ ) (37) In p articular, the minimizer of the c onditional Flow Matching loss is the mar ginal velo city v ( x t , t ) . R emark 3 (Standar d Flow Matching). Flo w Matching sets the conditioning v ari- able to b e the endp oint pair Z = ( X 0 , X 1 ) , so that conditioning on Z = z ﬁxes the endp oint pair ( X Z 0 , X Z 1 ) = ( X 0 , X 1 ) . Cho osing the linear schedules α ( t ) = 1 − t and β ( t ) = t yields the conditional path X Z t = (1 − t ) X Z 0 + tX Z 1 = (1 − t ) X 0 + tX 1 , 0 ≤ t ≤ 1 hence the time-deriv ativ e of this conditional path is constant: d dt X Z t = X Z 1 − X Z 0 = X 1 − X 0 . T ransition Flow Matc hing 17 Therefore, for Z = ( X 0 , X 1 ) , the conditional velocity used as sup ervision in the conditional Flo w Matching loss is a constant: v ( X t , t | Z ) = X Z 1 − X Z 0 = X 1 − X 0 , With this setup, the Flo w Matc hing ob jective in Eq. (36) can be written explicitly as L FM ( θ ) = E t, X 0 ∼ p 0 , X 1 ∼ p 1 D  X 1 − X 0 , v θ ( X t , t )  , X t = (1 − t ) X 0 + tX 1 , i.e., one samples t ∼ Unif [0 , 1] , draws independent endpoints X 0 ∼ p 0 and X 1 ∼ p 1 , forms the in terp olated state X t , and regresses the mo del v θ ( X t , t ) to the constant target X 1 − X 0 . By Theorem 3, minimizing this conditional loss yields the marginal v elo city ﬁeld v ( x t , t ) in Eq. (31). 8.2 T ransition Flo w Iden tity T o connect with previous w orks [10, 11], we deﬁne the av erage v elo city u ( x t , t, r ) as: ( r − t ) u ( x t , t, r ) = x t → r − x t = Z r t v ( x τ , τ ) dτ 0 ≤ t ≤ r ≤ 1 (38) where v ( x τ , τ ) is the marginal v elo city of Flo w Matching in time step τ ∈ [0 , 1] for state x τ , as sho wn in Eq.(31). In Eq.(38), x t is the curren t state at time step t , x t → r is the transition state at time step r , whic h comes from x t at time step t . x t → r = x t + Z r t v ( x τ , τ ) (39) Note here, akin to the marginal velocity of Flow Matching v ( x t , t ) as Eq.(31), x t → r is the marginal transition state: x t → r = Z x z t → r p Z | t ( z | x t ) dz = E [ X Z t → r | X t = x t ] (40) where x z t → r is the conditional transition state: x z t → r = x t + Z r t v ( x τ , τ | z ) dτ (41) The a verage velocity u ( x t , t, r ) is the displacement b etw een tw o time steps t and r divided by the time in terv al r − t : u ( x t , t, r ) ≜ 1 r − t Z r t v ( x τ , τ ) dτ (42) 18 Chenrui Ma Diﬀeren tiate b oth sides of Eq.(38) with respect to t, treating r as indep enden t of t. W e hav e: d dt ( r − t ) u ( x t , t, r ) = d dt Z r t v ( x τ , τ ) dτ = ⇒ u ( x t , t, r ) = v ( x t , t )+( r − t ) d dt u ( x t , t, r ) (43) Our learning target is the T ransition Flow X ( x t , t, r ) that transit state x t at time step t to x t → r at time step r : X ( x t , t, r ) = x t → r , whic h yield: X ( x t , t, r ) = Z X ( x t , t, r | z ) p Z | t ( z | x t ) dz = E [ X ( X t , t, r | Z ) | X t = x t ] (44) where X ( x t , t, r | z ) = x z t → r is conditional transition state. Note that Eq.(44) is iden tical to Eq.(40). F rom Eq.(38), we build the association b etw een the av erage velocity and the T ransition Flow (transition state): u ( x t , t, r ) = x t → r − x t r − t = X ( x t , t, r ) − x t r − t (45) Substitute Eq.(45) in to Eq.(43), we obtain the following equation: X ( x t , t, r ) − x t r − t = v ( x t , t ) + ( r − t ) d dt  X ( x t , t, r ) − x t r − t  (46) where r is indep endent of t , and the time deriv ative of x t is marginal velocity v ( x t , t ) , given by Eq.(32). Deﬁne the time deriv ativ e of the fraction: A ( t ) := X ( x t , t, r ) − x t r − t . Using the quotien t rule and the fact that r is constant, we obtain dA dt = ( r − t ) d dt ( X ( x t , t, r ) − x t ) + ( X ( x t , t, r ) − x t ) ( r − t ) 2 . Since d dt ( X ( x t , t, r ) − x t ) = dX ( x t , t, r ) dt − v ( x t , t ) , w e hav e dA dt = ( r − t )  dX ( x t ,t,r ) dt − v ( x t , t )  + ( X ( x t , t, r ) − x t ) ( r − t ) 2 . The righ t-hand side of Eq.(46) b ecomes v ( x t , t ) + ( r − t ) dA dt = v ( x t , t ) + ( r − t )  dX ( x t ,t,r ) dt − v ( x t , t )  + ( X ( x t , t, r ) − x t ) r − t = v ( x t , t ) + dX ( x t , t, r ) dt − v ( x t , t ) + X ( x t , t, r ) − x t r − t = dX ( x t , t, r ) dt + X ( x t , t, r ) − x t r − t (47) T ransition Flow Matc hing 19 Comparing b oth sides of Eq.(46) yields X ( x t , t, r ) − x t r − t = dX ( x t , t, r ) dt + X ( x t , t, r ) − x t r − t X ( x t , t, r ) = X ( x t , t, r ) + ( r − t ) dX ( x t , t, r ) dt (48) Giv en X ( x t , t, r ) = x t → r , the T ransition Flo w Iden tity can b e shown as: X ( x t , t, r ) = x t → r + ( r − t ) d dt X ( x t , t, r ) (49) 8.3 Calculate Time Deriv ative of T ransition Flow T o compute the d dt X ( x t , t, r ) term, w e expand it in terms of partial deriv atives: d dt X ( x t , t, r ) = ∂ x t X · dx t dt + ∂ t X · dt dt + ∂ r X · dr dt = ∂ x t X · v ( x t , t ) + ∂ t X · 1 + ∂ r X · 0 = ∂ x t X · v ( x t , t ) + ∂ t X (50) where dx t dt = v ( x t , t ) as shown in Eq.(32). The time deriv ative sho wn in Eq. (50) is given by the Jacobian-vector pro duct (JVP) betw een the Jacobian matrix of eac h function ( [ ∂ x t X, ∂ t X, ∂ r X ] ) and the corresponding tangent v ector ( [ v , 0 , 1] ). F or co de implementation, mo dern libraries such as PyT orch provide eﬃcient JVP calculation in terfaces. 8.4 T o w ards T ractable T ransition Flow Matc hing Ob jective Up to this p oint, the form ulations are independent of an y netw ork parameteriza- tion. W e now introduce the T ransition Flo w model X θ ( x t , t, r ) , parameterized by θ , to learn X ( x t , t, r ) . F ormally , w e encourage X θ ( x t , t, r ) to satisfy the T ran- sition Flow Identit y as Eq.(49). T o this end, ideally , w e can minimize the marginal T ransition Flow Matching ob jectiv e (M-TFM): L M − TFM ( θ ) = E t, r, X t ∼ p t D  sg  X m tgt ( X t , t, r )  , X θ ( X t , t, r )  (51) where X m tgt ( X t , t, r ) = X t → r + ( r − t ) d dt X θ ( X t , t, r ) (52) where D ( · ) represen ts a Bregman div ergence (e.g MSE) to regress our learnable T ransition Flo w X θ ( x t , t, r ) on to the target X m tgt ( x t , t, r ) , moreo ver, sg[ · ] donates the stop gradient (sg) op eration, indicating X m tgt ( x t , t, r ) serves as ground-truth in this loss function and is indep endent of optimization. Ho wev er, the marginal state x t → r in Eq.(52) and the marginal v elo city v ( x t , t ) used in calculation of d dt X θ ( x t , t, r ) Eq.(50) are not tractable, so the marginal loss Eq.(51) ab o ve cannot b e computed as is. 20 Chenrui Ma Instead, we minimize conditional T ransition Flow Matching loss (C-TFM), whic h is tractable: L C − TFM ( θ ) = E t, r, Z , X t ∼ p t | Z ( ·| Z ) D  sg  X c tgt ( X t , t, r | Z )  , X θ ( X t , t, r )  (53) where X c tgt ( X t , t, r | Z ) = X Z t → r + ( r − t ) d dt X θ ( X t , t, r ) (54) where the conditional state x z t → r in Eq.(54) and the conditional velocity v ( x t , t | z ) used in calculation of d dt X θ ( x t , t, r ) Eq.(50) are tractable (see Eq.(57) for details), giv es us a computable loss function Eq.(53). The t wo losses Eq.(51) and Eq.(53) are equiv alen t for learning purp oses, since their gradien ts coincide: Theorem 4 (Gradient equiv alence of T ransition Flo w Matc hing). The gr adients of the mar ginal T r ansition Flow Matching loss and the c onditional T r ansition Flow Matching loss c oincide: ∇ θ L M − TFM ( θ ) = ∇ θ L C − TFM ( θ ) (55) In p articular, the minimizer of the c onditional T r ansition Flow Matching loss is the mar ginal tar get X m tgt ( x t , t, r ) , which satisﬁes the T r ansition Flow Identity Eq. (49) . Pr o of (Pr o of of The or em 4). W e show Eq.(55) b y a direct computation. ∇ θ L M - TFM ( θ ) = ∇ θ E t, r, X t ∼ p t D  sg  X m tgt ( X t , t, r )  , X θ ( X t , t, r )  (a) = E t, r, X t ∼ p t ∇ θ D  sg  X m tgt ( X t , t, r )  , X θ ( X t , t, r )  (b) = E t, r, X t ∼ p t ∇ 2 D  X m tgt ( X t , t, r ) , X θ ( X t , t, r )  ∇ θ X θ ( X t , t, r ) (c) = E t, r, X t ∼ p t ∇ 2 D  E Z ∼ p Z | t ( ·| X t )  X c tgt ( X t , t, r | Z )  , X θ ( X t , t, r )  ∇ θ X θ ( X t , t, r ) (d) = E t, r, X t ∼ p t E Z ∼ p Z | t ( ·| X t ) h ∇ 2 D  X c tgt ( X t , t, r | Z ) , X θ ( X t , t, r )  ∇ θ X θ ( X t , t, r ) i (e) = E t, r, X t ∼ p t E Z ∼ p Z | t ( ·| X t ) h ∇ θ D  sg  X c tgt ( X t , t, r | Z )  , X θ ( X t , t, r ) i (f) = ∇ θ E t, r, Z, X t ∼ p t | Z ( ·| Z ) D  sg  X c tgt ( X t , t, r | Z )  , X θ ( X t , t, r )  = ∇ θ L C - TFM ( θ ) . (56) Explanations of lab ele d steps. (a) In terchange ∇ θ and E b y the Leibniz rule. (b) The stop-gradient makes the ﬁrst argument sg[ X m tgt ( X t , t, r )] θ -indep endent for diﬀerentiation, hence the gradient ﬂows only through the second argu- men t. Applying the chain rule yields ∇ θ D (sg[ · ] , X θ ) = ∇ 2 D ( · , X θ ) ∇ θ X θ . T ransition Flow Matc hing 21 (c) Use the deﬁnitions Eq.(52) and Eq.(54) together with the marginal–conditional relation for transition states Eq.(44) and v elo city Eq.(31): X m tgt ( X t , t, r ) = X t → r + ( r − t ) d dt X θ ( X t , t, r ) = X ( X t , t, r ) + ( r − t ) d dt X θ ( X t , t, r ) = X ( X t , t, r ) + ( r − t )  ∂ X t X θ ( X t , t, r ) · v ( X t , t ) + ∂ t X θ ( X t , t, r )  = E Z ∼ p Z | t ( ·| X t ) h X ( X t , t, r | Z ) + ( r − t )  ∂ X t X θ ( X t , t, r ) · v ( X t , t | Z ) + ∂ t X θ ( X t , t, r ) i = E Z ∼ p Z | t ( ·| X t ) h X ( X t , t, r | Z ) + ( r − t ) d dt X θ ( X t , t, r ) i = E Z ∼ p Z | t ( ·| X t ) h X Z t → r + ( r − t ) d dt X θ ( X t , t, r ) i = E Z ∼ p Z | t ( ·| X t )  X c tgt ( X t , t, r | Z )  , (57) where ∂ X t X θ ( X t , t, r ) and ∂ t X θ ( X t , t, r ) are Z -indep endent given ( X t , t, r ) . (d) Since D ( · , · ) is a Bregman divergence, its gradien t with resp ect to the second argumen t, ∇ 2 D ( a, b ) , is aﬃne in the ﬁrst argumen t a for ﬁxed b . Conditioning on X t , this implies the (conditional) exp ectation can b e mov ed inside: ∇ 2 D  E [ A | X t ] , b  = E  ∇ 2 D ( A, b ) | X t  . Apply this with A = X c tgt ( X t , t, r | Z ) and b = X θ ( X t , t, r ) . (e) Rev erse the chain rule as in (b): b ecause sg[ X c tgt ] freezes the ﬁrst argument, w e hav e ∇ 2 D ( · , X θ ) ∇ θ X θ = ∇ θ D (sg[ · ] , X θ ) . (f ) Use Bay es’ rule to sw ap the sampling orders: E X t ∼ p t E Z ∼ p Z | t ( ·| X t ) [ · ] = E Z E X t ∼ p t | Z ( ·| Z ) [ · ] , and in terchange ∇ θ with E (as in (a)) to recognize ∇ θ L C - TFM ( θ ) in Eq.(53). Therefore, ∇ θ L M - TFM ( θ ) = ∇ θ L C - TFM ( θ ) , whic h prov es Eq.(55). ⊓ ⊔ R emark 4 (Minimizer). Since the t w o ob jectiv es Eq.(51) and Eq.(53) ha v e iden- tical gradien ts, they share the same stationary p oints. Moreo ver, b ecause X m tgt ( X t , t, r ) = E Z ∼ p Z | t ( ·| X t )  X c tgt ( X t , t, r | Z )  , the conditional ob jective Eq.(53) regresses X θ ( X t , t, r ) tow ard the marginal tar- get X m tgt ( X t , t, r ) , and X m tgt is deﬁned to satisfy the T ransition Flow Iden tit y Eq.(49). 8.5 T raining T ransition Flow Matc hing Mo del Building on the previous developmen t, we now ha ve all the ingredients required to train a T ransition Flo w Matching mo del. W e inherit the setting in Remark 3 b y taking Z = ( X 0 , X 1 ) and using the linear interpolant X t = (1 − t ) X 0 + tX 1 , whic h yields a concrete, standard form of T ransition Flow Matc hing. 22 Chenrui Ma R emark 5 (Standar d T r ansition Flow Matching). W e adopt the standard Flo w Matc hing setting in Remark 3 b y setting Z = ( X 0 , X 1 ) , and use the linear in terp olant X t = (1 − t ) X 0 + tX 1 , 0 ≤ t ≤ 1 . F or any 0 ≤ t ≤ r ≤ 1 , the conditional transition state is X Z t → r = (1 − r ) X 0 + r X 1 , and the conditional v elo city is constant: v ( X t , t | Z ) = X 1 − X 0 . The tractable T ransition Flo w Matc hing ob jective in Eq.(53) can be written explicitly as L TFM ( θ ) = E t, r, Z , X t ∼ p t | Z ( ·| Z ) D  sg  X Z t → r + ( r − t ) d dt X θ ( X t , t, r )  , X θ ( X t , t, r )  (58) where d dt X θ ( X t , t, r ) = ∂ x t X θ ( X t , t, r ) · v ( X t , t | Z ) + ∂ t X θ ( X t , t, r ) (59) During inference, w e recursively apply the mo del to transform the generation tra jectory from the source distribution to the target distribution: ˆ x r = X θ ( x t , t, r ) (60) F or clarity , we summarize the conceptual training and inference pro cedures in Algorithm 2 and Algorithm 3. In particular, by extending Algorithm 3, the one- step generation pro cedure is describ ed in Algorithm 4. References 1. Alb ergo, M.S., Boﬃ, N.M., V anden-Eijnden, E.: Sto chastic interpolants: A unifying framew ork for ﬂows and diﬀusions, 2023. URL https://arxiv. org/abs/2303.08797 3 (2023) 2. Alb ergo, M.S., V anden-Eijnden, E.: Building normalizing ﬂo ws with stochastic in- terp olan ts. In: The Eleven th International Conference on Learning Representations (2023), https://openreview.net/forum?id=li7qeBbCR1t 3. Boﬃ, N.M., Albergo, M.S., V anden-Eijnden, E.: Flo w map matching. arXiv preprin t arXiv:2406.07507 2 (3), 9 (2024) 4. Cai, S., Chan, E.R., Zhang, Y., Guibas, L., W u, J., W etzstein, G.: Diﬀusion self- distillation for zero-shot customized image generation. In: Pro ceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 18434–18443 (2025) 5. Dormand, J.R., Prince, P .J.: A family of em b edded runge-kutta form ulae. Journal of computational and applied mathematics 6 (1), 19–26 (1980) T ransition Flow Matc hing 23 Algorithm 2 T ransition Flow Matching (TFM): T raining Note: In PyTorch and JAX, jvp returns (function_output, JVP). // tfm(x_t, t, r): uses model X θ ( x t , t, r ) predicting the transition state at time r // metric(.): e.g., MSE, any Bregman divergence on residuals x_1 = sample_from_training_batch () // sample from target distribution p 1 x_0 = randn_like (x_1) // sample from source distribution p 0 (Gaussian), independent of p 1 (t, r) = sample_t_r () // 0 <= t <= r <= 1 x_t = (1 - t) x_0 + t x_1 // current state v = x_1 - x_0 // conditional velocity with Z = (X_0, X_1) x_t_to_r = (1 - r) x_0 + r x_1 // conditional transition state with Z = (X_0, X_1) (X, dX_dt) = jvp ( tfm , (x_t, t, r), (v, 1, 0)) // d/dt X θ = partial_x X θ * v + partial_t X θ X_tgt = x_t_to_r + (r - t) dX_dt error = X - stopgrad (X_tgt) loss = metric (error) return loss 6. Doso vitskiy , A., Bey er, L., K olesniko v, A., W eissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly , S., Uszk oreit, J., Houlsb y , N.: An image is worth 16x16 w ords: T ransformers for image recognition at scale (2021), 7. Esser, P ., Kulal, S ., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Rombac h, R.: Scaling rectiﬁed ﬂow transformers for high-resolution image syn thesis. In: F orty- ﬁrst In ternational Conference on Machine Learning (2024), https://openreview. net/forum?id=FPnUhsQJ5B 8. Esser, P ., Kulal, S ., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Bo esel, F., P o dell, D., Do ckhorn, T., English, Z., Lacey , K., Go o dwin, A., Marek, Y., Rom bach, R.: Scaling rectiﬁed ﬂo w transformers for high-resolution image synthesis (2024), 9. F rans, K., Hafner, D., Levine, S., Abb eel, P .: One step diﬀusion via shortcut mo dels. In: The Thirteen th In ternational Conference on Learning Representations (2025), https://openreview.net/forum?id=OlzB6LnXcS 10. Geng, Z., Deng, M., Bai, X., Kolter, J.Z., He, K.: Mean ﬂows for one-step generative mo deling. arXiv preprint arXiv:2505.13447 (2025) 24 Chenrui Ma Algorithm 3 TFM: Multi-step Sampling with Arbitrary Step Sizes // Choose any time grid: 0 = t 0 < t 1 < · · · < t K = 1 x = randn (x_shape) // x = x_t_0 = x_0 for k = 0 to K - 1: t = t_k, r = t_k+1 x = tfm (x, t, r) // x = x_t_to_r return x // x = x_t_K = x_1 Algorithm 4 TFM: 1-step Sampling x = randn (x_shape) // x = x_0 x = tfm (x, 0, 1) // x = x_1 return x 11. Geng, Z., Lu, Y., W u, Z., Shech tman, E., Kolter, J.Z., He, K.: Impro ved mean ﬂows: On the challenges of fastforw ard generative mo dels. arXiv preprint arXiv:2512.02012 (2025) 12. Geng, Z., Pokle, A., Kolter, J.Z.: One-step diﬀusion distillation via deep equilibrium mo dels. In: Thirty-sev enth Conference on Neural Information Pro cessing Systems (2023), https://openreview.net/forum?id=b6XvK2de99 13. Geng, Z., Pokle, A., Luo, W., Lin, J., Kolter, J.Z.: Consistency models made easy . In: The Thirteen th In ternational Conference on Learning Representations (2025), https://openreview.net/forum?id=xQVxo9dSID 14. Guo, P ., Sc hwing, A.: V ariational rectiﬁed ﬂow matching. In: F orty-second Interna- tional Conference on Machine Learning (2025), https://openreview.net/forum? id=Rk18ZikrFI 15. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Ho chreiter, S.: Gans trained b y a tw o time-scale up date rule conv erge to a local nash equilibrium. Adv ances in neural information pro cessing systems 30 (2017) 16. Ho, J., Salimans, T.: Classiﬁer-free diﬀusion guidance (2022), abs/2207.12598 17. Hu, Z., Lai, C.H., Mitsufuji, Y., Ermon, S.: Cmt: Mid-training for eﬃcien t learning of consistency , mean ﬂow, and ﬂow map mo dels. arXiv preprint T ransition Flow Matc hing 25 (2025) 18. Karras, T., Aittala, M., Aila, T., Laine, S.: Elucidating the design space of diﬀusion- based generativ e models. Adv ances in neural information pro cessing systems 35 , 26565–26577 (2022) 19. Kim, D., Lai, C.H., Liao, W.H., Murata, N., T akida, Y., Uesak a, T., He, Y., Mitsu- fuji, Y., Ermon, S.: Consistency tra jectory mo dels: Learning probabilit y ﬂo w ODE tra jectory of diﬀusion. In: The T w elfth In ternational Conference on Learning Rep- resen tations (2024), https://openreview.net/forum?id=ymjI8feDTD 20. Kim, D., Lai, C.H., Liao, W.H., Murata, N., T akida, Y., Uesak a, T., He, Y., Mitsu- fuji, Y., Ermon, S.: Consistency tra jectory mo dels: Learning probabilit y ﬂow o de tra jectory of diﬀusion (2024), 21. Klein, L., Krämer, A., Noé, F.: Equiv ariant ﬂow matching. Adv ances in Neural Information Pro cessing Systems 36 , 59886–59910 (2023) 22. Krizhevsky , A., Hin ton, G., et al.: Learning m ultiple la yers of features from tin y images (2009) 23. Krizhevsky , A., Sutsk ev er, I., Hin ton, G.E.: Imagenet classiﬁcation with deep con- v olutional neural net works. Adv ances in neural information pro cessing systems 25 (2012) 24. Lai, C.H., Song, Y., Kim, D., Mitsufuji, Y., Ermon, S.: The principles of diﬀusion mo dels (2025), 25. Lee, S., Lin, Z., F anti, G.: Improving the training of rectiﬁed ﬂo ws. Adv ances in neural information pro cessing systems 37 , 63082–63109 (2024) 26. Lipman, Y., Chen, R.T.Q., Ben-Ham u, H., Nic kel, M., Le, M.: Flow matching for generative modeling. In: The Eleven th Inte rnational Conference on Learning Represen tations (2023), https://openreview.net/forum?id=PqvMRDCJT9t 27. Lipman, Y., Ha v asi, M., Holderrieth, P ., Shaul, N., Le, M., Karrer, B., Chen, R.T.Q., Lopez-Paz, D., Ben-Ham u, H., Gat, I.: Flo w matching guide and code (2024), 28. Lipman, Y., Ha v asi, M., Holderrieth, P ., Shaul, N., Le, M., Karrer, B., Chen, R.T., Lop ez-Paz, D., Ben-Hamu, H., Gat, I.: Flo w matching gu ide and co de. arXiv preprin t arXiv:2412.06264 (2024) 29. Liu, X., Gong, C., qiang liu: Flow straight and fast: Learning to generate and trans- fer data with rectiﬁed ﬂow. In: The Eleven th International Conference on Learning Represen tations (2023), https://openreview.net/forum?id=XVjTT1nw5z 30. Lu, C., Song, Y.: Simplifying, stabilizing and scaling contin uous-time consistency mo dels. In: The Thirteen th International Conference on Learning Representations (2025), https://openreview.net/forum?id=LyJi5ugyJx 31. Lu, Y., Lu, S., Sun, Q., Zhao, H., Jiang, Z., W ang, X., Li, T., Geng, Z., He, K.: One-step latent-free image generation with pixel mean ﬂows. arXiv preprint arXiv:2601.22158 (2026) 32. Luo, T., Y uan, H., Liu, Z.: Soﬂow: Solution ﬂow mo dels for one-step generative mo deling. arXiv preprint arXiv:2512.15657 (2025) 33. Ma, C., Xiao, X., W ang, T., Shen, Y.: Bey ond editing pairs: Fine-grained instructional image editing via multi-scale learnable regions. arXiv preprin t arXiv:2505.19352 (2025) 34. Ma, C., Xiao, X., W ang, T., W ang, X., Shen, Y.: CAD-V AE: Leveraging correlation-a ware latents for comprehensive fair disen tanglement. In: The F orti- eth AAAI Conference on Artiﬁcial Intelligence (2025) 35. Ma, C., Xiao, X., W ang, T., W ang, X., Shen, Y.: Sto chastic in terp olants via con- ditional dep endent coupling. arXiv preprint arXiv:2509.23122 (2025) 26 Chenrui Ma 36. Ma, C., Xiao, X., W ang, T., W ang, X., Shen, Y.: Learning straight ﬂows: V ari- ational ﬂow matching for eﬃcient generation (2026), https : / / arxiv . org / abs / 2511.17583 37. Ma, N., Goldstein, M., Alb ergo, M.S., Boﬃ, N.M., V anden-Eijnden, E., Xie, S.: Sit: Exploring ﬂow and diﬀusion-based generativ e models with scalable interpolant transformers. In: European Conference on Computer Vision. pp. 23–40. Springer (2024) 38. Nie, W., Berner, J., Ma, N., Liu, C., Xie, S., V ahdat, A.: T ransition matching distillation for fast video generation. arXiv preprint arXiv:2601.09881 (2026) 39. Rom bach, R., Blattmann, A., Lorenz, D., Esser, P ., Ommer, B.: High-resolution image syn thesis with laten t diﬀusion mo dels (2022), https: / / arxiv . org / abs / 2112.10752 40. Sab our, A., Fidler, S., Kreis, K.: Align y our ﬂow: Scaling con tinuous-time ﬂow map distillation. arXiv preprint arXiv:2506.14603 (2025) 41. Salimans, T., Ho, J.: Progressiv e distillation for fast sampling of diﬀusion mo d- els. In: International Conference on Learning Representations (2022), https : //openreview.net/forum?id=TIdIXIpzhoI 42. Silv estri, G., Ambrogioni, L., Lai, C.H., T akida, Y., Mitsufuji, Y.: VCT: T raining consistency mo dels with v ariational noise coupling. In: F orty-second In ternational Conference on Mac hine Learning (2025), https : / / openreview . net / forum ? id = CMoX0BEsDs 43. Song, Y., Dhariwal, P .: Improv ed tec hniques for training consistency models. In: The T welfth International Conference on Learning Represen tations (2024), https: //openreview.net/forum?id=WNzy9bRDvG 44. Song, Y., Dhariwal, P ., Chen, M., Sutskev er, I.: Consistency mo dels. In: Proceed- ings of the 40th International Conference on Mac hine Learning. pp. 32211–32252 (2023) 45. Song, Y., Sohl-Dickstein, J., Kingma, D.P ., Kumar, A., Ermon, S., Poole, B.: Score- based generative mo deling through sto chastic diﬀeren tial equations. In: Interna- tional Conference on Learning Represen tations (2021), https://openreview.net/ forum?id=PxTIG12RRHS 46. T ong, A., F atras, K., Malkin, N., Huguet, G., Zhang, Y., Rector-Bro oks, J., W olf, G., Bengio, Y.: Improving and generalizing ﬂow-based generative models with mini- batc h optimal transp ort. arXiv preprint arXiv:2302.00482 (2023) 47. V aswani, A., Shazeer, N., Parma r, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: A ttention is all you need (2023), 03762 48. W ang, F.Y., Y ang, L., Huang, Z., W ang, M., Li, H.: Rectiﬁed diﬀusion: Straightness is not your need in rectiﬁed ﬂow. arXiv preprint arXiv:2410.07303 (2024) 49. W ang, Z., Zhang, Y., Y ue, X., Y ue, X., Li, Y., Ouyang, W., Bai, L.: T ransition mo d- els: Rethinking the generative learning ob jectiv e. arXiv preprin t (2025) 50. Zhang, H., Siarohin, A., Menapace, W., V asilk ovsky , M., T ulyak ov, S. , Qu, Q., Sk orokho dov, I.: Alphaﬂow: Understanding and improving meanﬂow mo dels. arXiv preprin t arXiv:2510.20771 (2025) 51. Zhang, Y., Y an, Y., Sch wing, A., Zhao, Z.: Hierarchical rectiﬁed ﬂo w matc hing with mini-batch couplings. arXiv preprint arXiv:2507.13350 (2025) 52. Zhang, Y., Y an, Y., Sch wing, A., Zhao, Z.: T o wards hierarchical rectiﬁed ﬂo w. In: The Thirteen th In ternational Conference on Learning Representations (2025), https://openreview.net/forum?id=6F6qwdycgJ T ransition Flow Matc hing 27 53. Zhou, L., Ermon, S., Song, J.: Inductive moment matching. In: F orty-second In- ternational Conference on Machine Learning (2025), https : // openreview . net / forum?id=pwNSUo7yUb

Transition Flow Matching

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment