Graph Energy Matching: Transport-Aligned Energy-Based Modeling for Graph Generation

Energy-based models for discrete domains, such as graphs, explicitly capture relative likelihoods, naturally enabling composable probabilistic inference tasks like conditional generation or enforcing constraints at test-time. However, discrete energy…

Authors: Michal Balcerak, Suprosana Shit, Chinmay Prabhakar

Graph Energy Matching: Transport-Aligned Energy-Based Modeling for Graph Generation
Graph Energy Matching: T ransport-Aligned Energy-Based Modeling f or Graph Generation Michal Balcerak 1 Suprosana Shit 1 Chinmay Prabhakar 1 Sebastian Kaltenbach 2 Michael S. Albergo 2,3 Y ilun Du 2,3,* Bjoern Menze 1,* Abstract Energy-based models for discrete domains, such as graphs, explicitly capture relati ve likelihoods, naturally enabling composable probabilistic infer- ence tasks like conditional generation or enforcing constraints at test-time. Ho wever , discrete ener gy- based models typically struggle with efficient and high-quality sampling, as of f-support regions of- ten contain spurious local minima, trapping sam- plers and causing training instabilities. This has historically resulted in a fidelity gap relativ e to dis- crete dif fusion models. W e introduce Gr aph En- er gy Matching (GEM) , a generati ve framew ork for graphs that closes this fidelity gap. Moti vated by the transport map optimization perspectiv e of the Jordan–Kinderlehrer–Otto (JK O) scheme, GEM learns a permutation-in variant potential ener gy that simultaneously provides transport-aligned guidance from noise tow ard data and refines sam- ples within regions of high data likelihood. Fur - ther , we introduce a sampling protocol that lev er- ages an ener gy-based switch to seamlessly bridge: (i) rapid, gradient-guided transport toward high- probability regions to (ii) a mixing regime for exploration of the learned graph distrib ution. On molecular graph benchmarks, GEM matches or exceeds strong discrete diffusion baselines. Be- yond sample quality , explicit modeling of relati ve likelihood enables targeted exploration at infer- ence time, facilitating compositional generation, property-constrained sampling, and geodesic in- terpolation between graphs. 1 Introduction Generativ e modeling of discrete structured objects, such as graphs, is a central challenge across many scientific and 1 Univ ersity of Zurich; 2 Harvard Univ ersity; 3 Kempner Institute. Correspondence to: Michal Balcerak < michal.balcerak@uzh.ch > . * Contributed equally as senior authors. Preprint. Mar ch 25, 2026. industrial applications, including drug discov ery and mate- rials design ( Jin et al. , 2018 ; Stok es et al. , 2020 ; Xie et al. , 2022 ). While domain-driv en simulations explicitly encode physical knowledge ( No ´ e et al. , 2020 ), they quickly be- come intractable due to the combinatorial complexity of the underlying data distribution ( Polishchuk et al. , 2013 ), highlighting the need for generati ve models that efficiently capture implicit, domain-specific patterns. Current state-of-the-art models for unconditional graph generation predominantly rely on discrete dif fusion-like processes, implemented either as discrete-time denoising steps ( V ignac et al. , 2023 ) or continuous-time Marko v chains (CTMCs) ( Campbell et al. , 2024 ; Qin et al. , 2025 ). These methods generate graphs by progressively denois- ing samples from a simple prior distribution toward the target data distribution. For instance, DeFoG ( Qin et al. , 2025 ) learns CTMC transition rates to remov e noise from corrupted graph states, while V ariational Flo w Match- ing (VFM) ( Eijkelboom et al. , 2024 ; 2025 ) parameterizes stochastic trajectories that continuously transform samples from noise distrib utions into clean graphs. Although these approaches yield high-quality unconditional samples, they define distributions implicitly via noisy intermediate states, lacking an explicit model directly on clean graphs. This implicit representation complicates the enforcement of user- specified properties or constraints, as guidance must operate on intermediate, often off-manifold states, where predictors and conditioning signals become less reliable. Energy-based models (EBMs) ( Hinton , 2002 ; LeCun et al. , 2006 ; Du & Mordatch , 2019 ) offer a complementary per- spectiv e that naturally addresses these limitations. EBMs explicitly represent the probability structure of the data dis- tribution through a scalar ener gy function, enabling direct incorporation of constraints, priors, or property-based ob- jectiv es at inference time without retraining ( Du , 2025 ). For graph generation, EBMs of fer particular advantages, as graph-structured constraints and domain-specific priors can be naturally integrated into the energy function. Histori- cally , ho wev er, EBMs hav e struggled with sample quality , primarily due to poor sampling ef ficiency in discrete, com- binatorial domains, as ex emplified by existing methods such 1 Graph Energy Matching as GraphEBM ( Liu et al. , 2021 ). Consequently , discrete EBMs hav e predominantly served as scoring models for out-of-distribution detection rather than competiti ve genera- tion ( W u et al. , 2023 ; Fuchsgruber et al. , 2024 ). Energy Matching ( Balcerak et al. , 2025 ) bridges these perspectiv es by building on the Jordan–Kinderlehrer–Otto (JK O) formulation ( Jordan et al. , 1998 ) and its first-order optimality perspective ( T erpin et al. , 2024 ; Lanzetti et al. , 2024 ; 2025 ), coupling transport to ward the data distribu- tion with refinement through a time-indexed sequence of sampling steps. In these continuous formulations, dynamics typically follo w ordinary differential equations (ODEs) or stochastic differential equations (SDEs), guided by a learned scalar potential that defines both the transport direction and refinement by capturing a Boltzmann-like probability struc- ture of the data. Howe ver , Energy Matching formulations operate in con- tinuous spaces and rely on time-index ed dynamics defined ov er ambient continuous domains. This approach does not naturally extend to discrete spaces, where generation funda- mentally requires discrete sampling chara cterized by abrupt, local state transitions rather than continuous trajectories. This highlights the need for a ne w framew ork enabling high- quality graph generation while explicitly capturing the rela- tiv e likelihood structure, thus recov ering the compositional and constraint-handling advantages of EBMs. Contributions. W e introduce Graph Energy Match- ing (GEM) , a novel discrete ener gy-based generative model for graphs, achieving molecular graph genera- tion quality that matches or surpasses leading discrete diffusion approaches. Beyond unconditional sampling, the explicitly learned distrib ution over graphs enables GEM to incorporate compositional constraints at infer- ence time and facilitates the computation of geodesics between graphs. W e leverage a JKO-style transport map optimization perspecti ve to define (i) a transport-aligned discrete pro- posal to rapidly move to wards high-probability graphs, and (ii) a discrete mixing proposal that efficiently e x- plores the learned data geometry through local graph edits. GEM provides a principled pathway tow ard solving limitations of energy-based discrete graph gen- eration, paving the w ay for their broader applicability . 2 Preliminaries: Energy Matching and JKO Energy matching learns a scalar potential V θ : X → R , parameterized by θ , on a continuous data space X . It simul- taneously serves two complementary roles: (i) off-manifold guidance: directing samples toward the data manifold via the gradient field −∇ x V θ , and (ii) near-e q uilibrium refine- ment: characterizing the data distribution as an EBM. These objectiv es decompose model learning into a global transport-alignment term and a local density-refinement term. A time-dependent temperature schedule (e.g., ϵ ( t ) = 0 for 0 ≤ t < 1 for transport and ϵ ( t ) = ϵ for t ≥ 1 for mixing, promoting exploration) ensures that samples con verge to the Gibbs measure ρ θ ( x ) ∝ exp( − 1 ϵ V θ ( x )) as t → ∞ . Energy Matching can be viewed through an optimization- in-probability-spaces perspective, expressed via the JKO scheme from density ρ t to ρ t +∆ t : ρ t +∆ t = arg min ρ 1 2∆ t inf γ ∈ Γ( ρ t ,ρ ) E ( x,y ) ∼ γ [ c ( x, y ) ] + Z V θ ( y ) ρ ( y ) d y + ϵ ( t ) Z ρ ( y ) log ρ ( y ) d y , (2.1) where the cost c ( x, y ) quantifies displacement from x to y , typically set as c ( x, y ) = ∥ x − y ∥ 2 in continuous spaces. The coupling γ ∈ Γ( ρ t , ρ ) is a joint distribution that charac- terizes a transport plan. Intuiti vely , γ ( x, y ) encodes “how much mass from each location x ∼ ρ t is transported to each new location y ∼ ρ ”, thus determining the optimal reassign- ment of mass that minimizes the expected transportation cost E ( x,y ) ∼ γ [ c ( x, y ) ] . In practice, couplings γ can either be precomputed or dynamically estimated during training, and the potential V θ is optimized accordingly , aligning it with the induced transport trajectories. Sampling (two r egimes). Sampling starts from a transport- aligned drift and switches to refinement via a prescribed time-dependent temperature ϵ ( t ) . In continuous time, this corresponds to the time-inhomogeneous SDE: d x t = −∇ x t V θ ( x t ) d t + p 2 ϵ ( t ) d W t , (2.2) which reduces to an ODE when ϵ ( t ) = 0 and becomes stochastic as ϵ ( t ) > 0 . In practice, sampling corresponds to explicitly inte grating ( 2.2 ) over a fixed time interv al. 3 Graph Energy Matching W e extend the continuous Energy Matching formulation to discrete graph spaces by introducing two no vel discrete proposals: one enabling transport-aligned rapid mov ements across graph spaces, and the other facilitating mixing within high-probability regions. Our formulation employs an energy-based switching mechanism that clearly separates transport and mixing sampling regimes without requiring explicit indexing of time or noise lev els. Sampling is effi- ciently facilitated through deterministic greedy proposals in the transport regime and stochastic gradient-informed proposals during mixing. The sampler uses a temperature ϵ to control stochasticity: the transport regime is the deter - ministic ϵ → 0 limit, while the mixing regime uses ϵ > 0 . 2 Graph Energy Matching Probability Distrib utions Perspectiv e: π 0 π 1 . . . π k greedy discrete proposals on samples (direction −∇ x V θ ( x ) ↔ transport map direction) . . . π data = π θ multi-site, local, geometry-aware, ∇ x V θ ( x ) -informed jumps + MH acceptance T ransport Phase samples far from data manifold: V θ ( x ) ≫ V mean data Mixing Phase samples near data manifold: V θ ( x ) ∼ V mean data Samples Perspecti ve: F igur e 1. GEM Sampling Overview . T wo perspecti ves on GEM sampling: a probability-distribution vie w (top) of the two-phase MCMC process, and a samples view (bottom) showing molecular trajectories from MOSES. Sampling alternates between a transport phase, where gradient-informed, greedy proposals rapidly move samples to ward regions of high probability , and a mixing phase employing MH acceptance to ensure correct stationary distribution and ef ficient mixing between modes. Color key: transport ( orange ), MH ( blue ). Further , we propose a temperature-annealing strategy to efficiently generate no vel samples from giv en data samples. Graph repr esentation. W e consider a discrete state space X :=  X node  n ×  X edge  ( n 2 ) to represent undirected graphs. A graph x ∈ X contains a set of n nodes V = { 1 , . . . , n } and edges E defining the connectivity be- tween these nodes. Each node i ∈ V has a node class x i ∈ X node (e.g. an atom type), and each edge ( i, j ) ∈ E has an edge class x ij ∈ X edge (e.g. a bond type). The edge space X edge includes an element indicating the absence of an edge. The graph x :=  x V , x E  consists of x V and x E denoting categorical node and edge features, respectiv ely , and are defined as x V := [ x i ] n i =1 ∈ ( X node ) n ; x E := [ x ij ] i 0 scaling node and edge contributions. W e lift the local cost by optimizing over node relabelings to obtain the hard-permutation Fused Gromov–W asserstein (FGW) cost ( V ayer et al. , 2019 ): c FGW ( x, y ) = min σ ∈ S n { c loc ( x, σ · y ) } . (3.2) In practice, we predominantly use the local cost c loc , as it provides a v alid and efficient measure for scoring pairs of neighboring graphs, e.g. when e valuating displacements in- duced by local jump proposals. When permutation-in variant comparisons between non-local graph pairs are required, we instead employ the c FGW cost. Since the exact computation is generally intractable (linked to the Quadratic Assignment Problem (QAP) ( V ayer et al. , 2019 )), we rely on fast approx- imations: histogram matching for noisy–clean molecule pairings and node matching for molecule–molecule pairings ( Section C ). Learnable Components. The primary learnable compo- nent of our framew ork is the scalar energy potential V θ ( x ) , implemented via a neural network. While we ev aluate V θ exclusi vely on discrete graphs x ∈ X , we lev erage the continuous embedding representation ˆ x to enable gradient- based computations. Specifically , the differentiability of 3 Graph Energy Matching V θ ( x ) := ˜ V θ ( ˆ x ) with respect to ˆ x allows us to compute gra- dients that directly inform and guide local jump proposals used by the sampler (see Sections 3.1 and 3.2 ). Thus, the continuous embedding bridges the discrete graph domain with gradient-based optimization. Discrete, geometry-awar e energy-based proposal (local edits) distance penalty isolines 1 edit 2 edits 3 edits ˆ x stay −∇ ˆ x V θ ( ˆ x ) q mixing ( x → y ) ∝ exp  − λ L V ∥ ˆ y V − ˆ x V ∥ 2 2 − λ L E ∥ ˆ y E − ˆ x E ∥ 2 2 − β L ∇ V θ ( ˆ x ) ⊤ ( ˆ y − ˆ x )  dot area ∝ q mixing ( x → y ) F igure 2. Proposal Scoring. Local graph edit proposals with scor- ing gi ven by q mixing ( x → y ) . The size of dots encodes proposal probabilities. Alignment with the gradient direction increases the probability , while distance penalties discourage larger edits. If the candidate is the origin (stay), we resample to promote exploration. (with equal node/edge weights λ L V = λ L E for illustrativ e purposes) 3.1 Discrete T ransport-Aligned Proposal ( ϵ → 0 ) Let ( π k ) K k =0 denote intermediate discrete distrib utions along a path from an initial noise distribution to the data dis- tribution. Giv en an optimal coupling γ k (approximated in practice by tractable methods, e.g., ( Alber go et al. , 2024 ; T ong et al. , 2024 )) and its induced transport map T k , we assume a consistent embedding-based representa- tion y = T k ( x ) ⇐ ⇒ ˆ y = ˜ T k ( ˆ x ) , with x ∼ π k and y = T k ( x ) ∼ π k +1 denoting consecuti ve graph states along the transport, and ˜ T k a continuous map realizing T k in the embedded space. Expressing the JKO variational objec- tiv e from ( 2.1 ) as an optimization ov er transport maps (see Section B for deri vation) and substituting the time-step pa- rameter ∆ t with the jump step-size η yields: T k ( x ) ∈ arg min y ∈X  1 2 η c ( x, y ) + V θ ( y )  , (3.3) where π k +1 = T # k π k denotes the pushforward of π k . Assuming the minimizer T k ( x ) lies in the interior , the first- order optimality condition becomes 0 = ∇ ˆ y  1 2 η ˜ c ( ˆ x, ˆ y ) + ˜ V θ ( ˆ y )      ˆ y = ˜ T k ( ˆ x ) , (3.4) which implies ∇ ˆ y ˜ V θ  ˜ T k ( ˆ x )  = − 1 2 η ∇ ˆ y ˜ c  ˆ x, ˜ T k ( ˆ x )  . (3.5) This optimality condition directly characterizes the solution and serves as the training objecti ve ( Section 3.4 ). During sampling, gi ven a trained potential ˜ V θ that satisfies the optimality objecti ve ( 3.5 ) , our goal is to approximately solve ( 3.3 ) using a local proposal mechanism. Instead of performing large global mov es, we iterativ ely descend the potential by making local edits, enabling a T aylor expansion of ˜ V θ ( ˆ y ) around the current state ˆ x . Rather than controlling the step-size parameter η , we simplify the (local) greedy proposal by restricting the kernel to exactly N edits per jump. Since our jumps are local, we use the cost c loc ( ˆ x, ˆ y ) with distinct weights λ V and λ E ( 3.1 ). Linearizing around ˆ x : we approximate ˜ V θ ( ˆ y ) ≈ ˜ V θ ( ˆ x ) + ( ˆ y − ˆ x ) ⊤ ∇ ˆ x ˜ V θ ( ˆ x ) . W e deterministically select the candidate y from the candidate set C N ( x ) , defined as the set of graphs reachable from x via exactly N discr ete modifications (edge changes/remov als or node type alterations), by minimizing the sum of the potential change and the transport cost: y ∗ ( x ) = arg min y ∈ C N ( x )  ( ˆ y − ˆ x ) ⊤ ∇ ˆ x V θ ( ˆ x ) + c loc ( x, y )  , q greedy ( x → y ) = I { y = y ∗ ( x ) } (3.6) In the transport regime, proposed moves are deterministi- cally projected onto directions associated with strictly neg- ativ e gradients, aiming for a decrease in energy . Edits not meeting this criterion are rejected. If no candidate yields a negati ve gradient, the greedy sampler stalls, switching to the Lange vin-based proposal q mixing in Section 3.2 . More about proposal schedules in Section 3.3 . F or further pro- posal implementation details, see Section D . 3.2 Discrete Mixing Pr oposal ( ϵ > 0 ) Discretizing dynamics ( 2.2 ) in embedding space with tem- perature ϵ , step η , and graphs x, y ∈ X yields: q mixing ( x → y ) ∝ exp  − 1 4 ϵη ∥ ˆ y − ˆ x + η ∇ ˆ x V θ ( ˆ x ) ∥ 2  . (3.7) W e expand the mixing transition probability explicitly in terms of ˆ y , simplifying by remo ving terms independent of ˆ y and substituting the squared-distance term ∥ ˆ y − ˆ x ∥ 2 with the local graph distance ˜ c loc ( ˆ x, ˆ y ) . This gi ves: log q mixing ( x → y ) = − 1 4 ϵη ˜ c loc ( ˆ x, ˆ y ) − 1 2 ϵ ∇ ˆ x V θ ( ˆ x ) ⊤ ( ˆ y − ˆ x ) + const. (3.8) The proposal is effecti vely local due to the distance penalty , though, in principle, it can still propose an arbitrary number of edits, enabling mode escape. W e define Langevin-specific parameters ( λ L V , λ L E , β L ) by absorbing all constants, includ- ing ϵ , η , scaling terms λ V , λ E from c loc , and numerical 4 Graph Energy Matching factors: q mixing ( x → y ) ∝ exp  − λ L V ∥ ˆ y V − ˆ x V ∥ 2 2 − λ L E ∥ ˆ y E − ˆ x E ∥ 2 2 − β L ∇ ˆ x V θ ( ˆ x ) ⊤ ( ˆ y − ˆ x )  . (3.9) The resulting kernel can be viewed as a linearized locally balanced pr oposal ( Zanella , 2020 ); related discrete-gradient samplers appear in ( Grathwohl et al. , 2021 ; Zhang et al. , 2022 ) for particular costs or candidate sets. Figure 2 illus- trates the resulting geometry-aware proposal scoring and how gradient alignment competes with distance penalties. In practice, we pair the q mixing proposal with a Metropo- lis–Hastings (MH) acceptance step to ensure con ver gence to the target distrib ution π θ ( x ) ∝ exp( − β mh V θ ( x )) . In the mixing regime, the stochastic proposal is accepted with the following MH probability: α ( x, y ) = min  1 , π θ ( y ) q mixing ( y → x ) π θ ( x ) q mixing ( x → y )  . (3.10) This choice enforces detailed balance: π θ ( x ) q mixing ( x → y ) α ( x, y ) = π θ ( y ) q mixing ( y → x ) α ( y , x ) (3.11) so π θ is stationary for the mixing dynamics ( Metropolis et al. , 1953 ; Hastings , 1970 ). Implementation details of the proposal kernel (factorized logits, temperature annealing, and do-nothing resampling) are provided in Section D . W e calibrate the parameters β mh , β L , λ L V , and λ L E by maxi- mizing the relative likelihood of generated samples under the learned energy V θ , directly after the warmup training phase described in Section 3.4 . 3.3 Initializations and Proposal Schedules Determining the Regime. W e control the regime through an energy-based binary indicator s ∈ { 0 , 1 } : s = 0 selects transport (deterministic ϵ → 0 limit), while s = 1 selects mixing (finite ϵ > 0 ). If initialized at noise, the sampler follows deterministic greedy edits guided strictly by ne ga- ti ve gradient directions. T ransition from transport to mixing occurs when the Marko v chain either reaches the target en- ergy (defined as the mean energy of the training samples estimated during training) or becomes trapped in a local minimum—no valid gradient-improving edits exist—for all samples, whiche ver happens first. Once triggered, the sampler enters the stochastic mixing regime. Sample Initialization. W e use two initialization regions with aligned proposal schedules: • Noise initialization: sample the node count from the empirical histogram, then draw node/edge types uniformly (uniform noise). These start far from the data distribution, so we run transport with greedy proposals q greedy until the energy-based switch, then enter mixing with q mixing . V θ ( x ) Energy T rajectories 0 100 200 300 400 500 -40 0 100 200 275 annealed-temperature mixing fixed-temperature mixing greedy proposals noise band data band samples: noise initialized samples: data initialized Inference MCMC steps Fraction ↑ V alidity & Novelty T rajectories 0 100 200 300 400 500 0.00 0.25 0.50 0.75 0.90 1.00 V alidity (noise initialized) Novelty (noise initialized) V alidity (data initialized) Novelty (data initialized) F igure 3. Energy and Sampling T rajectories. T op: energy ev olution for noise- vs. data-initialized chains, with mean ± 1 std bands and reference le vels at E x ∼ π data [ V θ ( x )] and E x ∼ π 0 [ V θ ( x )] . Noise-initialized chains use greedy proposals to reach the data distribution; data-initialized chains use temperature annealing (lo w initial β mh , gradually increasing) to recov er novelty . Bottom: validity and no velty trajectories for data-initialized (teal) and noise- initialized (red) chains ov er inference steps. Uniqueness ≈ 100%. • Data initialization: draw graphs from the training set. W e skip the transport phase and start directly in the mix- ing regime with q mixing , initially using a higher model temperature (lo wer β mh ) to increase the acceptance rate, and then annealing (raising β mh ) before fixed-temperature mixing ( Section D ). Figure 3 summarizes the resulting energy trajectories under noise vs. data initialization and the corresponding regimes. 3.4 T raining objectives Noise and Data Interpolation via Minibatch coupling. Giv en minibatches x 0 ∼ π 0 and x data ∼ π data where π 0 is the uniform noise and π data is the data distribution re- spectiv ely . W e pair samples using a permutation-in variant approximation of c FGW ( x 0 , x data ) described in Section 3 , via histogram matching ( Section C ), and solving for a trans- port map T induced by the minibatch coupling, using the POT solver ( Flamary et al. , 2021 ). Each source graph x 0 is paired with a data graph x data = T ( x 0 ) . Crucially , we utilize c FGW solely to select optimal pairs based on global structural similarity (e.g., node and edge 5 Graph Energy Matching counts). Denote their embeddings by ˆ x 0 and ˆ x data . Define the displacement in the embedding space as v = ˆ x data − ˆ x 0 . W e sample a discr ete interpolant x t := ( x t V , x t E ) , where x t V := [ x t i ] n i =1 and x t E := [ x t ij ] i N warmup then 7: x ∼ GEM Sampling ( V θ ) ← Algorithm 2 8: L CL = E [ ˜ V θ ( ˆ x data )] − E [ ˜ V θ ( ˆ x )] 9: else 10: L CL = 0 11: end if 12: optimizer step ( L Flow + λ CL L CL ) 13: i = i + 1 14: end while Algorithm 2 GEM Sampling 1: Input: V θ , N CL , V th = E π data [ V θ ] 2: Init. x ∼ π 0 (or π data ) 3: while V θ ( x ) > V th and x not stuck do 4: x ∼ q greedy ( x → y ) 5: end while 6: for k = 1 to N CL do 7: x ∼ MH-step ( q mixing ( x → y )) 8: end for 9: Return x 4 Experiments W e use two established molecular datasets, QM9 ( Ramakr - ishnan et al. , 2014 ) and MOSES ( Polykovskiy et al. , 2020 ). QM9 comprises ∼ 150K small organic molecules containing up to nine heavy atoms (excluding hydrogens), whereas MOSES is a larger , drug-like dataset with roughly 1.5M molecules featuring up to 27 heavy atoms. W e report (i) unconditional generation on both QM9 and MOSES, (ii) conditional generation (property optimization) on MOSES, and (iii) analyses of graph-to-graph geodesics on MOSES. When applicable, we report (i) a fraction of valid-unique- nov el (V .U.N.) graphs (or v alid-unique (V .U.) and No velty of the samples separately on QM9 as per ( Eijkelboom et al. , 2024 )) and Fr ´ echet ChemNet Distance (FCD) ( Preuer et al. , 2018 ). For conditional tasks, we report conditional valid- unique-nov el (C.V .U.N.) i.e., a proportion of V .U.N. graphs that simultaneously fulfill the prescribed property condition. W e report results on two initialization regimes: noise initial- ization, where generation starts from easy-to-sample noise and must reach chemically valid molecules; and data initial- ization, applicable only to EBMs, which starts from training molecules and must produce novel molecules through edits. Baseline descriptions and experimental details are provided in Section E and Section F . 6 Graph Energy Matching inference steps V .U.N. ↑ FCD ↓ Perf ormance vs Inference Steps 0 500 1000 1500 0.0 0.25 0.50 0.80 0.85 0.90 1.00 0 2 4 6 8 GEM (noise init.) V .U.N. GEM (data init.) V .U.N. DeFoG V .U.N. GEM (noise init.) FCD GEM (data init.) FCD DeFoG FCD F igure 4. V .U.N. and FCD vs Steps. V .U.N. (higher is better) and FCD (lower is better) v ersus inference steps on MOSES for GEM noise initialization (uniform) with greedy warmup proposal, GEM data initialization with annealed proposal, and DeFoG (marginal). 4.1 Unconditional Generation MOSES. T able 1 (top) reports MOSES results. Across both noise and data initialization, GEM outperforms baselines in terms of V .U.N. and FCD. Figure 4 contrasts perfor- mance versus inference steps and initialization strategies, respectiv ely . At larger inference budgets, GEM substan- tially surpasses diffusion, yielding notably higher V .U.N. and improv ed distributional fidelity (lower FCD). QM9. T able 1 (bottom) shows results on QM9. Under noise initialization, GEM achiev es V .U. scores comparable to discrete diffusion baselines, with performance close to saturation, while substantially surpassing them in novelty , in- dicating effecti ve e xploration from random initialization. It also clearly outperforms the previous graph-EBM approach. Under data initialization, GEM consistently achieves strong performance across all reported metrics. 4.2 Conditional Generation W e conduct property optimization with respect to a div erse set of properties, including the logarithm of lipophilicity (logP), the logarithm of aqueous solubility (logS), quantita- tiv e estimate of drug-likeness (QED), and topological polar surface area (TPSA). For comparability , we follow ( V ignac et al. , 2023 ; Ninniri et al. , 2025 ): we train a regressor f ϕ for differentiable guidance. For a desired property v alue ζ de- fine V cond θ ( x ) = V θ ( x ) + λ prop ∥ f ϕ ( x ) − ζ ∥ 2 , where λ prop is the weighting for property criteria. Subsequently , we run the same proposal kernel q θ ( x → y ) using −∇ ˆ x V cond θ ( ˆ x ) , with the MH acceptance ratio modified accordingly . The abov e scheme can also be modified to accommodate a time- or noise-conditioned property regressor . For fairness, all methods in T able 2 use noise-conditioned regressors (see Section G , T able 4 for GEM ablations) and underwent com- parable hyperparameter tuning procedures. W e feed re- gressor time using an energy-based proxy that increases linearly until the chain reaches the data distribution energy (a) MOSES ( ∼ 1.5M molecules). Method V .U.N. ↑ FCD ↓ T raining samples (MOSES) 0.0 0.25 Noise Initialization GraphEBM ∗ ( Liu et al. , 2021 ) 0.081 9.83 VFM ∗ ( Eijkelboom et al. , 2024 ) 0.814 2.71 DeFoG ∗ ( Qin et al. , 2025 ) 0.822 1.95 GEM (Ours) 0.856 1.51 Data Initialization GraphEBM ∗ ( Liu et al. , 2021 ) 0.322 7.31 GEM (Ours) 0.898 0.76 (b) QM9 ( ∼ 0.15M molecules). Method V .U. ↑ Novelty ↑ FCD ↓ T raining samples (QM9) 1.00 0.0 0.04 Noise Initialization GraphEBM ∗ ( Liu et al. , 2021 ) 0.121 0.65 7.16 VFM ( Eijkelboom et al. , 2024 ) 0.998 0.49 0.44 DeFoG ( Qin et al. , 2025 ) 0.968 0.36 0.81 GEM (Ours) 0.971 0.63 0.92 Data Initialization GraphEBM ∗ ( Liu et al. , 2021 ) 0.215 0.48 5.15 GEM (Ours) 0.985 0.51 0.31 T able 1. Unconditional Generation. Results grouped by initial- ization. MOSES reports V .U.N. and FCD (lower is better); QM9 reports V .U., nov elty , and FCD. ∗ reproduced with the same back- bone as DeFoG ( Qin et al. , 2025 ); EBMs use a small energy head. Architecture details are in Section A . MOSES: 25k generations, 1000 inference steps. QM9: 10k generations, 500 inference steps. lev el, after which it is clamped to t = 1 ( Section G ). Noise- conditioned regressors provide more reliable predictions on noisy/in valid graphs. The target molecular constraints (logS ≥ -2.25, logP ≤ 1.5, QED ≥ 0.9, TPSA ≤ 50) are selected such that unconditional MOSES samples satisfy each con- straint approximately 20% of the time, ensuring comparable difficulty across properties. W e observe GEM maintains strong validity , uniqueness, and novelty while improving constraint satisfaction relati ve to baselines ( T able 2 ). 4.3 Geodesic Analysis W e study molecule-to-molecule interpolation by computing geodesics (low-energy graph-edit paths aligned with the learned data geometry ( B ´ ethune et al. , 2025 )). W e compare these geodesics to a purely cost-based interpolation baseline. W e report av erage validity versus the endpoint distance with distance computed via the node-matching approximation to c FGW ( Section C ). Figure 5 shows qualitativ e paths. As seen in Figure 6 , energy-weighted geodesics preserve v alid- ity more reliably than cost-only paths, indicating that the energy encodes chemically meaningful structure. Section H provides e xperimental details. 7 Graph Energy Matching MOSES ( ∼ 1.5M molecules) Method Condition: logS ≥ -2.25 Condition: QED ≥ 0.9 Condition: TPSA ≤ 50 Condition: logP ≤ 1.5 C.V .U.N. ↑ C.V .U. ↑ C.V .U.N. ↑ C.V .U. ↑ C.V .U.N. ↑ C.V .U. ↑ C.V .U.N. ↑ C.V .U. ↑ T raining samples (MOSES) 0.0 0.24 0.0 0.16 0.0 0.21 0.0 0.15 Noise Initialization G-VFM ∗ ( Eijkelboom et al. , 2025 ) 0.72 0.75 0.26 0.28 0.72 0.80 0.70 0.74 DeFoG ∗ ( Qin et al. , 2025 ) 0.74 0.79 0.26 0.31 0.77 0.82 0.78 0.82 GEM (Ours) 0.80 0.81 0.40 0.40 0.80 0.81 0.85 0.85 T able 2. Property Optimization. MOSES conditional generation under property constraints. C.V .U.N. is the fraction meeting the condition and valid, unique, novel simultaneously; conditional valid-unique (C.V .U.) is the fraction meeting the condition and valid, unique. 1k inference steps. Evaluated on 5k samples. ∗ reproduced, same architecture (other than a small energy-head for GEM Section A ). Molecule A Molecule B Molecule A Molecule B GEM energy-weighted geodesic: Cost-only geodesic: F igure 5. GEM Energy-W eighted vs Cost-Only Geodesics. Molecule trajectories along continuous geodesic paths for MOSES. T op row: GEM energy-weighted geodesic; bottom ro w: cost-only geodesic. Boxes indicate samples along the continuous path. distance (node-matching, c FGW ) avg. v alidity (%) ↑ 0 25 50 75 9 12 16 GEM energy-weighted geodesic cost-only geodesic F igure 6. V alidity Along Geodesics. A verage chemical valid- ity (%) vs. distance (node-matching, c FGW ) for GEM energy- weighted versus cost-only geodesics. Cost-only ignores the learned data geometry and interpolates through low-v alidity regions. 5 Discussion and Conclusion W e introduced Graph Ener gy Matching (GEM), a discrete generati ve framework that extends continuous energy match- ing principles to discrete graph domains motiv ated by the transport-map formulation of the JK O scheme. GEM inte- grates deterministic, transport-aligned graph edits guiding sampling tow ard high-probability regions with effecti ve mix- ing under a unified scalar energy model. Conceptually , this allows (i) scalable and high-quality molecular graph gener - ation from both noise- and data-initialized states, and (ii) flexible conditional generation at inference-time through an explicitly learned relativ e-likelihood structure. Empir- ically , GEM achieves state-of-the-art performance among graph EBMs, matching or surpassing diffusion and flow matching baselines on MOSES and QM9 datasets. Further , it ef fectively addresses property optimization under thresh- old constraints. Moreover , GEM’ s representation naturally enables diagnostics such as graph geodesics—tasks inher - ently challenging for methods modeling data distributions implicitly . Limitations and T rade-offs. GEM e xplicitly learns a scalar energy field capturing relativ e likelihood; thus, like other EBMs, generation velocities must be deriv ed from gradients, incurring a ∼ 2–3 × theoretical o verhead compared to purely forward parameterizations. Nonetheless, this explicit repre- sentation naturally supports challenging tasks like composi- tional inference and computing geometry-aware geodesics. Future Directions. GEM could be extended to joint 2D and 3D molecular modeling ( Liu et al. , 2022 ; Chen et al. , 2023 ) by explicitly encoding geometric constraints, such as interatomic distances, into its robust graph prior . Fur - thermore, the discrete proposals underlying GEM provide a foundation for broader discrete domains, including text and token-based modeling, where energy perspectiv es are in- creasingly prominent ( Blondel et al. , 2025 ; Xu et al. , 2025 ). 8 Graph Energy Matching Acknowledgments W e would like to thank Antonio T erpin for helpful discus- sions. W e thank the Harvard Kempner Institute for pro- viding access to computing resources. This research was supported by the Helmut Horten F oundation and the Euro- pean Cooperation in Science and T echnology (COST). References Albergo, M. S., Goldstein, M., Boffi, N. M., Ranganath, R., and V anden-Eijnden, E. Stochastic interpolants with data-dependent couplings. In International Confer ence on Machine Learning , pp. 921–937. PMLR, 2024. Balcerak, M., Amiranashvili, T ., T erpin, A., Shit, S., Bo- gensperger , L., Kaltenbach, S., K oumoutsakos, P ., and Menze, B. Energy matching: Unifying flow matching and energy-based models for generativ e modeling. arXiv pr eprint arXiv:2504.10612 , 2025. Bell, E. W . and Zhang, Y . Dockrmsd: an open-source tool for atom mapping and rmsd calculation of sym- metric molecules through graph isomorphism. Journal of Cheminformatics , 11(1):1–9, 2019. doi: 10.1186/ s13321- 019- 0362- 7. B ´ ethune, L., V igouroux, D., Du, Y ., V anRullen, R., Serre, T ., and Boutin, V . Follo w the energy , find the path: Rieman- nian metrics from ener gy-based models. arXiv pr eprint arXiv:2505.18230 , 2025. NeurIPS 2025. Blondel, M., Sander , M. E., V ivier -Ardisson, G., Liu, T ., and Roulet, V . Autore gressive language models are secretly energy-based models: Insights into the looka- head capabilities of next-token prediction. arXiv preprint arXiv:2512.15605 , 2025. Campbell, A., Y im, J., Barzilay , R., Rainforth, T ., and Jaakkola, T . Generati ve flows on discrete state-spaces: En- abling multimodal flo ws with applications to protein co- design. In International Confer ence on Machine Learning (ICML) , 2024. Chen, J., Zhang, X., Ma, Z.-M., Liu, S., et al. Molecule joint auto-encoding: T rajectory pretraining with 2d and 3d dif fusion. Advances in Neur al Information Pr ocessing Systems , 36:55077–55096, 2023. Du, Y . Learning Generalizable Systems by Learning Com- posable Ener gy Landscapes . PhD thesis, Massachusetts Institute of T echnology , 2025. Du, Y . and Mordatch, I. Implicit generation and generaliza- tion in energy-based models, 2019. arXiv preprint. Eijkelboom, F ., Bartosh, G., Naesseth, C. A., W elling, M., and van de Meent, J.-W . V ariational flow matching for graph generation. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , volume 37, 2024. Eijkelboom, F ., Zimmermann, H., V adgama, S., Bekkers, E. J., W elling, M., Naesseth, C. A., and van de Meent, J.-W . Controlled generation with equi variant v ariational flow matching, 2025. Flamary , R., Courty , N., Gramfort, A., Alaya, M. Z., Bois- bunon, A., Chambon, S., Chapel, L., Corenflos, A., Fa- tras, K., Fournier , N., et al. POT : Python Optimal T rans- port. Journal of Mac hine Learning Resear ch , 22(78):1–8, 2021. Fuchsgruber , D., W ollschl ¨ ager , T ., and G ¨ unnemann, S. Energy-based epistemic uncertainty for graph neural net- works. Advances in Neural Information Pr ocessing Sys- tems , 2024. Grathwohl, W ., Swersky , K., Hashemi, M., Duv enaud, D., and Maddison, C. Oops i took a gradient: Scalable sam- pling for discrete distributions. In International Con- fer ence on Machine Learning , pp. 3831–3841. PMLR, 2021. Hastings, W . K. Monte carlo sampling methods using markov chains and their applications. Biometrika , 57 (1):97–109, 1970. doi: 10.1093/biomet/57.1.97. Hinton, G. E. Training products of experts by minimizing contrastiv e div ergence. Neural computation , 14(8):1771– 1800, 2002. Jin, W ., Barzilay , R., and Jaakkola, T . Junction tree vari- ational autoencoder for molecular graph generation. In International confer ence on machine learning , pp. 2323– 2332. PMLR, 2018. Jordan, R., Kinderlehrer , D., and Otto, F . The variational formulation of the fokker–planck equation. SIAM J ournal on Mathematical Analysis , 29(1):1–17, 1998. Kingma, D. P . and Ba, J. Adam: A method for stochastic optimization. arXiv preprint , 2014. Landrum, G. et al. Rdkit: Open-source cheminformatics. URL https://www.rdkit.org . Lanzetti, N., T erpin, A., and D ¨ orfler , F . V ariational analysis in the wasserstein space. arXiv pr eprint arXiv:2406.10676 , 2024. Lanzetti, N., Bolognani, S., and D ¨ orfler , F . First-order con- ditions for optimization in the wasserstein space. SIAM Journal on Mathematics of Data Science , 7(1):274–300, 2025. 9 Graph Energy Matching LeCun, Y ., Chopra, S., Hadsell, R., Ranzato, M., and Huang, F . A tutorial on energy-based learning. Predicting Struc- tur ed Data , 1(0), 2006. Liu, M., Y an, K., Oztekin, B., and Ji, S. Graphebm: Molec- ular graph generation with energy-based models, 2021. arXiv preprint. Liu, S., W ang, H., Liu, W ., Lasenby , J., Guo, H., and T ang, J. Pre-training molecular graph representation with 3d geometry . In International Conference on Learning Rep- r esentations , 2022. Ma, L., Lin, C., Lim, D., Romero-Soriano, A., Dokania, P . K., Coates, M., T orr, P ., and Lim, S.-N. Graph in- ductiv e biases in transformers without message passing. In International Conference on Machine Learning , pp. 23321–23337. PMLR, 2023. Metropolis, N., Rosenbluth, A. W ., Rosenbluth, M. N., T eller , A. H., and T eller, E. Equation of state cal- culations by fast computing machines. The Journal of Chemical Physics , 21(6):1087–1092, 1953. doi: 10.1063/1.1699114. Ninniri, M., Podda, M., and Bacciu, D. Graph diffusion that can insert and delete. In Advances in Neural Information Pr ocessing Systems , 2025. NeurIPS 2025. No ´ e, F ., Tkatchenko, A., M ¨ uller , K.-R., and Clementi, C. Machine learning for molecular simulation. Annual re- view of physical c hemistry , 71(1):361–390, 2020. Polishchuk, P . G., Madzhidov , T . I., and V arnek, A. Es- timation of the size of drug-like chemical space based on gdb-17 data. J ournal of computer-aided molecular design , 27(8):675–679, 2013. Polykovskiy , D., Zhebrak, A., Sanchez-Lengeling, B., Golo- vano v , S., T atanov , O., Belyaev , S., Kurbanov , R., Arta- monov , A., Aladinskiy , V ., V eselov , M., et al. Molecular sets (moses): A benchmarking platform for molecular generation models. F r ontiers in Pharmacology , 11:1931, 2020. Preuer , K., Renz, P ., Unterthiner , T ., Hochreiter, S., and Klambauer , G. Fr ´ echet chemnet distance: A metric for generativ e models for molecules in drug discovery . Jour - nal of Chemical Information and Modeling , 58(9):1736– 1741, 2018. Qin, Y ., Madeira, M., Thanou, D., and Frossard, P . De- fog: Discrete flow matching for graph generation. In International Confer ence on Machine Learning (ICML) , 2025. Ramakrishnan, R., Dral, P . O., Rupp, M., and V on Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data , 1(1):140022, 2014. Stokes, J. M., Y ang, K., Swanson, K., Jin, W ., Cubillos- Ruiz, A., Donghia, N. M., MacNair, C. R., French, S., Carfrae, L. A., Bloom-Ackermann, Z., et al. A deep learning approach to antibiotic discovery . Cell , 180(4): 688–702, 2020. T erpin, A., Lanzetti, N., Gadea, M., and D ¨ orfler , F . Learning diffusion at lightspeed. Advances in Neural Information Pr ocessing Systems , 2024. T ong, A., Fatras, K., Malkin, N., Huguet, G., Zhang, Y ., Rector-Brooks, J., W olf, G., and Bengio, Y . Improv- ing and generalizing flo w-based generative models with minibatch optimal transport. T ransactions on Machine Learning Resear ch , 2024. V ayer , T ., Chapel, L., Flamary , R., T avenard, R., and Courty , N. Optimal transport for structured data with applica- tion on graphs. In International Confer ence on Machine Learning , pp. 6275–6284. PMLR, 2019. V ignac, C., Krawczuk, I., Siraudin, A., W ang, B., Cevher , V ., and Frossard, P . Digress: Discrete denoising diffusion for graph generation. In International Conference on Learning Repr esentations (ICLR) , 2023. W u, Q., Chen, Y ., Y ang, C., and Y an, J. Energy-based out-of-distribution detection for graph neural networks. International Confer ence on Learning Representations (ICLR) , 2023. Xie, T ., Fu, X., Ganea, O.-E., Barzilay , R., and Jaakkola, T . S. Crystal diffusion v ariational autoencoder for peri- odic material generation. In International Conference on Learning Repr esentations , 2022. Xu, M., Geffner , T ., Kreis, K., Nie, W ., Xu, Y ., Leskov ec, J., Ermon, S., and V ahdat, A. Energy-based diffusion language models for text generation. In International Confer ence on Learning Representations , 2025. Zanella, G. Informed proposals for local mcmc in discrete spaces. Journal of the American Statistical Association , 115(530):852–865, 2020. Zhang, R., Liu, X., and Liu, Q. A langevin-like sampler for discrete distributions. In International Confer ence on Machine Learning , pp. 26375–26396. PMLR, 2022. 10 Graph Energy Matching A Permutation-in variant energy parameterization W e implement the potential with a graph transformer back- bone directly from DeFoG ( Qin et al. , 2025 ). Each layer of this backbone is permutation equi variant, meaning that for any permutation (relabeling) σ , the layer satisfies: h ( l ) ( σ · x ( l ) V , σ · x ( l ) E , f ( l ) ) = ( σ · x ( l +1) V , σ · x ( l +1) E , f ( l +1) ) , where x ( l ) V and x ( l ) E denote node and edge features at layer l , respectiv ely , and f ( l ) denotes the global graph-lev el feature. Follo wing ( V ignac et al. , 2023 ; Qin et al. , 2025 ), we pad and mask all smaller graphs to the maximum number of nodes observed in the training dataset, denoted as n . This results in fixed-size inputs to the model, denoted x V and x E for node and edge features, respectively . W e denote node and edge features at layer l as x ( l ) V and x ( l ) E , respectiv ely . The global feature f ( l ) collects graph-lev el features at layer l and is updated through a permutation-in variant operation. Figure 7 summarizes the backbone and in variant readout. Input Graph x := ( x ( l ) V , x ( l ) E ) Graph Transformer (Permutation Equivariant) Invariant Readout P i x ( L ) i , P i  = j x ( L ) ij , f ( L ) V θ ( x ) ∈ R F igure 7. Energy Backbone. Permutation-equiv ariant backbone with in variant pooling to produce a scalar energy . The scalar energy is obtained by an in variant readout that first pools the equiv ariant features and then applies a multi- layer perceptron (MLP): ¯ x = X i ∈V x ( L ) i , ¯ e = X i