Exact Discrete Stochastic Simulation with Deep-Learning-Scale Gradient Optimization

1 Exact Di screte Stochastic Simulatio n with Deep -Lear ning-S cale Gra dient Optimization Jose M. G. Vilar 1,2,* and Leonor Saiz 3,* 1 Biofisika Institute (CSIC , UPV/EHU) , University of t he Basque Count ry (UPV/EH U), P.O. Box 6 44, 4808 0 Bilbao, Spain 2 IKERB ASQUE, Basque Found ation fo r Scienc e, 4801 1 Bilbao, Spain 3 Department of Biomedic al Engineering , University of C al ifornia, 451 E. Health Sciences Drive , Davis, CA 95616 , USA * Correspondence to: j.vilar@ikerbasque.or g or lsaiz@u cdavis.edu Abstract Exact stoc hastic simulation of continuous-time Markov chains (CTMCs) is essential when discreteness and noise drive sy stem behavior , but the hard categorical ev ent select ion in Gillespie - type alg orithms bl ocks gradient-based learn i ng. We eliminate this constraint by dec oupling fo rward simulation fro m backward differentiat i on, with hard categorical sampling generating exact traject ories and gradients propagating through a continuous ma ssively-parallel Gumbe l – Softmax straight-thro ugh surrogate. Our approach enables accurate optim ization at parameter scales over four orders of magnitude beyond existing si mulators . We val idate for accurac y, scalability, and re liability on a reversi ble dimerization model (0.09 % error), a g enetic oscillato r (1.2% error), a 203,79 6-parameter gene regulatory network achieving 98.4% M N IST accuracy (a pro totypical deep- learning multilayer percept ron benchmark), and experim ental patc h-clamp recordings of ion channel gating (R² = 0.987) in the single -channel reg ime. Our GPU imple mentation delivers 1. 9 billion steps per second, matching the scale of no n -differentiable sim u lators. By maki ng exact stoc hastic si mulation massively parallel and auto diff-compatible , our results enable high- dimensional parameter inferenc e and inverse design across systems biology, chemi cal kinetics, physics, and related CTMC-governed domains. 1. Introduction Stoc hastic fl uctu ations are not merely noise; in syste ms rangi ng from gene reg ulatory networks to viral epidemiology an d nuc leation kinetics, th ey are the drivers of phenotypic variability, extinction events, and symmetry breaking ( 1- 12 ). To model t hese phenomena, exact d iscrete-event simulation algorithms , such as th e Gi llespie algorithm ( 13 ) and t he Bo rtz-Kalos-Lebo witz (BKL) method ( 14 ) , serve as the gold standard, providing mathematically rig orous traject ories of the underlyi ng C TMC ( 3, 15 ). However, this physical ri gor has come at a steep comput ational cost be cause the discrete, non-differentiable n ature of event selection breaks the computat ional graph, renderi ng exac t stoc hastic simulation fundamentally inc ompatible wi th the gradient -based optimization techn iques that have revolutionized deep learning ( 16 ). Consequently , parameter inference and desig n for stochastic models have remained hist orically bott lenecked by the curse of dim ensionality ( 17, 18 ). For decades, the field has been rel egated to 2 likelihood-free methods , such as approximate Bayesian computation (ABC) , which do not requi re gradients but scale po orly, limiting inferenc e to m odels with fewer t han a dozen p arameters ( 3, 19 , 20 ). Alternative grad ient-estimation techniques ex ist but suffer from critical weaknesses i n this cont ext. Like lihood-ratio (score-function) estimators ( 21 ) provide unbiased gradi ents but exhi bit variance that ex plodes with trajectory leng th, maki ng them i mpractical for large systems ( 22 ) and often requiring clever system-specific approac hes ( 23, 2 4 ) . Other sophisticated unb iased appro aches, suc h as the Poisson P ath Al gorithm (P PA) ( 25 ) , significantly reduce this variance by using an integral representat i on of the sensi tivity. While these methods, al ong wi th common random n umber (C RN) finite-difference schemes ( 26 ), make gradient comput ation feasi ble for optimization i n lower -dimensional spaces, they are inherently l imited by their scaling. The computat i onal cost grows l inearly wi th the number of parameters because these estimators must be evaluated for each paramet er individually . This linear-scaling bottleneck effectively prevents their application to the high -di mensional parameter spaces requi red for complex tasks l ike neu ral network-style sy ste m s. C onv ersely, prior soft-forward methods ac hieve differentiability by directly approximatin g the dynamics (e.g., cont inuous react i on mixtures), but this introduces a simulation-reality mismatch where the optimized model no l onger represents the discrete stoc hastic physics , and have thus far been demonstrat ed only in low-dimensional systems ( 27 ). We sho w that physical exactness and differentiabili ty are not mutu ally exclusive at deep learning scales. Our central i nsight is the complete decoupling of the forward simulation from the backward differentiation. In t he fo rward pass, we r etain stan dard, hard c ategorical sam pling, ensuring t hat the solver captures the true intrinsic noise and discreteness of the system wi thout approximation. In the backward pass, we bridge the discontinuity using a massiv ely-parallel Gumbel- Softmax straight-through estimator ( 22, 28, 29 ) , a w el l-stablished tech nique in deep learni ng for discrete (categorical) choices. This surrogate gradient i s known to allow the optimizer to see through discrete events, guiding parameters via a tunable signal while the forward dynami cs remain statistically exact ( 2 2, 28, 2 9 ). Our framework removes the dimensionality barrier en tirely, enabling us to sc ale from handfu ls of parameters to hundreds of thousands. We val idate t he approach across a hierarchy of c ompl exity spanning five orders of magnitude, beginning with precisi on benchmarks where we rec over parameters for a rev ersible di merization system w i th 0.09% error and successfully learn the specifi c rates required t o sustain limit-cycle osci l lations in a genetic oscillator ( 30 ), oft en used as an exemple o f difficult parameter identif iability ( 31 ) , with 1. 2% er ror. To dem onstrat e extreme scalability, we train a 203 ,796-parameter gene regulatory network to classify MNIST dig its ( 32 ) wi th 98.4 % accu racy, matching standard mul tilayer perceptron benchmarks ( 33 ) and proving that stochastic reaction networks can be optimized for complex comput ation. Finally, we validate the method on experimental data by inferri ng i on channel gating kinetics from pa tch -clamp recording s ( 34 ) (      ). This latter application i s particularly si gnificant be cause wi th only two channels in the system, th e dynamics are dom inated by discret e t ransi tions wi th no law of large nu mbers smoothing, confirming that our method genuinely optimizes inherently discrete stochast i c processes. By decoupling phys ical exact ness i n the forward pass from stable, m assively parallel g radient comput ation in the backward pass, we establish exact stochast ic simulation as a scalable, differentiable primitive suppo rted by a GPU implementat i on capable o f 1. 9 b ill ion steps per second . This removes a longstanding barrier in computational mod eling, enabl i ng high -dimensiona l parameter inference and inverse design across systems biol ogy, chemistry , physi cs, epidemiology , 3 and other domai ns governed by continuous-time Markov chains. The approach opens the door to deep-learning g radient-based optimization of realistic sto chastic models and the aut omated engineering of reac tion c ircuits at scales previously out o f reach . We not e that concurrent and independent work by Mottes et al. ( 35 ) applies a simi lar straight- through Gumbel-Softmax strategy to s tochastic ki netic models and val idat es the meth od on simulated telegraph promoter gene expressi on and stochast i c thermodynamics problem s i n the low-dimensional regime. While the methods share the same core forward/backward decoupling principle, our work is specifically focused on overcoming the di mensional i ty barrier in exact stoc hastic op timization. By scaling to param eter spaces over four orders of magnitud e larger, our framework enab les deep-learning-scale tasks, i nc l uding a 203,796-paramet er g ene regulatory network for MNIST classifi cation via hig h-throughput GPU implement ation , while remainin g highly robust to accu rately optimize directly on experim ental data (patch-clamp recordings in th e extreme discrete regime of N=2 ion channels). 2. The Gillespi e Stochasti c Simulati on Algorithm We develop our frame work i n the c ontext of the Gi llespie algorithm, the canonical m ethod for exact CTMC simulation in chemical ki netics, thou gh the ap proach g eneralizes t o any system of c ompeting Poisson processes. Cons ider a well -mix ed chemi cal syst em containing  molecular species whose copy numbers at t i me  are represented by t he st ate vec tor  󰇛 󰇜  󰇛  󰇛󰇜     󰇛󰇜󰇜  . T he system evolves through  react ion channels, where each rea ction   i s chara cterized by a propensi ty funct ion   󰇛 󰇜 and a st oichiometry vect or      . The p ropensity func tion gives the probabilit y per uni t time that reaction   fires gi ven the current state and depends on kinetic parameters  that we wish to infer or optimize. When reaction   occurs after a t ime  , t he state updates according to  󰇛    󰇜   󰇛󰇜    . The state v ector 󰇛 󰇜 evolves as a cont inuous-time Marko v chain whose dynamics are governed by t he ch emical mast er equation. Let 󰇛  󰇜 denote t he probab ili ty of find ing the syst em in stat e  at time  . The master equation takes t he form 󰇛 󰇜       󰇛     󰇜󰇛     󰇜    󰇛  󰇜󰇛 󰇜    (1) where the first term represents probability f lux into state  from states that can reach it vi a reaction  , and the second term represents flux o ut of state  due to reaction  fi ring ( 15 ). Analytical solutions to the master equation exist only for si mple systems , m otivating the use of stoc hastic simulation to sample from the di stribution  󰇛 󰇜 . The Gil lespie algorithm generates exact sampl e paths from the m aster equa tion through the following procedure. Given the current state  at time  , fi rst comput e the pr opensity   󰇛 󰇜 for each reaction and the total propensity          󰇛 󰇜 . Sample the waiting time until the next reaction from an exponential distribution with rate   , whi ch can be acco mplished via the inverse transform     󰇛  󰇜  where    U niform 󰇛 󰇜 . Select which reaction fires by sampling fro m the categorical di stribution with probabilities        . U pdate the state to     and advance time to    . This pro cedure repeat s until a terminal con diti on is reac hed. 4 The propensity functions encode the kinetics of each r eaction and typically follow m ass -action form. A first-order reac tion     h as propensity      where  i s the rate c onsta nt. A second- order reaction        with    has propensity        . A homodimerization reaction     has propensity      󰇛   󰇜 , accoun ting for the number of distinct pairs. More complex propensity functions arise i n enzym e kinetics and co operative bi nding, but the Gi llespie algorithm applies u nchanged regardl ess of the fu nctional form. The compu tational cos t of the Gill espie algorithm scales wi th the number of reaction events, which in turn depends on the total propensity and simul at i on duration. For systems wi th fast reactions or larg e molecular populations, direct simulation can become prohibitively expensi ve. Various approximation method s trade exactness for computational efficiency , including tau -leapi n g, the c hem ical Langevin equation, and mor e recent map ping -based ap proach es such as Holimap ( 36 - 38 ) . H ere, we focus on exac t sim ulation and address computational challenges through massive parallelization rat her t han appro ximation. 3. Gumbel- Softmax Repar am eterization for Differ entiable Si mulation 3.1 T he Differentiability C h allenge The fundamental obst acle t o g radient-based optimization through st ochast ic simulation i s the discrete n ature o f reaction select ion. At each step, the Gillespi e a l gorithm sa mples a react ion index  from the categorical distribution with probabilities proportional to propensities. This sampling operation is not differentiable; namel y, the gradi ent of a d iscrete sam ple with respect to i ts underlying probabilities is zero al most everywhere and undefined at the sampl ing boundaries. Standard automatic di fferentiation frameworks cannot propagate gradi ents through su ch operations. The waiting time presents a lesser challenge because it is a continuo us random vari able. Given the total pr opensity   , the waiting time   E xponential 󰇛  󰇜 can be reparameterized as     󰇛 󰇜   where   U niform 󰇛  󰇜 . This reparameterization expresses  as a differentiabl e funct ion of   and hence of the underlyi ng parameters  , with the randomness absorbed into the fixed random variab l e  . Gradients flow throu gh  vi a the chain ru le:     󰇛   󰇜      . The reaction selection step requires a different approach. We see k a continuous rel axation of categorical sampling that preserves essent i al stati stical properties while enabling gradient comput ation. The relaxation should approach ex act categorical samplin g i n an appropriate l imit , ensuring that optimization with respec t t o t he rel axed objec tive produc es parameters t hat are valid for t he exact dynamics. 3.2 T he Gumbel-Max Reparame terizati on The Gumbel -Max reparameterization provides a representation of ca tegorical sampli ng that is amenable to continuous rel axation ( 39 ). Let        be i ndependent random vari ables wi th the standard Gumbel distribution, which has probability d ensity func tion  󰇛 󰇜    󰇛   󰇛󰇜󰇜 and can be sampled vi a     󰇛  󰇛 󰇜󰇜 where   Uniform 󰇛 󰇜 . The Gumbel -Max t heorem states that       󰇝  󰇞         h as the same distribution as a sample from the categorical distribution with probabili ties 󰇛       󰇜 . For reac tion selection, we can therefo re write       󰇝 󰇞     󰇛 󰇜     5 (2) where the n ormalization by total propensity is unnecessary b ecause  i s invariant to additiv e const ants. This representation transforms categorical sampling into an optimization problem, but the  operation rem ains no n-differentiable. The key ob servation i s that  can be approximated by softmax with a temperature parameter that controls the sharpness of the approximation. 3.3 T he Gumbel-Softmax Relaxation The Gumbel-Softmax relaxation repl aces the hard  with a temperature-cont rol led softmax ( 28 ). Define the so ft sample       󰇛    󰇛 󰇜    󰇜       󰇛 󰇛    󰇛 󰇜    󰇜 󰇜 (3) where    is the temperature parameter. The vector    󰇛         󰇜 l ies on the probability simplex and i s a different iable funct ion of the propens ities and hence of the parameters  . As    , the softmax output app roaches a one-hot vector corresponding to the  , recovering exac t categorical sampli ng. As    , the ou tput approaches a u niform distribution reg ardless of the propensities. The soft sampl e   can be used to comput e a relaxe d state update. Instead of applying the stoichiometry v ector    corresponding to the sel ected reaction, we apply the expected stoichiometry  soft           (4) which is a weighted avera ge of al l possible updates. This relaxed update is different iable with respect to the parameters and can be used to simulate approximate traject ori es for o ptimi zation. However, using soft updates throughout the si mulation fundamentally changes the dynami cs. The relaxe d state  soft does not correspond to any reali zation of the exact stochastic pr ocess, as molecular copy numbers are inherently discrete. Parameters optimized to fi t rel axed traject ories may not produce correct behavior under exact simul at i on, li miting the util ity of th e approach f or applications where exact stoc hasticity matters. 3.4 T he Straight-Through Esti mator The straight-thro ugh estimator resolves the conflict between exactness and differentiability by decoupling the forward and backward passes (Figure 1). Let     󰇛  󰇜  󰇝󰇞  denote the hard reaction indicato r from t he forward pass obtained from Eq. (3) . The st rai ght-through const ruction    󰇛    󰇜    (5) 6 ensures that the forward computation uses the hard sample  while the backward computation differentiates throu gh the soft sample   . Here,  i s the do-not-differentiate-thro ugh-this- part operator, present in deep learning fra mework s, that makes the forward pass exact and the backward pass differentiable. Therefore, the forward pass uses hard categorical samples to generate exact trajec tories:   󰆒        (6) This produces a state updat e that i s statistically iden tical to standard Gi llespie simulation. The traject ory consist s of discrete states con nected by si ngle -reaction jumps, exactly as in the orig inal algorithm. For gradi ent computat ion in the backward pass, it replaces the hard  with a Gumbel- Softmax relaxation. Automatic differentiation then propagates gradients as if the state update we re from  soft (Eq. 4) , yiel d ing the surro gate Jaco bian               (7) At l ow temperatures, t he soft samples   appro ach one-hot vectors, reducing bias but increasing gradient v ariance . At high temperatures , the soft samples become more un iform, providin g smoother gradi ents at t he cost of a less accurat e approximation to the discrete dynamics. I n pract ice, temperature anneali ng from higher to low er values over the cou rse of training often improves convergence . The straight-through estimator is biased because th e g radient i t computes is not the exact gradient of the expected objective wi th respect t o para meters. H owever, for optimization purposes, an unbiased gradient is not required ; what matters is that the gradient provides a useful descent direction. E mpirica l evidence across the appl ications presented here demonstrat es that straight- through gradients are effec tive for optimization despit e their potential bias. 3.5 I mplementation Details Our implementation combines several techniques to achieve computat ional efficiency and numerical stability. The framework is i mplemented in TensorFlow 2.20, which provi des automatic differentiation, GPU acceleration, an d XLA (A cceler ated Linear Al gebra) J I T c ompi lat ion. T he straight- thro ugh estim ator is expressed entirely within Tenso rFlow’s comput ational graph using tf.s top_gr adient to decouple the forward and backward passes, enabling standard backprop agation through the surrogate wi thout custom gradient definitions . All simulations are performed in parallel across an ensemble of independent trajecto ries, ex ploiting GPU parallelism through vec torized op erations over t he batch dimensi on. The ensemb l e size is a t unable hyperparameter t hat trades memory u sage agai nst var iance in the gradient estimate. The Gumbel -Softmax temperature  is a critical hyperparamet er whose optimal v alue is system- dependent. A key observation i s that at low  , the Gu mbel perturbations averag e out across the ensemble and the g radient estimates are effectively unbiased. The tradeoff at low temperature is increased variance rat her than bias, which is addressed thro ugh sufficient ensemble size. Temperature schedules used here span several ord ers of mag n itude across applications: the 7 dimerization system uses geometric annealing from    to     ; the genetic oscillator uses a fixed temperature       ; the MNIST classifi cation net work anneals from    to    ; and the ion channel system anneals fr om      to     . In general , systems with few reactions and l arge ensembles tolerate very l ow temperatures, while high -dimens i onal systems benefit fro m warmer temperat ures th at pro vide smoother gradients. For the high-dimensional MNI ST classification task, w e employ a two-phase t raining protoco l. In the ini tial phase (the fi rst 16 of 80 epochs), the forw ard pass uses soft Gumbel -Softmax samples rather than hard ca tegorical draws, providing gradien t sig nal t hat can traverse the flat regi ons of the discrete l oss l andscape (discrete molecu lar c hanges are too l arge in the earl ier st a ges of t raining to decrease the loss function). I n the second phase, training switches to the straight-through estimator with hard forward samples, refining para m et ers under the exact discrete dynamics. The remaining three syst ems (dimer i zation, g enet ic oscill at or, and ion channel) use th e straight -throu gh estimator throughout training, as their l ower di mensionality and larg er rel ative ensemble sizes provide sufficient gradient signal witho ut a soft warm-up phase. The RMSprop o ptimizer is used for all four sy stems, with learning rates coupled to the temperature annealing schedule. Learning rat es are reduced as training progresses, either through explicit annealing schedules or through coupling to the temperature par ameter. For the dimerization and oscill ator sy stems, l earning rate anneali n g foll ows a g eometric schedule. For the ion channel syst em, the learnin g rate is proportional to the c urrent temperature. The MNIST system uses a fixed learning rat e of   with st ochastic weight averaging applied over the final epochs to improve general i zation. The loss function for parameter inference i s the m ea n squared error between simulated and target trajecto ry s tatistics. For the dimerization and ion channel systems, we match ensembl e - averaged trajectories at each time point. For the oscillat or, the target data is g enerated via a start- point map, i n wh ich short traject ory seg ments are simulated from start points sampled along a reference t rajecto ry, and the loss measures deviation between simulated and t arget segments. For MNIST classifi cation, the loss is the categorical c ross-entrop y between sof tmax-transformed output gene copy numbers and one-hot enco ded digi t labels. 4. Results 4.1 Rev ersible Dim e rization We first validate the framework on a protot ypical reversi ble dimerizat ion reaction system (Figure 2A and Supplementary Not e 1 ):          (8) This sy stem i nvolves three molecular species and two reactions wi th rate constants   (forward, dimerization) and   (reverse, dissoc iation). The propen sity functions follow mass -action kinetics:   󰇛󰇜        for the f orward reac tion and   󰇛󰇜      for the reverse reaction ( 15 ). The stoichiometric v ectors are    󰇛  󰇜  and    󰇛  󰇜  . We generated synthet ic trajectory data using g round truth parameters       and      , wi th initial conditions   󰇛󰇜   ,   󰇛󰇜   , and   󰇛󰇜   . The target data consisted of ensemble-averaged trajectories computed from 100,000 independent si mulations, provi ding 8 low-variance estimates of the expected mol ecula r count s over time. For parameter inference, we used ensembl es of 100,000 parallel trajectories per gradient c omputat ion, si mulating each traject ory for 250 steps to observe bot h transient and near -equilibrium behavior. T he loss function is t he mean squared error ( MSE) bet ween the simulated and target ensemble -averaged trajectories for all three species (A, B, and C). Because the stochast ic Gil lespie traject ories are defined on irregular, event-driven time grids, each simulated trajectory is first l inearly interpolated onto a uniform target tim e grid before averaging acro ss the ensembl e. T he MSE is then comput ed at each grid point and averaged over time, comparing the model mean to the target mean for all three species simult aneously. Parameter optimization proc eeded over 250 epochs using the RMSprop opt imizer. Parameters were represented in log-space to ensure po sit ivity: the optimization variables were       and       . Initial parameter guesses wer e set arbitrarily at       and      . The optimization converged rapid ly, wi th the l oss decreasing by three orders of magnitude within the fi rst 100 epo chs (Fi gure 2C). A cross multiple inference runs wi th varyi ng ground truth parameters, the l earned parameters achieved mean absolute percentage errors of 0.06% for   (95% CI: 0 .04% – 0.08%) and 0.13% for   (95% CI: 0.07% – 0.18%). T his level of accu racy exceeds the requirements of most prac tical applications. To assess robust ness across different ki netic regimes, we repeated the inference proc edure for a range of reverse rate constants    󰇝                           󰇞 while holding       fixed. E ach condition used 100,000 target trajectories a nd 100,000 model si mulations per iteration (250 steps, 250 epochs). This spans equilibri u m constants       from 0. 0078 to 1. 0, correspo nding to strongly product-favored through bal anced cond itions (Figure 2D – E ). T he overall average mean absolute percentage error was 0.09% (9 5% C I: 0.06% – 0.13%), demonstrating that the method succeeds acr oss diverse kinetic regimes. 4.2 Genetic Osc illator The genetic oscil lator by Vilar et al. ( 30 ), which invo l v es complex no nlinear dynamics, mu ltiple time scales, and emergent oscillatory behavior (Figure 3A a nd Supplementary Note 2) , provides a more challenging test case an d is frequently employed as an exempl ar of difficult parameter identifiab i lity problems ( 31 ). This system, original ly proposed as a model of circadian rh ythm g eneration, consists of two g enes encoding an activato r pro tein and a repr e ssor protein. T he a ctivator pr omotes its own expression through positive feedback and activates r epressor expressi o n through a feed-forward motif. The repressor inhibi ts activator function through direct binding, creating the delayed negative feedback n ecessary for sustained oscillations. The complete reaction network invol ves nine molecular species: the activator gene i n inactive (   ) and active (    ) states, the repressor g ene in inactive (   ) and active (    ) states, activator mRNA (   ), repressor mRNA (   ), activator protein (  ), repressor protein (  ), and the activator- repressor complex (  ). The network comprises si xteen reactions including gene activation and deactivation, t ranscription, t ransl ation, degradation , and complex format i on. We focused on inferring five protein parameters that control the oscillatory dynamics: the activato r translation rate      , the repressor translation rate     , the activator protein degradation rate     , the repressor protein degradatio n rate     , and the complex formation rate      . The remaining el even parameters were held fixed at t hei r nomi nal values. 9 This partial i nference scenario reflect s reali stic situatio ns where some parameters are known from biochemical measur em ents whi l e others must be inferred from d ynamic data. Synthetic target data was generated using a start-point map approac h . A reference trajecto ry of 2,400,00 0 steps was si mulated with ground trut h parameters, from which 262,144 start points were sampled. E ach training i teration sim u lated short trajectory segm ents of horizon  segment   steps from these start points, usi ng 25 repli cate simulat ions per start point and bat ches of 8,192 start points per gradient step (204,800 total simulations per step). The l oss function measured the discrepancy (m ean squared error) between the rate of change of sim ulated and target traject ory segments of the three protein species (activator A, repressor R, and complex C ). E ach rate i s estimated as the change in m ean species c ount from the i nitial state to the m ean of the final 100 simulation steps, divided by th e corresp onding elapsed time. Parameter inference used the RMSprop optimizer with an initial l earning rate of   and a geometric decay schedule. Training proceeded for 3,000 epochs. Gumbel-Softmax temperature was fixed at       . Total traini ng time was approxim ately 19 minutes per run on a sing le GP U. To ensure robus tness, we avoided aggressive annealing and i nstead employed a m edian -based Polyak-Ruppert avera g ing scheme ( 40 ) over th e final 25% o f epoc hs and pooling across run s. The inferred parameters ac hieved exc ellent agreem ent with ground trut h values (Fi gure 3C – D). Across ten independent inference runs, the global median parameter estimates were:        (95% CI: 48.39 – 50 .13),        (95% C I: 4.81 – 5.25),       (95% C I: 0.97 – 1.01),       (95% CI : 0.196 – 0.211), and       (95% C I: 1. 92 – 2.10). The m e an absolute percentage error (MAPE) across all fi ve parameters was 1.23% ( 95% CI: 0.93% – 1.71%), with per-run instantaneo us MAPE trajecto ries in Figure 3F and the h istogram of b ootstrapped MAPE values in Fig ure 3B . Validation of the inferred paramet ers ext ended beyond po int estimates to verification of emergent dynamics (Fi gure 3E). Analysis comparing trajectories generated with l earned v ersus tru e parameters showed virtuall y identical results (same random number sequence was used in both cases). The oscill ation period , am plitud e, and wav efor m shape were all faithfully reproduced. This agreement confirms that the optimization succ essfully captured the functional behavior of the system, not merely individual paramet er va l ues. 4.3 Gene Regulatory Network for Image Classification To demonstrat e that t he frame work scales to hig h -dimensional parameter spaces where g radient- based optimization becomes not merely hel pfu l but essential, we construc ted a gene reg ulatory network capable of classifyi ng handwritten digi ts from the MNI ST dataset (Supplementary Note 3) ( 32 ). T his benchmark is no t intended to ou tcompete ne ural net works on accurac y , bu t i t is desig ned to establish that exact s tochastic si mulation can now be optimi zed competitive l y at scales previously considered int ractab le. With 203 ,796 t rai nable param et ers, this system is four orders of magn i tude larger than any previously dem onstrat ed di fferentiable stoc hastic simulation, entering a regi me where deriv ati ve-free methods such as gr id search or ev olutionary algorithms are computationally infeasible. The network architecture draws inspiration from biologi cal gene regulatory systems ( 41 ) while incorporat i ng structure necessary for classifi cation (Fi gure 4A). The input layer consists of 78 4 transc ription factor species whose concent rations encode the p ixel intensities of a 28×28 grayscale image. These transcription factors regulate a hi dden layer of 256 genes. The hidden genes in turn regulate 10 output genes correspo nding to the digit classes 0 through 9. Each gene has a promoter 10 with two coarse -grained states (OFF and ON) . The pr obabili ty of the ON s tate i s derived from a Boltzmann weig ht, yield ing a logistic si gmoid activation function where the dimensionless activation energy operates as an affine func tion of the copy number of the regulating transc ription fac tors (Supplementary Note 3). C lassification proceeds by sim ulat ing the network dynamics until the outpu t gene conc entrations reach a s teady stat e, then sel ecting the class with highest expression . Each reg ulatory connection is parameterized by weight and bias terms that cont rol the transc ription rate of t he target gene as a fun ction of re gulator copy numbers. T he networ k co ntains 203,79 6 trainable parameters i n total: 784 × 256 = 2 00, 704 weig hts from inputs to hidden genes, 256 biases for hidden genes, 256 × 10 = 2,560 weig hts from hidden to output genes, 10 biases for outpu t genes, and 266 deg radation rate parameters. This parameter count i s comparable to a medium-sized multi l ayer percept ron but with fundam entally differ ent dynami cs, in which the network state ev olves through discrete stochastic reactions rather than deterministic matrix operations. The network was trained on the standard MNIST training set of 60,000 imag es using mi nibat ches of approximately 68 2 im ages and the RMSprop optimizer with l earning rate 0.01. E ach forward pass simulated 2 ,880 reaction steps per trajectory, al lowing sufficien t time for the ou tput gene conc entrations to stabil ize. T raining pro ceeded for 8 0 epoch s, requ iring appro ximately 4 h ours on a single GPU. The fi nal model averages the weig hts over the last 16 epochs of training . To c ontrol molecular cop y numbers , we i mposed a minimu m degr adation rat e of 0.25 for all g enes, which sets an upper bound on the achievable steady -state expression. The loss i s categorical cross-entropy between th e true on e-hot digit label and the pr edicted class pro babilities. The pred icted probabilities are o btained by ap plying a softmax to the outpu t gene copy numbers at t he end of the stoc hastic simulation, treating the final expression level of each of t he 10 output genes as a l ogit for the c orresponding digi t c lass. The trained network achieved 97.8% accuracy on the held -out test set of 10,000 i mages using single-pass evaluation. When combinin g Monte Carlo averaging over 32 stochastic realizations and temporal averag ing of out put gene expression, accu racy improved t o 98. 4% (Figure 4E). To t est the system's robustness at lower copy numbers, we i ncre ased the minimum degradation rate to 1.0. While this high-degradation regime i n itially r educed the si ngle -pass accurac y to 84.6%, applying Mont e Carlo or temporal averaging successfully restored the accuracy to over 97.0% (Fig ure 4F). Th ese results demonstrate that gene regulatory network dynamics can i mplement nontrivial comput ations when parameters are app ropriately tuned through gradient descen t , an opt i mization that would be impossible without t he differentiable framework deve l oped here. Analysis of th e trained net work reveal ed interpretable structure in the learned paramet ers. T he input- to -hidden weig hts exhibi ted spatial organ i zati on remi niscent of Gab or -like filters, with different hidden genes responding to edg es at different orientations and positions. The hidden- to - outpu t wei ghts showed class-specific patterns indicating whi ch hidden gene activations support each d igit classificat ion. The gene express ion dynamics duri ng classif ication exhibited charac teristic traject ories i n whic h c ompetition bet ween output genes is resolved through t he stochastic regulatory interac tions (Figure 5 and Supplementary Figure 1 ). 4.4 Ap plication to Exper imental Data: Ion Channel Gating Kineti cs The preceding benchmarks demonstrate scalabili ty on systems with known g round -truth parameters. A final critical val idation i s performance on experi mental data, where measurement noise, model mismatch, and bi ological vari ab ility i ntroduc e chall enges absent from synthetic benchmarks. We applied our framework to infer g atin g kinetics from single -channel patch-c lamp 11 recordings ( 34 ), a classic problem i n ion channel biophysics where stoc hastic transitions be tween discrete c onformational states produce t he observed current fluct uations (Supplem entary Not e 4). This application tests the method i n three i mportant ways beyond the synthet ic benchmarks. First, it confronts experimental noise and potential model mi smat ch. Second, i t demonstrat es generality beyond chemical reactions. I on channel conformational dynamics are governed by the same continuous-time Markov chain formalis m but represent a physically distinct class of sy stems involving confo rmational changes of a sin gle macromolecu le . Third, and most important ly, it probes the regime of extreme discreteness where the quasi -c ontinuous appr oximations underlying many numerical methods break down entirel y. With only N = 2 channels in th e patc h, the observable can take only t hree values (0 , 1, or 2 open c hannels). Every sin gle stoc hastic transition c auses a macrosc opic c hange in t he system state. We analyzed whole-cell patch-clamp recordings from HE K293 cell s expressi n g a voltage -gated ion channel, obtained at +40 mV holding potent ial ( 34 ). The dataset comprises 100 i ndependent sweeps, each containing 801 timepoints sampled at 0.01 ms resolution over an 8 ms window following depolar ization. Current traces were idealized to discrete channel states, y ieldin g the number of open channels (0, 1, or 2) at each timepoint. The ensem ble -averag ed open probability exhibits the characteristic activation-inactivation kine tics typical of voltage -gat ed channels: rapid activation to a peak o pen probability of approxi mately 0.62 within 1 ms, foll owed by sl ow er decay toward zero as chann els enter an i nactivated state (Figure 6B). We modeled the channel as a three-state system wi th closed (C), open (O), and inactivated (I) confo rmations ( 42 ): C   close  open O   inact I (9) The inactivated state is absorbing on the timescale of the recording, reflect ing the ex p erimental observation that channels do n ot recover from inacti vation during i ndividual sweeps. This model has th ree unknown rate const an ts (  open ,  close ,  inact ), three molecular species, and t hree reaction channels. Each si mulation begins with all channels in t he closed s tate, matching the experim ental proto col where channels recover during the int er- sweep i nterval at hyperpo larized holding potent ial. We minimized the mean squared error between the simulated and ex perimental ensemble - averaged op en channel c ount s. Expli citly, the loss is an MSE bet ween the simulated and experimental mean open-channel count s, computed over time -binned intervals defined by the experimental recording grid. Rather than i nterpolating traject ori es onto a gri d, th e simulated piecewise-constant (SSA) trajectories are exactly tim e-averag ed over each experim ental bin by comput ing the cumulative i ntegra l of the step function and divi ding by the bin width. The experimental target for each bin i s approximated as the average of the m easur ed values at the bin endpoints. The loss i s then t he mean squared difference between the sim ulated and experim ental bin-averaged open-channel counts across al l time bins. E ach gradient step used 262,144 parall el stoc hastic simulations of 20 Gillespie steps each, with trajectories interpolated on to the experimental time grid for comparison. Training proceeded for 4 00 epochs using the RMSpro p optimizer with an i nitial learni n g rate of 0.05, annealed proportionally to the Gumbel -Softmax temperature from 0.05 down to 0.0005. Total trainin g time was approximately 112 se conds on a single GPU. 12 The learned rate const ants were  open    ms  ,  close     ms  , and  inact    ms  . These values are physically reaso nable: th e opening rate exceeds the closing rate b y a facto r of seven, consistent with the high peak open probability obser ved experimental ly, wh ile the inactivation rat e is fastest, exp laining the rapi d decay f rom peak. The ratio  open 󰇛 ope n   close 󰇜    predicts an equilibri u m open probabi lity i n the a bsence of inactivation that matches the observed peak. Validation using 30,000 i ndependent exact Gil lespie simulations wi th the learned parameters yielded excellent agreement with experi mental data (Fig ure s 6B , 6C, and 6E). The mode l mean closely tracks the experim ental mean throughou t the activation and inactivation phases, achievi ng       , RMSE = 0 .021 open ch annels, and normal ized RMSE of 3.5% relative to the d ata ran ge. The model ensem ble also captures the trial - to -trial variability observed i n experimental sweeps, with indivi dua l simulated trajectories exhibiting stochastic fluctuations quali tatively si milar to the experimental rec ordings. Parameter traject ories during training show rapid ini tial convergence followed by refinement (Figure 6D). The inactivation rate  inact converges fastest, as i t dominates the long -t ime behavior where all trajectories approach the absorbing state. The opening and closing rates require m ore iterations t o disentangle, as they jointly d etermine the t ransi ent peak dynamics. Loss decreased by over an order of m agnitude during traini ng, with the f inal loss corresponding to sub -percent-level deviations between model and dat a. This applicat ion demonstrat es that the framework performs well on experimental data with its inherent noise and potential model mismatch. The inferred rate const ants are consistent with known ion c hannel bi ophysics, and t he m odel reproduc es both the mean kinetics and t he stochastic variability of the experi mental recordings. C rucially , the success in this extreme low -cop y-number regime , where the system can occupy only three discrete states and every molecu lar event produc es a macrosc opic observable change , d emonstrates that the straight-thro ugh estimato r provides valid gradients even when there is no quasi-continuous limit to fall back on. The method genuinely handles discrete stoc hastic dynamics. 4. 5. Computational Perform ance The computational efficiency of our i mplementation enables practical appli cation to systems biology problems that would otherwise require weeks of c omputation. All benchmarks were performed on a single NVIDIA RTX 600 0 Ada Generation GPU with 49GB memory us ing TensorFlow 2.20. T he parallel architect ure exploits the independenc e of trajectory realizations to achieve near- linear scaling with ensemble size up t o hardware memory limits. We benchmarked performance on the genetic oscillat or system ( 30 ), measuring throughput as a function of ensem ble si ze on an N V IDIA RTX 6000 A da Generation GPU (Supp l ementary Figure 2 and Sup plementary Note 5). For small ensembles of 1 00 trajectories, the implem entation achieved approximately 4 million sim ulat ion steps per second. Throughput increased wi th ensemble si ze as parallel resources became full y utilized, reaching 1.37 – 2.30 bill ion steps per second, depending on the si mulator variant, for ensembles of 100,000 traje ctories. For compa rison, the state of the art existing CPU-based SSA implementat ions applied to the same oscillator with the same parameters , including GillesPy2 C ++ ( 43 ) and StochKit2 ( 44 ), reach 1. 3 – 1.5 mi llion steps per second per trajecto ry ( 43 ). O ur GPU implementation achieves a 1,000-fold i mprovement over t he fastest single-trajecto ry 13 CPU implementations when simulating l arge ensembles. This massive parallelism is essential for gradient estimation , which requires avera ging over lar ge ensembles to reduce variance. The computat ional overhead of maintaining differentiability is min imal. Benc hmarks co mparing standard Gill espie, Gumbel-Max, and Gumbel -Softma x show near-i dentical throughput across all ensemble si zes. This demonstrates that the differentiable framework ad ds negligi ble overhead compared to non-differentiable i mplementations while enabl ing g radient -based optimization that converges in far fewer iterations than derivative -free alternatives. Training time for parameter inference depends on the system complexity and the desired accu racy. The dimerization system converg es within mi nutes, the genet ic oscillator within tens of minutes, an d the M NIST network within ho urs. These t imescales are c ompetitive with or fast er than alternative approach es while providing the additional benefit of g radient information that can be used for sensi tivity analysis, uncert ainty quan tification, and further optimization. 5. Discussion Our results demonstrate tha t sta tistically exact sto chastic simulation and deep -learning -scale gradient optimization are no longer mutua l ly exc lusive . A cross benchmarks spannin g five orders of magnitude , from a two-parameter kinetic system to a 203 ,796-parameter gene regulatory network , we show that the di mensionality barrier imposed by discrete reaction events can be overcome without relaxing the underlying physics. The central conceptual advance is the ri gorous separation of roles, in which the forward pass maintains the hard categorical sampl ing o f the un derlying CTMC , while the bac kward pass propagates gradients v ia a Gumbel-Softmax straight-thro ugh estimator. This decoupling resolves a fu ndamental tradeoff that has constrained the fi eld. While the comput ation of gradients i n di screte stochastic systems is mathematically achievable through CRN finite-difference schemes ( 26 ) or unbiased sensi tivity estimators l ike PPA ( 25 ) , these approac hes are const rained by an inherent l inear scaling bott leneck , e ffectively precludin g their application to the massive model s requi red for c ompl ex information processi ng. Prior differentiable approaches, such as the sof t-forward Gil lespie algorithm ( 27 ), achieve differentiability by replacing discrete jum ps with continuous reaction mi xtures. While this permits exact g radient calculation, it changes the dynamics, c reating a simulation-reality mi smatc h that wi dens with trajectory leng th. In our approac h, the simul ation remains physically exact, and approximation is confined to the g radient estimator. By e xposing the optimizer to the system’s true intrinsic noise through the Gumbel -Max const ruction, whi le reusing the same Gumbel noise samples for the Gumbel -Softmax straight- through estimato r in the backward pass, l earned parameters are validated against the exact intended dynamics by co nstruction. The MNIST-trained gene regul atory network serves a s a proof of principle for a new class of mechanistic machine learning. It i llustrates that gradient descent can optimize large -scale stochast ic reaction networks to i mplement no n-trivial inform ation pro cessing. T his po sitions stoc hastic biochemical kinetics not just as a syst em t o be a nal yzed, but as a l earnable substrat e fo r comput ation, offering a rig orous al ternative to black -box neural networks for m odeling biol ogical information processing. By transition ing from parame ter- by -parameter sensitivity analysis to true massively-parallel backpro pagation, our framework provi des a concrete inverse -design workflow: specify a dynamic objec tive, represent th e topology as a parameterized CTM C, and optimize hundreds of thousands of rate c onstant s end- to -end while maintainin g rig orous stochasticity. The ability to optimize such hi g h-dim ens ional di screte systems support s a c oncret e inverse-desig n workflow, sp ecify a dynamic object ive, represent the topology as a parameterized CTMC, and optimize rate c onstants end- to -end while maintai ni ng ri gorous sto chastic ity. This capability 14 complements emerging gen erativ e framewo rks that automat e the topological discovery of biomolecular networks , providing a ri gorous nume ri cal backbo ne for t uning the kinetics of generated c andidates ( 23 ). Our application to i on channel gating ki netics provides t he critical stress test f or this framework. In this reg ime, with only two channels an d m acroscopic observability of every t ransition, there is no quasi-cont inuous limit to exploit. The succ essful recovery of gating kinetics (       ) confirms that our gradient estim ator provides valid descent di rections ev en when the system dynamics are dominated by indi vidual disc rete events. This result suggests that the method is robust enough for single-molecule biophysics and low-copy-number signali ng co ntexts where c o ntinuum approximations fail entir el y. The mat hematical framework developed here extends well beyond bioc hemistry. The Gillespie algorithm is a spec ific instance of the broader class of Kinetic Mont e C arlo (KMC) method s, such as the Bortz-Kalos-Lebo witz (BKL) alg orithm used in condensed matter physics. Our Gumbel -Softmax gradient estimator is isomorphic to these methods and is therefore immediately applicable to inverse problems in materials science , such as optimizing atom ic potentials in crystal g rowth or tuning transition rates in defect mi g ration , as well as stochastic epidemiological models and queueing networks . Any syst em governed by the master equation and simulated via competing Poisson proc esse s can, in p rinciple, be optim ized using this differentiable frame work. The straight-throu gh estimator i s typica l ly biased sinc e the gradient it computes is a surrogate, not the t rue gradient of th e expect ed objec tive. Howev er, in the cont ext of stochastic optimization, variance is often the m or e binding constraint . U nb iased l ikelihoo d -ratio estimators suffer from variance that expl odes with trajectory leng th, rendering them useless for deep temporal models. Our results suggest that the low variance and strong correlation of the Gumbel -Softmax estimator outweigh the t heoretical bias , providing a practical signal for optimization where unb iased alternatives d iverge. Future theoretical work should furt her characterize the conditions under whi ch this bias-variance t rade-off is op timal. Ultimately, our results provide a practical unificat ion, exact discret e-event simulation beco mes a bac kpropagation-compatible o perator. By confining ap proximation to the g radient step and preserving ex actness in the sim u lation, we remove th e historical barrier betwe en physical fidel ity and scalable optimization. This enables gradient-based parameter inference and inverse design for stoc hastic syst ems at a sc ale previously reserved for di fferentiable surrogate model s . References 1. M. B. Elowitz, A. J . Levine, E . D. Siggia, P. S. Swain, St oc hastic gene expression in a single cell. Science 297 , 1183-1 186 (2002). 2. H. H. McAdams, A . Arkin, St ochast i c mechan isms in gene expression. Proc Natl Acad Sci U S A 94 , 814-8 19 (1997). 3. D. J. Wilkinson , Stochastic Mode lling for Systems Bi o lo gy . (Chapman and Hall/CR C, ed. 3rd , 2018 ). 15 4. C. Chang, M. Garc ia-Alcala, L. Sa iz, J. M. G. Vilar , P. Cluz el , Robustness o f DNA looping ac ross multiple c ell divisions in ind ividual bac teria. P roceed ings of the Nati onal Academy o f Sciences 119 , e2200 061119 (2022). 5. I. Golding, J. Paulsson , S. M. Zawilski, E. C. Cox, Real-time kinetics of gene act ivity in individua l bact eria. Cell 123 , 1025 -103 6 (2005). 6. J. M. G. Vilar, L. Sa iz, Dynamics-informed deconvolutional neural net works for super- resolution identificat ion of regime chan ges in epidemi ological time series. Sci ence Advances 9 , eadf067 3 (2023). 7. R. Zandi, P. van der Scho ot, D. Reguera, W. Ke g el, H. Reiss, C lassical nuc l eation th eory of virus capsids. Biophysic al journal 90 , 1939-1 948 ( 2006). 8. J. M. G. Vilar, L. Sa iz, Act ionable Forec asting as a Determinant of Biological Adaptat ion. Advanced Science 12 , 2 4131 53 (2025). 9. J. M. G. Vilar, L. Sa iz, T he unreason able effectiveness of equilibrium gene re g ulation th rough the c el l cycle. Cell S ystems 15 , 639 -648 .e632 (2024). 10. A. Raj, A. Van O udenaarden, Nature, nurture, or chance: stoc hastic gene expression and its consequenc es. Cell 135 , 216-22 6 (2008). 11. E. Korobkova, T. Emo net, J. M. G . Vilar, T . S. Shimizu, P. Cluzel, From molecular noise t o behavioural var iability in a sing le bact erium. Nature 42 8 , 574-5 78 (2004). 12. J. M. Vilar, R. V. So lé, Effects of noise in symmetric two -species competition. Physic al review letters 80 , 40 99 (1998 ). 13. D. T. Gillespie , Exact st ochast ic simulation of c oupled chemical reactions. The Journal of Physical Chemistr y 81 , 2340-2 361 (1977). 14. A. B. Bortz, M. H. Ka los, J. L. Lebo witz, New A l gorithm f or Monte-C arlo Simulation of Ising Spin Systems. Journa l of Computatio nal Physics 17 , 10-18 (1975). 15. N. G. v. Kampen, Stoc hastic processes in physics a nd chem istry . North-Holland (E lsevier, Amsterdam ; Bo ston, ed. 3 rd, 200 7). 16. D. P. Kingma, J. Ba, Ada m: A method for st ochastic o ptimi zation. International Confe rence on Learning Represe ntations . 20 15. 17. F. Bach, Breaking th e curse of dimensionality with c onvex neural networks. Journal of Machine Learnin g Research 18 , 1-53 (2017). 18. N. Altman, M. Krzyw inski, The curse(s) of dimensionalit y. Nature Methods 15 , 399-4 00 (2018). 19. A. Golightly, D . J. Wilkinson, Bayes ian parameter infere nce fo r stochast ic biochemical network models using particle M arkov ch ain Mont e C arlo. Interface Focus 1 , 80 7-82 0 (2011). 20. T. To ni, D. Welch, N. Strelkowa , A. Ipsen, M. P . H. Stum pf, Approximate Baye si an co mputation scheme for parameter inference and model selection i n dynamical systems. Jo urn al of The Royal Society I nterface 6 , 187-2 02 (2 008). 21. R. J. Williams, Simp l e stat istical gradient-following algo ri thms for conn ectionist reinforcement learning. Machine Lear n ing 8 , 229-2 56 (1992). 22. Y. Bengio, N. L éonard, A. Courvil le, Esti mating or pro pagating grad ients t hrough st ochastic neurons fo r c onditional computat i on. arXiv:1308 .3432 (2013). 16 23. M. Filo, N. Rossi, Z . Fang, M. Khammash , GenAI -Net: A Generat i ve AI Framework for Automated Biomolecular Network D esign. arXiv prepr int arXiv:2601 .1758 2 , (2026). 24. G. Tuc ker, A. Mnih, C. J. Maddison , J. Lawson, J . Sohl-Dickstein, Rebar: Low-variance, unb iased gradient estimates fo r discret e latent v ariable models. Advances in Ne u ral Inform a tion Processing Systems 30 , (2 017 ). 25. A. Gupta, M. Khammash , An effici ent an d unbiased method fo r sensitivi ty analy sis of stoc hastic reaction networks. J ournal of The Royal Soc iety Interface 11 , (201 4). 26. M. Rathinam, P. W . Sheppard, M . Khammash, Eff i cient computat ion of parameter sensitivities of discret e stochast ic chemical reaction networks. The Journal of Chem ica l Physics 132 , (2010 ). 27. K. Rijal, P. Mehta, A diff erentiable Gillespie algorithm for simulating chem i cal kinetics , parameter est imation, and desi gning synt hetic b iological c ircuits. eLife 14 , RP103 877 (2025). 28. E. Jang, S. Gu, B. Poo le, Categorical repa ra meterization wi th Gu mbel -Softmax. Internatio nal Conference on L ea rning Representat ions . 2017 . 29. C. J. Madd ison, A. Mn ih, Y . W. Teh, The C oncrete distribution: A continuous relaxation of discrete ran dom variables. Internati onal Confere nce on L earning Represe ntations . 2017. 30. J. M. G. Vilar, H . Y. Kueh, N. Barka i , S. Leib ler, Mec hanisms of no ise -resi stance in genetic oscillators. Proc Natl Aca d Sci U S A 99 , 5988-59 92 (2002). 31. R. N. Gutenkunst , J. J. Waterfal l , F. P. Casey , K. S. Brow n, C. R. Myers, J. P . Sethna, Uni v ersally sloppy parameter sensitivities in syste ms biology models. P LoS Comput Biol 3 , 187 1- 1878 (2007 ). 32. Y. LeCun, L. Bo ttou , Y. Bengio, P. Haffner, Grad ient-based learning appl ied to docu ment recognition. Proceed ings of the IEE E 86 , 2278-2 324 (1998). 33. P. Y. Simard , D. Steinkraus, J . C. Platt, in Seventh Inter national C onference on Document Analysis and Reco gnition, 200 3. Proceedi ngs. (IEEE Co m puter Society, 20 03), vol. 3, pp. 9 58- 958. 34. Z. Selimi, J.-S. Rougier, H . Abriel , J. P. Kucera , A detailed anal ysis of single -channel Nav1.5 recordings do es not rev eal any co operative gating. The Journa l of Physiolog y 601 , 384 7-3868 (2023 ). 35. F. Mot tes, Q.-Z. Zhu , M. P. Brenner, Grad ient -based opt i mization of exac t st o chastic kinetic models. arXiv prepr i nt arXiv:260 1.1418 3 , (2026 ). 36. Y. Cao, D. T. Gil lespie, L. R. Petzo ld, Effici ent st ep size selection for the t au -leaping simulation method. J Chem Phys 124 , 044 109 (2006). 37. D. T. Gillespie , The c hem ical Langevin equation. The J ournal of Chem i cal Physics 113 , 297- 306 (2000 ). 38. C. Jia, R. Grima , Holimap: an ac curate an d efficient method for solving stochast ic gene network d ynamics. Nature Commun ications 15 , 65 57 (2024 ). 39. E. J. Gumbel, Statistic a l theory of extreme values and some p ractical app l ications: a series of lectures . (US Go vernment Printin g Office, 1 954), vol. 33 . 17 40. B. T. Polyak, A. B. Jud itsky, Accel eration of St ochastic Approximation by Averaging . SIAM Journal on Contr o l and Optimiz ation 30 , 83 8-855 (19 92). 41. J. M. G. Vilar, L. Sa iz, Systems biophysics of gene expre ssion. Biophys J 104 , 257 4-2585 (2 013). 42. D. Colquhoun, O n the sto chastic pr operties of single ion channels. Pro ceedings of t h e Royal Society of London . Series B. Bi ological Sc iences 211 , 205-23 5 (1981 ). 43. S. Matt hew, F. Carter, J. Coop er, M. Dippel, E. Gre en, S . Hodges, M . Kidwell, D. Nickerson , B. Rumsey, J. Reeve, L . R. Petzold , K. R. Sanft, B. D rawert, Gi llesPy2: A Bioc hemical M odeling Framework for Simulation Driven Bi ological Disc overy. Lett Bi omath 10 , 87-103 (2023). 44. K. R. Sanft, S. Wu, M . Roh, J. Fu, R . K. Lim, L . R. Petzold, Stoc hKi t2: software for discrete stoc hastic simulation of bioc hemical systems with eve nts. Bioinfo rm atics 27 , 2457-2 45 8 (2011 ). Acknowledgme nts Funding: J.M.G.V. acknowledges support from Ministerio de Ciencia, Innovación y Univers idades (MICIU/AEI/10.130 39/50 110 0011033 /FEDER, UE) und er Grant N o. PID2024-160016NB-I00. Author contributions: J.M.G.V. and L.S. designed rese arch, performed researc h, analyzed da ta, and wrote t he paper. Data availabil ity: Singl e-channel rec ordings are p ublicly avail able at Z enodo: http s://zenodo.org/record s/78176 01 . F ile used: HEK293_Cell01_40 mV_100s_raw_D ET_IDEl.xlsx. Code availability: All the code used in this researc h will be m ade fully ava ilable on GitHub with the final version of th is preprint. 18 Figures Figure 1. D iff erentiable exact stochastic sim ulation via forward/backward d eco upling. The exact traject ory 󰇝󰇛  󰇜 󰇛  󰇜   󰇛  󰇜󰇞 (central row, 󰇛  󰇜    ) is the shared backbone for both comput ation passes. Top (blue, forward pa ss): At each st ep  , propensities   󰇛󰇛  󰇜 󰇜 are comput ed from the current discrete state, independe nt standard Gumbel variates    are drawn, and the firing reac tion is selected by hard  over perturbed l og-propensities (Gumbel-Max trick). The sta te updates by the integer stoichiometric vector     , preserv ing the full intrinsic noise of the continuous-time Markov chain. Waiting times between react ions are sampled by s tandard inverse-transform (          ,    󰇛 󰇜 ) and are omi tted fro m the di agram for clarity. The l oss 󰇛 󰇜 is eval uated over t he co mplete exact trajectory. Bo ttom (red, backward pass): The entire backward pass implements a Gumbel -Soft max straight-thro ugh estimator. A t each step  , the same Gumbel variates   from th e forward pass ent er a temperature-controlled soft m ax to yield differentiab le soft reaction probabilities      󰇛󰇛    󰇛󰇛  󰇜 󰇜    󰇜󰇜 , which define th e surro gate st ate update   󰇛   󰇜  󰇛  󰇜            , where  i s th e Gumbel-              󰇛   󰇜  󰇛   󰇜  󰇛   󰇜  󰇛     󰇜  󰇛   󰇜    󰇛  󰇛   󰇜  󰇜       󰇛  󰇜                           󰇛    󰇜   󰇛   󰇜                              󰇛   󰇜 19 Softmax temperature. The surrogate gradient       󰇛     󰇛   󰇜  󰇛  󰇜 󰇜     󰇛   󰇜    coup les each per-step surrogate Jaco bian with t he backpro pagated adjoint   󰇛  󰇜    , which carries the accu mulated sensi tivity o f  to 󰇛   󰇜 through al l downstream surrogate transitions. The use of  (partial) versus  (total) di stinguishes local sensitivities , eval uated at the exact di screte s tate  󰇛  󰇜 with the Gumbel draws hel d fixed, from tot al sensitivi t i es propagated via the chain rule through the entire trajecto ry. When the l oss depends on all stat es, the t otal derivative sat isfies the recursion  󰇛  󰇜    󰇛  󰇜 and, for        ,  󰇛  󰇜    󰇛  󰇜  󰇛  󰇛   󰇜 󰇛  󰇜󰇜     󰇛   󰇜 , where   󰇛  󰇜 is the direct dependence of the loss on state  󰇛  󰇜 . All surrogate deri vatives are evaluated at the exact discrete states, not at relaxed or approximate states. This ex pression i s exact for the surrogate comput ational graph; the only approximation is that the straight-through estimator i s a biased gradient estimator for t he tru e discrete proc ess. 20 Figure 2. Pa rameter inference for the reversible dim eriz ation system. ( A ) Reaction scheme. Two monomers A and B revers i bly form a dimer C with forward rate con stant   and reverse rate const ant   . This m inima l system (2 parameters, 3 species, 2 reactions) serves as a val idation benchmark. ( B ) Target and learned t rajecto ries. Solid l ines sho w ensemble -averaged t arget traject ories for species A (bl ue), B (green), and C (red) generated with ground truth parameters      and      . Light traces show individual sample traject ories simulated with l earne d parameters. The l earned dynamics closely match the target across the full time course, including transient and near-equilibri um regimes. ( C ) Parameter convergence and l oss. Learned v alues of   (red) and   (green) converge to their true v alues (dash ed l ines) wi thin approximately 50 epochs. The loss (gray, right axis) decreases by ov er three orders of mag nitude during optimization. ( D ) Learned   versus true   . Inference was repeated across a range of reverse rate con stants    󰇝                          󰇞 with      held fixed. E ach c ondition used 100,00 0 target trajectories and 100,000 model si mula tions per iteration (250 steps, 250 epochs). Learned val ues (g reen circles) align wi th the ideal    relationship (dashed line) across two orders of magnitude, dem onstrat ing accurate recovery reg ardless of kinetic regime. The circled po int indicates t he current run shown in panels B – C . ( E ) Learned   acro ss varying   . The inferred value of   (green circles) remains within 0.1% of the true value (dashed line) ac ross all tested   values, confirming that parameter inference is robus t and that   and   are independently identifiable. The circled point indicates the current r un of panel B. Average MAPE across all conditions: 0.09% (95% CI: 0 .06% – 0.13%).     21 Figure 3. Pa rameter inference for the genetic o scil lator. ( A ) Reaction network schematic. A gene regulatory network exhibitin g sustained oscillat ions through coupled positive and negativ e feedback loops. The activato r protein A promot es i ts own expression and ac tivates repressor expression; t he repressor R sequesters A through t he fo rmation of th e C c omplex, c reating de layed negat ive feedback. Fi ve p arameters controlling o scillatory dynamics (   ,   ,   ,   ,   ) we re inferred from 9 species and 16 reactions. Bl ack arrows i ndicate the reactions invol ving the inferred parameters. Grey lines and arrows indicate the schematics of the networ k not directly related to these parameter s. ( B ) Bootstrap distribution of t he g lobal Mean Absolute Percentage Error (MAPE). Histogram shows the MAPE distribut ion of the estimated medi an para meters (median across runs of the l ast 25% percent values) computed from 10,000 block bootstrap resampl es of the pooled parameter estimates across all ru ns. The observed MAPE o f 1.23% (orange vertical li ne) falls near the c enter o f the distribution. Black dashed vertical lines indicate the 95% bootstrap confidenc e i nterval. ( C )    22 Global median parameter estimates wi th 95% confidence intervals. Bar heights show the ratio of learned to true parameter values for each of th e five inferred parameters (   ,   ,   ,   ,   ). Error bars indicate 95% bootstrap confidence intervals. The orange dashed line at 1.0 represents perfect recovery. All parame ters are rec overed to within a few percent of their true values, wi th confidence intervals that include or nearly include the true val ue in all cases. ( D ) I ndividua l parameter convergence trajecto ries over 3,000 training epochs, with different colors representing independent runs ini tialized from different random starting points. Bl ack dashed lines indicate ground truth values. All parameters c onverg e to near-true values wi thin the fi rst 1,00 0 epo chs an d remain stable thereafter. ( E ) Learned versus true trajec tories . Represent ative si ngle-traject ory simulat ions for the three key m olecular species with ground tru th para meters (solid blue) and learned parameters (dashed orange). ( F ) MAP E c onverg ence across ru ns. MAPE (%) i s pl otted on a logarithmic scale as a funct ion of training epoch for each run (same c olor c oding as panel D). MA P E decreas es from init ial values of   –   % to approximately 1 – 5% by epoch 3,000, with m ost runs achievin g errors near or below 2 % in th e final epoch s. 23 Figure 4. Architecture, training, and evaluation of a g ene regulatory network for MNIST digit class ification. ( A ) Network architect ure. Input pixel intensit ies (28×28 = 7 84 values) defi ne fi xed transc ription fac t or c oncentrations t hat regulate 256 h idden genes through a th ermodynamic promoter model, where the probability of gene activat ion follows a sigmoid function  󰇛 ON 󰇜   󰇛 󰇜 with ac tiv ation energ y        . Hidden genes stochastic ally regulate 10 output genes correspo nding to dig it classes 0 – 9. Cl assification proceeds by si mulating reaction steps and selecting the ou tput g ene with highest copy number. The network contains 203,796 trainable parame ters (weights, biases, and degradation rates). ( B ) Loss convergence during training. Training l oss (blue) and test loss (red) decrease over approximately 80 e pochs. The vertical dashed red line i ndicates the epoch at which the straight -throu gh (ST) estim ator is swi tched on. ( C ) C lassific ation accuracy during training. Training (b lue) and test (red) accuracy i ncrease over the same epoch range, reaching approximately 98% accurac y . ( D ) Temperature a nnealing schedule. The Gumbel -Softmax temperature is annealed during training, beginning at a high value to encourage expl oration and decreasing to sharpen the categorical approximation for more precise g radient si gnals. ( E ) Test accu racy as a fun ction o f Monte C arlo av eraging and temporal averaging. Heatmap shows classification accuracy on t he MNIST test set as a function of the number of M onte Carl o (MC ) samples (x -axis: 1 – 32) and the final frac tion of simulation tim e steps over which output gene       󰇛  󰇜        󰇛  󰇜     󰇛  󰇜   󰇛  󰇜    󰇛  󰇜   󰇛  󰇜      󰇛  󰇜        󰇛  󰇜    󰇛  󰇜   󰇛  󰇜    󰇛  󰇜   󰇛  󰇜      󰇛  󰇜        󰇛  󰇜    󰇛  󰇜   󰇛  󰇜    󰇛  󰇜   󰇛  󰇜         24 expression is averaged for classification (y -axis: 0.00 – 0.16). A final fraction of 0.00 uses only the terminal time point; l arger fractions av erage mo l ecule counts over more time st eps, reducing stoc hastic noise i n the readout . ( F ) Test accuracy under higher degradation rate. The higher minimum degradation rate (switched from 0.25 to 1.0) produces lower baseline accurac y , but wi th larger gai ns from both MC and temporal averaging compared to panel E . In both settings, the l argest accu racy gai ns come from MC averag ing (moving righ t across columns), with diminishing returns beyond approximately 8 samples. These results demonstrate that the i nherent stochastici ty of the gene reg ulatory network can be lev eraged through ensemble methods (e. g. populations of cells in biological systems) to ach ieve robust c l assification performanc e. 25 Figure 5. S tochastic dy nami cs of th e gene regul a tory networ k d uring seque ntial d igit classification. ( A ) Th e net work was present ed with t hree c onsecutive MNIST imag es (dig its 9 , 6, and 4) t o demonstrat e real-time classi ficat i on dynamics. ( Left ) Input i mages: digit “9” (simulation steps 0– 5,759), dig it “6” (steps 5,760–11,519), and dig it “4” (steps 1 1,520+). ( Middle ) Hi dden g ene dynamics. Heatmap showing expression lev els (molecule counts) for all 256 hidden genes over approximately 18 ,000 sim ulation st eps. Dashed vertical lines indicat e input i mage transitions. Different sub sets of hidden genes ac ti vate in respo nse to each digit, demonstrating th at t he hidden layer devel ops sparse digi t-specif i c feature represent ations thro ugh training. ( Right ) Outpu t gene dynamics. Heatmap showing expression l evels for the 1 0 output genes corresponding to digit classes 0 – 9. D uring each input p hase, the output gene matching th e t rue digit l abel shows elevated expression while competing genes rem ain suppressed. The network correctly predicts all three dig its (Pred: 9 → 6 → 4). ( B ) Out put gene time trac es. Cont inuous -time expression traces for all 10 output genes from a stochastic simul ation, with genes corresponding to correct predictions hig hlighted: Gene 9 (c yan), Gene 6 (pink), and Gene 4 (purple). Dashed vertical lines mark in put swi tching times. The stoc hastic winner -take-all dynamics resol ve competition between out put genes, with the correc t class achieving dominant express ion in each phase. True sequence: 9 → 6 → 4 ; Predi cted sequence: 9 → 6 → 4. 26 Figure 6. P arameter inference for ion channel gating kinetics from patch -clamp recordings. ( A ) Kinetic scheme. The three-state model consists of closed (C), o pen (O), and inactivated (I ) confo rmations with first-order transitions governed b y rate constants  open ,  close , and  inact . The inactivated state is absorb ing, reflecting the absence of recovery on the experimental t imescale. ( B ) Model fi t to ex perimental data. Blue li ne shows the experimental ensem ble mean open channel count (N = 100 sweeps) with shaded reg ion indicat ing 95% boot strap confidence interval. Red dashed line shows the model prediction using learned rate con s tant s, e valuated from independent exact Gill espie sim ulat ions (      ). T he model captures b oth t he rapid activation phase (peak at ~1 m s) and the slower i nactivation de cay. ( C ) Sam ple stoc hastic trajectories. Representative simulations (colored l ines) using the learned parameters show discrete transitions between 0, 1, and 2 open channels. Wi th only N = 2 channels, every stochastic event produ ces a macroscopic change in the observable . This extreme discreteness regime tests whether the method rel ies on quasi-cont inuous dynamics. Black l ine shows experim ental mean for reference. ( D ) Parameter convergence during t rai ning. Rate const ants evolv e from initial gue ss toward stable v alues o ver 400 epochs. The inactivation rate  inact (green) converges fast est as i t dominates long -time behavior, while  open (bl ue) and  close (orange) requir e additional iterations to resolve the transient dynamics. Loss (brown, right axis) dec reases by over two orders of magnitude during training. ( E ) State distribution. Comparison of experiment al (blue) and simulated (red) probabi lity distributions over the number of open channels (0, 1, or 2). The close agreement confirms that the learned parameters reproduc e the correct equ ilibri um occu pancy of c hann el st ates. Data: HEK293 cells, +40 mV holding potent ial, 100 sweeps.    27 Suppleme ntary Figure 1. S tochastic dynamics of the gene regulatory network during sequential digit cl assificati on with a degradation rate lower bound of 1.0. Same format , input sequence, and network architect ure as Fig ure 5 (dig its 9, 6, and 4), but usi ng the model trained with a hi gher degradation rat e (l ower bou nd = 1 .0; see Figure 4 F). ( A ) ( Left ) Input images presen ted sequent i ally: digit “9” (simulation steps 0–5,759), digi t “6” (steps 5 ,760–11,519), and digit “4” (steps 11,520+). ( Middle ) Hidden gene d yna mics. Heatmap showing expression l evels (molecul e count s) for all 256 hidden g enes o ver approximat ely 18 ,000 sim ulation steps. Dashed vertical li nes indicat e input image transitions. Despite t he higher deg radation rate, whic h increases molecular turnover an d s tochastic noise, distinct subsets of hi dden genes activate i n response to each dig it. ( Right ) Outpu t gene dynamics. Heatmap showing expression l evels for the 1 0 output genes corresponding to digit classes 0 – 9. The network correctly classifies all th ree digits (Pr ed: 9 → 6 → 4). ( B ) Output gene time traces from the stochastic simul ation. C ontinuous-time expression traces for all 10 output genes, wi th genes corresponding to correct predictions highlighte d: Gene 9 (cyan), Gene 6 (p ink), and Gene 4 (purple). Dashed vertic al lines mark i nput switch ing times. The correct outpu t gene achieves dominant expression during each i nput phase, demonstrating that t he network maintains accurat e classification even under elevated degradation, albeit with increased sto chastic fluctuations in gene expression levels compared to the lower degradation r ate regime (Figure 5). True sequenc e: 9 → 6 → 4; Predicted sequ ence: 9 → 6 → 4. 28 Suppleme ntary Figure 2. Computational throughput as a function of ensemble size. Si mulation throughpu t (steps per second) was measured for the os cill ator system using three reaction selection methods: Standard Gill espie wi th categorical sampli ng (blue circles), Gumbel -Max with argmax selection (orange inverted t riangles), and Gumbel -Softmax with temperat ure-controlled soft sampling (brown triangles). All metho ds show near-linear scaling with ensemble size from 1 to   parallel simulations. At l arge ensemble sizes, throughput reaches approximately   steps per second, with all three m ethods achieving comparable performance. The near -identical throughput across methods demonstrates that the differentiable framework incurs minim a l computational overhead compared to non-differentiable implementations. Be nchmarks were performed on an NVIDIA RTX 6000 Ada Generat ion GPU with T ensorFlow 2.20.

Exact Discrete Stochastic Simulation with Deep-Learning-Scale Gradient Optimization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment