Improving Infinitely Deep Bayesian Neural Networks with Nesterov's Accelerated Gradient Method

Impro ving Inﬁnitely Deep Bayesian Neural Networks with Nestero v’ s Accelerated Gradient Method Chenxu Y u ∗ Shenzhen Institutes of Advanced T echnology Chinese Academy of Sciences Shenzhen, P .R.China yuchenxu1024@163.com W enqi Fang † Shenzhen Institutes of Advanced T echnology Chinese Academy of Sciences Shenzhen, P .R.China wq.fang@siat.ac.cn Abstract —As a repr esentative continuous-depth neural net- work approach, stochastic differential equation (SDE)–based Bayesian neural networks (BNNs) have attracted considerable attention due to their solid theoretical foundations and strong potential for real-world applications. However , their reliance on numerical SDE solvers inevitably incurs a large number of function ev aluations (NFEs), resulting in high computational cost and occasional con vergence instability . T o address these challenges, we propose a Nesterov-accelerated gradient (NA G) enhanced SDE-BNN model. By integrating NA G into the SDE- BNN framework along with an NFE-dependent residual skip connection, our method accelerates con vergence and substan- tially reduces NFEs during both training and testing. Extensiv e empirical results show that our model consistently outperforms con ventional SDE-BNNs across various tasks, including image classiﬁcation and sequence modeling, achieving lower NFEs and impro ved predictive accuracy . Index T erms —Stochastic differential equation, Bayesian neural network, Nestero v-accelerated gradient, Number of function evaluations I . I N T RO D U C T I O N Neural networks (NNs) are commonly modeled as discrete sequences of layers, each consisting of an af ﬁne transformation followed by a nonlinear activ ation [1]. This formulation un- derpins modern deep learning and has driv en success across a wide range of applications, including image classiﬁcation [2], speech recognition [3], and natural language processing [4]. As network depth increases, NNs can be naturally interpreted as discretizations of continuous dynamical systems, providing new perspectives for theoretical analysis, architecture design, and understanding model behavior [5]. As the seminal continuous-depth neural network pro- posed by Chen et al, Neural Ordinary Dif ferential Equa- tions (Neural ODEs) model hidden states as solutions to an ODE, replacing discrete-layer architectures with continuous- time dynamics [6]. This idea has spawned numerous Neural ODE v ariants—augmented [7], second-order [8]– [10], control-based [11]–[13], and physically informed [14]– [16]—collectiv ely enhancing the expressivity , efﬁcienc y , and * W ork done during the author’ s internship at the Shenzhen Institutes of Advanced T echnology . † Corresponding author: W enqi Fang. versatility of continuous-depth NNs. These formulations en- able Neural ODE-based approaches to excel in time-series forecasting [17], image analysis [18], and dynamical system simulation [19], while providing strong generalization perfor- mance and enhanced interpretability . T o account for stochastic noise, Neural SDEs e xtend Neural ODEs to explicitly model inherent randomness and uncertainty in the data [20]. By embedding NNs in the deterministic drift term, stochastic diffusion term, or both, it of fers a ﬂexible framew ork for modeling complex dynamics that combine structured beha vior with randomness, making it well-suited for stochastic process modeling, probabilistic forecasting, and noisy system simulation [21]–[24]. Building upon this idea, the SDE-BNN method incorporates Bayesian uncertainty inference into the Neural SDEs frame- work, thereby further improving robustness and generaliza- tion [25]. Although SDE-BNNs capture uncertainty through parameter distrib utions, their dependence on iterative SDE solvers leads to e xcessiv e function e valuations, increased com- putational cost, and potentially unstable con vergence. T o some extent, recent variants, such as partially stochastic SDE-BNN (PSDE-BNN) [26] and rough path theory–based SDE-BNN (RDE-BNN) [27], alle viate these challenges by introducing partial stochasticity into the SDE-BNN architecture. Unlike prior variants, and inspired by Nesterov-accelerated NODEs [10], we incorporate the Nesterov accelerated gra- dient (N A G) method into the SDE-BNN framew ork, termed Nesterov-SDEBNN. The proposed approach improv es nu- merical stability and computational ef ﬁciency , thereby ac- celerating con vergence while maintaining—or ev en enhanc- ing—generalization performance. In contrast to PSDE-BNN and RDE-BNN, which rely on more complex mathemati- cal formulations, Nesterov-SDEBNN achiev es these improv e- ments by simply extending the original ﬁrst-order ODEs in SDE-BNN to second-order dynamics via N AG and re vising the residual connection scheme. Experiments on multiple rep- resentativ e open-source datasets sho w that Nesterov-SDEBNN signiﬁcantly reduces the NFEs, accelerates con ver gence, and improv es predictiv e accuracy . The main contrib utions of this paper are summarized as follows: • W e propose Nesterov-SDEBNN by integrating the N AG method into the original SDE-BNN framework, comple- mented by an NFE-dependent residual connection strat- egy to enhance overall performance. • Compared with the original SDE-BNN, our method con- ver ges faster and achieves higher accuracy on image classiﬁcation and sequence modeling tasks, while signiﬁ- cantly reducing NFEs, thereby accelerating both training and testing in adaptiv e differential equation solvers. I I . R E L A T E D W O R K A. Neural Ordinary Differ ential Equations Neural ODEs parameterize continuous-time dynamics using NNs of the form: dh t dt = f θ ( h t , t ) , h 0 ∈ R d (1) where h t denotes the hidden state at time t and f : R d × R → R d is a Lipschitz-continuous NN parameterized by θ [6]. The forward pass corresponds to numerically solving this ODE from an initial input state h 0 = x , yielding an inﬁnitely deep residual network with univ ersal approximation capabil- ity , while enabling memory-ef ﬁcient training via the adjoint sensitivity method and ﬂexible accuracy–ef ﬁciency tradeoffs through adaptiv e ODE solvers [6]–[10]. B. Nester ov Neural ODEs T o reduce the NFEs and accelerate training and testing, Nguyen et al. [10] introduced the NA G method into Neu- ral ODEs (Nesterov Neural ODE) by extending ﬁrst-order ODEs to second-order dynamics. Nesterov Neural ODEs were formulated by solving the follo wing second-order differential equation: d 2 h t dt 2 + 3 t dh t dt + f θ ( h t , t ) = 0 (2) T o improve computational ef ﬁciency and mitigate numerical instability , equation (2) is reformulated as an equiv alent ﬁrst- order system, making it more amenable to NN implementa- tions [10]:        h t = σ f  t − 3 2 e t 2  x t , x ′ t = σ f ( m t ) , m ′ t = − m t − σ f ( f θ ( h t , t )) − ξ h t , (3) Here, x t denotes an auxiliary variable associated with h t , m t represents the momentum term, σ f is a nonlinear activ ation function, and ξ controls the residual connection [10]. This reformulation recasts the dynamics as a ﬁrst-order system that incorporates momentum, leading to accelerated con ver gence and serving as the basis for our methodology . C. Stochastic Dif fer ential Equations based Bayesian Neural Network Neural SDEs extend Neural ODEs by introducing stochastic components into the dynamics, enabling the model to capture both deterministic and stochastic behavior [20]: dh t dt = f w ( h t , t ) dt + g θ ( h t , t ) dB t (4) where f w and g θ are the drift and diffusion functions, re- spectiv ely , and B t denotes Brownian motion. This formulation allows the network to model randomness inherent in various real-world applications, such as robust pricing and hedging, irregular time series data analysis, and eddy simulation [23], [24], [28]. Building on Neural SDEs, the SDE-BNN framework inte- grates SDEs with BNNs to capture uncertainty in both system dynamics and network parameters [22], [25]. In this frame- work, the NN weights are modeled as stochastic processes. Therefore, training SDE-BNNs via variational inference (VI) in volves solving an augmented SDE that jointly tracks the tra- jectories of both the weights and the network acti vations [25]: d  w t h t  =  f ϕ ( w t , t ) f w t ( h t , t )  dt +  g θ ( w t , t ) 0  dB t (5) where f w t and f ϕ represent the dynamics of the weights and the activ ations of the network, respectiv ely . The ﬁnal hidden unit state h T ( t ∈ (0 , T ]) is used to parameterize the lik elihood of the target output y . While SDE-BNNs effecti vely model uncertainty and im- prov e robustness under noisy or incomplete observ ations, it suffers from a signiﬁcant drawback: solving SDEs requires huge NFEs, which slows both training and testing. T o mitigate this, PSDE-BNN [26] and RDE-BNN [27] introduce partial stochasticity , striking a balance between computational cost and uncertainty modeling. Unlike these methods, our approach tackles this challenge by integrating the NA G method directly into the SDE-BNN framework, offeri ng a simple yet effecti ve solution. I I I . M E T H O D O L O G Y In this section, we present details of our Nesterov-SDEBNN approach. A. Prior Pr ocess and Approximate P osterior over W eights In essence, the proposed Nesterov-SDEBNN adheres to the Bayesian framework and thus admits well-deﬁned prior and posterior distributions. T o obtain a simple prior with bounded marginal v ariance in the long-time limit, we adopt an Ornstein–Uhlenbeck (OU) process as the prior over the weights, deﬁned by an SDE with the following drift and diffusion terms: f p ( w t , t ) = − w t , g ( w t , t ) = σ I d , (6) where σ is a hyperparameter and I d is a d × d identity matrix. For the posterior , we seek a posterior approximation that can capture non-Gaussian, non-factorized marginals, achiev ed by parameterizing its dynamics with a NN whose capacity can be scaled as needed. Accordingly , we implicitly deﬁne the approximate posterior over the weights via an SDE with a learned drift: f q ( w t , t, ϕ ) = NN ϕ ( w t , t ) − f p ( w t , t ) , (7) where the posterior drift, f q , is parameterized by a small NN with parameters ϕ , while the diffusion terms are kept the same as the prior . B. Evaluation of the Network By incorporating the NA G method into the SDE-BNN framew ork, we can easily deri ve the modiﬁed dynamics by combining equations (3) and (5), expressed in terms of the joint state H t := ( x t , m t , w t ) : h t = σ f  t − 3 2 e t 2  x t , dH t =   σ f ( m t ) − m t − σ f ( f w t ( h t , t )) − ξ h t f ϕ ( w t , t )   dt +   0 0 g ( w t , t )   dB t . (8) Notably , the residual term ξ h t in equation (8) is added to ev aluate the drift of m t . Due to the time-stepping nature of SDE solvers, each increment of NFE f (denotes the number of drift function ev aluations) advances the joint state from H t to H t +∆ t . Speciﬁcally , to ev olve the state from m t to m t +∆ t , the feature h t is ﬁrst processed by the function f w t with weight w t and activ ation σ f , and then combined with the residual ξ h t and the current state m t . After solving equation (8), we obtain h t +∆ t and m t +∆ t . These processes are repeated iterativ ely until the ﬁnal time T , as illustrated in subﬁgure (a) of Figure 1. Therefore, this residual mechanism establishes explicit link between representation at successiv e time steps (e.g., m t → m t +∆ t → m t +2∆ t ) rather than within a single time step. Howe ver , these settings result only in a direct residual connection, rather than the residual skip connection deﬁned in the original formulation [29]. This discrepancy may reduce feature reuse and constrain the effecti veness of residual learning in the SDEBNN settings. T o better align with classic residual learning, rather than in- jecting residual at every drift e valuation, we propose an NFE f - dependent residual skip connection mechanism. Speciﬁcally , we modify the drift of m t as: dm t dt = − m t − σ f ( f w t ( h t , t ) + ϵξ h temp ) , (9) where the control v ariables ϵ and h temp (initialized as input x , and retained from the previous value when NFE f is odd) depend on NFE f , deﬁned as: ϵ = ( 1 , if NFE f is odd 0 , if NFE f is ev en , h temp = h t if NFE f is ev en . (10) weight layer solve eq.(8) weight layer solve eq.(8) ℎ ! , 𝑚 ! ℎ !"#! , 𝑚 !"#! ℎ !"$#! , 𝑚 !"$#! (a) weight layer solve eq.(11) weight layer solve eq.(11) ℎ ! , 𝑚 ! ℎ !"#! , 𝑚 !"#! ℎ !"$#! , 𝑚 !"$#! (b) Fig. 1: Comparison of mentioned residual mechanisms: ( a ) Direct residual connection for m t in eq (8). ( b ) Proposed NFE f -dependent residual skip connection for m t in eq (11). Consequently , the ﬁnal formulation of our Nesterov-SDEBNN method becomes: h t = σ f  t − 3 2 e t 2  x t , dH t =   σ f ( m t ) − m t − σ f ( f w t ( h t , t ) + ϵξ h temp ) f ϕ ( w t , t )   dt +   0 0 g ( w t , t )   dB t . (11) As described in equations (9) and (10), the proposed NFE f - dependent scheme adheres to the spirit of residual skip con- nections, as shown in subﬁgure (b) of Figure 1. When NFE f is odd, the feature h t is cached for use at time t + ∆ t ; when NFE f becomes ev en (  = 0) , the cached feature is injected into the drift at t + ∆ t to help produce m t +2∆ t . This mechanism creates a skip connection across two consecuti ve drift evalu- ations, reusing cached representations to improve information ﬂow without adding parameters. Meanwhile, this formulation allows the state m t +2∆ t to depend explicitly on both h t and h t +∆ t , rather than only on h t +∆ t as in equation (8), which may help mitigate vanishing gradient issues commonly encountered in SDE solvers. T o compute h t giv en x 0 = h 0 = x , we marginalize over weight trajectories via Monte Carlo sampling. Speciﬁcally , we sample paths { w t } from the posterior process, and for each path solve equation (11) to obtain { h t } . The learnable param- eters include the initial weights w 0 and the drift parameters ϕ . C. Output Likelihood Predictions are generated from the ﬁnal hidden representa- tion h T , which directly parameterizes the output likelihood: log p ( y | x, w ) = log p ( y | h T ) . In practice, this likelihood can be chosen as a Gaussian distribution for regression tasks or a categorical distribution for classiﬁcation. D. T raining objective T o train our Nesterov-SDEBNN model, we adopt the VI framew ork, in which the loss function is deriv ed from the evidence lower bound (ELBO) by Monte Carlo sampling [25], deﬁned as: L ELBO ( ϕ ) = E q ϕ ( w ) " log p ( y | x, w ) − Z T 0 1 2 ∥ u ( w t , t, ϕ ) ∥ 2 2 dt # (12) where u ( w t , t, ϕ ) = g ( w t , t ) − 1 [ f q ( w t , t, ϕ ) − f p ( w t , t )] , q ϕ ( w ) denotes the posterior distribution of w for short. The sampled weights, hidden activ ations, and training objecti ve can be computed simultaneously via a single SDE solver call [25]. I V . E X P E R I M E N T S In this section, we empirically ev aluate the proposed Nesterov-SDEBNN against the baseline SDE-BNN on a di- verse set of benchmark tasks, including toy regression, image classiﬁcation, and dynamical simulation. These benchmarks cov er multiple data modalities, ranging from images to time- series data. All experiments were implemented using PyT orch and conducted on a remote server equipped with an NVIDIA A100 GPU (40 GB memory) and an Intel Xeon Gold 5320 CPU (26 cores). Additional details on training procedures, model conﬁgurations, and datasets are pro vided in T able I. The block conﬁgurations follow those of the original SDE- BNN approach [25]. A. T oy 1D Re gr ession W e ﬁrst validate the capabilities of Nesterov-SDEBNN on a simple 1D regression problem. T o model non-monotonic functions, we augment the state space by adding two additional dimensions initialized to zero. As shown in Figure 2, our model retains the expressi ve posterior predictiv e power of SDE-BNN on synthetic non-monotonic noisy 1D data, demon- strating its ef fectiv eness on this task.on this simple ﬁtting task. B. Image Classiﬁcation For image classiﬁcation, we followed the SDE-BNN setup, using con volutional networks to model instantaneous hidden- state changes. W e benchmarked Nestero v-SDEBNN against SDE-BNN on MNIST and CIF AR-10 in terms of accuracy , negati ve log-likelihood (NLL), area under the curve (A UC), and NFEs, with results reported in T able II, Figures 3 and 4. 3 2 1 0 1 2 3 X 3 2 1 0 1 2 3 Y P rior Mean 95% Confidence Interval (P rior) Observations P osterior Mean 95% Confidence Interval (P osterior) Fig. 2: Predictive prior and posterior of Nesterov-SDEBNN on a non-monotonic toy dataset. The blue and red shaded regions denote the 95% conﬁdence intervals of the prior and posterior, respectiv ely , while the solid lines represent their corresponding mean predictions. 0 20 40 60 80 100 Epoch 0.94 0.95 0.96 0.97 0.98 0.99 T est A ccuracy SDE- BNN Nester ov -SDEBNN 0 100 200 300 400 500 Epoch 0.4 0.5 0.6 0.7 0.8 0.9 T est A ccuracy SDE- BNN Nester ov -SDEBNN Fig. 3: Comparison of test accuracy between SDE-BNN and Nesterov-SDEBNN: (Left) MNIST (Right) CIF AR-10. T able II summarizes the performance of SDE-BNN and Nesterov-SDEBNN on MNIST and CIF AR-10 datasets un- der both ﬁxed-step and adaptive integration schemes. Across all settings, Nesterov-SDEBNN consistently achiev es slightly higher accuracy and A UC than the standard SDE-BNN. No- tably , the NLL v alues are signiﬁcantly lo wer for Nestero v- SDEBNN, indicating improved predicti ve uncertainty . On MNIST , the ﬁxed-step Nesterov-SDEBNN reaches 99.04% accuracy and 7 . 37 × 10 − 2 NLL, compared to 98.90% and 14 . 37 × 10 − 2 for SDE-BNN. On CIF AR-10, similar trends are observed, with Nesterov-SDEBNN improving both accuracy (88.36% vs. 87.60%), A UC (85.61% vs. 83.05%), and NLL ( 5 . 97 × 10 − 1 vs. 9 . 89 × 10 − 1 ) under ﬁxed-step conditions. In addition, in adaptive integration scenarios, Nesterov-SDEBNN consistently reduces NLL while simultaneously improving predictiv e accurac y . This dual improvement underscores the model’ s robustness in both discriminative performance and uncertainty calibration, regardless of the underlying step-size scheme or dataset complexity . Figure 3 compares the test accurac y of the two models. While both achiev e similar ﬁnal accuracy , the Nesterov- SDEBNN curve consistently lies abov e that of SDE-BNN throughout testing, resulting in a larger A UC value. This in- dicates that Nesterov-SDEBNN con ver ges faster and achiev es higher accuracy in the early stages of training on both the T ABLE I: The following hyperparameters are used for each e v aluation method corresponding to the e xperimental results reported in the paper . Experiments Model Hyper-parameter 1D Regression MNIST [30] CIF AR-10 [31] W alker2D [32] SDE-BNN (ﬁxed step) and Nesterov-SDE-BNN (ﬁxed step) Augment dim. 2 2 2 0 # blocks 1 1 1-1-1 1 Diffusion σ 0.2 0.1 0.1 0.1 KL coef. 0.0 1e-5 100 1 Learning Rate 1e-3 1e-3 3e-4 { 0:1e-3, 50:3e-3 } # Solver Steps 20 20 20 50 Batch Size 50 128 128 256 Activ ation swish swish mish tanh Epochs 1000 100 500 500 Drift f x dim. 32 32 64 24 Drift f w dim. 32 1-64-1 2-32-2 1-32-1 # Posterior Samples 10 1 1 1 solver midpoint midpoint midpoint midpoint SDE-BNN (adaptive) and Nesterov-SDE-BNN (adaptive) < SDE-BNN > - < SDE-BNN > < SDE-BNN > < SDE-BNN > atol - 1e-3 5e-3 5e-3 rtol - 1e-3 5e-3 5e-3 Epochs - 100 300 500 Activ ation - swish swish tanh solver midpoint midpoint midpoint midpoint T ABLE II: Quantitativ e ev aluation on MNIST and CIF AR-10. Comparison of SDE-BNN and Nesterov-SDEBNN performance (Accuracy , A UC, and NLL) using ﬁxed-step and adaptiv e-step integration. M N IS T C I F A R - 10 Model Accuracy (%) ↑ A UC (%) ↑ NLL ( × 10 − 2 ) ↓ Accuracy (%) ↑ A UC (%) ↑ NLL ( × 10 − 1 ) ↓ SDE-BNN(ﬁxed step) 98.90 ± 0.07 97.59 ± 0.04 14.37 ± 0.89 87.56 ± 0.36 83.05 ± 0.29 9.89 ± 0.22 Nesterov-SDEBNN(ﬁxed step) 99.04 ± 0.12 97.87 ± 0.06 7.37 ± 0.65 88.36 ± 0.24 85.61 ± 0.45 5.97 ± 0.11 SDE-BNN(adaptiv e step) 98.87 ± 0.12 97.57 ± 0.03 13.85 ± 1.26 85.87 ± 0.28 78.42 ± 0.60 6.77 ± 0.09 Nesterov-SDEBNN(adapti ve step) 99.04 ± 0.03 97.88 ± 0.04 7.15 ± 0.34 86.99 ± 0.23 83.94 ± 0.32 5.32 ± 0.10 MNIST and CIF AR-10 datasets. Furthermore, we compares the test NFEs between SDE- BNN and Nesterov-SDEBNN on MNIST and CIF AR-10, as shown in Figure 4. On both datasets, Nesterov-SDEBNN consistently requires signiﬁcantly fewer NFEs than SDE-BNN throughout training, indicating more efﬁcient computation. For MNIST , SDE-BNN’ s NFE increases steadily and ﬂuctuates around 400, whereas Nesterov-SDEBNN remains stable near 240. On CIF AR-10, SDE-BNN’ s NFE rises sharply after around 50 epochs and plateaus near 270, while Nesterov- SDEBNN stabilizes around 170, demonstrating faster conv er- gence and lower computational cost compared to SDE-BNN. These results demonstrate that Nesterov-SDEBNN delivers superior classiﬁcation performance while signiﬁcantly reduc- ing computational overhead, making it a highly efﬁcient solu- tion for image classiﬁcation. C. W alker2D Kinematic Simulation T o ev aluate the model’ s capacity for capturing long-term de- pendencies [33], we applied Nesterov-SDEBNN to W alker2D kinematic simulation data [32]. Building on the ODE-RNN framew ork [34], we benchmarked our approach against the standard SDE-BNN. 0 20 40 60 80 100 Epoch 100 200 300 400 500 T est NFE SDE- BNN Nester ov -SDEBNN 0 100 200 300 Epoch 50 100 150 200 250 300 T est NFE SDE- BNN Nester ov -SDEBNN Fig. 4: Comparison of test NFEs between SDE-BNN and Nesterov-SDEBNN: (Left) MNIST (Right) CIF AR-10. As shown in Figure 5, Nesterov-SDEBNN reduces loss faster and achieves lower ﬁnal error rates than standard counterparts. Despite slightly higher volatility and oscillations during testing, it maintains superior performance throughout the 500 epochs, with its mean loss curve generally below the baseline. Furthermore, as illustrated in Figure 6, Nesterov-SDEBNN demonstrates superior ef ﬁciency during both training and testing phases. While it initially incurs slightly higher NFEs, it quickly stabilizes around 145, maintaining a consistent compu- 0 100 200 300 400 500 Epoch 0.9 1.0 1.1 1.2 1.3 T est L oss SDE- BNN Nest-SDEBNN 0 100 200 300 400 500 Epoch 0.9 1.0 1.1 1.2 1.3 T est L oss SDEBNN-adaptive Nest-SDEBNN-adaptive Fig. 5: W alker2D test loss performance. Nesterov-SDEBNN vs. SDE-BNN under ﬁxed-step (Left) and adaptive-step (Right) solver conﬁgurations. 0 100 200 300 400 500 Epoch 120 140 160 180 T rain F orwar d NFEs SDEBNN-adaptive Nest-SDEBNN-adaptive 0 100 200 300 400 500 Epoch 120 140 160 180 T est F orwar d NFEs SDEBNN-adaptive Nest-SDEBNN-adaptive Fig. 6: F orward-pass NFEs on the W alker2D kinematic simula- tion. Comparison between SDE-BNN and Nesterov-SDEBNN during (Left) training and (Right) testing using adapti ve-step solver . tational cost. In contrast, standard SDE-BNN shows a stepwise increase in complexity , reaching 180 NFEs, about 24% higher than ours. These results indicate that integrating Nesterov- like dynamics into the SDE-BNN framework effecti vely limits solver complexity , resulting in more predictable and efﬁcient inference than the standard adaptiv e approach. Overall, these results provide further compelling empirical evidence that our approach outperforms the SDE-BNN base- line, underscoring its effecti veness and practical potential for time-series modeling. V . C O N C L U S I O N In this paper, we introduced Nesterov-SDEBNN, an ad- vancement of the SDE-BNN framew ork that integrates Nes- terov accelerated gradient principles with an NFE-dependent residual scheme. This architecture signiﬁcantly curtails the computational overhead of training and testing by reducing the required number of function ev aluations. Empirical results show that our approach conv erges faster and achieves higher accuracy on v arious tasks such as image classiﬁcation and time-series modeling, indicating its stronger capacity to cap- ture complex patterns than standard SDE-BNNs. Our future work will assess the frame work’ s scalability and robustness on larger , more complex datasets and explore extensions to examine its potential in practical applications. A C K N O W L E D G M E N T S W .Fang was supported by the National Natural Science Foundation of China (NSFC) under Grant No.12401676. R E F E R E N C E S [1] Paulo Botelho Pires, Jos ´ e Duarte Santos, and In ˆ es V eiga Pereira. Artiﬁcial neural networks: history and state of the art. Encyclopedia of Information Science and T echnology , Sixth Edition , pages 1–25, 2025. [2] Y aoli W ang, Y aojun Deng, Y uanjin Zheng, Pratik Chattopadhyay , and Lipo W ang. V ision transformers for image classiﬁcation: A comparativ e survey . T echnologies , 13(1):32, 2025. [3] Harsh Ahlawat, Naveen Aggarwal, and Deepti Gupta. Automatic speech recognition: A survey of deep learning techniques and approaches. International Journal of Cognitive Computing in Engineering , 2025. [4] Farid Ariai, Joel Mackenzie, and Gianluca Demartini. Natural language processing for the legal domain: A survey of tasks, datasets, models, and challenges. ACM Computing Surveys , 58(6):1–37, 2025. [5] Chao Ma, Lei Wu, et al. Machine learning from a continuous viewpoint, i. Science China Mathematics , 63(11):2233–2266, 2020. [6] Ricky T . Q. Chen, Y ulia Rubanov a, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations. Advances in Neural Information Processing Systems , 2018. [7] Emilien Dupont, Arnaud Doucet, and Y ee Whye T eh. Augmented neural odes. Advances in neural information processing systems , 32, 2019. [8] Alexander Norcliffe, Cristian Bodnar, Ben Day , Nikola Simidjievski, and Pietro Li ` o. On second order behaviour in augmented neural odes. Advances in neural information pr ocessing systems , 33:5911– 5921, 2020. [9] Cagatay Y ildiz, Markus Heinonen, and Harri Lahdesmaki. Ode2vae: Deep generative second order odes with bayesian neural networks. Advances in Neural Information Pr ocessing Systems , 32, 2019. [10] Ho Huu Nghia Nguyen, T an Nguyen, Huyen V o, Stanley Osher, and Thieu V o. Improving neural ordinary differential equations with nes- terov’ s accelerated gradient method. Advances in Neural Information Pr ocessing Systems , 35:7712–7726, 2022. [11] Patrick Kidger, James Morrill, James Foster , and T erry L yons. Neural controlled differential equations for irregular time series. Advances in neural information processing systems , 33:6696–6707, 2020. [12] Thomas Asikis, Lucas B ¨ ottcher , and Nino Antulov-Fantulin. Neural ordinary differential equation control of dynamics on graphs. Physical Review Research , 4(1):013221, 2022. [13] Pengkai W ang, Song Chen, Jiaxu Liu, Shengze Cai, and Chao Xu. Pidnodes: Neural ordinary differential equations inspired by a proportional–integral–deri vati ve controller . Neur ocomputing , 614:128769, 2025. [14] Samuel Greydanus, Misko Dzamba, and Jason Y osinski. Hamiltonian neural networks. Advances in neural information processing systems , 32, 2019. [15] Miles Cranmer , Sam Greydanus, Stephan Hoyer, Peter Battaglia, David Spergel, and Shirley Ho. Lagrangian neural networks. arXiv preprint arXiv:2003.04630 , 2020. [16] Marc Finzi, Ke Alexander W ang, and Andrew G Wilson. Simplifying hamiltonian and lagrangian neural networks via explicit constraints. Advances in neural information pr ocessing systems , 33:13880–13889, 2020. [17] Y ongKyung Oh, Seungsu Kam, Jonghun Lee, Dong-Y oung Lim, Sungil Kim, and Alex Bui. Comprehensi ve re view of neural differential equations for time series analysis. arXiv preprint , 2025. [18] Hao Niu, Y uxiang Zhou, Xiaohao Y an, Jun Wu, Y uncheng Shen, Zhang Y i, and Junjie Hu. On the applications of neural ordinary differential equations in medical image analysis. Artiﬁcial Intelligence Review , 57(9):236, 2024. [19] Ashish S Nair , Shivam Barwey , Pinaki Pal, Jonathan F MacArt, Troy Arcomano, and Romit Maulik. Understanding latent timescales in neural ordinary differential equation models of advection-dominated dynamical systems. Physica D: Nonlinear Phenomena , 476:134650, 2025. [20] Xuanqing Liu, T esi Xiao, Si Si, Qin Cao, Sanjiv Kumar , and Cho-Jui Hsieh. Neural sde: Stabilizing neural ode networks with stochastic noise. arXiv preprint arXiv:1906.02355 , 2019. [21] Junteng Jia and Austin R Benson. Neural jump stochastic differential equations. Advances in Neural Information Processing Systems , 32, 2019. [22] Xuechen Li, Ting-Kam Leonard W ong, Ricky TQ Chen, and David Duvenaud. Scalable gradients for stochastic differential equations. In International Conference on Artiﬁcial Intelligence and Statistics , pages 3870–3882. PMLR, 2020. [23] Anudhyan Boral, Zhong Yi W an, Leonardo Zepeda-N ´ u ˜ nez, James Lottes, Qing W ang, Yi-f an Chen, John Anderson, and Fei Sha. Neural ideal large eddy simulation: Modeling turbulence with neural stochastic differential equations. Advances in neural information processing systems , 36:69270–69283, 2023. [24] Y ongKyung Oh, Dong-Y oung Lim, and Sungil Kim. Stable neural stochastic differential equations in analyzing irregular time series data. arXiv preprint arXiv:2402.14989 , 2024. [25] W innie Xu, Ricky TQ Chen, Xuechen Li, and David Duvenaud. Inﬁnitely deep bayesian neural networks with stochastic differential equations. In International Conference on Artiﬁcial Intelligence and Statistics , pages 721–738. PMLR, 2022. [26] Sergio Calvo-Ordonez, Matthieu Meunier, Francesco Piatti, and Y uantao Shi. Partially stochastic inﬁnitely deep bayesian neural networks. arXiv pr eprint arXiv:2402.03495 , 2024. [27] Y ANG Xiaoyu, QIU Peiyi, et al. Inﬁnitely deep bayesian neural network with signature transform. Neurocomputing , page 132563, 2025. [28] Patryk Gierjato wicz, Marc Sabate V idales, David Siska, Lukasz Szpruch, and Zan Zuric. Rob ust pricing and hedging via neural stochastic differential equations. Journal of Computational Finance , 26(3):1–32, 2022. [29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Pr oceedings of the IEEE confer ence on computer vision and pattern reco gnition , pages 770–778, 2016. [30] Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine , 29(6):141–142, 2012. [31] Alex Krizhevsk y , V inod Nair , and Geoffrey Hinton. The cifar-10 dataset. online: http://www . cs. tor onto. edu/kriz/cifar . html , 55:5, 2014. [32] Emanuel T odorov , T om Erez, and Y uval T assa MuJoCo. A physics engine for model-based control. In Pr oceedings of the 2012 IEEE/RSJ International Confer ence on Intelligent Robots and Systems , pages 5026–5033. [33] M Lechner and R Hasani. Learning long-term dependencies in irregularly-sampled time series. arxiv 2020. arXiv preprint arXiv:2006.04418 . [34] Y ulia Rubanov a, Ricky TQ Chen, and David K Duvenaud. Latent ordi- nary differential equations for irregularly-sampled time series. Advances in neural information pr ocessing systems , 32, 2019.

Improving Infinitely Deep Bayesian Neural Networks with Nesterov's Accelerated Gradient Method

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment