Conjugate Learning Theory: Uncovering the Mechanisms of Trainability and Generalization in Deep Neural Networks

Reading time: 5 minute
...

📝 Original Info

  • Title: Conjugate Learning Theory: Uncovering the Mechanisms of Trainability and Generalization in Deep Neural Networks
  • ArXiv ID: 2602.16177
  • Date: 2026-02-18
  • Authors: ** - B. Qi (Tongji University, 이메일: 2080068@tongji.edu.cn, ORCID: 0000-0001-5832-1884) - 논문에 명시된 다른 저자 정보는 본문에 포함되지 않아 확인 불가 **

📝 Abstract

In this work, we propose a notion of practical learnability grounded in finite sample settings, and develop a conjugate learning theoretical framework based on convex conjugate duality to characterize this learnability property. Building on this foundation, we demonstrate that training deep neural networks (DNNs) with mini-batch stochastic gradient descent (SGD) achieves global optima of empirical risk by jointly controlling the extreme eigenvalues of a structure matrix and the gradient energy, and we establish a corresponding convergence theorem. We further elucidate the impact of batch size and model architecture (including depth, parameter count, sparsity, skip connections, and other characteristics) on non-convex optimization. Additionally, we derive a model-agnostic lower bound for the achievable empirical risk, theoretically demonstrating that data determines the fundamental limit of trainability. On the generalization front, we derive deterministic and probabilistic bounds on generalization error based on generalized conditional entropy measures. The former explicitly delineates the range of generalization error, while the latter characterizes the distribution of generalization error relative to the deterministic bounds under independent and identically distributed (i.i.d.) sampling conditions. Furthermore, these bounds explicitly quantify the influence of three key factors: (i) information loss induced by irreversibility in the model, (ii) the maximum attainable loss value, and (iii) the generalized conditional entropy of features with respect to labels. Moreover, they offer a unified theoretical lens for understanding the roles of regularization, irreversible transformations, and network depth in shaping the generalization behavior of deep neural networks. Extensive experiments validate all theoretical predictions, confirming the framework's correctness and consistency.

💡 Deep Analysis

📄 Full Content

Machine learning techniques based on deep neural networks (DNNs) have achieved unprecedented empirical success across a broad spectrum of real-world applications, including image classification, natural language understanding, speech recognition, and autonomous driving. Despite this widespread practical success, the theoretical foundations underpinning the trainability (the ability to optimize non-convex models to low empirical risk) and generalization (the ability to perform well on unseen data) of DNNs remain poorly understood. As a result, deep learning is often characterized as an experimental science, with theoretical developments lagging behind practical advances and offering limited actionable guidance for real-world model design, algorithm selection, and hyperparameter tuning.

In this work, we propose the conjugate learning theory framework to systematically analyze the optimization dynamics and generalization mechanisms that underpin the performance of deep neural networks in practical learning scenarios.

2080068@tongji.edu.cn (B. Qi) ORCID(s): 0000-0001-5832-1884 (B. Qi)

The trainability of DNNs refers to the well-documented empirical observation that highly over-parameterized, non-convex DNN models, when optimized with simple first-order optimization methods such as stochastic gradient descent (SGD) and its variants, consistently converge to high-quality solutions with low empirical risk, despite the absence of convexity or strong regularity assumptions that are typically required for theoretical guarantees in classical optimization [50]. In contrast, classical non-convex optimization theory only guarantees convergence to stationary points (points where the gradient is zero), which may correspond to local minima, saddle points, or maxima [22], and even seemingly simple non-convex optimization problems (e.g., quadratic programming with non-convex constraints or copositivity testing) are proven to be NP-hard in the general case [31]. This fundamental disconnect means that the remarkable practical efficiency of SGD in training DNNs cannot be explained by classical optimization theory, highlighting the need for new theoretical frameworks tailored to the unique properties of DNNs.

Several theoretical directions have emerged in recent years to address this trainability puzzle. One prominent line of work focuses on the infinite-width limit of DNNs, leading to the development of the Neural Tangent Kernel (NTK) framework [21], which provides valuable insights into the training dynamics of DNNs in the lazy training regime (where network parameters change minimally during training). Another complementary direction, based on Fenchel-Young losses, establishes a direct link between gradient norms and distribution fitting errors in supervised classification tasks [35]. However, as we detail in Section 7, these existing approaches have significant limitations: NTK theory struggles to capture the training dynamics of finite-width DNNs (the setting of practical interest), while the Fenchel-Young perspective is restricted to classification tasks and does not address the generalization properties of DNNs.

Generalization refers to the ability of DNN models to make accurate predictions on unseen test data after being trained on a finite set of training samples. Classical statistical learning theory quantifies generalization performance by deriving upper bounds on generalization error (the difference between test and training error) using complexity measures of the hypothesis class, such as VC-dimension [44] or Rademacher complexity [4]. These classical bounds universally suggest that controlling the size and complexity of the model promotes better generalization performance, as more complex models are more prone to overfitting to noise in the training data. However, in the over-parameterized regime, where DNNs often contain orders of magnitude more parameters than training samples, these classical bounds fail to reflect the strong empirical generalization performance observed in practice, a phenomenon known as the generalization paradox of DNNs.

To explain this paradox, several alternative theoretical frameworks have been proposed in the literature. The flat minima hypothesis posits that the inherent stochasticity of SGD acts as an implicit regularizer during training, steering the optimization process toward flat regions of the loss landscape (minima with low curvature) that are empirically associated with better generalization [23,10]. Information-theoretic approaches, most notably the Information Bottleneck (IB) principle, explain generalization through the lens of information compression in neural representations, arguing that DNNs learn to retain only the input information relevant for predicting target labels while discarding redundant or noisy components [41,39]. Yet as we discuss in Section 7, each of these perspectives has its own critical limitations: flatness measures are not invariant to network para

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut