Variational Kernel Design for Internal Noise: Gaussian Chaos Noise, Representation Compatibility, and Reliable Deep Learning

V ariational Kernel Design for In ternal Noise: Gaussian Chaos Noise, Represen tation Compatibilit y , and Reliable Deep Learning Ziran Liu 1,2,3, † 1 Shanghai Institute for Mathematics and In terdisciplinary Sciences (SIMIS), Shanghai 200433, China 2 Researc h Institute of In telligent Complex Systems, F udan Univ ersity , Shanghai 200433, China 3 Institute for In telligent Computing@SJTU, Shanghai 200433, China † Corresp ondence to zliu@simis.cn Abstract In ternal noise in deep net w orks is usually inherited from heuristics suc h as drop out, hard masking, or additiv e p erturbation. W e ask tw o questions: what correlation geometry should internal noise hav e, and is the implemented p erturbation compatible with the represen tations it acts on? W e answer these questions through V ariational Kernel Design (VKD), a framework in whic h a noise mec hanism is sp eciﬁed b y a la w family , a correlation k ernel, and an injection op erator, and is derived from learning desiderata. In a solved spatial subfamily , a quadratic maxim um-entrop y principle ov er laten t log-ﬁelds yields a Gaussian optimizer with precision given by the Dirichlet Laplacian, so the induced geometry is the Diric hlet Green kernel. Wic k normalization then giv es a canonical positive mean-one gate, Gaussian Chaos Noise ( GCh ). F or the sample-wise gate used in practice, w e pro v e exact Gaussian control of pairwise log-ratio deformation, margin-sensitive ranking stabilit y , and an exact expected in trinsic roughness budget; hard binary masks instead induce singular or coherence-ampliﬁed distortions on p ositive coherent representations. On ImageNet and ImageNet-C, GCh consistently improv es calibration and under shift also impro ves NLL at comp etitiv e accuracy . Keyw ords: Noise Design, Deep Learning Reliability , Calibration, Distribution Shift, Gaussian Multiplicativ e Chaos 1. In tro duction Noise injection is one of the most widely used yet least principled comp onen ts of deep learning. It app ears as additive p erturbation, sto chastic gating, masking, augmentation, corruption-aw are training, and uncertaint y regularization, and it is routinely used to impro ve generalization, calibration, and robustness. Y et one central design c hoice is usually left heuristic: what structur e should the noise have ? In most pip elines, that c hoice is inherited from familiar templates suc h as i.i.d. drop out ( Sriv asta v a et al. , 2014 ), sto chastic depth ( Huang et al. , 2016 ), or hard spatial masking ( Ghiasi et al. , 2018 ), rather than deriv ed from the geometry of the representation or the ob jective of the learner. This raises a more structural question: If internal noise is to b e use d as p art of r epr esentation le arning, what asp e cts of that noise should b e derive d fr om ﬁrst principles r ather than ﬁxe d by c onvention? V ariational Kernel Design for Internal Noise 2 Our answer is to treat in ternal noise as a design obje ct . W e call the resulting program V ariational Kernel Design (VKD). In VKD, a noise mec hanism is sp eciﬁed b y a triple N = ( F , K, T ) , consisting of a law family , a correlation kernel, and an injection operator. A realization map then turns a sampled laten t ﬁeld into an implemented p erturbation. The mec hanism is therefore not just “a distribution”; it is a compositional system that separates what is sample d , what ge ometry it must r esp e ct , and wher e and how it is deploye d . This viewpoint reveals that there are really tw o link ed questions. The ﬁrst is a design question : once lo calit y , smo othness, and mean-preserving positivity are enco ded as op erator-level constraints, what p erturbation geometry is canonically induced? The second is a c omp atibility question : once suc h a mec hanism is implemen ted in a deep netw ork, what do es it actually do to the geometry of p ositive seman tic representations, and ho w does that diﬀer from hard masking? The paper is built around this t wo-la yer split. The ﬁrst la yer derives the mechanism. The second studies the induced action of the implemented p erturbation on a target represen tation regime. The design la yer leads to a solv ed quadratic VKD program. In the spatial setting of this pap er, a maxim um-en trop y log-ﬁeld under a Diric hlet-energy budget is Gaussian with precision β L U , hence co v ariance β − 1 L − 1 U . In other words, within the chosen local quadratic design class, the Dirichlet Green kernel is not an additional modeling c hoice; it is the in verse op erator forced by the constrain ts. Exp onen tiating that ﬁeld with exact Wic k normalization yields a p ositive mean-one m ultiplicative gate, which we call GCh . The compatibilit y la yer is where the practical distinction emerges. F or the sample-wise gate actually used in our exp eriments, we prov e exact Gaussian control of pairwise log-ratio deformations, explicit margin-sensitiv e ranking stability , and an exact exp ected intrinsic roughness budget. F or hard binary masks, w e prov e a qualitativ ely diﬀeren t b ehavior: incompatibility with ﬁnite log-ratio geometry , a margin-blind ranking law for in verted drop out, and a coherence-sensitiv e distortion term whose relativ e size div erges as the underlying repres en tation b ecomes increasingly smo oth. This is the rigorous form of the informal claim that smooth p ositive multiplicativ e p erturbations are b etter matc hed to coherent late-stage seman tics than hard deletion. A key theme throughout the pap er is that these tw o la y ers b elong together. The contribution is not just that a Gaussian ﬁeld can be derived from a quadratic m axim um-en trop y problem—that isolated fact is classical. The con tribution is that the v ariational solution is used as an operator-level design map for training-time noise, and then analyzed as an implemen ted mechanism acting on coheren t positive evidence maps. In short: ﬁrst deriv e the geometry; then ask whether the realized p erturbation is compatible with the represen tation regime of in terest. Scop e of the theoretical claims. The pap er do es not claim that deep er is alwa ys b etter, or that hard masking is univ ersally inferior on every architecture, lay er, or ob jectiv e. The claims are conditional and operational: when a la y er carries positive region-level or tok en-level evidence and b ecomes increasingly coherent in the late-seman tic sense, the mathematically relev an t quan tities are relativ e log-ratios, ranking stabilit y , and aggregate geometric roughness. In that regime, w e show that the implemen ted GCh gate yields ﬁnite, margin-a ware Gaussian deformations, whereas hard binary masks yield singular or coherence-ampliﬁed distortions. This p ersp ective yields both a principled mec hanism and a practical prediction. If later la y ers encode increasingly decisiv e relativ e evidence b et w een regions or tok ens while als o becoming more spatially coheren t, then a margin-aw are smo oth multiplicativ e gate should remain compatible with those represen tations, whereas hard masking should b ecome increasingly mismatched. Our experiments are designed to test exactly this distinction. On clean ImageNet, GCh improv es calibration substan tially , and on the selected 7-corruption ImageNet-C ev aluation it impro v es b oth ECE and NLL while main taining comp etitive accuracy . It also remains eﬀectiv e in late-stage injection settings where hard masking can degrade clean calibration. V ariational Kernel Design for Internal Noise 3 Con tributions. Our contributions are as follo ws. • A framework view of in ternal noise. W e formulate in ternal noise injection as a compositional design problem and introduce VKD, in which a mec hanism is derived from learning-motiv ated constrain ts rather than selected from a ﬁxed men u of p erturbations. • A t w o-lay er theory: design and compatibility . W e separate a mechanism-design lay er from a represen tation-compatibility la y er, making explicit the distinction betw een what is derived from ﬁrst principles, what is realized in implementation, and what is subsequently measured on a target represen tation reg ime. • A solved quadratic MaxEnt design program. W e state the admissible class of centered log-ﬁeld laws explicitly , solve the resulting ﬁnite-dimensional v ariational problem in closed form, and derive an en trop y-gap iden tity certifying uniqueness of the optimizer. • Op erator-forced k ernel geometry . F or spatial log-ﬁelds with a Diric hlet-energy budget and gauge ﬁxing, the optimizer is Gaussian with co v ariance prop ortional to the Diric hlet Green kernel. More generally , replacing the quadratic op erator replaces the induced kernel by its in verse. • A canonical exact gate and an implementation-a ware framew ork. Exp onen tiating the MaxEn t log-ﬁeld with Wick normalization yields GCh , a p ositive mean-one m ultiplicative gate with explicit multi-point moments; once the op erator and budget are ﬁxed, the exact gate b ecomes an eﬀectively one-parameter family through τ = γ 2 /β . W e also make explicit the split b etw een the canonical exact gate and the sample-wise implemen tation used in practice. • Represen tation compatibility versus hard-mask mismatch. F or the sample-wise gate used in the experiments, w e pro ve exact Gaussian con trol of pairwise log-ratios, margin-sensitiv e ranking stabilit y , and an exact exp ected in trinsic roughness budget. F or hard binary masking, w e prov e incompatibilit y with ﬁnite log-ratio geometry , a margin-blind ranking la w for inv erted drop out, an immediate loss-of-p erfect-coherence result in exp ectation on perfectly coheren t maps, and a late-stage mismatc h theorem in the coherent-represen tation regime. • Empirical v alidation in the predicted late-stage regime. On clean ImageNet, a selected 7-corruption ImageNet-C ev aluation, Swin-T, and a ﬁne-grained Oxford-II IT Pets pilot, GCh impro v es calibration and, under shift, also impro ves NLL, all at competitive accuracy . Con trolled ablations show the imp ortance of correlation, positivity , and injection depth. A practical w ay to read the pap er. The v ariational results explain wher e the kernel c omes fr om . The compatibilit y results explain why the r esulting implemente d gate b ehaves diﬀer ently fr om binary masking on semantic r epr esentations . The experiments then test that distinction precisely in the late-stage regime where the mismatc h should matter most. Roadmap. Section 2 reviews sto chastic regularization, calibration, and robustness under shift. Section 3 presen ts VKD as a comp ositional design system and situates the pap er’s solv ed instance inside that framework. Sections 4 – 5 develop the Dirichlet log-ﬁeld construction, the quadratic MaxEn t theorem, and the exact and implemented GCh gates. Section 5.5 giv es the represen tation- compatibilit y analysis, and Section 6 tests the resulting predictions empirically . P ap er in one sen tence. W e deriv e the noise geometry from ﬁrst principles and then sho w that the resulting implemented smo oth p ositive gate preserves ﬁnite, margin-a w are relative geometry exactly in the regime where hard masking becomes singular or coherence-ampliﬁed. V ariational Kernel Design for Internal Noise 4 What is classical and what is new. The isolated fact that quadratic maximum entrop y yields a Gaussian la w is classical. The contribution here is the use of that principle as an op erator-level design map for training-time noise, together with the second lay er of theory that is speciﬁc to this pap er: exact representation-compatibilit y results for the implemen ted sample-wise gate and exact incompatibilit y results for hard binary masks on coherent p ositive seman tic representations. Put diﬀeren tly , the v ariational theorem identiﬁes the canonical kernel inside a c hosen design class, and the later compatibilit y theorems explain why that designed mec hanism b ehav es diﬀeren tly from masking in deep netw orks. 2. Related W ork Noise injection and regularization in deep netw orks. Small additive noise is classically link ed to Tikhonov-st yle regularization ( Bishop , 1995 ). Drop out injects i.i.d. Bernoulli gating ( Sriv asta v a et al. , 2014 ); stochastic depth drops residual branches ( Huang et al. , 2016 ) and is extended to T ransformers via Lay erDrop ( F an et al. , 2020 ); Shak eDrop perturbs residual branches with randomized co eﬃcients ( Y amada et al. , 2018 ). Spatial occlusion methods suc h as Cutout and DropBlo c k imp ose structured hard masking on feature maps ( DeV ries and T aylor , 2017 ; Ghiasi et al. , 2018 ), while sample-level mixing metho ds such as Mixup and CutMix inject sto chasticit y at the data lev el ( Zhang et al. , 2018 ; Y un et al. , 2019 ). In vision transformers, Patc hDrop out remov es input patches and changes token top ology ( Liu et al. , 2023 ). A common limitation is that the correlation structure of the noise is usually ﬁxed a priori and often assumes spatial indep endence or hard discontin uities, whic h can mismatch late seman tic represen tations. Calibration and reliabilit y under shift. Miscalibration is widespread in mo dern neural net- w orks, and temp erature scaling remains a strong p ost-ho c baseline ( Guo et al. , 2017 ). Nonparametric alternativ es include BBQ ( Naeini et al. , 2015 ), while Diric hlet calibration extends b eyond a single tem- p erature parameter ( Kull et al. , 2019 ). Dropout admits an approximate Bay esian interpretation ( Gal and Ghahramani , 2016 ), and deep ensem bles remain a strong uncertain ty baseline ( Lakshminara yanan et al. , 2017 ). Under distribution shift, calibration can deteriorate substantially ( Ov adia et al. , 2019 ), and recent work emphasizes that calibration depends strongly on arc hitecture and training recipe ( Minderer et al. , 2021 ). Lab el smoothing can help but is con text dependent ( M”uller et al. , 2019 ). These ﬁndings motiv ate metho ds that improv e NLL and ECE directly during representation learning rather than relying only on post-ho c correction. Robustness to corruptions and distribution shift. F or worst-case robustness, adv ersarial training and TRADES formalize the robustness–accuracy trade-oﬀ ( Madry et al. , 2018 ; Zhang et al. , 2019 ). F or a verage-case corruptions, ImageNet-C/P pro vide standardized benchmarks ( Hendryc ks and Dietteric h , 2019 ); subsequen t w ork has also emphasized that p erformance on synthetic corruptions do es not p erfectly transfer to natural shifts ( T aori et al. , 2020 ), and broader OOD suites reveal substan tial heterogeneity across shift types ( Hendryc ks et al. , 2021 ). Simple augmen tation p olicies suc h as RandAugment and AugMix improv e corruption robustness and uncertain t y with low ov erhead ( Cubuk et al. , 2020 ; Hendryc ks et al. , 2020 ); properly tuned Gaussian or sp eckle noise can also b e eﬀectiv e ( Rusak et al. , 2020 ). Noisy Student further demonstrates the p ow er of strong sto chastic regularization in large-scale training ( Xie et al. , 2020 ). Our fo cus is complemen tary: rather than designing p erturbations at the input level, we derive an internal spatial noise mec hanism whose correlation structure follows from explicit desiderata. 3. V ariational Kernel Design as a Comp ositional Design System W e treat internal noise not as a ﬁxed p erturbation template but as a mechanism to b e derived from learning desiderata. The role of V ariational Kernel Design (VKD) is to map a collection of task-lev el constrain ts to a sto c hastic mec hanism and then to analyze how that mec hanism acts on a V ariational Kernel Design for Internal Noise 5 target representation reg ime. This viewp oint separates tw o lay ers that are often conﬂated in practice: a me chanism-design layer , which speciﬁes what latent ob ject is sampled, what geometry it must resp ect, and where it is injected, and a c omp atibility layer , which studies what geometric quan tities the deplo yed p erturbation preserv es or distorts on the representations actually used b y the netw ork. The b eneﬁt of this separation is conceptual as well as practical. It mak es clear which parts of the construction are derived from ﬁrst principles, which parts are implementation choices, and whic h parts are prop erties of the resulting perturbation on a given representation regime. In particular, VKD is not a men u of named noises; it is a comp ositional system for deriving, realizing, deploying, and analyzing an in ternal perturbation mec hanism. 3.1. Mechanism space: VKD as a design system Let Ω denote a p erturbation domain and let H denote a feature space. A VKD mechanism is sp eciﬁed b y a triple N = ( F , K, T ) , whose three comp onents enco de complemen tary axes of design. Deﬁnition 1 (VKD mec hanism) . A VKD mechanism on (Ω , H ) is a triple N = ( F , K, T ) , wher e: (i) F is a family of laws on latent ﬁelds ψ ∈ R Ω ; (ii) K is a p ositive semideﬁnite kernel on Ω × Ω enc o ding the intende d se c ond-or der ge ometry; (iii) T is an inje ction op er ator that deploys a r e alize d p erturb ation inside the mo del. The three comp onents pla y distinct roles. The family F determines what latent ob ject is sampled; the k ernel K enco des ho w that ob ject is spatially correlated; and the op erator T determines where and ho w the realized perturbation acts on the net work. In this wa y , VKD separates sampling , ge ometry , and deployment . T o mak e the construction op erational, w e in tro duce a realization map ℓ : R Ω → (0 , ∞ ) Ω , whic h turns a laten t ﬁeld ψ in to a positive gate ξ = ℓ ( ψ ). The deplo yed p erturbation is then e h = T ( h ; ξ ) , ψ ∼ F . Th us the mechanism pip eline has the sc hematic form ( F , K , T ) = ⇒ ψ ∼ F ℓ − − → ξ T − − → e h. 3.2. F rom desiderata to admissible mechanism classes A central p oint of VKD is that the mec hanism is not selected from a ﬁxed heuristic menu. Instead, one starts from a collection of learning desiderata D —for example p ositivity , lac k of systematic scale drift, lo cality , smoothness, or minimal extra information—and translates them into mathematical constrain ts on admissible mec hanisms. Accordingly , VKD should b e read as a map D 7− → N ( D ) , V ariational Kernel Design for Internal Noise 6 where N ( D ) is an admissible class of me c hanisms consisten t with the desiderata. The design problem is then to deriv e a distinguished mec hanism N ⋆ ∈ N ( D ) rather than choose one by conv en tion. This formulation is in ten tionally general. In some settings, the admissible class ma y lea ve several comp onen ts independent. In other settings, the desiderata ma y couple the law and the geometry so strongly that the kernel is no longer a free modeling knob but a deriv ed consequence of the design class itse lf. 3.3. A tw o-lay er view: mec hanism and compatibilit y A VKD mec hanism is only half of the story . Once a mechanism has been deriv ed and realized, one m ust still ask ho w the deploy ed perturbation acts on the represen tations the netw ork actually uses. W e therefore separate a second ob ject: a target representation regime R together with a collection of compatibilit y obse rv ables O ( e h ; R ) , suc h as pairwise log-ratio deformation, ranking stabilit y , intrinsic roughness inﬂation, or top ological stabilit y . The resulting conceptual split is: • Mec hanism-design lay er: derive ( F , K , T ) and the realization map ℓ from desiderata; • Compatibilit y lay er: study the induced action of the deplo y ed mechanism on observ ables relev an t to a target represen tation regime R . This distinction is esp ecially important in the presen t paper b ecause the canonical ob ject deriv ed by the v ariational theory is an exact Wic k-normalized gate, while the optimization-friendly implementation used in the main exp eriments is a sample-wise mean-one gate. The design la yer tells us what the c anonic al latent ge ometry is ; the compatibility lay er tells us what the implemente d me chanism do es onc e deploye d . Remark 2 (VKD is comp ositional, not temp oral) . VKD should not b e interpr ete d as a temp or al dynamic al system unless an explicit up date rule is intr o duc e d. Its r ole her e is c omp ositional: desider ata deﬁne an admissible class, the variational principle derives a c anonic al latent law and ge ometry, the r e alization map pr o duc es a deploye d gate, and the c omp atibility layer studies the induc e d action of that gate on a tar get r epr esentation r e gime. 3.4. The solved instance studied in this pap er The present pap er studies a solv ed quadratic VKD subfamily in which the laten t ob ject is a centered log-ﬁeld and spatial coherence is imp osed through a quadratic operator budget. In this subfamily , the law and the geometry are not indep enden t design axes: once the op erator Q and the energy budget ε are ﬁxed, the unique maxim um-entrop y optimizer is Gaussian with co v ariance prop ortional to Q − 1 . Thus the kernel is op er ator-for c e d rather than chosen heuristically . In the main spatial construction, the p erturbation domain is the feature grid U , the op erator is the Dirichlet Laplacian Q = L U , and the resulting laten t la w is a discrete Gaussian free ﬁeld with co v ariance prop ortional to the Diric hlet Green kernel G U = L − 1 U . A realization map then turns the laten t log-ﬁeld into either a canonical exact Wick-normalized gate or the sample-wise mean-one gate used in the exp erimen ts. The injected p erturbation is studied precisely in the late-stage p ositive coheren t regime where pairwise log-ratios, ranking preserv ation, and intrinsic roughness are the relev an t observ ables. V ariational Kernel Design for Internal Noise 7 General VKD pip eline Desiderata D − → Admissible class N ( D ) − → N = ( F , K, T ) − → ψ ∼ F ℓ − − → ξ − → e h = T ( h ; ξ ) − → O ( e h ; R ) . This pap er’s solved instance Q = L U , ε − → ψ ∼ N (0 , ( β L U ) − 1 ) − → Green geometry G U = L − 1 U − → ξ ex γ (canonical) or ξ sw γ (implemen ted) − → late-stage spatial injection − → pairwise / ranking / roughness / top ology − → ECE / NLL / shift b ehavior . Figure 1. VKD as a comp ositional design system. The top row shows the general design logic: learning desiderata deﬁne an admissible mechanism class, from which a mechanism and realization map are deriv ed and then ev aluated through compatibility observ ables on a target representation regime. The b ottom ro w shows the solved instance of this pap er. A quadratic op erator budget with Q = L U yields a Gaussian log-ﬁeld with cov ariance prop ortional to the Diric hlet Green k ernel. This laten t ﬁeld admits a canonical exact realization through Wick normalization and an optimization-friendly implemented realization through sample-wise mean-one normalization. The resulting deplo yed gate is analyzed through pairwise, ranking, roughness, and top ological observ ables, and then tested empirically through calibration and reliability metrics. Roadmap from framew ork to results. Section 4 ﬁxes the p erturbation domain and deploymen t axis for the spatial setting studied in this paper. Section 5 then solv es the mec hanism-design la y er: desiderata deﬁne an admissible class, the quadratic MaxEn t principle derives the canonical latent la w, and the realization map yields the exact GCh gate together with the implemen tation-a w are v arian ts used in practice. The later subsections of Section 5 b egin the compatibility la yer b y analyzing the induced action of the implemented gate on the positive coherent regime relev an t to late-stage represen tations. 4. A Solv ed Quadratic VKD Instance: Problem Setup and Discrete GFF Bac kground W e no w instantiate the general system of Section 3 in the spatial setting used throughout the pap er. T o keep the framew ork explicit, it is useful to separate what is ﬁxed in this section from what is deriv ed in Section 5 . Here we ﬁx the perturbation domain Ω = U , the feature space H = R C × H × W , and the deploymen t axis T as spatial multiplicativ e injection on a feature grid. In Section 5 we then deriv e the canonical laten t law and its induced geometry from the learning desiderata. In the solved instance studied here, the laten t ob ject is a cen tered log-ﬁeld, the operator budget is the Dirichlet energy , and the resulting canonical geometry is the Diric hlet Green kernel. 4.1. Injection site and spatial gating Fix a lay er at which a feature map is p erturb ed. Let h ∈ R C × H × W V ariational Kernel Design for Internal Noise 8 denote the feature tensor at that site, with channel index c ∈ { 1 , . . . , C } and spatial lo cation x = ( i, j ) ∈ U , where U = { 1 , . . . , H } × { 1 , . . . , W } . W e fo cus on sp atial perturbations: a random ﬁeld acts on the H × W grid and is shared across c hannels. Concretely , w e introduce a positive spatial gate ν : U → (0 , ∞ ) , and apply it iden tically across channels. Injection op erators. The basic m ultiplicative op erator is T ν ( h )( c, x ) = h ( c, x ) ν ( x ) , (1) that is, p oint wise multiplication with spatial broadcasting. F or numerical stability or reduced p erturbation strength, we may also use the residual form T res ν ( h )( c, x ) = h ( c, x )  1 + α ( ν ( x ) − 1)  , α ∈ (0 , 1] . (2) Unless otherwise stated, w e use α = 1. F ramew ork instantiation of the deplo yment axis. In the notation of Section 3 , this subsection ﬁxes the deploymen t part of the mechanism: the p erturbation domain is the interior grid U , the feature space is H = R C × H × W , and the admissible deploymen t op erators are spatial m ultiplicative injections such as T ν and T res ν . What remains op en at this stage is the design la yer: whic h latent law should b e sampled, and what correlation geometry should it induce? 4.2. Discrete Gaussian free ﬁeld on a rectangular grid T o make the implementation and sp ectral formulas consisten t, we treat the feature grid itself as the in terior domain and imp ose Diric hlet conditions on an auxiliary outer b oundary . Fix in tegers H , W ≥ 1 and deﬁne U = { 1 , . . . , H } × { 1 , . . . , W } , ¯ U = { 0 , . . . , H + 1 } × { 0 , . . . , W + 1 } , with auxiliary b oundary B = ¯ U \ U. Equip ¯ U with the nearest-neighbor undirected edge set E =  { x, y } ⊂ ¯ U : ∥ x − y ∥ 1 = 1  . Optionally , allo w p ositive symmetric edge weigh ts c xy = c y x > 0 on { x, y } ∈ E ; the unw eighted case is c xy ≡ 1. A ﬁeld is a function ϕ : U → R . W e extend it by zero to the auxiliary b oundary: ¯ ϕ ( y ) = ( ϕ ( y ) , y ∈ U, 0 , y ∈ B . Diric hlet Laplacian and energy . F or ϕ : U → R , deﬁne the Diric hlet Laplacian L U b y ( L U ϕ )( x ) = X y : { x,y }∈ E c xy  ϕ ( x ) − ¯ ϕ ( y )  , x ∈ U. (3) Its quadratic form is the Diric hlet energy E ( ϕ ) : = 1 2 ⟨ ϕ, L U ϕ ⟩ = 1 2 X { x,y }∈ E c xy  ¯ ϕ ( x ) − ¯ ϕ ( y )  2 . (4) Under Dirichlet b oundary conditions, L U is symme tric p ositiv e deﬁnite, so E ( ϕ ) > 0 for ϕ  = 0. V ariational Kernel Design for Internal Noise 9 Discrete GFF. Fix an inv erse-temp erature parameter β > 0. The Diric hlet discrete Gaussian free ﬁeld (GFF) on U is the centered Gaussian v ector ϕ ∼ N  0 , ( β L U ) − 1  . (5) Equiv alen tly , its densit y on R U is p β ( ϕ ) = 1 Z β exp  − β E ( ϕ )  =  det( β L U ) (2 π ) | U |  1 / 2 exp  − 1 2 ϕ ⊤ ( β L U ) ϕ  , (6) with normalizing constant Z β = (2 π ) | U | / 2 det( β L U ) − 1 / 2 . (7) Green k ernel. Deﬁne the Dirichlet Green matrix G U : = L − 1 U . Then the cov ariance of the GFF is Co v  ϕ ( x ) , ϕ ( y )  = β − 1 G U ( x, y ) , x, y ∈ U. (8) F ramew ork role of this subsection. At this point the state space of laten t ﬁelds and the lo cal op erator hav e b een ﬁxed. The next section will solv e the design la yer inside this spatial VKD class: the v ariational principle will determine the canonical la w family F ⋆ , and the induced second- order geometry will app ear as a consequence of the chosen operator rather than as an additional h yp erparameter. 5. Solving the Design Lay er: F rom Desiderata to Gaussian Chaos Noise W e now solve the mechanism-design la yer of VKD for the spatial instantiation ﬁxed in Section 4 . The logical order is: sp ecify learning desiderata, deﬁne the admissible class of laten t laws, derive the canonical law and induced geometry , and only then c ho ose a realization map that turns the laten t ob ject in to a deploy ed gate. Read in this wa y , the section is not only ab out one new noise family; it is the full deriv ation of a solv ed VKD instance. The key mathematical p oint is that the optimization is p erformed o ver laws of the latent lo g-ﬁeld ; p ositivity and mean preserv ation are imposed afterw ards at the realization stage through an exp onential link and Wick normalization. 5.1. Design desiderata Eac h desideratum constrains a diﬀerent part of the framew ork: D1 selects the law inside an admissible class, D2–D3 constrain the realization map, D4 determines the op erator geometry , and D5 ensures that the op erator-level design problem is w ell posed. D1 Least additional information (maxim um entrop y). Among all admissible la ws satisfying the required constraints, choose the one with maximum diﬀerential entrop y . Intuitiv ely , the p erturbation should av oid injecting unintended semantics. D2 P ositivity through an exp onential link. The gate should mo dulate amplitude without in tro ducing sign ﬂips or hard artifact patterns. W e therefore write ξ = exp( ζ ) for a real-v alued log-ﬁeld ζ ∈ R U . V ariational Kernel Design for Internal Noise 10 D3 No systematic scale drift. The gate should not create a persistent gain shift. In the exact construction this is enforced by Wick normalization, giving E [ ξ ( x )] = 1 for every site x ∈ U . D4 Spatial coherence via a quadratic smo othness budget. The p erturbation should b e spatially coherent rather than pixelwise i.i.d. W e enco de this through a lo cal quadratic budget on the log-ﬁeld: E  1 2 ⟨ ψ , Qψ ⟩  = ε, (9) where Q ≻ 0 is a symmetric p ositive deﬁnite op erator on R U . In the canonical grid construction of this pap er, Q = L U is the Dirichlet Laplacian. D5 W ell-p osedness through gauge ﬁxing. A gauge con ven tion is required so that the quadratic op erator is inv ertible. In the main text w e imp ose auxiliary Dirichlet b oundary conditions, whic h mak e L U ≻ 0. Wh y separate ζ and ψ ? F or the v ariational problem, the ob ject being optimized is the law of a centered log-ﬁeld ψ . P ositivity and mean preserv ation are then enforced afterwar ds by mapping ψ through a Wic k-normalized exponential. This separation is useful because it makes clear whic h parts of the theory characterize the optimizer of the en tropy problem and which parts deﬁne the ﬁnal multiplicativ e gate. 5.2. A formal v ariational class Fix an SPD op erator Q on R U , an energy budget ε > 0, and let n : = | U | . Deﬁne the admissible class A ( Q, ε ) : =          p : R U → [0 , ∞ )          R R U p ( ψ ) dψ = 1 , R R U ψ p ( ψ ) dψ = 0 , R R U 1 2 ⟨ ψ , Qψ ⟩ p ( ψ ) dψ = ε, h ( p ) > −∞          , (10) where h ( p ) : = − Z R U p ( ψ ) log p ( ψ ) dψ is the diﬀerential en trop y . The asso ciated v ariational problem is sup p ∈A ( Q,ε ) h ( p ) . (11) This formulation clariﬁes the scope of the theory . The design class is determined by three ingredients only: (i) the state space R U of log-ﬁelds, (ii) the cen tering and quadratic-budget constrain ts, and (iii) the c hoice of lo cal op erator Q . The role of the operator is especially important: once Q is ﬁxed, the en trop y maximizer—if it exists—m ust rev eal the correlation geometry compatible with that op erator. In VKD language, this subsection isolates the la w-design part of the mechanism. The deplo yment axis has already b een ﬁxed in Section 4 ; the remaining task is to deriv e the canonical latent law and the geometry it induces. 5.3. Quadratic MaxEnt principle and op erator-forced k ernel geometry The next theorem is the main design theorem for the solved quadratic VKD subfamily . It turns desiderata D1, D4, and D5 in to a unique latent law, and it makes explicit ho w the op erator budget ﬁxes the scale of the optimizer. Relative to the earlier pro of sketc h, it yields the optimizer, its entrop y v alue, the explicit scale, and an entrop y-gap iden tity that certiﬁes uniqueness. V ariational Kernel Design for Internal Noise 11 Theorem 5.1 (Design theorem for the quadratic VKD subfamily) . L et Q ≻ 0 b e symmetric p ositive deﬁnite on R U , let n = | U | , and let ε > 0 . Then the variational pr oblem ( 11 ) has a unique optimizer p ⋆ Q,ε = N  0 , Σ Q,ε  , Σ Q,ε = 2 ε n Q − 1 . (12) Equivalently, p ⋆ Q,ε ( ψ ) = 1 (2 π ) n/ 2 det(Σ Q,ε ) 1 / 2 exp  − 1 2 ψ ⊤ Σ − 1 Q,ε ψ  , (13) with pr e cision matrix Σ − 1 Q,ε = n 2 ε Q. Mor e over, for every p ∈ A ( Q, ε ) , h ( p ⋆ Q,ε ) − h ( p ) = KL  p ∥ p ⋆ Q,ε  ≥ 0 , (14) so the optimizer is unique. Its entr opy is h ( p ⋆ Q,ε ) = 1 2 log  (2 π e ) n det  2 ε n Q − 1   . (15) Pr o of sketch. Let p ⋆ denote the Gaussian densit y in ( 12 ). Since Σ − 1 Q,ε = n 2 ε Q, the quadratic constraint implies E p ⋆  1 2 ⟨ ψ , Qψ ⟩  = 1 2 T r( Q Σ Q,ε ) = 1 2 T r  Q 2 ε n Q − 1  = ε, so p ⋆ ∈ A ( Q, ε ). F or an y feasible p , KL( p ∥ p ⋆ ) = − h ( p ) − Z p ( ψ ) log p ⋆ ( ψ ) dψ . Because log p ⋆ ( ψ ) = c − n 4 ε ⟨ ψ , Qψ ⟩ for a constant c , and every feasible p has the same normalization, mean, and energy budget, the second term dep ends only on ( Q, ε ) and coincides with − h ( p ⋆ ). Hence ( 14 ) holds. Uniqueness follows b ecause KL ( p ∥ p ⋆ ) = 0 iﬀ p = p ⋆ a.e. The entrop y formula is the standard entrop y of a centered Gaussian with cov ariance Σ Q,ε . Corollary 2 (Op erator-forced geometry in the Diric hlet instan tiation) . T aking Q = L U in The o- r em 5.1 yields the unique entr opy-maximizing lo g-ﬁeld ψ ∼ N  0 , ( β L U ) − 1  , β = n 2 ε . (16) Its c ovarianc e is Co v( ψ ) = 2 ε n L − 1 U = β − 1 G U , G U : = L − 1 U . (17) Thus, within the lo c al quadr atic design class determine d by the Dirichlet ener gy, the c orr elation ge ometry is the Dirichlet Gr e en kernel. Remark 3 (What is and is not “forced” ) . The the or em do es not say that the Gr e en kernel is universal ly optimal for every noise-design pr oblem. It says something mor e pr e cise: onc e the design class is ﬁxe d by a lo c al quadr atic budget with op er ator Q , the entr opy maximizer has c ovarianc e pr op ortional to Q − 1 . The Gr e en kernel is for c e d sp e ciﬁc al ly b e c ause the op er ator chosen her e is the Dirichlet L aplacian. V ariational Kernel Design for Internal Noise 12 F ramework interpretation. Theorem 5.1 solves the latent-la w axis of the VKD mechanism- design lay er, and Corollary 2 shows that the second-order geometry is induced rather than tuned. What remains is the realization map ℓ : how to turn the derived latent log-ﬁeld into a p ositive, mean-preserving gate that can actually be deplo yed. 5.4. F rom the MaxEnt log-ﬁeld to the canonical realization map A t this p oint the latent law and induced geometry ha ve b een deriv ed. The remaining step in the design lay er is the realization map ℓ : how to turn the latent log-ﬁeld into a p ositiv e gate satisfying D2–D3. The canonical answer in the solv ed VKD subfamily is the Wick-normalized exp onential. Let ψ ∼ N (0 , C ) , C = ( β L U ) − 1 = 2 ε n G U . (18) F or a strength parameter γ ∈ R , deﬁne the exact Wick-normalized exp onential ξ ex γ ( x ) : =: exp( γ ψ ( x )) := exp  γ ψ ( x ) − γ 2 2 C ( x, x )  , x ∈ U. (19) This is the canonical exact realization asso ciated with the v ariationally deriv ed log-ﬁeld. Positivit y comes from the exponential map; mean preserv ation comes from the Wick correction. Theorem 5.4 (Canonical realization in the solved VKD subfamily) . Under desider ata D1 – D5 , the c anonic al exact p ositive me an-one multiplic ative gate is obtaine d by: 1. sampling the MaxEnt lo g-ﬁeld ψ ∼ N  0 , ( β L U ) − 1  , β = | U | 2 ε , and 2. applying the Wick-normalize d exp onential ( 19 ) . F or any sites x 1 , . . . , x m ∈ U , E " m Y r =1 ξ ex γ ( x r ) # = exp   γ 2 X 1 ≤ a h ( y ) > 0 , and τ R G ( x, y ) > 0 , and deﬁne the lo g-mar gin δ xy ( h ) : = log h ( x ) − log h ( y ) > 0 . (35) Then under the implemente d sample-wise gate, Pr  e h ( x ) > e h ( y )  = Φ δ xy ( h ) p τ R G ( x, y ) ! , (36) wher e Φ is the standar d Gaussian c df. Equivalently, Pr  e h ( x ) ≤ e h ( y )  = Φ − δ xy ( h ) p τ R G ( x, y ) ! ≤ exp  − δ xy ( h ) 2 2 τ R G ( x, y )  . (37) Pr o of. By Theorem 5.7 , log e h ( x ) e h ( y ) = log h ( x ) h ( y ) + ∆ sw xy ( h ) = δ xy ( h ) + ∆ sw xy ( h ) , where ∆ sw xy ( h ) ∼ N (0 , τ R G ( x, y )). Therefore Pr  e h ( x ) > e h ( y )  = Pr  δ xy ( h ) + ∆ sw xy ( h ) > 0  = Φ δ xy ( h ) p τ R G ( x, y ) ! , whic h is ( 36 ) . The tail b ound follows from the standard Gaussian b ound Φ( − u ) ≤ e − u 2 / 2 for u > 0. V ariational Kernel Design for Internal Noise 16 Deep learni ng in terpretation. P airwise log-ratios are a natural coordinate system for relativ e evidence: how muc h stronger one region, tok en, or semantic part is than another. Corollary 8 says that the implemented GCh gate is mar gin-sensitive : if a feature comparison already has a large seman tic log-margin, then the probabilit y of preserving that ordering is exp onentially close to one. This is the kind of b eha vior one wan ts from a late-stage regularizer—strong semantic contrasts b ecome more, not less, stable. The mild condition τ R G ( x, y ) > 0 simply excludes the degenerate zero-v ariance case; on a connected Diric hlet grid it is automatic whenev er x  = y and γ  = 0. T o aggregate pairwise distortions ov er the grid, deﬁne the intrinsic interior edge set E int : = {{ x, y } ∈ E : x, y ∈ U } and the asso ciated in trinsic graph energy E int ( f ) : = 1 2 X { x,y }∈ E int c xy  f ( x ) − f ( y )  2 = 1 2 ⟨ f , L int f ⟩ , (38) where L int is the interior graph Laplacian on U with no auxiliary b oundary term. Unlik e the Dirichlet energy used in the v ariational design, E int is in v ariant under adding spatial constants, so it measures r elative geometry . Corollary 9 (Exact expected in trinsic roughness budget under the implemented gate) . L et h : U → (0 , ∞ ) and e h = ξ sw γ ⊙ h . Then E  E int (log e h )  = E int (log h ) + γ 2 ε int , ε int : = E  E int ( ψ )  = 1 2 T r( L int C ) . (39) Pr o of. F rom the pro of of Theorem 5.7 , log e h = log h + γ ψ − c ( ψ ) 1 , where 1 is the all-ones v ector on U . Because L int 1 = 0, the constan t term drops out of E int . Hence E int (log e h ) = E int (log h + γ ψ ) . Expanding the quadratic form and taking exp ectation giv es E  E int (log e h )  = E int (log h ) + γ E  ⟨ log h, L int ψ ⟩  + γ 2 E  E int ( ψ )  = E int (log h ) + γ 2 E  E int ( ψ )  , where the cross term v anishes because E [ ψ ] = 0. Finally , E  E int ( ψ )  = 1 2 E [ ψ ⊤ L int ψ ] = 1 2 T r( L int C ) . Deep learning in terpretation. Corollary 9 is the whole-map counterpart of the pairwise result. In the intrinsic log-geometry of a positive feature map, the implemen ted GCh gate adds an exactly quantiﬁe d exp e cte d amount of roughness. It deforms the representation by a ﬁnite random ﬁeld rather than puncturing it with hard zeros. F or practitioners, this is the rigorous v ersion of the intuition that GCh injects con trolled uncertain ty rather than discontin uous seman tic damage. Corollary 10 (Scale compatibility of the implemented GCh gate) . F or any a > 0 and any p ositive ﬁeld h : U → (0 , ∞ ) , let e h a : = ξ sw γ ⊙ ( ah ) . Then for every x, y ∈ U , ∆ sw xy ( ah ) = ∆ sw xy ( h ) , (40) and E  E int (log e h a )  − E int (log( ah )) = γ 2 ε int . (41) Thus the pairwise deformation law and the adde d intrinsic r oughness budget ar e invariant under glob al amplitude r esc aling. V ariational Kernel Design for Internal Noise 17 Pr o of. Because log ( ah ) = log h + ( log a ) 1 , global rescaling adds only a spatial constant in the log domain. Both Theorem 5.7 and the intrinsic energy E int are inv arian t under such constants, which yields ( 40 ) and ( 41 ). Deep learning interpretation. This is a concrete adv an tage of w orking in multiplicativ e log- geometry . If the same seman tic feature map is globally rescaled—for example by a c hange in channel gain, normalization, or o verall conﬁdence le v el—the geometric eﬀect of the implemen ted GCh gate do es not change. The perturbation trac ks relative structure rather than absolute amplitude. Corollary 11 (Finite expected in trinsic roughness for a p erfectly coheren t positive map under the implemen ted gate) . L et h : U → (0 , ∞ ) satisfy log h ( x ) ≡ c on U for some c onstant c ∈ R . Then for e h = ξ sw γ ⊙ h , E  E int (log e h )  = γ 2 ε int . (42) In p articular, a p erfe ctly c oher ent p ositive map ac quir es a ﬁnite and explicitly budgete d expected intrinsic r oughness under the implemente d GCh gate. Pr o of. If log h is constant on U , then E int ( log h ) = 0. The claim follows immediately from Corollary 9 . Deep learning interpretation. A late-stage representation is often close to piecewise coherent in log-amplitude: within a semantically consisten t region, the main issue is not whether the feature is exactly constant, but whether the p erturbation preserves the region as a coherent ob ject. Corollary 11 giv es an exp ectation-level statemen t: starting from zero in trinsic roughness, the implemen ted GCh gate produces a ﬁnite and explicitly budgeted exp ected roughness lev el rather than a singular or uncon trolled distortion. The next result formalizes the opp osite behavior of hard binary masks. The singular-ratio statement applies whenever a compared pair can b e zero ed with p ositive probability , and therefore cov ers drop out, DropBlock, and related hard-masking mec hanisms in their natural nontrivial regime. Theorem 5.12 (Binary masks are incompatible with ﬁnite log-ratio geometry) . L et h : U → (0 , ∞ ) , let a > 0 , and let m : U → { 0 , a } b e any r andom binary mask. Deﬁne e h m : = m ⊙ h . If ther e exist x, y ∈ U such that Pr  m ( x ) = 0 or m ( y ) = 0  > 0 , (43) then log e h m ( x ) e h m ( y ) fails to b e an almost sur ely ﬁnite r e al-value d r andom variable. In p articular, no ﬁnite-varianc e analo g of The or em 5.7 c an hold for such a mask. F or inverte d dr op out at distinct c omp ar e d sites x  = y , m q ( z ) = b ( z ) q , b ( z ) i . i . d . ∼ Bernoulli( q ) , the total pr ob ability of a zer o event at the c omp ar e d p air is 1 − q 2 , with asymmetric singular events of total pr ob ability 2 q (1 − q ) and a joint-er asur e event of pr ob ability (1 − q ) 2 . Pr o of. If m ( x ) = 0 and m ( y ) = a , then e h m ( x ) = 0 < e h m ( y ) and the log-ratio equals −∞ . If m ( x ) = a and m ( y ) = 0, then the log-ratio equals + ∞ . If m ( x ) = m ( y ) = 0, then both n umerator and denominator v anish and the log-ratio is undeﬁned, hence not a ﬁnite real num b er. Therefore the log-ratio fails to b e almost surely ﬁnite whenev er ( 43 ) holds. F or in verted drop out at distinct sites x  = y , independence giv es Pr  m q ( x ) = 0 or m q ( y ) = 0  = 1 − q 2 , while the asymmetric ev ents hav e total probabilit y 2 q (1 − q ) and the join t-erasure ev ent has probability (1 − q ) 2 . V ariational Kernel Design for Internal Noise 18 Corollary 13 (Margin-blind ranking under in verted drop out) . Assume h ( x ) > h ( y ) > 0 and let m q b e inverte d dr op out with ke ep pr ob ability q ∈ (0 , 1] . Then Pr  ( m q ⊙ h )( x ) > ( m q ⊙ h )( y )  = q . (44) In p articular, the pr ob ability of pr eserving the or dering is indep endent of the magnitude of the underlying fe atur e mar gin. Pr o of. W rite m q ( z ) = b ( z ) /q with b ( z ) ∈ { 0 , 1 } . If b ( x ) = 1, then ( m q ⊙ h )( x ) = h ( x ) /q and regardless of whether b ( y ) = 0 or 1, one has ( m q ⊙ h )( x ) > ( m q ⊙ h )( y ) because h ( x ) > h ( y ) > 0. If b ( x ) = 0, then ( m q ⊙ h )( x ) = 0 ≤ ( m q ⊙ h )( y ). Therefore the ordering is preserv ed if and only if b ( x ) = 1, which o ccurs with probabilit y q . Deep learning interpretation. This corollary is in tentionally blun t: even if one activ ation is arbitr arily more seman tically decisiv e than another, inv erted drop out preserves that ordering with probabilit y exactly q and destroys or erases it with probability 1 − q . In that sense hard masking is mar gin-blind . By comparison, Corollary 8 sho ws that the implemen ted GCh gate b ecomes more stable as the seman tic margin increases. Prop osition 14 (Exact intrinsic energy inﬂation under inv erted drop out) . L et m q ( x ) = b ( x ) /q with i.i.d. b ( x ) ∼ Bernoulli( q ) and q ∈ (0 , 1] . Then for every deterministic ﬁeld h : U → R , E  E int ( m q ⊙ h )  = E int ( h ) + 1 − q 2 q X x ∈ U d int x h ( x ) 2 , (45) wher e d int x : = X y : { x,y }∈ E int c xy is the intrinsic weighte d de gr e e of x . Pr o of. Fix an in terior edge { x, y } ∈ E int . Since m q ( x ) and m q ( y ) are independent and E [ m q ( x )] = 1, E [ m q ( x ) 2 ] = 1 /q , w e ha ve E  ( m q ( x ) h ( x ) − m q ( y ) h ( y )) 2  = 1 q h ( x ) 2 + 1 q h ( y ) 2 − 2 h ( x ) h ( y ) =  h ( x ) − h ( y )  2 +  1 q − 1   h ( x ) 2 + h ( y ) 2  . Multiply by c xy / 2 and sum o ver E int . The ﬁrst term sums to E int ( h ), while the second b ecomes 1 − q 2 q X x ∈ U d int x h ( x ) 2 . This is exactly ( 45 ). Corollary 15 (Coherence ampliﬁcation factor for in verted drop out) . Assume E int ( h ) > 0 and deﬁne the c oher enc e sc or e κ ( h ) : = P x ∈ U d int x h ( x ) 2 2 E int ( h ) . (46) Then inverte d dr op out satisﬁes E  E int ( m q ⊙ h )  E int ( h ) = 1 + 1 − q q κ ( h ) . (47) Pr o of. Divide b oth sides of ( 45 ) by E int ( h ) > 0 and rearrange. V ariational Kernel Design for Internal Noise 19 Deep learning in terpretation. The scalar κ ( h ) is an interpretable mismatch factor: it is large when a feature map carries nontrivial activ ation mass but v aries only w eakly across space, i.e. when the representation is coherent. Corollary 15 therefore says that hard masking damages coherent represen tations more severely in relativ e terms, and it does so b y a completely explicit ampliﬁcation factor. Corollary 16 (Immediate loss of perfect coherence under in verted drop out in exp ectation) . Assume q ∈ (0 , 1) and that the interior gr aph has at le ast one e dge. L et h ( x ) ≡ c on U for some c onstant c  = 0 . Then E int ( h ) = 0 , E  E int ( m q ⊙ h )  = 1 − q 2 q c 2 X x ∈ U d int x > 0 . (48) Thus p erfe ct c oher enc e is not pr eserve d by a single masking step: the p ost-mask ﬁeld has strictly p ositive expected intrinsic r oughness. Pr o of. A constan t ﬁeld has zero intrinsic energy , so the claim follo ws immediately from Prop osition 14 . Deep learning interpretation. This is the cleanest p ossible statemen t of hard-mask mismatc h. Ev en if a feature map is spatially p erfectly coheren t before perturbation, binary masking does not preserv e that zero-roughness state in any controlled relative sense. After one masking step the represen tation acquires strictly p ositive exp e cte d edgewise roughness, reﬂecting the discontin uities in tro duced b y hard deletion. Corollary 17 (Late-stage mismatc h of in verted drop out under coherence) . L et ( h ℓ ) ℓ ≥ 1 b e determin- istic ﬁelds on U such that inf ℓ ≥ 1 X x ∈ U d int x h ℓ ( x ) 2 > 0 , E int ( h ℓ ) > 0 for every ℓ, E int ( h ℓ ) → 0 . (49) Then for every ﬁxe d q ∈ (0 , 1) , E  E int ( m q ⊙ h ℓ )  − E int ( h ℓ ) E int ( h ℓ ) − → ∞ . (50) Thus, as the r epr esentation b e c omes mor e sp atial ly c oher ent, the r elative ge ometric distortion induc e d by binary masking diver ges. Pr o of. By Prop osition 14 , E [ E int ( m q ⊙ h ℓ )] − E int ( h ℓ ) E int ( h ℓ ) = 1 − q 2 q · P x ∈ U d int x h ℓ ( x ) 2 E int ( h ℓ ) . The numerator is bounded below b y assumption, whereas the denominator tends to zero, so the ratio div erges to + ∞ . Corollary 18 (Margin-growth regime: GCh strengthens while drop out saturates) . Fix distinct x, y ∈ U and assume τ R G ( x, y ) > 0 . L et ( h ℓ ) ℓ ≥ 1 b e p ositive ﬁelds with h ℓ ( x ) > h ℓ ( y ) for every ℓ . Deﬁne δ ℓ : = log h ℓ ( x ) − log h ℓ ( y ) . If δ ℓ p τ R G ( x, y ) − → ∞ , (51) then under the implemente d sample-wise GCh gate, Pr  e h ℓ ( x ) > e h ℓ ( y )  − → 1 . (52) Under inverte d dr op out with ke ep pr ob ability q , however, Pr  ( m q ⊙ h ℓ )( x ) > ( m q ⊙ h ℓ )( y )  = q for every ℓ. (53) V ariational Kernel Design for Internal Noise 20 Pr o of. Equation ( 52 ) follo ws immediately from Corollary 8 and the assumption ( 51 ) . Equation ( 53 ) is exactly Corollary 13 . Deep learning interpretation. Corollary 18 is the mathematically clean v ersion of the informal slogan that GCh b ecomes more compatible with later, sharp er seman tic representations. It does not claim that depth is alwa ys b eneﬁcial in every mo del. Instead it says: whenever late-stage represen tations b ecome more decisiv ely separated in their relative log-margins, the implemen ted GCh gate respects those rankings with probability tending to one, while hard masking stays stuc k at the same keep-probabilit y ceiling. Corollary 19 (Representation-compatibilit y dic hotomy) . Under the hyp otheses of The or ems 5.7 and 5.12 and Cor ol laries 9 and 17 , the implemente d GCh gate and har d binary masks exhibit qualitatively diﬀer ent b ehavior on p ositive c oher ent r epr esentations: 1. the implemente d GCh gate pr eserves a ﬁnite r elative lo g-ge ometry, with exact Gaussian p airwise deformations, mar gin-sensitive r anking stability, and an exact additive intrinsic r oughness budget; 2. any har d binary mask that c an zer o one or b oth memb ers of a c omp ar e d p air with p ositive pr ob ability fails to pr eserve ﬁnite lo g-r atio ge ometry, and inverte d dr op out pr eserves p airwise r anking only with the mar gin-blind pr ob ability q ; and 3. for inverte d dr op out, the r elative intrinsic distortion diver ges along c oher ent r epr esentation se quenc es satisfying ( 49 ) . The assumptions in Corollary 17 are a clean mathematical abstraction of the late-seman tic regime: the represen tation retains nontrivial mass but b ecomes increasingly lo w-frequency or spatially coherent. In that regime, binary masking b ecomes more and more mismatched. By contrast, Theorem 5.7 and Corollaries 8 , 9 , 18 and 19 show that the implemented GCh gate contin ues to pro duce a ﬁnite Gaussian deformation whose pairwise, ranking, and aggregate eﬀects are con trolled by the Green geometry . Engineering tak eaw a y . If a la y er enco des p ositiv e region-lev el evidence or token-lev el saliency , then the mathematically relev ant question is not merely whether noise is mean-preserving in exp ectation, but whether it preserves r elative c omp arisons that the do wnstream mo del relies on. The results ab o v e sa y that GCh perturbs those comparisons through a ﬁnite, margin-a ware Gaussian deformation, whereas hard binary masks can delete them outright and become especially mismatc hed when the represen tation is coherent and seman tically sharp. A top ological complemen t is given in Appendix E : p ositive multiplicativ e gates p erturb sup erlev el sets only through a m ultiplicativ e threshold band, whereas hard Bernoulli masking destroys lo op-type excursion top ology with probability 1 − q n on an n -cycle. What these theorems do and do not claim. They do not prov e that one should alw ays inject noise deep er, nor that ev ery mas king strategy is inferior in ev ery p ossible regime. What they prov e is a sharper and more defensible statemen t: once a la yer b ehav es lik e a p ositiv e coherent evidence map, there is a mathematically meaningful comparison to make. In that regime, the implemented GCh gate preserv es ﬁnite relativ e geometry , ranking information, and an explicit global roughness budget, while hard binary masking either mak es those quan tities singular or ampliﬁes their distortion b y an explicit coherence factor. That is exactly the regime targeted b y the late-stage exp e rimen ts in this pap er. 5.6. Implementation and eﬃcient sampling Injecting the gate. Given a feature map F ∈ R C × H × W , we inject the spatial gate multiplicativ ely: e F c ( x ) = F c ( x ) ξ γ ( x ) , x ∈ U. (54) V ariational Kernel Design for Internal Noise 21 In the exp eriments, β is ﬁxed once the grid, op erator, and normalization conv ention are c hosen; γ is the rep orted strength knob. FFT/DST sampling of the GFF log-ﬁeld. F or the un weigh ted four-neigh b or Diric hlet Laplacian on the H × W in terior grid U , the eigenbasis is the 2D sine basis: e k,ℓ ( i, j ) = sin  π k i H + 1  sin  π ℓj W + 1  , (55) λ k,ℓ = 4 sin 2  π k 2( H + 1)  + 4 sin 2  π ℓ 2( W + 1)  , (56) for 1 ≤ k ≤ H and 1 ≤ ℓ ≤ W . Hence sampling ψ ∼ N (0 , ( β L U ) − 1 ) reduces to sp ectral syn thesis: draw i.i.d. Z k,ℓ ∼ N (0 , 1), set A k,ℓ = Z k,ℓ p β λ k,ℓ , and compute ψ = IDST2 ( A ) using an orthonormal in v erse discrete sine transform. F ast DST implemen tations rely on FFT internally , giving near-linear complexit y in the n umber of spatial sites. Algorithm 1 GCh on an H × W grid (Diric hlet; FFT/DST implementation) 1: Input: grid size ( H, W ), parameters β > 0, γ ∈ R , feature map F ∈ R C × H × W 2: Precompute once: eigenv alues λ k,ℓ in ( 56 ); choose a DST con ven tion; optionally precompute the v ariance map v ( x ) = C ( x, x ) 3: Sample sp ectral co eﬃcien ts: draw i.i.d. Z k,ℓ ∼ N (0 , 1) 4: Scale by the Laplacian sp ectrum: set A k,ℓ ← Z k,ℓ / p β λ k,ℓ 5: In verse transform: ψ ← IDST2( A ) (so ψ ∼ N (0 , ( β L U ) − 1 )) 6: Exponentiate: G ( x ) ← exp( γ ψ ( x )) for all x ∈ U 7: Normalize (choose one): 8: Exact Wic k: ξ ( x ) ← exp( γ ψ ( x ) − γ 2 2 v ( x )) 9: Sample-wise mean-one: ξ ( x ) ← G ( x ) . 1 | U | P y ∈ U G ( y )  10: Inject into features: e F c ( x ) ← F c ( x ) ξ ( x ) for all c hannels c and sites x ∈ U 11: Output: noised feature map e F 6. Exp erimen ts W e ev aluate whether the design principles b ehind GCh translate in to practical gains. Our empirical questions are: (i) which ingredients matter beyond raw noise magnitude, (ii) where in netw ork depth is the mechanism most eﬀective, and (iii) whether the eﬀect transfers b eyond the primary CNN setting. Detailed protocols are provided in Appendix F . Theory-to-exp erimen t map. The represen tation-compatibility results mak e four concrete em- pirical predictions. Pairwise log-ratio stability and the margin-sensitiv e ranking law predict that when late-stage represen tations enco de decisiv e relativ e evidence, GCh should preserve that evidence b etter than hard masking. The in trinsic roughness budget predicts a broad non-destructive regime of sto chasticit y rather than abrupt fragmentation. The immediate loss-of-p erfect-coherence and coherence-mismatc h results predict that once a representation b ecomes spatially coherent, binary masking should incur disprop ortionate damage, esp ecially at late stages. Finally , the topological app endix is most relev ant to the ﬁne-grained P ets pilot, where preserving coheren t part structure matters mos t directly . V ariational Kernel Design for Internal Noise 22 6.1. Controlled multi-baseline ImageNet study T o isolate the source of the gains, we run a controlled 3-seed comparison that separates noise magnitude, spatial correlation, and p ositivity/mean-one multiplicativ e gating. W e compare GCh against Dropout and DropBlock, together with additiv e Gaussian baselines (i.i.d. and correlated) whose injected strength is energy-matched. T able 1 reports mean ± std. Unless otherwise stated, the GCh exp eriments use the sample-wise mean-one normalization of Algorithm 1 , which is the implementation v arian t used throughout the main b o dy . Uniﬁed strength knob. T o k eep tables compact, we write g for the metho d-sp eciﬁc strength parameter. F or GCh , g ≡ γ ; for Gaussian baselines, g ≡ σ ; for Drop out and DropBlo ck, g ≡ p ; and for the no-noise baseline, g = 0. Metho d g T op-1 ↑ NLL ↓ ECE ↓ None 0 0.765 ± 0.001 0.931 ± 0.004 0.030 ± 0.001 Drop out 0.1 0.764 ± 0.001 0.942 ± 0.005 0.033 ± 0.001 DropBlo c k 0.1 0.765 ± 0.000 0.930 ± 0.002 0.032 ± 0.000 I ID Gauss. 0.1 0.765 ± 0.001 0.930 ± 0.005 0.032 ± 0.002 Cor. Gauss. 0.1 0.765 ± 0.000 0.944 ± 0.002 0.037 ± 0.001 GCh (ours) 0.1 0.764 ± 0.001 0.934 ± 0.004 0.020 ± 0.001 T able 1. ImageNet v al (uncorrupted) under late-stage injection (la yer4). Mean ± std o ver 3 seeds. Here g denotes the metho d-sp eciﬁc strength knob: g = γ for GCh, g = σ for Gaussian baselines, and g = p for Drop out/DropBlo c k. 6.2. T ransfer to a transformer backbone: Swin-T W e additionally ev aluate GCh on Swin-T under the same full-recip e training setup. Direct Drop out/DropBlo ck analogues are less aligned with transformer pip elines b ecause token-based represen tations and attention up dates no longer corresp ond to contiguous suppression on a conv olu- tional feature grid, and standard transformer regularization usually acts on diﬀerent ob jects (e.g., sto c hastic depth, atten tion drop out, or MLP drop out). W e therefore rep ort the clean full-recip e baseline and isolate the incremental eﬀect of GCh in this setting. T able 2 rep orts b est-chec kp oint p erformance. Metho d T op-1 Acc. ↑ NLL ↓ ECE ↓ Baseline (None) 80.03% 0.9213 0.0762 GCh (ours) 80.11% 0.9131 0.0738 T able 2. Swin-T (b est chec kp oint). F ull-recipe training, single run. 6.3. What the evidence shows The ImageNe t controlled study supp orts three main conclusions. First, c orr elation alone is not enough . In T able 1 , the correlated additiv e Gaussian baseline can w orsen calibration relativ e to the no-noise baseline at matc hed strength. The strongest ECE impro vemen ts app ear only when correlation is combined with p ositive me an-one multiplic ative gating , namely in GCh . V ariational Kernel Design for Internal Noise 23 Second, depth matters . Injection depth induces a clear accuracy–calibration trade-oﬀ (App endix T able 5 ): moving from earlier to later stages substan tially improv es calibration while changing accuracy only mo destly . In the selected 7-corruption ImageNet-C ev aluation, the late-stage setting reduces ECE by 46% and improv es NLL b y 3.3% relative to the no-noise baseline (App endix T able 7 ), with corruption-wise details in T able 10 . Third, ther e is a stable op er ating r e gime . A strength sweep reveals a broad useful range around g ≈ 0 . 07–0 . 18. When g b ecomes to o large, accuracy collapses and NLL rises sharply; ECE can also b ecome misleadingly small under severe underconﬁdence, which we treat as a failure mo de rather than a fav orable outcome (App endix T able 9 ). This empirical pattern is consisten t with Theorem 5.7 and Corollaries 8 , 9 , 11 , 13 , 16 and 17 : the implemented GCh gate induces a ﬁnite Gaussian deformation in relative log-geometry together with an exact exp ected intrinsic roughness budget, while also b ecoming more stable when seman tic margins are sharp er; hard masking, by con trast, is margin-blind and incurs a coherence-sensitive geometric p enalty whose relativ e size gro ws in the coherent-represen tation regime. Finally , the Swin-T result provides preliminary transfer evidence b eyond the primary ResNet-50 setting, and the Oxford-I I IT P ets pilot in the app endix supports the claim that spatially coheren t p ositiv e gating is especially compatible with ﬁne-grained structure-sensitive recognition. 7. Conclusion W e prop osed V ariational Kernel Design as a comp ositional framework for internal noise in deep learning. In VKD, a sto chastic mechanism is not c hosen from a ﬁxed heuristic menu; it is deriv ed from learning-relev ant constraints and then analyzed on the representation regime where it is actually deplo y ed. This tw o-lay er viewpoint—mec hanism design follow ed b y compatibilit y analysis—is the main conceptual contribution of the pap er. Within the solv ed quadratic VKD subfamily studied here, we formulated the log-ﬁeld construction problem as an explicit ﬁnite-dimensional maximum-en tropy program under a quadratic op erator budget, solv ed that program in closed form, and obtained an en tropy-gap identit y sho wing that the optimizer is uniquely Gaussian. F or the Diric hlet operator, this makes the Green kernel emerge as the induced correlation geometry; after Wick normalization, it yields the canonical exact GCh gate. Once the op erator and energy budget are ﬁxed, the exact gate becomes an eﬀectiv ely one-parameter family through τ = γ 2 /β . The more distinctiv e message of the pap er is the representation-compatibilit y lay er that sits on top of this v ariational design. F or the sample-wise gate actually used in the exp eriments, w e established exact Gaussian con trol of pairwise log-ratios, margin-sensitive ranking stabilit y , and an exact exp ected in trinsic roughness budget. F or hard binary masking, we prov ed the opposite kind of statemen t: incompatibility with ﬁnite log-ratio geometry , margin-blind ranking under inv erted drop out, immediate loss of p erfect coherence in exp ectation on p erfectly coherent maps, and a relative distortion term that div erges in the coherent-represen tation regime. The cen tral contrast is therefore not merely Gaussian versus Bernoul li ; it is ﬁnite, mar gin-awar e deformation versus singular or c oher enc e-ampliﬁe d deletion . These theorems are inten tionally conditional rather than universal. They do not claim that every deep er la yer in ev ery arc hitecture will automatically fav or GCh . They claim something more precise and, for practice, more useful: whenev er p ositive semantic representations b ecome coheren t and their relativ e evidence sharp ens, smooth m ultiplicative gating preserv es those comparisons in a w ay that hard deletion cannot. That conditional form is exactly what allo ws the theory to sp eak directly to the late-stage regime without pretending to replace empirical ev aluation. Empirically , GCh improv es calibration on clean ImageNet, improv es b oth ECE and NLL on a selected 7-corruption ImageNet-C ev aluation, remains eﬀective in late seman tic stages where hard masking can degrade clean calibration, and sho ws encouraging transfer to Swin-T and a ﬁne-grained V ariational Kernel Design for Internal Noise 24 pilot. The practical takea wa y is simple: if a lay er carries p ositive, coherent, region-level evidence, then the right question is not merely whether noise is unbiased, but whether it p erturbs relative evidence smo othly or deletes it abruptly , and whether that p erturbation resp ects the comparisons the do wnstream netw ork actually uses. Our theory says that GCh do es the former, whereas canonical hard binary masks suc h as drop out and DropBlock-t yp e deletion mec hanisms tend to do the latter in the coherent late-stage regime. More broadly , the pap er suggests a reusable recip e for future w ork. First choose the op erator that enco des the geometry one wan ts the noise to resp ect. Then deriv e the corresp onding latent la w and realization. Finally , ask whether the implemen ted mechanism is compatible with the represen tation regime of interest. That p ersp ective op ens the door to principled v arian ts based on massiv e, anisotropic, graph-adapted, or arc hitecture-sp eciﬁc op erators while preserving the same mathematical blueprint. References Christopher M. Bishop. T raining with noise is equiv alent to Tikhonov regularization. Neur al Computation , 7(1):108–116, 1995. Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quo c V. Le. Randaugment: Practical automated data augmen tation with a reduced search space. In Pr o c e e dings of the IEEE/CVF Confer enc e on Computer Vision and Pattern R e c o gnition Workshops (CVPR W) , pages 702–703, 2020. Jia Deng, W ei Dong, Richard So cher, Li-Jia Li, Kai Li, and Li F ei-F ei. ImageNet: A large-scale hierarc hical image database. In Pr o c e e dings of the IEEE Confer enc e on Computer Vision and Pattern R e c o gnition (CVPR) , pages 248–255, 2009. T errance DeV ries and Graham W. T aylor. Improv ed regularization of conv olutional neural netw orks with Cutout. arXiv pr eprint arXiv:1708.04552 , 2017. Angela F an, Edouard Grav e, and Armand Joulin. Reducing transformer depth on demand with structured drop out. In International Confer enc e on L e arning R epr esentations (ICLR) , 2020. Y arin Gal and Zoubin Ghahramani. Drop out as a Bay esian approximation: Represen ting mo del uncertain t y in deep learning. In Pr o c e e dings of the 33r d International Confer enc e on Machine L e arning (ICML) , pages 1050–1059, 2016. Golnaz Ghiasi, Tsung-Yi Lin, and Quo c V. Le. DropBlo ck: A regularization metho d for conv olutional net w orks. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , pages 10727–10737, 2018. Ch uan Guo, Geoﬀ Pleiss, Y u Sun, and Kilian Q. W ein b erger. On calibration of mo dern neural net w orks. In Pr o c e e dings of the 34th International Confer enc e on Machine L e arning (ICML) , pages 1321–1330, 2017. Kaiming He, Xiangyu Zhang, Shao qing Ren, and Jian Sun. Deep residual learning for image recognition. In Pr o c e e dings of the IEEE Confer enc e on Computer Vision and Pattern R e c o gnition (CVPR) , pages 770–778, 2016. Dan Hendrycks and Thomas Dietterich. Benchmarking neural netw ork robustness to common corruptions and p erturbations. In International Confer enc e on L e arning R epr esentations (ICLR) , 2019. Dan Hendrycks, Norman Mu, Ekin D. Cubuk, Barret Zoph, Justin Gilmer, and Bala ji Lakshmi- nara y anan. AugMix: A simple data pro cessing metho d to improv e robustness and uncertaint y . In International Confer enc e on L e arning R epr esentations (ICLR) , 2020. V ariational Kernel Design for Internal Noise 25 Dan Hendryc ks, Stev en Basart, Norman Mu, Saura v Kada v ath, F rank W ang, Ev an Dorundo, Rahul Desai, T yler Zh u, Sam yak Para juli, Mik e Guo, Da wn Song, Jacob Steinhardt, and Justin Gilmer. The man y faces of robustness: A critical analysis of out-of-distribution generalization. In Pr o c e e dings of the IEEE/CVF International Confer enc e on Computer Vision (ICCV) , pages 8340–8349, 2021. Gao Huang, Y u Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. W ein b erger. Deep netw orks with sto c hastic depth. In Eur op e an Confer enc e on Computer Vision (ECCV) , pages 646–661, 2016. Meelis Kull, Miquel Perell´ o Nieto, Markus K”angsepp, T elmo de Menezes e Silv a Filho, Hao Song, and Peter Flach. Beyond temp erature scaling: Obtaining w ell-calibrated m ulticlass probabilities with Dirichlet calibration. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , pages 12316–12326, 2019. Bala ji Lakshminara yanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictiv e uncertain t y estimation using deep ensembles. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , pages 6402–6413, 2017. Y ue Liu, Christos Matsouk as, F redrik Strand, Hossein Azizp our, and Kevin Smith. Patc hDrop out: Economizing vision transformers using patc h drop out. In Pr o c e e dings of the IEEE/CVF Winter Confer enc e on Applic ations of Computer Vision (W ACV) , pages 4917–4926, 2023. Ze Liu, Y utong Lin, Y ue Cao, Han Hu, Yixuan W ei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin T ransformer: Hierarchical vision transformer using shifted windo ws. In Pr o c e e dings of the IEEE/CVF International Confer enc e on Computer Vision (ICCV) , pages 10012–10022, 2021. Aleksander Madry , Aleksandar Makelo v, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. T ow ards deep learning mo dels resistant to adv ersarial attac ks. In International Confer enc e on L e arning R epr esentations (ICLR) , 2018. Matthias Minderer, Josip Djolonga, Rob Romijnders, F rances Hubis, Xiaoh ua Zhai, Neil Houlsby , Dustin T ran, and Mario Lucic. Revisiting the calibration of modern neural net works. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , pages 15682–15694, 2021. Rafael M”uller, Simon Korn blith, and Geoﬀrey E. Hin ton. When do es label smo othing help? In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , pages 4696–4705, 2019. Mahdi P akdaman Naeini, Gregory Coop er, and Milos Hauskrech t. Obtaining w ell calibrated proba- bilities using ba yesian binning. In Pr o c e e dings of the AAAI Confer enc e on A rtiﬁcial Intel ligenc e (AAAI) , pages 2901–2907, 2015. Y aniv Ov adia, Emily F ertig, Jie Ren, Zachary Nado, Da vid Sculley , Sebastian Now ozin, Joshua V. Dillon, Bala ji Lakshminaray anan, and Jasp er Sno ek. Can you trust y our mo del’s uncertaint y? ev aluating predictiv e uncertaint y under dataset shift. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , pages 13991–14002, 2019. Omk ar M. Parkhi, Andrea V edaldi, Andrew Zisserman, and C. V. Jaw ahar. Cats and dogs. In Pr o c e e dings of the IEEE Confer enc e on Computer Vision and Pattern R e c o gnition (CVPR) , pages 3498–3505, 2012. Evgenia Rusak, Luk as Sc hott, Roland S. Zimmermann, Julian Bitterw olf, Oliver Bringmann, Matthias Bethge, and Wieland Brendel. A simple wa y to make neural netw orks robust against diverse image corruptions. In Eur op e an Confer enc e on Computer Vision (ECCV) , pages 53–69, 2020. Nitish Sriv astav a, Geoﬀrey Hinton, Alex Krizhevsky , Ily a Sutskev er, and Ruslan Salakhutdino v. Drop out: A simple w ay to preven t neural netw orks from o verﬁtting. Journal of Machine L e arning R ese ar ch , 15(56):1929–1958, 2014. Rohan T aori, Achal Da ve, V aishaal Shank ar, Nicholas Carlini, Benjamin Rech t, and Ludwig Schmidt. Measuring robustness to natural distribution shifts in image classiﬁcation. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , pages 18583–18599, 2020. V ariational Kernel Design for Internal Noise 26 Qizhe Xie, Minh-Thang Luong, Eduard Hovy , and Quo c V. Le. Self-training with noisy student impro v es ImageNet classiﬁcation. In Pr o c e e dings of the IEEE/CVF Confer enc e on Computer Vision and Pattern R e c o gnition (CVPR) , pages 10684–10695, 2020. Y oshihiro Y amada, Masak azu Iwam ura, T akuya Akiba, and Koichi Kise. ShakeDrop regularization for deep residual learning. arXiv pr eprint arXiv:1802.02375 , 2018. Sangdo o Y un, Dongy o on Han, Seong Joon Oh, Sangh yuk Ch un, Junsuk Cho e, and Y oung jo on Y o o. CutMix: Regularization strategy to train strong classiﬁers with lo calizable features. In Pr o c e e dings of the IEEE/CVF International Confer enc e on Computer Vision (ICCV) , pages 6023–6032, 2019. Hongy ang Zhang, Y aodong Y u, Jiantao Jiao, Eric P . Xing, Lauren t El Ghaoui, and Mic hael I. Jordan. Theoretically principled trade-oﬀ b etw een robustness and accuracy . In Pr o c e e dings of the 36th International Confer enc e on Machine L e arning (ICML) , pages 7472–7482, 2019. Hongyi Zhang, Moustapha Ciss ´ e, Y ann N. Dauphin, and David Lop ez-Paz. mixup: Beyond empirical risk minimization. In International Confer enc e on L e arning R epr esentations (ICLR) , 2018. A. Notation and T erminology (Glossary) • U : the H × W feature grid on which the gate is sampled and applied. • B : the auxiliary Dirichlet b oundary outside U ; ¯ U = U ∪ B . • L U : Dirichlet Laplacian on U ; G U = L − 1 U : Dirichlet Green kernel. • F : law family of latent log-ﬁelds in the VKD mechanism. • K : intended second-order geometry / k ernel in the VKD mec hanism. • T : injection op erator in the VKD mechanism. • ℓ : realization map from latent log-ﬁeld to p ositive gate. • R : target representation regime for compatibilit y analysis. • ψ : log-ﬁeld; ξ : p ositive multiplicativ e gate; γ : GCh strength parameter. • g : uniﬁed strength knob in the experimental tables ( g = γ /σ /p depending on the method). • GCh : Gaussian Chaos Noise / gate (ours). • IID/Corr. Gaussian : additiv e Gaussian baselines with matc hed injected energy . B. F ull v ariational deriv ation for Theorem 5.1 This app endix gives a fuller pro of of the quadratic MaxEnt principle, including an entrop y-gap iden tit y and, for completeness, the corresponding Euler–Lagrange stationarity calculation. Let Q ≻ 0 b e symmetric p ositive deﬁnite on R U , let n = | U | , and let ε > 0. Recall the admissible class A ( Q, ε ) =  p : R U → [0 , ∞ ) : Z p = 1 , Z ψ p ( ψ ) dψ = 0 , Z 1 2 ⟨ ψ , Qψ ⟩ p ( ψ ) dψ = ε, h ( p ) > −∞  . V ariational Kernel Design for Internal Noise 27 B.1. Entrop y-gap pro of of optimality and uniqueness Deﬁne Σ Q,ε : = 2 ε n Q − 1 , p ⋆ ( ψ ) : = 1 (2 π ) n/ 2 det(Σ Q,ε ) 1 / 2 exp  − 1 2 ψ ⊤ Σ − 1 Q,ε ψ  . Since Σ − 1 Q,ε = n 2 ε Q , we hav e E p ⋆  1 2 ⟨ ψ , Qψ ⟩  = 1 2 T r( Q Σ Q,ε ) = 1 2 T r  Q 2 ε n Q − 1  = ε, and clearly E p ⋆ [ ψ ] = 0, so p ⋆ ∈ A ( Q, ε ). No w ﬁx any p ∈ A ( Q, ε ). Using the deﬁnition of KL div ergence, KL( p ∥ p ⋆ ) = Z p ( ψ ) log p ( ψ ) p ⋆ ( ψ ) dψ = − h ( p ) − Z p ( ψ ) log p ⋆ ( ψ ) dψ . (57) Since log p ⋆ ( ψ ) = − n 2 log(2 π ) − 1 2 log det(Σ Q,ε ) − 1 2 ψ ⊤ Σ − 1 Q,ε ψ , and Σ − 1 Q,ε = n 2 ε Q , the energy constrain t giv es − Z p ( ψ ) log p ⋆ ( ψ ) dψ = n 2 log(2 π ) + 1 2 log det(Σ Q,ε ) + n 2 ε Z 1 2 ⟨ ψ , Qψ ⟩ p ( ψ ) dψ = n 2 log(2 π ) + 1 2 log det(Σ Q,ε ) + n 2 . (58) But the right-hand side is exactly the entrop y of p ⋆ : h ( p ⋆ ) = 1 2 log  (2 π e ) n det(Σ Q,ε )  . Therefore ( 57 ) and ( 58 ) imply KL( p ∥ p ⋆ ) = h ( p ⋆ ) − h ( p ) . Because KL( p ∥ p ⋆ ) ≥ 0, we obtain h ( p ) ≤ h ( p ⋆ ) , with equality iﬀ KL ( p ∥ p ⋆ ) = 0, i.e. iﬀ p = p ⋆ almost everywhere. This pro ves b oth optimalit y and uniqueness. B.2. Euler–Lagrange deriv ation (for completeness) The same optimizer can b e reco v ered by stationarity . Introduce Lagrange multipliers λ 0 ∈ R , λ ∈ R U , and β ∈ R , and deﬁne L ( p ) = − Z p ( ψ ) log p ( ψ ) dψ + λ 0  Z p ( ψ ) dψ − 1  +  λ, Z ψ p ( ψ ) dψ  − β  Z 1 2 ⟨ ψ , Qψ ⟩ p ( ψ ) dψ − ε  . (59) F or an interior optimum, the ﬁrst v ariation in a direction δ p giv es Z  − log p ( ψ ) − 1 + λ 0 + ⟨ λ, ψ ⟩ − β 1 2 ⟨ ψ , Qψ ⟩  δ p ( ψ ) dψ = 0 . V ariational Kernel Design for Internal Noise 28 Hence the Euler–Lagrange equation is − log p ( ψ ) − 1 + λ 0 + ⟨ λ, ψ ⟩ − β 1 2 ⟨ ψ , Qψ ⟩ = 0 , so p ( ψ ) ∝ exp( ⟨ λ, ψ ⟩ ) exp  − β 1 2 ⟨ ψ , Qψ ⟩  . The centering constraint forces λ = 0, and in tegrability requires β > 0 because Q ≻ 0. Thus p ( ψ ) ∝ exp  − β 1 2 ⟨ ψ , Qψ ⟩  = exp  − 1 2 ψ ⊤ ( β Q ) ψ  , whic h is the cen tered Gaussian N (0 , ( β Q ) − 1 ). Matching the energy budget yields ε = 1 2 T r  Q ( β Q ) − 1  = n 2 β , so β = n 2 ε . This repro duces ( β Q ) − 1 = 2 ε n Q − 1 = Σ Q,ε . B.3. Dirichlet sp ecialization Setting Q = L U giv es the optimizer used in the main text: p ⋆ L U ,ε = N  0 , ( β L U ) − 1  , β = | U | 2 ε . Its cov ariance is Co v( ψ ) = 2 ε | U | L − 1 U = β − 1 G U . Other b oundary conditions. If one uses p erio dic or Neumann boundary conditions on a connected ﬁnite graph, the Laplacian has a constan t nullspace, so the corresp onding ﬁeld must b e deﬁned after gauge ﬁxing, for example by pinning one site or imp osing zero spatial mean and using the Mo ore–P enrose pseudoinv erse. Under the auxiliary Dirichlet b oundary used in the main text, L U ≻ 0 and no additional gauge ﬁxing is needed. Massiv e v ariant. A regularized or massive v arian t replaces L U b y L U + µI for µ > 0: ψ ∼ N  0 , ( β ( L U + µI )) − 1  . This corresp onds to the quadratic energy E µ ( ψ ) = 1 2 ψ ⊤ ( L U + µI ) ψ and yields a better conditioned cov ariance with shorter-range correlations. C. F urther prop erties of the exact Gaussian-c haos gate This app endix collects simple but useful consequences of the exact construction. C.1. All-order moment formula Let ξ ex γ b e deﬁned by ( 19 ). F or any x 1 , . . . , x m ∈ U , E " m Y r =1 ξ ex γ ( x r ) # = exp   γ 2 X 1 ≤ a 0, deﬁne the superlevel set S t ( f ) : = { x ∈ U : f ( x ) ≥ t } , and view it as the induced subgraph of the underlying adjacency graph on U . Prop osition 1 (Threshold-band stabilit y under positive multiplicativ e gating) . L et h : U → (0 , ∞ ) , let ξ : U → (0 , ∞ ) , and assume ∥ log ξ ∥ ∞ ≤ η . Then for every t > 0 , S te η ( h ) ⊆ S t ( ξ ⊙ h ) ⊆ S te − η ( h ) . (60) In p articular, the sup erlevel top olo gy of ξ ⊙ h at level t c an diﬀer fr om that of h only thr ough thr eshold events alr e ady pr esent in the b and [ te − η , te η ] . Deep learning interpretation. The prop osition says that a p ositive m ultiplicative gate do es not tear the superlevel geometry apart arbitrarily . It only mo ves the eﬀectiv e threshold within a m ultiplicativ e band. F or representation learning, this is a rigorous wa y to say that coheren t regions can shift smo othly under GCh rather than being punctured by hard zeros. Pr o of. F rom ∥ log ξ ∥ ∞ ≤ η we hav e e − η ≤ ξ ( x ) ≤ e η for every x ∈ U . If h ( x ) ≥ te η , then ξ ( x ) h ( x ) ≥ e − η te η = t, so x ∈ S t ( ξ ⊙ h ). Conv ersely , if ξ ( x ) h ( x ) ≥ t , then h ( x ) ≥ t ξ ( x ) ≥ te − η , so x ∈ S te − η ( h ). Prop osition 2 (Sample-wise GCh obeys a random sandwich width) . F or the implemente d gate ξ sw γ in ( 30 ) , ∥ log ξ sw γ ∥ ∞ ≤ | γ | osc( ψ ) , osc( ψ ) : = max x ∈ U ψ ( x ) − min x ∈ U ψ ( x ) . (61) Henc e Pr op osition 1 applies with η = | γ | osc( ψ ) . V ariational Kernel Design for Internal Noise 31 Pr o of. W rite log ξ sw γ ( x ) = γ ψ ( x ) − c ( ψ ) , c ( ψ ) = log  1 | U | X y ∈ U e γ ψ ( y )  . Since the logarithm of an a v erage of exponentials lies betw een the minimum and maximum exp onent, min y ∈ U γ ψ ( y ) ≤ c ( ψ ) ≤ max y ∈ U γ ψ ( y ) . Therefore each quantit y γ ψ ( x ) − c ( ψ ) lies in the interv al [ −| γ | osc( ψ ) , | γ | osc( ψ )] , whic h is exactly ( 61 ). T o con trast this with hard masking, recall that for any ﬁnite graph H the ﬁrst Betti n umber equals the cycle rank β 1 ( H ) = | E ( H ) | − | V ( H ) | + β 0 ( H ) . Theorem E.3 (Cycle-top ology fracture under in verted drop out) . L et the underlying gr aph b e the cycle C n , let q ∈ (0 , 1] , let h ≡ c > 0 on its vertic es, and cho ose a thr eshold t ∈ (0 , c/q ) . Under inverte d dr op out, m q = b q , b ( v ) i . i . d . ∼ Bernoulli( q ) , deﬁne e h : = m q ⊙ h . Then S t ( e h ) is exactly the induc e d sub gr aph on the kept vertic es, and Pr  β 1 ( S t ( e h )) = 1  = q n , Pr  β 1 ( S t ( e h )) = 0  = 1 − q n . (62) Equivalently, the lo op top olo gy is destr oye d with pr ob ability 1 − q n . Pr o of. Because t < c/q , a vertex b elongs to S t ( e h ) if and only if it is k ept. Th us S t ( e h ) is the induced subgraph on the kept vertices. If all n v ertices are k ept, this induced subgraph is the full cycle C n , so β 1 = 1. If at least one v ertex is dropp ed, the induced subgraph is a disjoint union of paths, hence acyclic and therefore has β 1 = 0. The all-k ept ev ent has probabilit y q n . Deep learning in terpretation. Closed contours, ring-like activ ation patterns, and lo op-shap ed sup erlev el regions are idealized but me aningful models of seman tic geometry . Theorem E.3 says that hard deletion is topologically brittle: a single dropp ed segmen t breaks the lo op. This mak es precise the intuition that binary masks can fracture spatial seman tics rather than perturb them smoothly . Remark 4 (Why this is relev an t to DropBlock) . Dr opBlo ck changes the sp atial c orr elation of the zer o set, not the b asic har d-mask me chanism. A ny blo ck p attern that r emoves a c onne cte d ar c fr om a lo op-like sup erlevel set also destr oys its cycle r ank. The the or em ab ove ther efor e isolates the essential top olo gic al failur e mo de alr e ady pr esent in the binary-masking me chanism itself. F. Additional exp erimental details and results F.1. Exp erimen tal setup Datasets. W e ev aluate on ImageNet-1k ( Deng et al. , 2009 ) (1.28M training images, 50k v alidation images, 1000 classes). T o measure robustness under common corruptions, we additionally use ImageNet-C ( Hendrycks and Dietteric h , 2019 ). Our main corruption-shift analysis rep orts a verages o v er a selected subset of 7 corruption types, each av eraged across severities 1–5. T o complemen t the large-scale setting with a fast ﬁne-grained pilot, w e also ev aluate on Oxford-II IT P et ( P arkhi et al. , 2012 ), a 37-class benchmark whose labels are sensitive to shape cues. V ariational Kernel Design for Internal Noise 32 Arc hitectures and injection sites. Our primary backbone is ResNet-50 ( He et al. , 2016 ). W e inject the spatial gate at selected residual stages (L2/L3/L4) to study depth-dep endent eﬀects. ResNet-50 is also the fairest setting for comparisons to Drop out and DropBlo ck b ecause those metho ds are naturally deﬁned on conv olutional feature grids. Since GCh acts on a 2D grid wherev er suc h a representation exists, w e further ev aluate on Swin-T ( Liu et al. , 2021 ) to test transfer b eyond the primary CNN regime. T raining proto cols and repro ducibility . Main ImageNet proto col. Unless otherwise sp ec- iﬁed, ImageNet mo dels are trained from scratch for 270 ep o chs using SGD with momentum 0.9 and weigh t decay 10 − 4 , with learning-rate schedules held ﬁxed across metho ds. Clean ImageNet metrics are rep orted on the standard v alidation set, and ImageNet-C metrics are computed from the corresp onding trained chec kp oints. Con trolled ablation proto col. F or extensive multi-seed comparisons and strength sw eeps, w e also use a shorter matc hed-budget proto col described in the table captions. Within each controlled study , all hyperparameters aside from the noise mechanism are held ﬁxed. Oxford-I I IT Pets pilot. F or the ﬁne-grained pilot, we train a ResNet-18 from scratch for 40 ep o chs using Adam (lr = 10 − 3 ), 224 × 224 inputs, and standard normalization. Results are rep orted as mean ± std ov er 3 seeds. Baselines. W e compare GCh against Drop out ( Sriv asta v a et al. , 2014 ), DropBlo ck ( Ghiasi et al. , 2018 ), additiv e i.i.d. Gaussian noise, and additive correlate d Gaussian noise. The Gaussian baselines are energy-matched to GCh to separate the eﬀect of structure from the eﬀect of ra w magnitude. Metrics. W e report T op-1 accuracy , negativ e log-lik eliho o d (NLL), and exp ected calibration error (ECE). These metrics capture b oth predictiv e performance and probabilistic reliability . F.2. Best vs. latest chec kp oint on clean ImageNet W e compare late-stage (L4) injection at t wo ev aluation points: the best c heckpoint observed during training and the ﬁnal chec kp oint. Proto col note. These tables come from the full-recip e single-run chec kp oint proto col and are therefore complemen tary to, rather than n umerically comparable with, the 3-seed controlled table in the main text. They summarize a separate ev aluation slice of the same late-stage setting, whereas the main-text causal-control table rep orts the matc hed 3-seed proto col used for mec hanism isolation. They are included to show that the late-stage reliability pattern is not sp eciﬁc to one chec kp ointing con v en tion. Metho d T op-1 ↑ NLL ↓ ECE ↓ None 76.41 0.96 0.082 DropBlo c k 75.86 0.99 0.085 GCh (ours) 76.23 0.95 0.076 T able 3. ImageNet (clean), b est chec kp oint, L4 injection. Metho d T op-1 ↑ NLL ↓ ECE ↓ None 76.35 0.97 0.084 DropBlo c k 75.21 1.04 0.091 GCh (ours) 76.18 0.96 0.078 T able 4. ImageNet (clean), latest chec kp oint / ﬁnal epo ch, L4 injection. V ariational Kernel Design for Internal Noise 33 T ak eaw ay . Across b oth chec kp oints, GCh improv es reliability relativ e to b oth the no-noise baseline and DropBlo ck while remaining close to the baseline in T op-1 accuracy . The pattern is especially informativ e at the ﬁnal epo ch, where DropBlo ck shows a pronounced late-stage degradation whereas GCh do es not. F.3. Injection depth (L2/L3/L4) W e apply the same GCh mec hanism at diﬀeren t residual stages under the con trolled 3-seed protocol. Stage T op-1 ↑ NLL ↓ ECE ↓ L2-early 0.767 ± 0.001 0.918 ± 0.003 0.031 ± 0.001 L3-mid 0.765 ± 0.001 0.925 ± 0.006 0.029 ± 0.002 L4-late 0.764 ± 0.001 0.934 ± 0.004 0.020 ± 0.001 T able 5. Injection depth ablation for our metho d at ﬁxed strength γ = 0 . 1 (3 seeds). Mean ± std. T akea w ay . Depth induces a clear trade-oﬀ: earlier injection fav ors T op-1 and NLL, while later injection gives the strongest ECE gains. This is why the main pap er emphasizes the late-stage regime when discussing reliability . F.4. Strength sensitivity W e sw eep γ ∈ { 0 . 03 , 0 . 07 , 0 . 10 , 0 . 18 , 0 . 27 , 0 . 35 } at L4 injection under the controlled proto col. γ T op-1 ↑ NLL ↓ ECE ↓ 0.03 0.766 ± 0.002 0.926 ± 0.006 0.027 ± 0.001 0.07 0.765 ± 0.002 0.928 ± 0.009 0.021 ± 0.001 0.1 0.764 ± 0.001 0.934 ± 0.004 0.020 ± 0.001 0.18 0.759 ± 0.001 1.005 ± 0.006 0.076 ± 0.002 0.27 0.667 ± 0.034 1.880 ± 0.201 0.316 ± 0.017 0.35 0.164 ± 0.017 5.204 ± 0.119 0.149 ± 0.017 T able 6. γ sw eep at late-stage injection (each γ retrained). Mean ± std ov er completed seeds ( n = 3 for all sho wn). T akea w ay . There is a robust small-to-moderate regime in which GCh preserv es accuracy and impro v es reliabilit y . V ery large γ v alues cause the expected breakdown from excessiv e m ultiplicativ e p erturbation. F.5. ImageNet-C full results Ev aluation proto col and aggregation. W e rep ort T op-1 accuracy , NLL, and ECE on a selected 7-corruption subset of ImageNet-C. F or each corruption type, metrics are a v eraged across severities 1– 5; the rep orted aggregate num b ers then av erage across the selected corruption types. All ImageNet-C metrics are computed from the same chec kp oints used in the clean ImageNet tables, and w e report mean ± std ov er three seeds. Reading the tables. T able 7 giv es the main late-stage comparison at g = 0 . 1. T able 8 isolates the eﬀect of injection depth under shift. T able 9 reports strength sensitivity under shift. T able 10 pro vides the corruption-wise breakdo wn. V ariational Kernel Design for Internal Noise 34 Metho d g T op-1 ↑ NLL ↓ ECE ↓ None 0 0.382 ± 0.003 3.400 ± 0.030 0.105 ± 0.002 Drop out 0.1 0.384 ± 0.003 3.317 ± 0.020 0.084 ± 0.001 DropBlo c k 0.1 0.390 ± 0.009 3.300 ± 0.100 0.093 ± 0.004 I ID Gaussian 0.1 0.388 ± 0.003 3.316 ± 0.044 0.096 ± 0.006 Corr. Gaussian 0.1 0.386 ± 0.002 3.340 ± 0.028 0.103 ± 0.010 GCh (ours) 0.1 0.383 ± 0.005 3.287 ± 0.064 0.056 ± 0.005 T able 7. ImageNet-C o verall (mean ov er 7 corruptions × 5 severities) for late-stage injection. Mean ± std o ver 3 seeds. Ov erall comparison. Note that Drop out/DropBlo ck use their standard h yp erparameters (drop probabilit y p = 0 . 1) rather than an energy-matc hed Gaussian strength, while I ID/Corr./GCh use matc hed injected-energy strength for fair mec hanism isolation. Main robustness takea wa y . T able 7 shows that our metho d substantially improv es reliabilit y under distribution shift: compared to the no-noise baseline, ECE drops from 0 . 105 to 0 . 056 (a 46% relativ e reduction), while NLL also improv es. Crucially , the correlated additiv e Gaussian baseline (“Corr. Gaussian”) remains close to the no-noise baseline in ECE, supp orting our central message that c orr elation alone is not suﬃcient ; the improv ement emerges only when correlation is coupled with a p ositive, mean-one m ultiplicative gate (our GCh). Seed v ariabilit y (Corr. Gaussian). W e also observ e noticeably larger seed-to-seed v ariability for the correlated additive Gaussian baseline, suggesting that correlation without multiplicativ e gating can lead to less consistent b ehavior under shift. Stage T op-1 ↑ NLL ↓ ECE ↓ early 0.390 ± 0.002 3.314 ± 0.018 0.096 ± 0.003 mid 0.393 ± 0.003 3.230 ± 0.037 0.088 ± 0.004 late 0.383 ± 0.005 3.287 ± 0.064 0.056 ± 0.005 T able 8. Stage-wise ablation on ImageNet-C for GCh (ours) with g = 0 . 1. Mean ± std ov er 3 seeds. Depth under shift: late-stage helps calibration. T able 8 demonstrates a consistent depth eﬀect on ImageNet-C: moving injection from early → mid → late monotonically impro ves calibration (ECE) under shift. This aligns with the clean-data depth trade-oﬀ: late-stage injection p erturbs higher-lev el semantic representations in a structured manner, yielding stronger reliability gains for comparable accuracy . g T op-1 ↑ NLL ↓ ECE ↓ 0.03 0.388 ± 0.001 3.317 ± 0.045 0.091 ± 0.007 0.07 0.388 ± 0.007 3.304 ± 0.032 0.075 ± 0.006 0.1 0.383 ± 0.005 3.287 ± 0.064 0.056 ± 0.005 0.18 0.385 ± 0.004 3.277 ± 0.048 0.073 ± 0.001 0.27 0.276 ± 0.038 4.266 ± 0.228 0.169 ± 0.018 0.35 0.050 ± 0.003 6.187 ± 0.030 0.043 ± 0.004 T able 9. Strength sweep on ImageNet-C for GCh (late-stage injection). Mean ± std ov er 3 seeds. Strength sw eep under shift. V ariational Kernel Design for Internal Noise 35 Strength sensitivity and failure mo des. T able 9 rev eals a clear op erating regime: mo derate strengths ( g ≈ 0 . 07–0 . 18) retain accuracy while impro ving reliabilit y , with the b est ECE attained around g = 0 . 1 in this sw eep. At o verly large strengths ( g ≥ 0 . 27), acc uracy and NLL collapse sharply , indicating destabilization under excessive multiplicativ e perturbation. Notably , ECE can app ear deceptiv ely small at extreme collapse (e.g., g = 0 . 35) because the mo del b ecomes sev erely underconﬁden t; we therefore treat this region as a failure mo de rather than a fav orable calibration outcome. Corruption Acc (None) Acc (Ours) ECE (None) ECE (Ours) defocus blur 0.402 ± 0.003 0.398 ± 0.003 0.038 ± 0.002 0.039 ± 0.002 gaussian noise 0.308 ± 0.004 0.310 ± 0.012 0.156 ± 0.011 0.076 ± 0.011 glass blur 0.273 ± 0.002 0.263 ± 0.004 0.122 ± 0.003 0.075 ± 0.002 jpeg compression 0.547 ± 0.002 0.550 ± 0.008 0.059 ± 0.004 0.026 ± 0.001 motion blur 0.396 ± 0.006 0.400 ± 0.004 0.089 ± 0.006 0.049 ± 0.004 pixelate 0.462 ± 0.011 0.467 ± 0.006 0.096 ± 0.004 0.047 ± 0.008 shot noise 0.289 ± 0.005 0.293 ± 0.011 0.171 ± 0.015 0.083 ± 0.015 T able 10. ImageNet-C corruption-wise breakdown (severit y-av eraged) comparing None vs GCh (ours) at late-stage g = 0 . 1. Mean ± std ov er 3 seeds. Corruption-wise breakdo wn. Whic h corruptions b eneﬁt most. T able 10 shows that the reliabilit y gains are broad-based across corruption types: the largest ECE reductions occur on noise-t yp e corruptions (gaussian/shot) and compression/pixelation (jp eg/pixelate), while motion blur also impro ves. Defo cus blur is largely unc hanged in ECE, indicating that not all shifts b eneﬁt equally; this heterogeneity is informativ e and consisten t with the notion that our mec hanism primarily targets structured uncertain ty arising from lo cal stochastic p erturbations rather than all blur k ernels uniformly . F.6. Oxford-I I IT Pets (Fine-grained) Results Proto col (multi-seed, selection on v alidation only). W e follow a scientiﬁc multi-seed proto col on Oxford-I I IT P ets with a ﬁxed train/v al split (from trainval ). F or eac h metho d/seed, w e select the c heckpoint that minimizes v alidation NLL, using v alidation ECE as a tie-break when NLLs are nearly iden tical, and then rep ort test T op-1, NLL, and ECE for the selected chec kp oint. ECE is computed with 15 equal-width conﬁdence bins. Strength parameter g across metho ds. T o align notation with the main paper, we use a single “strength” sym b ol g across all metho ds. F or GCh (ours) , g is the multiplicativ e-gate strength used in the exp onential gate. F or Drop out/DropBlo ck, g corresp onds to the drop probability p (here p = 0 . 1); for “None” we set g = 0. T ak eaw ay . On this ﬁne-grained dataset, GCh achiev es the best (lo west) NLL and ECE at essen tially unc hanged accuracy relativ e to the strong baselines, indicating that the reliability gains are not sp eciﬁc to ImageNet/ImageNet-C. V ariational Kernel Design for Internal Noise 36 Method g T op-1 ↑ NLL ↓ ECE ↓ None 0 0.9009 ± 0.0044 0.3669 ± 0.0016 0.0325 ± 0.0044 Dropout ( p =0 . 1) 0.1 0.8957 ± 0.0007 0.4246 ± 0.0131 0.0503 ± 0.0007 DropBlock ( p =0 . 1) 0.1 0.9002 ± 0.0027 0.3669 ± 0.0007 0.0317 ± 0.0053 GCh (ours) 0.1 0.9010 ± 0.0023 0.3627 ± 0.0039 0.0302 ± 0.0037 T able 11. Oxford-I I IT Pets test p erformance (ResNet-18, 224 × 224, late-stage injection; mean ± std ov er 3 seeds). The strength parameter g is shared across rows for compactness; for Drop out/DropBlo ck it corresp onds to the drop probability p (see text). g T op-1 ↑ NLL ↓ ECE ↓ 0.1 0.9010 ± 0.0023 0.3627 ± 0.0039 0.0302 ± 0.0037 0.5 0.8989 ± 0.0031 0.3660 ± 0.0038 0.0314 ± 0.0053 1.0 0.8978 ± 0.0030 0.3661 ± 0.0024 0.0323 ± 0.0037 T able 12. GCh strength sw eep on Oxford-II IT P ets (test; mean ± std o ver 3 seeds). As in ImageNet/ImageNet- C, mo derate strengths are b est; larger strengths do not yield further gains. G. Additional theory details G.1. Op erational meaning of Theorem 5.4 Theorem 5.4 gives the canonical exact construction: 1. sample a GFF log-ﬁeld ψ with co v ariance ( β L U ) − 1 ; 2. exponentiate with exact Wick normalization to obtain a p ositive mean-one m ultiplicativ e gate. Once the op erator, gauge conv ention, and energy budget are ﬁxed, the remaining reported strength parameter in the experiments is γ . G.2. Mean-one normalization choices The exact mean-one gate requires the v ariance map v ( x ) = V ar ( ψ ( x )) = C ( x, x ). On a ﬁnite Dirichlet grid, v ( x ) is not spatially constant. Two practical normalization choices are standard. 1. Exact Wick normalization. Precompute v ( i, j ) = 1 β H X k =1 W X ℓ =1 ˜ e k,ℓ ( i, j ) 2 λ k,ℓ , where ˜ e k,ℓ denotes the orthonormal sine basis. Then use ξ ex γ ( x ) = exp  γ ψ ( x ) − γ 2 2 v ( x )  . This is the exact ob ject in the theory and preserv es E [ ξ ex γ ( x )] = 1 sitewise. 2. Sample-wise mean-one normalization. Compute G ( x ) = exp ( γ ψ ( x )) and normalize b y the spatial mean: ξ sw γ ( x ) = G ( x ) 1 | U | P y ∈ U G ( y ) . This guaran tees unit spatial av erage per sample and is often con venien t in optimization. It is the implemen tation used in the main experiments unless otherwise noted. V ariational Kernel Design for Internal Noise 37 G.3. Implementation notes 1. A single gate ma y b e shared across c hannels, or indep endent gates ma y be sampled channel-wise. 2. In multi-resolution architectures, the gate can b e sampled directly at the feature resolution of the target lay er or sampled at a base resolution and then resized. 3. At inference time, noise can be disabled by setting ξ ≡ 1.

Variational Kernel Design for Internal Noise: Gaussian Chaos Noise, Representation Compatibility, and Reliable Deep Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment