Intrinsic regularization effect in Bayesian nonlinear regression scaled by observed data

In trinsic regularization eﬀect in Ba y esian nonlinear regression scaled by observ ed data Satoru T okuda, 1, 2, 3 , ∗ Kenji Nagata , 3 and Ma sato O k ada 3, 4 1 R ese ar ch Institute for Information T e chnolo gy, Kyushu U niversity, Kasuga, F ukuoka 816-8580 , Jap an 2 Mathematics for Ad vanc e d Materials – Op en Innovation L ab or atory, AIST, c/o A dvanc e d Institute for Materials R ese ar ch, T ohoku U niversity, Sendai, Miyagi, 980-8577, Jap an 3 Dep artment of Compl exity Scienc e and Engine ering, The University of T okyo, Kashiwa, Chib a 277-8561, Jap an 4 Center for Materials R ese ar ch by Information Inte gr ation, National Institut e for Mater ials Scienc e, Tsukub a, I b ar aki 305-0047, Jap an (Dated: D ecemb er 7, 2022) Occam’s razor is a guiding principle th at models should b e simple enough to describe observed data. While Ba yesia n mod el selection (BMS) em b o dies it by the intrinsic regularizatio n eﬀ ect (IRE), how observed data scale the IRE has not b een fully understo o d. In the nonlinear regression with conditionally indep end ent observ ations, we show that th e IRE is scaled b y observ ations’ ﬁneness, deﬁned by the amount and quality of observed data. W e introd uce an observ able that quantiﬁes the IRE, referred to as the Bay es sp eciﬁc heat, inspired by th e correspondence betw een statis ti- cal inference and statistical ph ysics. W e derive its scaling relation to observ ations’ ﬁn eness. W e demonstrate t hat the optimal mod el chosen by the BMS changes at critical v alues of observ ations’ ﬁneness, accompanying the IRE’s v ariation. The changes are from choosing a coarse-grained mod el to a ﬁne-grained one as observ ations’ ﬁn eness increases. Our ﬁnd ings expand an understanding of BMS’s typicalit y when observed d ata are insuﬃcient. I. INTRO DUCTION T o descr ibe o bserved data by mathematical mo del is an impo r tant sta g e to understand the physics of the s ystem behind obser ved data. It is desired to build a mo del as simple as p ossible for caricaturing the es sential physics [1 – 5]. A simple mo del is a n equation or a function with fewer parameter s , which sometimes co rresp onds to an eﬀective theory: a spe cial ca s e, an a pproximation, or a co arsening o f the more compr e he ns ive theory . Some attempts r elate the mo del simplicity to a kind of emer genc e as if microsco pic details a re a lmost negligible in macros copic phenomena fro m the viewp oint of information theor y [6 – 9]. They ar e also r elated to what mo del is simple enough for describing obser ved data well. In the context of sta tistics, some criteria for mo del s election, as highlig h ted by the Ak a ike and Bay esian information cr iteria (AIC a nd BIC), s upp or t selecting a simpler mo del as justiﬁcations of Oc c a m’s razor [10, 11]. Bay esian mo del selection (BMS) is utilized to c ho ose a mo del s imple enough to describ e the es s ent ial physics behind observed data [12 – 1 7]. Suppose that the gr ound truth that genera tes the obse r ved data is included in the list of candidate mo dels. There are se veral motiv a tions for employing the BMS. One is that the BMS is consistent [18, 19]. If the candidate mo dels ar e disjoin t, the mo del chosen by the BMS conv erges almost sure ly tow ar d the ground truth as o bserved data incr ease under mild conditions. Another is that the BMS automatically incorp or a tes Occam’s razor [18, 19]. If more than one candidate mo del can ex a ctly express the ground truth, the mo del c hosen b y the BMS conv er ges almost surely toward the simplest one with the fewest par ameters. W e s hould mention that our primary int erest in the BMS is choos ing the simplest mode l that describ es o bserved data w ell rather than verifying one ’s mo del (h yp othesis) based on the data. The simplest mo del for giv en data is sometimes consistent with the ground truth, but not necessarily . W e wan t to clar ify when and how the simplest mo del is the g round truth or other ca ndida tes. Occam’s ra z or in the BMS is em bo die d b y the intrinsic regulariz ation eﬀect (IRE) that prefers simpler models to complex o nes as succe e ding to the BIC [11, 2 0, 21]. In the case of enough obs erved da ta , the singular lear ning theor y has rev ealed that the IRE is explicitly quan tiﬁed b y a birationa l inv aria nt , called the real log c anonical threshold (RLCT) [22 – 24]. How ever, how observed data sca le the IRE is still a c hallenging question. This question includes three issues. Fir s t, the RLCT do es not quantify the dep endence on the amount a nd qualit y of observed data. W e need the IRE’s scaling function that conv erge s toward the RLCT in the limit where they ar e suﬃcient. Second, the RLCT is not an observ able quantit y ca lculated from observed data but a q uantit y theoretically derived from the ground truth, which is prac tica lly unknown. While so me observ able quantities that co nv erg e tow ard the RLCT have bee n introduced [2 5 , 26], they do not explicitly represent the dep endence o n bo th the amo un t and q uality of observed ∗ s.tokuda.a96@m.kyush u- u.ac.jp 2 data. W e also need an observ a ble counterpart o f the sc a ling function. Third, how the IRE aﬀects the BMS in the case of insuﬃcien t observed data has yet to b e fully understo o d. While the BMS is c o nsistent, the mo del chosen by the BMS is not necessa rily the gro und truth if observed data a r e insuﬃcient. W e also need to explain the t ypicality of the BMS for insuﬃcient data by using the scaling function. Here we address these is s ues in the context of the no nlinear regres s ion with conditiona lly indep endent observ ations, which is a prototypical setup in mathematical mo deling. T o quantify the IRE, we in tro duce an observ able quantit y , referred to as the Bayes sp e c iﬁc heat, inspired by the mathematical equiv alence b etw een statistica l inference and statistical physics [2 1, 27 – 2 9]. W e derive its ﬁnite-s ize scaling re lation to observ a tio ns’ ﬁneness, deﬁned by the amount and quality of observed data. W e show the cor resp ondence of the scaling function to the RLCT. W e demonstra te that the mo del chosen b y the BMS changes at critical v a lues of observ a tions’ ﬁneness, aﬀected by v ariatio n in the scaling function. W e also ﬁnd that the changes a r e fro m choo sing a coarse - grained mo del to the ﬁne-grained g round truth as o bserv ations ’ ﬁneness increases. Our ﬁndings c o rresp ond to a typical b ehavior o f the BMS with a v a riation of obs erved data fr o m insuﬃcient to suﬃcient, expanding an under standing of the BMS’s c o nsistency . II. ST A TISTICA L ENSEMBLE OF NONLINEAR REGRESSION W e star t b y deﬁning a statistic al ensemble o f nonlinear reg ression. Let us consider obse rving conditionally indep en- dent r andom v a riables D n := { x i , Y i } n i =1 for x i = x i − 1 + L/ ( n − 1), where x 1 , x n , a nd L := x n − x 1 > 0 ar e ﬁxed. W e assume that Y i ∼ N ( f ( x i ; w ) , β − 1 ), where f , w , and β − 1 > 0 are res pectively reg ression mo del, its parameter s et, and v ariance of o bserv ation no ise as an e mbo diment of the q uality of observed data. Here w e deﬁne obse r v ations’ ﬁneness κ := nβ , whic h is a key quantit y of scaling relations as describ ed la ter. Since w and f are unknown and s ho uld b e estimated fro m D n , we treat them a s rando m elemen ts sub ject to the po sterior proba bilit y distribution p ( w, f | D n , β , µ ) ∝ exp  − κ 2 E n ( w ; f ) − κµF n ( β , f )  p ( w | f ) p ( f ) , (1) where E n ( w ; f ) := 1 n n X i =1 ( Y i − f ( x i ; w )) 2 (2) is the mea n squa r e error, F n ( β , f ) := − 1 κ log Z exp  − κ 2 E n ( w ; f )  p ( w | f ) dw (3) is the Bay es free energy , µ ≥ 0 is an auxiliar y v a riable, p ( w | f ) is an arbitrary prior probability distribution of w ∈ W ⊂ R d , and p ( f ) is that of f ∈ { f K } . If µ = 1, Eq. (1) is derived from Bayes’ theo rem. Throughout this study , w e consider the case µ → ∞ , which corr e spo nds to the empir ical Bayes approach, i.e., f is chosen by minimizing F n ( β , f ) at a given β as the BMS [20, 3 0, 31]. Note that the BIC a nd its generalized version are derived by approximating F n ( β , f ) [11, 25]. While β is given in E q . (1), if β is unknown, it ca n also b e estimated by minimizing κF n ( β , f ) − n (log β − log 2 π ) / 2 [32]. The ov erall setup is the basics o f our analys es in the following sections to cla rify the BMS transitions from choosing a coarse-g rained mo del to the ﬁne-g rained gr ound truth as κ increa s es. II I. BA YES SPECIFIC HEA T W e in tro duce a key quantit y that c hara cterizes the macr ostates in the statistical ensem ble of nonlinear reg ression, which is deﬁned as the ﬂuctuation of E n ( w ; f ) at a certain pa ir of β and f with a g iven D n : C n ( β , f ) =  κ 2  2  h E n ( w ; f ) 2 i − h E n ( w ; f ) i 2  , (4) where h· · · i := R ( · · · ) p ( w | β , f , D n ) dw denotes the a verage over a ll micr ostates of w sub ject to p ( w | β , f , D n ) ∝ exp( − κE n ( w ; f ) / 2) p ( w | f ). Note that Eq. (4) is derived from a mor e gene r al deﬁnition of such a quant ity in statistical inference, c a lled the Bay es sp eciﬁc heat (see App endix A). W e s hould men tion that C n ( β , f ) is an observ a ble that quantiﬁes the IRE as ex plained in the next s ection. This q ua nt ity plays an imp ortant r ole in explaining the mec hanism of the BMS transitions in terms o f the IRE. 3 IV. FINITE-SIZE SCALING RELA TIONS W e hereafter consider the case that Y i ∼ N ( f 0 ( x i ; w 0 ) , β − 1 0 ), where f = f 0 , w = w 0 and β = β 0 are the ground truths such that p ( w 0 | f 0 ) > 0 is satisﬁed. The BMS dep ends on the set D n of n random v aria bles since it is carried out by minimizing F n ( β , f ), which is a function of D n . Bes ides, F n ( β , f ) dep ends on b oth D n and β ra ther than o n κ (see Eqs . (2) and (3)). T o clarify the typicalit y of BMS transitions, not dep ending on the realiza tion of D n but on κ , we take the limit where κ and L are ﬁxe d, whereas n → ∞ . W e derive the ﬁnite-size sca ling relation F n ( β , f ) = F ( κ, f ; w 0 , f 0 ) + R n 2 nβ 0 (5) with the scaling function F ( κ, f ; w 0 , f 0 ) : = − 1 κ log Z exp  − κ 2 E ( w ; f , w 0 , f 0 )  p ( w | f ) dw (6) and the energ y function E ( w ; f , w 0 , f 0 ) := 1 L Z x n x 1 ( f 0 ( x ; w 0 ) − f ( x ; w )) 2 dx, (7) where R n is a random v aria ble sub ject to the chi-square distribution with n degr ee of freedom (see App endix B). Note that F n ( β , f ) is self-av er aging, i.e., F n ( β , f ) = [ F n ( β , f )] holds as n → ∞ , where [ · · · ] := R ( · · · ) Q n i =1 p ( y i | x i , w 0 , β 0 , f 0 ) dy i denotes the av erage o ver all r e a lizations of D n , which are sub ject to p ( y i | x i , w 0 , β 0 , f 0 ) := N ( f 0 ( x i ; w 0 ) , β − 1 0 ). Since R n / (2 nβ 0 ) is indep endent of f , f minimizing F ( κ, f ; w 0 , f 0 ) is e q ual to f minimizing F n ( β , f ) at κ consisting o f a certain pair of β and suﬃciently large n . Under the condition β = β 0 , cor resp onding to the Nishimo ri line [33, 3 4], F ( κ, f ; w 0 , f 0 ) ena bles us to assess the typical b ehavior of statistica l ensemble at a certa in κ = nβ 0 , whic h is indep endent o f the rea lization of D n . Ba s ed on this typicality , we demonstra te that f minimizing F ( κ, f ; w 0 , f 0 ) at any κ is not necessa rily f 0 in the next section. The demonstration c la riﬁes the t ypicality of BMS transitions, independent of the realiza tions of D n , fro m cho o sing a co arse-g rained mo del to the ﬁne- grained gr ound truth as κ increases . W e are interested in expla ining the mechanism of BMS tr ansitions in ter ms of the IRE. T o expla in the mechanism formally , in the a bove limit wher e κ a nd L ar e ﬁxed, wher eas n → ∞ , we als o derive the ﬁnite-size scaling relation C n ( β , f ) = Λ ( κ, f ; w 0 , f 0 ) + O p  max  1 log κ , β β 0 , 1 n √ β 0  (8) with the scaling function Λ( κ, f ; w 0 , f 0 ) : =  κ 2  2  h E ( w ; f , w 0 , f 0 ) 2 i − h E ( w ; f , w 0 , f 0 ) i 2  , (9) where h· · · i conv erges to the a verage ov er p ( w | β , f , D n ) ∝ exp( − κE ( w ; f , w 0 , f 0 ) / 2) p ( w | f ) in the limit we consider (se e App endix C). F r om E q. (8), we ﬁnd that C n ( β , f ) is an obse rv able co un terpart of Λ( κ, f ; w 0 , f 0 ) in the limit we co nsider. Note that C n ( β , f ) is conditio nally self-averaging, i.e., C n ( β , f ) = [ C n ( β , f )] holds a s n → ∞ if β = O ( β 0 / lo g n ). Thus, C n ( β , f ) is not necessarily se lf-av er aging under the condition β = β 0 , i.e., C n ( β 0 , f ) = Λ( nβ 0 , f ; w 0 , f 0 ) + O p (1) holds. W e a ssess the ﬁnite- s ize eﬀect on some particula r ex a mples to v a lidate that the ﬂuctua tion term O p  max  (log κ ) − 1 , β /β 0 , ( n √ β 0 ) − 1  is neglig ible at a ny β in some situations (see App endix D). W e also ﬁnd that Λ( κ, f ; w 0 , f 0 ) is a quantiﬁcation of the IRE in the limit we consider fro m the consider ation below. There is a junction b etw een Eq. (9) and the sing ular lear ning theory [22 – 25, 35, 36]. While the setup is based on independent and iden tically distributed obser v ations as n → ∞ in the related study [37], our setup is based on conditionally indep endent observ ations in the limit w he r e κ and L are ﬁxed, wher eas n → ∞ . Our setup is a natura l extension, whe r e the limits of F ( κ, f ; w 0 , f 0 ) and Λ( κ, f ; w 0 , f 0 ) as κ → ∞ conv erge to the related study’s result. An y direct counterparts of C n ( β , f ) and Λ( κ, f ; w 0 , f 0 ) hav e not b e e n deﬁned in the singular learning theory , while Eq. (8) is related to W a tanab e’s corollar y [25] (see App endices A and C). Thanks to this relation, the limit of Λ ( κ, f ; w 0 , f 0 ) as κ → ∞ is re g arded as the RLCT, which c haracter izes the IRE as the co eﬃcient o f a leading term in the asymptotic expansion of F n ( β , f ) a s n → ∞ [22 – 24] (see Eq. (C23)). Namely , Λ( κ, f ; w 0 , f 0 ) quant iﬁes the IRE not only in the limit n → ∞ but also in the limit wher e κ and L are ﬁxed, whereas n → ∞ . The limit Λ( κ, f ; w 0 , f 0 ) → dim ( w ) / 2 also holds as κ → ∞ if p ( w | D n , β , f ) is reg ular, wher e the BIC is justiﬁed [1 1, 25]. Based on these corresp ondences, we demonstr ate how the IRE is scaled by κ and how it aﬀects the B MS transitions in the next section. 4 V. INTRINSIC REGULARIZA TION EFFECT SCA LED BY OBSER VED DA T A W e demonstr ate the BMS transitio ns from choos ing a coar s e-grained model to the ﬁne-gr ained ground truth, aﬀected by v a riation in the IRE, as κ increases. As a simple example, we n umerically demonstra te that f ∈ { f 0 , f 1 , f 2 } minimizing F ( κ, f ; w 0 , f 0 ) at a n y κ is not necess arily f 0 ∈ { f 0 , f 1 , f 2 } by using a nonlinear reg ression mo del with K Gaussian co mpo ne nts: f K ( x ; w ) = K X k =1 a k exp  − b k 2 ( x − c k ) 2  (10) for K ≥ 1 , where w := { a k , b k , c k } K k =1 is the parameter set. This mo del is regar ded as a kind of radial basis function net works [38]. Note that we deﬁne f 0 ( x ; w ) = 0, where w is a n empty set. W e also derive the analytic expression o f E ( w ; f , w 0 , f 0 ) for this f K (see App endix D). Let us consider a physic al in ter pretation of the statistic al ensemble of nonlinear regress ion with f K as an analogy . Based o n the ma thematical equiv alence be tw een statistical inference and statistical physics, the micr ostate w with the p article numb er K sub ject to p ( w, f | D n , β , µ ) is interpreted as a gr and c anonic al ensemble co nditioned by β as the inverse temp er atu r e , n as the volume , and µ a s the chemic al p otential . In the same manner, w sub ject to p ( w | β , D n , f ) is in terpreted as a c anonic al ensemble , where we a ssume p ( w, f | D n , β , µ ) = p ( w | D n , β , f ) p ( f | D n , β , µ ) to derive Eq.(1). Since b oth statistica l ensembles ar e als o conditioned b y the qu enche d disor der D n , w e discuss the typical macr ostate by mea ns of the tw o av er ages h· · · i and [ · · · ]. F or this purp ose, F ( κ, f ; w 0 , f 0 ) and Λ( κ, f ; w 0 , f 0 ) are reasona ble in the limit where κ and L are ﬁxed, whereas n → ∞ . It should b e emphasiz e d that we do es not co nsider the thermo dynamic limit , i.e., the limit where K /n is ﬁxed, whereas K → ∞ a nd n → ∞ . W e p erfor med an Mon te Ca r lo (MC) sim ulations by using pa rallel temp ering based on the Metrop olis crite- rion [39, 40]. The v a riable κ was discr etized as 400 p oints consis ting of κ = 0 and 39 9 lo garithmically spaced po int s in the interv al [1 0 − 8 , 10 12 ]. W e set the prior probability distribution as p ( w | f K ) = Q K k =1 exp( − a k / 10 − b k / 10 − c 2 k / 50) / (50 0 √ 2 π ). W e sim ulated p ( w | D n , β , f ) ∝ exp ( − κE n ( w ; f ) / 2) p ( w | f ) with a realization of D n to calculate F n ( β , f ) in the same manner a s in our previous work [32]. W e als o simulated p ( w | D n , β , f ) ∝ exp( − κE ( w ; f , w 0 , f 0 ) / 2) p ( w | f ) to calculate F ( κ, f ; w 0 , f 0 ) and Λ( κ, f ; w 0 , f 0 ) at each p oint of κ > 0 in the manner of Bridge sampling [41, 42]. In all the MC simulations, the total MC sweeps w ere 100,0 00 after the burn- in. The er ror bars o f F ( κ, f ; w 0 , f 0 ) and Λ ( κ, f ; w 0 , f 0 ) were calculated by b o otstrap r e s ampling. W e consider tw o c a ses to simulate the discov ery pro cess of the gross and ﬁne str uc tur es in o bs erved data. One is the deg e ne r ate case that is deﬁned as f 0 = f 1 with w 0 = { 10 , 10 , 0 } (Fig. 1a). Another is the splitting case tha t is deﬁned as f 0 = f 2 with w 0 = { 5 , 10 , ( − 1) k × 0 . 25 } 2 k =1 , whic h makes tw o Gaussian comp onents str ongly overlapping (Fig. 1e). In each case, a realiza tion of D n for n = 10 1 is o btained in the presence of obser v ation noise at a certain β 0 . If β 0 is small e no ugh, f minimizing F n ( β , f ) in both cas es a re f 0 , whic h is not co nsistent with f 0 (Figs. 1b and 1f ). If β 0 is larg e to some extent, f minimizing F n ( β , f ) in b oth c a ses are f 1 , which is consistent/inconsistent with f 0 in the degenerate/splitting case (Figs. 1c and 1g). If β 0 is large e nough, f minimizing F n ( β , f ) in the splitting case is f 2 , which is consistent with f 0 (Fig. 1h). They s how that the optimal model chosen by the BMS is not alwa ys co nsistent with the ground truth and imply the t ypical behavior that the optimal model changes a t some critical v alues of obs erv ations’ ﬁneness (rather than mag nitude of o bs erv ation noise); F or to o ro ugh observ ations , the optimal mo del just describ es a ”non-s tructure”. F o r rather r o ugh observ atio ns, the optimal model describ es a ” gross structure”. F or ﬁne obs e rv ations, the optimal mo del des c rib es a ”ﬁne structure”. T o elucidate our implication, we calculate F ( κ, f ; w 0 , f 0 ) and Λ( κ, f ; w 0 , f 0 ) on the Nishimori line κ = nβ 0 for the ab ov e t wo c ases. In b oth cas es ab ov e, the o ptimal mo del f minimizing F ( κ, f ; w 0 , f 0 ) c hanges from f 0 to f 1 around κ = κ c1 (Figs. 2a and 2c), while Λ( κ, f 1 ; w 0 , f 0 ) has a p eak a round κ = κ c1 (Figs. 2b and 2 d). Λ( κ, f 1 ; w 0 , f 0 ) is fa ir ly c onsistent with 0 and 1 . 5 at κ ≪ κ c1 and κ ≫ κ c1 , resp ectively (Fig. 2b) . Corresp onding changes in p ( w | D n , β , f 1 ) are also shown (Fig. 3). While p ( w | D n , β , f 1 ) at κ = κ 1 (Fig. 3a-3c) is fairly consistent with p ( w | f 1 ), p ( w | D n , β , f 1 ) at κ = κ 2 is suﬃciently a pproximated by a Gaussian distributio n whose mean is w 0 (Figs. 3g-3i). p ( w | D n , β , f 1 ) at κ = κ c1 represents the intermediate state (Figs. 3d- 3f ) betw een these t wo states. Only in the splitting case, the optimal f minimizing F ( κ, f ; w 0 , f 0 ) changes from f 1 to f 2 around κ = κ c2 (Fig. 2c), while Λ( κ, f 1 ; w 0 , f 0 ) has a peak ar ound κ = κ c2 (Fig. 2 d). Λ( κ, f 2 ; w 0 , f 0 ) is fairly consistent with 1 . 5 and 3 at κ c1 ≪ κ ≪ κ c2 and κ ≫ κ c2 , resp ectively (Fig. 2d). Corr esp o nding changes in p ( w | D n , β , f 2 ) are also shown (Fig. 4 ). While p ( w | D n , β , f 2 ) at κ = κ 2 is far from a Gaussian distribution (Fig. 4a-4c; see also Appendix D), p ( w | D n , β , f 2 ) at κ = κ 3 is suﬃciently a pproximated b y a bimo dal Ga ussian dis tribution whose mo des ar e symmetric (Fig. 4g- 4i). p ( w | D n , β , f 2 ) at κ = κ c2 represents the intermediate state (Fig. 4d-4f ) b etw een these t wo s ta tes. W e extend the inv estigation to the c a se that is deﬁned as f 0 = f 2 with w 0 = { 5 , 10 , ( − 1) k × δ } 2 k =1 for 0 ≤ δ ≤ 2, where δ = 0 and δ = 0 . 25 r esp ectively corr e spo nd to the degenerate a nd splitting cases ab ov e. The phase diagr am 5 shows that there are three phases describ e d by κ and δ (Fig. 5 ). Three phases ar e bounded b y tw o ridge lines of Λ( κ, f 2 ; w 0 , f 2 ), where these lines merge in δ & 0 . 8. In other words, there is only tw o phases in δ & 0 . 8, while there are three phases 0 < δ . 0 . 8. Note that δ = 0 is a spe c ial case since the case of f 0 = f 2 with w 0 = { 5 , 10 , 0 } 2 k =1 is ident iﬁed with the case of f 0 = f 1 with w 0 = { 1 0 , 10 , 0 } (see Appendix D). The optimal mo del f minimizing F ( κ, f ; w 0 , f 0 ) also changes ar ound the phase b ou ndaries . This corr e s po ndence enables us to interpret three phases as a ” non-structure” , a ”gro s s structur e”, and a ”ﬁne str uc tur e”. Note that the r e g ion in the ”non-structure” pha se, whereas the optimal mo del is f = f 1 , cor resp onds to intermediate state, suc h as shown in Figs. 3d-3f. VI. DISCUSSIONS Our results clearly s how that the optima l mo del chosen by the BMS is not alwa ys consistent with the ground truth but depends o n observ ations’ ﬁneness . The discovery pro cess o f the g ross and ﬁne structur es s hows the c hang es in the optimal model at critical v alues of observ ations’ ﬁneness, accompanying the v ariation in the IRE. Our r esults can also b e understo o d from another v iewpo int . An eﬀectiv e model is not necessa r y to b e consistent with the optimal mo del for describing observed data since it is deduced from the or iginal theo ry independent of the data. If o ne is more conﬁdent o f an eﬀective model than observed data, what mo del is optimal fo r observed data is replaced by what a mount a nd quality o f obser ved data are required to v alidate the model. Our results show that critical v alues of obse r v ations’ ﬁnenes s ar e re quired to do so . F rom this po in t, these v alues can b e rega rded as kinds of limits on an indire ct measur ement , i.e., parameter estimation of the eﬀectiv e mo del. ACKNO WLEDGMENTS The a uthors are grateful to Chihiro H. Nak a jima, K o ji Hukushima, Ko uki Y onag a, Masayuki Ohzeki, Shotaro Ak aho, Sumio W atanab e, T omoyuki Obuchi a nd Y oshiyuki K abashima for v alua ble discuss ions. S.T. was supp orted by JSPS KAKE NHI (No. JP20K 19889 ). M.O. was suppor ted by JST CREST (No. JPMJCR17 6 1), JSP S K AKENHI (No. 25 12000 9), the ”Materia ls Research by Information Integration” Initiative (MI2 I) pro ject o f the Supp ort Progr am for Starting Up Innov a tio n Hub from the Japan Science and T echnology Agency (JST), and the Council fo r Science, T echnology and Innov a tion (CSTI), Cros s-ministerial Strategic I nnov ation Pr omotion Prog ram (SIP), ”Structural Materials for Inno v ation” (F unding ag ency: JST). App endix A: Deﬁni tion of Bay e s sp eciﬁc heat Here, we der ive Eq. (4 ) from a broader p ers p ective of statistical inference including our setup. W e star t by int ro ducing the conditional pr obability de ns it y p ( w | D n , ˜ β , β , f ) : = 1 ˜ Z n ( ˜ β , β , f ) exp − n ˜ β 2 L n ( w ; β , f ) ! p ( w | f ) (A1) with ˜ β ≥ 0 b eing the inverse temp er atur e [23], the empirical lo g loss function L n ( w ; β , f ) : = − 1 n n X i =1 log p ( Y i | x i , w , β , f ) (A2) and the pa rtition function ˜ Z n ( ˜ β , β , f ) : = Z exp − n ˜ β 2 L n ( w ; β , f ) ! p ( w | f ) dw. (A3) Note that Eq . (A1) for ˜ β = 1 is just B ayes’ theorem, where p ( w | D n , 1 , β , f ) and ˜ Z n (1 , β , f ) are the p osterior distribution and marg inal likelihoo d, resp e ctively . If ˜ β → ∞ a nd p ( ˆ w | f ) > 0, then p ( w | D n , ˜ β , β , f ) co nv erg es to δ ( w − ˆ w ), wher e ˆ w is the maximum likelihoo d es timator [23]. Notably , p ( w | D n , ˜ β , β , f ) = p ( w | f ) holds for ˜ β = 0. Here, we deﬁne the speciﬁc heat ˜ C n ( ˜ β , β , f ) : = ∂ h nL n ( w ; β , f ) i ˜ β ∂ ˜ β − 1 = ˜ β 2 ˜ I n ( ˜ β ; β , f ) (A4) 6 with the Fisher information ˜ I n ( ˜ β ; β , f ) : = *  ∂ ∂ ˜ β log p ( w | D n , ˜ β , β , f )  2 + ˜ β = h ( nL n ( w ; β , f )) 2 i ˜ β − h nL n ( w ; β , f ) i 2 ˜ β , (A5) where the av erage h· · · i ˜ β := R ( · · · ) p ( w | D n , ˜ β , β , f ) dw . Consider ing the co nnection betw een statistical inference and statistical physics, h nL n ( w ; β , f ) i ˜ β is the internal energ y and ˜ F n ( ˜ β , β , f ) : = − 1 ˜ β log ˜ Z n ( ˜ β , β , f ) . (A6) is the fre e energ y . Then, we also obtain the relation ˜ C n ( ˜ β , β , f ) = − ˜ β 2 ∂ 2 ( ˜ β ˜ F n ) ∂ ˜ β 2 (A7) as in s tatistical physics. As the Bay es free energ y is deﬁned by ˜ F n (1 , β , f ), we deﬁne the Bayes sp eciﬁc heat as ˜ C n (1 , β , f ), where this deﬁnition c an be applied not only to L n ( w ; β , f ) in the no nlinear regressio n but als o to empirical log loss functions o f any other sta tistical inference setups without lo ss of genera lity . W e should compare the Bayes spec iﬁc heat, esp ecially in the form o f Eq. (A7) , with the learning capacity [4 3], whic h is deﬁned by the second der iv ative of the Bay es free energ y with resp ect to n as an a ppr oximation of the se c ond-order- ﬁnite diﬀerenc e . Notably , the Bay es sp eciﬁc hea t and the lea rning capacity are diﬀerent, as ˜ β and n are diﬀerent. How ever, they a ls o hav e similar ities. W e show their similarities and diﬀerences in App endix C. Now, we consider the sp eciﬁcs of our setup, i.e., the relation L n ( w ; β , f ) = β 2 E n ( w ; f ) − 1 2 log β 2 π . (A8) Then, we obtain the scaling relations ˜ C n ( ˜ β , β , f ) = C n ( ˜ β β , f ) and ˜ I n ( ˜ β ; β , f ) = I n ( ˜ β β ; f ), where the scaling functions are C n ( ˜ β β , f ) : = ( ˜ β β ) 2 I n ( ˜ β β ; f ) (A9) and I n ( ˜ β β ; f ) : = h ( nE n ( w ; f )) 2 i ˜ β − h nE n ( w ; f ) i 2 ˜ β . (A10) Now, we take ˜ β = 1, i.e ˜ β β = β , and then obtain C n ( β , f ) in the form of Eq. (4) as the Bayes sp eciﬁc hea t. App endix B: Deriv ation of scali ng relation on Bay e s free e nergy Here, we show an outline of the deriv ation of Eqs. (5) and (6). By cons idering the no ise additivity , we divided Y i int o the signal and noise, i.e., Y i = f 0 ( x i ; w 0 ) + N i , (B1) where N i ∼ N (0 , β − 1 0 ). Then, we obtained nE n ( w ; f ) = X i s i ( w ; f ) 2 + 2 X i s i ( w ; f ) N i + R n β 0 , (B2) where s i ( w ; f ) := f 0 ( x i ; w 0 ) − f ( x i ; w ) a nd R n := β 0 P i N 2 i . By using Jensen’s inequality , " − lo g Z exp − β 2 X i s i ( w ; f ) 2 + 2 X i s i ( w ; f ) N i !! p ( w | f ) dw # ≥ − log Z exp − β 2 X i s i ( w ; f ) 2 ! p ( w | f ) dw (B3) holds, where the equality holds when P i s i ( w ; f ) N i = 0, which is asymptotically satisﬁed for n → ∞ . Note that 1 n X i s i ( w ; f ) 2 = E ( w ; f , w 0 , f 0 ) (B4) also holds as n → ∞ . Based on these asymptotic b ehaviors, Eqs. (5) and (6) were o btained. 7 App endix C: Deriv ation and v alidation of scali ng rel ation on Bay e s sp eciﬁc heat W e show the deriv ation of Eqs. (8) and (9) in more detail. W e start from a broader p e rsp ective o f statistical inference including our setup. The asymptotic behaviour of the free ener gy has b een obtained [23]: ˜ β ˜ F n ( ˜ β , β , f ) = n ˜ β L n ( w ′ 0 ; β , f ) + λ log n ˜ β + ( m − 1) log lo g n ˜ β + O p ( ˜ β ) (C1) as n → ∞ , where w ′ 0 is w that minimizes the Kullback-Leibler dis ta nce fro m p ( y | x, w 0 , β 0 , f 0 ) to p ( y | x, w, β , f ), λ > 0 is a rational num b er ca lled the real log ca nonical threshold, and m ≥ 1 is a natural num b er. Note that w ′ 0 = w 0 holds if β = β 0 and f = f 0 . By following Eqs. (A7) and (C1), w e obtain ˜ C n ( ˜ β , β , f ) = λ − ( m − 1)  1 log n ˜ β + 1 (log n ˜ β ) 2  + o p  ˜ β 2  (C2) as n → ∞ . If w e tak e ˜ β → ∞ , then ˜ C n ( ˜ β , β , f ) = λ + o p  ˜ β 2  holds; the quan tit y ˜ C n ( ˜ β , β , f ) is not necessar ily self-av eraging . The r elation ˜ C n ( ˜ β , β , f ) = λ holds a s n → ∞ if ˜ β = O (1 / log n ), which corresp onds to the condition shown in W atanab e’s co rollar y [25]: h nL n ( w ; β , f ) i ˜ β 1 − h nL n ( w ; β , f ) i ˜ β 2 ˜ β − 1 1 − ˜ β − 1 2 = λ + O p  1 √ log n  (C3) hold for ˜ β 1 = O (1 / log n ) a nd ˜ β 2 = O (1 / log n ), where ˜ β 1 and ˜ β 2 are pos itiv e v a riables. The r efore, it is recertiﬁed that ˜ C n ( ˜ β , β , f ) = λ holds for n → ∞ if ˜ β = O (1 / log n ), where h nL n ( w ; β , f ) i ˜ β 1 − h nL n ( w ; β , f ) i ˜ β 2 ˜ β − 1 1 − ˜ β − 1 2 = ˜ C n ( ˜ β , β , f ) (C4) hold a s ˜ β 1 → ˜ β and ˜ β 2 → ˜ β . Here, we mention that the expectation of the learning capa cit y over realizations a lso conv er ges to λ as n → ∞ [43]. Howev er , it has not pr ov en that the lear ning capacity as a random v ar iable conv erges tow ar d λ as n → ∞ . The lea r ning capacity as a random v a riable is not applicable for the scaling ana ly sis that provides Eq. (C2). It do e s not a lso provide Eqs. (8) a nd (9) in the limit where κ and L ar e ﬁxed, whereas n → ∞ . Now, we consider the sp eciﬁcs of our setup, i.e., the scaling relatio n C n ( ˜ β β , f ) = λ − ( m − 1)  1 log ˜ β κ + 1 (log ˜ β κ ) 2  + o p  ˜ β 2 β 2  , ( C5) where the corresp ondence of ( ˜ β , β ) a nd ˜ β β is co nsidered. Then, we obtain C n ( β , f ) = λ − ( m − 1)  1 log κ + 1 (log κ ) 2  + o p  β 2  (C6) for ˜ β = 1. W e als o ev alua te the term o p  β 2  more tightly . F ollowing Eqs. (2) a nd (B2), we obtain C n ( β , f ) =  β 2  2   * n X i =1 s i ( w ; f ) 2 ! 2 + − * n X i =1 s i ( w ; f ) 2 + 2   + V n ( β , f ) + ˜ V n ( β , f ) + W n ( β , f ) + ˜ W n ( β , f ) , (C7) where V n ( β , f ) : = β 2 X i 6 = j ( h s i ( w ; f ) s j ( w ; f ) i − h s i ( w ; f ) i h s j ( w ; f ) i ) N i N j , (C8) ˜ V n ( β , f ) : = β 2 n X i =1   s i ( w ; f ) 2  − h s i ( w ; f ) i 2  N 2 i , (C9) 8 W n ( β , f ) : = β 2 X i 6 = j  s i ( w ; f ) 2 s j ( w ; f )  −  s i ( w ; f ) 2  h s j ( w ; f ) i  N j , (C10) and ˜ W n ( β , f ) := β 2 n X i =1  s i ( w ; f ) 3  −  s i ( w ; f ) 2  h s i ( w ; f ) i  N i . (C11) W e ev alua te the o rder of each term in E q. (C7) in the limit where κ a nd L ar e ﬁxed, wherea s n → ∞ . First, w e obtain  β 2  2   * n X i =1 s i ( w ; f ) 2 ! 2 + − * n X i =1 s i ( w ; f ) 2 + 2   = Λ( κ, f ; w 0 , f 0 ) (C12) in the limit that w e consider . Second, we ev aluate the order o f V n ( β , f ) as n → ∞ . Now, we o bta in [ V n ( β , f )] = 0 (C13) as n → ∞ , such that [ h ( h s i ( w ; f ) s j ( w ; f ) i − h s i ( w ; f ) i h s j ( w ; f ) ii ) N i N j ] = ( h s i ( w ; f ) s j ( w ; f ) i − h s i ( w ; f ) i h s j ( w ; f ) i ) [ N i ][ N j ] (C14) is satisﬁed, where [ N i ] = 0. Then, we a lso obtain  V n ( β , f ) 2  − [ V n ( β , f )] 2 = β 4 β 2 0 X i 6 = j ( h s i ( w ; f ) s j ( w ; f ) i − h s i ( w ; f ) i h s j ( w ; f ) i ) 2 = O  β 2 β 2 0  (C15) as n → ∞ , such that h ( h s i ( w ; f ) s j ( w ; f ) i − h s i ( w ; f ) i h s j ( w ; f ) i ) 2 N 2 i N 2 j i = ( h s i ( w ; f ) s j ( w ; f ) i − h s i ( w ; f ) i h s j ( w ; f ) i ) 2 [ N 2 i ][ N 2 j ] (C16) and [( h s i ( w ; f ) s j ( w ; f ) i − h s i ( w ; f ) i h s j ( w ; f ) i ) ( h s k ( w ; f ) s l ( w ; f ) i − h s k ( w ; f ) i h s l ( w ; f ) i ) N i N j N k N l ] = ( h s i ( w ; f ) s j ( w ; f ) i − h s i ( w ; f ) i h s j ( w ; f ) i ) ( h s k ( w ; f ) s l ( w ; f ) i − h s k ( w ; f ) i h s l ( w ; f ) i ) [ N i ][ N j ][ N k ][ N l ] (C17) are satisﬁed, where [ N i ] = 0, [ N 2 i ] = β − 1 0 , h s i ( w ; f ) i = O ( κ − 1 ) and h s i ( w ; f ) s j ( w ; f ) i = O ( κ − 1 ). In summar y , w e obtain V n ( β , f ) = [ V n ( β , f )] + O p  β β 0  = O p  β β 0  (C18) as n → ∞ . In the same wa y , we also obtain ˜ V n ( β , f ) = h ˜ V n ( β , f ) i + O p  β √ nβ 0  (C19) as n → ∞ with h ˜ V n ( β , f ) i = β 2 β 0 n X i =1   s i ( w ; f ) 2  − h s i ( w ; f ) i 2  = O  β β 0  , (C20) 9 where [ N 2 i ] = β − 1 0 ,  N 4 i  = 3 /β 2 0 , h s i ( w ; f ) i = O ( κ − 1 ) and  s i ( w ; f ) 2  = O ( κ − 1 ). This means that ˜ V n is self-averaging; i.e., ˜ V n = [ ˜ V n ] ≥ 0 holds as n → ∞ . F urthermo r e, w e also obtain W n ( β , f ) = O p  1 n √ β 0  , (C21) and ˜ W n ( β , f ) = O p  1 n 3 / 2 √ β 0  (C22) as n → ∞ , wher e [ N i ] = 0, [ N 2 i ] = β − 1 0 , [ W n ] = 0, [ ˜ W n ] = 0, h s i ( w ; f ) i = O ( κ − 1 ),  s i ( w ; f ) 2  = O ( κ − 1 ),  s i ( w ; f ) 2 s j ( w ; f )  = O  κ − 2  , and  s i ( w ; f ) 3  = O  κ − 2  . By consider ing the consistency b etw een Eq s. (C6) and (C7) with E qs. (C12) and (C18-C22), we obtain Eqs. (8) and (9), where Λ( κ, f ; w 0 , f 0 ) = λ as κ → ∞ is the real log canonical threshold. F r om this asymptotic behavior and Eq. (C1), it is fo und out that Λ( κ, f ; w 0 , f 0 ) and C n ( β , f ) repres en t the intrinsic regular ization eﬀect in F n ( β , f ) as κ → ∞ : κF n ( β , f ) = nL n ( w ′ 0 ; β , f ) + λ log n + ( m − 1 ) log log n + n 2 log β 2 π + O p (1) , (C23) where ˜ F n (1 , β , f ) = κF n ( β , f ) − n (lo g β − log 2 π ) / 2 holds [3 2]. Here, we demonstrate the v alidity of Eqs. (8) and (9). By perfo rming the same simulation in Fig. 1 based on 100 diﬀeren t realizations of D n for n = 1 01 tak en fro m iden tical p ( y i | x i , w 0 , β 0 ), we c a lculate C n ( β , f ) for each realization. At any β , the exp e c tation of C n ( β , f 1 ) ov e r the realizations is fairly consis ten t with Λ( κ, f 1 ; w 0 , f 1 ) if β 0 = β 2 (Fig. A1 a), Λ ( κ, f 1 ; w 0 , f 2 ) (Fig. A1 c), and Λ( κ, f 2 ; w 0 , f 2 ) if β 0 = β 3 . The sta ndard devia tions of C n ( β , f ) for the realiza tions is small enoug h (Fig. A2 ); the quantit y C n ( β , f ) is considere d to be self-av er aging at an y β in these cases without the condition β = O ( β 0 / lo g n ). According to Eq. (8), these cas es mean that the exp ectation of C n ( β , f ) corresp onds to the average term Λ ( κ, f ; w 0 , f 0 ), wher e the standard devia tion of C n ( β , f ) corre spo nds to the ﬂuctuation ter m of or der O p  max  (log κ ) − 1 , β /β 0 , ( n √ β 0 ) − 1  . Note that the standar d deviation o f C n ( β , f ) as the ﬂuctuation ter m shows a dep endence on Λ( κ, f ; w 0 , f 0 ) (Fig. A2 ), which is not predicted by Eq. (8). In other cases, the expec tation of C n ( β , f ) is not consisten t with Λ( κ, f ; w 0 , f 0 ) at β & β 0 (Fig. A1 ); ﬁnite-size eﬀects on C n ( β , f ) app ear at β & β 0 , where the ter m of o rder O p ( β /β 0 ) is dominant. According to Eq. (C7) with Eqs. (C18-C22), the exp ectation of C n ( β , f ) corresp onds to Λ( κ, f ; w 0 , f 0 ) + ˜ V n ( β , f ) a s n → ∞ , wher e the standard deviation of C n ( β , f ) corresp onds to V n ( β , f ) + W n ( β , f ) + ˜ W n ( β , f ) as n → ∞ . The expectations and standard devia tions of C n ( β , f ) ar e roughly pr op ortional to √ β if β is larg e enough (Fig. A3 ) a nd also show a rough depe ndence on β 2 0 . Note that C n ( β , f ) is considered to be self-av eraging for β = O ( β 0 / lo g n ) in all c ases. App endix D: Energy function of radial bas i s function netw ork Here, we derive the analytic expr ession of E ( w ; f , w 0 , f 0 ) for f , f 0 ∈ { f K } , wher e f K is deﬁned as Eq. (1 0): E ( w ; f , w 0 , f 0 ) = H ( w 0 , w 0 ) − 2 H ( w 0 , w ) + H ( w , w ) (D1) with H ( w , w ′ ) : = 1 L K X j =1 K ′ X k =1 a j a ′ k s 2 b j + b ′ k exp − ( c j − c ′ k ) 2 2  b − 1 j + b ′− 1 k  ! ˜ H ( b j , c j , b ′ k , c ′ k ) (D2) for w ′ := { a ′ k , b ′ k , c ′ k } K ′ k =1 as K K ′ > 0 and H ( w , w ′ ) := 0 as K K ′ = 0, where ˜ H ( b j , c j , b ′ k , c ′ k ) : = √ π 2 erf r b j + b ′ k 2  x n − c j b j + c ′ k b ′ k b j + b ′ k  ! − √ π 2 erf r b j + b ′ k 2  x 1 − c j b j + c ′ k b ′ k b j + b ′ k  ! . (D3) Note that ˜ H ( b j , c j , b ′ k , c ′ k ) = √ π holds as x 1 → − ∞ and x n → ∞ . W e explain wh y p ( w | D n , β , f 2 ) at κ = κ 2 (Figs.4a-4 c) exhibits such a cor relation in the parameter s. Although the gr ound truth are f 0 = f 2 with w 0 = { 5 , 10 , ( − 1 ) k × 0 . 25 } 2 k =1 , here, we consider the ps eudo-gro und truth ˜ f 0 = f 1 with ˜ w 0 := { a ∗ , b ∗ , c ∗ } and the analytic set ˜ W 0 := { w | f ( x i ; w ) = ˜ f 0 ( x i ; ˜ w 0 ) } = ˜ W 01 ∪ ˜ W 02 ∪ ˜ W 03 (D4) 10 for w := { a k , b k , c k } 2 k =1 , where ˜ W 01 := { w | a k + a k ′ = a ∗ , b k = b k ′ = b ∗ , c k = c k ′ = c ∗ } , (D5) ˜ W 02 := { w | a k = a ∗ , a k ′ = 0 , b k = b ∗ , c k = c ∗ } , (D6) and ˜ W 03 :=  w | a k = a ∗ , b k = b ∗ , c k = c ∗ , exp  − b k ′ 2 ( x i − c k ′ ) 2  = 0  (D7) for i = 1 , · · · , n and k 6 = k ′ . Note that exp( − b k ′ ( x i − c k ′ ) 2 / 2) = 0 holds as b k ′ → ∞ or c k ′ → ± ∞ , where b k ′ 6 = 0 and c k ′ 6 = x i are necessary . Notably , E ( ˜ w ; f 2 , ˜ w 0 , ˜ f 0 ) = 0 holds for ∀ ˜ w ∈ ˜ W 0 ; i.e., p ( w | D n , β , f 2 ) can be re latively large around w = ˜ w for p ( w | f 2 ) > 0. The above scena r io o f the pseudo-g round truth corre s po nds to p ( w | D n , β , f 2 ) at κ = κ 2 (Figs. 4 a-4c); i.e., w ≃ ˜ w 0 is statistically optimal for f = f 2 at κ = κ 2 . This scena r io qualitatively holds at any κ c1 ≪ κ ≪ κ c2 , since Λ( κ, f 2 ; w 0 , f 2 ) is fairly consistent with 1 . 5(= Λ( κ, f 1 ; w 0 , f 2 )). No te that this result is from the situation where f 0 = f 2 ( x ; w 0 ) app ear s almos t as one Ga us sian component (Fig. 1e), where one can o nly capture a ”gross structure” depending on obs erv ations’ ﬁneness. [1] N. D. Goldenfeld, L e ctur es on phase tr ansitions and the r enormalization gr oup (A ddison-W esley , 1992). [2] N. Goldenfeld and L. P . Kadanoﬀ, Simple lessons from complexit y , Science 284 , 87 (1999). [3] R. W. Batterman, Asymptotics and the role of minimal mo d els, The British Journal for the Philosoph y of Science 53 , 21 (2002). [4] R. W. Batterman and C. C. Rice, Minimal mod el exp lanations, Philosophy of Science 81 , 349 (2014). [5] Y. Oono, The nonline ar world: Con c eptual analysis and phenomenolo gy (Sp ringer Science & Business Media, 2012). [6] E. P . H oel, L. Albantakis, and G. T ononi, Quantifying causal emergence shows that macro can b eat micro, Pro ceedings of the National Academy of Sciences 110 , 19790 (2013). [7] B. B. Mach ta, R. Chachra, M. K. T ranstru m , and J. P . Sethna, P arameter space compression underlies emergent theories and pred ictiv e mo dels, Science 342 , 604 (2013). [8] H. H. Mattingly , M. K. T ranstrum, M. C. Ab b ott, and B. B. Mach ta, Maximizing the information learned from ﬁ nite d ata selects a simple model, Proceed in gs of the National Academy of Sciences 115 , 1760 (2018). [9] A. Gordon, A. Banerjee, M. Ko ch-Jan usz, and Z. R in gel, Relev ance in th e renormalization group and in information theory , Physica l Review Letters 126 , 24060 1 (2021). [10] H. Ak aike, A n ew lo ok at the statistical mo del identiﬁcatio n, IEEE t ran sactions on automatic con trol 19 , 716 (1974). [11] G. Sch warz et al. , Estimating the d imension of a mod el, Ann als of statistics 6 , 461 (1978). [12] R. T rotta, Ba yes in t h e sk y: Ba yesian inference and mod el selection in cosmolog y , Contemporary Physics 49 , 71 (2008). [13] R. P . Mann, Bay esian in ference for identifying interactio n rules in moving animal groups, PloS one 6 , e22827 (2011). [14] C. Mark, C. Metzner, L. Lautsc ham, P . L. Strissel, R. Strick, and B. F abry , Bay esian mo del selection for complex dynamic systems, Natu re communications 9 , 1 ( 2018). [15] J. A . V´ azquez, D. T ama yo, A. A . Sen, and I . Quiros, Bay esian model selection on scala r ε -ﬁeld dark energy , Physical Review D 103 , 043506 (2021). [16] S. T okuda, S . S ou ma, K. S ega w a, T. T ak ahashi, Y. An do, T. Nak anishi, and T. Sato, U nv eiling q uasiparticle dyn amics of top ologica l insu lators through bay esian mo delling, Comm unications Physics 4 , 1 (2021). [17] S. T okuda, Y. Kaw achi, M. Sasaki, H. Arak aw a, K. Y amasaki, K. T erasak a, and S. Inagaki, Bay esian inference of ion velocit y d istribution function from laser-induced ﬂuorescence spectra, Scientiﬁc Rep orts 11 , 1 (2021). [18] L. W asserman, Ba yesi an mo del selection and mo del av eraging, Journal of mathematical psychology 44 , 92 (2000). [19] J. O. Berger, L. R. P ericc hi, J. Gh osh, T. Samanta, F. De Santis, J. Berger, and L. Pericc h i, Ob jective bay esian metho ds for mo del selection: I ntroduction and comparison, Lecture Notes-Monograph Series , 135 (2001). [20] D. J. MacKay , Ba yesia n interpolation, Neural computation 4 , 415 (1992). [21] V. Balasubramanian, Statistical inference, o ccam’s razor, and statistical mechanics on the space of probability distribut ions, Neural computation 9 , 349 (1997). [22] S. W atanab e, Algebraic analysis for nonidentiﬁable learning mac hines, Neural Computation 13 , 899 (2001). [23] S. W atanab e, A lgebr aic ge ometry and statistic al le arning the ory , V ol. 25 (Cam bridge Universit y Press, 2009). [24] S. W atanab e, Mathematic al the ory of Bayesian statistics (CRC Press, 2018). [25] S. W atanab e, A widely applicable bay esian information criterion, Journal of Mac hine Learning Researc h 14 , 867 (2013). [26] S. T okuda, K . Nagata, and M. Ok ada, A numerical analysis of learning coeﬃcient in radial basis fun ct ion netw ork, I PSJ Online T ransactions 7 , 20 (2014). [27] E. T. Jaynes, I nformation theory and statistical mechanics, Physical review 106 , 620 (1957). [28] E. T. Jaynes, Pr ob abili ty the ory: The lo gic of scienc e (Cambridge un ive rsity press, 2003). [29] L. Zdeb oro v´ a and F. K rzaka la, Statistical physics of inference: Thresholds and algorithms, Adv ances in Physic s 65 , 453 (2016). 11 [30] B. Efron and C. Morris, S tein’s estimation rule and its comp etitors—an empirical ba yes approach, Journal of the A merican Statistical Association 68 , 117 (1973). [31] H. Ak aike, Likeli ho od and th e bay es p rocedu re, in Sele cte d Pap ers of Hi r otugu Akaike (S pringer, 1998) pp. 309–33 2. [32] S. T okuda, K. Nagata, and M. Okada, S im ultaneous estimation of noise v ariance and num b er of p eaks in bay esian sp ectral deconv olution, Journal of th e Physical So ciet y of Japan 86 , 024001 (2017). [33] H. Nishimori, Exact results and critical prop erties of the ising mod el with comp eting interactions, Journal of Physics C: Solid S tate Physics 13 , 4071 (1980). [34] Y. Iba, The nishimori line and baye sian statistics, Journal of Ph ysics A: Mathematical and General 32 , 3875 (1999). [35] S. W atanab e, Equations of states in singular statistical estimation, Neural N etw orks 23 , 20 (2010). [36] S. W atanab e, Asymptotic equiv alence of ba yes cross va lidation and widely applicable informa tion criterion in singular learning theory , Journal of Mac hine Learning Researc h 11 , 3571 (2010). [37] S. W atanab e et al. , A limit theorem in singular regression problem, in Pr ob abi listic Appr o ach to Ge ometry (Mathematical Society of Japan, 2010) pp . 473–492. [38] D. BR OOMHEAD, Multiv ariable functional in terp olation and adaptive net works, Complex S ystems 2 , 321 ( 1988). [39] C. J. Geyer, Marko v chain monte carlo maximum lik elihoo d (I nterf ace F oundation of N orth America, 1991). [40] K. H ukushima and K. Nemoto, Exc hange monte carlo method and application to spin glass sim ulations, Journal of the Physica l Society of Japan 65 , 1604 (1996). [41] X.-L. Meng and W. H . W ong, Simulating ratios of normalizing constants via a simple identit y: a theoretical ex ploration, Statistica Sinica , 831 (1996). [42] A. Gelman and X.-L. Meng, Sim ulating normalizing constants: F rom importance sampling to bridge sampling to path sampling, Statistical science , 163 (1998). [43] C. H. LaMont and P . A. Wiggins, Corresp ondence betw een thermo dynamics and inference, Physical Review E 99 , 052140 (2019). 12 FIG. 1. Ground truth , observ ed data, and op t imal mod el chosen by Ba yesian model selection. It is demonstrated th at whic h mod el minimizes F n ( β , f ) for each realization of D n (blac k dots) with n = 101, f = f 0 ( x ; w ) (n o line), f = f 1 ( x ; w ) (red solid line), or f = f 2 ( x ; w ) (red dashed line) with tw o Gaussian comp onents (blue solid lines). F or the degenerate case that the ground truth is ( a ) f 0 = f 1 ( x ; w 0 ) with w 0 = { 10 , 10 , 0 } , the minimal mo d el is ( b ) f = f 0 at β 0 = κ 1 /n , ( c ) f = f 1 at β 0 = κ 2 /n , and ( d ) f = f 1 at β 0 = κ 3 /n . F or the splitting case th at th e ground trut h is ( e ) f 0 = f 2 ( x ; w 0 ) with w 0 = { 5 , 10 , ( − 1) k × 0 . 25 } 2 k =1 , the minimal mo del is ( f ) f = f 0 at β 0 = κ 1 /n , ( g ) f = f 1 at β 0 = κ 2 /n , and ( h ) f = f 2 at β 0 = κ 3 /n . 13 FIG. 2. Ba yesia n mod el selection and intrinsi c regularization eﬀect scaled by observ ations’ ﬁ neness. The inequalit y F ( κ, f 0 ; w 0 , f 0 ) < F ( κ, f 1 ; w 0 , f 0 ) holds for κ < κ c1 (region in ligh t blue), F ( κ, f 0 ; w 0 , f 0 ) > F ( κ, f 1 ; w 0 , f 0 ) holds for κ c1 < κ < κ c2 (region in light pink), and F ( κ, f 1 ; w 0 , f 0 ) > F ( κ, f 2 ; w 0 , f 0 ) holds for κ > κ c2 (region in light yel low ). ( a ) Log-log plot of F ( κ, f ; w 0 , f 1 ) for w 0 = { 10 , 10 , 0 } . In set sho ws that the Bay es factor exp( − κ ( F ( κ, f 2 ; w 0 , f 1 ) − F ( κ, f 1 ; w 0 , f 1 ))) < 1 holds for any κ . ( b ) S emi- log plot of Λ( κ, f ; w 0 , f 1 ) for w 0 = { 10 , 10 , 0 } and asymptote dim( w ) / 2 = 1 . 5 (b lac k dot- ted line). ( c ) Log-log p lot of F ( κ, f ; w 0 , f 2 ) for w 0 = { 5 , 10 , ( − 1) k × 0 . 25 } 2 k =1 . I nset shows that the Ba yes factor exp( − κ ( F ( κ, f 2 ; w 0 , f 2 ) − F ( κ, f 1 ; w 0 , f 2 ))) < 1 holds for κ < κ c2 , and exp( − κ ( F ( κ, f 2 ; w 0 , f 2 ) − F ( κ, f 1 ; w 0 , f 2 ))) > 1 holds for κ > κ c2 . ( d ) Semi-log plot of Λ( κ, f ; w 0 , f 2 ) for w 0 = { 5 , 10 , ( − 1) k × 0 . 25 } and asymptote dim( w ) / 2 = 3 (black dotted line). 14 FIG. 3. Change in the p osterior probability distribution from the p rior p robabilit y distribution depend ing on observ ations’ ﬁneness. Histogram of the Monte Carlo sample from p ( w | D n , β , f 1 ) ∝ exp( − κE ( w ; f 1 , w 0 , f 1 ) / 2) p ( w | f 1 ) for w 0 = { 10 , 10 , 0 } at ( a - c ) κ = κ 1 , ( d - f ) κ = κ c1 , and ( g - i ) κ = κ 2 . Eac h ro w corresponds to a marginal distribution; ( a , d , g ) corresponds to p ( a 1 | D n , β , f 1 ), ( b , e , h ) corresponds to p ( b 1 | D n , β , f 1 ), and ( c , f , i ) corresp onds t o p ( c 1 | D n , β , f 1 ). Compare eac h histogram with the marginal d istribution of p ( w | f 1 ) (red dashed line). 15 FIG. 4. Change in the posterior p robabilit y distribution dep ending on observ ations’ ﬁneness and u nderlying parameter corre- lations. Two-dimensional histogram of th e Mon te Carlo sample from p ( w | D n , β , f 2 ) ∝ exp( − κE ( w ; f 2 , w 0 , f 2 ) / 2) p ( w | f 2 ) for w 0 = { 5 , 10 , ( − 1) k × 0 . 25 } 2 k =1 at ( a - c ) κ = κ 2 , ( d - f ) κ = κ c2 , and ( g - i ) κ = κ 3 . Each ro w corresponds to a margi nal distribution; ( a , d , g ) corresponds to p ( a 1 , a 2 | D n , β , f 2 ), ( b , e , h ) corresp onds to p ( b 1 , b 2 | D n , β , f 2 ), and ( c , f , i ) corresp onds to p ( c 1 , c 2 | D n , β , f 2 ). 16 FIG. 5. Phase diagram with respect to observ ations’ ﬁneness and ground truth of parameter. Semi-log p lot of th e peak p ositions of Λ( κ, f 2 ; w 0 , f 2 ) (black dashed and grey solid lines) for w 0 = { 5 , 10 , ( − 1) k × δ } 2 k =1 . The inequality F ( κ, f 0 ; w 0 , f 2 ) < F ( κ, f 1 ; w 0 , f 2 ) holds for κ < κ c1 ( δ ) (region in light blue), F ( κ, f 0 ; w 0 , f 2 ) > F ( κ, f 1 ; w 0 , f 2 ) holds for κ c1 ( δ ) < κ < κ c2 ( δ ) (region in light pink), and F ( κ, f 1 ; w 0 , f 2 ) > F ( κ, f 2 ; w 0 , f 2 ) holds for κ > κ c2 ( δ ) ( region in light yello w). 17 a b c d FIG. A1 . Finite-size eﬀects on th e Ba yes sp eciﬁc h eat. Semi-log plots of Λ( κ, f ; w 0 , f 0 ) (blue circles) and th e exp ectations of C n ( β , f ) ove r realizati ons of D n for β 0 = β 1 (red triangles), β 0 = β 2 (yello w squares) and β 0 = β 3 (purple crosses), where β = κ/n ( β 1 := κ 1 /n , β 2 := κ 2 /n , and β 3 := κ 3 /n ). F or the case that the ground truth is f 0 = f 1 with w 0 = { 10 , 10 , 0 } , ( a ) C n ( β , f 1 ) and ( b ) C n ( β , f 2 ) are resp ectively compared with Λ( κ, f 1 ; w 0 , f 1 ) and Λ( κ, f 2 ; w 0 , f 1 ) shown in Fig. 2b. F or th e case th at the ground tru th is f 0 = f 1 with w 0 = { 5 , 10 , ( − 1) k × 0 . 25 } 2 k =1 , ( c ) C n ( β , f 1 ) and ( d ) C n ( β , f 2 ) are respectively compared with Λ( κ, f 1 ; w 0 , f 2 ) and Λ( κ, f 2 ; w 0 , f 2 ) shown in Fig. 2d . 18 FIG. A2 . Fluctuation of th e Ba yes speciﬁc heat for real izations. Log-log plots of the standard deviation of C n ( β , f ) for realizations of D n for ( β 0 , f 0 ) = ( β 2 , f 1 ) (blue squares for f = f 1 ), ( β 0 , f 0 ) = ( β 2 , f 2 ) (red sq uares for f = f 1 ), and ( β 0 , f 0 ) = ( β 3 , f 2 ) (yello w crosses for f = f 1 and pu rple crosses for f = f 2 ), where each case corresponds to Fig. A1 . 19 a b c FIG. A 3 . Scaling analyses of the Ba yes sp eciﬁc heat. Log-lo g p lots of the exp ectations (blue triangles for f 0 = f 1 and yel low squares for f 0 = f 2 ) of C n ( β , f ), subtracting Λ( κ, f ; w 0 , f 0 ) as the baseline, and the standard dev iations (red triangles for f 0 = f 1 and yello w squares for f 0 = f 2 ) of C n ( β , f ) ov er realizations of D n in the cases of ( a ) Fig. A1 a , ( b ) Fig. A1 b and ( c ) Fig. A1 d .

Intrinsic regularization effect in Bayesian nonlinear regression scaled by observed data

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment