The Rules-and-Facts Model for Simultaneous Generalization and Memorization in Neural Networks

A key capability of modern neural networks is their capacity to simultaneously learn underlying rules and memorize specific facts or exceptions. Yet, theoretical understanding of this dual capability remains limited. We introduce the Rules-and-Facts …

Authors: Gabriele Farné, Fabrizio Boncoraglio, Lenka Zdeborová

The Rules-and-Facts Model for Simultaneous Generalization and Memorization in Neural Networks
The R ules-and-F acts Mo del for Sim ultaneous Generalization and Memorization in Neural Net w orks Gabriele F arné 1 , F abrizio Boncoraglio 1 , Lenka Zdeb oro v á 1 1 Statistical Ph ysics of Computation Lab oratory , École P olytechnique F édérale de Lausanne (EPFL) CH-1015 Lausanne Abstract A k ey capabilit y of mo dern neural netw orks is their capacity to simultaneously learn underlying rules and memorize specific facts or exceptions. Y et, theoretical understanding of this dual capabilit y remains limited. W e in tro duce the R ules-and-F acts (RAF) model, a minimal solv able setting that enables precise characterization of this phenomenon b y bridging t wo classical lines of work in the statistical physics of learning: the teacher–studen t framework for generalization and Gardner-style capacit y analysis for memorization. In the RAF model, a fraction 1 − ε of training lab els is generated b y a structured teacher rule, while a fraction ε consists of unstructured facts with random lab els. W e characterize when the learner can sim ultaneously recov er the underlying rule—allo wing generalization to new data—and memorize the unstructured examples. Our results quantify how ov erparameterization enables the sim ultaneous realization of these t wo ob jectives: sufficien t excess capacit y supports memorization, while regularization and the choice of kernel or nonlinearit y con trol the allo cation of capacit y b et ween rule learning and memorization. The RAF mo del pro vides a theoretical foundation for understanding how mo dern neural net works can infer structure while storing rare or non-compressible information. 1 In tro duction A hallmark of mo dern neural net works is their ability to sim ultaneously infer structured patterns and store sp ecific exceptions or facts [ 1 – 3 ]. F or example, it is desirable to learn a rule for the addition of tw o num b ers rather than storing in the net work’s weigh ts the full list of p ossible sums. At the same time, certain facts—such as the name of the capital of F rance—lik ely need to b e memorized. A language mo del, for instance, can generalize grammatical rules such as English v erb conjugation (“ju mp” b ecomes “jump ed”) while also memorizing irregular forms (“go” b ecomes “w en t”). It is p erhaps surprising that, unlike biological systems in whic h reasoning and memory are often attributed to distinct mechanisms or specialized brain regions, see e.g. [ 4 ], mo dern neural netw orks—such as transformers—p erform b oth functions within a single arc hitecture. Understanding ho w and when mo dern neural netw orks are capable of generalizing and memorizing sim ultaneously remains a foundational op en question in current mac hine learning, and is the main motiv ation for the presen t work. Mo dern AI systems also exhibit regimes where they pro duce plausible but incorrect facts; see App endix E for an illustrative example, including a hallucinated reference to a non-existent paper b y Mei, Mon tanari, and Nguyen. Understanding ho w generalization can co exist with reliable memorization is, therefore, not only of theoretical in terest but also of practical relev ance. Memorization vs Generalization: A Long-Standing T ension In classical learning theory , generalization and memorization hav e typically b een viewed as competing or even mutually exclusiv e phenomena – contributing to the long-held view that “generalization b egins where memorization ends” . The classical Rademac her - complexit y b ounds formalize this view b y showing 1 that mo dels with insufficient capacity to memorize arbitrary lab elings are precisely those that are guaran teed to generalize [ 5 , 6 ]. A drastic shift in understanding came with recen t efforts to explain why mo dern neural net works can generalize even in highly o verparameterized regimes—despite also b eing able to fit purely random or partially randomized lab els [ 7 ]. In that w ork, memorization w as primarily used as a stress test for mo del capacity rather than as a desirable learning ob jectiv e. An influential line of w ork has since demonstrated that memorization and generalization are not inherently at o dds: o verfitting asso ciated with memorization can, in fact, b e b enign and need not impair a model’s abilit y to generalize [ 8 ]. This insigh t is closely linked to the now broadly studied double-descen t phenomenon, in whic h test error initially worsens and then impro ves again as the n umber of mo del parameters increases b ey ond the so-called in terp olation threshold [ 9 – 11 ]. Certain w orks explore this in terplay even further b y sho wing that memorization of a small subset of rare or atypical examples can, in fact, enhance generalization p erformance [ 12 , 13 ]. Despite this shift in p erspective, muc h of the recent literature still treats memorization as something to b e tolerated – b enign at b est, but not in trinsically desirable. In these works, the emphasis remains on understanding ho w generalization can p ersist despite the mo del’s capacity to memorize, rather than on settings where memorization is necessary for successful p erformance. Separately , memorization has b een studied through mo dels suc h as the perceptron with random lab els [ 14 ], or in asso ciative memory net works like the Hopfield mo del [ 15 , 16 ], whic h explicitly store patterns but do not generalize b ey ond them. While these classical setups hav e b een generalized to the con text of mo dern arc hitectures [ 17 – 21 ], they further reinforced the notion of separation b et ween generalization and memorization. Other lines of work treat memorization and generalization as fundamentally antagonistic. In priv acy-sensitive applications such as medical diagnostics or text generation, memorization of sp ecific training examples is considered a failure mo de—raising concerns ab out data leakage and the unintended exp osure of sensitive information. This has motiv ated extensive research on auditing, a voiding, or reversing memorization in trained mo dels, particularly in the context of differen tial priv acy an d machine unlearning [ 22 – 24 ]. A p ersp ectiv e analogous to memorization b eing undesirable arises in generativ e modeling, where memorizing the training set leads to o verfitting and degraded sample qualit y . F or instance, diffusion mo dels are exp ected to learn the underlying data distribution rather than repro duce memorized training samples, and recent w orks sho w that indeed diffusion mo dels start to generalize once they fail to memorize [ 25 – 27 ]. These w orks further reinforce the conv entional view that memorization compromises generalization and should b e suppressed or mitigated. Memorization and Generalization Must Go Hand in Hand In con trast to lines of w ork that treat memorization as an undesirable—or at b est tolerable—side effect, our w ork is motiv ated by the observ ation that, in many realistic tasks, generalization and memorization are not merely compatible, but join tly essential. This is no where more eviden t than in the p erformance of mo dern large language mo dels (LLMs), whic h routinely demonstrate the ability to infer grammatical and semantic structure while also recalling sp ecific factual con tent. These mo dels simultaneously solv e tasks requiring abstract reasoning (e.g., co de synthesis, translation, summarization) and factual precision (e.g., recalling scientific kno wledge, historical dates, or rare terminology). Rather than viewing memorization as a nuisance or limitation, we treat it as an in tegral comp onen t of the learning ob jective, on equal fo oting with generalization. Despite the cen trality of this phenomenon in mo dern AI, there is—remarkably—no theoretical framework to date that captures generalization and memorization simultaneously and allo ws their interpla y to b e analyzed in closed form. The R ules-and-F acts (RAF) T ask Mo del Mo dern AI mo dels must b oth generalize patterns and memorize facts – any complete theory of learning m ust account for both. T o adv ance to ward 2 suc h a theory , we introduce an analytically tractable framework, which we call the Rules-and- F acts (RAF) mo del , that enables the study of generalization and memorization as simultaneous and necessary ob jectiv es. In the RAF mo del, each training lab el is either generated by a structured teacher rule or assigned at random as an isolated fact that must b e memorized. A parameter ε ∈ [0 , 1] controls the proportion of factual (non-generalizable) examples, with the remaining 1 − ε fraction adhering to the structured rule. This form ulation captures the fundamen tal tension—and p otential synergy—b et ween generalization and memorization within a unified theoretical framework. In order to enable analytical tractability we will consider n samples in d dimensions dra wn from a Gaussian distribution: x µ ∈ R d i . i . d . ∼ N ( 0 , I d ) . The lab els of eac h sample are then obtained as follo ws (i) The rules : with probability 1 − ε a sample x µ receiv es the teacher lab el y µ = sign w ⊤ ⋆ x µ √ d ! , (1) where w ⋆ ∈ R d is the v ector of teacher-w eights that is the same for all the samples. The teac her weigh ts are generated as w ⋆ i . i . d . ∼ N ( 0 , I d ) . (ii) The facts : with probability ε the lab el is an unstructured exception y µ i . i . d . ∼ Rad { 1 2 } (i.e., a random ± 1 lab el with equal probabilit y). W e consider that n samples of input-lab el pairs { ( x µ , y µ ) } n µ =1 are observed, but not the teacher v ector w ⋆ , nor when a lab el comes from the rule or a fact. W e consider the high-dimensional limit where d, n → ∞ while α = n/d = Θ(1) , ε = Θ(1) . The learner observes n input-lab el pairs { ( x µ , y µ ) } n µ =1 and its ob jectiv e is to reco ver the underlying rule enco ded by w ⋆ while also memorizing the unstructured lab els. The tw o main parameters of the RAF data generative mo del are hence the fraction of facts to b e memorized ε and the sample complexit y α . The RAF mo del bridges t wo classical paradigms in the statistical physics of learning: the teacher–studen t mo del, which describ es the inference of structured rules (recov ered when ε = 0 ) [ 28 , 29 ], and Gardner-st yle capacity analysis, which quantifies the ability to memorize arbitrary data (reco vered when ε = 1 ) [ 14 ]. In this sense, the RAF mo del unifies and extends these classical settings. T o our kno wledge, this is the first analytically solv able mo del that captures and quan tifies the desirable co existence of generalization an d memorization within a unified theoretical setting. W e also stress upfron t that our aim with this w ork is not to mo del the full complexit y of real-w orld language or reasoning tasks, but to isolate and analyze the minimal ingredients required for a study of simultaneous generalization and memorization. The analyzed neural netw orks A central question, then, is under what conditions a learner can succeed at b oth tasks – learning the underlying rule and memorizing the unstructured facts – and how this abilit y dep ends on structural properties suc h as mo del arc hitecture or regularization. As suggested by existing results, successful memorization of arbitrary facts requires a mo del with sufficien tly man y parameters. The simplest o verparameterized mo dels that remain analytically tractable in the high-dimensional regime are the k ernel regression and the related random f eatures regression [ 30 ]. By analyzing their p erformance on the RAF mo del, w e can quantitativ ely characterize the in terplay b et ween generalization (ac hieved by learning the rule) and memorization of the facts. W e now describ e the neural netw ork mo dels used in our analysis and define the metrics employ ed to ev aluate b oth generalization and memorization p erformance. 3 F or a giv en loss function ℓ and regularization strength λ > 0 , the learner mo del is trained on the dataset D = { ( x µ , y µ ) } n µ =1 b y minimizing the so-called empirical risk o ver weigh ts w , L ( w ; D ) = n X µ =1 ℓ  y µ , f w ( x µ )  + λ 2 r ( w ) . (2) The function f w is related to the arc hitecture of the learner mo del, and r is the regularization. W e will study three of the most basic learners: a linear classifier, considered as a baseline, and t wo more expressive mo dels, random features and kernel regression. In particular: • Linear mo del aka single la yer p erceptron where f w ( x ) = 1 √ d w · x , r ( w ) = || w || 2 2 (3) with w ∈ R d . • Random feature regression where f w ( x ) = 1 √ p w · σ  1 √ d F ⊤ x  , r ( w ) = || w || 2 2 (4) with w ∈ R p , F ∈ R d × p a fixed matrix with elements F ia i . i . d . ∼ N (0 , 1) , and σ : R → R a non-linear activ ation function applied comp onent-wise. In this work, we consider a range of p ossible activ ation functions, for example erf ( · ) , sign ( · ) , or ReLu ( · ) . • Kernel regression. F or a symmetric p ositive semi-definite kernel K : R d × R d → R , we ha ve f w ( x ) = n X ν =1 w ν K ( x , x ν ) , r ( w ) = n X µ,ν =1 w µ w ν K ( x µ , x ν ) . (5) where x µ , x ν ∈ R d are t wo training samples. In this w ork, we restrict ourselv es to a class of dot-pro duct kernels ob eying a few tec hnical assumptions describ ed in App endix A.5.2 . These assumptions, ho wev er, cov er many of the standard k ernels used in practice. The k ernel used will b e crucial for assessing the estimator’s capacit y to simultaneously memorize and generalize. A classical result of [ 30 ] establishes a relation b et ween the k ernel regression ( 5 ) and random features in the large width, p → ∞ , limit of ( 4 ) . In particular, they establish that in the large width limit, random features regression with non-linearity σ leads to kernel regression with the follo wing kernel K ( x µ , x ν ) = E v " σ v ⊤ x µ √ d ! σ v ⊤ x ν √ d !# (6) where x µ , x ν ∈ R d are t wo training samples and v ∈ R d is sampled in the same w ay as the columns of F . The con verse do es not hold in general: not every kernel admits a represen tation via the random-features construction used here. This is wh y we fo cus on the class of dot pro duct k ernels that do. W e will use the relation b et ween kernel regression and random features in their large-width limit to treat them together. F rom the random-features viewp oin t, k ernel regression corresp onds to the infinite-width (hence maximally o verparameterized) limit. Regarding the loss, w e will consider the tw o most commonly used for classification with k ernel metho ds: ℓ square ( y , z ) = 1 2 ( y − z ) 2 , ℓ hinge ( y , z ) = max(0 , 1 − y z ) . (7) 4 In particular, kernel regression with square loss is often referred to as k ernel ridge regression (KRR), and k ernel classification with hinge loss is called supp ort vector machine (SVM) [ 31 ]. Because we restrict our atten tion to con vex losses, the empirical risk ( 2 ) admits a unique minimizer, whic h we refer to as the empirical risk minimizer (ERM), and denote by ˆ w : ˆ w ( D ) = arg min w L ( w ; D ) = arg min w   n X µ =1 ℓ  y µ , f w ( x µ )  + λ 2 r ( w )   . (8) Ev aluation quan tities The p erformance of all the mo dels is then ev aluated on tw o downstream tasks. • F actual recall task : When the learned mo del sees back a training sample for which the lab el was random, it should return back that lab el. By desiring go od p erformance on this task, w e mo del the ability to recall facts. • T eac her-rule task : The mo del should b e able to apply the teacher rule on previously unseen input samples. By desiring this, we mo del the abilit y to apply the learned rule in new con texts. W e are seeking a mo del that is able to p erform well on b oth the factual recall and teacher-rule tasks, equiv alent to memorizing and generalizing at the same time. This also clarifies that the fact-lab els cannot b e treated merely as noise in a standard teacher–studen t setting. If they were simply noise, the ob jectiv e w ould b e to ignore them. In con trast, in the RAF mo del, performance on the factual recall task is explicitly required. This makes the RAF mo del conceptually distinct from previously studied settings that do not simultaneously treat b oth the factual recall and teac her-rule tasks. T o quantify the p erformance on the factual recall and teacher-rule tasks, we introduce the memorization and gener alization err ors . W e consider the training dataset as the disjoint union of t w o sets, namely D = { ( x µ , y µ ) } µ ∈I mem ∪ { ( x µ , y µ ) } µ ∈I gen , where I mem and I gen denote the indices of the randomly-lab eled and the learnable samples, respectively . The memorization error is thereafter defined as the av erage fraction of fact-lab els in the dataset that are misclassified by ˆ w : E ( n ) mem = 1 |I mem | X µ ∈I mem 1 y µ  =sign( f ˆ w ( x µ )) , (9) where 1 denotes the indicator function. Let x new ∼ N ( 0 , I d ) b e a new input and y new b e the lab el attributed to it b y the teacher rule as in Eq. ( 1 ). The generalization error of the p oin t estimator ˆ w is then defined as E ( n ) gen = E x new ,y new h 1 y new  =sign( f ˆ w ( x new )) i . (10) The high-dimensional limit In general, the p erformance of a neural netw ork dep ends on the realization of the dataset D = { ( x µ , y µ ) } n µ =1 and the realization of the random features F . Ho wev er, in the presen t setting, the quantities of in terest are self-a veraging in the high-dimensional limit where n, d, p → ∞ , α = n d = Θ(1) , κ = p d = Θ(1) . (11) The parameter α is commonly referred to as the sample complexity of the problem, while κ pla ys the role of the o verparameterization ratio of the random feature arc hitecture in Eq. ( 4 ) . All the other h yp erparameters, suc h as ε or λ , are kept of order one. The k ernel regression limit is reac hed when κ → ∞ . 5 The RAF mo del is inspired by tw o classical paradigms in the statistical physics of learning: the teacher-studen t mo del [ 28 , 29 ] and Gardner-style capacity analysis [ 14 ]. Both these mo dels w ere analyzed in the high-dimensional limit for their generalization and memorization capabilit y using the replica metho d. Random-feature regression/classification and kernel regression in high dimensions were c haracterized b y [ 32 – 35 ] for high-dimensional teac her-student data mo dels, but not for the sp ecific structure of labels as in the RAF mo del, and without fo cus on the in terplay b et w een generalization and memorization. Metho dologically we b orro w from the ab o v e w orks to characterize the high-dimensional limit of the memorization and generalization errors as a function of the v arious parameters: the fraction of facts ε , the sample complexit y α , the regularization strength λ , the form of the loss ℓ , the ov erparameterization ratio κ , as well as the activ ation function σ in the random features and the sp ecific form of the k ernel in the kernel regression. Key tak e-a wa y A cen tral mechanism emerging from our analysis is that, in ov erparameterized mo dels, different comp onen ts of the represen tation can b e allo cated to distinct ob jectives. In particular, the kernel geometry effectively separates a comp onen t aligned with the teacher rule from comp onen ts that can b e used to interpolate unstructured lab els. As we will show, this decomp osition — captured by the parameters µ 1 and µ ⋆ (defined in Eq. ( 30 ) ) – con trols how capacit y is distributed b et w een generalization and memorization, and thereby enables their sim ultaneous realization. This p erspective pro vides a simple in terpretation of the role of ov erparameterization: excess capacit y is not merely used to fit arbitrary data, but can b e selectively allo cated so as to preserve alignmen t with the underlying rule while storing non-compressible information. In this sense, the co existence of generalization and memorization is not paradoxical, but a direct consequence of ho w capacity is organized in o verparametrized mo dels. 2 Characterization of the p erformance in high-dimension In this section, w e present a summary of the key equations that describ e the high-dimensional p erformance of the learners in tro duced in Section 1 , when trained on data generated according to the RAF mo del. Sp ecifically , we c haracterize the memorization and generalization errors defined in Eq. ( 9 ) and Eq. ( 10 ) in terms of a set of low-dimensional equations. The deriv ation follows approaches similar to those of prior works and is therefore deferred to the App endices. W e b egin by stating the Bay es-optimal generalization error, whic h serves as a baseline, and then pro ceed to the c haracterization of the ERM solution defined in Eq. ( 8 ). 2.1 Ba y es-optimal generalization error In the Bay es-optimal (BO) setting, w e assume that the learner has access to the true data- generating mo del, namely a Gaussian prior on the teac her weigh ts, w ⋆ ∼ N ( 0 , I d ) , together with the mixed “rules-and-facts” output channel P out ⋆ ( y | z ) = ε  1 2 δ ( y + 1) + 1 2 δ ( y − 1)  + (1 − ε ) δ  y − sign( z )  , (12) where z = w ⋆ · x / √ d and x ∼ N ( 0 , I d ) . What the learner do es not hav e access to is the realization of teacher w eigh ts w ⋆ nor the information whether the lab el of a given training sample w as drawn from the teac her rule or at random. Giv en a new input x new ∼ N ( 0 , I d ) , the p osterior predictiv e distribution of the corresponding teac her lab el is P  y new | x new , D  = Z d w δ  y new − sign  w · x new √ d  P ( w | D ) , (13) 6 where P ( w | D ) is the p osterior of the RAF mo del. The Bay es-optimal classifier, whic h minimizes the error defined in Eq. ( 10 ) , predicts the lab el that maximizes the p osterior predictive probability , namely ˆ y BO ( x new ; D ) = arg max y ∈{− 1 , +1 } P  y | x new , D  . (14) The Ba yes-optimal generalization error is then E BO , (n) gen = E  1  y new  = ˆ y BO ( x new ; D )  , (15) where the exp ectation is taken with resp ect to the join t distribution of the training set D and the fresh test pair ( x new , y new ) . In the limit n, d → ∞ with α = n/d and ε fixed, the generalization error in ( 15 ) admits a closed-form expression and is given by E BO gen = lim n →∞ E BO , (n) gen = 1 π arccos  √ q b  , (16) where q b ∈ [0 , 1] is the solution of the system of tw o coupled scalar equations q b = ˆ q b 1 + ˆ q b , (17) ˆ q b = 4 α (1 − ε ) 2 (2 π ) 3 / 2 (1 − q b ) Z R exp  − 1 + q b 2(1 − q b ) ξ 2  dξ 1 + (1 − ε ) erf  √ q b ξ √ 2(1 − q b )  . (18) The order parameter q b is referred to as the Bay es-optimal o v erlap with the ground truth w eights, and ˆ q b ∈ [0 , ∞ ) is its conjugate parameter. Eqs. ( 17 ) – ( 18 ) are obtained by plugging the output c hannel ( 12 ) in to the general equations derived in [ 36 ], see App endix B.1 for more details. The Ba yes-optimal error in ( 16 ) pro vides the information-theoretically b est p erformance in learning the teacher-rule in the RAF data mo del, and thus giv es a low er b ound for the generalization error of any other estimator, including the ones we are going to discuss in the following. F or eac h ( α, ε ) , the system ( 17 ) – ( 18 ) admits a unique solution ( q b , ˆ q b ) , and this solution determines the Ba yes-optimal generalization error via ( 16 ) . In the limiting case ε = 0 , the equations ab o v e reduce to the ones of the Bay es-optimal generalization of the teacher-studen t p erceptron with spherical weigh ts studied in [ 29 , 36 , 37 ], while ε = 1 yields the random guessing estimator, with q b = 0 and E BO gen = 1 / 2 . In the large sample complexity limit α → ∞ , the Bay es-optimal generalization error decays according to the p o wer law E BO gen ≃ α →∞ C BO ( ε ) α , (19) where C BO ( ε ) = √ 2 π 2(1 − ε ) 2 J ( ε ) , J ( ε ) = Z R dt e − t 2 1 + (1 − ε ) erf  t √ 2  , (20) for all 0 ≤ ε < 1 . The deriv ation of Eq. ( 19 ) is pro vided in App endix C.3.1 . 2.2 P erformance of the p erceptron, random features, and kernel regression In this subsection, we provide the asymptotic expression of the memorization and generalization errors, defined in Eqs. ( 9 ) - ( 10 ) , for the three estimators that w e consider in our work. In the high-dimensional limit it is possible to characterize these quantities in terms of a set of lo w- dimensional state equations. Here w e present a formulation that encompasses random features in the large width limit κ → ∞ (large ov erparameterization), namely k ernel regression, and the 7 linear mo del as a particular case of the latter. This formulation follo ws from the low-dimensional equations for random features, which are given in App endix A.4 , as their full expression is less concise. The full deriv ation, which is based on the replica metho d, is presented in App endix A and closely follows [ 33 ]. The conclusion of the deriv ation is that, for an y conv ex loss ℓ , the generalization error and the memorization error of linear p erceptron ( 3 ) , random features ( 4 ) and the kernel metho d ( 5 ) , ev aluated at the unique global minimizer of ( 8 ) , admit the following closed-form expressions in the high-dimensional limit ( 11 ). The generalization error ( 10 ) is given by: E gen = lim n →∞ E (n) gen = 1 π arccos m √ q ! . (21) The memorization error ( 9 ) is given by: E mem = lim n →∞ E (n) mem = 1 2 E ξ ∼N (0 , 1) h θ  P V ℓ ( − 1 , · ) ( √ q ξ )  + θ  −P V ℓ (1 , · ) ( √ q ξ ) i , (22) where θ ( · ) is the Heaviside step function. The function P V ℓ ( y , · ) ( ω ) , with y ∈ {− 1 , +1 } and ω = √ q ξ with ξ ∼ N (0 , 1) , denotes the loss-dep enden t proximal op erator: P V ℓ ( y , · ) ( ω ) = arg min z ∈ R n ℓ ( y , z ) + ( z − ω ) 2 2 V o . (23) In the case of kernel regression, the three scalar order parameters m, q , V en tering the errors in ( 21 ) and ( 22 ) are given self-consistently b y the follo wing set of six equations (see App endices A.5 and B.2 ): m = µ 1 ˆ m λ + ˆ V , q = µ 2 1 ˆ m 2 + ˆ q ( λ + ˆ V ) 2 , V = µ 2 1 λ + ˆ V + µ 2 ⋆ λ , (24) for m, q , V , and ˆ m = µ 1 α (1 − ε ) s q 2 π ( q − m 2 ) Z R dξ √ 2 π e − qξ 2 2( q − m 2 ) [ f out (1 , √ q ξ , V ) − f out ( − 1 , √ q ξ , V )] , ˆ q = µ 2 1 α 2 Z R dξ √ 2 π e − ξ 2 / 2 h f out ( − 1 , √ q ξ , V ) 2 + f out (1 , √ q ξ , V ) 2 i + µ 2 1 α (1 − ε ) 2 Z R dξ √ 2 π e − ξ 2 / 2 erf mξ p 2( q − m 2 ) ! h f out (1 , √ q ξ , V ) 2 − f out ( − 1 , √ q ξ , V ) 2 i , ˆ V = µ 2 1 α 2 Z R dξ √ 2 π e − ξ 2 / 2 [ ∂ ω f out ( − 1 , √ q ξ , V ) + ∂ ω f out (1 , √ q ξ , V )] + µ 2 1 α (1 − ε ) 2 Z R dξ √ 2 π e − ξ 2 / 2 erf mξ p 2( q − m 2 ) ! [ ∂ ω f out (1 , √ q ξ , V ) − ∂ ω f out ( − 1 , √ q ξ , V )] . (25) for the conjugate v ariables ˆ m, ˆ q , ˆ V . The function f out ( y , ω , V ) is commonly referred to as the output c hannel denoising function, and is a loss-dep enden t quantit y defined as: f out ( y , ω , V ) = P V ℓ ( y , · ) ( ω ) − ω V (26) with P V ℓ ( y , · ) ( ω ) defined in ( 23 ), while ∂ ω f out ( y , ω , V ) is its partial deriv ativ e. The k ernel enters Eqs. ( 24 ) - ( 25 ) through the t wo parameters µ 1 , µ ⋆ ∈ R . F or inputs x µ , x ν dra wn uniformly on the sphere of radius √ d , the k ernel asso ciated with the Gaussian random features learner in Eq. ( 6 ) is a dot-pro duct kernel, namely K ( x µ , x ν ) = K ( ρ ) = E ( u,v ) ∼N (0 , Σ ρ ( x µ , x ν ) )  σ ( u ) σ ( v )  , (27) 8 where Σ ρ = 1 ρ ρ 1 ! (28) with ρ denoting the normalized inner pro duct of the inputs x µ , x ν : ρ ( x µ , x ν ) := ( x µ ) ⊤ x ν ∥ x µ ∥ ∥ x ν ∥ . (29) The parameters µ 1 , µ ⋆ are then linked to the dot-pro duct kernel ( 27 ) b y the follo wing identities (see App endix A.5.2 ): µ 2 1 = K ′ (0) , µ 2 ⋆ = K (1) − K (0) − K ′ (0) . (30) F or high-dimensional Gaussian independent samples, as we consider in this w ork, the ab o v e argumen t still holds and one has ρ ( x µ , x ν ) = O ( d − 1 / 2 ) → 0 , while for coinciden t argumen ts ρ = 1 . Consequen tly , the generalization and memorization errors are gov erned by the lo cal b eha vior of the k ernel around ρ = 0 and ρ = 1 . This observ ation highligh ts the central role of the tw o parameters µ 1 and µ ⋆ in quantifying the trade-off b et w een rule learning and fact memorization in the RAF mo del. In particular, µ 1 and µ ⋆ admit an in terpretation in terms of the Hermite expansion of the activ ation σ (equiv alently , the p olynomial expansion of the k ernel; see App endix A.5.2 ): µ 1 is the first Hermite co efficien t, capturing th e linear comp onen t of σ ( · ) and thus gov erning rule learning and generalization, whereas µ ⋆ aggregates the higher-order con tributions, enco ding nonlinear comp onen ts that allow for memorization 1 . Also, note that for µ 1 = 1 and µ ⋆ = 0 , Eqs. ( 24 ) - ( 25 ) coincide with the state equations describing the p erformance of the linear regression or, equiv alently , the single-lay er p erceptron learner in ( 3 ) (see also [ 36 ]). This is consistent with the fact that this c hoice of parameters corresp onds to the linear k ernel K ( ρ ) = ρ . The order parameters m and q admit a clear in terpretation in terms of the weigh ts ˆ w solving the ERM problem in random features regression. In this setting, they are giv en by m = µ 1 d ( F ˆ w ) ⊤ w ⋆ , q = µ 2 1 d ∥ F ˆ w ∥ 2 2 + µ 2 ⋆ p ∥ ˆ w ∥ 2 2 (31) and concentrate in the high-dimensional limit ( 11 ) . When κ → ∞ , these quantities c haracterize the asymptotic p erformance of k ernel regression, as well as that of the linear model in the sp ecial case µ 1 = 1 , µ ⋆ = 0 . It is worth emphasizing that the equations illustrated ab o ve dep end on the kernel only through the t wo parameters µ 1 , µ ⋆ , and not on the full sequence of expansion co efficien ts { a ℓ } ℓ ≥ 0 . This reduction allo ws us to parametrize a broad class of kernels using only these t wo effectiv e parameters. In particular, for any prescrib ed pair µ 1 , µ ⋆ , one can construct an admissible dot-pro duct k ernel whose expansion co efficien ts repro duce these v alues. Imp ortan tly , this construction is not unique: many different k ernels share the same pair µ 1 , µ ⋆ and are therefore equiv alent as far as our analysis is concerned. As a consequence, for the purp oses of the asymptotic analysis, it is sufficien t to consider an arbitrary representativ e within this equiv alence class. F or concreteness, we choose a simple truncated k ernel of the form K ( ρ ) = µ 2 1 ρ + µ 2 ⋆ ρ 2 , (32) for which a 1 = µ 2 1 , a 2 = µ 2 ⋆ , and all higher-order co efficients are set to zero. Other c hoices (e.g., in volving higher-order terms) would lead to the same asymptotic predictions. 1 Note that the notation µ ⋆ is taken from previous works [ 32 ] and the ⋆ notation do es not reflect any relation to the w ay we denote the teac her w eights w ⋆ . 9 F or illustrative purp oses, w e rep ort the following kernels together with their corresp onding parameters µ 1 , µ ⋆ : K erf = 2 π arcsin  2 ρ 3  , µ 1 = 2 √ 3 π , µ ⋆ = r 2 π arcsin  2 3  − 4 3 π , K sign = 2 π arcsin( ρ ) , µ 1 = r 2 π , µ ⋆ = r 1 − 2 π , K ReLU = 1 2 π  p 1 − ρ 2 + ( π − arccos ρ ) ρ  , µ 1 = 1 2 , µ ⋆ = r 1 2  1 2 − 1 π  . (33) A table of other examples of k ernels for whic h our theory applies is presented in App endix D.1 . Square loss F or the square loss ℓ square ( y , z ) = 1 2 ( y − z ) 2 , the expression for the denoising function in Eq. ( 26 ) and its deriv ativ e are given by (see App endix B.3 ): f square out ( y , ω , V ) = y − ω 1 + V , ∂ ω f square out ( y , ω , V ) = − 1 1 + V . (34) This leads to a drastic simplification of the three equations for ˆ m, ˆ q , ˆ V : ˆ m = µ 1 α (1 − ε ) q 2 π 1 + V , ˆ q = µ 2 1 α 1 + q − 2 q 2 π (1 − ε ) m (1 + V ) 2 , ˆ V = µ 2 1 α 1 + V . (35) F urthermore, the expression of the memorization error in Eq. ( 22 ) can also b e simplified as E square mem = 1 2 erfc  V √ 2 q  , (36) where erfc( x ) = 1 − erf ( x ) . F or the hinge loss, the memorization error tak es the same functional form. The corresponding denoising function and order parameter equations are provided in App endix B.4 . 3 Results and their implications Results in this section can b e repro duced using the codes provided in https://github . com/ SPOC- group/RAF . 3.1 The classical trade-off b et ween generalization and memorization W e start by discussing the p erformance of the linear mo del, aka single-la yer perceptron, in terms of its ability to achiev e low generalization and memorization error. In Fig. 1 w e plot the parametric curve λ 7→  E gen ( λ ) , E mem ( λ )  , for fixed representativ e v alues of ( α, ε ) obtained with the square (left) and hinge (righ t) loss. The curve corresp onding to the linear p erceptron is depicted by a full gray line. W e see how the regularization λ orc hestrates the trade-off b et w een memorization and generalization. As λ → 0 + (mark ed with a black triangle), the memorization error is the low est (limiting v alues derived in App endix C.1 ) and gets monotonically w orse as th e regularization λ increases up to its w orst v alue at λ → ∞ (mark ed with a black square), that is deriv ed in App endix C.2 . On the other hand, the generalization error achiev es its minimum at an in termediate v alue of the regularization λ opt and increases again b ey ond this p oin t. F or the setting of Fig. 1 , we see that the hinge loss pro vides a lo wer memorization error at small regularization. The sample complexit y α b elo w which, at λ → 0 + , the memorization error is zero dep ends on the considered loss. F or the square loss, this is b elo w the interpolation threshold α square interp . = 1 for all v alues of 0 < ε ≤ 1 (see deriv ation in App endix C.1.1 ). F or the 10 0.19 0.21 0.23 0.25 0.27 E square gen 0.00 0.10 0.23 0.36 E square mem Square loss 0.19 0.21 0.23 0.25 0.27 E hinge gen 0.00 0.10 0.23 0.36 E hinge mem Hinge loss p erceptron erf relu Ba yes-optimal λ → + ∞ λ → 0 + Figure 1: Generalization–memorization trade-off induced b y regularization for the RAF mo del at fraction of facts ε = 0 . 1 and sample complexity α = 2 . 0 / (1 − ε ) . Left: KRR, square loss. Righ t: SVM, hinge loss. Eac h plot sho ws the parametric curv e λ 7→ ( E gen ( λ ) , E mem ( λ )) for different mo dels: a linear p erceptron and kernel regression with k ernels corresp onding to erf and ReLU in Eq. ( 33 ) , together with the Ba yes-optimal generalization baseline E BO gen = 0 . 2008 (b old black dashed vertical line). F or the square loss, the optimal generalization error is the same E opt , square gen = 0 . 2084 for all three depicted models (thin gra y dotted vertical line). Endpoints corresp ond to λ → 0 + (triangle) and λ → + ∞ (square). F or the hinge loss, the minimum test errors are E opt , hinge gen = 0 . 2094 (p erceptron), E opt , hinge gen = 0 . 2068 (erf ), E opt , hinge gen = 0 . 2031 (relu). hinge loss, the threshold b elow which the memorization error at λ → 0 + is zero is f ound by solving the self-consistent equation 1 = α hinge interp . ( ε )   1 2 − 1 − ε π arctan   α hinge interp . ( ε )(1 − ε ) π     (37) as deriv ed in App endix C.1.1 for 0 < ε ≤ 1 . In particular, we hav e α hinge interp . ( ε = 1) = 2 [ 14 ] and as ε → 0 + it div erges as α hinge interp . ( ε ) ≃ [2 π 2 / (3 ε )] 1 / 3 . F or the sample complexity in Fig. 1 , we do not get zero memorization error for the square loss, but w e do for the hinge loss. Imp ortan tly , for the single-la yer p erceptron mo del, there is no v alue of regularization λ where b oth memorization and generalization are sim ultaneously reasonably lo w. This is rather in tuitive; assume that w e had access to the teacher’s w eight v ector w ⋆ , then the memorization error would b e bad, given that the teac her’s lab els are uncorrelated with the random lab els represen ting facts. The single-lay er p erceptron presents an inevitable trade-off b et ween generalization and memorization, exactly in line with the traditional learning theory paradigm that generalization can start where memorization ends. 3.2 The p o wer of o v erparameterization More expressiv e mo dels, i.e., in general those having more trainable parameters, will ha ve a larger capacit y to memorize samples. The key question is whether suc h memorization can b e ac hieved without sacrificing generalization. T o start inv estigating this question, w e also plot in Fig. 1 the parametric curv e λ 7→  E gen ( λ ) , E mem ( λ )  for t wo represen tative k ernel regressions, one corresp onding to the infinitely ov erparameterized random features with erf activ ation (in orange), the other to ReLU activ ation (in green). The corresp onding kernels are given in Eq. ( 33 ). 11 The results show that ov erparameterization dramatically improv es the set of sim ultaneously lo w errors ( E gen , E mem ) achiev able by v arying the regularization λ . The corresp onding curves in Fig. 1 demonstrate that approaching λ → 0 + do es not necessarily en tail catastrophic loss of generalization: for a wide range of λ , kernel regression can sim ultaneously attain small E mem and a generalization error that remains lo w. In other words, ov erparameterization creates a regime of useful (and in the RAF mo del, necessary) b enign ov erfitting: the mo del uses excess capacit y to memorize the unstructured fraction of the data while preserving a represen tation aligned with the teac her rule. This observ ation is conceptually distinct from muc h of the b enign-o v erfitting literature [ 8 , 32 ], where memorization is typically treated as a b y-pro duct to b e tolerated. In the RAF mo del, memorization is part of the task ob jectiv e: the learner is exp ected to store the factual subset. In Fig. 1 w e see that, in particular, the ReLU k ernel enables regimes in whic h the mo del fits the random lab els nearly p erfectly while maintaining go od generalization. The erf kernel do es not ac hieve go o d memorization and generalization simultaneously . This motiv ates a follo w-up question of ho w the generalization-memorization trade-off dep ends on the kernel, i.e., the tw o parameters µ 1 , µ ⋆ , and the data, via the ε, α parameters. W e in vestigate this question thoroughly in the next section. 0.19 0.24 0.30 0.35 0.40 E square gen 0.00 0.09 0.19 0.30 0.40 E square mem Square loss — Erf activ ation κ = 0 . 5 κ = 1 κ = 2 κ = 10 κ = 50 κ = 100 κ → ∞ λ → + ∞ λ → 0 + 0.19 0.24 0.30 0.35 0.40 E hinge gen 0.00 0.09 0.19 0.30 0.40 E hinge mem Hinge loss — ReLu activ ation κ = 0 . 5 κ = 1 κ = 2 κ = 10 κ = 50 κ = 100 κ → ∞ λ → + ∞ λ → 0 + Figure 2: Finite-width random features vs. k ernel limit. P arametric memorization- generalization trade-off curv es λ 7→  E gen ( λ ) , E mem ( λ )  for increasing mo del widths κ , sho wing con vergence to the κ → ∞ k ernel prediction. Data mo del parameters: ε = 0 . 1 , α (1 − ε ) = 2 . The endp oin ts are numerically obtained in the random features case. Left: Erf activ ation, square loss. Right: ReLu activ ation, hinge loss. Fig. 2 further illustrates the role of o v erparameterization b y depicting the parametric memorization-generalization trade-off λ 7→ ( E gen ( λ ) , E mem ( λ )) for the random feature regression with finite width κ = p/d . The left panel corresp onds to the Erf activ ation and square loss, the right one to ReLu and hinge loss. W e see that as the num b er of learnable parameters p = κd increases, the capacity of the mo del to memorize and generalize simultaneously impro ves, and the curves rapidly conv erge to the κ → ∞ k ernel prediction. Perfect memorization is p ossible only for κ large enough, while it is alw ays p ossible in the kernel limit for any µ ⋆ > 0 (see App endix C.1 ). This further justifies our choice to fo cus on the κ → ∞ k ernel limit and in vestigate the dep endence on the kernel geometry fixed by µ 1 , µ ⋆ . 3.3 Kernel geometry con trols the rule–fact allo cation W e now turn to the question: which kernels ar e b est suite d for the RAF mo del, and why? In Section 2.2 we established that the kernel enters our high-dimensional theory only through the 12 t wo scalar co efficien ts µ 1 , µ ⋆ giv en in Eq. ( 30 ) , and that these are linked to the kernel abilities of rule learning and factual recalling, resp ectiv ely . A dditionally to this, we establish analytically in App endix A.6 for all losses (and even more general data distributions than the RAF) that the parametric plot of memorization versus generalization, as in Fig. 1 , dep ends on the kernel only through the ratio of the parameters µ 1 , µ ⋆ . W e th us introduce the angle in the ( µ ⋆ , µ 1 ) plane γ = arctan  µ 1 µ ⋆  . (38) W e iden tify γ = π / 2 with µ ⋆ = 0 ; in particular, linear regression is recov ered for µ 1 = 1 and µ ⋆ = 0 . W e establish this angle-only dep endence b y showing that the order parameters, on which the memorization and generalization errors are ev aluated, are inv arian t up on the reparameterization ( µ ⋆ , µ 1 , λ ) → ( r µ ⋆ , r µ 1 , r 2 λ ) with r > 0 , see App endix A.6 . Consequen tly , the angle γ acts as a single-parameter summary of kernel geometry for the RAF mo del: it determines ho w strongly the estimator remains anchored to the teacher rule, versus how muc h effectiv e capacity is av ailable for fitting unstructured facts. Ha ving shown in Fig. 1 the o verall impact of regularization, w e now fo cus on t wo particularly relev ant choices: – In terp olating regularization λ → 0 + : this is the natural choice if memorization is a hard constrain t (facts must be stored). Any k ernel with µ ⋆ > 0 ac hieves zero memorization error for any of the considered losses when λ → 0 + , at any ( ε, α ) . This result is derived in C.1.2 . W e will b e interested in the generalization error E gen ( λ → 0 + ) for this interpolating regularization. – Regularization cross-v alidated on the generalization error among all p ositiv e regularizations λ = λ opt > 0 : this is the natural choice if rule learning is primary and one hop es memorization will come as a byproduct. As is common in the field, w e consider on ly p ositiv e v alues of λ to k eep the optimization problem ( 8 ) con vex. W e are then interested in the v alue of the generalization error at the optimal regularization, E gen ( λ opt ) , and the v alue of the memorization error at that same regularization, E mem ( λ opt ) . In Fig. 3 w e present ho w the generalization errors E gen ( λ → 0 + ) (left panel), E gen ( λ opt ) (middle panel) and the memorization error E mem ( λ opt ) (right panel) depend on the angle γ defined in Eq. ( 38 ) . The differen t curves corresp ond to differen t sample complexities α . In order to make the v alues of the generalization errors comparable across a range of α w e subtracted from each curve the v alue that corresp onds to the angle γ for which the lo w est E gen ( λ opt ) is ac hieved for the square loss. The corresp onding subtracted v alues are summarized in the T able b elo w the figure. Memorization/in terp olation regime Let us first fo cus on the b eha viour of the generalization error in the p erfect memorization regime, at λ → 0 + , corresp onding to left-hand side panels in Fig. 3 . F or the square loss, we are able to derive analytically , see App endix B.3 , that at λ → 0 + the generalization error is minimized for kernels corresp onding to the angle γ opt mem ( ε ) = arctan "  π 2(1 − ε ) 2 − 1  − 1 / 2 # . (39) This angle is depicted b y the vertical black dashed line in the top panels of Fig. 3 . The v alues of the generalization error at this optimal angle, min γ E square gen ( λ → 0 + ) , are summarized in the T able b elo w Fig. 3 . Notice that while the optimal angle γ opt mem dep ends on the data through the fraction of facts ε , it do es not dep end on the sample complexit y α . 13 0 π 12 π 6 π 4 π 3 5 π 12 π 2 γ 0 . 00 0 . 01 0 . 02 0 . 03 E square gen ( λ → 0 + ) − min γ E square gen ( λ opt ) 0 π 12 π 6 π 4 π 3 5 π 12 π 2 γ 0 . 00 0 . 01 0 . 02 0 . 03 E square gen ( λ opt ) − min γ E square gen ( λ opt ) 0 π 12 π 6 π 4 π 3 5 π 12 π 2 γ 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 E square mem ( λ opt ) 0 π 12 π 6 π 4 π 3 5 π 12 π 2 γ 0 . 00 0 . 01 0 . 02 0 . 03 E hinge gen ( λ → 0 + ) − min γ E square gen ( λ opt ) 0 π 12 π 6 π 4 π 3 5 π 12 π 2 γ − 0 . 02 0 . 00 0 . 02 E hinge gen ( λ opt ) − min γ E square gen ( λ opt ) 0 π 12 π 6 π 4 π 3 5 π 12 π 2 γ 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 E hinge mem ( λ opt ) α = 2 α = 4 α = 10 α = 20 α E BO gen min γ E square gen ( λ opt ) min γ E hinge gen ( λ opt ) min γ E hinge gen ( λ → 0 + ) 2 0.2430 0.2479 0.2445 0.2477 4 0.1702 0.1864 0.1763 0.1858 10 0.0853 0.1208 0.1009 0.1196 20 0.0449 0.0858 0.0669 0.0846 Figure 3: Generalization and memorization dep endence on the angle γ = arctan( µ 1 /µ ⋆ ) . KRR (square loss) in the top panels, SVM (hinge loss) in the b ottom panels. Left panels: E gen ( λ → 0 + ) − min γ E square gen ( λ opt ) ; center panels: E gen ( λ opt ) − min γ E square gen ( λ opt ) ; right panels: E mem ( λ opt ) . The fraction of facts in the training set is fixed to ε = 0 . 2 . The samples complexities are α ∈ { 2 , 4 , 10 , 20 } in each panel. The dashed black vertical line in the upp er panels is the optimal angle γ opt mem ( ε ) in Eq. ( 39 ) , for which, for the square loss, b oth minimum generalization and p erfect memorization are simultaneously reached. F or the hinge loss (b ottom panels), the angle where E hinge gen ( λ opt ) is minimal is marked by a cross, while the angle where E hinge gen ( λ → 0 + ) is minim um is marked by a dot. The table summarizes the Bay es-optimal error, the minimum generalization for the square loss (that coincides for λ → 0 + and λ opt ), and the minim um generalization for the hinge loss – b oth for λ → 0 + and λ opt . 14 F or the hinge loss in the memorization regimes λ → 0 + , we see from Fig. 3 and the corresp onding data, that the phenomenology is qualitatively the same as for the square loss, and quan titatively extremely close. The angle at whic h the test error is minimized for λ → 0 + is n umerically very close (but not equal) to γ opt mem from Eq. ( 39 ) , and the lo west achiev ed error min γ E hinge gen ( λ → 0 + ) is numerically extremely close to those ac hieved with the square loss min γ E hinge gen ( λ → 0 + ) ≈ min γ E square gen ( λ → 0 + ) , compare the v alues in T able b elo w Fig. 3 (3rd and 5th column). W e thus conclude that in the memorization regime, i.e., at λ → 0 + , of the RAF mo del, b oth for the KRR and SVM, kernels corresp onding, or b eing very close, to the optimal memorization angle ( 39 ) achiev e the b est generalization error. Optimal generalization The next natural question that arises is whether, by relaxing the p erfect memorization constrain t, one can further impro ve the generalization error. Fig. 3 sho ws that here the answer dep ends on the considered loss. F or the square loss (KRR), the generalization error at γ opt mem ( ε ) is already the b est achiev able one, while for the hinge loss (SVM) an ev en b etter generalization error can b e ac hieved. F or the square loss, we are able to derive explicitly the optimal regularization that minimizes the generalization error (see App endix B.3 ) λ opt = max  µ 2 1  π 2(1 − ε ) 2 − 1  − µ 2 ⋆ , 0 +  . (40) Note that Eq. ( 39 ) is then directly related to this v alue. This result then leads to the following prop erties (see App endix B.3 for details): • The optimal regularization λ opt → 0 + for all kernels satisfying γ ≤ γ opt mem ( ε ) . Consequen tly , the corresp onding memorization error is zero in that region. F or kernels with γ > γ opt mem ( ε ) , b oth the optimal regularization and the corresp onding memorization error are strictly p ositiv e. • The generalization error at the optimal regularization λ opt has the same low est p ossible v alue, min γ E square gen ( λ opt ) , for all kernels satisfying γ ≥ γ opt mem ( ε ) , and is strictly worse than min γ E square gen ( λ opt ) for kernels that hav e γ < γ opt mem ( ε ) . It th us follows that, for the kernel ridge regression, kernels that satisfy γ = γ opt mem ( ε ) are able to sim ultaneously memorize p erfectly while achieving optimal rule learning. F or γ > γ opt mem ( ε ) , the k ernel ridge regression is excellent for extracting the rule, but at the cost of limited memorization capabilit y . F or γ < γ opt mem ( ε ) , the kernel ridge regression p erfectly memorizes even at optimal regularization, but at the exp ense of not extracting the rule reliably . F or the hinge loss, w e do not hav e analytical results when it comes to the optimal regular- ization, y et we can p erform the cross-v alidation numerically . F rom the results, depicted in the b ottom panels of Fig. 3 , w e see that the generalization error can b e further lo wered by adjusting the regularization v alue, to the detrimen t of the memorization error. The trade-off depicted in Fig. 3 leads to the conclusion that for the hinge loss there is not a v alue of the angle γ , among those achieving zero memorization error, that would lead to the lo west ac hiev able generalization error at the optimal regularization, min γ E hinge gen ( λ opt ) . With the insights we just describ ed, we return to the memorization-generalization trade-off curv es depicted in Fig. 1 . In Fig. 4 we depict the trade-off curves for the optimal angle γ opt mem ( 39 ) for the square loss (left panel), whereas for the hinge loss (right panel) we depict the angle γ opt , hinge gen that minimized the generalization error. F or b oth losses, we then sho w one represen tative large v alue of the angle and a smaller one. In Fig. 5 left top panel, we illustrate the describ ed memorization-generalization phenomenol- ogy for the widely used isotropic RBF k ernel that we write as a function of the normalized 15 0 . 20 0 . 21 0 . 22 0 . 23 0 . 24 E square gen 0 . 0 0 . 1 0 . 2 0 . 3 E square mem Square loss 0 . 20 0 . 21 0 . 22 0 . 23 0 . 24 E hinge gen 0 . 0 0 . 1 0 . 2 0 . 3 E hinge mem Hinge loss γ = γ opt mem γ = 7 π / 16 γ = π / 12 γ = γ opt , hinge gen λ → + ∞ λ → 0 + Figure 4: Memorization-generalization trade-off curv es λ 7→ ( E gen ( λ ) , E mem ( λ )) for k ernel metho ds in the RAF mo del at ε = 0 . 1 and α (1 − ε ) = 2 . 0 . The left panel reports square loss, while the righ t panel rep orts hinge loss. Curves are shown for representativ e kernel geometries parameterized by the angles γ ( 38 ) . F or square loss (left), w e display the curv e for the optimal angle γ opt mem = 0 . 8011 , in Eq. ( 39 ) . F or hinge loss (righ t), w e display the angle achieving optimal generalization γ opt , hinge gen = 0 . 9774 . F or b oth losses, we also display one represen tative low er v alue of the angle, and another higher one. T riangles mark the limit λ → 0 + , squares mark the limit λ → + ∞ , the v ertical dashed line indicates the Ba yes-optimal generalization baseline and the v ertical dotted line indicates the optimal generalization error for the square loss. correlation ρ and as a function of the bandwidth-like parameter η : K ( ρ ) = e − η (1 − ρ ) where ρ = x ⊤ x ′ / ( ∥ x ∥ ∥ x ′ ∥ ) . (41) As we derived, for Gaussian input data in high dimension, the asymptotic p erformance dep ends on the kernel only through µ 1 = q η e − η , µ ⋆ = q 1 − e − η (1 + η ) . In Fig. 5 left top panel, each curv e corresp onds to a different v alue of the bandwidth parameter η for the RAF mo del at giv en representativ e v alues of ( α, ε ) . The highlighted v alue η opt is chosen so that the kernel corresp onds to the optimal angle γ opt mem ( 39 ). 3.4 Qualitativ e comparison on real data In order to inv estigate the consequences of our theory for more realistic cases of data, w e repro duce the idea of the rules-and-facts mo del with the widely used CIF AR10 [ 38 ] b enc hmark. W e build a training set that we call CIF AR10-RAF to mimic the RAF construction: tw o of the CIF AR10 classes provide the structured part of the task, with lab els fixed b y the class (cat as +1 , dog as − 1 ), while a third class (automobile) pro vides the factual/random subset, whose lab els are assigned indep enden tly ± 1 with equal probability . The dataset is constructed so that the total sample size corresp onds to α = 4 . 0 (this dataset has d = 3072 ) and a fraction ε = 0 . 2 of the training examples b elong to the randomized subset. The test set con tains n test = 2000 new samples only from the t wo structured classes, so the rep orted test error measures generalization to the rule comp onen t, while the memorization error is computed on the training randomly lab eled subset. Inputs are prepro cessed by feature-wise z -scoring, and the effectiv e bandwidth is taken relative to the t ypical train-set squared distance, η eff = η / median ( ∥ x i − x j ∥ 2 ) . The regularization in the CIF AR10 plot is spanned in λ ∈ [10 − 5 , 10 2 ] . 16 0.18 0.20 0.21 0.21 0.23 E square gen 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 E square mem RAF Data η = 0 . 10 η opt = 1 . 59 η = 2 . 50 η = 5 . 00 0 . 32 0 . 34 0 . 36 0 . 38 0 . 40 E square gen 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 E square mem CIF AR10-RAF Data η = 0 . 10 η = 4 . 00 η = 6 . 75 η = 20 . 00 0 5 10 15 20 η 0 . 325 0 . 350 0 . 375 0 . 400 E square gen CIF AR10-RAF: λ = λ opt 0 5 10 15 20 η 0 . 325 0 . 350 0 . 375 0 . 400 E square gen CIF AR10-RAF: λ = 1 e − 05 Figure 5: Qualitativ e comparison b etw een theory for the RAF mo del (upp er left panel) and exp erimen ts on real data, namely the CIF AR10-RAF task (low er and righ t panels). In all panels, we use RBF kernel ridge regression with α = 4 . 0 and ε = 0 . 2 . The upp er panels show the parametric memorization–generalization curves λ 7→ ( E gen ( λ ) , E mem ( λ )) . The kernel bandwidth η for CIF AR10-RAF is selected based on the lo wer panels: generalization error as a function of η at optimal regularization (left), with a minim um around η ≈ 4 (red dashed line), and at small regularization (righ t), with a minimum around η ≈ 6 . 75 (purple dashed line). The parametric curv es are then shown for these tw o v alues of η , together with one larger and one smaller represen tative v alue. The RAF mo del exhibits qualitatively similar b eha vior as a function of η , with the corresp onding v alues shifted. F or the purp ose of testing ho w the theory is reflected on the CIF AR10 (instead of i.i.d. Gaus- sian) input, w e run kernel ridge regression with a full RBF kernel K ( x , x ′ ) = exp  − η ∥ x − x ′ ∥ 2 2  on the CIF AR10-RAF dataset using the standard square loss. W e depict the memorization– generalization trade-off (parametrically in the regularization λ ) in Fig. 5 top righ t panel. The curv es shown are a verages ov er 20 rep eated random draws of the CIF AR10 training/test subsets and randomized lab els. W e remind that for structured data, the p erformance of the ke rnel metho d dep ends on the k ernel through more than just the µ 1 , µ ⋆ parameters, see App endix D.4 . Still in Fig. 5 , w e observ e a qualitatively similar phenomenology for the RAF and the CIF AR10-RAF data when tuning the RBF bandwidth η . Some asp ects are absent in the CIF AR10-RAF data – such as the indep endence of the optimal generalization error on η . Ho wev er ov erall, we see that the generalization error for CIF AR10-RAF dep ends on the bandwidth η similarly as the theory predicts for the hinge loss, i.e. the generalization error at λ → 0 + is minimized at one v alue of the bandwidth, here roughly η mem ≈ 6 . 75 , and the generalization error at optimal regularization is minimized at a low er bandwidth v alue, here roughly at η gen ≈ 4 . 0 . Representativ e low er and higher v alues of the bandwidth displa y qualitatively similar b eha viour as predicted by the theory . In App endix D.4 we present the same comparison as in Fig. 5 for the hinge loss for which we also observ e a qualitative agreement b et ween the CIF AR10-RAF and the theory . 17 3.5 The large- α generalization rate The analytical expressions for the square loss from Eqs. ( 24 ) - ( 35 ) allo w us to derive the decay rate of the generalization error when the fraction of facts ε is fixed and the sample complexit y α → ∞ . As a result, for kernel ridge regression with 0 ≤ ε < 1 , the leading large- α deca y of the generalization error is: E square gen = √ 1 − a 2 π a α − 1 / 2 + O ( α − 3 / 2 ) , a = (1 − ε ) r 2 π . (42) In particular, the leading α − 1 / 2 co efficien t is universal in the sense that it do es not dep end on λ nor on the kernel geometry µ 1 , µ ⋆ . This is sho wn in App endix C.3.2 . This result has sev eral consequences. On the p ositiv e side, the generalization rate α − 1 / 2 is ac hieved ev en for λ → 0 + . Th us, for all the considered kernel metho ds with µ ⋆ > 0 , generalization at rate α − 1 / 2 is ac hiev able together with p erfect memorization. On the negative side, at fixed ε > 0 , with the square loss, no choice of kernel or regularization can b eat the α − 1 / 2 rate despite the fact that the Bay es-optimal rate is α − 1 (presen ted in section 2.1 ). In the fact-free case ε = 0 , where only the teac her rule ( 1 ) pro duces lab els, studied extensiv ely in [ 36 ], it is well known that a linear student mo del with the hinge loss is able to match the Ba yes-optimal rate α − 1 . In terested in the question of whether such a fast rate can b e matc hed for ε > 0 , we next in vestigate the supp ort v ector machine. 10 − 1 10 0 10 1 10 2 10 3 10 4 α 10 − 3 10 − 2 10 − 1 E BO gen E square gen ( λ opt ) E hinge gen ( λ opt ) Figure 6: The figure displa ys generalization errors E BO gen and E square gen ( λ opt ) , E hinge gen ( λ opt ) for γ = arctan ( µ 1 /µ ⋆ ) = π / 4 against α in log-log scale. The fraction of facts in the training set is fixed to a representativ e v alue, which is ε = 0 . 5 . The dashed black segmen t has a slop e equal to − 1 , while the other tw o segmen ts ha ve slop es equal to − 1 / 2 . This figure illustrates the sub optimalit y of the generalization rate of the kernel metho ds in the RAF mo del. F or the hinge loss, the closed-form simplification leading to Eq. ( 40 ) is no longer av ailable. W e th us resort to numerical cross-v alidation of the high-dimensional equations. Numerical evidence presen ted in Fig. 6 and App endix C.3.3 con vinced us that the large- α generalization deca y remains very close to α − 1 / 2 indep enden tly of the k ernel geometry parameters µ 1 , µ ⋆ and ε . Based on the ab ov e evidence, we thus conjecture that also for the hinge loss for an y ε > 0 and any regularization λ and kernel giv en by µ 1 , µ ⋆ , the generalization decay rate is α − 1 / 2 . If we were considering generalization irresp ectiv e of the desire to memorize the facts, then w e could apply a construction of a loss for whic h gradien t descen t matc hes the steps of the Ba yes-optimal appro ximate message passing algorithm, as detailed in [ 36 , 39 ], which w ould ac hieve the Bay es-optimal rate. Suc h a metho d w ould, ho wev er, not hav e the capacit y to memorize the facts. This op ens the follo wing intriguing question: 18 Do es there exist a neural net work that is able to reac h a generalization rate of α − 1 on data dra wn from the RAF mo del and at the same time memorize the facts? Our w ork indicates that linear and kernel methods are insufficient for that purpose. It is p ossible that a wide tw o-lay er neural netw ork with a trainable first lay er (as opp osed to fixed features, as we considered in this work) will ac hiev e this goal. How ever, an analysis of such a neural net work learning on the RAF data remains a tec hnically op en problem, whic h we lea ve for future in vestigation. 4 Conclusion W e in tro duced the Rules-and-F acts (RAF) framework to initiate a theoretical study of how neural net works can simultaneously generalize structured rules and memorize unstructured facts. While b oth capabilities are central to mo dern AI systems, they hav e largely b een examined in isolation within existing theoretical studies. Our analysis, carried out in the high-dimensional limit, reveals a clear structural picture. Linear mo dels exhibit an unav oidable trade-off b et ween rule learning and fact memorization. In con trast, ov erparameterized mo dels — ev en in their simplest analytically tractable form of random features and k ernel regression — admit regimes where memorization and generalization co exist. Overparameterization emerges as the key enabler, providing sufficient excess capacity to store factual exceptions while preserving alignmen t with the underlying rule. Regularization strength λ and kernel geometry , through a single parameter γ , defined in Eq. ( 38 ) , then gov ern ho w this capacity is allo cated. Bey ond the sp ecific results, the RAF framework pro vides a minimal setting in which tensions commonly observ ed in practice — b enign ov erfitting, in terp olation, and capacit y allo cation — can b e analyzed quantitativ ely and in closed form. Several op en directions naturally follow. First, extending the analysis to feature-learning architectures, where representations them- selv es adapt, ma y clarify how faster generalization rates (e.g., approac hing Ba y es-optimal scaling) are compatible with sim ultaneous memorization. Second, ric her architectures suc h as multi-la y er net works and attention-based mo dels ma y allo w the study of ho w differen t fun ctional comp onents sp ecialize, potentially separating rule-like computation from fact-lik e storage. In particular, understanding under what conditions memorization localizes in sp ecific arc hitectural submo dules — suc h as attention heads v ersus feedforward lay ers — remains an intriguing question. Finally , the in terplay b etw een memorization and generalization has long b een studied in cognitiv e science and neuroscience, see e.g. [ 40 ], often framed in terms of complemen tary learning systems. It is an in teresting a ven ue for future work to explore whether the RAF framework and its analysis introduced here could hav e in teresting implications in that context. An imp ortan t direction for future w ork is to mov e b eyond the dichotom y b et ween b eneficial and detrimen tal memorization. In realistic learning systems, memorization may play multiple roles: enabling the storage of rare but essential information, or giving rise to un wan ted retention of spurious or sensitiv e data. Our work ma y inspire new frameworks for controlled settings in whic h such distinct forms of memorization can b e disentangled and studied systematically . More broadly , many tasks intrinsically require an active interaction b etw een memorization and generalization, rather than their mere co existence. Exceptions ma y refine rules, stored facts may guide inference under ambiguit y , and memory may serve as a scaffold for abstraction. Understanding these mechanisms — and the architectural or dynamical principles that regulate them — represents a promising av enue tow ard a ric her theory of learning in b oth artificial [ 41 ] and biological systems [ 40 ]. 19 A c k no wledgmen t W e thank Emanuele T roiani, Vittorio Erba, F reya Behrens, V alentina Njaradi, and Florent Krzakala for fruitful discussions. W e ackno wledge that the initial idea to study the RAF mo del came from a conv ersation with ChatGPT that is transcrib ed in the App endix E for reference. W e ackno wledge funding from the Swiss National Science F oundation gran ts SNSF SMArtNet (gran t num b er 212049), and the Simons Collab oration on the Ph ysics of Learning and Neural Computation via the Simons F oundation grant (#1257413 (LZ)). References [1] F abio Petroni, Tim Ro cktäsc hel, Sebastian Riedel, Patric k Lewis, An ton Bakhtin, Y uxiang W u, and Alexander Miller. Language mo dels as kno wledge bases? In Pr o c e e dings of the 2019 c onfer enc e on empiric al metho ds in natur al language pr o c essing and the 9th international joint c onfer enc e on natur al language pr o c essing (EMNLP-IJCNLP) , pages 2463–2473, 2019. [2] Benjamin Heinzerling and Ken taro Inui. Language mo dels as kno wledge bases: On entit y represen tations, storage capacit y , and paraphrased queries. In Pr o c e e dings of the 16th Confer enc e of the Eur op e an Chapter of the A sso ciation for Computational Linguistics: Main V olume , pages 1772–1791, 2021. [3] Zeyuan Allen-Zh u and Y uanzhi Li. Physics of language mo dels: P art 3.1, kno wledge storage and extraction. arXiv pr eprint arXiv:2309.14316 , 2023. [4] Bertram Opitz and Angela D F riederici. Brain correlates of language learning: the neuronal disso ciation of rule-based versus similarity-based learning. Journal of Neur oscienc e , 24(39): 8436–8440, 2004. [5] P eter L Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk b ounds and structural results. Journal of machine le arning r ese ar ch , 3(Nov):463–482, 2002. [6] Alia Abbaras, Benjamin A ubin, Floren t Krzakala, and Lenka Zdeb oro vá. Rademacher complexit y and spin glasses: A link b et ween the replica and statistical theories of learning. In Mathematic al and Scientific Machine L e arning , pages 27–54. PMLR, 2020. [7] Chiyuan Zhang, Sam y Bengio, Moritz Hardt, Benjamin Rec ht, and Oriol Viny als. Under- standing deep learning requires rethinking generalization. In International Confer enc e on L e arning R epr esentations , 2017. [8] P eter L Bartlett, Philip M Long, Gáb or Lugosi, and Alexander T sigler. Benign ov erfitting in linear regression. Pr o c e e dings of the National A c ademy of Scienc es , 117(48):30063–30070, 2020. [9] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling mo dern machine- learning practice and the classical bias–v ariance trade-off. Pr o c e e dings of the National A c ademy of Scienc es , 116(32):15849–15854, 2019. [10] Madh u S Adv ani, Andrew M Saxe, and Haim Somp olinsky . High-dimensional dynamics of generalization error in neural netw orks. Neur al Networks , 132:428–446, 2020. [11] Stefano Spigler, Mario Geiger, Stéphane d’Ascoli, Leven t Sagun, Giulio Biroli, and Matthieu W yart. A jamming transition from under-to ov er-parametrization affects generalization in deep learning. Journal of Physics A: Mathematic al and The or etic al , 52(47):474001, 2019. 20 [12] Vitaly F eldman. Do es learning require memorization? a short tale ab out a long tail. In Pr o c e e dings of the 52nd annual A CM SIGA CT symp osium on the ory of c omputing , pages 954–959, 2020. [13] Mo Zhou, Hao yang Ma, and Rong Ge. Memorizing long-tail data can help generalization through comp osition. arXiv pr eprint arXiv:2510.16322 , 2025. [14] Elizab eth Gardner and Bernard Derrida. Optimal storage properties of neural netw ork mo dels. Journal of Physics A: Mathematic al and gener al , 21(1):271, 1988. [15] John J Hopfield. Neural netw orks and ph ysical systems with emergent collectiv e com- putational abilities. Pr o c e e dings of the national ac ademy of scienc es , 79(8):2554–2558, 1982. [16] Daniel J Amit, Hano c h Gutfreund, and Haim Somp olinsky . Storing infinite num b ers of patterns in a spin-glass mo del of neural net works. Physic al r eview letters , 55(14):1530, 1985. [17] Dmitry Krotov and John J Hopfield. Dense asso ciativ e memory for pattern recognition. A dvanc es in neur al information pr o c essing systems , 29, 2016. [18] SueY eon Ch ung, Daniel D Lee, and Haim Sompolinsky . Classification and geometry of general p erceptual manifolds. Physic al R eview X , 8(3):031003, 2018. [19] Jungh wan Kim, Michelle Kim, and Barzan Mozafari. Prov able memorization capacity of transformers. In The Eleventh International Confer enc e on L e arning R epr esentations , 2023. [20] Sadegh Mahdavi, Renjie Liao, and Christos Thramp oulidis. Memorization capacity of m ulti-head attention in transformers. arXiv pr eprint arXiv:2306.02010 , 2023. [21] Eshaan Nic hani, Jason D Lee, and Alb erto Bietti. Understanding factual recall in trans- formers via asso ciativ e memories. In International Confer enc e on L e arning R epr esentations (ICLR) , 2025. [22] Nic holas Carlini, Florian T ramer, Eric W allace, Matthew Jagielski, Ariel Herb ert-V oss, Katherine Lee, A dam Rob erts, T om Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language mo dels. In 30th USENIX se curity symp osium (USENIX Se curity 21) , pages 2633–2650, 2021. [23] Milad Nasr, Reza Shokri, and Amir Houmansadr. Machine learning with membership priv acy using adv ersarial regularization. In Pr o c e e dings of the 2018 A CM SIGSA C c onfer enc e on c omputer and c ommunic ations se curity , pages 634–646, 2018. [24] An tonio Ginart, Melo dy Guan, Gregory V aliant, and James Y Zou. Making ai forget y ou: Data deletion in machine learning. A dvanc es in neur al information pr o c essing systems , 32, 2019. [25] T aeHo Y o on, Jo o Y oung Choi, Seh yun Kw on, and Ernest K Ryu. Diffusion probabilistic mo dels generalize when they fail to memorize. In ICML 2023 workshop on structur e d pr ob abilistic infer enc e & gener ative mo deling , 2023. [26] Giulio Biroli, T ony Bonnaire, V alentin De Bortoli, and Marc Mézard. Dynamical regimes of diffusion mo dels. Natur e Communic ations , 15(1):9957, 2024. [27] Bao Pham, Gabriel Ray a, Matteo Negri, Mohammed J Zaki, Luca Ambrogioni, and Dmitry Kroto v. Memorization to generalization: Emergence of diffusion mo dels from asso ciative memory . arXiv pr eprint arXiv:2505.21777 , 2025. 21 [28] Elizab eth Gardner and Bernard Derrida. Three unfinished w orks on the optimal storage capacit y of netw orks. Journal of Physics A: Mathematic al and Gener al , 22(12):1983, 1989. [29] Hyunjune Sebastian Seung, Haim Somp olinsky , and Naftali Tishb y . Statistical mechanics of learning from examples. Physic al r eview A , 45(8):6056, 1992. [30] Ali Rahimi and Benjamin Rec ht. Random features for large-scale k ernel mac hines. A dvanc es in neur al information pr o c essing systems , 20, 2007. [31] Corinna Cortes and Vladimir V apnik. Supp ort-v ector netw orks. Machine le arning , 20(3): 273–297, 1995. [32] Song Mei and Andrea Mon tanari. The generalization error of random features regression: Precise asymptotics and the double descen t curve. Communic ations on Pur e and A pplie d Mathematics , 75(4):667–766, 2022. [33] F ederica Gerace, Bruno Loureiro, Florent Krzakala, Marc Mézard, and Lenka Zdeb oro vá. Generalisation error in learning with random features and the hidden manifold mo del. In International Confer enc e on Machine L e arning , pages 3452–3462. PMLR, 2020. [34] Bruno Loureiro, Cedric Gerb elot, Hugo Cui, Sebastian Goldt, Florent Krzakala, Marc Mezard, and Lenka Zdeb oro vá. Learning curves of generic features maps for realistic datasets with a teacher-studen t mo del. A dvanc es in Neur al Information Pr o c essing Systems , 34:18137–18151, 2021. [35] Song Mei, Theo dor Misiakiewicz, and Andrea Mon tanari. Generalization error of random feature and k ernel metho ds: Hyp ercontractivit y and kernel matrix concentration. A pplie d and Computational Harmonic A nalysis , 59:3–84, 2022. [36] Benjamin Aubin, Florent Krzakala, Y ue Lu, and Lenka Zdeb oro vá. Generalization error in high-dimensional p erceptrons: Approaching ba y es error with con vex optimization. A dvanc es in Neur al Information Pr o c essing Systems , 33:12199–12210, 2020. [37] Jean Barbier, Florent Krzakala, Nicolas Macris, Léo Miolane, and Lenka Zdeb orov á. Optimal errors and phase transitions in high-dimensional generalized linear mo dels. Pr o c e e dings of the National A c ademy of Scienc es , 116(12):5451–5460, 2019. doi: 10 . 1073/pnas . 1802705116. URL https://www . pnas . org/doi/abs/10 . 1073/pnas . 1802705116 . [38] Alex Krizhevsky , Geoffrey Hinton, et al. Learning m ultiple la yers of features from tin y images. 2009. [39] Madh u A dv ani and Surya Ganguli. An equiv alence b et ween high dimensional ba yes optimal inference and m-estimation. A dvanc es in Neur al Information Pr o c essing Systems , 29, 2016. [40] W einan Sun, Madh u A dv ani, Nelson Spruston, Andrew Saxe, and James E Fitzgerald. Orga- nizing memories for generalization in complementary learning systems. Natur e neur oscienc e , 26(8):1438–1448, 2023. [41] Sam y Jelassi, Clara Mohri, David Brandfonbrener, Alex Gu, Nikhil V yas, Nikhil Anand, Da vid Alv arez-Melis, Y uanzhi Li, Sham M Kakade, and Eran Malach. Mixture of parrots: Exp erts improv e memorization more than reasoning. arXiv pr eprint arXiv:2410.19034 , 2024. [42] Sebastian Goldt, Marc Mézard, Floren t Krzakala, and Lenka Zdeb oro vá. Mo deling the influence of data structure on learning in neural net works: The hidden manifold model. Physic al R eview X , 10(4):041044, 2020. 22 [43] Thomas Hofmann, Bernhard Schölk opf, and Alexander J. Smola. Kernel metho ds in mac hine learning. The A nnals of Statistics , 36(3):1171 – 1220, 2008. doi: 10 . 1214/ 009053607000000677. URL https://doi . org/10 . 1214/009053607000000677 . [44] Gab or Szeg. Ortho gonal p olynomials , volume 23. American Mathematical So c., 1939. [45] F ederica Gerace, Floren t Krzakala, Bruno Loureiro, Ludo vic Stephan, and Lenka Zdeborov á. Gaussian univ ersality of p erceptrons with random lab els. Phys. R ev. E , 109:034305, Mar 2024. doi: 10 . 1103/Ph ysRevE . 109 . 034305. URL https://link . aps . org/doi/10 . 1103/ PhysRevE . 109 . 034305 . 23 A Replica analysis of the ERM problem on the RAF task mo del This app endix provides a deriv ation of the results of Sec. 2 based on the replica metho d of the statistical ph ysics of disordered systems. A.1 Setting and notation First, w e recall the setting and in tro duce the notation used throughout the following sections. Consider a training dataset D = { ( x µ , y µ ) } n µ =1 consisting of n indep enden tly lab eled samples. The inputs x µ ∈ R d are i.i.d. draws from N ( 0 , I d ) , while the lab els y µ ∈ {± 1 } are generated according to the output distribution P out ⋆  y µ    1 √ d w ⋆ · x µ  = εP mem ( y µ ) + (1 − ε ) P gen  y µ    1 √ d w ⋆ · x µ  . (43) In Eq. ( 43 ) we distinguish the channel asso ciated with the random lab els P mem ( y µ ) = 1 2 δ ( y µ + 1) + 1 2 δ ( y µ − 1) (44) from the channel linked to the teacher P gen  y µ    1 √ d w ⋆ · x µ  = δ  y µ − sign  1 √ d w ⋆ · x µ  . (45) In addition, the teacher weigh ts are assumed to follo w a Gaussian prior P w ⋆ = N ( 0 , I d ) . Let ˆ w b e the minimizer of the strictly conv ex empirical risk L ( w ; D ) = n X µ =1 ℓ  y µ , f w ( x µ )  + λ 2 r ( w ) = X µ ∈I mem ℓ ( y µ , f w ( x µ )) + X µ ∈I gen ℓ ( y µ , f w ( x µ )) + λ 2 r ( w ) (46) where f w is the pre-activ ation function of the mo del, r the regularization on the trained weigh ts, and we assume λ > 0 . In ( 46 ) , we emphasize that the data-dep enden t term of the empirical risk can b e decomp osed in to tw o sums by introducing the index sets I mem and I gen , corresp onding to the randomly-lab eled samples and the teac her-app oin ted ones, resp ectively . W e ev aluate the minimizer ˆ w through its capacity to memorize the random lab els in the training set and its ability to generalize on the learnable ones. These prop erties are measured by memorization error E mem ( ˆ w ; D , ¯ ε ) = 1 ¯ εn X µ ∈I mem . 1 y µ  =sign( f ˆ w ( x µ )) = 1 ¯ εn X µ ∈I mem . θ ( − y µ f ˆ w ( x µ )) (47) and the generalization error E gen ( ˆ w ) = E x new ,y new h 1 y new  =sign( f ˆ w ( x new )) i = E x new ,y new [ θ ( − y new f ˆ w ( x new ))] . (48) In ( 47 ) w e in tro duced the fraction of randomly-labeled samples in the training dataset ¯ ε = |I mem | /n , whic h is a scaled binomial random v ariable. In what follows we fo cus on the random features setting, where f w ( x µ ) = 1 √ p w · σ  1 √ d F ⊤ x µ  and r ( w ) = || w || 2 2 , with w ∈ R p , F ∈ R d × p with i.i.d. en tries F ia ∼ N (0 , 1) , and σ : R → R a non-linear activ ation applied comp onen t-wise. W e adopt this setting b ecause k ernel regression arises as a limiting case of random-features regression, while the linear mo del can b e viewed as a particular instance of k ernel regression. These connections will b e sho wn in the following sections. Finally , we study the p erformance of the minimizer in the high-dimensional limit, where n, d, p → ∞ , α = n d = Θ(1) , κ = p d = Θ(1) . (49) F or conv enience, we also introdu ce the notation γ = κ − 1 . 24 A.2 Gibbs formulation of the minimization problem The empirical risk minimization (ERM) in the high-dimensional limit is equiv alent to solving a set of low-dimensional self-consistent equations for a set of scalar order parameters that fully c haracterize the minimizer. Within statistical physics, this is sho wn via the replica metho d. T o b egin with, w e define the following Gibbs measure o ver the weigh ts w ∈ R p : µ β ( d w ) = 1 Z β ( D ) e − β L ( w ; D ) d w , (50) where Z β ( D ) is the normalization or partition function of the measure: Z β ( D ) = Z R p d w e − β L ( w ; D ) . (51) Sending the inv erse temp erature β to infinity , the Gibbs measure ( 50 ) concen trates on the global minimizer of the risk ( 46 ). The goal of the replica metho d is to compute the free energy density f β = − 1 β lim p → + ∞ 1 p E D h log Z β ( D ) i (52) and ev aluate it for β → + ∞ . Indeed, in this limit the free energy densit y ( 52 ) con verges to the global minim um of the energy functional of the problem, that is just the empirical risk ( 46 ) . Then, recasting the righ t hand side of Eq. ( 52 ) as a a saddle-p oin t integral in the large p limit, the order parameters describing the asymptotic minimizer arise naturally as the asso ciated saddle-p oin t solutions. In the next section, w e present the main steps of the replica computation. In practice we will not compute the free energy density defined in ( 52 ), but rather a closely related one, namely f β ,h = − 1 β lim p → + ∞ 1 p E D , ¯ ε h log Z β ,h ( D , ¯ ε ) i , (53) where the corresp onding partition function is defined as Z β ,h ( D , ¯ ε ) = Z R p d w e − β L ( w ; D ) − β hn E mem ( w ; D , ¯ ε ) . (54) W e in tro duce this mo dified definition because it enables us to extract the asymptotic memorization of the risk minimizer, given by E mem = 1 γ α lim β →∞ ∂ f β ,h ∂ h      h =0 . (55) When h = 0 , the quantit y f β ,h reduces to the original free-energy density f β . A.3 Replica computation of the free energy densit y Using the replica trick, the free energy density f β ,h can b e rewritten as f β ,h = − 1 β lim p → + ∞ lim r → 0 + 1 p E D , ¯ ε [ Z β ,h ( D , ¯ ε ) r ] − 1 r . (56) Therefore, the whole problem reduces to compute the av erage replicated partition function. 25 The replicated partition function The partition function can b e massaged as follows: Z β ,h ( D , ¯ ε ) = Z R p d w e − β  n P µ =1 ℓ ( y µ ,f w ( x µ ))+ λ 2 || w || 2 2  − β h ¯ ε P µ ∈I mem θ ( − y µ f w ( x µ )) = Z R p d w e − β P µ ∈I mem ℓ ( y µ ,f w ( x µ )) − β P µ ∈I gen ℓ ( y µ ,f w ( x µ )) − β λ 2 || w || 2 2 − β h ¯ ε P µ ∈I mem θ ( − y µ f w ( x µ )) = Z R p d w e − β P µ ∈I mem [ ℓ ( y µ ,f w ( x µ ))+ h ¯ ε θ ( − y µ f w ( x µ )) ] e − β P µ ∈I gen ℓ ( y µ ,f w ( x µ )) e − β λ 2 || w || 2 2 = Z R p d w Y µ ∈I mem e − β  ℓ ( y µ ,f w ( x µ ))+ h ¯ ε θ ( − y µ f w ( x µ ))  | {z } P ℓθ ( y µ | f w ( x µ )) ! Y µ ∈I gen e − β ℓ ( y µ ,f w ( x µ )) | {z } P ℓ ( y µ | f w ( x µ )) ! e − β λ 2 || w || 2 2 | {z } P w ( w ) = Z R p d w P w ( w ) Y µ ∈I mem P ℓθ ( y µ | f w ( x µ )) ! Y µ ∈I gen P ℓ ( y µ | f w ( x µ )) ! . (57) Th us, the replicated partition function is equal to Z β ,h ( D , ¯ ε ) r = Z R p × r r Y a =1 d w a P w ( w a ) ! Y µ ∈I mem a =1 ,...,r P ℓθ ( y µ | f w a ( x µ ) ! Y µ ∈I gen a =1 ,...,r P ℓ ( y µ | f w a ( x µ )) ! . (58) A v eraging o v er the disorder The a verage replicated partition function can b e obtained as E D , ¯ ε [ Z β ,h ( D , ¯ ε ) r ] = E ¯ ε E w ⋆ E D| ¯ ε, w ⋆ [ Z β ,h ( D , ¯ ε ) r ] , (59) where the conditional av erage ov er the dataset reads E D| ¯ ε, w ⋆ [ Z β ,h ( D , ¯ ε ) r ] = Z R p × r r Y a =1 d w a P w ( w a ) ! × Z R d × ¯ εn Y µ ∈I mem d x µ P x ( x µ ) ! Z R ¯ εn Y µ ∈I mem dy µ P mem ( y µ ) ! × Z R d × (1 − ¯ ε ) n Y µ ∈I gen d x µ P x ( x µ ) ! Z R (1 − ¯ ε ) n Y µ ∈I gen dy µ P gen  y µ    1 √ d w ⋆ · x µ  ! × Y µ ∈I mem a =1 ,...,r P ℓθ ( y µ | f w a ( x µ )) ! Y µ ∈I gen a =1 ,...,r P ℓ ( y µ | f w a ( x µ )) ! . (60) T o ev aluate explicitly the a verage ov er the inputs, w e introduce the auxiliary v ariables ν µ = 1 √ d w ⋆ · x µ , λ a µ = f w a ( x µ ) , (61) 26 enforcing these definitions through Dirac delta functions for all v alues of the indices µ and a : E D| ¯ ε, w ⋆ [ Z β ,h ( D , ¯ ε ) r ] = Z R p × r  r Y a =1 d w a P w ( w a )  × ( Y µ ∈I mem Z R dy µ P mem ( y µ ) Z R r r Y a =1 dλ a µ P ℓθ ( y µ | λ a µ ) ! E x µ " r Y a =1 δ  λ a µ − f w a ( x µ )  #) × ( Y µ ∈I gen Z R dy µ Z R dν µ P gen ( y µ | ν µ ) Z R r r Y a =1 dλ a µ P ℓ ( y µ | λ a µ ) ! × E x µ " δ  ν µ − 1 √ d w ⋆ · x µ  r Y a =1 δ  λ a µ − f w a ( x µ )  #) . (62) Recall that in the random features setting f w ( x µ ) = 1 √ p w · σ ( 1 √ d F ⊤ x µ ) . By a replicated Gaussian equiv alence argument, we claim that E x µ " r Y a =1 δ  λ a µ − f w a ( x µ )  # = N ( { λ a µ } r a =1 ; 0 , Q ) , E x µ " δ  ν µ − 1 √ d w ⋆ · x µ  r Y a =1 δ  λ a µ − f w a ( x µ )  # = N ( ν µ , { λ a µ } r a =1 ; 0 , Σ) , (63) where the elemen ts of the cov ariance matrices Q ∈ R r × r , Σ ∈ R ( r +1) × ( r +1) are the rescaled o verlap order parameters Σ 00 = 1 d || w ⋆ || 2 2 , Σ 0 a = µ 1 d w ⋆ · s a a = 1 , . . . , r, Σ ab = Q ab = µ 2 ⋆ p w a · w b + µ 2 1 d s a · s b 1 ≤ a ≤ b ≤ r , (64) with s a = 1 √ p F w a a = 1 , . . . , r (65) and µ 0 = E z ∼N (0 , 1)  σ ( z )  , µ 1 = E z ∼N (0 , 1)  z σ ( z )  , µ 2 ⋆ = E z ∼N (0 , 1)  σ ( z ) 2  − µ 2 0 − µ 2 1 . (66) The replicated Gaussian equiv alence is based on the Gaussian equiv alence theorem (GET) in tro duced and pro ved in [ 42 ]. The GET holds in the high-dimensional limit p, d → + ∞ with fixed p/d and for matrices F satisfying sp ecific balance conditions. F or more details on the GET and the replicated Gaussian equiv alence, we defer to App endix B of [ 33 ]. In the end, the av erage replicated partition function is given by E D , ¯ ε [ Z β ,h ( D , ¯ ε ) r ] = E ¯ ε Z R d d w ⋆ P w ⋆ ( w ⋆ ) Z R p × r  r Y a =1 d w a P w ( w a )  × ( Y µ ∈I mem Z R dy µ P mem ( y µ ) Z R r r Y a =1 dλ a µ P ℓθ ( y µ | λ a µ ) ! N ( { λ a µ } r a =1 ; 0 , Q ) ) × ( Y µ ∈I gen Z R dy µ Z R dν µ P gen ( y µ | ν µ ) Z R r r Y a =1 dλ a µ P ℓ ( y µ | λ a µ ) ! N ( ν µ , { λ a µ } r a =1 ; 0 , Σ) ) . (67) 27 Rewriting as a saddle p oint problem The next step is to disentangle the integrals o ver the lab els and the previously introduced auxiliary v ariables from the integrals ov er the teacher and the studen t replicas, whic h are coupled through the ov erlap order parameters. T o this end, w e insert in Eq. ( 67 ) the following Dirac delta functions, and their corresp onding F ourier representations: 1 ∝ Z R dρ δ  dρ − || w ⋆ || 2 2  Z R r r Y a =1 dm a s ! r Y a =1 δ  dm a s − w ⋆ · s a  ! × Z R r × r   Y 1 ≤ a ≤ b ≤ r dq ab s     Y 1 ≤ a ≤ b ≤ r δ  dq ab s − s a · s b    × Z R r × r   Y 1 ≤ a ≤ b ≤ r dq ab w     Y 1 ≤ a ≤ b ≤ r δ  pq ab w − w a · w b    = Z R dρ Z R d ˆ ρ 2 π e i ˆ ρ ( dρ −|| w ⋆ || 2 2 ) Z R r r Y a =1 dm a s ! Z R r r Y a =1 d ˆ m a s 2 π ! e i r P a =1 ˆ m a s ( dm a − w ⋆ · w a ) × Z R r × r   Y 1 ≤ a ≤ b ≤ r dq ab s   Z R r × r   Y 1 ≤ a ≤ b ≤ r d ˆ q ab s 2 π   e i P 1 ≤ a ≤ b ≤ r ˆ q ab s ( dq ab s − s a · s b ) × Z R r × r   Y 1 ≤ a ≤ b ≤ r dq ab w   Z R r × r   Y 1 ≤ a ≤ b ≤ r d ˆ q ab w 2 π   e i P 1 ≤ a ≤ b ≤ r ˆ q ab ( dq ab w − w a · w b ) . (68) Disregarding subleading factors that v anish in the high-dimensional limit, the a verage replicated partition function can then b e expressed as a saddle-p oin t integral as p → + ∞ : E D , ¯ ε [ Z β ,h ( D , ¯ ε ) r ] = E ¯ ε Z R dρ Z i R d ˆ ρ 2 π Z R r r Y a =1 dm a s ! Z ( i R ) r r Y a =1 d ˆ m a s 2 π ! Z R r × r   Y 1 ≤ a ≤ b ≤ r dq ab s   × Z ( i R ) r × r   Y 1 ≤ a ≤ b ≤ r d ˆ q ab s 2 π   Z R r × r   Y 1 ≤ a ≤ b ≤ r dq ab w   Z ( i R ) r × r   Y 1 ≤ a ≤ b ≤ r d ˆ q ab w 2 π   e p Φ ( r ) (Σ , ˆ Σ; ¯ ε ) , (69) where we introduced ˆ Σ as a collective notation for the set of v ariables ˆ ρ, { ˆ m a } r a =1 , { ˆ q ab s } 1 ≤ a ≤ b ≤ r and { ˆ q ab w } 1 ≤ a ≤ b ≤ r . The p oten tial Φ ( r ) (Σ , ˆ Σ; ¯ ε ) reads Φ ( r ) (Σ , ˆ Σ; ¯ ε ) = − γ ρ ˆ ρ − γ r X a =1 m a s ˆ m a s − γ X 1 ≤ a ≤ b ≤ r q ab s ˆ q ab s − X 1 ≤ a ≤ b ≤ r q ab w ˆ q ab w + Ψ ( r ) w ( ˆ Σ) + γ ¯ εα Ψ ( r ) mem ( Q ) + γ (1 − ¯ ε ) α Ψ ( r ) gen (Σ) (70) where w e resp ectively called Ψ ( r ) w ( ˆ Σ) = 1 p log Z R d d w ⋆ P w ⋆ ( w ⋆ ) Z R p × r r Y a =1 d w a P w ( w a ) ! × e ˆ ρ || w ⋆ || 2 2 + r P a =1 ˆ m a s w ⋆ · w a + P 1 ≤ a ≤ b ≤ r ˆ q ab s s a · s b + P 1 ≤ a ≤ b ≤ r ˆ q ab w w a · w b , (71) Ψ ( r ) mem ( Q ) = log Z R dy P mem ( y ) Z R r r Y a =1 dλ a P ℓθ ( y | λ a ) ! N ( { λ a } r a =1 ; 0 , Q ) , (72) 28 Ψ ( r ) gen (Σ) = log Z R dy Z R dν P gen ( y | ν ) Z R r r Y a =1 dλ a P ℓ ( y | λ a ) ! N ( ν, { λ a } r a =1 ; 0 , Σ) . (73) After the disentanglemen t, the cov ariance matrices Q and Σ are given by Σ 00 = ρ, Σ 0 a = µ 1 m a s a = 1 , . . . , r, Σ ab = Q ab = µ 2 ⋆ q ab w + µ 2 1 q ab s 1 ≤ a ≤ b ≤ r . (74) The replica-symmetric ansatz Applying the saddle-p oin t metho d, the computation of the a verage replicated partition function reduces to an extremization problem o ver Σ and ˆ Σ (up to the av erage ov er ¯ ε , which we still need to tak e). At this stage, the free energy densit y ( 56 ) can b e expressed as f β ,h = − 1 β lim r → 0 + lim p → + ∞ 1 p E ¯ ε " e p extr Σ , ˆ Σ Φ ( r ) (Σ , ˆ Σ; ¯ ε ) # − 1 r . (75) T o pro ceed with the r → 0 + limit w e restrict the extremization to the replica-symmetric ansatz, whic h implies m a s = m s , ˆ m a s = ˆ m s , for a = 1 , . . . , r, q aa s/w = Q s/w , ˆ q aa s/w = − 1 2 ˆ Q s/w , for a = 1 , . . . , r, q ab s/w = q s/w , ˆ q ab s/w = ˆ q s/w , for 1 ≤ a < b ≤ r. (76) In this ansatz, we claim that f β ,h = − 1 β lim r → 0 + lim p → + ∞ 1 r E ¯ ε " extr Σ , ˆ Σ Φ ( r ) (Σ , ˆ Σ; ¯ ε ) # = − 1 β lim r → 0 + lim p → + ∞ 1 r extr Σ , ˆ Σ Φ ( r ) (Σ , ˆ Σ; ε ) , (77) where w e used that ¯ ε concen trates to ε in the high-dimensional limit. The first equality in ( 77 ) holds provided that in the replica-symmetric ansatz lim r → 0 + Φ ( r ) (Σ , ˆ Σ ; ¯ ε ) = 0 . Inserting ( 76 ) in to the p oten tial ( 70 ) , the ab o ve consistency condition fixes ρ = E w ⋆ ∼N (0 , 1)  w 2 ⋆  = 1 and ˆ ρ = 0 . The reasoning leading to this result is the same as that presen ted in App endix C of [ 33 ]. The replica-symmetric free energy density The free energy density f β ,h in the replica-symmetric ansatz takes the form f β ,h = − 1 β extr m s ,q s ,V s ,q w ,V w ˆ m s , ˆ q s , ˆ V s , ˆ q w , ˆ V w  − γ m s ˆ m s + γ 2 V s ˆ V s + γ 2 ( q s ˆ V s − ˆ q s V s ) + 1 2 V w ˆ V w + 1 2 ( q w ˆ V w − ˆ q w V w ) + ψ w ( ˆ m s , ˆ q s , ˆ V s , ˆ q w , ˆ V w ) + γ εαψ mem ( q , V ) + γ (1 − ε ) αψ gen ( m, q , V )  . (78) In Eq. ( 78 ) , w e introduced V s/w = Q s/w − q s/w , ˆ V s/w = ˆ Q s/w + ˆ q s/w . F urthermore, w e defined m = µ 1 m s , q = µ 2 1 q s + µ 2 ⋆ q w and V = µ 2 1 V s + µ 2 ⋆ V w . The entropic term ψ w and the energetic terms ψ mem , ψ gen are obtained by taking the limits p → + ∞ and r → 0 + of 1 r Ψ ( r ) w , 1 r Ψ ( r ) mem and 1 r Ψ ( r ) gen , resp ectively , ev aluated under the replica- symmetric ansatz. Here, we simply rep ort their explicit expressions, since the deriv ation follows exactly the same steps as those detailed in App endix C of [ 33 ]: 29 ψ w ( ˆ m s , ˆ q s , ˆ V s , ˆ q w , ˆ V w ) = − 1 2 log  β λ + ˆ V w  − γ 2 log ˆ V s β λ + ˆ V w ! + γ 2 Z ˆ V s β λ + ˆ V w 0 dt 1 t 2 g  − 1 t  + γ ˆ m 2 s + ˆ q s 2 ˆ V s " 1 − β λ + ˆ V w ˆ V s g − β λ + ˆ V w ˆ V s !# + 1 2 ˆ q w β λ + ˆ V w " 1 − γ + γ β λ + ˆ V w ˆ V s g − β λ + ˆ V w ˆ V s !# , (79) ψ mem ( q , V ) = E ξ ∼N (0 , 1)  Z R dy P mem ( y ) log Z R dz √ 2 π V P ℓθ ( y | z ) e − ( z − √ qξ ) 2 2 V  , (80) ψ gen ( m, q , V ) = E ξ ∼N (0 , 1) " Z R dy Z R dν r 2 π  1 − m 2 q  P gen ( y | ν ) e −  ν − m √ q ξ  2 2  1 − m 2 q  × log Z R dz √ 2 π V P ℓ ( y | z ) e − ( z − √ qξ ) 2 2 V # . (81) In Eq. ( 79 ) , g denotes the Stieltjes transform of the sp ectral distribution of the matrix F F ⊤ p in the limit as p → + ∞ . Since F is a Gaussian pro jection matrix, this sp ectral distribution is w ell-defined in the large p limit and it is describ ed by the w ell-known Marchenk o-P astur law. The corresp onding Stieltjes transform is giv en by g ( z ) = 1 − z − γ − p ( z − 1 − γ ) 2 − 4 γ 2 γ z (82) with z < 0 . It is w orth c hecking that Eq. ( 78 ) is consisten t with the replica-symmetric v ersion of Eq. ( 52 ) . When h = 0 , one has P ℓθ ( y | z ) = P ℓ ( y | z ) . Then, the terms ψ mem and ψ gen can b e merged into a single energetic contribution, which we denote by ψ out ⋆ , defined as follows: ψ out ⋆ ( m, q , V ) = E ξ ∼N (0 , 1) " Z R dy Z out ⋆ y , m √ q ξ , 1 − m 2 q ! log Z out ( y , √ q ξ , V ) # , (83) where Z out ⋆ ( y , ω , τ ) = E z ∼N ( ω ,τ ) [ P out ⋆ ( y | z )] , (84) Z out ( y , ω , τ ) = E z ∼N ( ω ,τ ) [ P ℓ ( y | z )] . (85) T aking this into account, for h = 0 the replica-symmetric free energy densit y ( 78 ) b ecomes f β , 0 = − 1 β extr m s ,q s ,V s ,q w ,V w ˆ m s , ˆ q s , ˆ V s , ˆ q w , ˆ V w  − γ m s ˆ m s + γ 2 V s ˆ V s + γ 2 ( q s ˆ V s − ˆ q s V s ) + 1 2 V w ˆ V w + 1 2 ( q w ˆ V w − ˆ q w V w ) + ψ w ( ˆ m s , ˆ q s , ˆ V s , ˆ q w , ˆ V w ) + γ αψ out ⋆ ( m, q , V )  . (86) Eq. ( 86 ) coincides exactly with the replica-symmetric free energy density f β , which was deriv ed in [ 33 ]. Notice that Eq. ( 86 ) is v alid for an y choice of output distribution P out ⋆ . In contrast, Eq. ( 78 ) holds only when the output distribution has the sp ecific form introduced in Eq. ( 43 ). 30 A.4 Random features state equations and asymptotic p erformance By extremizing the replica-symmetric p oten tial in Eq. ( 86 ) , we derive a set of self-consisten t saddle-p oin t equations that can b e solv ed via fixed-p oint iteration for any finite β . Sending β to infinit y , the fixed p oin t of these equations yields the ov erlap order parameters that characterize the global minimizer of the empirical risk in the high-dimensional limit. T o p erform the limit β → + ∞ , one introduces the follo wing rescaling of the ov erlaps and their conjugates: m s → m s , q s/w → q s/w , V s/w → β − 1 V s/w , ˆ m s → β ˆ m s , ˆ q s/w → β 2 ˆ q s/w , ˆ V s/w → β ˆ V s/w . (87) The explicit pro cedure is detailed in [ 33 ]. Here, we directly present the resulting state equations for the ov erlaps and their conjugate v ariables in the random-features mo del with Gaussian pro jections, which read as follows:                                                                                                        m s = ˆ m s ˆ V s z + 1 + γ − ∆ 2 γ q s = ˆ m 2 s + ˆ q s ˆ V 2 s (2 z + γ + 1)∆ − 2 z 2 − 3( γ + 1) z − ( γ − 1) 2 2 γ ∆ − ˆ q w ( λ + ˆ V w ) ˆ V s z ∆ − z 2 − ( γ + 1) z 2 γ ∆ V s = 1 ˆ V s z + 1 + γ − ∆ 2 γ q w = ˆ q w ( λ + ˆ V w ) 2 (1 − γ )∆ + ( γ + 1) z + ( γ − 1) 2 2∆ − ˆ m 2 s + ˆ q s ( λ + ˆ V w ) ˆ V s z ∆ − z 2 − ( γ + 1) z 2∆ V w = 1 λ + ˆ V w 1 − γ − z + ∆ 2 ˆ m s = µ 1 α E ξ ∼N (0 , 1)  Z R dy Z out ⋆ ( y , √ η ξ , 1 − η ) f out ⋆ ( y , √ η ξ , 1 − η ) f out ( y , √ q ξ , V )  ˆ q s = µ 2 1 α E ξ ∼N (0 , 1)  Z R dy Z out ⋆ ( y , √ η ξ , 1 − η ) f out ( y , √ q ξ , V ) 2  ˆ V s = − µ 2 1 α E ξ ∼N (0 , 1)  Z R dy Z out ⋆ ( y , √ η ξ , 1 − η ) ∂ ω f out ( y , √ q ξ , V )  ˆ q w = γ µ 2 ⋆ α E ξ ∼N (0 , 1)  Z R dy Z out ⋆ ( y , √ η ξ , 1 − η ) f out ( y , √ q ξ , V ) 2  ˆ V w = − γ µ 2 ⋆ α E ξ ∼N (0 , 1)  Z R dy Z out ⋆ ( y , √ η ξ , 1 − η ) ∂ ω f out ( y , √ q ξ , V )  (88) In Eqs. ( 88 ) , z = λ + ˆ V w ˆ V s , ∆ = p (1 + γ + z ) 2 − 4 γ and η = m 2 q . F urthermore, the functions f out ⋆ and f out are defined as f out ⋆ ( y , ω , τ ) = ∂ ω log Z out ⋆ ( y , ω , τ ) , (89) f out ( y , ω , τ ) = P τ ℓ ( y , · ) ( ω ) − ω τ , (90) where P τ ℓ ( y , · ) is the proximal op erator P τ ℓ ( y , · ) ( ω ) = arg min z ∈ R " ℓ ( y , z ) + ( z − ω ) 2 2 τ # . (91) 31 A.4.1 The memorization error As explained in Section A.2 , we can access th e memorization error of the asymptotic empirical risk minimizer through f β ,h . Within this subsection, w e implicitly assume that the ov erlaps are those that extremize the right hand side of Eq. ( 78 ) . Consequen tly , they should b e regarded as implicit functions of h and β . In particular, when h = 0 and β → + ∞ , they coincide with the solutions of the state equations ( 88 ) . F or notational simplicity , w e omit the explicit dep endence on h and β . The partial deriv ative of f β ,h with resp ect to h , ev aluated at h = 0 , is given by ∂ f β ,h ∂ h      h =0 = γ α E ξ ∼N (0 , 1)     Z R dy P mem ( y ) R R dz e − β h ℓ ( y ,z )+ ( z − √ qξ ) 2 2 β V i θ ( − y z ) R R dz e − β h ℓ ( y ,z )+ ( z − √ qξ ) 2 2 β V i     . (92) Applying the same rescaling as in ( 87 ), w e can write lim β → + ∞ ∂ f β ,h ∂ h      h =0 = γ α E ξ ∼N (0 , 1)  Z R dy P mem ( y ) θ ( − y P V ℓ ( y , · ) ( √ q ξ ))  . (93) Therefore, from Eq. ( 55 ), the asymptotic memorization error reads E mem = E ξ ∼N (0 , 1)  Z R dy P mem ( y ) θ ( − y P V ℓ ( y , · ) ( √ q ξ ))  . (94) Inserting P mem , w e obtain the Eq. ( 22 ) presen ted in the main: E mem = 1 2 E ξ ∼N (0 , 1) h θ  P V ℓ ( − 1 , · ) ( √ q ξ )  + θ  −P V ℓ (1 , · ) ( √ q ξ ) i . (95) A.4.2 The generalization error The asymptotic expression for the generalization error of a random features student learning from a teacher p erceptron was deriv ed in [ 33 ]. It takes the form E gen = 1 π arccos m √ q ! , (96) where w e recall that m = µ 1 m s and q = µ 2 1 q s + µ 2 ⋆ q w , with m s , q s and q w giv en by the solution of Eqs. ( 88 ). A.5 Random features in the highly-o v erparametrized regime: the k ernel limit In the limit κ → ∞ , or equiv alently γ → 0 , with α and λ held finite, the fixed p oint of Eqs. ( 88 ) yields the o verlap order parameters that characterize the asymptotic p erformance of dot-pro duct k ernels. In this section, we first sh o w how to tak e the limit, and then justify why the resulting equations describ e dot-pro duct kernels. A.5.1 Kernel regression state equations Assume that, as γ → 0 with α and λ finite, the conjugate v ariables scale as ˆ m s = Θ(1) , ˆ q s = Θ(1) , ˆ V s = Θ(1) , ˆ q w = o (1) , ˆ V w = o (1) . (97) 32 Under this ansatz, we expand the right hand sides of the ov erlap order parameters equations in ( 88 ) . A k ey subtlety is to av oid ill-defined ratios in the γ → 0 limit. T o this end, write z = λ ˆ V s + γ µ 2 ⋆ µ 2 1 and expand ∆ as ∆ ≃ γ → 0 1 + λ ˆ V s + γ  1 + λ ˆ V s   1 + µ 2 ⋆ µ 2 1  − 2 1 + λ ˆ V s . (98) Substituting this expansion and simplifying the resulting expressions, the equations for the o verlaps reduce to the remarkably simple form m s = ˆ m s λ + ˆ V s , q s = ˆ m 2 s + ˆ q s ( λ + ˆ V s ) 2 , V s = 1 λ + ˆ V s , q w = 0 , V w = 1 λ . (99) It remains to v erify that the conjugate equations are consistent with the scaling assumption ( 97 ) . Pro vided that the integrals app earing on their right-hand sides are sufficiently regular functions of ( η , q , V ) , this consis tency condition is satisfied. As a consequence, the v ariables q w , V w , ˆ q w and ˆ V w decouple, and the remaining equations reduce to                                                        m s = ˆ m s λ + ˆ V s q s = ˆ m 2 s + ˆ q s ( λ + ˆ V s ) 2 V s = 1 λ + ˆ V s ˆ m s = µ 1 α E ξ ∼N (0 , 1)  Z R dy Z out ⋆ ( y , √ η ξ , 1 − η ) f out ⋆ ( y , √ η ξ , 1 − η ) f out ( y , √ q ξ , V )  ˆ q s = µ 2 1 α E ξ ∼N (0 , 1)  Z R dy Z out ⋆ ( y , √ η ξ , 1 − η ) f out ( y , √ q ξ , V ) 2  ˆ V s = − µ 2 1 α E ξ ∼N (0 , 1)  Z R dy Z out ⋆ ( y , √ η ξ , 1 − η ) ∂ ω f out ( y , √ q ξ , V )  (100) Here m = µ 1 m s , q = µ 2 1 q s , V = µ 2 1 V s + µ 2 ⋆ λ . In this regime, it is more conv enien t to express the system of state equations directly in terms of the rescaled order parameters m, q , V rather than the bare ones m s , q s , V s . Rewriting the first three equations of ( 100 ) accordingly yields the equiv alent system                                                          m = µ 1 ˆ m s λ + ˆ V q = µ 2 1 ˆ m 2 s + ˆ q s ( λ + ˆ V s ) 2 V = µ 2 1 1 λ + ˆ V s + µ 2 ⋆ λ ˆ m s = µ 1 α E ξ ∼N (0 , 1)  Z R dy Z out ⋆ ( y , √ η ξ , 1 − η ) f out ⋆ ( y , √ η ξ , 1 − η ) f out ( y , √ q ξ , V )  ˆ q s = µ 2 1 α E ξ ∼N (0 , 1)  Z R dy Z out ⋆ ( y , √ η ξ , 1 − η ) f out ( y , √ q ξ , V ) 2  ˆ V s = − µ 2 1 α E ξ ∼N (0 , 1)  Z R dy Z out ⋆ ( y , √ η ξ , 1 − η ) ∂ ω f out ( y , √ q ξ , V )  (101) 33 A t this stage, the subscript s b ecomes redundant and can b e dropp ed for notational simplicity . This leads to the final form of the random features state equations in the highly-o v erparametrized limit:                                                          m = µ 1 ˆ m λ + ˆ V q = µ 2 1 ˆ m 2 + ˆ q ( λ + ˆ V ) 2 V = µ 2 1 1 λ + ˆ V + µ 2 ⋆ λ ˆ m = µ 1 α E ξ ∼N (0 , 1)  Z R dy Z out ⋆ ( y , √ η ξ , 1 − η ) f out ⋆ ( y , √ η ξ , 1 − η ) f out ( y , √ q ξ , V )  ˆ q = µ 2 1 α E ξ ∼N (0 , 1)  Z R dy Z out ⋆ ( y , √ η ξ , 1 − η ) f out ( y , √ q ξ , V ) 2  ˆ V = − µ 2 1 α E ξ ∼N (0 , 1)  Z R dy Z out ⋆ ( y , √ η ξ , 1 − η ) ∂ ω f out ( y , √ q ξ , V )  (102) A.5.2 The κ → ∞ limit and the link b et w een µ 1 , µ ⋆ and dot-product k ernels W e now motiv ate wh y in the limit κ = p/d → ∞ the high-dimensional random-features state equations recov er dot-pro duct kernel regression. F or a comprehensiv e review on kernel metho ds, see [ 43 ]. W e start by the random-features predictor f w ( x ) = 1 √ p p X a =1 w a σ  1 √ d F ⊤ a x  , (103) where F a ∈ R d denotes the a -th column of F . F or fixed training data { ( x µ , y µ ) } n µ =1 and fixed dimension d , this is a linear mo del in the p random features ϕ a ( x ) = σ  1 √ d F ⊤ a x  . (104) Its empirical-risk minimizer hence dep ends on the data only through the Gram matrix of the random features, namely 1 p p X a =1 ϕ a ( x µ ) ϕ a ( x ν ) = 1 p p X a =1 σ  1 √ d F ⊤ a x µ  σ  1 √ d F ⊤ a x ν  . (105) Since the columns of F are i.i.d., the law of large num b ers implies that, for fixed n and d , 1 p p X a =1 ϕ a ( x µ ) ϕ a ( x ν ) − → p →∞ K ( x µ , x ν ) = E v " σ v ⊤ x µ √ d ! σ v ⊤ x ν √ d !# . (106) Therefore, for finite n and d , the infinite-width random-features mo del is exactly equiv alent to k ernel regression with kernel K . Our analysis is conducted in the high-dimensional regime n, d, p → ∞ , α = n d = Θ(1) , κ = p d = Θ(1) . (107) Within this joint limit, sending κ → ∞ means taking p m uch larger than d while keeping α (and λ ) fixed. Intuitiv ely , for eac h large d one first reaches the infinite-width random-features limit 34 p → ∞ , whic h yields kernel regression at that finite dimension, and only afterwards takes the high-dimensional limit n, d → ∞ with α = n/d fixed. Since κ = p/d , th e sequential transition from random features to kernel regression is enco ded in the limit κ → ∞ . Next, we explain how the parameters µ 1 , µ ⋆ , which app ear in Eqs. ( 102 ) , are related to a dot-pro duct kernel. In the random-features construction with activ ation function σ and Gaussian first-la yer w eights, the infinite-width limit induces the k ernel K ( x µ , x ν ) = E ( u,v ) ∼N (0 , Σ ρ )  σ ( u ) σ ( v )  , (108) where Σ ρ = ∥ x µ ∥ 2 d ρ ρ ∥ x ν ∥ 2 d ! , ρ = ( x µ ) ⊤ x ν ∥ x µ ∥ ∥ x ν ∥ . (109) F or inputs dra wn uniformly from the sphere of radius √ d , the k ernel in Eq. ( 108 ) is a dot-product k ernel K ( ρ ) . Asymptotically , in the d → ∞ limit, for Gaussian inputs x µ , x ν ∼ N ( 0 , I d ) this still holds, ha ving ∥ x µ ∥ 2 /d = ∥ x ν ∥ 2 /d → 1 , and v anishing normalized inner pro duct ρ . Notice that the subsequent limit n → ∞ with α = n/d finite do es not alter this conclusion. Let us now establish the connection b et ween this k ernel and the parameters µ 1 , µ ⋆ . Assuming that σ ∈ L 2 ( N (0 , 1)) , we can expand the activ ation in the basis of the Hermite p olynomials: σ ( g ) = X ℓ ≥ 0 c ℓ H ℓ ( g ) , c ℓ = 1 ℓ ! E [ σ ( g ) H ℓ ( g )] . (110) Here { H ℓ } ℓ ≥ 0 denote the probabilists’ Hermite p olynomials, normalized suc h that E [ H ℓ ( g ) H m ( g )] = ℓ ! δ ℓm . (111) If ( g , g ′ ) is a centered jointly Gaussian pair with correlation E [ g g ′ ] = ρ , then the following iden tity holds: E  H ℓ ( g ) H m ( g ′ )  = ℓ ! ρ ℓ δ ℓm . (112) Using ( 110 ) together with ( 112 ), the dot-pro duct k ernel in Eq. ( 108 ) can b e expanded as K ( ρ ) = X ℓ ≥ 0 a ℓ ρ ℓ , a ℓ = ℓ ! c 2 ℓ = 1 ℓ ! ( E [ σ ( g ) H ℓ ( g )]) 2 ≥ 0 . (113) In particular, we hav e a 0 = ( E [ σ ( g )]) 2 , a 1 = ( E [ g σ ( g )]) 2 , X ℓ ≥ 0 a ℓ = E [ σ ( g ) 2 ] . (114) Recall no w the definitions of µ 0 , µ 1 and µ ⋆ in tro duced in the random-features analysis: µ 0 = E [ σ ( g )] , µ 1 = E [ g σ ( g )] , µ 2 ⋆ = E [ σ ( g ) 2 ] − µ 2 0 − µ 2 1 . (115) Comparing ( 114 ) and ( 115 ), we obtain µ 2 0 = a 0 = K (0) , µ 2 1 = a 1 = K ′ (0) , µ 2 ⋆ = X ℓ ≥ 2 a ℓ = K (1) − K (0) − K ′ (0) . (116) This pro ves the iden tities ( 30 ) rep orted in the main. F or more details we defer to [ 44 ]. Therefore, in the highly-o verparametrized limit, the asymptotic p erformance dep ends on the k ernel only through its linear co efficien t a 1 and the aggregate higher-order terms P ℓ ≥ 2 a ℓ . In conclusion, any Gaussian random-features mo del with activ ation σ ∈ L 2 ( N (0 , 1)) induces a dot-pro duct k ernel of the form ( 113 ) . Then, the parameters µ 1 , µ ⋆ en tering the asymptotic equations are obtained directly from the co efficien ts of its expansion. 35 A.5.3 The linear mo del as a special k ernel If w e set µ 1 = 1 and µ ⋆ = 0 , Eqs. ( 102 ) b ecome exactly the state equations describing the asymptotic p erformance of a linear mo del trained by minimizing an empirical risk with conv ex loss and ridge regularization, which were deriv ed in [ 36 ]. Indeed, the combination µ 1 = 1 and µ ⋆ = 0 corresp onds to the linear k ernel K ( ρ ) = a 0 + ρ , thus to standard linear regression. A.6 Symmetry prop erties of k ernel regression solutions The fixed p oin t of Eqs. ( 102 ) exhibits notable symmetries when the k ernel parameters µ 1 and µ ⋆ are appropriately rescaled. In this section, we first derive a reduced system of state equations from Eqs. ( 102 ) , and then emphasize the key symmetries satisfied by its solutions. A.6.1 Reduced kernel regression state equations In tro ducing ℓ = λ µ 2 1 , t = µ 1 µ ⋆ . (117) and defining the rescaled v ariables ˜ m = ˆ m µ 1 , ˜ q = ˆ q µ 2 1 , ˜ V = ˆ V µ 2 1 . (118) the system of Eqs. ( 102 ) can b e recast as                                                        m = ˜ m ℓ + ˜ V q = ˜ m 2 + ˜ q ( ℓ + ˜ V ) 2 V = 1 ℓ + ˜ V + 1 t 2 ℓ ˜ m = α E ξ ∼N (0 , 1)  Z R dy Z out ⋆ ( y , √ η ξ , 1 − η ) f out ⋆ ( y , √ η ξ , 1 − η ) f out ( y , √ q ξ , V )  ˜ q = α E ξ ∼N (0 , 1)  Z R dy Z out ⋆ ( y , √ η ξ , 1 − η ) f out ( y , √ q ξ , V ) 2  ˜ V = − α E ξ ∼N (0 , 1)  Z R dy Z out ⋆ ( y , √ η ξ , 1 − η ) ∂ ω f out ( y , √ q ξ , V )  (119) Eqs. ( 119 ) form a closed reduced system for ( m, q , V , ˜ m, ˜ q , ˜ V ) that dep ends on the kernel only through the tw o scalars ( ℓ, t ) and on the data only through ( α, ε ) . In particular, there exists a reduced solution map, namely m = m ( α, ε ; ℓ, t ) , q = q ( α, ε ; ℓ, t ) , V = V ( α , ε ; ℓ, t ) . (120) A.6.2 Scaling inv ariance and angular symmetry in the plane ( µ ⋆ , µ 1 ) The reduced equations ( 119 ) are exactly inv ariant under the rescaling ( µ ⋆ , µ 1 , λ ) 7→ ( r µ ⋆ , r µ 1 , r 2 λ ) , r > 0 . (121) Indeed, since ℓ and t are unc hanged under this transformation, it follo ws that the order parameters ( 120 ) are also unchanged. 36 This in v ariance has a simple geometric interpretation. As long as the regularization is scaled as λ ∝ µ 2 1 , solutions to Eqs. ( 119 ) organize along radial directions in the ( µ ⋆ , µ 1 ) plane. In other w ords, fixing the angle γ = arctan( t ) = arctan  µ 1 µ ⋆  (122) and moving along the corresp onding straight line passing through the origin lea ves the solution to Eqs. ( 119 ) inv ariant. If instead λ is held fixed, the inv ariance under radial rescaling ( µ ⋆ , µ 1 ) 7→ ( r µ ⋆ , r µ 1 ) is brok en in general, as ℓ no longer remains unchanged. How ever, there are tw o relev an t exceptions: (i) V anishing regularization: in the limit λ → 0 + , the dep endence on ℓ disapp ears and the radial in v ariance is trivially restored. (ii) Optimal regularization: inv ariance is recov ered when λ is chosen to optimize any function of the order parameters ( 120 ). Observ e that minimizing a function of the order parameters with resp ect to λ is equiv alent to minimizing it with resp ect to ℓ . The optimal v alue satisfies ℓ opt = ℓ opt ( α, ε, t ) and dep ends on µ 1 , µ ⋆ only through their ratio t . As a consequence, the order parameters ev aluated at ℓ opt dep end on µ 1 , µ ⋆ only via t and are thus preserved under radial rescaling. B Ba y es-optimal generalization error, kernel ridge regression and supp ort v ector mac hine on the RAF task mo del B.1 Ba y es-optimal generalization error and state equations F ollowing [ 36 ], in the high-dimensional limit n, d → ∞ with α = n/d and ε fixed, the Bay es- optimal generalization error can b e expressed as E BO gen ( α, ε ) = 1 π arccos  √ q b  . (123) The order parameter q b represen ts the o verlap b et w een the Ba yes-optimal estimator and the teac her w ⋆ . In the case of Gaussian prior P w ⋆ = N ( 0 , I d ) , the o verlap q b , together with its conjugate v ariable ˆ q b , is determined by the system of coupled scalar equations q b = ˆ q b ˆ q b + 1 , (124) ˆ q b = α E y ,ξ h Z out ⋆ ( y , q 1 / 2 b ξ , 1 − q b ) f out ⋆ ( y , q 1 / 2 b ξ , 1 − q b ) 2 i (125) where the functions Z out ⋆ and f out ⋆ are defined in Eq. ( 85 ) and in Eq. ( 89 ), resp ectively . In the RAF data mo del we can compute the right hand side of Eq. ( 125 ) explicitly . Plugging the output distribution ( 43 ) in Eq. ( 84 ), w e find that Z out ⋆ ( y , ω , τ ) = 1 2 δ ( y + 1) + 1 2 δ ( y − 1) + 1 − ε 2 [ δ ( y − 1) − δ ( y + 1)] erf  ω √ 2 τ  . (126) Then, from Eq. ( 89 ) it follows f out ⋆ ( y , ω , τ ) = 1 Z out ⋆ ( y , ω , τ ) (1 − ε ) [ δ ( y − 1) − δ ( y + 1)] e − ω 2 2 τ √ 2 π τ . (127) Finally , noticing that Z out ⋆ ( y , ω , τ ) f out ⋆ ( y , ω , τ ) 2 = 1 Z out ⋆ ( y , ω , τ )  ∂ Z out ⋆ ( y , ω , τ ) ∂ ω  2 (128) 37 0 2 4 6 8 10 α 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 ε 0.04 0.10 0.20 0.30 0.40 0.50 E BO gen Figure 7: W e sho w the heatmap of the Bay es-optimal test error as a function of the sample complexit y α and the fraction of random lab els in the training set ε . and p erforming the integration ov er the lab els y , w e obtain Eq. ( 18 ) of the main: ˆ q b = 4 α (1 − ε ) 2 (2 π ) 3 / 2 (1 − q b ) Z R exp  − 1 + q b 2(1 − q b ) ξ 2  dξ 1 + (1 − ε ) erf  √ q b ξ √ 2(1 − q b )  . (129) In Fig. 7 we show the heatmap of the Bay es-optimal generalization error across the ( α, ε ) plane for the RAF data mo del. At fixed α , the error increases with the fraction of random lab els ε and approaches the random-guess v alue 1 / 2 as ε → 1 . B.2 Loss-dep enden t kernel state equations for the RAF data mo del The k ernel equations for ˆ m, ˆ q and ˆ V in ( 102 ), which we recall here, read ˆ m = µ 1 α E ξ ∼N (0 , 1)  Z R dy Z out ⋆ ( y , √ η ξ , 1 − η ) f out ⋆ ( y , √ η ξ , 1 − η ) f out ( y , √ q ξ , V )  , (130) ˆ q = µ 2 1 α E ξ ∼N (0 , 1)  Z R dy Z out ⋆ ( y , √ η ξ , 1 − η ) f out ( y , √ q ξ , V ) 2  , (131) ˆ V = − µ 2 1 α E ξ ∼N (0 , 1)  Z R dy Z out ⋆ ( y , √ η ξ , 1 − η ) ∂ ω f out ( y , √ q ξ , V )  . (132) The righ t hand sides dep end on b oth the output distribution and the chosen loss function. Substituting the expressions of Z out ⋆ and f out ⋆ deriv ed in Eq. ( 126 ) and Eq. ( 127 ) , resp ectiv ely , 38 w e obtain ˆ m = µ 1 α (1 − ε ) Z R dξ √ 2 π e − ξ 2 2(1 − η ) p 2 π (1 − η ) [ f out (1 , √ q ξ , V ) − f out ( − 1 , √ q ξ , V )] , (133) ˆ q = µ 2 1 α 2 Z R dξ √ 2 π e − ξ 2 / 2 h f out ( − 1 , √ q ξ , V ) 2 + f out (1 , √ q ξ , V ) 2 i + µ 2 1 α (1 − ε ) 2 Z R dξ √ 2 π e − ξ 2 / 2 erf √ η ξ p 2(1 − η ) ! h f out (1 , √ q ξ , V ) 2 − f out ( − 1 , √ q ξ , V ) 2 i , (134) ˆ V = µ 2 1 α 2 Z R dξ √ 2 π e − ξ 2 / 2 [ ∂ ω f out ( − 1 , √ q ξ , V ) + ∂ ω f out (1 , √ q ξ , V )] + µ 2 1 α (1 − ε ) 2 Z R dξ √ 2 π e − ξ 2 / 2 erf √ η ξ p 2(1 − η ) ! [ ∂ ω f out (1 , √ q ξ , V ) − ∂ ω f out ( − 1 , √ q ξ , V )] . (135) Replacing η = m 2 /q , these equations are precisely the loss-dep enden t kernel state equations for the RAF data mo del ( 25 ) presen ted in the main. In the following tw o sections, w e provide their explicit form for the square loss and the hinge loss. B.3 Kernel ridge regression B.3.1 State equations F or the square loss ℓ square ( y , z ) = ( y − z ) 2 2 , the proximal op erator tak es a particularly simple form: P τ ℓ square ( y , · ) ( ω ) = ω + τ y 1 + τ . (136) As a consequence, b oth f square out and its deriv ative ∂ ω f square out simplify to f square out ( y , ω , τ ) = y − ω 1 + τ , ∂ ω f square out ( y , ω , τ ) = − 1 1 + τ . (137) Substituting these expressions in to Eqs. ( 133 ) and ev aluating the resulting Gaussian integrals, w e obtain the full set of state equations describing kernel ridge regression on the RAF data mo del in the high-dimensional limit: m = µ 1 ˆ m λ + ˆ V , q = µ 2 1 ˆ m 2 + ˆ q ( λ + ˆ V ) 2 , V = µ 2 1 λ + ˆ V + µ 2 ⋆ λ , (138) ˆ m = µ 1 α (1 − ε ) q 2 π 1 + V , ˆ q = µ 2 1 α 1 + q − 2 q 2 π (1 − ε ) m (1 + V ) 2 , ˆ V = µ 2 1 α 1 + V . (139) The second line of equations coincide with Eqs. ( 35 ) in the main. Emplo ying the parameters ℓ = λ/µ 2 1 and t = µ 1 /µ ⋆ , this system can b e rewritten in the reduced form m = ˜ m ℓ + ˜ V , q = ˜ m 2 + ˜ q ( ℓ + ˜ V ) 2 , V = 1 ℓ + ˜ V + 1 t 2 ℓ , (140) ˜ m = α (1 − ε ) q 2 π 1 + V , ˜ q = α 1 + q − 2 q 2 π (1 − ε ) m (1 + V ) 2 , ˜ V = α 1 + V . (141) 39 In the case of the square loss, the k ernel state equations are sufficien tly simple to admit a fully analytical solution. Com bining the equations for V and ˜ V yields V = 1 + 1 t 2 − α − ℓ + r  α − 1 − ℓ − 1 t 2  2 + 4 α  ℓ + 1 t 2  2 ℓ . (142) All remaining v ariables can then b e expressed in terms of V . In particular, the ov erlaps m and q are given by m = α (1 − ε ) q 2 π α + ℓ (1 + V ) , (143) q = α h 1 + 2 α (1 − ε ) 2 π α + ℓ (1+ V ) − 2 α + ℓ (1+ V ) i [ α + ℓ (1 + V )] 2 − α = h 1 + 2 α (1 − ε ) 2 π − 2 q 2 π (1 − ε ) m i m 2 2 α (1 − ε ) 2 π − m 2 . (144) B.3.2 Memorization error The memorization error, whose expression we recall b elo w, E mem = 1 2 E ξ ∼N (0 , 1) h θ  P V ℓ ( − 1 , · ) ( √ q ξ )  + θ  −P V ℓ (1 , · ) ( √ q ξ ) i , (145) simplifies considerably in the case of the square loss. Noting that 1 + V > 0 , one has P V ℓ square ( − 1 , · ) ( √ q ξ ) > 0 ⇐ ⇒ ξ > V / √ q , P V ℓ square (1 , · ) ( √ q ξ ) < 0 ⇐ ⇒ ξ < − V / √ q . (146) The remaining computation then amounts to ev aluating a Gaussian tail integral, leading to E square mem = Z ∞ V √ q dξ √ 2 π e − ξ 2 / 2 = 1 2 erfc  V √ 2 q  . (147) B.3.3 Optimal regularization The optimal v alue of the regularization strength λ opt , which minimizes the generalization error ( 96 ) at fixed α, ε and µ 1 , µ ⋆ , can also b e determined analytically in the case of kernel ridge regression. Since m and q dep end on λ only through ℓ , minimizing the generalization error with resp ect to λ is equiv alent to solving ∂ ℓ m √ q ! = 0 . (148) After a straightforw ard but tedious calculation, one obtains ℓ opt = π 2(1 − ε ) 2 − 1 − 1 t 2 . (149) Using ( 149 ), together with the constrain t that λ must b e non-negative, we finally obtain λ opt = max  µ 2 1  π 2(1 − ε ) 2 − 1  − µ 2 ⋆ , 0  . (150) Therefore, in the ( µ ⋆ , µ 1 ) plane, there exists a critical angle γ opt ( ε ) = arctan "  π 2(1 − ε ) 2 − 1  − 1 / 2 # (151) 40 suc h that b elo w this threshold the optimal regularization v anishes, while ab o ve it the optimal regularization is strictly p ositiv e and given by λ opt = µ 2 1 ℓ opt . In App endix C.1.2 w e show that this angle is exactly γ opt mem ( ε ) in Eq. ( 39 ) of the main. In terestingly , when ℓ = ℓ opt , one finds V ( α, ε, ℓ opt , t ) = 2  1 + 1 t 2  − α − π 2(1 − ε ) 2 + r  α + π 2(1 − ε ) 2  2 − 4 α 2  π 2(1 − ε ) 2 − 1 − 1 t 2  , (152) whereas m ( α, ε, ℓ opt , t ) = α (1 − ε ) q 2 π α + π 2(1 − ε ) 2 + r  α + π 2(1 − ε ) 2  2 − 4 α . (153) This shows that V retains an explicit dep endence on t , while m , and hence q via Eq. ( 144 ) , are indep enden t of t . As a result, the generalization error ev aluated at λ opt is constan t ab o ve the angle ( 151 ) in the plane ( µ ⋆ , µ 1 ) . By contrast, the memorization error ev aluated at λ opt v aries with t ab o ve the angle ( 151 ). When λ opt v anishes, the memorization and generalization errors ev aluated at λ opt coincide with their limiting v alues as λ → 0 + , deriv ed in App endix C.1 . B.4 Supp ort v ector mac hine B.4.1 State equations F or the hinge loss ℓ hinge ( y , z ) = max(0 , 1 − y z ) , the proximal op erator is a piecewise function: P τ ℓ hinge ( y , · ) ( ω ) =          ω + τ y if y ω < 1 − τ , y if 1 − τ < y ω < 1 , ω if y ω > 1 . (154) Therefore, f hinge out and its deriv ative ∂ ω f hinge out are also piecewise: f hinge out ( y , ω , τ ) =          y if y ω < 1 − τ , y − ω τ if 1 − τ < y ω < 1 , 0 if y ω > 1 , ∂ ω f hinge out ( y , ω , τ ) =    − 1 τ if 1 − τ < y ω < 1 , 0 otherwise . (155) Substituting these expressions into Eqs. ( 133 ) and using standard Gaussian integration results, w e obtain the full set of state equations describing supp ort v ector machine for the RAF data mo del in the high-dimensional limit: m = µ 1 ˆ m λ + ˆ V , q = µ 2 1 ˆ m 2 + ˆ q ( λ + ˆ V ) 2 , V = µ 2 1 λ + ˆ V + µ 2 ⋆ λ , (156) 41 ˆ m = µ 1 α (1 − ε ) √ 2 π " 1 + 1 V erf 1 p 2 q (1 − η ) ! − (1 − V ) erf 1 − V p 2 q (1 − η ) !! + 1 V s 2 q (1 − η ) π e − 1 2 q (1 − η ) − e − (1 − V ) 2 2 q (1 − η ) !# , ˆ q = µ 2 1 α 2  1 + erf  1 − V √ 2 q  + µ 2 1 α V 2 " 1 + q 2  erf  1 √ 2 q  − erf  1 − V √ 2 q  + r q 2 π  e − 1 2 q − (1 + V ) e − (1 − V ) 2 2 q  # + µ 2 1 α (1 − ε ) " − 1 π arctan  r η 1 − η  − Z 0 − 1 − V √ q dξ √ 2 π e − ξ 2 / 2 erf √ η ξ p 2(1 − η ) ! + 1 V 2 Z 1 √ q 1 − V √ q dξ √ 2 π e − ξ 2 / 2 erf √ η ξ p 2(1 − η ) ! (1 − √ q ξ ) 2 # , ˆ V = µ 2 1 α V " 1 2  erf  1 √ 2 q  − erf  1 − V √ 2 q  + (1 − ε ) Z 1 √ q 1 − V √ q dξ e − ξ 2 / 2 √ 2 π erf √ η ξ p 2(1 − η ) !# , (157) where, as a reminder, η = m 2 /q . W e can recast these equations in their reduced form introducing ℓ = λ/µ 2 1 and t = µ 1 /µ ⋆ . F ollowing the same rescaling pro cedures of the more general App endix A.6 , we obtain here:                                                                                          m = ˜ m ℓ + ˜ V , q = ˜ m 2 + ˜ q ( ℓ + ˜ V ) 2 , V = 1 ℓ + ˜ V + 1 t 2 ℓ ˜ m = α (1 − ε ) √ 2 π " 1 + 1 V erf 1 p 2 q (1 − η ) ! − (1 − V ) erf 1 − V p 2 q (1 − η ) !! + 1 V s 2 q (1 − η ) π e − 1 2 q (1 − η ) − e − (1 − V ) 2 2 q (1 − η ) !# , ˜ q = α 2  1 + erf  1 − V √ 2 q  + α V 2 " 1 + q 2  erf  1 √ 2 q  − erf  1 − V √ 2 q  + r q 2 π  e − 1 2 q − (1 + V ) e − (1 − V ) 2 2 q  # + α (1 − ε ) " − 1 π arctan  r η 1 − η  − Z 0 − 1 − V √ q dξ √ 2 π e − ξ 2 / 2 erf √ η ξ p 2(1 − η ) ! + 1 V 2 Z 1 √ q 1 − V √ q dξ √ 2 π e − ξ 2 / 2 erf √ η ξ p 2(1 − η ) ! (1 − √ q ξ ) 2 # , ˜ V = α V " 1 2  erf  1 √ 2 q  − erf  1 − V √ 2 q  + (1 − ε ) Z 1 √ q 1 − V √ q dξ e − ξ 2 / 2 √ 2 π erf √ η ξ p 2(1 − η ) !# . (158) As explained in App endix A.6 , the order parameters satisfying these equations dep end on µ 1 , µ ⋆ only through t = tan ( γ ) , once optimizing ov er ℓ or in the limit λ → 0 + . Contrarily to 42 KRR, Eqs. ( 158 ) cannot b e closed analytically , and this is why in the main text we resort to n umerical cross-v alidation in order individuate the optimal tangent t opt = tan ( γ hinge opt ) for which the generalization error is minimized. B.4.2 Memorization error Notably , the memorization error for the hinge loss reduces to the same form as for the square loss. The k ey observ ation is that P V ℓ hinge ( − 1 , · ) ( √ q ξ ) > 0 ⇐ ⇒ ξ > V / √ q , P V ℓ hinge (1 , · ) ( √ q ξ ) < 0 ⇐ ⇒ ξ < − V / √ q . (159) exactly as in the square loss case. The remaining computation then follows identically , yielding E hinge mem = 1 2 erfc  V √ 2 q  . (160) C Notable limits C.1 Ridgeless limit C.1.1 Ridgeless linear regression W e b egin by analyzing the limit λ → 0 + in the sp ecial case µ 1 = 1 and µ ⋆ = 0 , corresp onding to standard linear regression. Within this framew ork, w e can also iden tify the in terp olation threshold of the single-la yer p erceptron learner. F or the RAF mo del, w e define this threshold as the largest sample complexity for which the memorization error v anishes in the ridgeless limit: α c ( ε ) = sup n α > 0 : lim λ → 0 + E mem ( α, ε, λ ) = 0 o . (161) In general, this threshold dep ends on the c hoice of loss function and, for the single-lay er p erceptron, it is a function of ε . Square loss F or the square loss, the limiting expressions of b oth the generalization and memorization errors can b e computed analytically . F rom Eqs. ( 117 ) , we note that ℓ = λ and t → ∞ . Using Eq. ( 142 ), the order parameter V then reads V = 1 − α − λ + p ( α − 1 − λ ) 2 + 4 αλ 2 λ . (162) Expanding in the limit λ → 0 + w e obtain V ≃ λ → 0 + | α − 1 | + 1 − α 2 λ + 1 2  α + 1 | α − 1 | − 1  . (163) Dep ending if 0 < α < 1 or α > 1 w e achiev e tw o differen t set of equations, and thus t w o different b eha viours. In the regime 0 < α < 1 we hav e V ≃ λ → 0 + 1 − α λ + α 1 − α . (164) This yields the closed set of equations for the ov erlap order parameters m ≃ λ → 0 + α (1 − ε ) r 2 π , q ≃ λ → 0 + α π − 2 α (1 − ε ) 2 π (1 − α ) , V ≃ λ → 0 + 1 − α λ . (165) 43 In the regime α > 1 instead w e hav e V ≃ λ → 0 + 1 α − 1 . (166) This leads to the closed set of equations for the o verlap order parameters m ≃ λ → 0 + (1 − ε ) r 2 π , q ≃ λ → 0 + 1 + 2 π (1 − ε ) 2 ( α − 2) α − 1 , V ≃ λ → 0 + 1 α − 1 . (167) Com bining these results, we can derive the limiting expressions for the generalization and memorization errors as λ → 0 + : lim λ → 0 + E square gen ( α, ε, λ ) =        1 π arccos  (1 − ε ) r 2 α (1 − α ) π − 2 α (1 − ε ) 2  if 0 < α ≤ 1 , 1 π arccos  (1 − ε ) r 2( α − 1) π +2( α − 2)(1 − ε ) 2  if α > 1 . , (168) lim λ → 0 + E square mem ( α, ε, λ ) =      0 if 0 < α ≤ 1 , 1 2 erfc  q π 2( α − 1)( π +2(1 − ε ) 2 ( α − 2))  if α > 1 . (169) These expressions show that, for any 0 < ε ≤ 1 , ridgeless linear regression p erfectly memorizes randomly lab eled training data when α ≤ 1 . In contrast, for α > 1 , memorization p erformance deteriorates: the memorization error increases with α and approac hes 1 / 2 in the limit α → ∞ . The critical sample complexit y α c = 1 coincides with the interpolation threshold for linear regression with square loss. Notably , this v alue is indep enden t of ε . Hinge loss In the case of hinge loss, closed-form expressions for the generalization and memorization errors in the limit λ → 0 + are not a v ailable. The corresp onding state equations, deriv ed from Eqs. ( 156 )-( 157 ), m ust therefore b e analyzed indirectly . Although an explicit solution for the ov erlap order parameters cannot b e obtained, their b eha vior in the ridgeless limit can be in vestigated numerically . F or fixed α and ε , we solve the state equations using a fixed-p oin t iteration scheme at λ v ery close to 0 , and examine the resulting scaling of b oth th e order parameters and their conjugates. This analysis reveals that, for any 0 < ε < 1 , there exists a critical v alue α c ( ε ) separating tw o distinct regimes. Below this threshold, the fixed-p oin t solutions scale as m = Θ(1) , q = Θ(1) , V = Θ( λ − 1 ) , ˆ m = Θ( λ ) , ˆ q = Θ( λ 2 ) , ˆ V = Θ( λ ) , (170) whereas ab o v e the threshold all v ariables remain finite as λ → 0 + . Therefore, when α < α c ( ε ) , the fact that V div erges while q remains finite implies that the memorization error v anishes (see Eq. ( 160 ) ). F or α > α c ( ε ) , instead, p erfect memorization is lost. The critical v alue α c ( ε ) th us coincides with the interpolation threshold. In con trast to the square-loss case, this threshold is a non-trivial function of ε . W e cannot deriv e an explicit expression, but we can show that it satisfies a self-consisten t equation. T o this end, we consider the regime α < α c ( ε ) and introduce rescaled v ariables consistent with the 44 scaling in Eq. ( 170 ). In this limit, the state equations simplify to m = ˆ m 1 + ˆ V , ˆ m = α V I ˆ m ( q , η ) , q = ˆ m 2 + ˆ q (1 + ˆ V ) 2 , ˆ q = α V 2 I ˆ q ( q , η ) , V = 1 1 + ˆ V , ˆ V = α V I ˆ V ( q , η ) , (171) where η = m 2 /q and I ˆ m ( q , η ) = 1 − ε √ 2 π   1 + erf 1 p 2 q (1 − η ) ! + s 2 q (1 − η ) π e − 1 2 q (1 − η )   , I ˆ q ( q , η ) = 1 + q 2  1 + erf  1 √ 2 q  + r q 2 π e − 1 2 q + (1 − ε ) Z 1 √ q −∞ dξ √ 2 π e − ξ 2 / 2 erf √ η ξ p 2(1 − η ) ! (1 − √ q ξ ) 2 , I ˆ V ( q , η ) = 1 2  1 + erf  1 √ 2 q  + (1 − ε ) Z 1 √ q −∞ dξ √ 2 π e − ξ 2 / 2 erf √ η ξ p 2(1 − η ) ! . (172) Rewriting these equations in a form suited to interpolation analysis yields m = V ˆ m = α I ˆ m ( q , η ) , q = V 2 ( ˆ m 2 + ˆ q ) = m 2 + α I ˆ q ( q , η ) , V = 1 − α I ˆ V ( q , η ) , (173) T o characterize the threshold, w e study the limit α ↑ α c ( ε ) . Define δ = α c ( ε ) − α → 0 + . Numerical evidence suggests the scaling V = Θ( δ ) , q = Θ( δ − 2 ) , m = Θ( δ − 1 ) , η = m 2 q = Θ(1) . (174) F rom Eq. ( 173 ), this implies the condition α I ˆ V ( q , η ) = 1 . (175) W e then analyze the large- q b eha vior of the in tegrals en tering the state equations. Expanding I ˆ m ( q , η ) for q → ∞ yields I ˆ m ( q , η ) ≃ q →∞ 1 − ε π q q (1 − η ) . (176) whic h leads to the relation r η 1 − η = α (1 − ε ) π . (177) Similarly , expanding I ˆ V ( q , η ) in the same regime, and employing Eq. ( 177 ), gives I ˆ V ( q , η ) ≃ q →∞ 1 2 − 1 − ε π arctan  α (1 − ε ) π  . (178) Substituting Eq. ( 178 ) in to the threshold condition ( 175 ) , and ev aluating at α = α c ( ε ) , w e obtain the desired self-consistent equation: 1 = α c ( ε )  1 2 − 1 − ε π arctan  α c ( ε )(1 − ε ) π  . (179) 45 This is precisely Eq. ( 37 ) of the main. F rom Eq. ( 179 ) , for ε = 1 , corresp onding to a training dataset consisting of random lab els only , w e obtain α c (1) = 2 , as exp ected from [ 45 ], while the numerical solution of Eq. ( 179 ) as ε approac hes 0 indicates that α c ( ε ) diverges in this limit. Hence, as ε → 0 , assuming that in Eq. ( 179 ) α c ( ε ) div erges, we hav e that y = α c ( ε )(1 − ε ) π → ∞ (180) and w e can use the large- y expansion for arctan ( y ) : arctan( y ) = π 2 − 1 y + 1 3 y 3 + o ( y − 3 ) . (181) Substituting this expansion into Eq. ( 179 ), we get 1 = α c ( ε )  1 2 − 1 − ε π  π 2 − 1 y + 1 3 y 3 + o ( y − 3 )  . (182) W riting y explicitly , this b ecomes 1 ≃ ε → 0 α c ( ε ) ε 2 + 1 − π 2 3(1 − ε ) 2 α c ( ε ) 2 . (183) Cancelling the constant term on b oth sides, w e obtain the scaling α c ( ε ) ≃ ε → 0 2 π 2 3 ε ! 1 / 3 . (184) Therefore, the interpolation threshold diverges as ε − 1 / 3 when the fraction of random lab els tends to zero. Fig. 8 , b ottom panels, confirms these predictions numerically . In particular, it illustrates b oth the existence of the threshold α c ( ε ) separating p erfect and imp erfect memorization in the max-margin limit, and the div ergence of α c ( ε ) as ε → 0 + . F or completeness in the same figure w e show in the top panels the generalization and memorization errors in the ridgeless limit ac hieved b y the p erceptron when trained with the square loss, highlighting the interpolation threshold at α c = 1 previously found. C.1.2 Ridgeless kernel regression Let us now analyze the limit λ → 0 + for generic µ 1 , µ ⋆ . Square loss As for linear regression, in the square-loss setting the limiting expressions of b oth the generalization and memorization errors can b e computed analytically . Expanding the righ t-hand side of Eq. ( 142 ) in the limit ℓ → 0 + (noting that λ en ters only through ℓ ), the order parameter V b eha ves as V ≃ ℓ → 0 + 1 + 1 t 2 − α + r  α − 1 − 1 t 2  2 + 4 α t 2 2 ℓ = V 0 ℓ . (185) The corresp onding limiting expressions for m and q then follow from Eqs. ( 143 ) and ( 144 ): m ≃ ℓ → 0 + α (1 − ε ) q 2 π α + V 0 , (186) q ≃ ℓ → 0 + α h 1 + 2 α (1 − ε ) 2 π α + V 0 − 2 α + V 0 i ( α + V 0 ) 2 − α . (187) 46 0 2 4 6 8 10 α 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 ε E square gen , p erceptron, λ → 0 + 0 2 4 6 8 10 α 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 ε E square mem , p erceptron, λ → 0 + 0 2 4 6 8 10 α 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 ε E hinge gen , p erceptron, λ → 0 + 0 2 4 6 8 10 α 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 ε E hinge mem , p erceptron, λ → 0 + 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 Figure 8: T op panels: heatmaps of the generalization error E square gen (top left) and memorization error E square mem (top right) for the p erceptron trained with square loss in the ridgeless limit λ → 0 + , as functions of the fraction of facts ε and the sample complexit y α . The v ertical red line indicates the interpolation threshold α c ( ε ) = 1 . Bottom panels: heatmaps of the generalization error E hinge gen (b ottom left) and memorization error E hinge mem (b ottom right) for the p erceptron trained with hinge loss in the ridgeless limit λ → 0 + , as functions of the fraction of facts ε and the sample complexit y α . The red curve is the interpolation threshold α c ( ε ) defined in Eq. ( 179 ) . Below this curve, the memorization error is exactly zero, E hinge mem = 0 , while ab o ve it the p erceptron can no longer p erfectly memorize the random lab els and E hinge mem > 0 . The threshold decreases monotonically with ε , interpolating b et ween the known v alue α c (1) = 2 for fully random lab els and a diverging threshold as ε → 0 + . F or b oth losses, w e observe an interpolation cusp in the generalization error around the corresp onding in terp olation threshold. 47 Since q remains finite in the limit λ → 0 + while V div erges, the memorization error v anishes for all v alues of α (see Eq. ( 147 )). As for the generalization error, its limiting expression is considerably more inv olved and do es not admit a meaningful simplification, so we refrain from rep orting it explicitly . Nevertheless, after a lengthy but straightforw ard calculation, one finds that it is minimized at t =  π 2(1 − ε ) 2 − 1  − 1 / 2 = tan( γ opt mem ( ε )) , (188) th us recov ering the optimal angle given in Eq. ( 39 ) of the main text. Hinge loss In the case of hinge loss, we p erformed a numerical inv estigation, similarly as the one that w e explained for the linear learner using the hinge loss. W e find that there is no critical threshold for the sample complexity , and that the scalings ( 170 ) hold in this case for an y finite v alue of α . In tro ducing appropriately rescaled v ariables, Eqs ( 156 )-( 157 ) simplify in to m = µ 1 ˆ m 1 + ˆ V , ˆ m = µ 1 α V I ˆ m ( q , η ) , q = µ 2 1 ˆ m 2 + ˆ q (1 + ˆ V ) 2 , ˆ q = µ 2 1 α V 2 I ˆ q ( q , η ) , V = µ 2 1 1 + ˆ V + µ 2 ⋆ , ˆ V = µ 2 1 α V I ˆ V ( q , η ) , (189) where I ˆ m , I ˆ q and I ˆ V are formally the same as the ones in Eqs. ( 172 ). Since the scalings ( 170 ) holds for an y finite α , in this case the memorization v anishes as λ → 0 + indep enden tly of α . C.2 Infinite regularization limit in k ernel regression In the limit λ → + ∞ , at fixed α, ε and µ 1 , µ ⋆ , the data-part of the empirical risk b ecomes irrelev ant and the state equations describ e the same trivial asymptotic estimator, for an y loss. Ho wev er, in the case of a classification task, the generalization and memorization errors do not ha ve trivial expressions. W e can derive the expressions of m, q , V in the limit λ → + ∞ (equiv alently , ℓ → + ∞ ) starting from the state equations at finite λ that are v alid for kernel ridge regression. Expanding at leading order the right hand side of Eq. ( 142 ) , and consequently those of Eqs. ( 143 ) - ( 144 ) , we obtain m ≃ ℓ → + ∞ α (1 − ε ) q 2 π ℓ , q ≃ ℓ → + ∞ (1 − ε ) 2 2 α 2 π + α ℓ 2 , V ≃ ℓ → + ∞ 1 + 1 t 2 ℓ . (190) Inserting these asymptotic expressions inside Eqs. ( 147 ) and ( 96 ) for the memorization and generalization errors we obtain E mem = 1 2 erfc     1 + 1 t 2 r 2  α (1 − ε ) q 2 π  2 + 2 α     , (191) 48 E gen = 1 π arccos   (1 − ε ) q 2 α π q 1 + (1 − ε ) 2 2 α π   . (192) The asymptotic v alues of the errors corresp onding to linear regression are recov ered by taking µ 1 = 1 and µ ⋆ = 0 . C.3 Deca y rates at large sample complexity C.3.1 Large α expansion of the Ba yes-optimal generalization error In this subsection we analyze the large sample complexity regime of the Bay es-optimal (BO) estimator at fixed ε ∈ [0 , 1) . Starting from the BO equations E BO gen = 1 π arccos  √ q b  , (193) q b = ˆ q b 1 + ˆ q b , (194) ˆ q b = 4 α (1 − ε ) 2 (2 π ) 3 / 2 (1 − q b ) Z R exp  − 1 + q b 2(1 − q b ) ξ 2  dξ 1 + (1 − ε ) erf  √ q b ξ √ 2(1 − q b )  , (195) w e note that as α → ∞ the BO estimator b ecomes asymptotically p erfect, hence q b → 1 − . W e therefore in tro duce the small parameter δ = 1 − q b , δ → α →∞ 0 + . (196) F rom Eq. ( 194 ) one immediately gets ˆ q b = 1 − δ δ ≃ α →∞ 1 δ . (197) Moreo ver, by introducing the change of v ariables ξ = √ δ t , the whole in tegral in Eq. ( 195 ) admits the asymptotic form Z R exp  − 1 + q b 2(1 − q b ) ξ 2  dξ 1 + (1 − ε ) erf  √ q b ξ √ 2(1 − q b )  = √ δ J ( ε )  1 + o (1)  , (198) where J ( ε ) = Z R dt e − t 2 1 + (1 − ε ) erf  t √ 2  . (199) Substituting Eq. ( 198 ) into Eq. ( 195 ) gives ˆ q b ≃ α →∞ 1 δ = 4 α (1 − ε ) 2 (2 π ) 3 / 2 √ δ J ( ε )  1 + o (1)  . (200) Hence, since arccos (1 − x ) = √ 2 x + O ( x 3 / 2 ) as x → 0 + , we obtain the follo wing expan tion of the Ba yes optimal generalization error: E BO gen ( α, ε ) = C BO ( ε ) α (1 + o (1)) , C BO ( ε ) = √ 2 π 2(1 − ε ) 2 J ( ε ) . (201) W e thus conclude that for every fixed 0 ≤ ε < 1 , the Bay es-optimal generalization error decays as α − 1 . As a consistency chec k, when ε = 0 one recov ers the pure teac her–student setting. In that case, J (1) = Z R dt e − t 2 1 + erf  t √ 2  ≈ 2 . 8375 , (202) 49 whic h gives C BO (0) ≈ 0 . 4417 , E BO gen ( α, 0) ≃ α →∞ 0 . 4417 α , (203) in agreemen t with the large- α Ba yes-optimal asymptotics rep orted in [ 36 ]. C.3.2 Large α expansion of the KRR generalization error W e now derive the large- α b eha vior of the generalization error for kernel ridge regression, at fixed 0 ≤ ε < 1 , fixed regularization λ > 0 , and fixed kernel parameters µ 1 , µ ⋆ . Recall that E gen = 1 π arccos m √ q ! , (204) and that in the KRR case the order parameters admit explicit closed-form expressions. It is therefore sufficien t to study the large- α expansion of the ratio S ( α ) = m √ q . (205) F or notational conv enience, introduce a = (1 − ε ) r 2 π , β = λ + µ 2 ⋆ µ 2 1 = ℓ + 1 t 2 , (206) with ℓ = λ/µ 2 1 and t = µ 1 /µ ⋆ as in the previous sections. By directly expanding, at fixed β , the righ t hand sides of Eqs. ( 143 )-( 144 ) for m and q , we obtain m = a − aβ α + O ( α − 2 ) , q = a 2 + 1 − a 2 − 2 a 2 β α + O ( α − 2 ) . (207) Although b oth m and q dep end on β at the order prop ortional to α − 1 , this dep endence cancels in the ratio, yielding S ( α ) = m √ q = 1 − 1 − a 2 2 a 2 α − 1 + O ( α − 2 ) . (208) Th us the leading correction to the cosine ov erlap is univ ersal: it does not dep end on the regularization λ , nor on the kernel parameters µ 1 , µ ⋆ . W e can now translate this into an asymptotic form ula for the generalization error. Define δ ( α ) = 1 − S ( α ) = 1 − m √ q . (209) F rom Eq. ( 208 ), it follows δ ( α ) = 1 − a 2 2 a 2 α − 1 + O ( α − 2 ) . (210) Using the expansion arccos(1 − x ) = √ 2 x + O ( x 3 / 2 ) as x → 0 + , w e finally reach E gen = √ 1 − a 2 π a α − 1 / 2 + O ( α − 3 / 2 ) . (211) This pro ves Eq. ( 42 ) from the main text. In particular, for KRR with an y fixed λ > 0 and any fixed k ernel parameters µ 1 , µ ⋆ , the generalization error universally decays for large α as α − 1 / 2 at leading order. C.3.3 Large α n umerics for the optimal SVM generalization error F or completeness, we rep ort here the results derived from the numerical analysis at large α of the optimal SVM generalization error E hinge gen ( λ opt ) . In Fig. 9 , for a representativ e v alue of ε , we sho w that E hinge gen ( λ opt ) deca ys as α − 1 / 2 at large α . 50 0 π / 12 π / 6 π/ 4 π / 3 5 π / 12 π / 2 γ 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 δ E hinge gen ( λ opt ) ' α − δ Figure 9: Exp onen t of the p ow er law deca y of the optimal SVM generalization error E hinge gen ( λ opt ) at large α as a function of the angle γ = arctan ( µ 1 /µ ⋆ ) . The fraction of the random facts is fixed to ε = 0 . 5 . The v alue δ of the exp onen t of the p o wer law α − 1 /δ is numerically estimated from a linear fit in log-log scale of the generalization error at the optimal regularization E hinge gen ( λ opt ) in the range α ∈ (10 , 10 4 ) . D Numerical details and further exp erimen ts In this app endix w e collect numerical details, represen tative k ernel families, and additional exp erimen ts complemen ting the results of Sec. 3 . First, w e provide examples of k ernel families for which the asymptotic theory is v alid, together with the corresp onding co efficien ts ( µ 0 , µ 1 , µ ⋆ ) . W e also report additional n umerical exp eriments, b oth on the synthetic RAF mo del and on CIF AR10-RAF, in order to clarify which aspects of the theory are sp ecific to the Gaussian high-dimensional setting. Throughout this app endix, kernel ridge regression refers to the square-loss estimator of Sec. 2.2 , and supp ort vector machine refers to the hinge-loss estimator. F or the Gaussian-input RAF mo del, all theoretical curv es are obtained by numerically solving the asymptotic state equations stated in Sec. 2.2 and App endix B . D.1 Kernel families W e b egin by listing in T able 1 some representativ e kernel families captured b y our theory , together with the co efficien ts ( µ 0 , µ 1 , µ ⋆ ) defined in Eq. ( 30 ) . As shown in Sec. 2.2 and App endix A.5 , for the Gaussian RAF mo del the asymptotic p erformance of kernel regression dep ends on the k ernel only through µ 1 and µ ⋆ , while the constant µ 2 0 = K (0) do es not affect the errors. Conditions for matching the optimal angle. F or the square loss, Sec. 3.3 and App endix B.3 sho w that the optimal k ernels for the generalization error are those determined b y the ratio µ 1 µ ⋆ = tan  γ opt mem ( ε )  =  π 2(1 − ε ) 2 − 1  − 1 / 2 , (212) where γ opt mem is given in Eq. ( 39 ) . F or a parameterized k ernel family , the v alue of the internal k ernel parameters that realizes the optimal angle is therefore determined by imp osing Eq. ( 212 ) . Explicitly: 51 • P olynomial k ernel K ( ρ ) = ( c + ρ ) m : m c m − 1 ( c + 1) m − c m − mc m − 1 = tan 2  γ opt mem ( ε )  . (213) F or fixed degree m , this is a scalar equation for c ≥ 0 . • Exponential dot-pro duct k ernel K ( ρ ) = e β ρ : β e β − 1 − β = tan 2  γ opt mem ( ε )  . (214) • Spherical Gaussian k ernel K ( ρ ) = e − η (1 − ρ ) : η e − η 1 − e − η (1 + η ) = tan 2  γ opt mem ( ε )  . (215) • Geometric kernel K ( ρ ) = 1 1 − γ ρ : µ 1 µ ⋆ = √ 1 − γ √ γ = tan  γ opt mem ( ε )  , (216) hence the solution is explicit: γ opt geo ( ε ) = 1 1 + tan 2  γ opt mem ( ε )  = cos 2  γ opt mem ( ε )  . (217) A graphical representation of these results is provided in Fig. 10 . D.2 Spanning the angle γ : full phenomenology and additional exp erimen ts In Sec. 3.3 we fo cused on the distinguished angles that optimize either the interpolating regime or the optimal generalization regime. It is also useful to visualize the full phenomenology by spanning a broad range of angles γ = arctan  µ 1 µ ⋆  . (218) Because of the angular inv ariance established in App endix A.6 , the relev ant kernel dep endence is en tirely controlled by γ when either λ → 0 + or λ = λ opt . Figure 11 shows representativ e memorization–generalization trade-off curv es for a wide range of angles, b oth for square loss and hinge loss. F or the square loss, the figure illustrates the sharp transition describ ed analytically in App endix B.3 : the angle γ opt mem separates k ernels whose optimal regularization, with resp ect to the generalization error, is strictly p ositiv e from k ernels whose optimal regularization is already interpolating. F or the hinge loss, the same phenomenology survives qualitativ ely , but the optimal angle for optimal generalization error and the angle minimizing the generalization error at zero regularization no longer coincide exactly , in agreemen t with the d iscussion in Sec. 3.3 . F or completeness, in Fig. 12 w e rep ort analogous curv es to the ones shown in Fig. 3 of the main text, y et for a different v alue of the fraction of facts, that is ε = 0 . 5 . This figure shows that the phenomenology discussed in the main text do es not change in a significant wa y when c hanging the fraction of random lab els. 52 Kernel / activ ation K ( ρ ) , σ ( u ) ( µ 0 , µ 1 , µ ⋆ ) Linear (p erceptron) K ( ρ ) = ρ σ ( u ) = u µ 0 = 0 , µ 1 = 1 , µ ⋆ = 0 Arc-sine (sign) K sign ( ρ ) = 2 π arcsin( ρ ) σ ( u ) = sign( u ) µ 0 = 0 , µ 1 = r 2 π , µ ⋆ = r 1 − 2 π Arc-sine (erf ) K erf ( ρ ) = 2 π arcsin  2 ρ 3  σ ( u ) = erf ( u ) µ 0 = 0 , µ 1 = 2 √ 3 π , µ ⋆ = r 2 π arcsin  2 3  − 4 3 π ReLU (arc-cosine) K ReLU ( ρ ) = 1 2 π  p 1 − ρ 2 + ( π − arccos ρ ) ρ  σ ( u ) = ReLU( u ) µ 0 = 1 √ 2 π , µ 1 = 1 2 , µ ⋆ = r 1 2  1 2 − 1 π  P olynomial dot-pro duct K poly ( ρ ) = ( c + ρ ) m m ∈ N , c ≥ 0 µ 0 = c m/ 2 , µ 1 = √ m c m − 1 , µ ⋆ = p ( c + 1) m − c m − mc m − 1 Exp onen tial dot-pro duct K exp ( ρ ) = e β ρ , β ≥ 0 σ β ( u ) = exp( √ β u − β / 2) µ 0 = 1 , µ 1 = p β , µ ⋆ = p e β − 1 − β Spherical Gaussian (RBF, high- d scaling) K Gauss ( ρ ) = e − η (1 − ρ ) , η ≥ 0 σ Gauss ( u ) = e − η / 2 exp( √ η u − η / 2) µ 0 = e − η / 2 , µ 1 = p η e − η , µ ⋆ = p 1 − e − η (1 + η ) Geometric dot-pro duct K geo ( ρ ) = 1 1 − γ ρ , 0 ≤ γ < 1 σ geo ( u ) = P ℓ ≥ 0 q γ ℓ ℓ ! H ℓ ( u ) µ 0 = 1 , µ 1 = √ γ , µ ⋆ = s γ 2 1 − γ = γ √ 1 − γ T able 1: Represen tative k ernels and the corresp onding co efficien ts ( µ 0 , µ 1 , µ ⋆ ) . F or the RAF mo del with Gauss ian inputs, the asymptotic errors dep end on the k ernel only through ( µ 1 , µ ⋆ ) , as established in Sec. 2.2 and App endix A.5 . D.3 Numerical agreement for RAF data W e no w compare the asymptotic theory to direct finite-size kernel exp erimen ts on finite size RAF data. The goal is to v alidate tw o distinct predictions of the theory: 1. for a fixed kernel family , the parametric trade-off curve λ 7→ ( E gen ( λ ) , E mem ( λ )) is accurately captured b y the asymptotic equations; 2. t wo different k ernels sharing the same co efficien ts ( µ 1 , µ ⋆ ) yield the same asymptotic p erformance and, already at finite size, n umerically matching trade-off curves. 53 0 2 4 6 8 10 0 2 4 6 8 10 1 m = 1 m = 2 m = 3 m = 4 m = 5 m = 6 1 = t a n ( o p t m e m ) , t a n ( o p t m e m ) = 0 . 8 2 9 P o l y n o m i a l ( c = 1 , m = 1 , , 7 ) P o l y n o m i a l ( m = 4 , c [ 0 , 5 . 0 ] ) , c o p t 1 . 5 4 7 E x p o n e n t i a l K ( ) = e , o p t 1 . 5 8 9 S p h e r i c a l G a u s s i a n K ( ) = e ( 1 ) , o p t 1 . 5 8 9 G e o m e t r i c K ( ) = 1 1 , o p t 0 . 5 9 3 Figure 10: Representativ e k ernel families in the ( µ ⋆ , µ 1 ) plane. The red line is the optimal square-loss direction µ 1 = tan  γ opt mem  µ ⋆ , with γ opt mem for ε = 0 . 2 given by Eq. ( 39 ) . Eac h curve is obtained b y v arying the internal parameter of a kernel family while plotting the induced pair ( µ ⋆ , µ 1 ) . F or polynomial kernels of fixed degree m , the curv e is parameterized b y c ; for the exp onen tial, spherical Gaussian, and geometric families, it is parameterized resp ectiv ely by β , η , and γ . Markers indicate the parameter v alues for whic h the family intersects the optimal line. This figure mak es explicit that distinct kernel families can realize the same asymptotic geometry ( µ 1 , µ ⋆ ) , and are therefore predicted by the Gaussian RAF theory to yield the same asymptotic p erformance. Gaussian k ernel across bandwidths. A conv enient family for this comparison is the spherical Gaussian kernel K Gauss ( ρ ) = e − η (1 − ρ ) , (219) whose geometry is controlled by µ 1 = q η e − η , µ ⋆ = q 1 − e − η (1 + η ) . (220) In the exp erimen ts in Fig. 13 , we compare the t wo representativ e bandwidths η opt for both k ernel ridge regression (giv en b y the condition in Eq. ( 215 ) ) and supp ort v ector mac hines (found n umerically). F or the tw o optimal η opt , we plot the square-loss and hinge-loss asymptotic p erformances spanning o ver the regularization and compare them to the corresp onding finite- n KRR and SVM runs. In b oth cases, w e observe a go o d numerical agreement within finite size corrections of the numerical simulations. 54 0.22 0.23 0.23 0.23 E square gen 0.00 0.10 0.20 0.31 0.41 E square mem Square loss 0.21 0.22 0.23 0.23 E hinge gen 0.00 0.10 0.20 0.31 0.41 E hinge mem Hinge loss Square panel γ = π / 36 γ = π / 16 γ = 3 π / 31 γ = γ ? ≈ 0 . 131 π γ = π / 6 γ = π / 5 γ = 7 π / 30 Hinge panel γ = 2 π / 27 γ = 3 π / 22 γ = π / 5 γ = γ ? ≈ 0 . 259 π γ = 9 π / 28 γ = 13 π / 34 γ = 4 π / 9 Figure 11: Representativ e trade-off curves λ 7→ ( E gen ( λ ) , E mem ( λ )) for a broad range of angles γ = arctan ( µ 1 /µ ⋆ ) , for ε = 0 . 5 and α (1 − ε ) = 4 . 0 . Left: square loss. Right: hinge loss. The distinguished angles highligh ted in the legend in purple corresp ond to the theoretical optima discussed in Sec. 3.3 . F or the square loss, the curve with γ = γ opt mem is the unique one that sim ultaneously achiev es p erfect memorization and optimal generalization. F or the hinge loss, the family of curves illustrates the trade-off b et ween the tw o ob jectiv es. Matc hing k ernels with identical ( µ 1 , µ ⋆ ) . Another chec k of the theory is to compare tw o differen t kernels that induce the same pair ( µ 1 , µ ⋆ ) . F or the Gaussian RAF mo del, the asymptotic equations predict that such kernels must yield exactly the same trade-off curves, b ecause the theory dep ends on the kernel only through these t wo co efficients. A simple example consists in matc hing a spherical Gaussian k ernel to a truncated quadratic kernel of the form K quad ( ρ ) = µ 2 1 ρ + µ 2 ⋆ ρ 2 , (221) with ( µ 1 , µ ⋆ ) as in Eq. ( 220 ) . The corresp onding finite- n n umerics is again in Fig. 13 for b oth KRR and SVM, showing go o d numerical agreement within their finite-size effects. D.4 Matc hed kernels on CIF AR10-RAF and the hinge loss In Sec. 3 we compared the RAF-theory phenomenology to the more realistic CIF AR10-RAF construction. A natural stress test is to take tw o kernels that matc h in ( µ 1 , µ ⋆ ) and apply them to the same non-Gaussian, structured dataset. W e construct again tw o kernels, i.e. a spherical Gaussian k ernel and a p olynomial dot- pro duct k ernel, whose induced co efficien ts ( µ 1 , µ ⋆ ) coincide, as in the previous section. W e rep ort the memorization and generalization errors obtained in these tw o cases in the right panel of Fig. ( 14 ) . When these k ernels are trained on the same CIF AR10-RAF dataset, the corresp onding KRR memorization-generalization curves are no longer identical, con trarily to the theory prediction. Th us, on structured data, matching ( µ 1 , µ ⋆ ) does not result in equal p erformances. F or completeness, in the left panel of Fig. ( 14 ) w e rep ort the curves of the errors obtained when spanning ov er the regularization for the quadratic kernel in Eq. ( 221 ) when freely c ho osing the parameters µ 1 and µ ⋆ , showing a qualitatively similar, y et differen t, phenomenology to its gaussian counterpart. W e also rep eated on CIF AR10-RAF the same qualitativ e comparison presented in Sec. 3.4 , but for the hinge loss; the result is shown in Fig. 15 . The hinge-loss comparison confirms the same qualitative separation b et w een the bandwidth preferred in the near-interpolating regime and the one preferred after tuning the regularization. On the theory side, the generalization curv e at λ → 0 + is optimized b y a bandwidth η mem , 55 0 π 12 π 6 π 4 π 3 5 π 12 π 2 γ 0 . 00 0 . 01 0 . 02 0 . 03 0 . 04 E square gen ( λ → 0 + ) − min γ E square gen ( λ opt ) 0 π 12 π 6 π 4 π 3 5 π 12 π 2 γ 0 . 00 0 . 01 0 . 02 0 . 03 0 . 04 E square gen ( λ opt ) − min γ E square gen ( λ opt ) 0 π 12 π 6 π 4 π 3 5 π 12 π 2 γ 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 E square mem ( λ opt ) 0 π 12 π 6 π 4 π 3 5 π 12 π 2 γ 0 . 00 0 . 01 0 . 02 0 . 03 0 . 04 E hinge gen ( λ → 0 + ) − min γ E square gen ( λ opt ) 0 π 12 π 6 π 4 π 3 5 π 12 π 2 γ − 0 . 02 0 . 00 0 . 02 0 . 04 E hinge gen ( λ opt ) − min γ E square gen ( λ opt ) 0 π 12 π 6 π 4 π 3 5 π 12 π 2 γ 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 E hinge mem ( λ opt ) α = 2 α = 4 α = 10 α = 20 Figure 12: Generalization and memorization dep endence on the angle γ = arctan( µ 1 /µ ⋆ ) . KRR (square loss) in the top panels, SVM (hinge loss) in the b ottom panels. Left panels: E gen ( λ → 0 + ) − min γ E square gen ( λ opt ) ; center panels: E gen ( λ opt ) − min γ E square gen ( λ opt ) ; right panels: E mem ( λ opt ) . The fraction of facts in the training set is fixed to ε = 0 . 5 . The samples complexities are α ∈ { 2 , 4 , 10 , 20 } in each panel. The dashed black vertical line in the upp er panels is the optimal angle γ opt mem ( ε ) in Eq. ( 39 ) , for which, for the square loss, b oth minimum generalization and p erfect memorization are simultaneously reached. F or the hinge loss (b ottom panels), the angle where E hinge gen ( λ opt ) is minimal is marked by a cross, while the angle where E hinge gen ( λ → 0 + ) is minim um is marked by a dot. while the b est achiev able generalization o ver λ > 0 is obtained at a smaller bandwidth η opt . On CIF AR10-RAF, the same ordering is observ ed qualitatively: the lo wer-righ t panel shows that the generalization error at v ery small regularization is minimized around η mem ≈ 7 . 5 , whereas the lo wer-left panel shows that the generalization error at the optimally tuned regularization is minimized around η opt ≈ 5 . 0 . The upp er-righ t panel then displays the full memorization– generalization trade-off curves for these tw o distinguished v alues together with one smaller and one larger representativ e bandwidth. Compared to k ernel ridge regression, the hinge-loss exp erimen ts display stronger finite-size distortions at large regularization, where the CIF AR10-RAF SVM curves can drift closer to a near-random-guessing regime than predicted by the RAF asymptotics. Nevertheless, the o verall dep endence on the bandwidth remains qualitatively consistent with the theory: k ernels fav oring near-in terp olation and factual recall are differen t from those fav ored by regularization-tuned rule learning, and the CIF AR10-RAF exp erimen ts preserve this central ordering ev en though the optimal numerical v alues are shifted relative to the Gaussian RAF mo del. 56 0.21 0.22 0.23 0.24 0.25 0.26 s q u a r e g e n 0.00 0.05 0.10 0.15 0.20 0.25 s q u a r e m e m Square loss / KRR Theory Gaussian KRR numerics Matched quadratic KRR numerics 0.21 0.22 0.23 0.24 0.25 0.26 h i n g e g e n 0.00 0.05 0.10 0.15 0.20 0.25 h i n g e m e m Hinge loss / SVM Theory Gaussian SVM numerics Matched quadratic SVM numerics Figure 13: Comparison b etw een asymptotic theory for RAF data and direct kernel exp eriments for the spherical Gaussian k ernel K ( ρ ) = e − η (1 − ρ ) , for their resp ective optimal bandwidths η opt . In b oth theory and n umerics α = 2 . 0 , ε = 0 . 1 . F or the numerics we use d = 3000 and the error bars are computed ov er 60 different realizations. Left panel: square loss / KRR ( η square opt = 1 . 205 ). Righ t panel: hinge loss / SVM ( η hinge opt = 0 . 752 ). In b oth cases, the theoretical parametric curves obtained from the state equations closely match the empirical trade-off curves obtained by training the corresp onding finite-size kernel estimator. W e also compare t wo distinct kernels with identical co efficien ts ( µ 1 , µ ⋆ ) . The asymptotic theory predicts identical trade-off curv es, and the direct kernel exp erimen ts confirm this prediction up to finite size corrections. These exp erimen ts complemen ts Sec. 3 , where the RAF theory was shown to capture the qualitativ e phenomenology of CIF AR10-RAF, but not its full quantitativ e details. D.5 Implemen tation details F or completeness, we summarize the numerical proto col used throughout this app endix. Gaussian RAF data. F or the syn thetic RAF exp eriments, w e generate i.i.d. Gaussian inputs x µ ∼ N ( 0 , I d ) and lab els according to the RAF mixture describ ed in Sec. 1 . Kernel ridge regression is implemen ted b y solving the regularized linear system asso ciated with the Gram matrix, while the hinge-loss kernel classifier is trained as a standard supp ort vector mac hine. Generalization error is estimated on fresh teac her-lab eled Gaussian test data, whereas memorization error is computed on the random-lab el subset of the training set, in direct analogy with Eqs. ( 10 ) and ( 9 ) . Whenev er theory and exp erimen ts are compared, the same v alues of ( α, ε ) and the same kernel parameters are used on b oth sides. CIF AR10-RAF data. The CIF AR10-RAF construction follows Sec. 3 : t wo classes define the rule comp onen t and one additional class provides the factual/random subset with i.i.d. ± 1 lab els. Inputs are feature-wise standardized. F or Gaussian-like kernels, bandwidths are rep orted using the effectiv e normalization describ ed in Sec. 3 . All empirical CIF AR trade-off curv es shown here are av eraged ov er rep eated draws of the training/test split and of the randomized lab els. Solving the asymptotic equations. The asymptotic square-loss equations are solved either directly through the closed forms of App endix B.3 or b y fixed-p oin t iteration of the original state equations as a cross-chec k. The hinge-loss equations are solv ed numerically b y fixed-p oin t iteration with damping. Whenever a parametric trade-off curv e is displa yed, the regularization 57 0 . 34 0 . 36 0 . 38 0 . 40 0 . 42 0 . 44 E square gen 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 E square mem CIF AR10-RAF: quadratic kernel µ 2 1 = 1 e + 03 , µ 2 ? = 0 . 001 µ 2 1 = 1 , µ 2 ? = 1 µ 2 1 = 0 . 001 , µ 2 ? = 1 e + 03 0 . 33 0 . 34 0 . 35 0 . 36 0 . 37 0 . 38 0 . 39 0 . 40 E square gen 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 E square mem CIF AR10-RAF: Gaussian and quadratic kernels Gaussian K ( ρ ) = e − η (1 − ρ ) , η = 1 . 589 Matched quadratic K ( ρ ) = µ 2 1 ρ + µ 2 ? ρ 2 Figure 14: CIF AR10-RAF: we compute the memorization and generalization error for KRR with α = 4 . 0 and ε = 0 . 2 for b oth a spherical Gaussian RBF k ernel as in Eq. ( 220 ) , and a p olynomial k ernel as in Eq. ( 221 ) with co efficien ts matc hing the Gaussian case. In all cases we av erage the data ov er 10 different realizations and the regularization spans ov er the interv al [10 − 5 , 10 2 ] . Left panel: w e show the spans for three represen tative couples of the parameters µ 1 and µ ⋆ in the case of the quadratic k ernel. Righ t panel: comparison b et w een the spherical Gaussian RBF kernel and a quadratic k ernel with matching co efficien ts µ 1 , µ ⋆ and η = 1 . 589 . Despite the asymptotic theory matches in the t wo cases, the CIF AR10-RAF dataset do es not yield empirical trade-off curv es analogous to those presen ted in Fig. 5 . This contrasts with the Gaussian RAF prediction and shows that, on structured non-Gaussian data, the pair ( µ 1 , µ ⋆ ) is no longer a complete descriptor of kernel p erformance. λ is v aried o ver a logarithmic grid co vering b oth the near-in terp olating and strongly regularized regimes. F or further details regarding the n umerical implementations and the fixed-p oin t schemes used to solv e the state equations, w e defer to the rep ository at the following link: https: //github . com/SPOC- group/RAF . 58 0 . 18 0 . 20 0 . 22 E hinge gen 0 . 0 0 . 2 0 . 4 E hinge mem RAF Data η = 0 . 10 η opt = 0 . 69 η mem = 1 . 49 η = 5 . 00 0 . 31 0 . 35 0 . 40 0 . 45 0 . 50 0 . 55 E hinge gen 0 . 0 0 . 2 0 . 4 E hinge mem CIF AR10-RAF Data η = 0 . 10 η opt = 5 . 00 η mem = 7 . 50 η = 20 . 00 0 5 10 15 20 η 0.35 0.40 E hinge gen CIF AR10-RAF: λ = λ opt 0 5 10 15 20 η 0.35 0.40 E hinge gen CIF AR10-RAF: λ = 1 e − 05 Figure 15: Qualitativ e comparison b et ween SVM theory for the RAF mo del (upp er left panel) and exp erimen ts on real data, namely the CIF AR10-RAF task (low er and right panels). In all panels, w e use RBF SVM with α = 4 . 0 and ε = 0 . 2 . The upp er panels show the parametric memorization– generalization curves λ 7→ ( E gen ( λ ) , E mem ( λ )) . The kernel bandwidth η for CIF AR10-RAF is selected based on the low er panels: generalization error as a function of η at optimal regularization (left), with a minimum around η ≈ 4 (red dashed line), and at small regularization (right), with a minimum around η ≈ 7 . 5 (purple dashed line). The parametric curv es are then sho wn for these tw o v alues of η , together with one larger and one smaller represen tative v alue. The curv es are constructed spanning the regularization in the interv al λ ∈ [10 − 5 , 10 2 ] . F or the SVM exp erimen ts, the large-regularization part of the curves can drift tow ard near-random-guessing p erformance due to finite-size effects and numerical aspects of the optimization. The RAF mo del exhibits qualitatively similar b eha vior as a function of η , with the corresp onding v alues shifted. E AI suggested the R ules-and-F acts mo del W e think it is of in terest and for the historical record to give an account of how the authors of this pap er came up with the Rules-and-F acts mo del. It started in a conv ersation with ChatGPT-4o. The last author of the pap er was giving a plenary talk at the COSYNE conference on Marc h 29th, 2025 and, in preparation, she had the follo wing conv ersation with ChatGPT-4o. W e giv e the full version of the prompts by LZ, but, for conciseness, only the relev ant parts of the answers b y the mo del: LZ: Can y ou seek works of Lenka Zdeb orov a from the past 10 years on statistical physics of neural net works and tell me which part of it is the most interesting for a talk to a computational neuroscience audience? {asking for a deep search} AI: {Ga ve a long and rather exhaustive answer including a sen tence in a conclusion:} They pro vide concrete mechanisms for phenomena lik e o v er-parameterization not harming generalization, by showing how SGD dynamics navigates the weigh t landscap e in large net works. This adds a theoretical backbone to why brains (or big mo dels) can memorize enormous amounts of data and still generalize – the learning pro cess itself biases which 59 solution is found. LZ: I find it interesting when you say: "This adds a theoretical backbone to wh y brains (or big mo dels) can memorize enormous amounts of data and still generalize – the learning pro cess itself biases whic h solution is found." I did not think previously as ov erparameterization allo wing useful memorization. It would b e great to study a model where the in terplay b et w een generalization on a test set and memorization of facts that need to b e memorized is b oth included. Are there examples? AI: That’s a great line of th ough t—and yes, this interpla y b etw een memorization and generalization in o verparameterized systems is both subtle and increasingly studied. Sev eral theoretical directions mo del this phenomenon explicitly . 1. Mixture T asks: Memorization + Generalization Recen t theoretical works construct hybrid tasks in whic h: – P art of the data follows a regular pattern (learnable through generalization), – Another part is random (requiring memorization). K ey example. Mei, Montanari, and Nguyen (2022) analyze a teac her–student setup where lab els are generated as y i = f ( x i ) + ϵ i , for structured f ( x ) (e.g., a lo w-rank or sparse function), plus a small fraction of random exceptions that ha ve to b e memorized. Overparameterized netw orks, trained with SGD, learn the structured part and memorize the exceptions. They find that test error sta ys low as long as the fraction of exceptions is small—but once it grows, the mo del hits a sharp transition (like a capacit y limit). This mirrors what h umans do: we generalize most inputs, but memorize exceptions (like irregular verbs or rare faces). {Then con tinues with 3 other p oin ts.} LZ: Do y ou hav e a reference for the “F eature Learning in Neural Netw orks with Lo calit y” b y Mei, Mon tanari, and Nguyen (2022) pap er? AI: I ap ologize for the earlier confusion. It app ears that the pap er titled "F eature Learning in Neural Net works with Lo cality" by Mei, Montanari, and Nguyen (2022) do es not exist. {Then suggest other pap ers that are not related to the in terpla y of memorization and generalization. Note for the reader: Montanari and his former students Mei and Nguyen are w ell known colleagues in the field in Stanford who could hav e v ery plausibly written suc h a pap er.} LZ: I w as refering to your p oin t 1 ab ov e "1. Mixture T asks: Memorization + Generalization" what is the reference for this? A mo del mixing memorization and generalization. AI: I previously mentioned a study by Mei, Mon tanari, and Nguyen (2022) regarding "F eature Learning in Neural Netw orks with Lo calit y ." Up on further review, I couldn’t lo cate this sp ecific pap er. Ho w ever, the concept of tasks that com bine both generalization and memorization has b een explored in the literature. {Suggests y et other pap ers that are not related to the core question.} LZ: When you talk ed ab out "plus a small fraction of random exceptions that hav e to b e memorized." w as there a sp ecific reference? 60 AI: Ap ologies for an y confusion in my earlier resp onses. Up on further review, I couldn’t lo cate a sp ecific study that directly mo dels tasks combining a structured comp onent with a small fraction of random exceptions requiring memorization. {Then suggests to explore other v ariants of questions ab out memorization. And concludes:} If y ou’re thinking of building or analyzing such a mo del yourself—sa y , a p erceptron or shallo w netw ork with a plan ted structure plus sparse random lab els—I’d b e happy to help design or analyze it. It’s a b eautiful direction that captures something the brain seems quite go od at: memorizing exceptions without o verfitting everything. LZ: Ok, go on with the mo del suggestion. AI: P erfect. Let’s build a minimal but rich mo del to study the in terpla y b et ween gen- eralization and memorization . W e w ant something simple enough to analyze, yet expressiv e enough to exhibit in teresting phase b eha vior. Mo del: Structured + Memorized T ask in a Perceptron W e define a binary classification problem with inputs x ∈ R d and lab els y ∈ {− 1 , 1 } . 1. Data Generation Eac h sample ( x µ , y µ ) , with µ = 1 , . . . , n , is dra wn as follows: – Inputs: x µ ∼ N  0 , 1 d I d  (standard high-dimensional Gaussian vectors). – Labels: y µ = ( sign( w ∗ · x µ ) , with probability 1 − ϵ, random in {− 1 , 1 } , with probabilit y ϵ, where: ∗ w ∗ ∈ R d is a fixed “teacher” vector (drawn once at random), ∗ ϵ con trols the fraction of examples that are random exceptions (requiring memo- rization). 2. Learning Model: Perceptron W e train a student p erceptron (one-lay er neural netw ork) b y minimizing, for example, the hinge loss: L ( w ) = n X µ =1 ℓ  y µ , w · x µ  , using standard gradient descent. 3. Quan tities of Interest W e track: – T raining error E train : ability to fit b oth structured and random subsets, – Generalization error E gen : p erformance on clean test data generated by y = sign( w ∗ · x ) , 61 – Memorization capacit y : num b er of random lab els that can b e memorized without degrading generalization. 4. Key Questions – Ho w do es increasing ϵ (fraction of exceptions) affect generalization? – Is there a phase transition in p erformance as a function of ϵ , sample size n , and dimension d ? – Can o verparameterization (large d ) enable memorization of noisy examples without harming generalization? 5. Extensions If y ou’d like a richer mo del later: – Mak e the "teac her" a tw o-lay er net work (committee machine), and see ho w a student appro ximates it. – A dd explicit memorization mo dules, e.g., extra parameters that only store the excep- tions. – Consider online learning, where the mo del learns structured data incrementally but m ust react to o ccasional one-shot exceptions. 6. Wh y This is In teresting – It’s a clean testbed for double descen t: when can the model interpolate and still generalize? – Mo dels how the brain might balance slo w statistical learning (from distributions) with fast memory of rare but imp ortan t exceptions (episo dic memory). – Giv es ro om to apply replica theory or mean-field theory to derive phase diagrams analytically . W ould you lik e to sim ulate this (n umerically), analyze it (statistically), or go full theoretical (mean-field/replica)? I can help dev elop any of those paths. LZ: I lik e the mo del, it has not b een studied b efore? AI: I’m really glad y ou lik e it—and y ou’re absolutely right to ask whether it is new. The answ er is nuanced. Has this precise mo del b een studied b efore? Not exactly in this clean and explicit form. While several strands of the literature touc h on asp ects of a structured + memorized setup, this sp ecific com bination—with a tunable fraction ϵ of random exceptions in a p erceptron setting, and a fo cus on b oth generalization and memorization as a function of o verparameterization—has not, to my knowledge, b een fully explored. What has b een studied, and ho w this mo del differs 1. Pure generalization in teac her–studen t mo dels 62 Man y pap ers (e.g., by Zdeb oro v á, Montanari, Loureiro) study generalization in teacher–studen t p erceptrons or GLMs, assuming all data is structured. No ro om for exceptions or memorization. Y our mo del adds a new axis: an interpolating parameter ϵ that con trols the “exception load. ” 2. Pure noise / lab el corruption mo dels W orks like Zhang et al. (2017) (“Understanding Deep Learning Requires Rethinking Generalization”) or Nakkiran et al. (2021) study net works trained on fully random lab els or data. These prob e memorization capacity and generalization collapse, but they typically go to extremes: 100 Y our mo del prop oses a contin uum, where b oth mec hanisms are presen t sim ultaneously . 3. Theoretical Analyses of Memorization + Generalization Some theoretical works analyze memorization/generalization tradeoffs, e.g., statistical mec hanics of ov erfitting, but usually abstractly—without clean control of a “random exceptions” parameter. There are rigorous results on noise robustness (e.g., V C dimension b ounds, noise-tolerant P AC learning), but again not in the explicit mixture mo del. Y our mo del could yield a phas e diagram in ( ϵ, α = n/d ) space, which to my kno wledge has not b een published. 4. Neuroscience/CLS-Inspired Models Complemen tary Learning Systems theory in tro duces dual-system arc hitectures (memo- rization vs. generalization mo dules), but they’re not typically p osed in a clean, tractable statistical ph ysics setup. Y our mo del is analytically tractable and could serve as a minimal solv able CLS to y mo del. Summary This mo del seems nov el in its clarity , tractability , and aim: – It’s simple enough for mean-field or replica analysis. – It’s rich enough to demonstrate a fundamen tal tradeoff b et ween memorization and generalization. – It’s grounded in b oth deep learning and cognitive inspiration. If you pursue this direction, it could lead to a beautiful paper or seminar that bridges statistical mec hanics, machine learning, and neuroscience. LZ: Are you able to p erform some simulations of it and dra w a phase diagram? AI: The simulation was taking too long to complete due to the large grid size and sample coun ts. Let me rerun it using a smaller grid and fewer trials so w e can quickly get a clear lo ok at the generalization phase b eha vior. Here’s the phase diagram showing generalization error as a function of: – n/d : the sample complexity (num b er of training samples p er dimension), 63 2 . 5 5 . 0 7 . 5 10 . 0 12 . 5 15 . 0 17 . 5 20 . 0 α 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 ε E square gen ( λ opt ) 0 . 030 0 . 075 0 . 120 0 . 165 0 . 210 0 . 255 0 . 300 0 . 345 Figure 16: Left: Phase diagram as pro duced by AI in the conv ersation. Right: Generalization error at optimal regularization E square gen ( λ opt ) for the linear mo del in a matc hed format. – ϵ : the fraction of random exceptions (non-generalizable lab els). Key insigh ts. – T op-left region T op-left corner (low sample size, many exceptions): generalization error is high — the mo del can’t separate signal from noise. – Bottom-righ t region Bottom-right corner (high sample size, few exceptions): error is lo w — the mo del generalizes well. – There is a smo oth phase b oundary: as ϵ increases, more data is needed to maintain generalization. W ould you like to: – Explore other architectures (e.g., 2-la yer netw orks)? – Plot training error and memorization accuracy to o? – Fit this with a mean-field or replica analysis? W e’re in a great sp ot to develop this further into a formal story . LZ: Impressiv e. I need to go now, but I will come back on this with you. Y ou ro c k ;) .... 64

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment