Model Agreement via Anchoring
Numerous lines of aim to control $\textit{model disagreement}$ -- the extent to which two machine learning models disagree in their predictions. We adopt a simple and standard notion of model disagreement in real-valued prediction problems, namely th…
Authors: Eric Eaton, Surbhi Goel, Marcel Hussing
Mo del Agreemen t via Anc horing Eric Eaton 1 , Surbhi Go el 1 , Marcel Hussing 1 , Mic hael Kearns 1 , Aaron Roth 1 , Sik ata Sengupta 1 , and Jessica Sorrell 2 1 Departmen t of Computer and Information Sciences, Universit y of P ennsylv ania 2 Departmen t of Computer Science, The Johns Hopkins Universit y F ebruary 27, 2026 Abstract Numerous lines of aim to con trol mo del disagr e ement — the extent to which tw o machine learning mo dels disagree in their predictions. W e adopt a simple and standard notion of mo del disagreemen t in real-v alued prediction problems, namely the expected squared difference in predictions b et ween tw o mo dels trained on independent samples, without any coordination of the training processes. W e w ould like to b e able to driv e disagreement to zero with some natural parameter(s) of the training pro cedure using analyses that can b e applied to existing training metho dologies. W e develop a simple general tec hnique for proving b ounds on indep endent mo del disagree- men t based on anchoring to the av erage of t wo models within the analysis. W e then apply this tec hnique to pro ve disagreement b ounds for four commonly used mac hine learning algorithms: (1) stack ed aggregation ov er an arbitrary mo del class (where disagreement is driv en to 0 with the n umber of models k b eing stack ed) (2) gradient b o osting (where disagreement is driv en to 0 with the num b er of iterations k ) (3) neural net work training with arc hitecture searc h (where disagreemen t is driv en to 0 with the size n of the arc hitecture b eing optimized ov er) and (4) regression tree training o ver all regression trees of fixed depth (where disagreemen t is driven to 0 with the depth d of the tree architecture). F or clarity , w e work out our initial b ounds in the setting of one-dimensional regression with squared error loss — but then show that all of our results generalize to m ulti-dimensional regression with an y strongly con vex loss. 1 In tro duction Tw o predictiv e models f 1 , f 2 : X → R , trained on data sampled from the same distribution D , migh t frequen tly disagr e e in the sense that on a t ypical test example x ∼ D , f 1 ( x ) and f 2 ( x ) tak e very different v alues. In fact, this can happ en even when the tw o mo dels are trained on the same dataset, if the mo del class is not conv ex and the training pro cess is sto c hastic. This kind of mo del disagr e ement , sometimes kno wn as mo del or predictive multiplicit y [Marx et al., 2020, Blac k et al., 2022, Roth and T olb ert, 2025] or the R ashomon effe ct [Breiman, 2001], is a concern for many different reasons. Pragmatically , predictions are used to inform do wnstream actions, and t wo mo dels that make different predictions pro duce am biguity ab out whic h is the b est action to tak e when w e can only tak e one. This has led to a literature on ho w tw o predictive mo dels (or a predictiv e mo del and a h uman) can engage in short test-time in teractions so as to “agree” on a single prediction or action that is more accurate than either mo del could ha v e made alone [Aumann, 1976, Aaronson, 2005, Donahue et al., 2022, F rongillo et al., 2023, P eng et al., 2025, Collina et al., 2025, 1 2026]. In industrial applications, this same phenomenon is known as mo del or predictiv e churn ; there is a large b o dy of work that aims to reduce it, b ecause c hurn for predictions in w a ys that do not pro duce accuracy improv emen ts can needlessly disrupt downstream pip elines built around an initial mo del [Milani F ard et al., 2016, Bahri and Jiang, 2021, Hidey et al., 2022, W atson-Daniels et al., 2024]. The phenomenon of predictive multiplicit y has led to concern ab out the p oten tial arbitrariness of decisions informed by statistical mo dels, and hence the procedural fairness of using suc h models in high-stakes settings Marx et al. [2020], Blac k et al. [2022], W atson-Daniels et al. [2024]. The same phenomenon is what underlies the desire for r eplic ability of machine learning algorithms, which has recently attracted widespread study [Impagliazzo et al., 2022, Bun et al., 2023, Eaton et al., 2023, Kalav asis et al., 2024b,a, Karbasi et al., 2023, Diakonik olas et al., 2025, Eaton et al., 2026]. In this pap er we ask when training on indep enden t samples from a common distribution results in mo dels that appro ximately agree on most inputs. Unlik e the (mo del) agreement literature [Aumann, 1976, Aaronson, 2005, Collina et al., 2025] we w ant appro ximate agreement “out of the b o x”, without the need for any test-time in teraction or co ordination. And unlike the literature on replicabilit y [Impagliazzo et al., 2022, Bun et al., 2023, Eaton et al., 2023, Karbasi et al., 2023], w e do not wan t our analyses to apply only to custom-designed (and often impractical) algorithms: w e wan t metho ds for analyzing existing families of practical learning algorithms. W e con tinue a discussion of additional related work in Section 1.3. 1.1 Our Results Our notion of approximate mo del agreement is that the exp ected squared difference b et ween t wo mo dels f 1 and f 2 should b e small: D ( f 1 , f 2 ) := E x ∼ P [( f 1 ( x ) − f 2 ( x )) 2 ] ≤ ε . Our goal is to sho w that for broad classes of mo del training metho ds, this disagreemen t lev el ε can be driv en to 0 with some tunable parameter of the metho d. W e aim for high agreement in this sense via indep enden t training, i.e., without the need for an y interaction or co ordination b etw een the learners b ey ond the fact that they are sampling data from a common distribution. W e give an abstract recip e for establishing guarantees lik e this based on a “midp oin t anchoring argumen t” and then give four applications of the recip e: (1) to the p opular ensem bling tec hnique of “stac king”, (2) to gradien t b oosting and similar metho ds that iterativ ely build up linear com binations ov er a class of base mo dels, (3) to neural netw ork training with architecture search, and (4) to regression tree training o ver all regression trees of b ounded depth. F or clarit y , we first establish all of our guaran tees for mo dels that solv e a one dimensional regression problem to optimize for squared loss, but we then sho w how our results generalize to multi-dimensional strongly conv ex loss functions. W e include our results on these generalizations in Section 6. 1.1.1 The Midp oin t Anchoring Metho d Our core technique is built around a simple “midp oin t iden tity” for squared loss. F or the sake of completeness, w e provide a proof in Section 2. This identit y is a sp ecial case ( m = 2) of what is also kno wn as the ambiguit y decomp osition in the literature Krogh and V edelsb y [1994], Jiang et al. [2017], W o od et al. [2023]. The decomp osition breaks down the a verage ensem ble mo del’s loss into the av erage losses of the individual ensemble members and an ‘ambiguit y’ term measuring the disagreement of mem b ers from the av erage ensemble. F or any tw o predictors f 1 , f 2 : X → R , let ¯ f ( x ) := 1 2 ( f 1 ( x ) + f 2 ( x )) denote the (hypothetical) mo del corresp onding to their av erage. Then MSE( ¯ f ) = MSE( f 1 ) + MSE( f 2 ) 2 − D ( f 1 , f 2 ) 4 . 2 This decomp osition is usually used to upp er bound the loss of an explicitly realized ensem ble mo del ¯ f . W e use it as a wa y to b ound D ( f 1 , f 2 ): D ( f 1 , f 2 ) = 2 MSE( f 1 ) + MSE( f 2 ) − 2 MSE( ¯ f ) . F or us, the ensemble mo del ¯ f need not ever b e realized, except as a thought exp erimen t. The iden tity reduces proving indep endent disagreement b ounds (the goal of this pap er) to b ounding the err or gap b et ween the constituent models f 1 and f 2 and their a verage. If ¯ f lies in the same h yp othesis class H as f 1 and f 2 , then this error gap can b e b ounded b y any conv ergence analysis that establishes that MSE( f ) will approach error optimality within H . More frequently , for non- con vex classes, ¯ f will not be representable within the same class of functions as f 1 and f 2 — but for man y natural concept classes, the av erage of tw o mo dels trained within some class of mo dels parameterized b y a measure of complexit y (size, depth) will b e representable within a class that is “not muc h larger”. This will giv e us stability guarantees in terms of the “lo cal learning curv e” of this complexity parameter, which because of error b oundedness and monotonicit y must tend to zero at v alues of the complexit y parameter that can b e bounded indep endently of the instance. All of our stability bounds are “agnostic” in the sense that they hold without any distributional or realizabilit y assumptions. In other w ords, our b ounds will alw ays follow from the ability to optimize within giv en mo del classes, without needing to assume that the mo del class is able to represen t the relationship b et ween the features and the lab els to an y non-trivial degree. It is instructive at the outset to compare the midp oint anc horing metho d to a more naive metho ds for establishing agreemen t b ounds. Any pair of mo dels f 1 and f 2 that b oth ha ve almost p erfect accuracy in the sense that MSE( f 1 ) , MSE( f 2 ) ≤ ε must also satisfy D ( f 1 , f 2 ) ≤ O ( ε ). This follo ws by anc horing on hypothetical perfect predictions f ∗ ( x ) = y . Of course, such b ounds will rarely apply b ecause very few settings are compatible with near p erfect prediction. The b enefit of our more general midp oin t anchoring metho d is that it will allo w us to argue for indep enden t mo del agreement without needing to mak e any realizability assumptions — high accuracy is not needed for high agreement, as if f 1 and f 2 ha ve high error, so migh t the av erage mo del ¯ f . 1.1.2 Applications: Ensem bling, Bo osting, Neural Nets, and Regression T rees W e choose our four applications b elo w to show the v arious wa ys in whic h we can apply our method in settings that are progressively more c hallenging. First, as a w arm-up, w e study stack ed aggregation, whic h ensem bles indep enden tly trained mo dels. W e show how the midp oin t anc horing method can reco ver strong agreemen t results as a function of the lo c al err or curve . Next, we study gradient b o osting. Gradien t b o osting, like stac king, learns a linear com bination of base mo dels, but unlik e stacking, do es not rely on indep enden tly trained mo dels. The mo dels in gradient b o osting are found by adaptiv ely and iteratively solving a “weak learning” problem. As our midp oin t anchoring metho d do es not rely on mo del indep endence, we are still able to use it to recov er strong agreemen t b ounds tending to 0 at a rate of O (1 /k ), where k is the num b er of iterations of gradient b o osting. The constituent mo dels used in gradient b oosting can b e arbitrary and non-con vex (e.g., depth 5 regression trees), but the aggregation metho d is still linear and is implicitly approximating a (infinite dimensional) con v ex optimization problem — minimizing mean squared error amongst linear mo dels in the span of the set of weak learner mo dels. One might wonder if the kind of agreemen t b ounds we are able to prov e are implicitly relying on this con vexit y . In our third and fourth applications, w e see that the answer is no. W e study error minimization o ver arbitrary ReLU neural netw orks of size n (implying arc hitecture searc h) as well as arbitrary regression trees of depth 3 d . These are highly non-conv ex optimization landscap es. Thus appro ximate error minimizers can generally b e very far from agreemen t in parameter space. Nevertheless, w e are able to apply our midp oin t anchoring metho d to sho w strong b ounds on agreement that can b e driven to 0 as a function of the size of the neural netw ork n in the first case and the depth of the regression tree d in the second case, recov ering agreemen t in prediction space despite arbitrary disagreement in parameter space. 1.1.3 W arm up: Stac king and Lo cal T raining Curve Bounds In Section 3 we apply our recip e to establish the stability of stacking, also known as stack ed aggregation or stack ed regression. Stac king [W olp ert, 1992, Breiman, 1996] is a simple, p opular mo del ensem bling technique in which w e indep enden tly learn k base models, and then combine them b y training a regression mo del on top of them, using the predictions of the base mo dels as features. T o model indep enden t training in reduced form, we imagine that there is a fixed distribution Q on mo dels g : X → R that a learner can sample from. Sampling a new mo del g from Q represen ts the induced distribution on mo dels from (as one example) sampling a fresh dataset D of some size from the underlying data distribution P , and then running an arbitrary (p ossibly randomized) mo del training pro cedure on the sample D . W e mak e no assumptions on the form of the distribution Q , and hence no assumptions ab out the nature of the underlying mo del training pro cedure or the underlying mo del class. A learner samples k mo dels G = { g 1 , . . . , g k } indep enden tly from Q , and then ensemb les them b y training a linear regression mo del f 1 to minimize squared error, using G as its feature space. A second indep endent learner running the same pro cedure corresp onds to sampling a different set of k mo dels G ′ = { g ′ 1 , . . . , g ′ k } , also indep enden tly from Q , and then solving for a linear regression mo del f 2 minimizing squared error using G ′ as its feature space. Via the midp oin t anchoring metho d w e argue that w e can quickly driv e the mo del disagreement D ( f 1 , f 2 ) to 0 b y increasing k , the num b er of mo dels b eing ensembled. The idea is to compare b oth f 1 and f 2 to the mo del f ∗ that is the solution to linear regression on the union of the tw o feature spaces G ∪ G ′ . This mo del is only more accurate than the anc hor mo del ¯ f , as ¯ f is a (likely sub optimal) function within the same span as f ∗ . Moreov er, since the base mo dels underlying b oth f 1 and f 2 w ere sampled i.i.d. from a common distribution Q , the set of features in G ∪ G ′ is exchangeable. Consequen tly , we can view both f 1 and f 2 as consisting of the solution to a linear regression problem on a uniformly random subset of half the features a v ailable to f ∗ . This allo ws us to argue that as k gets large, the MSE of f 1 and f 2 m ust approac h the MSE of f ∗ , whic h in turn lets us driv e D ( f 1 , f 2 ) to 0 as a function of k . T aking the exp ectation o ver the models f 1 and f 2 as w ell lets us simplify the b ound to: E f 1 ,f 2 [ D ( f 1 , f 2 )] ≤ 4( ¯ R k − ¯ R 2 k ) where ¯ R k and ¯ R 2 k represen t the exp ected MSE that result from stacking k and 2 k mo dels resp ec- tiv ely , drawn i.i.d. from Q . This relates the stability of f 1 and f 2 to the lo cal training curve; since ¯ R 1 , ¯ R 2 , . . . is a monotonically decreasing sequence b ounded from ab ov e by E [ y 2 ] and from b elo w by 0, the curve must quic kly “level out” at most v alues, yielding any degree of desired stabilit y . W e briefly remark on the “lo cal training curv e” form of this result, which is also shared b y our results for neural netw orks and regression trees. Rather than b ounding disagreement directly in terms of the n umber of models k b eing aggregated (which is the form of our result for gradien t b oosting), here we are relating disagreement to the lo cal training error curve as a function of k — i.e., ho w muc h the error would decrease in exp ectation b y doubling the num b er of mo dels in the aggregation from k to 2 k . If this curv e has flattened out sufficien tly near any particular v alue of k , w e hav e approximate agreement. 4 Note that in general we cannot say anything ab out which v alue of k will result in the loss ¯ R k appro ximating its minimum v alue. Consider a distribution Q ov er some mo del space H in whic h almost all mo dels ha ve large error and no correlation with resp ect to the true y v alues, and there is one mo del h ∗ that is a p erfect predictor of y . Let Q put some (arbitrarily small) w eigh t τ > 0 on h ∗ . Then stac ked regression for k ≪ 1 /τ will result in large error, since with high probability w e sample only uninformativ e mo dels. But for k ∼ 1 /τ we will b e lik ely to draw h ∗ , at which p oin t stac king will suddenly c ho ose to put all its weigh t on h ∗ and enjo y a rapid drop in error. Ho wev er, note that in this example, the learning curve as a function of k wil l hav e b een flat for the long p eriod before the error drop (as w ell as after it), and so our theorem implies high agreemen t ev en for small v alues of k despite low error only being obtained for large v alues of k . More generally , if lab els y are b ounded (say) in [0 , 1], then each term ¯ R k is b ounded in [0 , 1]. Because the sequence ¯ R k is monotonically decreasing in k , there can b e at most 1 /α v alues of k such that ( ¯ R k − ¯ R 2 k ) drops b y at least α b efore con tradicting the non-negativity of squared error. Thus ev en in the w orst case, there must b e a v alue of k ≤ 2 1 /α suc h that ( ¯ R k − ¯ R 2 k ) ≤ α — a b ound dep ending only on the desired agreement rate α , indep enden tly of the complexity of the instance or the mo del class. Of course we expect a b etter b ehav ed learning curve in practice. Moreo ver, it is easy to empirically ev aluate the actual learning curv e on a holdout set. This suggests a practical prescription arising from our lo cal-learning curve b ound for stac king (as w ell as the similar b ound w e obtain for neural net w ork and regression tree training): empirically trace out the learning curve b y successively doubling k , estimate the errors of the stac ked mo dels on a holdout set, and choose a v alue of k for which the lo cal drop in error is small. Note that whatev er our computational resources migh t be, reducing our predictiv e error while resp ecting our resource constraints and achieving predictiv e stability are aligned: in neither case do we wan t to c ho ose a v alue k for which the lo cal learning curv e is steep. F or predictive error, steepness of the learning curve indicates that at only mo destly increased cost, we can meaningfully reduce error. Con versely , flatness of the learning curve indicates lo cal optimality of our choice of k — w e cannot impro ve error substan tially at least without significan tly more computational and data resources. Our theorem shows that the same condition implies strong indep enden t agreement bounds. 1.1.4 Gradien t Bo osting Our next application in Section 4 is to gradien t b o osting [F riedman et al., 2000, F riedman, 2001, Mason et al., 1999]. Concretely , gradien t b oosting starts with an arbitrary class of “weak” mo dels C (e.g., depth 5 regression trees), and iterativ ely builds up a model f b y finding a model g ∈ C that has high correlation with the residuals of the existing mo del f . It then adds some scaling of g to f , and con tinues to the next iterate. Scalable implemen tations of gradien t b o osting, like XGBoost [Chen and Guestrin, 2016], hav e b ecome some of the most widely used learning algorithms for tabular data. Lik e stac king, gradien t b o osting builds up a linear ensem ble of base mo dels, but unlik e stac king, the models are no longer indep enden tly trained. W e again take a reduced-form view of training on finite samples, mirroring the classical Statistical Query mo del of Kearns [1998]: w e mo del the learning algorithm as having access to a weak-learning oracle that, given a model f , can return any mo del g ∈ C whose cov ariance with the residuals of f is within ε of maximal on the underlying distribution, modeling both sampling and optimization error. Gradient b oosting has t w o prop erties that are useful to us: it produces a model in the linear span of C , and it isn’t hard to show that (indep endent of an y distributional assumptions), it learns a mo del whose MSE approac hes that of the b est model in the span of C at a rate of 1 /k , where k is the n um b er of iterates. Accordingly , w e choose to compare the error of our mo dels to the hypothetical mo del f ∗ that minimizes squared error amongst all models in the linear span of C . Once again, as ¯ f also lies within the linear span of 5 C , f ∗ has only low er error. Because we can show that MSE( f 1 ) , MSE( f 2 ) ≤ MSE( f ∗ ) + O (( τ ∗ ) 2 /k ) after k iterates, our core analysis establishes that D ( f 1 , f 2 ) ≤ O (( τ ∗ ) 2 /k ) as desired. Here τ ∗ is the norm of the MSE-optimal predictor in the span of C , a problem-dep enden t constant dep ending on only the underlying data distribution and C . W e later show how to remov e this dep endence on τ ∗ b y instead using a F rank-W olfe style algorithm which controls the norm of the mo dels f 1 , f 2 that w e learn, and hence lets us instead anchor to the optimal b ounded norm mo del in the span of C . 1.1.5 Neural Net w ork and Regression T ree T raining Our final applications in Section 5 are to neural netw ork training with architecture searc h and regression tree training. F or these applications, the anc hor mo del ¯ f do es not necessarily lie within the same mo del class as f 1 and f 2 — i.e the a verage of tw o neural net works of size n is not in general itself represen table as a neural netw ork of size n , and the a v erage of tw o regression trees of depth d is not in general represen table as another regression tree of depth d . How ev er, the av erage of tw o neural netw orks of size n is representable as a neural net work of size 2 n , and the av erage of t wo depth- d regression trees is representable as a depth 2 d regression tree. Just as in our stac king result, these b ounds relate the disagreemen t of tw o approximately optimal mo dels f 1 and f 2 to the lo cal learning curve (parameterized by the num b er of internal no des for neural netw orks, and the depth for regression trees), whic h means that disagreemen t can b e driv en to 0 as a function of the mo del complexit y at a rate that dep ends only on the desired disagreement lev el and is indep endent of the complexity of the instance. 1.1.6 Tigh tness of Our Results W e show that in general, our tec hnique yields tigh t bounds. Concretely , we sho w in Section 3.2 that the stability b ound that our technique yields is tight even in constan ts. Recall that our upp er b ound for stac king was: E f 1 ,f 2 [ D ( f 1 , f 2 )] ≤ 4( ¯ R k − ¯ R 2 k ) W e show that for ev ery ε ≥ 0 there is an instance for which: E f 1 ,f 2 [ D ( f 1 , f 2 )] ≥ (4 − ε )( ¯ R k − ¯ R 2 k ) This establishes that our core av erage anc horing tec hnique cannot b e generically impro ved even in the constan t factor. 1.1.7 Generalizations Finally , in Section 6 we generalize all of our results beyond one-dimensional outcomes and squared loss to multi-dimensional strongly conv ex losses. This generalization requires establishing an ana- logue of our MSE decomp osition and anchoring argumen t, letting us relate disagreement rates to differences in loss with the av eraged anchor mo del. W e also giv e a v arian t of our gradien t b o osting result for a F rank-W olfe st yle optimization algorithm that iteratively builds up a linear com bination of weak learners from C that are restricted to hav e norm at most τ for any τ of our c ho osing. This lets us anchor to the b est norm- τ mo del in the span of C , which lets us drive the disagreement error betw een t wo indep enden tly trained mo dels to 0 at a rate of O ( τ 2 /k ) where k is the num b er of iterations of the algorithm. Unlike our initial gradient b o osting result — which has error going to 0 at a rate of O (( τ ∗ ) 2 /k ), where τ ∗ is a problem-dep enden t constant not under our con trol — here τ is a parameter of our algorithm and w e can set it ho wev er w e like to trade off agreemen t with accuracy . 6 1.2 In terpreting Lo cal Learning Curv e Stabilit y Our results for stac king, neural netw ork training, and regression tree training all ha ve the form of lo cal learning curve stabilit y b ounds: D ( f 1 , f 2 ) ≤ 4( R ( F n ) − R ( F 2 n ))— where R ( F k ) refers to the optimal error amongst mo dels with “complexity” k (parameterizing the n umber of mo dels b eing ensem bled, net work size, and depth in the cases of stac king, neural netw orks, and regression trees resp ectiv ely). These kinds of b ounds are actionable and well aligned with optimizing for accuracy . They are actionable b ecause (with enough data) it is p ossible to empirically plot the local learning curv e by training with different parameter v alues, and pic king n such that the curve is lo cally flat — R ( F n ) ≈ R ( F 2 n ). This is aligned with the goal of optimizing for accuracy since if w e could sub- stan tially improv e accuracy b y lo cally increasing the complexity of the mo del, then in the high data regime, w e should. It is also descriptive in the sense that if we assume that most deplo yed models are not “leaving money on the table” in the sense of b eing able to substan tially impro ve accuracy b y lo cally increasing complexit y , then we should exp ect stability amongst deplo yed mo dels. Because the error sequence { R ( F n ) } n is b ounded from ab o ve and below and monotonically decreasing in n , the lo cal learning curve is also guaranteed to “flatten out” to a v alue α for a v alue of n that is indep enden t of the problem complexity , and at most 2 1 /α . How ever, in practice we exp ect the lo cal learning curve to flatten out ev en more gracefully . Empirical studies of neural scaling laws [Kaplan et al., 2020, Hoffmann et al., 2022] ha ve consisten tly found that across a wide v ariety of domains, the optimal error R ( F n ) decreases as a p o w er law in mo del complexit y: R ( F n ) ≈ R ∗ + cn − γ for some constan ts c > 0, γ > 0, and irreducible error R ∗ . Under such a pow er law, the gap in the lo cal learning curv e b ecomes: R ( F n ) − R ( F 2 n ) = c ( n − γ − (2 n ) − γ ) = c (1 − 2 − γ ) n − γ = O ( n − γ ). That is, the lo cal learning curve gap shrinks p olynomially in mo del complexit y , which b y our results implies that indep enden t mo del disagreement D ( f 1 , f 2 ) decreases at the same rate. Crucially , this do es not require lo w absolute error rather only that the marginal b enefit of increasing complexity dimin- ishes. The exp onent γ v aries by domain (typically 0 . 05–0 . 5 for large-scale neural net works), but is reliably p ositiv e. Our results pro vide theoretical grounding for empirical observ ations that larger mo dels exhibit greater prediction-level consistency across indep enden t training runs [Bho janapalli et al., 2021, Jordan, 2024], and may help explain the surprisingly high levels of agreement observed empirically across indep enden tly trained large language mo dels [Gorecki and Hardt, 2025]. 1.3 Additional Related W ork Agreemen t via In teraction A line of work inspired b y Aumann [1976] aims to give interactiv e test-time protocols through whic h tw o mo dels (trained initially on differen t observ ations) can arriv e at (accuracy improving) agreemen t. Initial w ork in economics [Geanak oplos and P olemarchakis, 1982] fo cused on exact agreement, but more recen t w ork in computer science fo cused on interac- tions of b ounded length, leading to appro ximate agreemen t of the same form that we study here [Aaronson, 2005, F rongillo et al., 2023]. This line of work focused on p erfect Bay esian learners un til Collina et al. [2025, 2026], Kearns et al. [2026] sho wed that the same kind of accuracy-improving agreemen t could b e obtained via test-time interaction using computationally and data efficien t learning algorithms. Agreemen t as V ariance Our disagreemen t metric is (t wice) the v ariance of the training pro- cedure. Kur et al. [2023] show that for realizable learning problems (with mean zero indep endent noise), empirical risk minimization o ver a fixed, con vex class leads to v ariance that is upp er b ounded b y the minimax rate. Our results apply to more general settings: our applications to neural net- w orks and regression trees corresp ond to non-conv ex learning problems, our application to stacking 7 do es not corresp ond to optimization ov er a fixed class, and we do not require an y realizability assumptions. Note also, the in terest of Kur et al. [2023] is to study generalization through the lens of bias/v ariance tradeoffs, whereas our starting p oin t is to assume small excess risk in distribution. Differen t Notions of Stability There are many notions of stability in machine learning. Bous- quet and Elisseeff [2002] give notions of lea ve-one-out stabilit y and connect them to out-of-sample generalization. These notions hav e b een influential, and man y authors hav e pro ven generalization b ounds via this link to stability: for example Hardt et al. [2016] show that sto c hastic gradient descen t is stable in this sense if only run for a small num b er of iterations, and Charles and P apail- iop oulos [2018] study the stabilit y of global optimizers in terms of the geometry of the loss-optimal solution. These notions of stability are differen t than the disagreement metric w e study here. First, stabilit y in the sense of Bousquet and Elisseeff [2002] is stability only of the loss, not the predictions themselv es which is our in terest. Second, stability in the sense of Bousquet and Elisseeff [2002] is stabilit y with resp ect to adding or remo ving a single training example, whereas we w ant prediction- lev el stability ov er fully indep enden t retraining, in which (in general) every single training example is differen t — just dra wn from the same distribution. Differen tial priv acy [Dwork et al., 2006, Dwork and Roth, 2014] is a strong notion of algorith- mic stability that when applied to mac hine learning requires that when one training sample is c hanged, the (randomized) training algorithm induces a near-b y distribution on output mo dels. Differen tial priv acy is a m uch stronger stabilit y condition than those of Bousquet and Elisseeff [2002], and similarly implies strong generalization guaran tees [Dwork et al., 2015]. When the dif- feren tial priv acy stabilit y parameter is taken to b e sufficien tly small ( ε ≪ 1 / √ n ), then it implies stabilit y under resampling of the en tire training set from the same distribution, as w e study in our pap er — this is related to what is called p erfe ct gener alization by Cummings et al. [2016]. Via this connection, differential priv acy has been sho wn to b e (information theoretically) reducible to repli- cabilit y (as defined by Impagliazzo et al. [2022]) and vice-versa [Bun et al., 2023]. Replicability is a stronger condition than the kind of agreement that we study: in the context of machine learning, it requires that (under coupled random coins across the tw o training algorithms), the run of t wo training algorithms o ver indep enden tly sampled training sets output exactly identic al mo dels with high probabilit y . In con trast we ask that t wo indep enden tly trained mo dels pro duce numeric al ly similar predictions on most examples. How ever, b ecause replicabilit y asks for more, it also comes with sev ere limitations that we av oid. Via its connection to differential priv acy , there are strong separations b et ween problems that are learnable with the constraint of replicabilit y and without [Alon et al., 2019, Bun et al., 2020]. Even for those learning problems that are solv able replicably (e.g. learning problems solv able in the statistical query mo del of Kearns [1998]), standar d learning algorithms for these problems are not replicable, and the computational and sample complexity of custom-designed replicable algorithms often far exceeds the complexity of non-replicable learning (see e.g. Eaton et al. [2026]). In con trast, our analyses apply to existing, p opular, state of the art learning algorithms (gradient b oosting and regression tree and neural netw ork training with archi- tecture search). Since any mo del class can b e used together with stacking or gradient b oosting, there are no barriers to obtaining our kind of mo del agreemen t similar to the information theo- retic barriers separating replicable from non-replicable learning. The concurrent work of Hopkins et al. [2025] is similarly motiv ated to ours: their goal is to relax the strict replicabilit y definition of Impagliazzo et al. [2022] to one that requires that t wo replicably trained mo dels agree on “most inputs”, and thereby circum v ent the imp ossibilit y results separating P A C learning from replicable learning. They give several definitions of appro ximate replicability and sho w that approximately replicable P AC learning has similar sample complexity to unconstrained P A C learning. Our ap- 8 proac hes, results, and techniques are quite different, how ev er. Hopkins et al. [2025] fo cuses on binary hypothesis classes and gives custom training procedures relying on shared randomness that satisfy their notion of approximate replicability . W e instead fo cus on (m ulti-dimensional) regression problems and giv e analyses of existing, popular learning algorithms. Our training pro cedures do not use shared randomness. Agreemen t and Ensembling W o od et al. [2023] studies the error reduction that can b e obtained through ensembling metho ds and relates it to a notion of mo del disagreement that is equiv alent to ours. Their in terests are dual to ours: for them the goal is error reduction through explicit ensem bling, and mo del disagreement is a means to that end; our primary goal is model agreemen t, and we sho w a general recip e for obtaining it — for us, the “ensemble” is a h yp othetical ob ject used only in the analysis of agreement for hypothesis classes (neural net works, regression trees) that are not themselves ensemble metho ds. Empirical Phenomena Empirical work quan tifies prediction-lev el stability across retrainings via c hurn, p er-example consistency , and related notions [Bho janapalli et al., 2021, Bahri and Jiang, 2021, Johnson and Zhang, 2023]. Based on this, v arious studies show that simple pro cedures — e.g., ensembling or co-distillation — can increase agreement [W ang et al., 2020, Bho janapalli et al., 2021]. How ever, recen t work show ed that fluctuations in run-to-run test accuracy can b e largely explained by finite-sam ple effects ev en when the underlying predictors are similar [Jordan, 2024]. Relatedly , Somepalli et al. [2022] made the observ ation that across pairs of mo dels, indep enden tly trained neural netw orks often seem to depict similar decision regions despite their complexit y whic h raises the question of when and whether external methods to encourage agreemen t are ev en required. On top of this, Mao et al. [2024] pro vide evidence that training tra jectories lie on a shared lo w-dimensional manifold in prediction space, p oin ting to a common structure that could underlie agreemen t. The latter works only c haracterize the prediction space based on visualizations and do not provide a formal explanation as to wh y agreemen t might o ccur from indep enden t training. Gorec ki and Hardt [2025] recen tly conducted a large empirical study of mo del disagreement across 50 large language mo dels used for prediction tasks, and find that empirically they ha v e muc h higher lev els of agreemen t than one w ould exp ect if errors w ere made at random; our w ork can b e view ed as giving foundations to this kind of empirical observ ation. Empirical agreement has also b een studied through the lens of generalization. In-distribution pairwise disagreement b et ween indep enden tly trained copies on unlab eled test data has b een ob- serv ed to provide an accurate estimate of test error [Jiang et al., 2022]. Moreov er, a single mo del’s pattern of predictions on the training set closely matc hes its b eha vior on the test set as distribu- tions, indicating prediction-space stabilit y that is distinct from in ter-run agreemen t [Nakkiran and Bansal, 2020]. Beyond in-distribution, there are cases where even out-of-distribution pairwise agree- men t scales linearly with in-distribution agreement across many shifts [Baek et al., 2022]. None of these w orks provide prediction-space conditions or rates under which independently trained mo dels will immediately agree in the first place. A complementary line of w ork fo cusing on weigh t-space studies shows that many independently trained solutions can b e connected by lo w-loss paths [Garip o v et al., 2018, Draxler et al., 2018]. Ev en when solutions aren’t trivially aligned, applying neuron p ermutations can align them, enabling lo w-loss in terp olation [En tezari et al., 2022, Ainsworth et al., 2023]. It can be shown that their la yers are stitc hable or exhibit la yer-wise linear feature connectivity [Bansal et al., 2021, Zhou et al., 2023], which is consistent with a connected region once p erm utation symmetries are accounted for. These techniques are post-ho c observ ations ab out w eight or parameter space and do not provide ex 9 an te, prediction-space guaran tees or quan titativ e rates that indep endent training will agree without alignmen t. Closer to prediction space theory , the neural tangent k ernel findings characterize ho w a mo del’s predictiv e function evolv es under gradien t descent [Jacot et al., 2018, Lee et al., 2019]. How ever, these analyses focus on a single training tra jectory , primarily analyze the infinite-width regime, and do not directly address whether independently trained mo dels will agree. Our w ork seeks conditions under which standard training dir e ctly yields appro ximate agreement “out of the b o x,” bypassing parameter-space alignmen t and establishing stabilit y in prediction space itself. 2 Preliminaries and Midp oin t Anchoring Lemmas W e consider a setting in whic h we train tw o mo dels on indep enden tly dra wn datasets. Let X ⊆ R d b e the data domain and Y ⊆ R b e the lab el domain. W e assume access to datasets S = (( x i , y i )) n − 1 i =0 that are indep enden tly drawn from a joint distribution P on X × Y . Note that unless otherwise stated, all exp ectations will b e with resp ect to x, y ∼ P or where appropriate just the marginal ov er x . A mo del is then defined as a function mapping f : X 7→ Y . W e define the norm ∥ f ∥ := E [ f ( x ) 2 ] 1 / 2 . With this we define the mean squared error ob jective and the corresp onding p opulation risk MSE( f ) = E [( y − f ( x )) 2 ] , R ( F ) := inf f ∈F MSE( f ) . W e next define the disagreement b et ween t wo mo dels as their exp ected squared difference. Definition 2.1 (Disagreemen t) . F or any two functions f 1 : X 7→ Y , f 2 : X 7→ Y , we define the exp e cte d disagr e ement b etwe en them as D ( f 1 , f 2 ) := E [( f 1 ( x ) − f 2 ( x )) 2 ] . W e are now ready to state and pro ve a simple iden tity that will form the backbone of our analyses. It relates the disagr e ement b et ween tw o mo dels to the degree to whic h their errors could b e impro v ed by a veraging the mo dels. Lemma 2.2 (Midp oin t identit y for squared loss) . F or any two functions f 1 : X 7→ Y and f 2 : X 7→ Y , let ¯ f ( x ) := 1 2 ( f 1 ( x ) + f 2 ( x )) . Then D ( f 1 , f 2 ) = 2 MSE( f 1 ) + MSE( f 2 ) − 2 MSE( ¯ f ) . Pr o of. Let r i ( x ) := f i ( x ) − y for i ∈ { 1 , 2 } . Then ¯ f ( x ) − y = 1 2 ( r 1 ( x ) + r 2 ( x )) and f 1 ( x ) − f 2 ( x ) = r 1 ( x ) − r 2 ( x ). Expanding squares and using linearity of exp ectation gives E [( r 1 − r 2 ) 2 ] = E [ r 2 1 ] + E [ r 2 2 ] − 2 E [ r 1 r 2 ] . On the other hand, E h 1 2 ( r 1 + r 2 ) 2 i = 1 4 E [( r 1 + r 2 ) 2 ] = 1 4 E [ r 2 1 + r 2 2 + 2 r 1 r 2 ] = 1 4 E [ r 2 1 ] + 1 4 E [ r 2 2 ] + 1 2 E [ r 1 r 2 ] . Therefore, 2 E [ r 2 1 ] + E [ r 2 2 ] − 2 E 1 2 ( r 1 + r 2 ) 2 = 2 E [ r 2 1 ] + E [ r 2 2 ] − 2 1 4 E [ r 2 1 ] + 1 4 E [ r 2 2 ] + 1 2 E [ r 1 r 2 ] = 2 1 2 E [ r 2 1 ] + 1 2 E [ r 2 2 ] − E [ r 1 r 2 ] = E [ r 2 1 ] + E [ r 2 2 ] − 2 E [ r 1 r 2 ] = E [( r 1 − r 2 ) 2 ] . Substituting bac k E [ r 2 i ] = MSE( f i ) and E 1 2 ( r 1 + r 2 ) 2 = MSE( ¯ f ) yields the claim. 10 A useful corollary of this identit y is that we can upp er b ound the disagreement b etw een tw o mo dels by the degree to whic h they are sub-optimal relative to the b est mo del in an y family that con tains their av erage. Corollary 2.3 (Disagreement via the midp oin t anchor) . F or any two functions f 1 , f 2 : X → Y , let ¯ f ( x ) := 1 2 ( f 1 ( x ) + f 2 ( x )) . If ¯ f ∈ H for some class of pr e dictors H , then D ( f 1 , f 2 ) ≤ 2 MSE( f 1 ) − R ( H ) + 2 MSE( f 2 ) − R ( H ) . Pr o of. By Lemma 2.2, we hav e D ( f 1 , f 2 ) = 2 MSE( f 1 ) + MSE( f 2 ) − 2 MSE( ¯ f ) . If ¯ f ∈ H then MSE( ¯ f ) ≥ R ( H ), so substituting yields the claim. If the mo del class from whic h f 1 and f 2 w ere trained contains their a verage, then w e can relate the disagreement b et ween f 1 and f 2 to the sub-optimalit y of the loss of f 1 and f 2 to the global optim um within the class in which they w ere trained. How ev er, non-con vex mo del classes will not satisfy this closure-under-a veraging prop erty . T o analyze these classes it is useful to consider lo cal learning-curv e b ounds with resp ect to a hierarch y of mo del classes, such that eac h lev el F 2 n in the hierarc hy is expressiv e enough to represen t the a verage of any pair of mo dels in F n . W e will see that this prop ert y is satisfied by neural net works (where n parametrizes the num b er of internal no des) and regression trees (where n parametrizes the depth). Lemma 2.4 (Lo cal learning-curve b ound from midpoint closure) . L et ( F n ) n ≥ 1 b e a neste d se quenc e of pr e dictor classes and assume that for every n and every f 1 , f 2 ∈ F n , the midp oint pr e dictor ¯ f := 1 2 ( f 1 + f 2 ) lies in F 2 n . Fix n ≥ 1 and supp ose f 1 , f 2 ∈ F n satisfy MSE( f i ) ≤ R ( F n ) + ε for i ∈ { 1 , 2 } . Then D ( f 1 , f 2 ) ≤ 4 R ( F n ) − R ( F 2 n ) + ε . Pr o of. By midp oin t closure w e hav e ¯ f ∈ F 2 n , so Lemma 2.3 with H = F 2 n giv es D ( f 1 , f 2 ) ≤ 2 MSE( f 1 ) − R ( F 2 n ) + 2 MSE( f 2 ) − R ( F 2 n ) . Using MSE( f i ) ≤ R ( F n ) + ε for b oth i yields the claim. In the following sections, we apply Lemma 2.3 and Lemma 2.4 b y v erifying that the midp oin t predictor lies in an appropriate hypothesis class. 3 W arm up Application: Stac king Stac king is an ensem bling metho d whic h first trains k indep enden t base mo dels in some arbitrary fashion and then uses linear regression ov er these base mo dels to combine their predictions. Let Q b e a probability distribution on mo dels of the form g : X → R . Concretely , Q could represen t the law of a base predictor obtained b y training a fixed learning algorithm M on a r andom shar d of the training sample of size n/k , with a fresh i.i.d. draw of examples and fresh algorithmic randomness; indep enden t dra ws from Q corresp ond to training M on independent shards. W e remark in passing that other interpretations of Q also mak e sense. F or example, p erhaps all parties share the same training set (b ecause e.g. it is the training set for a standard b enchmark dataset lik e ImageNet). Then there is no need to ha v e different mo dels be trained on differen t shards, and 11 Algorithm 1 Ensembling via Stac king Input: M : G → H black-box learning algorithm, D ∼ P n dataset of size n , num b er of shards k Randomly split D into k disjoin t shards G i eac h of size | G i | = j n k k for i ∈ [ k ] do g i ← M ( G i ) end f ← OLS( g 1 , . . . , g k ) return f Q can represent only the randomness of the training pro cedure, which might re-use samples in arbitrary w ays. W e will analyze the p opulation least squares predictor ov er the span of these base mo dels. That is, w e sample k mo dels G = { g 1 , ..., g k } ∼ Q k and define V ( G ) to b e the linear span of the sampled mo dels in G . W e will consider the predictor arg min f ∈ V ( G ) MSE( f ) . Note that this is just a finite dimensional least squares problem, so a minimizer exists, and multiset m ultiplicities do not affect the span V ( G ). F or t ∈ N , let R t denote the random v ariable R ( G ) when G = { g 1 , . . . , g t } with g 1 , . . . , g t i.i.d. ∼ Q , and write ¯ R t := E { g 1 ,...,g t }∼ Q t [ R t ]. W e will use the shorthand ¯ R t := E G [ R t ] 3.1 An Agreemen t Upp er Bound W e instantiate our agreemen t upp er b ound for Stacking using the midp oin t anchoring lemma. In this case we compare f 1 and f 2 to the risk R ( G ∗ ) where G ∗ := G ∪ G ′ is the union of the base mo dels used in training f 1 and f 2 . Here f 1 is the MSE minimizer o ver the set of base mo dels G = { g 1 , ..., g k } and f 2 is the MSE minimizer o ver the set of base mo dels G ′ = { g ′ 1 , ..., g ′ k } . W e kno w that V ( G ) , V ( G ′ ) ⊆ V ( G ∪ G ′ ), and that the midp oin t predictor 1 2 ( f 1 + f 2 ) lies in V ( G ∪ G ′ ). This, together with the fact that the set of 2 k mo dels in G ∪ G ′ is exchangeable lets us prov e the follo wing agreement bound: Theorem 3.1 (Agreement for Stac ked Aggregation) . L et G = { g 1 , . . . , g k } i.i.d. ∼ Q k and G ′ = { g ′ 1 , . . . , g ′ k } i.i.d. ∼ Q k b e indep endent. Define f 1 , f 2 as fol lows: f 1 = arg min f ∈ V ( G ) MSE( f ) , f 2 = arg min f ∈ V ( G ′ ) MSE( f ) Then we have that E f 1 ,f 2 D ( f 1 , f 2 )] ≤ 4 ¯ R k − ¯ R 2 k . Pr o of. Fix realizations of G and G ′ , and let G ∗ = G ∪ G ′ (m ultiset union ). Throughout this section we will think of G, G ′ ∼ Q k , unless explicitly conditioned. Note that V ( G ) ⊆ V ( G ∗ ) and V ( G ′ ) ⊆ V ( G ∗ ). In our pro ofs, without loss of generality , we will use the notation h G to denote the least squares minimizer with resp ect to subspace G . In our theorem statements, this corresp onds to f 1 , but w e use this notation in our pro ofs for the sake of clarity . Let ¯ h := 1 2 ( h G + h G ′ ). Since h G ∈ V ( G ) and h G ′ ∈ V ( G ′ ) and V ( G ) , V ( G ′ ) ⊆ V ( G ∗ ), we hav e ¯ h ∈ V ( G ∗ ). Applying Lemma 2.3 with f 1 = h G , f 2 = h G ′ , and H = V ( G ∗ ), and using MSE( h G ) = R ( G ), MSE( h G ′ ) = R ( G ′ ), and R ( V ( G ∗ )) = R ( G ∗ ), w e hav e the p oin twise inequality ∥ h G − h G ′ ∥ 2 ≤ 2 R ( G ) − R ( G ∗ ) + 2 R ( G ′ ) − R ( G ∗ ) . (1) 12 W e now tak e exp ectations o ver G, G ′ , G ∗ to relate the t wo terms on the RHS of Equation 1. Conditional on G ∗ , we can generate the pair ( G, G ′ ) by dra wing a uniformly random p erm utation π of { 1 , . . . , 2 k } and letting G b e the first k p erm uted elements of G ∗ and G ′ the remaining k . This holds b ecause the 2 k features in G ∗ arise from 2 k i.i.d. draws from Q and the joint law of ( G, G ′ ) is exc hangeable under p erm utations of these 2 k draws. Conditioning on the unordered m ultiset G ∗ , ( G, G ′ ) is a uniformly random partition into tw o k -subm ultisets. Therefore, taking the conditional exp ectation of (1) given G ∗ and using symmetry of G and G ′ , E ( G,G ′ ) | G ∗ ∥ h G − h G ′ ∥ 2 G ∗ ≤ 4 E ( G,G ′ ) | G ∗ R ( G ) G ∗ − R ( G ∗ ) . (2) W e now integrate (2) o ver G ∗ . W e claim that E G ∗ h E ( G,G ′ ) | G ∗ R ( G ) G ∗ i = ¯ R k and E G ∗ R ( G ∗ ) = ¯ R 2 k . (3) The second equalit y is immediate from the definition of ¯ R 2 k , since G ∗ is a collection of 2 k i.i.d. draws from Q . F or the first equality in (3), let U b e a uniformly random k -subset of { 1 , . . . , 2 k } inde- p enden t of the dra ws { g 1 , . . . , g 2 k } i.i.d. ∼ Q 2 k . Define G U := { g i } i ∈ U . By the conditional description ab o v e, E ( G,G ′ ) | G ∗ R ( G ) G ∗ = E U R ( G U ) G ∗ . E G ∗ h E ( G,G ′ ) | G ∗ R ( G ) G ∗ i = E G ∗ h E U R ( G U ) G ∗ i = E G ∗ ,U R ( G U ) . F or an y fixed U , the sub collection { g i } i ∈ U consists of k i.i.d. draws from Q (since the full family is i.i.d. and U is indep enden t of the draws), hence av eraging o ver U yields E G ∗ ,U [ R ( G U )] = ¯ R k , pro ving (3). Finally , taking exp ectations in (2) and substituting (3) gives E G,G ′ ∥ h G − h G ′ ∥ 2 ≤ 4 ¯ R k − ¯ R 2 k , whic h is the desired b ound. Note that Theorem 3.1 dep ends on the slop e of the lo c al le arning curve at k : ( ¯ R k − ¯ R 2 k ). This is a strength; dep endence on the glob al learning curv e ( ¯ R k − R ∞ ) w ould b e significan tly weak er. T o see this, note that if Q contained only a single “goo d mo del” with arbitrarily small w eight, the global learning curve could fail to flatten out for arbitrarily large k . On the other hand, simply by monotonicit y , for any v alue of α , if labels are b ounded in (sa y) [0 , 1] then there m ust b e a v alue of k ≤ 2 1 /α suc h that ( ¯ R k − ¯ R 2 k ) ≤ α (as error can drop b y α at most 1 /α times b efore contradicting the non-negativit y of squared error). While this depends exponentially on α , it is indep endent of the dimensionality or complexit y of the instance, in contrast to bounds dep ending on the global learning curv e. 3.2 Stac king Lo w er Bound Theorem 3.1 gives an upp er b ound with constant 4. W e no w show that this factor cannot b e impro ved in general: for every fixed k and every ε > 0, there exists a data distribution P and a distribution Q ov er base mo dels such that tw o indep enden t stacking runs ha ve disagreement at least (4 − ε ) times the gap ¯ R k − ¯ R 2 k . 13 Theorem 3.2 (Near-tightness of the factor 4) . Fix an inte ger k ≥ 1 . F or every ε > 0 , ther e exists a data distribution P and a distribution Q over b ase mo dels such that if G, G ′ i.i.d. ∼ Q k ar e indep endent k –tuples and f 1 = arg min f ∈ V ( G ) MSE( f ) , f 2 = arg min f ∈ V ( G ′ ) MSE( f ) , then E f 1 ,f 2 D ( f 1 , f 2 ) ≥ (4 − ε ) ¯ R k − ¯ R 2 k . Pr o of. Fix k ≥ 1 and ε > 0. Since the claim is w eaker for larger ε , we may assume ε ∈ (0 , 1]. W e w ork in a real Hilb ert space H (equiv alently H = L 2 ( P ) for a suitable data distribution P 1 ) with an orthonormal family { e 0 , . . . , e m } , where m ∈ N will b e chosen later, and set the target y := e 0 . W e construct base mo dels that are “noisy versions” of the target. Fix σ > 0 and define g i := e 0 + σ e i , i = 1 , . . . , m. Let Q b e the uniform distribution ov er { g 1 , . . . , g m } . First, we analyze the predictor and risk for a fixed set of distinct base mo dels. Let H be a m ultiset of draws from Q . Let S ( H ) b e the set of distinct indices of base mo dels in H , and let r ( H ) = | S ( H ) | . By symmetry , the least-squares predictor f H ∈ V ( H ) assigns equal weigh t to each distinct g i ∈ H . A straigh tforward calculation sho ws that the optimal weigh ts are 1 / ( r ( H ) + σ 2 ), yielding: f H = X i ∈ S ( H ) 1 r ( H ) + σ 2 g i = r ( H ) r ( H ) + σ 2 e 0 + σ r ( H ) + σ 2 X i ∈ S ( H ) e i . (4) R ( H ) = ∥ y − f H ∥ 2 = σ 2 r ( H ) + σ 2 . (5) In particular, for G, G ′ i.i.d. ∼ Q k , w e hav e f 1 = f G , f 2 = f G ′ , and R ( G ) = MSE( f 1 ), R ( G ′ ) = MSE( f 2 ). Next, w e analyze the disagreemen t and risk drop on the even t where all sampled mo dels are distinct. Let E b e the ev ent that the 2 k draws in G ∪ G ′ are all distinct. On this even t, r ( G ) = k , r ( G ′ ) = k , and r ( G ∪ G ′ ) = 2 k . Using (5), the drop in risk on ev ent E is: ∆ 0 := R ( G ) − R ( G ∪ G ′ ) = σ 2 k + σ 2 − σ 2 2 k + σ 2 . (6) Using (4) and the fact that G and G ′ share no indices on E (and th us the e 0 co efficien ts are identical and cancel), the disagreement is: D 0 := ∥ f G − f G ′ ∥ 2 = σ k + σ 2 X i ∈ S ( G ) e i − X j ∈ S ( G ′ ) e j 2 = 2 k σ 2 ( k + σ 2 ) 2 . (7) Comparing these quantities, we see that for small σ : D 0 ∆ 0 = 4 − 2 σ 2 k + σ 2 σ → 0 − − − → 4 . (8) 1 F or example, tak e X = { 0 , 1 , . . . , m } and let P b e uniform on X . Defining e j ( x ) = √ m + 1 I { x = j } gives an orthonormal family { e 0 , . . . , e m } ⊆ L 2 ( P ). 14 Finally , we handle the exp ectations by showing that the even t E dominates. The probabilit y of a collision among the 2 k uniform dra ws from m items is at most 2 k 2 /m , and hence Pr ( E ) ≥ 1 − 2 k 2 1 m . Since disagreemen t is alw ays non-negative: E G,G ′ D ( f 1 , f 2 ) ≥ Pr ( E ) D 0 . (9) F or the exp ected risk drop, we upp er b ound the risk when collisions o ccur. The risk R ( H ) is maximized when r ( H ) is minimized (i.e., r ( H ) = 1), b ounded by R max = σ 2 / (1 + σ 2 ). The exp ected risk is: ¯ R k = Pr [ r ( G ) = k ] σ 2 k + σ 2 + E [ R ( G ) I ( r ( G ) < k )] ≤ σ 2 k + σ 2 + Pr r ( G ) < k R max ≤ σ 2 k + σ 2 + k 2 1 m R max . On the other hand, since r ( G ∪ G ′ ) ≤ 2 k alwa ys, w e hav e the deterministic low er b ound ¯ R 2 k ≥ σ 2 2 k + σ 2 . Com bining these, the exp ected drop satisfies: ¯ R k − ¯ R 2 k ≤ ∆ 0 + k 2 2 m σ 2 1 + σ 2 . No w choose σ 2 = ( ε/ 8) k so that (8) gives D 0 ≥ (4 − ε/ 4)∆ 0 . Cho osing m ≥ l 96 k 3 ε m mak es Pr ( E ) close to 1 and the collision term in the b ound on ¯ R k − ¯ R 2 k negligible compared to ∆ 0 . Combining (9) with the upp er b ound on ¯ R k − ¯ R 2 k then yields E f 1 ,f 2 D ( f 1 , f 2 ) ≥ (4 − ε )( ¯ R k − ¯ R 2 k ). 4 Gradien t Bo osting In this section w e apply our midp oin t anc horing argument to gr adient b o osting , an algorithm that iterativ ely builds up an ensemble mo del b y repeatedly chooses a weak learning mo del g ∈ C that correlates with the residual of our current ensemble model and then adds g to it. Unlike stac king, the mo dels that make up tw o indep enden tly trained ensembles f 1 and f 2 are not exc hangeable, since the weak learners are not selected indep endently , but rather adaptively in a path dep enden t w ay . Nevertheless, we sho w that we can apply midp oin t anchoring to drive disagreemen t to 0 at a 1 /k rate (where k is the num b er of iterations of gradien t b o osting). Here we abstract aw ay finite sample issues by modeling our weak learning algorithm in the style of an SQ oracle [Kearns, 1998] — i.e. rather than obtaining the g ∈ C which exactly maximizes co v ariance with the residuals of our curren t mo del, it can return an y g ∈ C that is an ϵ -approximate maximizer. This mo dels e.g. solving an ERM problem ov er any sample that is sufficient for ε -approximate uniform con vergence o ver C . W e assume for simplicity that our weak learning class C satisfies the follo wing mild regularity conditions (whic h are enforceable if necessary): Symmetry ( g ∈ C ⇒ − g ∈ C ), normalization ( ∥ g ∥ ≤ 1 for all g ∈ C ) and non-degeneracy (0 / ∈ C ). W e will use the normalization condition with resp ect to the A tomic and Euclidean Norm, which can b e enforced by dividing the original functions (unnormalized) by the maxim um of its Atomic norm, Euclidean norm, and 1. Note in this section, for the sake of clarit y , w e will use the standard 15 inner pro duct ⟨ f , g ⟩ = f T g . When we list || f || it will still corresponding to the norm we defined in the Preliminaries of ( E [ f ( x ) 2 ]) 1 / 2 . When needed, we will explicitly mention the exp ectations we are computing. W e mo del weak-learning via an ε -appro ximate SQ-style oracle: at iteration t , the oracle returns any g t ∈ C such that E [ ⟨ r t − 1 ( x ) , g t ( x ) ⟩ ] ≥ sup g ∈C E [ ⟨ r t − 1 ( x ) , g ( x ) ⟩ ] − ε t , r t − 1 := y − f t − 1 . Algorithm 2 Gradient Bo osting Input: SQ-oracle for weak learner class C f 0 ≡ 0, G 0 = ∅ for t ∈ [ k ] do r t − 1 := y − f t − 1 Cho ose g t ∈ C with E [ ⟨ r t − 1 ( x ) , g t ( x ) ⟩ ] ≥ sup g ∈C E [ ⟨ r t − 1 ( x ) , g ( x ) ⟩ ] − ε t . (SQ- oracle) α t := arg min α ∈ R E [( r t − 1 ( x ) − αg t ( x )) 2 ] = E [ ⟨ r t − 1 ( x ) , g t ( x ) ⟩ ] / ∥ g t ∥ 2 f t := f t − 1 + α t g t ; set G t := G t − 1 ∪ { g t } end return f k and G := G k Algorithm 2 pro vides the details of ho w to use this oracle within the Gradient Bo osting pro ce- dure. W e will b e interested in comparing the MSE of the gradient b oosting iterates with the risk of the b est minimizer in the weak learner class R V ( C ) := inf f ∈ V ( C ) MSE( f ). W e will b ound the disagreement of t wo indep enden tly trained mo dels f 1 and f 2 b y anc horing to the b est mo del f ∗ in the span of the w eak learner class C , and then apply our anchoring lemma from Section 2. Since anchoring b ounds disagreemen t in terms of eac h mo del’s error gap to f ∗ , it remains to up- p er b ound that gap. W e do so b elo w, starting b y b ounding the single-step error impro vemen t of gradien t b o osting. Lemma 4.1 (Single Iterate Progress) . With α t = arg min α ∈ R ∥ r t − 1 − αg t ∥ 2 and ∥ g t ∥ ≤ 1 , MSE( f t − 1 ) − MSE( f t ) ≥ E [ ⟨ r t − 1 ( x ) , g t ( x ) ⟩ ] 2 . Pr o of. Note that MSE( f t − 1 ) − MSE( f t ) = || r t − 1 || 2 − || r t || 2 = || r t − 1 || 2 − || r t − 1 − α t g t || 2 By exact line searc h, ∥ r t − 1 − α t g t ∥ 2 = min α ∥ r t − 1 − αg t ∥ 2 = min α ( || r t − 1 || 2 − 2 α E [ ⟨ r t − 1 ( x ) , g t ( x ) ⟩ ] + α 2 || g t || 2 ) = || r t − 1 || 2 − 2 E [ ⟨ r t − 1 ( x ) , g t ( x ) ⟩ ] 2 / ∥ g t ∥ 2 + E [ ⟨ r t − 1 ( x ) , g t ( x ) ⟩ ] 2 / ∥ g t ∥ 2 = || r t − 1 || 2 − E [ ⟨ r t − 1 ( x ) , g t ( x ) ⟩ ] 2 / ∥ g t ∥ 2 Therefore, we ha ve that MSE( f t − 1 ) − MSE( f t ) = E [ ⟨ r t − 1 ( x ) , g t ( x ) ⟩ ] 2 / ∥ g t ∥ 2 . Using ∥ g t ∥ ≤ 1 gives the stated b ound. No w, we define the radius τ > 0 with the corresp onding conv ex h ull K τ := τ con v( C ). Let f ∗ ∈ V ( C ) b e the p opulation least-squares minimizer ov er the span of the weak learning class. Define the corresponding atomic norm radius τ ∗ := ∥ f ∗ ∥ A , where the atomic norm induced by C is ∥ f ∥ A := inf n k X j =1 | α j | : f = lim k →∞ k X j =1 α j g j , g j ∈ C , k X j =1 | α j | ≤ ∞ o . 16 That is, τ ∗ corresp onds to the smallest total weigh t needed to represent f ∗ within the weak learner class. W e ha ve no w related the MSE gap b et w een the mo dels of tw o runs in terms of the square of the max correlation of the residuals of the earlier model with a mo del in the weak learner class. Next, we will low er b ound the largest p ossible correlation b et ween the residuals of a mo del f and a function in the weak learner class in terms of the difference betw een the MSE of the curren t model f and the error of the b est mo del in the span of the weak learners, scaled by the atomic norm of f ∗ . Lemma 4.2 (Correlation Lo wer Bound w.r.t. W eak Learning Anc hor Gap) . F or any f , writing M ( f ) := sup g ∈C | E [ ⟨ y − f , g ⟩ ] | , we have M ( f ) ≥ MSE( f ) − R V ( C ) 2 τ ∗ . Pr o of. Recall that K τ ∗ := τ ∗ con v ( C ). Its supp ort function is σ K τ ∗ ( u ) := sup s ∈K τ ∗ E [ ⟨ u, s ⟩ ] = τ ∗ sup g ∈±C E [ ⟨ u, g ⟩ ]. W e will ultimately relate this quantit y to M ( f ). F or any s ∈ K τ ∗ , the squared loss ob eys MSE( f ) − MSE( s ) = ∥ y − f ∥ 2 − ∥ y − s ∥ 2 = 2 E [ ⟨ y − f , s − f ⟩ ] − ∥ s − f ∥ 2 ≤ 2 E [ ⟨ y − f , s ⟩ − ⟨ y − f , f ⟩ ] . The second equality uses the fact that || a || 2 − || b || 2 = 2 ⟨ a, a − b ⟩ − || a − b || 2 . The inequality uses the fact that we can drop the subtracted nonnegativ e term || s − f || 2 . T aking the supremum ov er s ∈ K τ ∗ yields MSE( f ) − R ( K τ ∗ ) ≤ 2 σ K τ ∗ ( y − f ) − 2 E [ ⟨ y − f ( x ) , f ( x ) ⟩ ] . Applying the same inequalit y with f − y in place of y − f yields a second upp er bound. Since an y X with X ≤ A and X ≤ B satisfies X ≤ ( A + B ) / 2, a veraging the tw o b ounds cancels the unknown linear term ⟨ y − f , f ⟩ . Using ev enness of the supp ort function for symmetric sets, σ K τ ∗ ( u ) = σ K τ ∗ ( − u ), w e get MSE( f ) − R ( K τ ∗ ) ≤ σ K τ ∗ ( y − f ) + σ K τ ∗ ( f − y ) = 2 σ K τ ∗ ( y − f ) = 2 τ ∗ sup g ∈±C E [ ⟨ y − f ( x ) , g ( x ) ⟩ ] = 2 τ ∗ sup g ∈C |⟨ E [ y − f ( x ) , g ( x )] ⟩| = 2 τ ∗ M ( f ) . where we used symmetry of K τ ∗ and of C . F or any u , the trivial inequality is sup g ∈C |⟨ u, g ⟩| ≥ sup g ∈C ⟨ u, g ⟩ . Conv ersely , b ecause C is symmetric, for every g ∈ C also − g ∈ C , so max {⟨ u, g ⟩ , ⟨ u, − g ⟩} = |⟨ u, g ⟩| , implying sup g ∈C ⟨ u, g ⟩ ≥ sup g ∈C |⟨ u, g ⟩| . Th us sup g ∈C |⟨ u, g ⟩| = sup g ∈C ⟨ u, g ⟩ . T aking u = y − f identifies the last term with M ( f ). Since f ∗ ∈ K τ ∗ ∩ V ( C ) minimizes MSE o ver V ( C ), we ha v e R ( K τ ∗ ) = R V ( C ) . Rearranging yields M ( f ) ≥ MSE( f ) − R V ( C ) 2 τ ∗ . W e hav e low er b ounded the maximum residual–mo del correlation o ver the weak learner class b y a quantit y dep ending on the gap b et ween the current model’s error and the b est error in the w eak-learner span. W e no w relate the p er-step error gap to that b est error via a recurrence. 17 Prop osition 4.3 (Gap Recurrence T ow ard R V ( C ) ) . L et E t := MSE( f t ) − R V ( C ) . We wil l use the shorthand u 2 + = (max { u, 0 } ) 2 . Then, for t ≥ 1 , E t − 1 − E t ≥ E t − 1 2 τ ∗ − ε t 2 + . Pr o of. By Lemma 4.1, MSE( f t − 1 ) − MSE( f t ) ≥ E [ ⟨ r t − 1 ( x ) , g t ( x ) ⟩ ] 2 . The oracle gives E [ ⟨ r t − 1 ( x ) , g t ( x ) ⟩ ] ≥ M ( f t − 1 ) − ε t . Hence E [ ⟨ r t − 1 , g t ⟩ ] 2 ≥ ( M ( f t − 1 ) − ε t ) 2 + . Finally , Lemma 4.2 gives M ( f t − 1 ) ≥ E t − 1 / (2 τ ∗ ), yielding the claim. Finally , we can use the recurrence relation to b ound the difference b et ween the MSE of the mo del at iteration t and the MSE of the b est mo del in the span of the weak learner class–w e can see that the first term is inv ersely prop ortional to t and dep ends on the atomic norm of the b est mo del in span of the w eak learner class. It also includes a term that dep ends on the SQ-oracle error at every iteration. Theorem 4.4 (W eak Learning Anchor Gap Upp er Bound) . F or al l t ≥ 1 , MSE( f t ) − R V ( C ) ≤ 8 ( τ ∗ ) 2 t + t X s =1 ε 2 s . Pr o of. Let E t := MSE( f t ) − R V ( C ) . F rom Prop osition 4.3, E t − 1 − E t ≥ E t − 1 2 τ ∗ − ε t 2 + . F or an y a ≥ 0 and b ∈ R , ( a − b ) 2 ≥ a 2 / 2 − b 2 . T o see this, consider: a 2 − 2 ab + b 2 − a 2 / 2 + b 2 . W e ha ve that this quan tity equals a 2 / 2 − 2 ab + 2 b 2 . Since a m ultiplicative factor of 2 does not affect the sign, notice that t wice this quan tity is equal to ( a − 2 b ) 2 whic h is non-negative. In this case the inequalit y also holds for the quantit y (( a − b ) + ) 2 . T aking a = E t − 1 / (2 τ ∗ ) and b = ε t yields E t − 1 − E t ≥ E 2 t − 1 8 ( τ ∗ ) 2 − ε 2 t . Since E t ≤ E t − 1 , 1 E t − 1 E t − 1 = E t − 1 − E t E t E t − 1 ≥ E t − 1 − E t E 2 t − 1 ≥ 1 8 ( τ ∗ ) 2 − ε 2 t E 2 t − 1 ≥ 1 8 ( τ ∗ ) 2 − ε 2 t E 2 t . Summing from s = 1 to t giv es 1 E t ≥ 1 E 0 + t 8 ( τ ∗ ) 2 − t X s =1 ε 2 s E 2 s ≥ 1 E 0 + t 8 ( τ ∗ ) 2 − 1 E 2 t t X s =1 ε 2 s . Let A t := P t s =1 ε 2 s and B t := 1 E 0 + t 8 ( τ ∗ ) 2 . W riting X := 1 /E t , the inequality b ecomes A t X 2 + X − B t ≥ 0. If A t = 0 then X ≥ B t and E t ≤ 1 /B t ≤ 8( τ ∗ ) 2 /t . If A t > 0, define the quantit y Y = 1 /X . Then, the inequalit y b ecomes − B t Y 2 + Y + A t ≥ 0. Then the quadratic inequalit y implies Y ≤ 1+ √ 1+4 A t B t 2 B t . Using √ 1 + z ≤ 1 + z / 2 for z ≥ 0 giv es 1 X ≤ 1 B t + A t . Th us E t ≤ 1 /B t + A t ≤ 8( τ ∗ ) 2 /t + P t s =1 ε 2 s . 18 W e can now use the anc horing lemmas from Section 2 to relate tw o indep enden t stagewise runs. Theorem 4.5 (Gradien t Bo osting Agreemen t Bound) . L et f 1 and f 2 b e two indep endent gr adient b o osting runs (using the same we ak le arning class C and numb er of iter ations k ) driven by { ε t } and { ε ′ t } r esp e ctively. L et f ∗ ∈ V ( C ) denote the p opulation le ast-squar es pr e dictor over V ( C ) . Then D ( f 1 , f 2 ) ≤ 2 MSE( f 1 ) − R V ( C ) + 2 MSE( f 2 ) − R V ( C ) . Conse quently, using The or em 4.4, for al l k ≥ 1 , D ( f 1 , f 2 ) ≤ 32 ( τ ∗ ) 2 k + 2 k X t =1 ε 2 t + k X t =1 ε ′ 2 t . Pr o of. Let ¯ f := 1 2 ( f 1 + f 2 ). Since each gradient b o osting run outputs a predictor in V ( C ), w e ha v e ¯ f ∈ V ( C ). Applying Lemma 2.3 with H = V ( C ) giv es D ( f 1 , f 2 ) ≤ 2 MSE( f 1 ) − R ( V ( C )) + 2 MSE( f 2 ) − R ( V ( C )) . Applying Theorem 4.4 to b oth runs yields the stated b ound. Th us we hav e shown that gradient b o osting yields independent agreement tending to 0 at a rate of O (1 /k ), where k is the num b er of iterations. This b ound also dep ends on τ ∗ , which is a problem-dep enden t constant. In Section 6 we analyze a v ariant of gradien t b oosting based on the F rank W olfe algorithm (for more general loss functions) that alw ays pro duces a predictor that has norm at most τ , where τ is a user defined parameter. W e give a v ariant of this analysis in which w e anc hor to the b est mo del in the span of the weak learner class that also has norm at most τ . This remov es any dep endence on τ ∗ , and obtain similar rates dep ending only on τ — replacing the problem dep enden t constant with a user defined parameter that trades of agreement with accuracy as desired. 5 Neural Net w orks, Regression T rees, and Other Classes Satisfy- ing Hierarc hical Midp oin t Closure Next, w e show that certain function classes including ReLU neural netw orks and regression trees admit strong agreement b ounds under appro ximate p opulation loss minimization. These function classes may b e highly non-conv ex, meaning that appro ximate loss minimizers may b e very far in parameter space—or even incomparable in the sense that they ma y b e of different architectures. Nev ertheless, by anchoring on the midp oin t predictor ¯ f ( x ) = 1 2 ( f 1 ( x ) + f 2 ( x )) and using that the relev ant model classes are closed under av e raging we show that they must b e close in pr e diction sp ac e . W e will use Lemma 2.4 from Section 2. T o apply it, we need midp oint closure of the form ¯ f ∈ F 2 n whenev er f 1 , f 2 ∈ F n . The form of our theorems will b e identical for any class satisfying this kind of “hierarchical midp oint closure”. 5.1 Application to Neural Net works W e w ork with feed-forw ard ReLU netw orks. Let σ ( t ) := max { 0 , t } denote the ReLU activ ation. F or n ≥ 0, let NN n denote the class of functions f : X → Y computable by a finite directed acyclic graph in which eac h internal (non-input, non-output) no de computes σ ( ⟨ w , u ⟩ + b ) for som e affine function of its inputs, and the output no de computes an affine combination of the v alues at the input co ordinates and in ternal no des. First, w e demonstrate midp oin t closure for this class. 19 Lemma 5.1 (Neural-netw ork midp oin t closure) . F or every n ≥ 0 and every f 1 , f 2 ∈ NN n , the midp oint pr e dictor ¯ f := 1 2 ( f 1 + f 2 ) lies in NN 2 n . Pr o of. Fix realizations of f 1 and f 2 as ReLU net works with at most n internal no des each. Construct a new net work by taking a disjoint cop y of the in ternal computation graph for each of f 1 and f 2 , and wiring b oth copies to the same input x . This yields a single feed-forward net work that computes b oth f 1 ( x ) and f 2 ( x ) in parallel, using at most 2 n in ternal ReLU no des. Define the output node to return the affine com bination 1 2 f 1 ( x ) + 1 2 f 2 ( x ). This adds no new in ternal no des, so the resulting netw ork computes ¯ f and has size at most 2 n , i.e., ¯ f ∈ NN 2 n . Corollary 5.2 (Neural-netw ork agreemen t) . Fix n ≥ 1 and ε > 0 . If f 1 , f 2 ∈ NN n satisfy MSE( f i ) ≤ R (NN n ) + ε for i ∈ { 1 , 2 } , then D ( f 1 , f 2 ) ≤ 4 R (NN n ) − R (NN 2 n ) + ε . Pr o of. Apply Lemma 2.4 with F n = NN n and use Lemma 5.1. Observ e that this is exactly the same form of lo cal learning curv e guaran tee that we got for Stac king in Theorem 3.1. In particular, as loss is b ounded and optimal loss is monotonically decreasing in netw ork size, for any v alue of α , there m ust b e a v alue of n ≤ 2 1 /α suc h that R (NN n ) − R (NN 2 n ) ≤ α (as error can drop b y α at most 1 /α times b efore contradicting the non- negativit y of squared error). F or such a v alue of n , w e hav e D ( f 1 , f 2 ) ≤ 4( α + ϵ ). As with stacking, this bound is completely indep enden t of the complexity of the instance and do es not require that “global optimality” can b e obtained by a small neural netw ork (i.e. it requires only flatness of the lo cal loss curve, which can alw ays b e guaran teed at mo dest v alues of n , not the global loss curv e, whic h cannot). This kind of “learning curve” b ound for neural netw orks is reminiscent of the argument used by B lasiok et al. [2024] to show that “most sizes” of ReLU net works are appro ximately multicalibrated with resp ect to all neural net work arc hitectures of b ounded size. 5.2 Application to Regression T rees W e observe that the same arguments apply almost verbatim to regression trees. W e w ork with axis-aligned regression trees. A depth- d tree is a ro oted binary tree in whic h ev ery internal no de is lab eled b y a co ordinate j ∈ [ d ] and a threshold t ∈ R , and routes an input x ∈ X ⊆ R d to the left or righ t c hild depending on whether x j ≤ t or x j > t . Eac h leaf is lab eled b y a constant prediction v alue in [0 , 1]. The predictor computed by the tree is the leaf v alue reached by x . W e write T ree d for the class of such predictors of depth at most d . Lemma 5.3 (Regression-tree midp oin t closure) . F or every d ≥ 0 and every f 1 , f 2 ∈ T ree d , the midp oint pr e dictor ¯ f := 1 2 ( f 1 + f 2 ) lies in T ree 2 d . Pr o of. Fix realizations of f 1 , f 2 ∈ T ree d as depth- d trees. Consider the partition of X induced by the lea ves of the tree for f 1 ; on each cell of this partition, f 1 is constan t. No w refine eac h such cell further using the splits of the tree for f 2 restricted to that cell. Equiv alently , w e can construct a single tree as follo ws: take the tree for f 1 , and at each leaf, graft a copy of the tree for f 2 . Along any root-to-leaf path, we tra verse at most d splits from f 1 and then at most d splits from f 2 , so the resulting tree has depth at most 2 d . Moreov er, on each leaf of the resulting tree, b oth f 1 and f 2 tak e constan t v alues, so we can label that leaf with their a verage 1 2 f 1 ( x ) + 1 2 f 2 ( x ) ∈ [0 , 1]. This yields a depth-2 d regression tree computing ¯ f , i.e., ¯ f ∈ T ree 2 d . W e now get an immediate corollary: 20 Corollary 5.4 (Regression tree agreement) . Fix d ≥ 1 and ε > 0 . If f 1 , f 2 ∈ T ree d satisfy MSE( f i ) ≤ R ( T ree d ) + ε for i ∈ { 1 , 2 } , then D ( f 1 , f 2 ) ≤ 4 R ( T ree d ) − R ( T ree 2 d ) + ε . Pr o of. Apply Lemma 2.4 with F d = T ree d and use Lemma 5.3. Again, this is a local learning curv e agreement guaran tee of exactly the same form as our theorem for Stacking (Theorem 3.1) and our theorem for neural netw ork training (Corollary 5.2). An immediate implication is that for any v alue of α that there is a v alue d ≤ 2 1 /α (i.e. indep enden t of the complexity of the instance) guaran teeing that for that v alue of d , D ( f 1 , f 2 ) ≤ 4( α + ϵ ). 6 Generalization to Multi-Dimensional Strongly Con v ex Losses In this section we generalize our setting to study mo dels that output d -dimensional distributions as predictions, and optimize arbitrary strongly conv ex losses. W e show that the midp oint anchoring argumen t extends directly to this more general setting, which lets us mo del a wide arra y of practical mac hine learning problems. First we define general strongly conv ex loss functions ov er d dimensional predictions: Definition 6.1 (Strongly conv ex losses) . L et L : Y × R d → R b e a c ontinuously differ entiable loss function. We say that L is µ -strongly conv ex if ther e exists some µ > 0 such that for every y ∈ Y , P 1 , P 2 ∈ R d , L ( y , P 1 ) ≥ L ( y, P 2 ) + ⟨∇ p L ( y , P 2 ) , P 1 − P 2 ⟩ + µ 2 ∥ P 1 − P 2 ∥ 2 2 . F or predictors outputting d -dimensional predictions, w e define disagreement as follo ws, straight- forw ardly generalizing our 1-dimensional exp ected squared disagreemen t metric: Definition 6.2 (Generalized disagreement) . L et P b e a distribution on X × Y and let f 1 , f 2 : X → R d b e functions. The disagr e ement b etwe en f 1 , f 2 over P is the exp e cte d squar e d Euclide an distanc e b etwe en their pr e dictions: D ( f 1 , f 2 ) = E [ ∥ f 1 ( x ) − f 2 ( x ) ∥ 2 2 ] . W e will write R ( f ) := E [ L ( y , f ( x ))]. W e can no w generalize our disagreement-via-midpoint- anc horing lemma which driv es our analyses. Lemma 6.3 (Disagreement via the midpoint anchor) . Assume L is µ -str ongly c onvex. F or any two functions f 1 , f 2 : X → R d , let ¯ f ( x ) := 1 2 ( f 1 ( x ) + f 2 ( x )) . Then D ( f 1 , f 2 ) ≤ 4 µ R ( f 1 ) + R ( f 2 ) − 2 R ( ¯ f ) . In p articular, if ¯ f ∈ H for some class of pr e dictors H , then D ( f 1 , f 2 ) ≤ 4 µ R ( f 1 ) − R ( H ) + 4 µ R ( f 2 ) − R ( H ) . Pr o of. Fix an y x ∈ X and y ∈ Y and abbreviate p 1 := f 1 ( x ) , p 2 := f 2 ( x ) , ¯ p := ¯ f ( x ) = 1 2 ( p 1 + p 2 ) . Applying µ -strong conv exity (Definition 6.1) with P 1 = p 1 and P 2 = ¯ p giv es L ( y , p 1 ) ≥ L ( y, ¯ p ) + ⟨∇ p L ( y , ¯ p ) , p 1 − ¯ p ⟩ + µ 2 ∥ p 1 − ¯ p ∥ 2 2 . 21 Similarly , with P 1 = p 2 and P 2 = ¯ p , L ( y , p 2 ) ≥ L ( y, ¯ p ) + ⟨∇ p L ( y , ¯ p ) , p 2 − ¯ p ⟩ + µ 2 ∥ p 2 − ¯ p ∥ 2 2 . Adding the tw o inequalities, and using ( p 1 − ¯ p ) + ( p 2 − ¯ p ) = p 1 + p 2 − 2 ¯ p = 0 to cancel the gradien t terms, yields L ( y , p 1 ) + L ( y , p 2 ) ≥ 2 L ( y, ¯ p ) + µ 2 ∥ p 1 − ¯ p ∥ 2 2 + ∥ p 2 − ¯ p ∥ 2 2 . Since p 1 − ¯ p = 1 2 ( p 1 − p 2 ) and p 2 − ¯ p = 1 2 ( p 2 − p 1 ), w e hav e ∥ p 1 − ¯ p ∥ 2 2 + ∥ p 2 − ¯ p ∥ 2 2 = 2 1 2 ( p 1 − p 2 ) 2 2 = 1 2 ∥ p 1 − p 2 ∥ 2 2 . Substituting this back and rearranging gives the p oin t wise b ound ∥ f 1 ( x ) − f 2 ( x ) ∥ 2 2 ≤ 4 µ L ( y , f 1 ( x )) + L ( y , f 2 ( x )) − 2 L y , ¯ f ( x ) . T aking exp ectations o ver ( x, y ) ∼ P and using the definitions of D ( · , · ) and R ( · ) yields D ( f 1 , f 2 ) ≤ 4 µ R ( f 1 ) + R ( f 2 ) − 2 R ( ¯ f ) . F or the second inequality , if ¯ f ∈ H then R ( ¯ f ) ≥ R ( H ), so substituting R ( ¯ f ) by R ( H ) in the righ t-hand side yields the claim. W e now show how to apply the midp oin t anchoring lemma to each of our (generalized) appli- cations. 6.1 Stac king Here, w e will provide a generalization of our stacking results to multi-dimensional strongly-conv ex losses. W e once again mo del “base mo dels” as b eing sampled i.i.d. from an arbitrary distribution Q , and under tw o indep enden t training runs write G, G ′ ∼ Q k to denote the set of k sampled mo dels. W e will consider the stack ed predictors f 1 ∈ V ( G ) and f 2 ∈ V ( G ′ ). Define G ∗ = G ∪ G ′ . The key observ ation is that the midp oin t predictor 1 2 ( f 1 + f 2 ) lies in V ( G ∗ ), so we can apply Lemma 6.3 and then use the same exchangeabilit y argument as in the single-dimensional case. Theorem 6.4. (A gr e ement for Stacke d A ggr e gation Gener alization) Assume that L is µ -str ongly c onvex. L et G = { g 1 , . . . , g k } i.i.d. ∼ Q k and G ′ = { g ′ 1 , . . . , g ′ k } i.i.d. ∼ Q k b e indep endent. Define f 1 , f 2 as fol lows: f 1 = arg min f ∈ V ( G ) E [ L ( y , f ( x ))] , f 2 = arg min f ∈ V ( G ′ ) E [ L ( y , f ( x ))] Then we have that E f 1 ,f 2 D ( f 1 , f 2 )] ≤ 8 µ ¯ R k − ¯ R 2 k . Pr o of. Fix realizations of G and G ′ , and let G ∗ = G ∪ G ′ (m ultiset union ). Throughout this section we will think of G, G ′ ∼ Q k , unless explicitly conditioned. Note that V ( G ) ⊆ V ( G ∗ ) and V ( G ′ ) ⊆ V ( G ∗ ). In our pro ofs, without loss of generalit y , w e will use the notation h G to denote the minimizer of E [ L ( y , · )] with respect to subspace G . Similarly , we will use the notation R ( G ) = R ( h G ) in this context. In our theorem statements, this corresp onds to f 1 . Let ¯ h := 1 2 ( h G + h G ′ ). 22 Since h G ∈ V ( G ) and h G ′ ∈ V ( G ′ ) and V ( G ) , V ( G ′ ) ⊆ V ( G ∗ ), we hav e ¯ h ∈ V ( G ∗ ). Applying Lemma 6.3 with f 1 = h G , f 2 = h G ′ , and H = V ( G ∗ ), and using R ( h G ) = R ( G ), R ( h G ′ ) = R ( G ′ ), and R ( V ( G ∗ )) = R ( G ∗ ), w e hav e the p oin twise inequality ∥ h G − h G ′ ∥ 2 ≤ 4 µ R ( G ) − R ( G ∗ ) + 4 µ R ( G ′ ) − R ( G ∗ ) . (10) W e now tak e expectations o ver G, G ′ , G ∗ to relate the t wo terms on the RHS of Equation 10. Conditional on G ∗ , we can generate the pair ( G, G ′ ) by dra wing a uniformly random p erm utation π of { 1 , . . . , 2 k } and letting G b e the first k p erm uted elements of G ∗ and G ′ the remaining k . This holds b ecause the 2 k features in G ∗ arise from 2 k i.i.d. draws from Q and the joint law of ( G, G ′ ) is exc hangeable under p erm utations of these 2 k draws. Conditioning on the unordered m ultiset G ∗ , ( G, G ′ ) is a uniformly random partition into tw o k -subm ultisets. Therefore, taking the conditional exp ectation of (10) given G ∗ and using symmetry of G and G ′ , E ( G,G ′ ) | G ∗ ∥ h G − h G ′ ∥ 2 G ∗ ≤ 8 µ E ( G,G ′ ) | G ∗ R ( G ) G ∗ − R ( G ∗ ) . (11) W e now integrate (11) o ver G ∗ . W e claim that E G ∗ h E ( G,G ′ ) | G ∗ R ( G ) G ∗ i = ¯ R k and E G ∗ R ( G ∗ ) = ¯ R 2 k . (12) The second equalit y is immediate from the definition of ¯ R 2 k , since G ∗ is a collection of 2 k i.i.d. draws from Q . F or the first equality in (12), let U b e a uniformly random k -subset of { 1 , . . . , 2 k } inde- p enden t of the dra ws { g 1 , . . . , g 2 k } i.i.d. ∼ Q 2 k . Define G U := { g i } i ∈ U . By the conditional description ab o v e, E ( G,G ′ ) | G ∗ R ( G ) G ∗ = E U R ( G U ) G ∗ . E G ∗ h E ( G,G ′ ) | G ∗ R ( G ) G ∗ i = E G ∗ h E U R ( G U ) G ∗ i = E G ∗ ,U R ( G U ) . F or an y fixed U , the sub collection { g i } i ∈ U consists of k i.i.d. draws from Q (since the full family is i.i.d. and U is indep enden t of the draws), hence av eraging o ver U yields E G ∗ ,U [ R ( G U )] = ¯ R k , pro ving (12). Finally , taking exp ectations in (11) and substituting (12) gives E G,G ′ ∥ h G − h G ′ ∥ 2 ≤ 8 µ ¯ R k − ¯ R 2 k , whic h is the desired b ound. As b efore, we hav e related the stability of (now generalized) stacking to the lo cal learning curv e, whic h is b ounded, non-negative, and non-increasing in k . As a result for an y desired level of stabilit y α , there must be a k ≤ 2 O (1 /α ) that guarantees that lev el of stability , indep endently of the complexit y of the learning instance — and once again the lo cal learning curve can b e empirically in vestigated on a holdout set to choose such a v alue of k . 23 Algorithm 3 Multi-Dimensional F rank–W olfe Input: SQ-oracle for weak learner class C , budget τ > 0 f 0 ≡ 0, G 0 = ∅ for t ∈ [ k ] do Cho ose s t ∈ C suc h that E [ ⟨−∇ p L ( y , f t − 1 ( x )) , s t ( x ) ⟩ ] ≥ max s ∈C E [ ⟨−∇ p L ( y , f t − 1 ( x )) , s ( x ) ⟩ ] − ε t Cho ose g t ∈ K τ suc h that g t = τ s t / ∥ s t ∥ A α t = 2 t +1 f t := f t − 1 + α t ( g t − f t − 1 ); G t := G t − 1 ∪ { g t } end return f k and G := G k 6.2 Gradien t Bo osting (via F rank W olfe) In this section, w e generalize our gradient b o osting agreemen t results to the multi-dimensional setting. Along the wa y we give another generalization as well. Recall that in the final risk b ound of Theorem 4.5, and corresp ondingly in the final agreement b ound, we had a dep endence on the instance-dep enden t constant τ ∗ , the atomic norm of the b est mo del in the span of the weak learner class. In this section, w e instead analyze a F rank-W olfe v arian t of gradient b oosting. In this v ariant, the iterates are constrained to lie within a user-sp ecified atomic-norm budget τ . As a result we are able to carry out our anchoring argument with resp ect to the b est norm τ mo del in the span of the w eak learner class, rather than the b est unconstrained mo del. This lets us replace the dep endence on τ ∗ with a dep endence on τ , which is sp ecified by the user rather than defined by the instance. In this section w e will need to work with L − smo oth losses (in the prediction p ). In other w ords we need to assume that for all y ∈ Y and all p 1 , p 2 ∈ ∆( Y ) that our loss satisfies: ||∇ p L ( y , p 1 ) − ∇ p L ( y , p 2 ) || 2 ≤ L || p 1 − p 2 || 2 . Recall the conditions of the w eak learner class C that w e had previously , which we contin ue to assume in this section: symmetry , normalization, and non-degeneracy . Note in this section, for the sak e of clarit y , w e will use the standard inner pro duct ⟨ f , g ⟩ = f T g . When needed, w e will explicitly men tion the expectations w e are computing. When the norms are mark ed || f || , w e still tak e it to mean the same definition as in Preliminaries of ( E [ f ( x ) 2 ]) 1 / 2 . W e will define the quantit y M ( f ) := sup g ∈C E [ ⟨∇ p L ( y , f ( x )) , g ( x ) ⟩ ] . W e will also define the closely related quan tity G ( f ) := sup z ∈K τ E [ ⟨∇ p L ( y , f ( x )) , f ( x ) − z ( x ) ⟩ ] . W e can show that for an y f ∈ K τ , G ( f ) ≤ 2 τ M ( f ). Define ˜ f ( x ) , ˜ g ( x ) ∈ con v ( C ), where f ( x ) = τ ˜ f ( x ) and g ( x ) = τ ˜ g ( x ). One can see this (as sho wn b elo w) b ecause the w eak learner class is normalized, functions in K τ can b e scaled up to live in conv( C ), the inside inner pro duct term for M ( f ) is linear in g and therefore the suprem um ov er conv( C ) matches the suprem um ov er C , and triangle inequalit y . 24 G ( f ) = sup z ∈K τ | E [ ⟨∇L ( y , f ( x )) , f ( x ) − z ( x ) ⟩| ] = τ sup ˜ z ∈ con v ( C ) | E [ ⟨∇L ( y , f ( x )) , ˜ f ( x ) − ˜ z ( x ) ⟩| ] ≤ τ | E [ ⟨∇L ( y , f ( x )) , ˜ f ( x ) ⟩ ] | + τ sup ˜ z ∈ con v ( C ) | E [ ⟨∇L ( y , f ( x )) , ˜ z ( x ) ⟩| ] ≤ τ sup ˜ h ∈ conv( C ) | E [ ⟨∇L ( y , f ( x )) , ˜ h ( x ) ⟩ ] | + τ sup ˜ z ∈ con v ( C ) | E [ ⟨∇L ( y , f ( x )) , ˜ z ( x ) ⟩| ] = 2 τ sup ˜ h ∈C | E [ ⟨∇L ( y , f ( x )) , ˜ h ( x ) ⟩ ] | = 2 τ M ( f ) Broadly , our pro of will mirror the analysis of our single-dimensional agreement results for gradi- en t b oosting. W e will once again make use of the conditions on the w eak learner class men tioned for gradien t b o osting of symmetry , normalization, and non-degeneracy . Also note that we can define G ( f ) with the absolute v alue due to symmetry of our class, similar to the argumen t provided in the gradient b o osting section. First, we will low er b ound the difference of t w o iterate’s losses. This will giv e us a lo wer b ound on the progress our algorithm’s mo del is making on a p er-iterate basis. Lemma 6.5 (FW single-iterate progress) . Assume L is L -smo oth in the se c ond ar gument. L et d t = g t − f t − 1 with ∥ d t ∥ 2 ≤ 2 τ . Then with the or acle ab ove we get that, R ( f t − 1 ) − R ( f t ) ≥ α t ( G ( f t − 1 ) − τ ε t ) − 2 Lτ 2 α 2 t Pr o of. By L − smo othness w e hav e the following quadratic upp er bound (or descen t lemma), L ( y , f t − 1 ( x ) + αd t ( x )) ≤ L ( y, f t − 1 ( x )) + α ⟨∇ p L ( y , f t − 1 ( x )) , d t ( x )) ⟩ + L 2 α 2 ( d t ( x )) 2 . T aking exp ectations, w e know that R ( f t − 1 ) − R ( f t ) ≥ α E [ ⟨−∇ p L ( y , f t − 1 ( x )) , d t ⟩ ] − L 2 α 2 || d t || 2 2 . Consider the quan tity ⟨−∇ p L ( y , f t − 1 ) , d t ⟩ = ⟨−∇ p L ( y , f t − 1 ) , g t − f t − 1 ⟩ = ⟨−∇ p L ( y , f t − 1 ) , g t ⟩ + ⟨∇ p L ( y , f t − 1 ) , f t − 1 ⟩ . W e know from the oracle that ⟨−∇ p L ( y , f t − 1 ) , g t ⟩ ≥ τ sup c ∈C ⟨−∇ p L ( y , f t − 1 ) , c ⟩− τ ε t = sup g ∈K τ ⟨−∇ p L ( y , f t − 1 ) , g ⟩− τ ε t . W e can combine this back with the term ⟨∇ p L ( y , f t − 1 ) , f t − 1 ⟩ and reapply the exp ectation to lo wer bound this term b y G ( f t − 1 ) − τ ε t . Therefore, w e know that R ( f t − 1 ) − R ( f t ) ≥ α t ( G ( f t − 1 ) − τ ε t ) − L 2 α 2 t || d t || 2 2 Using the b ound on || d t || 2 (whic h we get from the normalization condition on the weak learner class and triangle inequality), we get that R ( f t − 1 ) − R ( f t ) ≥ α t ( G ( f t − 1 ) − τ ε t ) − 2 Lτ 2 α 2 t Next we low er bound the progress that the b est mo del in the weak learner class could make, in terms of the current loss gap with the anchor mo del and our chosen atomic norm b ound τ : 25 Lemma 6.6. (FW Corr elation L ower Bound w.r.t We ak L e arning A nchor Gap) F or a given f fr om our algorithm’s iter ates ( f t ) , we have that M ( f ) ≥ R ( f ) − R ( K τ ) 2 τ Pr o of. Let f ∗ = arg min f ∈K τ R ( f ) . W e know b y conv exity that L ( y , f ( x )) − L ( y , f ∗ ( x )) ≤ ⟨∇ p L ( y , f ( x )) , f ( x ) − f ∗ ( x ) ⟩ T aking exp ectations and b y an application of H¨ older’s inequalit y and triangle inequality , we get that R ( f ) − R ( K τ ) ≤ ||∇ R ( f ) || A ∗ ( || f || A + || f ∗ || A ) . As shown b elo w, b y the definition of the dual norm, atomic norm, linearit y of the inner pro duct, normalization of the weak learner class, and the budget τ , ||∇ R ( f ) || A ∗ = sup || c || A ≤ 1 |⟨∇ R ( f ) , c ⟩| = sup c ∈ conv( C ) |⟨∇ R ( f ) , c ⟩| = sup c ∈C |⟨∇ R ( f ) , c ⟩| . Therefore, w e get that R ( f ) − R ( K τ ) ≤ 2 τ M ( f ) . Rearranging this expression gives the final b ound. Next we derive a recurrence relation b et ween the error gap of the mo del at iteration t and the b est model in the restricted span of the weak learner class. Lemma 6.7 (FW Gap Recurrence T o ward R ( K τ )) . Assume L is L -smo oth in its se c ond ar gument. L et E t := R ( f t ) − R ( K τ ) . Then for al l t ≥ 1 , E t − 1 − E t ≥ α t ( G ( f t − 1 ) − τ ε t ) − 2 Lτ 2 α 2 t ≥ α t ( E t − 1 − τ ε t ) − 2 Lτ 2 α 2 t Pr o of. By L -smo othness and the FW up date f t = f t − 1 + α t d t with d t = g t − f t − 1 , Lemma 6.5 gives R ( f t − 1 ) − R ( f t ) ≥ α t ( G ( f t − 1 ) − τ ε t ) − 2 Lτ 2 α 2 t Subtract and add R ( K τ ) to obtain the first inequality: E t − 1 − E t ≥ α t ( G ( f t − 1 ) − τ ε t ) − 2 Lτ 2 α 2 t Let f ∗ = arg min f ∈K τ E [ L ( y , f )]. By conv exit y , E t − 1 = R ( f t − 1 ) − R ( f ∗ ) ≤ ⟨∇ R ( f t − 1 ) , f t − 1 − f ∗ ⟩ ≤ G ( f t − 1 ), so E t − 1 − E t ≥ α t ( E t − 1 − τ ε t ) − 2 Lτ 2 α 2 t whic h is the second inequalit y . W e will use this recurrence relation to b ound the error gap for the mo del at iterate t . 26 Lemma 6.8 (FW Anc hor Gap Upp er Bound) . F or al l t ≥ 1 , R ( f t ) − R ( K τ ) ≤ 8 Lτ 2 t + 1 + 2 τ ( t + 1) t X j =1 ε t . Pr o of. F rom Lemma 6.7 we hav e the recursion E t − 1 − E t ≥ α t ( E t − 1 − τ ε t ) − 2 Lτ 2 α 2 t whic h is equiv alent to E t ≤ E t − 1 − α t ( E t − 1 − τ ε t ) + 2 Lτ 2 α 2 t = (1 − α t ) E t − 1 + α t τ ε t + 2 Lτ 2 α 2 t W e use the conv en tion that [ k ] = { 1 , ..., k } . Call C = 4 Lτ 2 and substitute in α t , then we get E t ≤ t − 1 t + 1 E t − 1 + 2 t + 1 τ ε t + 2 Lτ 2 2 t + 1 2 = t − 1 t + 1 E t − 1 + 2 C ( t + 1) 2 + 2 t + 1 τ ε t Define S t = τ P t j =1 j ε j . Then, w e will pro ve via induction that for all t ≥ 1, E t ≤ 2 C t + 2 S t t ( t + 1) First, for the base case consider t = 1. W e ha ve from the recurrence relation that E 1 ≤ 0 + C 2 + τ ε t ≤ C + τ ε t . Next, supp ose E t ≤ 2 C t +2 S t t ( t +1) , w e will pro ve the same relationship holds for E t +1 . E t +1 ≤ t t + 1 E t + 2 C ( t + 2) 2 + 2 t + 2 τ ϵ t +1 ≤ t t + 2 2 C t + 2 S t t ( t + 1) + 2 C ( t + 2) 2 + 2 t + 2 τ ϵ t +1 = 2 C t + 2 S t ( t + 1)( t + 2) + 2 C ( t + 2) 2 + 2 t + 2 τ ϵ t +1 = 2 C t + 2 t t + 1 + 1 t + 2 + 2 S t +1 ( t + 1)( t + 2) ≤ 2 C t + 2 t t + 1 + 1 t + 1 + 2 S t +1 ( t + 1)( t + 2) ≤ 2 C t + 2 t + 1 t + 1 + 2 S t +1 ( t + 1)( t + 2) = 2 C ( t + 1) + 2 S t +1 ( t + 1)( t + 2) . Therefore, w e hav e that E t ≤ 8 Lτ 2 t + 1 + 2 τ t ( t + 1) t X j =1 j ε j Therefore, E t ≤ 8 Lτ 2 t + 1 + 2 τ ( t + 1) t X j =1 ε j 27 Theorem 6.9. (FW Gr adient Bo osting A gr e ement Bound) Fix any L that is L -smo oth and µ - str ongly c onvex. L et f 1 , f 2 b e the output of any two runs of Algorithm 3 p ar ameterize d with the same τ , k , C such that the se quenc e of SQ or acle err ors ar e { ε t , ε ′ t } t ∈ [ k ] r esp e ctively. L et f ∗ = arg min f ∈K τ R ( f ) . Then, we have that D ( f 1 , f 2 ) ≤ 64 Lτ 2 µ ( k + 1) + 8 τ µ ( k + 1) ( k X j =1 ε j + k X j =1 ε ′ j ) Pr o of. Since f ⋆ minimizes E [ L ( y , f ( x ))] o ver the con vex set K τ , by first-order optimalit y we ha ve the inequalit y E [ ⟨∇L ( y , f ⋆ ( x )) , z ( x ) − f ⋆ ( x ) ⟩ ] ≥ 0 ∀ z ∈ K τ . Com bining this with µ –strong con vexit y of L giv es E [ L ( y , g ( x ))] ≥ E [ L ( y , f ⋆ ( x )) + ⟨∇L ( y , f ⋆ ( x )) , g ( x ) − f ⋆ ( x ) ⟩ + µ 2 ∥ g ( x ) − f ⋆ ( x ) ∥ 2 2 ] . Since K τ is con vex, the midp oin t 1 2 ( f 1 + f 2 ) lies in K τ . Applying Lemma 6.3 with H = K τ giv es D ( f 1 , f 2 ) ≤ 4 µ R ( f 1 ) − R ( f ∗ ) + 4 µ R ( f 2 ) − R ( f ∗ ) . Finally , applying Lemma 6.8 to b oth error gap terms giv es us the final b ound. 6.3 Neural Net works W e next state the midp oint-anc hor analogue of our neural-net work and regression-tree agreemen t b ounds for m ulti-dimensional µ -strongly conv ex losses. Theorem 6.10 (Agreement from midp oin t closure) . Assume L is µ -str ongly c onvex. 1. If f 1 , f 2 ∈ NN n satisfy R ( f i ) ≤ R (NN n ) + ε for i ∈ { 1 , 2 } , then D ( f 1 , f 2 ) ≤ 8 µ R (NN n ) − R (NN 2 n ) + ε . 2. If f 1 , f 2 ∈ T ree d satisfy R ( f i ) ≤ R ( T ree d ) + ε for i ∈ { 1 , 2 } , then D ( f 1 , f 2 ) ≤ 8 µ R ( T ree d ) − R ( T ree 2 d ) + ε . Pr o of. W e prov e each part b y applying Lemma 6.3 at the appropriate midp oin t-closed level. Part (1). Let f 1 , f 2 ∈ NN n and define ¯ f := 1 2 ( f 1 + f 2 ). By midp oin t closure (Lemma 5.1), w e ha ve ¯ f ∈ NN 2 n . Applying Lemma 6.3 with H = NN 2 n giv es D ( f 1 , f 2 ) ≤ 4 µ R ( f 1 ) − R (NN 2 n ) + 4 µ R ( f 2 ) − R (NN 2 n ) . Using the assumptions R ( f i ) ≤ R (NN n ) + ε for i ∈ { 1 , 2 } , we obtain R ( f i ) − R (NN 2 n ) ≤ R (NN n ) − R (NN 2 n ) + ε. Substituting this b ound for b oth i = 1 , 2 yields D ( f 1 , f 2 ) ≤ 8 µ R (NN n ) − R (NN 2 n ) + ε , as claimed. 28 Part (2). The pro of is iden tical with T ree d in place of NN n . Let f 1 , f 2 ∈ T ree d and ¯ f := 1 2 ( f 1 + f 2 ). By midp oin t closure (Lemma 5.3), ¯ f ∈ T ree 2 d . Applying Lemma 6.3 with H = T ree 2 d giv es D ( f 1 , f 2 ) ≤ 4 µ R ( f 1 ) − R ( T ree 2 d ) + 4 µ R ( f 2 ) − R ( T ree 2 d ) . Using R ( f i ) ≤ R ( T ree d ) + ε for i ∈ { 1 , 2 } and substituting yields D ( f 1 , f 2 ) ≤ 8 µ R ( T ree d ) − R ( T ree 2 d ) + ε . 7 Ac kno wledgmen ts This w ork is partially supp orted by D ARP A grant #HR001123S0011, an NSF Graduate Research F ellowship, a grant from the Simons foundation, and the NSF ENCoRE TRIPODS institute. The views and conclusions contained herein are those of the authors and should not b e interpreted as represen ting the official p olicies of D ARP A or the US Go vernmen t. References Scott Aaronson. The complexit y of agreement. In Pr o c e e dings of the thirty-seventh annual A CM symp osium on The ory of c omputing , pages 634–643, 2005. Sam uel Ainsworth, Jonathan Ha yase, and Siddhartha Sriniv asa. Git re-basin: Merging models mo dulo p erm utation symmetries. In The Eleventh International Confer enc e on L e arning R epr e- sentations , 2023. Noga Alon, Roi Livni, Mary anthe Malliaris, and Sha y Moran. Priv ate pac learning implies finite littlestone dimension. In Pr o c e e dings of the 51st Annual ACM SIGACT Symp osium on The ory of Computing , pages 852–860, 2019. Rob ert J Aumann. Agreeing to disagree. The Annals of Statistics , 4(6):1236–1239, 1976. Christina Baek, Yiding Jiang, Aditi Raghunathan, and J. Zico Kolter. Agreement-on-the-line: Predicting the p erformance of neural net works under distribution shift. In A dvanc es in Neur al Information Pr o c essing Systems , volume 35, 2022. Dara Bahri and Heinric h Jiang. Lo cally adaptive lab el smo othing improv es predictiv e ch urn. In Pr o c e e dings of the 38th International Confer enc e on Machine L e arning , 2021. Y amini Bansal, Preetum Nakkiran, and Boaz Barak. Revisiting mo del stitching to compare neural represen tations. In A dvanc es in Neur al Information Pr o c essing Systems , 2021. Srinadh Bho janapalli, Kim b erly Wilb er, Andreas V eit, Ankit Singh Raw at, Seungy eon Kim, Adit y a Menon, and Sanjiv Kumar. On the repro ducibilit y of neural netw ork predictions. arXiv pr eprint arXiv:2102.03349 , 2021. Emily Black, Manish Raghav an, and Solon Baro cas. Mo del multiplicit y: Opp ortunities, concerns, and solutions. In Pr o c e e dings of the 2022 ACM c onfer enc e on fairness, ac c ountability, and tr ans- p ar ency , pages 850–863, 2022. 29 Jaros la w B lasiok, P arikshit Gopalan, Lunjia Hu, Adam T auman Kalai, and Preetum Nakkiran. Loss minimization yields multicalibration for large neural net works. In 15th Innovations in The or etic al Computer Scienc e Confer enc e (ITCS 2024) , volume 287, pages 17–1. Schloss Dagstuhl–Leibniz- Zen trum f”ur Informatik, 2024. Olivier Bousquet and Andr´ e Elisseeff. Stability and generalization. Journal of machine le arning r ese ar ch , 2(Mar):499–526, 2002. Leo Breiman. Stac ked regressions. Machine le arning , 24(1):49–64, 1996. Leo Breiman. Statistical mo deling: The t wo cultures (with comments and a rejoinder b y the author). Statistic al scienc e , 16(3):199–231, 2001. Mark Bun, Roi Livni, and Shay Moran. An equiv alence b et ween priv ate classification and online prediction. In 2020 IEEE 61st Annual Symp osium on F oundations of Computer Scienc e (FOCS) , pages 389–402. IEEE, 2020. Mark Bun, Marco Gab oardi, Max Hopkins, Russell Impagliazzo, Rex Lei, T oniann Pitassi, Satchit Siv akumar, and Jessica Sorrell. Stability is stable: Connections b et ween replicabilit y , priv acy , and adaptiv e generalization. In Pr o c e e dings of the 55th Annual ACM Symp osium on The ory of Computing , pages 520–527, 2023. Zac hary Charles and Dimitris P apailiop oulos. Stability and generalization of learning algorithms that conv erge to global optima. In International c onfer enc e on machine le arning , pages 745–754. PMLR, 2018. Tianqi Chen and Carlos Guestrin. Xgb o ost: A scalable tree b o osting system. In Pr o c e e dings of the 22nd acm sigkdd international c onfer enc e on know le dge disc overy and data mining , pages 785–794, 2016. Natalie Collina, Surbhi Go el, V arun Gupta, and Aaron Roth. T ractable agreemen t proto cols. In Pr o c e e dings of the 57th Annual ACM Symp osium on The ory of Computing , pages 1532–1543, 2025. Natalie Collina, Ira Globus-Harris, Surbhi Go el, V arun Gupta, Aaron Roth, and Mirah Shi. Col- lab orativ e prediction: T ractable information aggregation via agreemen t. In Pr o c e e dings of the A CM-SIAM Symp osium on Discr ete Algorithms , 2026. Rac hel Cummings, Katrina Ligett, Kobbi Nissim, Aaron Roth, and Zhiw ei Steven W u. Adaptiv e learning with robust generalization guaran tees. In Confer enc e on L e arning The ory , pages 772– 814. PMLR, 2016. Ilias Diak onikolas, Jingyi Gao, Daniel Kane, Sihan Liu, and Christopher Y e. Replicable distribution testing. arXiv pr eprint arXiv:2507.02814 , 2025. Kate Donah ue, Alexandra Chouldec hov a, and Krishnaram Kenthapadi. Human-algorithm collab- oration: Ac hieving complementarit y and a voiding unfairness. In Pr o c e e dings of the 2022 A CM Confer enc e on F airness, A c c ountability, and T r ansp ar ency , pages 1639–1656, 2022. F elix Draxler, Kam bis V esc hgini, Manfred Salmhofer, and F red Hamprec ht. Essen tially no barriers in neural net work energy landscap e. In Pr o c e e dings of the 35th International Confer enc e on Machine L e arning , 2018. 30 Cyn thia Dwork and Aaron Roth. The algorithmic foundations of differen tial priv acy . F oundations and tr ends ® in the or etic al c omputer scienc e , 9(3–4):211–407, 2014. Cyn thia Dw ork, F rank McSherry , Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivit y in priv ate data analysis. In The ory of crypto gr aphy c onfer enc e , pages 265–284. Springer, 2006. Cyn thia Dwork, Vitaly F eldman, Moritz Hardt, T oniann Pitassi, Omer Reingold, and Aaron Leon Roth. Preserving statistical v alidity in adaptiv e data analysis. In Pr o c e e dings of the forty-seventh annual A CM symp osium on The ory of c omputing , pages 117–126, 2015. Eric Eaton, Marcel Hussing, Michael Kearns, and Jessica Sorrell. Replicable reinforcemen t learning. A dvanc es in Neur al Information Pr o c essing Systems , 36:15172–15185, 2023. Eric Eaton, Marcel Hussing, Mic hael Kearns, Aaron Roth, Sik ata Bela Sengupta, and Jessica Sorrell. Replicable reinforcemen t learning with linear function approximation. In The F ourte enth International Confer enc e on L e arning R epr esentations , 2026. Rahim Entezari, Hanie Sedghi, Olga Saukh, and Behnam Neyshabur. The role of p ermutation in- v ariance in linear mo de connectivit y of neural net works. In International Confer enc e on L e arning R epr esentations , 2022. Jerome F riedman, T rev or Hastie, and Rob ert Tibshirani. Additive logistic regression: a statistical view of b o osting (with discussion and a rejoinder by the authors). The annals of statistics , 28 (2):337–407, 2000. Jerome H F riedman. Greedy function approximation: a gradient b o osting machine. A nnals of statistics , pages 1189–1232, 2001. Rafael F rongillo, Eric Neyman, and Bo W aggoner. Agreement implies accuracy for substitutable signals. In Pr o c e e dings of the 24th ACM Confer enc e on Ec onomics and Computation , pages 702–733, 2023. Tim ur Garip o v, Pa vel Izmailov, Dmitrii Podoprikhin, Dmitry P V etrov, and Andrew G Wilson. Loss surfaces, mo de connectivity , and fast ensembling of dnns. A dvanc es in neur al information pr o c essing systems , 31, 2018. John D Geanak oplos and Heraklis M Polemarc hakis. W e can’t disagree forev er. Journal of Ec onomic the ory , 28(1):192–200, 1982. Mila Gorec ki and Moritz Hardt. Mono culture or multiplicit y: Whic h is it? In The Thirty-ninth A nnual Confer enc e on Neur al Information Pr o c essing Systems , 2025. Moritz Hardt, Ben Rech t, and Y oram Singer. T rain faster, generalize b etter: Stabilit y of sto c hastic gradien t descent. In International c onfer enc e on machine le arning , pages 1225–1234. PMLR, 2016. Christopher Hidey , F ei Liu, and Rahul Go el. Reducing mo del c hurn: Stable re-training of con- v ersational agen ts. In Oliver Lemon, Dilek Hakk ani-T ur, Junyi Jessy Li, Arash Ashrafzadeh, Daniel Hern´ andez Garcia, Malihe Alikhani, Da vid V andyk e, and Ond ˇ rej Du ˇ sek, editors, Pr o- c e e dings of the 23r d Annual Me eting of the Sp e cial Inter est Gr oup on Disc ourse and Dialo gue , pages 14–25, Edinburgh, UK, September 2022. Asso ciation for Computational Linguistics. doi: 10.18653/v1/2022.sigdial- 1.2. URL https://aclanthology.org/2022.sigdial- 1.2/ . 31 Jordan Hoffmann, Sebastian Borgeaud, Arth ur Mensc h, Elena Buchatsk ay a, T revor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendric ks, Johannes W elbl, Aidan Clark, et al. T raining compute-optimal large language mo dels. In Pr o c e e dings of the 36th International Con- fer enc e on Neur al Information Pr o c essing Systems , pages 30016–30030, 2022. Max Hopkins, Russell Impagliazzo, and Christopher Y e. Appro ximate replicability in learning. arXiv pr eprint arXiv:2510.20200 , 2025. Russell Impagliazzo, Rex Lei, T oniann Pitassi, and Jessica Sorrell. Reproducibility in learning. In Pr o c e e dings of the 54th annual ACM SIGACT symp osium on the ory of c omputing , pages 818–831, 2022. Arth ur Jacot, F ranck Gabriel, and Clemen t Hongler. Neural tangent kernel: Conv ergence and generalization in neural netw orks. In A dvanc es in Neur al Information Pr o c essing Systems , 2018. Yiding Jiang, V aishnavh Nagara jan, Christina Baek, and J Zico Kolter. Assessing generalization of SGD via disagreement. In International Confer enc e on L e arning R epr esentations , 2022. URL https://openreview.net/forum?id=WvOGCEAQhxl . Zhengshen Jiang, Hongzhi Liu, Bin F u, and Zhonghai W u. Generalized am biguity decompositions for classification with applications in active learning and unsup ervised ensemble pruning. In Pr o c e e dings of the AAAI Confer enc e on A rtificial Intel ligenc e , volume 31, 2017. Rie Johnson and T ong Zhang. Inconsistency , instability , and generalization gap of deep neural net work training. In Thirty-seventh Confer enc e on Neur al Information Pr o c essing Systems , 2023. Keller Jordan. On the v ariance of neural net work training with resp ect to test sets and distributions. In The Twelfth International Confer enc e on L e arning R epr esentations , 2024. Alkis Kalav asis, Amin Karbasi, Kasp er Green Larsen, Grigoris V elegk as, and F elix Zhou. Replicable learning of large-margin halfspaces. arXiv pr eprint a rXiv:2402.13857 , 2024a. Alkis Kala v asis, Amin Karbasi, Grigoris V elegk as, and F elix Zhou. On the computational landscape of replicable learning. A dvanc es in Neur al Information Pr o c essing Systems , 37:105887–105927, 2024b. Jared Kaplan, Sam McCandlish, T om Henighan, T om B Bro wn, Benjamin Chess, Rew on Child, Scott Gray , Alec Radford, Jeffrey W u, and Dario Amo dei. Scaling la ws for neural language mo dels. arXiv pr eprint arXiv:2001.08361 , 2020. Amin Karbasi, Grigoris V elegk as, Lin Y ang, and F elix Zhou. Replicability in reinforcemen t learning. A dvanc es in Neur al Information Pr o c essing Systems , 36:74702–74735, 2023. Mic hael Kearns. Efficient noise-toleran t learning from statistical queries. Journal of the ACM (JA CM) , 45(6):983–1006, 1998. Mic hael Kearns, Aaron Roth, and Emily Ryu. Net w orked information aggregation via machine learning. In Pr o c e e dings of the ACM-SIAM Symp osium on Discr ete Algorithms , 2026. Anders Krogh and Jesp er V edelsby . Neural net w ork ensem bles, cross v alidation, and activ e learning. A dvanc es in neur al information pr o c essing systems , 7, 1994. 32 Gil Kur, Eli Putterman, and Alexander Rakhlin. On the v ariance, admissibility , and stability of empirical risk minimization. In A dvanc es in Neur al Information Pr o c essing Systems , volume 36, 2023. Jaeho on Lee, Lec hao Xiao, Samuel Sc ho enholz, Y asaman Bahri, Roman No v ak, Jasc ha Sohl- Dic kstein, and Jeffrey P ennington. Wide neural netw orks of any depth ev olve as linear models under gradien t descent. In A dvanc es in Neur al Information Pr o c essing Systems , 2019. Jialin Mao, Itay Griniast y , Han Kheng T eoh, Rahul Ramesh, Rubing Y ang, Mark K. T ranstrum, James P . Sethna, and Pratik Chaudhari. The training pro cess of many deep net works explores the same low-dimensional manifold. Pr o c e e dings of the National A c ademy of Scienc es , 2024. Charles Marx, Flavio Calmon, and Berk Ustun. Predictiv e m ultiplicity in classification. In Inter- national c onfer enc e on machine le arning , pages 6765–6774. PMLR, 2020. Llew Mason, Jonathan Baxter, P eter Bartlett, and Marcus F rean. Bo osting algorithms as gradien t descen t. A dvanc es in neur al information pr o c essing systems , 12, 1999. Mahdi Milani F ard, Quentin Cormier, Kevin Canini, and May a Gupta. Launch and iterate: Re- ducing prediction ch urn. A dvanc es in Neur al Information Pr o c essing Systems , 29, 2016. Preetum Nakkiran and Y amini Bansal. Distributional generalization: A new kind of generalization. arXiv pr eprint arXiv:2009.08092 , 2020. Kenn y Peng, Nikhil Garg, and Jon Kleinberg. A no free lunch theorem for human-ai collab oration. In Pr o c e e dings of the AAAI Confer enc e on Artificial Intel ligenc e , volume 39, pages 14369–14376, 2025. Aaron Roth and Alexander Williams T olb ert. Resolving the reference class problem at scale. Philosophy of Scienc e , pages 1–15, 2025. Go wthami Somepalli, Liam F o wl, Arpit Bansal, Ping Y eh-Chiang, Y ehuda Dar, Ric hard Baraniuk, Micah Goldblum, and T om Goldstein. Can neural nets learn the same mo del twice? inv estigating repro ducibilit y and double descent from the decision b oundary p erspective. In Pr o c e e dings of the IEEE/CVF Confer enc e on Computer Vision and Pattern R e c o gnition , June 2022. Lijing W ang, Dipanjan Ghosh, Maria T eresa Gonzalez Diaz, Ahmed F arahat, Mah bubul Alam, Chetan Gupta, Jiangzhuo Chen, and Madhav Marathe. Wisdom of the ensemble: improving consistency of deep learning mo dels. In Pr o c e e dings of the 34th International Confer enc e on Neur al Information Pr o c essing Systems , 2020. Jamelle W atson-Daniels, Flavio du Pin Calmon, Alexander D’Amour, Carol Long, David C P ark es, and Berk Ustun. Predictive c h urn with the set of go od mo dels. arXiv pr eprint arXiv:2402.07745 , 2024. Da vid H W olp ert. Stack ed generalization. Neur al networks , 5(2):241–259, 1992. Dann y W o o d, Tingting Mu, Andrew M W ebb, Henry WJ Reeve, Mikel Luj´ an, and Gavin Brown. A unified theory of div ersity in ensemble learning. Journal of machine le arning r ese ar ch , 24(359): 1–49, 2023. Zhanp eng Zhou, Y ongyi Y ang, Xiao jiang Y ang, Junchi Y an, and W ei Hu. Going b ey ond linear mo de connectivit y: The lay erwise linear feature connectivity . In A dvanc es in Neur al Information Pr o c essing Systems , 2023. 33
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment