Defensive Generation

We study the problem of efficiently producing, in an online fashion, generative models of scalar, multiclass, and vector-valued outcomes that cannot be falsified on the basis of the observed data and a pre-specified collection of computational tests.…

Authors: Gabriele Farina, Juan Carlos Perdomo

Defensiv e Generation Gabriele F arina MIT gfarina@mit.edu Juan Carlos P erdomo MIT, NYU j.perdomo.silva@nyu.edu F ebruary 26, 2026 Abstract W e study the problem of efficiently pro ducing, in an online fashion, generative mo dels of scalar, m ulticlass, and v ector-v alued outcomes that cannot b e falsified on the basis of the observ ed data and a pre-specified collection of computational tests. Our contributions are t wofold. First, w e expand on connections b etw een online high-dimensional multicalibration with resp ect to an RKHS and recen t adv ances in exp ected v ariational inequality problems, enabling efficien t algo- rithms for the former. W e then apply this algorithmic mac hinery to the problem of outcome indistinguishabilit y . Our pro cedure, Defensiv e Generation, is the first to efficien tly pro duce online outcome indistinguishable generativ e models of non-Bernoulli outcomes that are unfalsi- fiable with resp ect to infinite classes of tests, including those that examine higher-order momen ts of the generated distributions. F urthermore, our metho d runs in near-linear time in the num b er of samples and ac hieves the optimal, v anishing 1 / √ T rate for generation error. 1 In tro duction Sup ervised learning frames learning in terms of loss minimization. Under this paradigm, an algo- rithm succeeds if it is able to find a prediction rule that has lo w excess risk o v er the underlying data distribution, with respect to some loss and h yp othesis class. Dra wing on a rich set of ideas from complexity theory and cryptograph y , outcome indistinguishabilit y offers a different persp ec- tiv e [ Dwork et al. , 2021 ]. A learner succeeds if it finds a generative mo del of outcomes that cannot b e falsified on the basis of the observ ed data and a pre-sp ecified collection of computational tests. The motiv ation b ehind outcome indistinguishabilit y is that if no computational pro cess can tell the difference betw een the “true” data produced b y Nature and data produced by the Learner’s mo del, then this model effectiv ely is the “real” data generating pro cess. The v alidity of the Learner’s mo del is staked on its abilit y to withstand falsification. In this pap er, w e study the problem of designing computationally efficient algorithms that pro v ably find outcome indistinguishable generative mo dels for ric h classes of outcomes. W e study this question in the online setting where data is arbitrarily (p ossibly adversarially) c hosen. Samples are not assumed to b e dra wn i.i.d. from an y distribution, y et the Learner is tasked with finding a generativ e mo del that, for all inten ts and purp oses, “lo oks like” it generated the observ ed sequence. Our main con tribution is an efficient online pro cedure that finds generativ e models of scalar, m ulticlass, or vector-v alued outcomes that are prov ably outcome indistinguishable with resp ect to 1 ric h, infinite collections of tests that live in a vector-v alued repro ducing kernel Hilb ert space. A generativ e mo del here is a function that giv en a set of features x pro duces a conditional distribution µ o ver outcomes Y . Prior w ork in this area was either restricted to the case of finding generative mo dels of simple binary outcomes, or required explicit enumeration ov er the set of tests (which ruled out efficien tly guaranteeing indistinguishability with resp ect to sup er polynomially-sized classes). Our algorithm not only expands the scop e of settings where one can guaran tee indistinguishability , it also achiev es the optimal √ T regret b ound for this problem (see V o vk [ 2007 ] for a lo wer b ound). On a technical level, our result comes from reducing online outcome indistinguishabilit y to high- dimensional multicalibration and using recen t adv ances on exp ected v ariational inequalities (EVIs) to efficien tly solve these underlying calibration problems. A core part of our w ork hence relies on exploring the spiderweb of connections b et ween indistinguishability , online calibration, research on learning in games, and nonlinear optimization. On a conceptual lev el, we highlight ho w our work provides a differen t p ersp ectiv e on genera- tiv e mo deling, based on online learning as its foundation. Muc h of the recent theoretical literature assumes that there is a fixed distribution w e wish to sample from, and relies on function appro xima- tion assumptions to guarantee that the learned mo del is close in statistical distance to the truth. Our w ork on online OI, on the other hand, do es not guaran tee closeness in statistical distance nor do es it rely on unv erifiable function approximation assumptions. Instead, it provides an end- to-end unconditional guarantee that the learned mo del is c omputational ly indistinguishable from the “truth”; where truth is in quotation marks since data is arbitrary and there is no underlying distribution to speak of. Lastly , we pro duce these indistinguishable mo dels through defensive forecasting, an algorithmic metho dology pioneered in V ovk et al. [ 2005 ]. Rather than trying to mak e a go od guess around what the true data will b e, defensiv e forecasting views predictions as a game. A go o d prediction is not one whic h aims to mimic Nature, but rather one that ensures that the generative mo del lo oks go o d in hindsigh t (from the p ersp ective of the set of tests) no matter the choice of Nature. W e pro vide an o v erview of our con tributions in Section 1.1 . The in terested reader can skip ahead to Section 1.2 to see example guarantees of our Defensive Generation algorithm in domains lik e language generation, learning linear dynamical systems, and weather forecasting. A technical o verview is given in Section 2 . 1.1 Ov erview of Con tributions W e design algorithms that work in the following online proto col. A t every time step t , the Learner sees features x t c hosen by Nature and randomly pro duces a distribu tion µ t o ver the outcome space Y and an auxiliary v ector of statistics p t (think p t = g ( µ t ) for some function g ). Lastly , Nature rev eals an outcome y t ∈ Y . Our goal is to design algorithms whic h achiev e the follo wing desideratum. Definition 1. An algorithm A satisfies online outc ome indistinguishability with r esp e ct to a class of distinguishers F if it pr o duc es distributions µ t over Y and statistics p t such that for any f ∈ F , OIGap T ( f ) : =     T X t =1 E p t [ f ( x t , p t , y t )] − T X t =1 E p t , e y t ∼ µ t [ f ( x t , p t , e y t )]     , (1) is o ( T ) r e gar d less of Natur e’s choic es of ( x t , y t ) . We define OIGap T : = sup f ∈F OIGap T ( f ) . 2 On the left, w e ha ve the outputs of the distinguishers f on the real outcomes y t . On the righ t, there is the exp ected v alue of the distinguisher o ver the outcomes e y t ∼ µ t sampled by the Learner’s mo del. Dividing b oth sides of the abov e equation b y T , an algorithm guaran tees online outcome indistinguishabilit y if the v alue of every function in F is the same when: (1) a veraged o ver the realized sequence of outcomes y t , or (2) the sim ulated outcomes e y t . No distinguisher can sp ot a difference b etw een the real data and sim ulated data: lim T →∞     1 T T X t =1 E p t [ f ( x t , p t , y t )] − 1 T T X t =1 E p t , e y t ∼ µ t [ f ( x t , p t , e y t )]     → 0 . As a though t exp eriment, if the outcomes y t w ere truly random and dra wn from µ t , by a martingale argumen t, we should exp ect that for any f , OIGap T ( f ) ≈ √ T . Our main con tribution is a new algorithm, Defensive Generation, that ac hieves exactly the same guarantee, online, and for broad classes of adv ersarially c hosen outcomes y t . Theorem 1 (Informal) . The fol lowing ar e true r e gar ding the Defensive Gener ation pr o c e dur e. F ur- thermor e, in e ach setting, it runs in time O (p oly ( d ) · log( t )) at time t and has OIGap T ≤ O ( √ T ) . 1. Multiclass outc omes, Y = [ d ] . The pr o c e dur e outputs distributions µ t that ar e online outc ome indistinguishable with r esp e ct to any infinite c ol le ction of distinguishers f ( x, p, y ) that form an RKHS, and which have ful l ac c ess to the entir e c onditional distribution µ t (i.e., p t = µ t ). As a sp e cific example, this in p articular implies indistinguishability with r esp e ct to al l line ar functions at a r ate b ounde d by d √ T . Se e Example 1 . 2. Sc alar outc omes, Y = [ − 1 , 1] . Defensive Gener ation pr o duc es distributions µ t that ar e online outc ome indistinguishable with r esp e ct to al l low-de gr e e tests in any RKHS that only examine the first d moments of µ t . That is, p t = (1 , E µ t [ e y t ] , E µ t [ e y 2 t ] , . . . , E µ t [ e y d t ]) . F or Bo ole an x , this implies that one c an pr o duc e distributions over sc alar e y whose first d c onditional moments, E [ e y j | c ( x ) = 1] , for j = 1 to d , match those of the observe d se quenc e over al l subsets of the hyp er cub e c omputable by low-depth de cision tr e es c ( x ) . Se e Example 2 . 3. High-dimensional outc omes, Y = { y ∈ R d : ∥ y ∥ ≤ 1 } . It outputs µ t that ar e online OI with r esp e ct to distinguishers in any RKHS that examine the me an and c ovarianc e of µ t . That is, p t = ( E µ t [ e y t ] , E µ t [ e y t e y ⊤ t ]) . This in p articular implies indistinguishability with r esp e ct to the infinite set of distinguishers that c an lie in the sp an of a fixe d set of (nonline ar) fe atur es Φ( x, p ) . Se e Example 4 . Mor e over, if the distinguishers only examine the first moments, p = E µ t [ e y t ] , then the Defensive Gener ation algorithm works for any c omp act c onvex set in R d , not just the unit b al l. Se e Example 3 . T o the b est of our knowledge, this is the first algorithm that can efficien tly guaran tee online out- come indistinguishability with resp ect to infinite classes F beyond the case of Bernoulli outcomes. As men tioned previously , the main bac kb one of this result is a new near-linear-time algorithm for online multicalibration in high-dimensions. W e state this problem formally: 3 Definition 2. Assume that at every time step, Natur e sele cts fe atur es x t ∈ X arbitr arily and then the L e arner r andomly samples a for e c ast p t in a c omp act, c onvex set Z ⊂ R d . L astly, Natur e r eve als the tar get z t ∈ Z . An algorithm A guar ante es online multic alibr ation with r esp e ct to a class of functions H ⊆ {X × Z } → R d if it r andomly pr o duc es a se quenc e of p t such that sup h ∈H     T X t =1 E p t  h ( x t , p t ) ⊤ ( z t − p t )      ≤ o ( T ) r e gar d less of Natur e’s choic es of ( x t , z t ) ∈ X × Z in this online pr oto c ol. Unlik e other versions of online ( ℓ 1 ) calibration studied in the literature [ Peng , 2025 , Fishelson et al. , 2025 ] where errors are summed up ov er a grid of predicted v alues p t = v , one can ac hieve error ε for this m ulticalibration problem after O ( ε − 2 ) many rounds for ric h c lasses of functions H . This is in contrast to the lo w er b ound of d poly (1 /ε ) man y rounds for the ℓ 1 v ersion. In particular, w e design an algorithm that guarantees √ T regret for an y vector-v alued RKHS H . Our construction relies on recent adv ances in solving exp ected v ariational inequalit y problems [ Zhang et al. , 2025 ], along with the high-dimensional defensive forecasting algorithm from Dwork et al. [ 2025 ]. W e state its guarantees b elo w: Theorem 2 (Informal) . L et H ⊆ {X × Z → R d } b e any ve ctor-value d RKHS with c orr esp onding matrix value d kernel Γ . Then, the high-dimensional defensive for e c asting algorithm c an b e efficiently implemente d to run in time O (p oly ( d ) log ( t )) at time step t and guar ante e that for any h ∈ H ,     T X t =1 E p t  h ( x t , p t ) ⊤ ( z t − p t )      ≤ ∥ h ∥ H v u u t T X t =1 E p t ( z t − p t ) ⊤ Γ(( x t , p t ) , ( x t , p t ))( z t − p t ) . Her e, ∥ · ∥ H denotes the norm of the functions in the RKHS H . As a final note, prior w ork [ Gopalan et al. , 2023 , Dwork et al. , 2025 , Ok oroafor et al. , 2025 ] has established that mo dels that are OI are also loss minimizing. OI in fact implies loss minimization not just for a single loss, but rather for man y losses sim ultaneously (that is, omniprediction) [ Gopalan et al. , 2022 ]. By efficiently guaran teeing OI in higher-dimensions, w e hope to enable future w ork on faster algorithms for omniprediction in richer domains. 1.2 Example Applications W e provide in this section a few differen t examples of generative mo deling problems and the as so- ciated guarantees of the Defensiv e Generation algorithm. Rather than highligh ting the strongest p ossible guarantees, w e fo cus on highligh ting concrete examples and the intuition b ehind the indistinguishability definitions. Some of these example instan tiations ma y b e of indep endent in terest. T ok en-Based Sequence Mo deling. T o start, we consider tok en-based generation of sequences, suc h as language modeling. In this setting, the goal is giv en the history of tok ens x t pro duce a conditional distribution µ t o ver the next tok en y t ∈ Y = [ d ] where Y is a v o cabulary of d tokens. 4 In more detail, at each round t = 1 , 2 , . . . , we let x t b e the en tire history of tokens observ ed so far x t = ( y 1 , y 2 , . . . , y t − 1 ). Ha ving seen x t , the learner pro duces a distribution µ t ∈ ∆([ d ]) : = { µ ∈ R d ≥ 0 : 1 ⊤ µ = 1 } . That is, a a generativ e mo del of the next token given the history , Pr µ t [ y t = · | x t ]. Ha ving produced µ t , the next token y t ∈ [ d ] is rev ealed. In this setting, w e can pro duce, as a simple example, distributions µ t that are unfalsifiable with resp ect to tests that linear functions of some feature embedding Φ( x t ) of the history , suc h as a frozen T ransformer or BER T represen tation of the context. W e define k ( x, x ′ ) = ⟨ Φ( x ) , Φ( x ′ ) ⟩ , Γ( x, x ′ ) = I d k ( x, x ′ ) , G = sup t k ( x t , x t ) = sup t ∥ Φ( x t ) ∥ 2 2 . Asso ciated distinguishers tak e the form f h ( x, p, y ) = h ( x ) ⊤  y , where h ( x ) = W ⊤ Φ( x ) ∈ R d and  y = (1 { y = 1 } , . . . , 1 { y = d } ) with norm ∥ h ∥ H = ∥ W ∥ F . In this case, Defensive Generation guarantees that for an y suc h h , 1 T · OIGap T ( f h ) =      1 T T X t =1 E  h ( x t ) ⊤ (  y t − µ t )       ≤ 4 ∥ h ∥ H r G T . This means, no matter ho w the tokens y t are generated, every linear function of the em b edded history Φ( x t ) lo oks the same when ev aluated 1) o ver the empirical distribution of tokens P T t =1  y t or 2) the conditional distributions P t t =1 µ t . W e note that in this case the distinguishers f are only a function of x t and the outcomes y t . They ignore the side-information p t , which in this case is exactly equal to the full conditional distribution µ t . Also note that the generation error bound only dep ends on the norms of the functions and not explicitly on the am bient dimension. The guarantees presented here follow directly from Corollary 1 . Predicting Correlated Rain Across the United States. Next, w e illustrate ho w Defensiv e Generation can b e used to pro duce conditional distributions o ver high-dimensional, real-v alued outcomes. In particular, consider the problem of predicting how muc h it will rain in all 50 US states ev ery da y . Here, y t ∈ R 50 is a vector where y t,i ∈ R denotes the amoun t rainfall in state i on da y t . W e normalize precipitation units so that ∥ y t ∥ 2 ≤ 1. Let x t ∈ R n b e a v ector of features (e.g. atmospheric measurements, seasonality information, prior rainfall, etc.) with ℓ 2 norm uniformly bounded b y B . At ev ery time t , the algorithm tak es in x t and outputs a joint distribution µ t o ver v ectors in R 50 , that is rain in all 50 states. 1 It also outputs an auxiliary statistics p t = ( v t , Q t ) where E e y t ∼ µ t [ e y t ] = v t and E e y t ∼ µ t [ e y t e y ⊤ t ] = Q t describing the mean and cov ariance of the conditional distributions of rain. If we use the linear kernel, the algorithm guarantees (as a sp ecial case) indistinguishability with resp ect to all functions of the form f θ ( x t , p t , y t ) = y t,i · θ ⊤ x t , f β ( x t , p t , y t ) = y t,i · y t,j · β ⊤ x t , f w ( x t , p t , y t ) = v ⊤ t w · y t,i , where w , β , θ are v ectors in R d , p t = ( v t , Q t ) = ( E µ t [ e y t ] , E µ t [ e y t e y ⊤ t ]), and i, j are arbitrary state in- dices in { 1 , . . . , 50 } . Expanding the definition of indistinguishabilit y and plugging in the guaran tees 1 As per Algorithm 3 , this is an atomic measure supp orted on 51 p oin ts. 5 for Defensive Generation w e get the follo wing statements: 1 T · OIGap( f θ ) =     1 T T X t =1 θ ⊤ x t ( y t,i − E µ t [ e y t,i ])     ≤ 4 ∥ θ ∥ 2 r ( B 2 + 2) T , (correct mean per state) 1 T · OIGap( f β ) =     T X t =1 β ⊤ x t ( y t,i · y t,j − E µ t [ e y t,i · e y t,j ])     ≤ 4 ∥ β ∥ 2 r ( B 2 + 2) T , (correct co v ariances) 1 T · OIGap( f w ) =     1 T T X t =1 w ⊤ E µ t [ e y t ]( y t,i − E µ t [ e y t,i ])     ≤ 4 ∥ w ∥ 2 r ( B 2 + 2) T . (self-consisten t means) Unpac king this a bit further, from the first set of conditions, we get that for every state i the exp ected v alue of rain is uncorrelated with any linear function of the features x . That is, the generativ e model of rain has means that are conditionally correct as per this test. The second equation shows that not only are the means p er state correct, the conditional distribution also captures the pairwise correlations b et ween an y pair of states. If a storm hits Massac husetts, it likely also hits Rho de Island. This second set of tests guarantees that the join t distribution of rain passes all these pairwise c hecks. It also asserts that the v ariances of the per state rainfall distributions are correct ( i, j can b e the same). The last set of conditions sho ws that the means are not only conditionally correct, they are self-consisten t in the sense that the errors in the exp ected v alue of rain in state i , E µ t [ e y t,i ] − y t,i , are uncorrelated with any linear function of the means of the distributions themselves. This is a strengthening of the indistinguishabilit y guaran tee and is not implied b y either of the first t w o conditions. 2 See F oster and Kak ade [ 2006 ] for further discussion of this p oin t. The guarantees presen ted in this example follo w from Corollary 4 (see also App endix D ). W e note how this example highlights the distinction b et ween the conditional distributions µ t pro duced b y the algorithm and the statistics p t consumed by the distinguisher. While Defensiv e Generation pro duces a probability measure o v er p oin ts on the unit ball, the distinguishers only examine the first and second moments of the distribution. Learning Linear Dynamical Systems. As a final example, we consider the task of finding a generativ e model that is indistinguishable with respect to data generated b y a linear dynamical system. In particular, consider the system z t +1 = Az t + ζ t , y t = C z t + ν t . Here, y t ∈ R d is the observ ation, z t is the hidden state, A and C are matrices, and ζ t and ν t are noise vectors which could b e adv ersarial (not i.i.d). Assume that observ ations y t are uniformly bounded, ∥ y t ∥ ≤ B . This is true whenever the initial hidden state x 0 is b ounded, the noise terms ( ν t , ζ t ) are bounded, and the matrix A has sp ectral radius strictly less than 1 ( A is strictly stable). T o keep notation simple, we set B = 1 (assuming strict stability one can alw ays renormalize). No w fix a history length ℓ ≥ 1 and set the features x t in the online proto col to b e the history of observ ations y j up to some lag ℓ : x t := ( y t − 1 , y t − 2 , . . . , y t − ℓ ) ∈ R dℓ . 2 The co v ariances in this example are also self-consistent, we omit the equation for the sake of concision. 6 Note that ∥ x t ∥ 2 2 ≤ ℓ. Given x t , the Defensiv e Generation algorithm pro duces a probabilit y measure µ t o ver p oin ts in R d along with statistics p t = ( E µ t [ e y t ] , E µ t [ e y t e y ⊤ t ]). Again using the linear k ernel Γ(( x, p ) , ( x ′ , p ′ )) = I · k ( x, x ′ ), b y Corollary 4 ,the Defensiv e Gen- eration algorithm guaran tees indistinguishabilit y with resp ect to all distinguishers of the form f ( x t , p t , y t ) = d X i =1 x ⊤ t α i · y t,i + X 1 ≤ i ≤ j ≤ d x ⊤ t β ij · ( y t,i y t,j ) , where α i and β i,j are all v ectors in R d . In particular, for any suc h function f , 3 OIGap( f ) ≤   d X i =1 ∥ α i ∥ 2 + X 1 ≤ i ≤ j ≤ d ∥ β i,j ∥ 2   √ 4 T ℓ. (2) This in particular means that the Defensiv e Generation algorithm pro duces probability distributions µ t that “lo ok like” they generated the observ ations y t , at least from the p ersp ective of an y linear function of the truncated history of observ ations, 1 T T X t =1 " d X i =1 x ⊤ t α i · y t,i + X 1 ≤ i ≤ j ≤ d x ⊤ t β ij · ( y t,i y t,j ) # ≈ 1 T T X t =1 " d X i =1 x ⊤ t α i · E µ t [ e y t,i ] + X 1 ≤ i ≤ j ≤ d x ⊤ t β ij · E µ t [ e y t,i e y t,j ] # . This indistinguishability guaran tee of the probabilit y measures µ t holds for any truncation length ℓ and with no stochasticit y assumptions on the noise. F urthermore, the hidden state z t can b e infinite-dimensional and the pair of matrices ( A, C ) need not satisfy an y observ ability conditions. W e only required that the observ ations y t had b ounded norm. Moreo ver, unlike other approac hes to learning dynamical systems that only pro duce p oint es- timates E µ t [ e y t ] for y t , our algorithm pro duces a full probabilit y measure with calibrated estimates of what the cov ariances E µ t [ e y e y ⊤ ] will be. Lastly , we note that our pro cedure do es not reco ver the matrices ( A, C ). It only learns to pro duce distributions that are indistinguishable with resp ect to the true data generated by the system. Using arguments from the literature (e.g., Simc howitz [ 2021 ]), it is plausible one can sho w that b y setting ℓ in the order of O (log( T )), the means of the distributions µ t will be as goo d as those pro duced b y a Kalman filter. How ev er, w e defer a detailed in vestigation to future w ork. Generativ e Models of Educational Oucomes. As a final example, we mention ho w the tec h- niques we dev elop in this paper can pro vide fine-grained indistiguishabilit y guaran tees for generativ e mo dels of so cial outcomes, lik e studen t test scores. Our algorithm can generate conditional distri- butions o ver scalar-v alued outcomes that are v alid even after conditioning on ric h subsets of the features and examining higher order momen ts. T o illustrate this, assume that at every time step t we see studen ts with Bo olean features x t ∈ X = {± 1 } n and w e w ant to output a probabilit y distribution µ t o ver scalar outcomes y t ∈ [0 , 1] 3 Compared to the distinguishers in the previous example, which lo ok ed at each mean y t,i individually , eac h f in this example has a m uch larger scale, since it lo oks at all of the terms simultaneously . This is reflected b y the term in paren theses in ( 2 ), which increases when the n umber of dimensions d increases. 7 describing their test scores in an exam. Using the polynomial k ernel, w e can guaran tee that, not just the mean, but any fixed num b er of momen ts of these distributions will be accurate ov er all subp opulations c ( x ) ⊆ {± 1 } n computable by depth- r decision trees. 1 T T X t : c ( x t )=1 y j t ≈ 1 T T X t : c ( x t )=1 E µ t [ e y j t ] for all j = 1 , . . . 2 d. Here, c ( x ) is an y Bo olean function computable by a decision tree of depth r . The Defensiv e Generation algorithm guarantees that if w e lo ok at the subset of p oints in {± 1 } n where c ( x ) = 1, the empirical momen ts of studen ts tests scores y t will match those of the conditional distributions µ t . This guaran tee holds not just for any fixed tree c or momen t j , but rather for all trees and momen ts j ∈ [ d ] simultane ously . In this case, we let each distinguisher ha ve access to the v ector of momen ts of the distribution µ t pro duced by the algorithm, p t = (1 , E µ t [ y t ] , . . . , E µ t [ y 2 d t ]). Defensive Generation is able to pro duce these distributions efficien tly , as we sho w in more technical terms in Example 2 . 1.3 Related W ork The notion of outcome indistinguishabilit y was first defined in the batch case and for binary out- comes b y Dw ork et al. [ 2021 ]. They pro v ed that OI w as equiv alent to the influen tial notion of m ulticalibration [ H´ eb ert-Johnson et al. , 2018 ] if the distinguishers f cannot examine the underly- ing computational circuits that produce the samples. Using ideas on moment multicalibration from Jung et al. [ 2021 ], these equiv alences b etw een OI and m ulticalibration (for the batc h setting) w ere extended to the case of non-Bernoulli outcomes b y Dw ork et al. [ 2022 ]. F ollowing (and prior to) w ork on OI and m ulticalibration in the batch setting, there’s been significan t interest in extending the analyses to the online setting, with the v ast ma jorit y of w ork fo cused on the case of predicting (equiv alently , generating) a binary outcome. Some of this w ork go es bac k to F oster and V ohra [ 1998 ], Sandroni et al. [ 2003 ], V o vk [ 2007 ], F oster and Kak ade [ 2006 ]. F ollowing H ´ eb ert-Johnson et al. [ 2018 ] there w as renewed in terest in the problem [ Gupta et al. , 2022 , Okoroafor et al. , 2025 ]. Our work builds on that of Dwork et al. [ 2025 ] who focused on outcome indistinguishable models of binary outcomes with respect to scalar v alued RKHSs. Our w ork is also closely related to w ork designing algorithms for high-dimensional m ulticali- bration. In this v ein, Noarov et al. [ 2025 ] design algorithms whic h ac hieve p T log ( |F | ) regret for this problem but require enumerating ov er the functions in a finite set F . Our work also builds on the idea of guaranteeing indistinguishabilit y for richer outcome spaces and non-linear distinguish- ers by conv erting OI to a m ulticalibration problem in a higher-dimensional space where non-linear functions b ecome linear in an expanded basis [ Lu et al. , 2025 , Gopalan et al. , 2023 , Gupta et al. , 2022 ]. W e build on these results to design computationally efficien t algorithms that can cop e with distinguisher classes F that are infinitely large. Lastly , our tec hnical approac h relies and expands on the philosoph y of defensive for e c asting , a metho dology in tro duced b y V ovk et al. [ 2005 ] for online prediction where forecasts are deriv ed b y correcting past mistakes. P e rdomo and Rec ht [ 2025 ] pro vide an ov erview of this line of work and sho w ho w it can be used to design algorithms for online calibration, quantile regression, and loss 8 minimization (amongst others). Our core results essen tially extend the meta-algorithm discussed in Perdomo and Rec ht [ 2025 ] to w ork in high-dimensional settings. W e defer a discussion of exp ected v ariational inequalities to Section 3 . 2 T ec hnical Ov erview Our work builds on la yers of ideas from different researc h areas. In this section, we pro vide a conceptual ov erview of ho w these relate and lead to our final result. Recall that the goal is, giv en features x t , pro duce a (conditional) distribution µ t o ver outcomes in a set Y that withstands falsification with respect to the true rev ealed outcome y t for that x t . That is, f ( x t , p t , y t ) ≈ E e y t ∼ µ t [ f ( x t , p t , e y t )] . This goal is well-defined ev en if the distinguishers only examine the features and the sampled outcome f ( x t , p t , y t ) = f ( x t , y t ). Allo wing f access to the “side information” in p t only strengthens the indistinguishability guaran tee. 4 F or certain classes of outcomes y t , we can guarantee indistinguishabilit y even if we pro vide distinguishers with a full description of the conditional distribution µ t from whic h we sample e y t . That is, we let p t = µ t . In the binary case, this conditional distribution is just a single num b er Pr µ t [ e y t = 1]. And in the multiclass case where Y = [ d ], this is a p oin t on the simplex µ t,j = Pr µ [ e y t = j ] for j ∈ [ d ]. Ho wev er, for contin uous outcomes, one migh t in principle require infinitely man y parameters to specify the full conditional distribution ov er e y ∈ [ − 1 , 1]. T o address this computational issue, when dealing with real-v alued outcomes y , we restrict ourselv es to guaran teeing oblivious outcome indistinguishabilit y as defined in Dw ork et al. [ 2022 ]. Here, w e restrict the class of distinguishers f to those that can b e written as a function of a finite-dimensional vector of statistics s ( y ) living in set Z ⊂ R d , f ( x, p, y ) = g f ( s ( y ) , x, p ) . F or instance, if the distinguisher f only examines the first t wo moments of the distribution then w e can write f ( x, p, y ) = g f ( y , y 2 , x, p ) where s ( y ) = ( y , y 2 ) is a vector of sufficient statistics. In many imp ortant cases, these distinguishers are in fact linear in the sufficient statistics. That is, for ev ery f , there exists a function h f ( x, p ) : X × Z → R d suc h that f ( x, p, y ) is equal to h f ( x, p ) ⊤ s ( y ) for all ( x, p ). This “linearization” is alw a ys true for the case of discrete outcomes since the conditional distribution is a sufficien t statistic for an y f . If w e can write f ( x, p, y ) = h ( x, p ) ⊤ s ( y ), then outcome indistinguishability reduces to online high-dimensional, multicalibration with resp ect to the collection of functions H that dep end on F . More precisely , OIGap T = sup f ∈F     T X t =1 E p t [ f ( x t , p t , y t )] − T X t =1 E p t , e y t ∼ µ t [ f ( x t , p t , e y t )]     = sup h ∈H     T X t =1 E p t [ h ( x t , p t ) ⊤ s ( y t )] − E p t ,µ t [ h ( x t , p t ) ⊤ s ( e y t )]     = sup h ∈H     T X t =1 E p t [ h ( x t , p t ) ⊤ ( z t − p t )]     , where in the last equalit y we let z t = s ( y t ) and set p t = E µ t [ s ( e y t )]. 4 Using the terminology from Dwork et al. [ 2021 ], we op erate within the sample-access form ulation of OI. 9 Therefore, as long as we can solv e for a distribution µ t o ver Y suc h that E µ t [ s ( e y t )] = p t , w e can reduce online outcome indistinguishabilit y to high-dimensional multicalibration. In particular, to pro duce the conditional distribution µ t , giv en x t w e first pro duce a high-dimensional multicalibrated forecast p t suc h that p t ≈ z t = s ( y t ), and then solve for µ t . 5 Ha ving established this reduction, our central algorithmic contribution is a procedure that efficien tly guaran tees online high-dimensional m ulticalibration with resp ect to any v ector-v alued repro ducing kernel Hilb ert space H . These are ric h spaces that can express functions such as all p olynomials and whic h can b e learned efficien tly . W e pro vide a brief primer for those unfamiliar. A vector-v alued RKHS is a Hilb ert space of functions H ⊆ {X × Z → R d } that is equal to the closure of the set h ( x, p ) = n X i =1 Γ(( x i , p i ) , ( x, p )) θ i . Here, Γ(( x, p ) , ( x ′ , p ′ )) ∈ R d × d is a matrix-v alued kernel, θ i are vectors, and ( x i , p i ) are elements in X × Z . In particular, Γ is a symmetric function whose outputs are p ositiv e semidefinite matrices and which satisfies the repro ducing prop erty . F or any h ∈ H and θ ∈ R d , h ( x, p ) ⊤ θ = ⟨ h, Φ( x, p ) θ ⟩ H where Φ( x, p ) = Γ( · , ( x, p )) is the ev aluation functional for the RKHS. By virtue of b eing a Hilb ert space, an RKHS has a unique inner pro duct ⟨· , ·⟩ H and induced norm ∥ h ∥ H = p ⟨ h, h ⟩ H that serves an instance-dep endent notion of complexit y . A simple case to k eep in mind is when Φ( x, p ) ∈ R r × d is a finite-dimensional feature map and Γ(( x, p ) , ( x ′ , p ′ )) = Φ( x, p ) ⊤ Φ( x ′ , p ′ ). In this case the RKHS con tains the set of functions h ( x, p ) = Φ( x, p ) ⊤ θ for θ ∈ R r , the inner product is equal to the standard inner pro duct in R r ⟨ θ , θ ′ ⟩ H = P r i =1 θ i θ ′ i , and the norm of the functions h ∈ H is equal to ∥ h ∥ H = ∥ h ∥ 2 . T o guarantee indistinguishabilit y with resp ect to v ector-v alued RKHS, w e mak e use of ideas from defensive for e c asting , an algorithmic template in tro duced b y V o vk et al. [ 2005 ]. Defensiv e forecasting is a game-theoretic strategy for prediction where forecasts are deriv ed not b y prog- nostication, but rather by correcting past mistakes. Our main adv ancement here is to dra w on recen t breakthroughs on algorithms for solving exp e cte d variational ine qualities (EVI) to b e able to efficiently do defensiv e forecasting in high dimensions [ Zhang et al. , 2025 ]. Exp ected v ariational inequalities ha ve app eared in different areas with different names, including “outgoing minimax problems” [ F oster and Hart , 2021 ] or “accuracy certificates” [ Nemirovski et al. , 2010 ]. Given a desired tolerance ε > 0, a conv ex, compact set Z ⊂ R d , and an operator S : Z → R d , the goal of an EVI is to find a (finitely supp orted) distribution D ∈ ∆( Z ) suc h that E p ∼D [ S ( p ) ⊤ ( z − p )] ≤ 0 for all z ∈ Z . As w e discuss in Section 3 , these problems alw ays admit a O (p oly( d, 1 /ε )) time algorithm via a rather clean reduction to regret minimization. 6 While Nemiro vski et al. [ 2010 ] prop osed algorithms with log(1 /ε ) complexit y for the limited case in whic h S ( · ) is a monotone 5 Solving for µ t is trivial in the discrete case where p t = µ t . It is less ob vious for the scalar case where p t = ( E µ t [ e y ] , . . . , E µ t [ e y d ]), but w e show it can still b e done by leveraging some results from semi-algebraic optimization. 6 F or simplicity , throughout the discussion w e suppress p olynomial dependence on the diameters of Z and S . 10 op erator, it w as not until recen tly that these problems (and some generalizations) w ere sho wn to b e solv able in O (p oly ( d ) log(1 /ε )) time for general operators S ( · ) [ Zhang et al. , 2025 ]. This is in stark con trast with traditional v ariational inequality problems (“find p ∈ Z suc h that S ( p ) ⊤ ( z − p ) ≤ 0 for all z ∈ Z ”), which are computationally intractable already when S is a linear function and Z is the pro duct of tw o simplices in ligh t of standard connections with the problem of computing Nash equilibria in bimatrix games [ Chen et al. , 2009 , Dask alakis et al. , 2009 ]. 3 Efficien t High-Dimensional Multicalibration In this section, we present an efficient algorithm for high-dimensional multicalibration with resp ect to general v ector-v alued reproducing k ernel Hilbert spaces. Throughout the section, we assume that Z ⊂ R d is a given con vex b o dy for which an efficient (i.e., resp onding to queries in time p olynomial in d and logarithmic in the desired precision) pro jection, linear optimization, or separation oracle is av ailable. These oracles are all known to b e efficiently reducible to each other in time p olynomial in the dimension d under minimal regularit y assumptions that w e assume are satisfied [ Gr¨ otsc hel et al. , 1993 , Lee et al. , 2018 ]. F urthermore, we will denote with D : = max z ,z ′ ∈Z ∥ z − z ′ ∥ 2 and G : = sup ∥ Γ( · , · ) ∥ op (3) the diameter of Z and the maxim um op erator norm of the matrix kernel Γ, resp ectiv ely . Finally , w e assume that the matrix k ernel Γ can be ev aluated in p oly( d ) time for an y input, and likewise that the feature map Φ can b e ev aluated in p oly( d, r ) time for any input (if r is finite). Giv en T samples, our algorithm guarantees order √ T regret in p olynomial-time. A t the heart of the construction, our algorithm relies on recen t algorithmic adv ances for solving expected v ariational inequalit y problems efficiently [ Zhang et al. , 2025 ]. T o appreciate why EVIs are connected to m ulticalibration, and what op erators arise in these contexts, it is instructive to lo ok at the general template of defensive for e c asting . Defensiv e forecasting was in tro duced b y V ovk et al. [ 2005 ], and can b e thought of as a par- ticular instan tiation of the framew ork of Blackw ell approachabilit y . Related ideas hav e appeared throughout the literature, including F oster and V ohra [ 1997 ], Sandroni et al. [ 2003 ], Lehrer [ 2001 ], F udenberg and Levine [ 1999 ], F oster and Hart [ 2021 ], and others. W e refer the reader to Perdomo and Rec ht [ 2025 ] for an o verview of the defensive forecasting philosophy . At all times t , defensive forecasting picks distributions D t so that the quantit y Z T =      T X t =1 E p t ∼D t Φ( x t , p t )( z t − p t )      H is guaran teed to be sublinear (as a function of T ) no matter how Nature selects z t (after D t has been pic ked) and x t (b efore D t has been pick ed). This is sufficien t for ensuring sublinear m ulticalibration error, since b y the repro ducing prop erty:      T X t =1 E p t ∼D t h ( x t , p t ) ⊤ ( y − p t )      =      ⟨ h, T X t =1 E p t ∼D t Φ( x t , p t )( z t − p t ) ⟩ H      ≤ ∥ h ∥ H Z T . 11 Algorithm 1 Efficient High-Dimensional Multicalibration via Exp ected VIs 1: Matrix-v alued k ernel function Γ : ( X × Z ) 2 → R d × d 2: F or t = 1 , 2 , . . . , T : 3: See the features x t and define S t : Z → R d as S t ( p ) = Φ( x t , p ) ⊤ t − 1 X τ =1 E p τ ∼D τ Φ( x τ , p τ )( z τ − p τ ) = t − 1 X τ =1 E p τ ∼D τ Γ(( x t , p ) , ( x τ , p τ ))( z τ − p τ ) 4: Predict p t ∼ D t , where the distribution D t ∈ ∆( Z ) satisfies the exp ected v ariational inequalit y E p ∼D t h S t ( p ) ⊤ ( z − p ) i ≤ ε t ∀ z ∈ Z (EVI) for some small error ε t ; for example, ε t = 1 or ε t = 1 / √ t Although, strictly sp eaking, Z T is not a Blac kwell approac hability game due to the presence of the (p oten tially adversarial) context x t in the otherwise bilinear ob jectiv e (as a function of D t and z t ), the same idea behind Blac kwell’s approac hability algorithm yields a sublinear guaran tee. In particular, observe the recursiv e expansion Z 2 t = Z 2 t − 1 +   E p t ∼D t Φ( x t , p t )( z t − p t )   2 H + 2 E p t ∼D t  ( z t − p t ) ⊤ Φ( x t , p t ) ⊤ t − 1 X τ =1 E p τ ∼D τ Φ( x τ , p τ )( z τ − p τ ) | {z } = : S t ( p t )  . It is then clear that, as long as D t is selected so that E p t ∼D t h S t ( p t ) ⊤ ( z − p t ) i ≤ 0 ∀ z ∈ Z , (4) the third term can be dropped, yielding the bound Z 2 T ≤ T X t =1   E p t ∼D T Φ( x t , p t )( z t − p t )   2 H = ⇒ Z T ≤ v u u t T X t =1   E p t ∼D T Φ( x t , p t )( z t − p t )   2 H . This guarantees a multicalibration error gro wing at a √ T rate, with constants dep ending on D , G , and the norm of the functions in the RKHS. W e summarize the pro cedure in Algorithm 1 , incor- p orating the possibility of error ε t on the righ t of ( 4 ). F rom the abov e discussion, w e immediately deriv e the following guaran tee. Prop osition 1. L et Γ b e a matrix-value d kernel with asso ciate d RKHS H ⊆ {X × Z → R d } . If the pr e dictions p t ∼ D t ar e made ac c or ding to Algorithm 1 , then for al l T ≥ 1 and h ∈ H :      T X t =1 E p t ∼D t h ( x t , p t ) ⊤ ( z t − p t )      ≤ ∥ h ∥ H v u u t T D 2 G + T X t =1 ε t . Henc e, as long as P T t =1 ε t = O ( T ) , A lgorithm 1 guar ante es or der O ( √ T ) multic alibr ation err or. 12 Efficien tly Solving the EVI. The k ey (and effectively , only) step in Algorithm 1 is constructing a distribution D t o ver Z that solves the exp ected v ariational inequality problem ( EVI ). As men- tioned in Section 1.3 , a solution of an EVI can alw ays b e computed efficien tly for any general, ev en discon tinuous, b ounded VI op erator (in our case, S t ), giv en oracle access—for example, a member- ship, linear optimization, or separation oracle—to the con vex compact domain (in our case, Z ). In the rest of this section, w e sketc h t w o kno wn general approac hes for solving EVIs, with run times scaling p olynomially and p olylogarithmically in the desired in verse precision 1 /ε t , resp ectively . T o start, a distribution D t suc h that ( EVI ) holds can b e found in poly( d, D , G, 1 /ε t ) time b y no-regret algorithms (see also Zhang et al. [ 2025 ], F arina [ 2026 ]). In particular, consider setting up any no-regret learning algorithm outputting at ev ery time k a point p k ∈ Z , and receiving as utilities (losses) the functions p 7→ S t ( p k ) ⊤ p . 7 Then, expanding the definition of what it means to ha ve no-regret ov er the course of an arbitrary n um b er K of iterations, one has Regret K K = 1 K K X k =1 S t ( p k ) ⊤ ( z − p k ) = o (1) ∀ z ∈ Z , implying that the uniform distribution D t o ver the set of iterates { p 1 , . . . , p K } pro duced b y the no-regret algorithm con verges to an arbitrarily go o d solution of the exp ected v ariational inequality as the n umber of iterations K gro ws. By using known regret bounds for online pro jected gradient ascen t, setting the learning rate appropriately , and accounting for the fact that the diameter of the image of S t is of order tGD , w e obtain the follo wing rate (see App endix A.1 for the deriv ation). Prop osition 2. Ther e exists a no-r e gr et-b ase d implementation of Algorithm 1 that achieves av erage multic alibr ation err or | 1 T P T t =1 E p t ∼D t h ( x t , p t ) ⊤ ( z t − p t ) | ≤ ε in time O (p oly( d, D , G ) · min { r, ε − 6 } · ε − 6 ) . In p articular, the runtime is never worse than O (p oly( d, D , G ) · ε − 12 ) . Significan tly faster approaches solving EVIs (and even harder generalizations) at log(1 /ε ) rates ha ve b een very recently developed by leveraging a new constructive v ersion of the minimax theorem [ F arina and Pipis , 2024 , Zhang et al. , 2025 , F arina , 2026 ]. These metho ds are based on the ellipsoid metho d, and can solv e EVIs to error ε t with only order log(1 /ε t ) ev aluations of the operator S t . W e include a high-lev el, se lf-con tained description of these more adv anced algorithms in App endix A.2 . Here, we only state the final result; a proof is deferred to App endix A.3 . Prop osition 3. Ther e exists an el lipsoid-b ase d implementation of Algorithm 1 that achieves av er- age multic alibr ation err or | 1 T P T t =1 E p t ∼D t h ( x t , p t ) ⊤ ( z t − p t ) | ≤ ε in time ˜ O (p oly( d, D , G ) · min { r, ε − 2 }· ε − 2 ) . In p articular, the runtime is never worse than ˜ O (p oly( d, D , G ) · ε − 4 ) , and it is of or der ˜ O (p oly( d, D , G ) · ε − 2 ) if the numb er of fe atur es r is p olynomial in d, D , and G . 4 Online Outcome Indistinguishable Generativ e Mo dels In this section, we illustrate how to use the high-dimensional m ulticalibration algorithm from Section 3 to pro duce online outcome indistinguishability generativ e models for m ulticlass, scalar- v alued, and v ector-v alued outcomes. W e assume throughout that ε t = GD 2 in Algorithm 2 . 7 W e use k to indicate the iteration of the no-regret algorithm, to av oid confusion with scalar k ernel k used in Section 4 . An EVI ( EVI ) must b e solved at every time t in Algorithm 1 , and we are prop osing an iterative algorithm indexed o ver iterations c to solve such an EVI at that time k . 13 Algorithm 2 The Defensiv e Generation Algorithm 1: Input: matrix-v alued kernel Γ, outcome and prediction sets Y and Z , function s : Y → R d 2: F or t = 1 , . . . , T : 3: Receive features x t 4: Pro duce predicted statistic p t ∈ Z via defensive forecasting (Algorithm 1 ) to error ε t 5: Solve for a distribution µ t suc h that E e y ∼ µ t [ s ( e y t )] = p t , output the measure µ t 6: Record true outcome, y t ∈ Y and statistic s ( y t ) = z t Algorithm Description. As discussed in Section 2 , the Defensive Generation algorithm guaran- tees indistinguishability with resp ect to functions f that can b e written as f ( x, p, y ) = h ( x, p ) ⊤ s ( y ) where h b elongs to vector-v alued RKHS H . The pro cedure tak es as input the outcome and pre- diction sets Y and Z ⊂ R d , as well as the function s : Y → Z that together define the generative mo deling task. It also tak es as input the matrix-v alued k ernel function Γ : ( X × Z ) 2 → R d × d that implicitly defines the RKHS H . Defensiv e Generation is just a simple wrapp er around the online multicalibration routine (Al- gorithm 1 ). Giv en x t , it app eals to the defensiv e forecasting procedure to find p t that is a multi- calibrated forecast of z t = s ( y t ). Then, it solves for a distribution µ t suc h that E µ t [ s ( e y t )] = p t . T o sp ecialize it to different outcome spaces Y , w e need to ensure that w e can 1) implemen t efficien t oracle access (e.g., separation) for the set Z as discussed in Section 3 and 2) be able to solve for µ t . The rest of this section shows ho w to do b oth of these things for different choices of Z and Y . Before discussing these adaptations, w e state the end-to-end guaran tee of the algorithm: Theorem 3. Fix an outc ome sp ac e Y , a function s : Y → Z , and let F b e a class of distinguishers f that c an b e written as f ( x, p, y ) = h f ( x, p ) ⊤ s ( y ) for a function h f in ve ctor-value d RKHS H . Given ac c ess to a sampling or acle that on input z r eturns a pr ob ability me asur e µ such that E e y ∼ µ [ s ( e y )] = z , and a sep ar ation or acle for the set Z , 8 the Defensive Gener ation algorithm with ε t = D 2 G , wher e D and G ar e as in ( 3 ) , guar ante es online outc ome indistinguishability with r esp e ct to F . Mor e pr e cisely, given any f ∈ F , OIGap T ( f ) ≤ ∥ h f ∥ H √ 2 D 2 GT . 4.1 Generativ e Mo dels for Multiclass Outcomes In this subsection, we sp ecialize this main theorem to the case where Y consists of d lab els. In this setting, our results simplify substan tially and we can guarantee indistinguishability with resp ect to distinguishers that examine the entire conditional distribution from whic h w e are sampling. That is, we let p = µ where µ is a p oint on the simplex, µ ∈ Z = ∆ d = { p : p j ≥ 0 , d X j =1 p j = 1 } and e y ∼ µ. 8 Giv en z ′ , a separation oracle for the set Z returns true if z ′ ∈ Z . If z ′ / ∈ Z , it returns a hyperplane ( w, b ) such that w ⊤ z ′ > b but w ⊤ z < b for all z ∈ Z 14 In more detail, if Y = { 1 , . . . , d } , then for an y function f ( x, p, y ), w e can write f ( x, p, y ) = d X j =1 f ( x, p, j ) · 1 { y = j } . Therefore, an y distinguisher f can b e expressed as a linear function of its v alues at d points, f ( x, p, y ) = h f ( x, p ) ⊤ s ( y ) where h f ( x, p ) = ( f ( x, p, 1) , . . . , f ( x, p, d )) and s ( y ) = (1 { y = 1 } , . . . , 1 { y = d } ) ∈ Z is a one-hot enco ded v ersion of the outcomes y . Moreo ver, for this multiclass case, there is a simple formula w e can use to construct function spaces that con tain these functions h f . In particular, assuming that there is a sc alar -v alued RKHS that con tains each of the scalar-v alued functions f ( x, p, i ) for i ∈ [ d ], we can alwa ys construct a ve ctor -v alued RKHS that contains the function h f ( x, p ) from abov e. F act 1 ( Alv arez et al. [ 2012 ]) . L et k (( x, p ) , ( x ′ , p ′ )) ∈ R b e a sc alar-value d kernel with RKHS H scalar ⊂ {X × Z → R } . Then, Γ(( x, p ) , ( x ′ , p ′ )) = I d · k (( x, p ) , ( x ′ , p ′ )) ∈ R d × d is a matrix-value d kernel whose c orr esp onding ve ctor-value d RKHS H d c onsists of al l functions of the form: h ( x, p ) = ( h 1 ( x, p ) , . . . , h d ( x, p )) ⊤ wher e h j ( x, p ) ∈ H scalar for al l i ∈ [ d ] . F urthermor e, for any function h ∈ H d its RKHS norm is e qual to ∥ h ∥ 2 H d = P d j =1 ∥ h j ∥ 2 H scalar . Here, there is no need to implement a sampling oracle, since the algorithm directly solv es for the distribution p = µ as part of the multicalibration subroutine. F urthermore, it is simple to chec k whether any p oint is in the simplex Z = { p : p i ≥ 0 , P d i =1 p i = 1 } , and one can easily implement a separation oracle. Tying these facts together with Theorem 3 , w e get the following corollary: 9 Corollary 1. Set Y = [ d ] and define Z to b e the simplex in R d , Z = { p : p j ≥ 0 , P d j =1 p j = 1 } . L et k b e a kernel such that sup x,p | k (( x, p ) , ( x, p )) | ≤ G with an asso ciate d sc alar-value d RKHS H scalar , such that f ( x, p, j ) = h j ( x, p ) ∈ H scalar for al l j ∈ [ d ] . Then, Defensive Gener ation with kernel Γ = I d · k pr o duc es distributions µ t over outc omes Y that ar e online outc ome indistinguishable with r esp e ct to the set of functions f ( x, µ, y ) wher e h j ( x, µ ) = f ( x, µ, j ) is in H scalar for every j ∈ [ d ] . In p articular, for every such f , OIGap T ( f ) ≤ 4   d X j =1 ∥ h j ∥ H scalar   √ T G. W e can further specialize this statement to guarantee indistinguishability with resp ect to com- mon function classes as per this example: Example 1. L et Y = [ d ] , X = { x : ∥ x ∥ 2 ≤ 1 } ⊂ R r , and set k (( x, p ) , ( x ′ , p ′ )) = ⟨ x, x ′ ⟩ + ⟨ p, p ′ ⟩ . R unning the Defensive Gener ation with Γ(( x, p ) , ( x ′ , p ′ )) = I d · k (( x, p ) , ( x ′ , p ′ )) guar ante es online outc ome indistinguishability with r esp e ct to the set of functions, F = { f ( x, p, y ) : f ( x, p, j ) = ⟨ x, θ j ⟩ + ⟨ p, θ ′ j ⟩ , ( θ i , θ ′ i ) ∈ R r × R d for al l j ∈ [ d ] } . In p articular, if we r estrict ∥ θ j ∥ 2 + ∥ θ ′ j ∥ 2 ≤ B for every j ∈ [ d ] , then we get that OIGap T ( f ) is at most 4 dB √ T for any f . F urthermor e, at time t the algorithm runs in time O (p oly( d, B ) log( t )) . 9 W e use µ t and p t in terchangeably in Corollary 1 since they are the same for this multiclass setting. 15 4.2 Generativ e Mo dels for Scalar-V alued Outcomes As a second application, we show how Defensiv e Generation can b e used to produce distributions µ t o ver scalar outcomes y t that are outcome indistinguishable from the p ersp ectiv e of an y f that examines higher-order momen ts of the distribution, up to some fixed degree 2 d . In symbols, w e fo cus on the setting in whic h Y = [ − 1 , 1] and s ( y ) = (1 , y , y 2 , . . . , y 2 d ) and Z is the (con vex) set of momen ts: Z : = ( Z [ − 1 , 1] y 0 d µ, . . . , Z [ − 1 , 1] y 2 d d µ ! : µ is a Borel probabilit y measure on [ − 1 , 1] ) . (5) T o extend our results to this setting, w e lev erage tec hniques from semi-algebraic optimization. As w e discuss in Section B , Z admits an efficient separation oracle, thus satisfying the preconditions of our construction in Section 3 for efficien tly pro ducing defensiv e forecasts z t ∈ Z . F urthermore, once a momen t v ector z t ∈ Z is selected, it is p ossible to efficien tly construct a discrete distribution µ t on [ − 1 , 1] whose momen ts match those sp ecified b y z t . By sampling a e y t ∼ µ t , we can complete the roundtrip and yield predictions that are indistinguishable with resp ect to tests that only examine a fixed n umber of momen ts. Tying these ideas together, we get the follo wing corollary: Corollary 2. Set Y = [ − 1 , 1] and define Z ⊂ R 2 d +1 to b e the set of moments as in Equation ( 5 ) . L et Γ b e a matrix-value d kernel with asso ciate d ve ctor-value d RKHS H ⊆ {X × Z → R d } and let F b e the set of functions, F : = { f ( x, p, y ) : f ( x, p, y ) = h ( x, p ) ⊤ s ( y ) wher e s ( y ) = (1 , y , . . . , y 2 d ) } . The Defensive Gener ation algorithm with the sampling and sep ar ation or acles fr om Se ction B pr o duc es distributions µ t and statistics p ⊤ t = (1 , E µ t [ e y t ] , . . . , E µ t [ e y 2 d t ]) such that for every f ∈ F , OIGap T ( f ) ≤ ∥ h ∥ H v u u t T X t =1 E p t ( p t − s ( y t )) ⊤ Γ(( x t , p t ) , ( x t , p t ))( p t − s ( y t )) + GD 2 T . Using this w e can generate distributions o v er scalar outcomes that fool all lo w degree tests. That is, we can use to Defensive Generation algorithm to pro duce distributions µ t o ver scalar outcomes in [ − 1 , 1] such that all lo w degree moments are conditionally v alid o ver all subsets of the Bo olean h yp ercub e computable b y lo w-depth decision trees. Example 2. L et X = {± 1 } n b e the Bo ole an hyp er cub e and c onsider the p olynomial kernel, k poly ( x, x ′ ) = (1 + x ⊤ x ′ ) r . It is a wel l-known r esult that the RKHS H poly for this kernel c ontains al l Bo ole an functions c : {± 1 } n → { 0 , 1 } c omputable by de cision tr e es of depth at most r . 10 10 See Proposition 4.12 in Dw ork et al. [ 2025 ] for a pro of. As p er O’Donnell [ 2014 ], a decision tree is a representation of a b oolean function as a rooted binary tree in which the in ternal nodes are lab eled b y coordinates [ i ] ∈ [ n ], the outgoing edges are lab eled by − 1 and 1 and the lea ves hav e real num bers corresp onding to the outputs. 16 Mor e over, by Fact 1 , the matrix-value d kernel Γ(( x, p ) , ( x, p )) = I d · k poly ( x, x ′ ) c ontains al l functions, h ( x ) = ( c 1 ( x ) , . . . , c d ( x )) wher e e ach c i is a Bo ole an function in H poly . Conse quently, if we c onsider the sp ac e of functions F that c an b e written as, f ( x, p, y ) = h ( x ) ⊤ s ( y ) = d X j =1 c j ( x ) y j , (6) this sp ac e c ontains al l tests of the form f ( x, p, y ) = 1 { c ( x ) = 1 } · y j . Ther efor e, the Defensive Gener ation algorithm guar ante es that for any j = 1 , . . . , 2 d and al l low-de gr e e b o ole an functions c , lim T →∞     1 T T X t =1 1 { c ( x t ) = 1 } y j t − 1 T T X t =1 1 { c ( x t ) = 1 } E e y t ∼ µ t E µ t [ e y j t ]     = 0 . (7) That is, the first 2 d moments of the c onditional distributions we pr o duc e (online) match those of the r eve ale d outc omes. F urthermor e, this is true c onditional ly over al l the subse quenc es { x t : c ( x t ) = 1 , c is a de cision tr e e of depth r } . Mor e formal ly, for al l functions f of the form in Equation ( 6 ) , OIGap T ( f ) ≤ O (p oly ( d ) · n r · √ T ) . While Dw ork et al. [ 2025 ] achiev ed the guarantee in Equation ( 7 ) for the case where j = 1, our algorithm produces distributions whic h (sim ultaneously) satisfy the guaran tee for all j = 1 , . . . , d . This result is also closely related to the guarantees in Gupta et al. [ 2022 ]. Their results hold for higher order momen ts, but only with respect to a finite class of distinguishers. W e note that one in teresting consequence of pro ducing distributions that hav e these high order guaran tees of v alidity is that one can use them to deriv e high-probabilit y prediction in terv als via Cheb yshev’s inequalit y . W e refer the reader to Gupta et al. [ 2022 ] for further details. 4.3 Generativ e Mo dels for High-Dimensional Outcomes As a final application, w e consider guaran teeing outcome indistinguishabilit y when outcomes are high-dimensional, Y ⊂ R d . Indistinguishablit y of conditional exp ectations T o start, we sho w that w e can alw ays guar- an tee online outcome indistinguishabilit y when the distinguishers f are linear functions of the first momen t of y , f ( x, p, y ) = h ( x, p ) ⊤ y , and s ( y ) = y . That is, they guaran tee that the distributions pro duced by the learner ha ve first momen ts that are conditionally correct as measured b y the functions h . F or this setting, we let Z = Y ⊂ R d whic h is assumed to b e an y compact, con v ex set with diameter b ounded b y D with an efficien t separation oracle (e.g. Y = [ − 1 , 1] d ). Since s ( y ) is just the iden tity function, the problem of finding a distribution µ t suc h that E e y ∼ µ t [ s ( e y t )] = E e y ∼ µ t [ e y t ] = p t is easy , just return a p oint mass distribution at p t . Here, we hav e the following corollary: Corollary 3. L et Γ b e a matrix value d kernel with op er ator norm uniformly b ounde d by G and with ve ctor-value d RKHS H ⊆ {X × Z → R d } wher e Z has diameter b ounde d by D . Then, the Defensive Gener ation algorithm with kernel Γ guar ante es online OI with r esp e ct to the set of functions f ( x, p, y ) = h ( x, p ) ⊤ y wher e h ( x, p ) ∈ H . F or any f ∈ F , OIGap T ( f ) ≤ ∥ h ∥ H √ 2 T D 2 G. 17 As a concrete instan tiation of this result, w e can guaran tee indistinguishability with respect to the set of functions: F = { f ( x, p, y ) = h ( x, p ) ⊤ y where h ( x, p ) = Φ( x, p ) θ for θ ∈ R r } . Here, Φ is a finite-dimensional feature map Φ( x, p ) ∈ R d × r and h ( x, p ) is an y function that lies in the span of these features. F or instance, if we let, Φ( x, p ) ⊤ =  g 1 ( x, p ) | · · · | g r ( x, p )  ∈ R d × r (8) where g 1 , . . . , g r is an y arbitrary set of r functions then we get that H = span( { g 1 , . . . , g r } ). This is just one choice of an explicit feature map, but one could of course consider others. Example 3. L et Y = Z b e a c omp act subset of R d with diameter D and let Φ : X × Z → R r × d b e a fe atur e map. The Defensive Gener ation algorithm with matrix value d kernel Γ(( x, p ) , ( x ′ , p ′ )) = Φ( x, p ) ⊤ Φ( x ′ , p ′ ) guar ante es online outc ome indistinguishability with r esp e ct to the set of functions f ( x, p, y ) = h ( x, p ) ⊤ y wher e h ( x, p ) = Φ( x, p ) θ . In p articular, for every such h , OIGap T ( f ) ≤ ∥ θ ∥ 2 r 2 D 2 T · sup x,p ∥ Φ( x ′ , p ′ ) ∥ 2 op . Ther efor e, if we set Φ as in ( 12 ) for functions g 1 , . . . , g r , wher e max 1 ≤ j ≤ r ∥ g j ( x, p ) ∥ 2 ≤ B , then for any f ( x, p, y ) = g j ( x, p ) ⊤ y , we have OIGap T ( f ) ≤ √ 2 r B 2 D 2 T . Indistinguishabilit y with respect to higher-order momen ts In general, the defensive gen- eration approach described so far can guaran tee efficient online outcome indistinguishability with resp ect to d -order moments, when the set of such momen ts admits an efficien t separation ora- cle. This condition is equiv alent to the existence of efficien t algorithms for the minimization of p olynomials of degree up to d on Y . As an illustration, w e sho w that on the high-dimensional ball Y = { y ∈ R d : ∥ y ∥ 2 ≤ 1 } , we can guaran tee indistinguishabilit y with resp ect to functions f that examine the first t wo moments of the distributions µ t , f ( x, p, y ) = h ( x, p ) ⊤ s ( y ), and s ( y ) = ( y 1 , . . . , y d , y 2 1 , y 1 y 2 , . . . , y i y j ) for i, j ∈ [ d ] , i ≤ j. (9) In other w ords, they guarantee that the distributions µ t pro duced online ha v e high-dimensional means, E µ t [ e y t ], and cov ariances, E µ t [ e y t e y ⊤ t ] ∈ R d × d , that are conditionally correct as measured b y h . In this setting, the set of momen ts Z is Z = { ( v , Q ) : ∃ a probability measure µ o ver ∥ y ∥ 2 ≤ 1 such that E µ [ y ] = v , E µ [ y y ⊤ ] = Q } (10) Since quadratic functions on the unit ball can alw ays be optimized efficien tly (by p erforming a singular v alue decomposition), a separation oracle for Z can alw ays be constructed efficien tly . In fact, in this case w e do not ev en need to resort to complicated separation oracle constructions: as it turns out, the set Z admits a simple characterization. Indeed, an application of the S-Lemma [ Y akub o vich , 1971 ] lets us write Z = { ( v , Q ) : Q ⪰ 0 , Q ⪰ v v ⊤ , T r[ Q ] ≤ 1 } , (11) where T r[ Q ] denotes the trace of Q and A ⪰ B denotes that A − B is p ositiv e semidefinite. (A more constructive pro of of ( 11 ) via some intermediate results that will b e used later in this section 18 Algorithm 3 Backfitting a probability measure µ . // W e use δ ( y ) to denote a p oint mass on a p oint y . 1: Input: PSD matrix Q and vector v ∈ R d suc h that Q ⪰ v v ⊤ 2: Compute the SVD of Q − v v ⊤ = P d i =1 σ i u i u ⊤ i 3: F or i = 1 . . . d , 4: Solve for the 2 real-v alued roots t + i > 0 and t − i < 0 that satisfy ∥ v + t · u i ∥ = 1 5: Define y + i = v + t + i u i , y − i = v + t − i u i and λ + i = σ i t + i ( t + i − t − i ) , λ − i = σ i t − i ( t − i − t + i ) 6: Return µ = λ 0 δ ( v ) + P d i =1 [ λ + i δ ( y + i ) + λ − i δ ( y − i )] where λ 0 = 1 − P d i =1 ( λ + i + λ − i ) is a v ailable in Appendix C .) F rom this rewriting, it is straightforw ard to efficien tly c heck whether an y giv en prop osed moments ( v ′ , Q ′ ) b elong to Z , or return a separating h yp erplane (violated constrain t) otherwise. Moreo ver, giv en ( v, Q ) ∈ Z , we can efficiently find an atomic probability measure µ o ver the unit ball such that E y ∼ µ [ y ] = v and E µ [ y y ⊤ ] = Q . It in volv es nothing more complicated than taking the SVD of a matrix. W e present the pro cedure in Algorithm 3 and the pro of of its correctness in App endix C . Giv en that w e can implement a separation oracle and a w ay to solve for a probabilit y measure µ whose momen ts matc h a target v ector of momen ts in Z , w e hav e the follo wing corollary: Corollary 4. L et Γ b e a matrix-value d kernel with op er ator norm uniformly b ounde d by G and with ve ctor-value d RKHS H ⊆ {X × Z → R d + d ( d +1) / 2 } wher e Z is as in ( 10 ) and s ( y ) ∈ R d + d ( d +1) / 2 . Then, Z has diameter at most √ 8 and the Defensive Gener ation algorithm with kernel Γ guar ante es online OI with r esp e ct to the set of functions f ( x, p, y ) define d in ( 9 ) . F or any f ∈ F , OIGap T ( f ) ≤ 4 ∥ h ∥ H √ T G. As a concrete instan tiation of this result, w e can guaran tee indistinguishability with respect to the set of functions in Equation ( 9 ) where s ( y ) ∈ R m for m = d + d ( d + 1) / 2 is a vector con taining all the v ariables in the first and second moments of y (e.g. y i , y 2 i and also the mixed pairs y i y j for y = ( y 1 , . . . y d )). In this example, Φ is a finite-dimensional feature map Φ( x, p ) ∈ R m × r and h ( x, p ) is any function that lies in the span of these features. F or instance, if we let, Φ( x, p ) ⊤ =  g 1 ( x, p ) | · · · | g r ( x, p )  ∈ R m × r (12) where g 1 , . . . , g r is an y arbitrary set of r functions then we get that H = span( { g 1 , . . . , g r } ). This is just one choice of an explicit feature map, but one could of course consider others. Example 4. L et Z (se e Equation ( 10 ) ) b e the set of first and se c ond moments of pr ob ability distributions over the unit b al l and let Φ : X × Z → R r × m b e a fe atur e map. The Defensive Gener ation algorithm with kernel Γ(( x, p ) , ( x ′ , p ′ )) = Φ( x, p ) ⊤ Φ( x ′ , p ′ ) guar ante es online outc ome indistinguishability with r esp e ct to the set of functions f ( x, p, y ) = h ( x, p ) ⊤ s ( y ) 19 wher e h ( x, p ) = Φ( x, p ) θ . In p articular, for every such h , OIGap T ( f ) ≤ 4 ∥ θ ∥ 2 r T · sup x,p ∥ Φ( x ′ , p ′ ) ∥ 2 op . Ther efor e, if we set Φ as in ( 12 ) for functions g 1 , . . . , g r , wher e max 1 ≤ j ≤ r ∥ g j ( x, p ) ∥ 2 ≤ B , then for any f ( x, p, y ) = g j ( x, p ) ⊤ s ( y ) , OIGap T ( f ) ≤ 4 √ r B 2 T . W e note that while we presented the results in this last section for the case where Y is the unit ball. They apply more generally to the case where Y is any ellipsoid Y = { x : ( x − b ) ⊤ A ( x − b ) ≤ 1 , A ≻ 0 , b ∈ R d } , b y simply changing basis x 7→ A − 1 / 2 x + b . 5 Ac kno wledgmen ts GF w as supp orted in part by the National Science F oundation a ward CCF-2443068, the Office of Na v al Research gran t N000142512296, and an AI2050 Early Career F ello wship. W e would lik e to thank Benjamin Rec ht for his helpful comments. References M. A. Alv arez, L. Rosasco, N. D. La wrence, et al. Kernels for vector-v alued functions: A review. F oundations and T r ends in Machine L e arning , 2012. X. Chen, X. Deng, and S.-H. T eng. Settling the complexit y of computing t wo-pla y er nash equilibria. Journal of the ACM (JACM) , 56(3):1–57, 2009. C. Dask alakis, P . W. Goldb erg, and C. H. P apadimitriou. The complexity of computing a nash equilibrium. Communic ations of the ACM , 52(2):89–97, 2009. C. Dw ork, M. P . Kim, O. Reingold, G. N. Rothblum, and G. Y ona. Outcome indistinguishability . In ACM Symp osium on The ory of Computing , 2021. C. Dw ork, M. Kim, O. Reingold, G. Rothblum, and G. Y ona. Beyond bernoulli:generating random outcomes that cannot b e distinguished from nature. In International Confer enc e on Algorithmic L e arning The ory , 2022. C. Dw ork, C. Ha ys, N. Immorlica, J. C. P erdomo, and P . T ank ala. F rom fairness to infinity: Outcome- indistinguishable (omni) prediction in evolving graphs. Confer enc e on L e arning The ory , 2025. G. F arina. T urning defense into offense in O (log 1 /ε ) steps: Efficient constructive proof of the minimax theorem. ACM SIGe c om Exchanges , 2026. G. F arina and C. Pipis. P olynomial-time computation of exact Φ-equilibria in p olyhedral games. A dvanc es in Neur al Information Pr o c essing Systems , 2024. M. Fishelson, N. Golowic h, M. Mohri, and J. Schneider. High-dimensional calibration from swap regret. In Confer enc e on Neur al Information Pr o c essing Systems , 2025. 20 D. P . F oster and S. Hart. F orecast hedging and calibration. Journal of Politic al Ec onomy , 2021. D. P . F oster and S. M. Kak ade. Calibration via regression. In IEEE Information The ory Workshop , 2006. D. P . F oster and R. V. V ohra. Calibrated learning and correlate d equilibrium. Games and Ec onomic Behavior , 1997. D. P . F oster and R. V. V ohra. Asymptotic calibration. Biometrika , 1998. D. F udenberg and D. K. Levine. An easier wa y to calibrate. Games and Ec onomic Behavior , 1999. P . Gopalan, A. T. Kalai, O. Reingold, V. Sharan, and U. Wieder. Omnipredictors. In Innovations in The or etic al Computer Scienc e Confer enc e , 2022. P . Gopalan, L. Hu, M. P . Kim, O. Reingold, and U. Wieder. Loss minimization through the lens of outcome indistinguishabilit y . Innovations in The or etic al Computer Scienc e , 2023. M. Gr¨ otsc hel, L. Lo v´ asz, and A. Schrijv er. Ge ometric A lgorithms and Combinatorial Optimization . 1993. V. Gupta, C. Jung, G. Noaro v, M. M. P ai, and A. Roth. Online multiv alid learning: Means, moments, and prediction interv als. Innovations in The or etic al Computer Scienc e , 2022. U. H ´ eb ert-Johnson, M. Kim, O. Reingold, and G. Rothblum. Multicalibration: Calibration for the (computationally-iden tifiable) masses. In International Confer enc e on Machine L e arning , 2018. C. Jung, C. Lee, M. Pai, A. Roth, and R. V ohra. Momen t m ulticalibration for uncertaint y estimation. In Confer enc e on L e arning The ory , 2021. Y. T. Lee, A. Sidford, and S. S. V empala. Efficient con v ex optimization with mem b ership oracles. In Confer enc e On L e arning The ory , 2018. E. Lehrer. Any insp ection is manipulable. Ec onometric a , 2001. J. Lu, A. Roth, and M. Shi. Sample efficien t omniprediction and do wnstream sw ap regret for non-linear losses. Confer enc e on L e arning The ory , 2025. A. Nemiro vski, S. Onn, and U. G. Rothblum. Accuracy certificates for computational problems with conv ex structure. Mathematics of Op er ations R ese ar ch , 2010. J. Nie. Moment and Polynomial Optimization . So ciety for Industrial and Applied Mathematics, Philadelphia, P A, 2023. G. Noaro v, R. Ramalingam, A. Roth, and S. Xie. High-dimensional prediction for sequential decision making. International Confer enc e on Machine L e arning , 2025. R. O’Donnell. Analysis of b o ole an functions . Cambridge Universit y Press, 2014. P . Okoroafor, R. Klein b erg, and M. P . Kim. Near-optimal algorithms for omniprediction. arXiv pr eprint arXiv:2501.17205 , 2025. P . A. Parrilo. P olynomial games and sum of squares optimization. In Pr o c e e dings of the 45th IEEE Confer enc e on De cision and Contr ol . IEEE, 2006. B. Peng. High dimensional online calibration in polynomial time. Symp osium on the F oundations of Computer Scienc e , 2025. 21 J. C. Perdomo and B. Rech t. In defense of defensiv e forecasting. arXiv pr eprint arXiv:2506.11848 , 2025. A. Sandroni, R. Smorodinsky , and R. V. V ohra. Calibration with man y chec king rules. Mathematics of Op er ations R ese ar ch , 2003. M. Simcho witz. Statistic al c omplexity and r e gr et in line ar c ontr ol . PhD thesis, Universit y of California, Berk eley , 2021. V. V ovk. Non-asymptotic calibration and resolution. The or etic al Computer Scienc e , 2007. V. V ovk, A. T akem ura, and G. Shafer. Defensive forecasting. In International Workshop on A rtificial Intel ligenc e and Statistics , 2005. V. A. Y akub o vich. S-pro cedure in nonlinear control theory . V estnik L eninggr adsko go Universiteta, Ser. Matematika , 1971. B. H. Zhang, I. Anagnostides, E. T ewolde, R. E. Berk er, G. F arina, V. Conitzer, and T. Sandholm. Exp ected v ariational inequalities. International Confer enc e on Machine L e arning , 2025. M. Zink evich. Online conv ex programming and generalized infinitesimal gradient ascen t. In International Confer enc e on Machine L e arning , 2003. A Exp ected V ariational Inequalities A.1 Pro of of Prop osition 2 F rom the definition of S t , the norm of the utility gradient is at most G ′ := max z ∥ S t ( z ) ∥ 2 ≤ tGD . Hence, by setting the learning rate η = D/ ( G √ T ), and using the known analysis of the online pro jected gradient ascen t algorithm [ Zink evich , 2003 ], we can write Regret K K ≤ 2 G ′ D √ K K ≤ 2 D 2 G t √ K . This shows that it is enough to set K = t 2 to obtain an error of 2 D 2 G in the solution of ( EVI ) and th us, b y Prop osition 1 , a m ulticalibration error bounded b y 2 ∥ h ∥ H √ D 2 GT against any h ∈ H . T o achiev e a giv en av erage multicalibration error ε , it is then necessary to run Algorithm 1 for T = D 2 G/ε 2 . The only step left to complete the pro of is then to estimate the complexity of each iteration of the no-regret-based EVI solution algorithm. Eac h iteration inv olv es ev aluating S t at the curren t p k , taking a gradient step, and pro jecting on to Z . The latter t wo op erations require p oly( d ) time. Thus, at ev ery time t we need to p erform an amoun t of w ork that is of order t 2 (p oly( d ) + Eval ( S t )) , where Eval ( S t ) denotes the cost of ev aluating S t . W e distinguish t wo p ossibilities for ev aluating S t . • In general, S t is defined as a sum of t expectations. Each exp ectation is ov er the distribution D τ that solv ed ( EVI ) at time τ . As discussed abov e, the supp ort at time τ is K τ = τ 2 . So, giv en our assumption that eac h ev aluation of Γ tak es p oly ( d ) time, we are left with a O ( t 3 p oly( d )) time b ound for eac h ev aluation of S t . Putting all the pieces together, w e are left with a O ( t 5 p oly( d )) time p er iteration of Algo- rithm 1 , leading to a O ( T 6 p oly( d )) runtime to complete T iterations. Plugging T = D 2 G/ε 2 yields a bound of p oly ( d, D , G ) · ε − 12 . 22 • When r = O (p oly ( d, D , G )), the cost of ev aluating S t ( p ) can b e amortized, by accumulating o ver time the v ector quan tit y m t : = t − 1 X τ =1 E p τ ∼D τ Φ( x τ , p τ )( z τ − p τ ) . Indeed, b y definition of S t , w e ha ve S t ( p ) = Φ( x t , p ) ⊤ m t , leading to a cost of O ( r d ) for eac h ev aluation of S t . T o solv e the EVI, w e then require t 2 p oly( d ) m time. After every iteration t of Algorithm 1 , the quantit y m t can b e up dated by summing the new exp ectation arising from D t . Since the distribution has support K = t 2 , suc h an up date has a O ( t 2 p oly( d )) cost. In total, eac h iteration of Algorithm 1 requires t 2 p oly( d ) m time, for a total of T 3 p oly( d ) r time. Plugging T = D 2 G/ε 2 yields a b ound of O (p oly( d, D , G ) · r · ε − 6 ) time to reach ε a verage m ulticalibration error. T aking the minim um betw een the t wo cases yields the statement. A.2 In tuition Behind the Construction of Zhang et al. [ 2025 ] As men tioned in Section 3 , significan tly faster approac hes for solving EVIs (and ev en harder gen- eralizations) at log (1 /ε ) rates ha ve b een v ery recen tly dev elop ed b y lev eraging a new constructiv e v ersion of the minimax theorem [ F arina and Pipis , 2024 , Zhang et al. , 2025 , F arina , 2026 ]. W e no w giv e a high-lev el in tuition for these methods; for more details, w e refer the reader to the paper of Zhang et al. [ 2025 ]. These faster algorithms can b e understoo d as op erating in tw o phases: • First, they pro duce an extremely sparse set P = { p 1 , . . . , p K } ⊂ Z of size K = O (p oly ( d ) log( G, D , 1 /ε )) . • Then, they searc h for a distribution o v er the K points in P that guarantees v alue ε for all z ∈ Z . This discrete distribution ( λ 1 , . . . , λ K ) solving the EVI problem can b e computed by solving the con vex optimization problem arg min λ ∈ ∆ K max z ∈Z K X k =1 λ k S ( p k ) ⊤ ( z − p k ) . o ver ∆ K . This can be solv ed using standard conv ex optimization tec hniques in time p olyno- mial in K , the diameter of Z and S [ Gr¨ otschel et al. , 1993 ]. Conceptually , to pro duce the sparse supp ort, the algorithm makes use of the ellipsoid algorithm to certify the emptiness of the set Ω : = { z ∈ Z : S ( p ) ⊤ ( z − p ) > 0 ∀ p ∈ Z } . The emptiness of Ω is direct, since for an y z ∈ Z , the constrain t indexed by p = z is trivially violated. By running the ellipsoid o v er Ω using p = z as the separation oracle, the ellipsoid method is therefore able to pro duce a trace of violated constrain ts indexed b y the ellipsoid centers p 1 = z 1 , p 2 = z 2 , . . . , p K = z K that sp arsely c ertifies the emptiness of Ω. The F ark as lemma then implies that a distribution ov er the constraints must b e an EVI solution, justifying the soundness of the second step. 23 A.3 Pro of of Prop osition 3 The ellipsoid-based algorithm for EVIs is able to solve the EVI problem with only O (p oly( d ) log( D , G, log 1 /ε )) calls to the op erator S t , pro ducing a distribution with support size b ounded by the same quantit y . F ollowing the discussion in Section A.1 , w e distinguish tw o cases for the cost of ev aluating S t . • In general, S t is defined as a sum of t expectations. Each exp ectation is ov er the distribution D τ that solv ed ( EVI ) at time τ . Since the supp ort at time τ is K τ = ˜ O (log( t/ε )), giv en our assumption that each ev aluation of Γ takes poly ( d ) time, w e are left with a O ( t · p oly( d )) time b ound for each ev aluation of S t . Putting all the pieces together, w e are left with a O ( t log 2 ( t/ε )) time per iteration of Algo- rithm 1 , leading to a ˜ O ( T 2 p oly( d )) runtime to complete T iterations. Plugging T = D 2 G/ε 2 yields a bound of ˜ O (p oly( d, D , G ) · ε − 4 ). • When r = O (p oly ( d, D , G )), the cost of ev aluating S t ( p ) can b e amortized as describ ed in Section A.1 , leading to a O (p oly ( d ) r ) ev aluation time for S t . Hence, w e can solv e the EVI in time ˜ O (p oly( d ) r ) After ev ery iteration t of Algorithm 1 , the quan tity S t can be updated b y summing the new exp ectation arising from D t . Since the distribution has supp ort K = ˜ O (1), such an update has a ˜ O (p oly( d )) cost. In total, each iteration of Algorithm 1 requires ˜ O (p oly( d ) r ) time, for a total of O ( T p oly( d ) r ) time. Plugging T = D 2 G/ε 2 yields a b ound of O (p oly( d, D , G ) · r · ε − 2 ) time to reach ε a verage m ulticalibration error. T aking the minim um betw een the t wo cases yields the statement. B More details on the Cone of the first d Momen ts in R In the in terest of keeping the exposition as self-contained as possible, in this section we sk etch classical results regarding the algebraic and computational prop erties of the cone of moments. F or more details, w e refer the reader to Chapters 3.3 and 3.4 of the b o ok on momen t and p olynomial optimization by Nie [ 2023 ], or the paper of Parrilo [ 2006 ]. T o lighten notation, in this section w e use the notation ⟨· , ·⟩ to denote the standard dot pro duct in R 2 d +1 . Efficien t Separation Oracle for the Momen t Set An efficient separation oracle for Z can b e constructed b y lev eraging a dualit y result betw een moments and nonnegative p olynomials, and in voking classical P ositivstellensatz results to c haracterize the latter via the p ositive semidefinite (PSD) cone. T o start, it is w ell understo o d that the dual set Z ∗ of the Z from ( 5 ), defined as Z ∗ : = { a = ( a 0 , a 1 , . . . , a 2 d ) ∈ R 2 d +1 : ⟨ a, z ⟩ ≥ 0 ∀ z ∈ Z } 24 corresp onds to the set of p olynomials that are nonnegative ev erywhere on the in terv al [ − 1 , 1]: Z ∗ = ( ( a 0 , a 1 , . . . , a 2 d ) ∈ R 2 d +1 : 2 d X i =0 a i y i ≥ 0 ∀ y ∈ [ − 1 , 1] ) . T o see this, observe that for an y a ∈ R 2 d +1 , ⟨ a, z ⟩ ≥ 0 ∀ z ⇐ ⇒ 2 d X i =0 a i Z [ − 1 , 1] y i d µ ! ≥ 0 ∀ µ ⇐ ⇒ Z [ − 1 , 1] 2 d X i =0 a i y i ! d µ ≥ 0 ∀ µ. Since, in particular, we are free to tak e µ to b e a Dirac delta centered at any y ∈ [ − 1 , 1], it is immediate to see that Z [ − 1 , 1] 2 d X i =0 a i y i ! d µ ≥ 0 ∀ µ ⇐ ⇒ 2 d X i =0 a i y i ≥ 0 ∀ y ∈ [ − 1 , 1] , completing the proof. Both Z and Z ∗ are conv ex and closed sets. The dual set Z ∗ pla ys a key role in constructing an efficien t separation oracle for Z . Indeed, one can pro ve that Z ∗ defines a cutting plane char acterization of Z , in the sense that Z = { z ∈ R 2 d +1 : z 0 = 1 ∧ ⟨ z , a ⟩ ≥ 0 ∀ a ∈ Z ∗ } . (13) The pro of of the ab o ve result is a standard application of separation for closed con v ex sets. Equa- tion ( 13 ) implies that a separation oracle for Z can b e constructed directly from a linear optimization oracle for Z ∗ . Specifically , to chec k whether a giv en p oint w b elongs to Z , one can do the following: • if w 0  = 1, then clearly w / ∈ Z , and the v ector (1 , 0 , . . . , 0) provides a separating direction; else • W e solve the conv ex optimization problem min a ∈Z ∗ ⟨ w , a ⟩ . If the optimal v alue of the problem is non-negative, then w ∈ Z ; else, the minimizer a ∗ pro vides a separating direction, as ⟨ a ∗ , w ⟩ < 0 b y assumption, and yet ⟨ a ∗ , z ⟩ ≥ 0 for all z ∈ Z by Equation ( 13 ). T o complete the construction, it remains to sho w that the cone Z ∗ of nonnegative p olynomials on [ − 1 , 1] admits an efficien t linear optimization oracle. As mentioned ab ov e, this follows from imp ortan t results in semi-algebraic optimization. In particular, it is a celebrated result that in dimension one, p olynomial nonnegativity is in timately connected with the notion of sum-of-squares (SOS) polynomials. A polynomial q of degree at most 2 d is said to be SOS, denoted q ∈ Σ 2 d , if it can b e written in the form q ( y ) =      1 y . . . y d      ⊤ Q      1 y . . . y d      , where Q =      q 00 q 01 . . . q 0 d q 01 q 11 . . . q 1 d . . . . . . . . . . . . q 0 d q 1 d . . . q dd      ⪰ 0 d +1 is a PSD matrix . (14) An application of the P ositivstellensatz result for univ ariate polynomials states that a p olyno- mial p ( y ) = a 0 + · · · + a 2 d y 2 d is nonnegativ e on [ − 1 , 1] if and only if it can b e decomp osed in the form p ( y ) = r ( y ) + (1 − y 2 ) s ( y ) , for r ∈ Σ 2 d , s ∈ Σ 2 d − 2 . 25 The co efficien ts of the polynomial on the righ t-hand side dep end linearly on the PSD matrices R and S underlying r and s . Letting H : R ( d +1) × ( d +1) × R d × d → R 2 d +1 denote the mapping from ( R, S ) to the co efficients of the righ t-hand side p olynomial, we therefore conclude that Z ∗ = n a ∈ R 2 d +1 : a = H ( R, S ) , R ⪰ 0 d +1 , S ⪰ 0 d o . Hence Z ∗ is a con vex conic domain defined as the in tersection betw een the linear constraint a = H ( R, S ), and the product of t w o PSD cones. Minimization of a linear ob jectiv e ov er Z ∗ , as used b y the separation oracle for Z , can therefore b e solv ed using standard semidefinite programming tec hniques. Solving for µ t . Having discussed how to implement a separation oracle for the moment co de, we no w describ e how given a target vector of momen ts z t ∈ Z , one can pro duce a discrete distribution µ ov er [ − 1 , 1] with at most 2 d + 1 atoms in the supp ort, whose moments matc h z , E µ t [ s ( e y t )] ⊤ = ( E µ t [ e y t ] , . . . , E mu t [ e y 2 d t ]) = z t . T o do so, we leverage once again the dualit y betw een the momen t set Z and the set Z ∗ , as captured by ( 13 ). In light of the connection, the statement that z ∈ Z is witnessed b y the fact that z 0 = 1 and ⟨ z , a ⟩ ≥ 0 for all a ∈ Z ∗ . In particular, let a ∗ ∈ arg min a ∈Z ∗ : 1 ⊤ a ≥ 1 ⟨ a, z ⟩ , whic h can b e computed efficien tly by semidefinite programming using the ideas w e discussed. 11 Since ( a ∗ ) ⊤ z ≥ 0, w e can assume without loss of generality that the optimal solution satisfies 1 ⊤ a ∗ = 1. Since a ∗ b elongs in Z ∗ , it induces a nonzero p olynomial p ∗ ( y ) : = a ∗ 0 + a ∗ 1 y + · · · + a ∗ 2 d y 2 d , p ∗ ( y ) ≥ 0 ∀ y ∈ [ − 1 , 1] . Let { γ 1 , . . . , γ m } the distinct zeros of p ∗ (of course, 0 ≤ m ≤ 2 d , since p ∗ is not iden tically zero). W e no w sho w that the set { γ 0 : = 1 , γ 1 , . . . , γ m } defines a v alid discrete support for a distribution µ that matches the moments z . By the first-order optimalit y conditions, it m ust then b e the case that the gradien t of the ob jective is in the normal cone of the feasible set at a ∗ . Since all constrain ts are binding b y construction, this condition is exactly the existence of con v ex combination co efficients λ 0 , . . . , λ m suc h that z = m X j =0 λ j      1 γ 1 j . . . γ 2 d j      . (15) In other words, there exists a discrete distribution supp orted on { 1 , γ 1 , . . . , γ m } whose moments matc h z . The probability masses λ j can b e computed directly b y solving for the co efficients λ j in ( 15 ), whic h can b e done efficien tly via a linear program. 11 W e remark that the feasible set is not empty , since all a ∈ Z ∗ m ust b e such that a ⊤ w ≥ 0 for all w ∈ Z , and 1 ∈ Z since it is the moment vector of the p oint mass distribution with support { 1 } . 26 C The Set of First and Second Momen ts on the Euclidean Ball In this section, we pro vide similar results as in the previous app endix section but for the case where the outcomes y live in the unit ball in R d . W e start b y recalling the definition of the momen t set: Z = { ( v , Q ) : ∃ a probability measure µ o ver ∥ y ∥ 2 ≤ 1 such that E µ [ y ] = v , E µ [ y y ⊤ ] = Q } (16) Since ∥ y ∥ 2 ≤ 1 and T r[ Q ] = E µ ∥ y ∥ 2 2 , it must b e the case that T r[ Q ] ≤ 1 and that Q = E µ [ y y ⊤ ] ⪰ 0. F urthermore, for ( v , Q ) = ( E µ [ y ] , E µ [ y y ⊤ ]), it is also true that: E µ [( y − E µ [ y ])( y − E µ [ y ]) ⊤ ] ⪰ 0 ⇐ ⇒ Q − v v ⊤ = E µ [ y y ⊤ ] − E µ [ y ] E µ [ y ] ⊤ ⪰ 0 . Using these probabilit y facts, stipulating necessary conditions on ( v , Q ) ∈ Z , we conclude that Z ⊆ M where, M = { ( v , Q ) : Q ⪰ v v ⊤ , Q ⪰ 0 , T r[ Q ] ≤ 1 } . Next, w e sho w that given any ( v , Q ) ∈ M we can construct a probability measure µ ov er points y in the unit ball using Algorithm 3 suc h that E µ [ y ] = v and E µ [ y y ⊤ ] = Q . This both solv es the problem of finding a measure µ with appropriate moments and sho ws that M ⊆ Z . Since w e had already argued that Z ⊆ M , this establishes that Z = M and solv es the problem of characterizing the set Z in terms of PSD constrain ts. W e establish the correctness of Algorithm 3 in the follo wing prop osition. Similar results and constructions hav e app eared throughout the literature, we include the pro of here simply for the sak e of having a self-contained exp osition. Prop osition 4. L et ( v , Q ) b e any element in M . A lgorithm 3 r eturns an atomic me asur e µ supp orte d on 2 d + 1 p oints such that E µ [ y ] = v and E µ [ y y ⊤ ] = Q . Pr o of. Consider the SVD of Σ = Q − v v ⊤ = P d i =1 σ i u i u ⊤ i . W e can assume without loss of generality that the u i form an orthonormal basis for R d and that σ i ≥ 0 for all i . Also for i ∈ [ d ], the equation, ∥ v + tu i ∥ 2 2 = 1 ⇐ ⇒ t 2 + 2 tv ⊤ u i − (1 − ∥ v ∥ 2 ) = 0 , alw ays has either 2 solutions t + i and t − i since ∥ v ∥ 2 = T r[ v v T ] ≤ T r[ Q ] ≤ 1 or 1 solution (whic h is t = 0) if ∥ v ∥ = 1. If there are 2 solutions, b y the quadratic form ula t + i t − i = − (1 − ∥ v ∥ 2 2 ) ≤ 0. Hence w e can let t + i > 0 and t − i < 0. W e will deal with the case where there are 2 solutions that are non-zero ( ∥ v ∥ < 1) and address the other case later. W e define y + i = v + t + i u i , y − i = v + t − i u i and λ + i = σ i t + i ( t + i − t − i ) > 0 , λ − i = σ i t − i ( t − i − t + i ) > 0 . W e can c heck that Λ i = λ + i + λ − i = σ i / (1 − ∥ v ∥ 2 2 ). Recall that the measure is, µ = λ 0 δ ( v ) + d X i =1 [ λ + i δ ( y + i ) + λ − i δ ( y − i )] , 27 where λ 0 = 1 − P d i =1 λ i . By construction, the total w eight amongst all of the 2 d + 1 p oints in the measure is: λ 0 + d X i =1 Λ i = λ 0 + P d i =1 σ i 1 − ∥ v ∥ 2 = 1 . Note that b y definition of ( v , Q ), 1 − ∥ v ∥ 2 2 ≥ T r[ Q ] − ∥ v ∥ 2 2 = T r[ Q − v v ⊤ ] = P d i =1 σ i ≥ 0 hence the fraction is well-defined. Here w e note that if ∥ v ∥ = 1 (the other case) we are effectively just adding w eight on the v so there is no issue. It remains to sho w that µ has the righ t first and second moments. A direct calculation shows that for every i , λ + i t + i + λ − i t − i = 0. Using this, and plugging in the definitions of y + i , y − i , w e get that E µ [ y ] = λ 0 v + d X i =1 ( λ + i y + i + λ − i y − i ) = λ 0 v + d X i =1 ( λ + i + λ − i ) v + d X i =1 ( λ + i t + i + λ − i t − i ) u i = v T o v erify the second moment, w e use the fact that λ + i ( t + i ) 2 + λ − i ( t − i ) 2 = σ i and plug in again the definition of µ to get that: E µ [( y − v )( y − v ) ⊤ ] = d X i =1 ( λ + i ( t + i ) 2 + λ − i ( t − i ) 2 ) u i u ⊤ i = d X i =1 σ i u i u ⊤ i = Q − v v ⊤ . D F urther Details on the Examples in Section 1.2 Predicting Correlated Rain Across the United States. The particular guaran tees stated there are essentially a restatemen t of Corollary 4 . In particular, let k (( x, p ) , ( x ′ , p ′ )) = x ⊤ x ′ + p ⊤ p ′ b e the linear, scalar-v alued k ernel. Then, Γ = I · k is a matrix-v alued k ernel that con tains all the functions, h ( x, p ) = Ax + C p where ∥ h ∥ 2 H = ∥ A ∥ 2 F + ∥ C ∥ 2 F and ∥ · ∥ F denotes the F rob enius norm. Note that ∥ Γ(( x, p ) , ( x, p )) ∥ op = ∥ I ( ∥ x ∥ 2 2 + ∥ p ∥ 2 2 ) ∥ op (17) By Assumption ∥ x ∥ 2 2 ≤ B 2 . Also, since p = ( E µ t [ e y t ] , E µ t [ e y t e y ⊤ t ]) ∥ p ∥ 2 2 = ∥ E µ t [ e y t ] ∥ 2 2 + ∥ E µ t [ e y t e y ⊤ t ] ∥ 2 F The first term is at most 1 since e y t is on the unit ball. By Jensen’s, ∥ E µ t [ e y t e y ⊤ t ] ∥ 2 F ≤ E µ t ∥ [ e y t e y ⊤ t ] ∥ 2 F ≤ 1 and ∥ Γ(( x, p ) , ( x, p )) ∥ op ≤ B 2 + 2. Defensiv e Generation guarantees indistinguishability with respect to all functions of the form f ( x, p, y ) = h ( x, p ) ⊤ s ( y ) where as in Equation ( 9 ) s ( y ) is a vector containing E µ t [ e y t ] and E µ t [ e y t e y ⊤ t ]. The sp ecific examples in that section corresp ond to cases where A and B are zero everywhere except for one row corresp onding to a sp ecific entry in s ( y ). In this case, the F rob enius norm of the matrices just b ecomes the ℓ 2 norm of that row. The rest of the statement follows from Corollary 4 . 28

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment