Simultaneous Blackwell Approachability and Applications to Multiclass Omniprediction

Omniprediction is a learning problem that requires suboptimality bounds for each of a family of losses $\mathcal{L}$ against a family of comparator predictors $\mathcal{C}$. We initiate the study of omniprediction in a multiclass setting, where the c…

Authors: Refer to original PDF

Sim ultaneous Blac kw ell Approac habilit y and Applications to Multiclass Omniprediction Lunjia Hu ∗ Kevin Tian † Ch utong Y ang ‡ Abstract Omniprediction is a learning problem that requires suboptimality bounds for each of a family of losses L against a family of comparator predictors C . W e initiate the study of omniprediction in a m ulticlass setting, where the comparator family C may b e infinite. Our main result is an extension of the recen t binary omniprediction algorithm of [ OKK25 ] to the m ulticlass setting, with sample complexit y (in statistical settings) or regret horizon (in online settings) ≈ ε − ( k +1) , for ε -omniprediction in a k -class prediction problem. En route to proving this result, w e design a framework of p oten tial broader interest for solving Blackw ell approachabilit y problems where m ultiple sets must simultaneously b e approached via coupled actions. ∗ Northeastern Universit y , lunjia@alumni.stanford.edu † Univ ersity of T exas at Austin, kjtian@cs.utexas.edu ‡ Univ ersity of T exas at Austin, cyang98@utexas.edu Con ten ts 1 In tro duction 1 1.1 Our results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Preliminaries 5 2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Omniprediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Sim ultaneous Blac kw ell Approac hability 9 3.1 Blac kwell approachabilit y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 F ramew ork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4 Binary Omniprediction 15 4.1 Binary omniprediction preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2 Online binary omniprediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.3 Statistical binary omniprediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.4 Generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.5 General classifiers and losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5 Multiclass Omniprediction 25 5.1 MLOOs for m ulticlass omniprediction . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.2 Reducing multiclass omniprediction to lo w-regret learning . . . . . . . . . . . . . . . 26 5.3 Generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.4 General classifiers and losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 6 Unions of Comparators 32 A Deferred Proofs 38 B Coun terexample for Multiclass Isotonic Regression 39 1 In tro duction Omniprediction is a p o werful definition of learning in tro duced recen tly by [ GKR + 22 ]. Consider a standard supervised learning task: w e receiv e i.i.d. samples ( x , y ) ∼ D , where x ∈ R d are the features and y ∈ ∂ ∆ k := { e i } i ∈ [ k ] is the lab el (see Section 2.1 for notation), and we wish to build a predictor p ( x ) ≈ E [ y | x ] . In omniprediction, a family of loss functions L is fixed, as w ell as a family of comparator predictors C . The goal is then to satisfy the simultaneous loss minimization guaran tee, for some ε > 0 and predictor p : R d → ∆ k : E ( x , y ) ∼D [ ℓ ( k ⋆ ℓ ( p ( x )) , y )] ≤ min c ∈C E ( x , y ) ∼D [ ℓ ( c ( x ) , y )] + ε, for all ℓ ∈ L . (1) Here, k ⋆ ℓ is the ex ante optimum mapping for a particular loss ℓ ∈ L , defined in ( 3 ). This function maps each p ∈ ∆ k to the loss-minimizing action, on a v erage o v er y = e i where i ∼ p . The formulation ( 1 ) effectively decouples the tasks of pr e diction and action : once the learner has decided on a predictor p , the decision mak er who wishes to minimize a particular loss ℓ ∈ L then takes the action k ⋆ ℓ ◦ p . This prop ert y is particularly useful when e.g., losses can dep end on parameters unkno wn at training time (such as a mark et price), or robustness to a range of loss h yperparameters is desirable. Because ( 1 ) applies to a family of losses, the predictor p can b e viewed as a “supervised sufficient statistic” that go es b ey ond single loss minimization. This p ersp ectiv e built up on earlier work in algorithmic fairness [ HKRR18 ], and has intimate connections to indistinguishability argumen ts from pseudorandomness [ GHK + 23 , GH25 ]. By now, there is a rich b o dy of w ork on omniprediction in statistical and online learning settings [ GKR + 22 , GHK + 23 , HNR Y23 , GKR23 , GJRR24 , HTY25 , DHI + 25 , OKK25 ]. How ev er, essentially all prior works fo cused on binary classification, where lab els liv e in the set { 0 , 1 } . This is a rather stringen t restriction in the con text of real-world sup ervised learning, which is often used for m ulti- class tasks, e.g., [ DDS + 09 , MDP + 11 , Den12 ]. Even the abilit y to handle lab els y ∈ ∂ ∆ k ≡ [ k ] , for k a constant num ber of classes, would substantially extend the applicabilit y of omnipredictors. T o our knowledge, the problem of multiclass omniprediction has only b een studied in recent works b y [ NRRX25 , LRS25 ]. These pap ers fo cused on a setting motiv ated b y the economics literature, where C the family of comparators (viewed as an action space) is finite. The former’s main m ulticlass omniprediction result (Theorem 6.5, [ NRRX25 ]) is restricted to ℓ that indep endently decompose co- ordinatewise. On the other hand, Corollary 6, [ LRS25 ] gives a more general statemen t for m ulticlass omniprediction, but again the result is stated for finite C , and incurs an ≈ ε − 4 k − 2 o verhead in the sample complexity for achieving ( 1 ) (without the consideration of run time). The main motiv ation of our w ork is to bridge this gap, b y developing multiclass omnipredictors with guaran tees more closely resem bling the state-of-the-art in binary omniprediction. Indeed, there has been substan tial recent progress on improving the sample complexity and run time of binary omniprediction for concrete pairs ( C , L ) . F or example, in the gener alize d line ar mo del (GLM) setting, where C is b ounded linear predictors and L is appropriate conv ex losses (cf. ( 5 )), [ HTY25 , OKK25 ] dev elop ed end-to-end efficient algorithms with ≈ ε − 2 sample complexities. 1 In fact, [ OKK25 ] ga ve 1 The [ HTY25 ] omnipredictor requires L to b e well-conditioned, but outputs a mixture of prop er hypotheses from C ; the [ OKK25 ] omnipredictor is improp er but holds for general b ounded losses. The sample complexity of ≈ ε − 2 is tigh t for GLMs, even for a single loss [ Sha15 ]. See also [ DHI + 25 ], who gav e a similar result in a RKHS setting. 1 a substan tial generalization, sho wing how to reduce binary omniprediction for arbitrary pairs of ( C , L ) to online learning tasks against appropriate function classes. 1.1 Our results Our approac h to multiclass omniprediction is based on the framew ork of [ OKK25 ]. Both [ HTY25 , OKK25 ], as well as many prior results on binary omniprediction, leverage a reduction from [ GHK + 23 ]. This reduction (Prop osition 1 ) shows that ( 1 ) is satisfied for predictors p satisfying appropriate no- tions of multiac cur acy (Definition 2 ) and c alibr ation (Definiton 3 ), concepts w e review in Section 2.2 . In tuitively , these properties guarantee that our predictor p ( x ) passes certain statistical tests against the ground truth p ⋆ ( x ) := E [ y | x ] , induced b y the particular pair ( C , L ) of interest. As in th e binary case, learning m ulticlass predictors that satisfy m ultiaccuracy and calibration individually is w ell-studied. W e discuss the former in Section 5.4 , and the latter is p ossible in ≈ ε − ( k +1) timesteps (in the online setting) and samples (in the statistical setting), as sho wn b y seminal w ork of [ FV98 ] (see also [ MS10 ]). How ev er, it is less clear ho w to ac hieve b oth sim ultaneously . In the binary setting, [ OKK25 ] leveraged an existing calibration algorithm from [ ABH11 ] based on Blackwel l appr o achability , and augmen ted it to also guaran tee multiaccuracy . Their analysis used several imp ortant facts about binary losses, e.g., existence of an approximate basis for prop er losses (Lemma 9 ), and a custom “halfspace satisfiability oracle” sp ecialized to their application (Algorithm 3 ). Unfortunately , the natural extension of these tools to k > 2 classes b oth prov ably fail, necessitating a stronger framework capable of handling the multiclass setting. Sim ultaneous Blackw ell approachabilit y . Our starting p oin t is to isolate a k ey tec hnical prim- itiv e needed in the [ OKK25 ] algorithm, and study sufficien t conditions for it in greater generalit y . This is the fo cus of Section 3 ; here, w e provide a brief o v erview of the technique. The standard setting of Blackw ell approac hability (review ed in Section 3.1 ) generalizes v on Neu- mann’s minimax theorem to vector-v alued games. Consider a bilinear, vector-v alued function v : A × B , and a set V , living in the same space H . Unlik e the scalar setti ng, the follo wing are not equiv alen t: for all b ∈ B there exists a ∈ A suc h that v ( a , b ) ∈ V (“resp onse satisfia- bilit y”), and there exists a ∈ A suc h that for all b ∈ B , v ( a , b ) ∈ V (“satisfiability”). Blac kwell approac hability [ Bla56 ] is an elegan t compromise: whenever resp onse s atisfiability holds, we can c ho ose { a t } t ∈ [ T ] in an online manner (b efore the corresp onding { b t } t ∈ [ T ] is rev ealed), suc h that lim T →∞ 1 T P t ∈ [ T ] v ( a t , b t ) → V . This strategy has in timate connections to calibration: since [ F os99 ] man y researc hers hav e used ideas from approachabilit y to design calibration algorithms. In Problem 1 , w e prop ose a simultane ous v arian t of Blac kwell’s approachabilit y problem, where there are m pairs of vector-v alued functions v ( i ) and sets V ( i ) . The goal is to c ho ose a sequence of { a t } t ∈ [ T ] (resp onding online to { b t } t ∈ [ T ] ) such that lim T →∞ 1 T P t ∈ [ T ] v ( i ) ( a t , b t ) → V ( i ) , simultaneously for all i ∈ [ m ] . This primitive has clear connections to omniprediction, as b oth (binary and multiclass) m ultiaccuracy and calibration can be written in the language of Blackw ell approachabilit y . Sim ultaneous Blackw ell approac hability can naturally b e cast as a (standard) Blackw ell approacha- bilit y instance, b y lifting the vectors and sets into a pro duct space H (1) × H (2) × . . . × H ( m ) . How ever, w e find the p ersp ectiv e in Problem 1 useful, as the sufficien t condition of resp onse satisfiability do es not lift cleanly . W e show in Lemma 6 that even when m = 2 , there are tw o one-dimensional subsets 2 V (1) , V (2) and corresp onding vector-v alued functions, that are both response satisfiable (and hence approac hable in isolation), but not sim ultaneously approac hable. Our main contribution in Section 3 is a sufficien t condition for simultaneous Blackw ell approac ha- bilit y , stated in the form of an oracle requiremen t (Definition 5 ). Our oracle is natural in the context of [ Bla56 ], who gav e an alternate c haracterization of approachabilit y: ev ery halfspace con taining U should b e satisfiable. This statement was later made algorithmic b y [ ABH11 ] using online learning tec hniques. Our sufficient condition can then b e cleanly stated as: for any halfspaces each contain- ing one U ( i ) , and any sp ecified conv ex com bination w ∈ ∆ m , the halfspaces should b e satisfiable on aver age (with resp ect to w ). W e further build upon [ ABH11 ] to lev erage existence of suc h an oracle to solv e simultaneous Blac kwell approac hability with an explicit rate (Theorem 2 ). While our reduction is a relatively straightforw ard extension of [ ABH11 ] (and indeed, [ OKK25 ] also implicitly ga ve a v arian t of Theorem 2 ), we b elieve that explicitly isolating this sufficient condition will pro ve useful to the comm unity . T o ease applications, w e sho w that Theorem 2 holds in m uch greater generalit y , including in statistical and contextual settings. W e pro vide a high-probabilit y guaran tee capable of flexibly handling these extensions in Corollary 1 . Multiclass omniprediction. Our simultaneous Blac kwell approac habilit y framework reduces omniprediction to implementing an appropriate mixture linear optimization oracle (MLOO, Defi- nition 5 ), and to designing appropriate online learners for each of tw o sets V ( i ) in isolation (corre- sp onding to calibrated and multiaccurate predictors). While an appropriate MLOO w as explicitly giv en in [ OKK25 ] for the binary omniprediction setting, it is unclear how to generalize their strategy (based on an algorithmic Sp erner’s lemma) to hold in higher dimensions. T ow ards lev eraging our results for omniprediction, in Section 5.1 , we giv e a meta-result for designing MLOOs when all v ( i ) share a common structure. Roughly speaking, w e require each v ( i ) to tak e as input a prediction p and a lab el y , and to be linear in the prediction error p − y (a more formal statemen t is in ( 37 )). Under these assumptions, we sho w ho w to use the minimax theorem and linear programming to generically design MLOOs compatible with our simultaneous Blackw ell approac hability framework, which extends to future p otential applications. By com bining our framework, our new MLOO construction, and known online learners, we obtain our multiclass omnipredictors in Section 5 . The following is a represen tativ e result. Theorem 1 (Informal, see Theorem 5 ) . L et L b e the family of multiclass GLM losses ( 5 ) and let C b e the family of b ounde d k × d line ar classifiers ( 47 ) . Then given T i.i.d. samples ( x , y ) ∼ D for T = k · Ω  1 ε  k +1 , (2) we r eturn an ε -omnipr e dictor in time O ( dk T ) + O ( 1 ε ) 2 k p oly ( k , log 1 ε ) , with high pr ob ability. W e pause to make some remarks ab out Theorem 5 , which is specialized to the b enc hmark class of multiclass GLM losses, an expressiv e family that includes all proper losses after reparameteri- zation (cf. Lemma 4 ), including p opular c hoices in practice suc h as the squared and cross entrop y losses. First, it is fully explicit and do es not rely on any computationally-infeasible oracles. Sec- ond, although its sample complexit y scales exp onen tially in t he n um b er of classes k , this gro wth is relatively mild for small constant k , and the b ound is indep endent of the ambien t dimension d 3 of the features x . Third, b ecause our approac h is based on the indistinguishabilit y argumen t of [ GHK + 23 ] (i.e., it go es through calibration and multiaccuracy), the exponential dep endence on k is inevitable due to a low er b ound from Theorem 1.12, [ HV25 ]. Indeed, our bound ( 2 ) recov ers the same dep endence on k as existing algorithms for the simpler task of multiclass calibration, 2 and impro ves by a quartic factor ov er the prior work [ LRS25 ] (while also handling infinite C ). W e prov e our formal v arian t of Theorem 1 in Section 5 , as w ell as extensions to online omniprediction, and general families of multiclass losses and comparators (Theorem 6 ). Other consequences. As a warm up to our multiclass results, in Section 4 , we rederive the main results of [ OKK25 ] in the binary setting b y w ay of our new formalism. W e b elieve this ma y b e useful to the communit y , as it cleanly separates out the requiremen ts of eac h online learner. F or example, Theorem 3 , our sp ecialization of Theorem 1 to binary omniprediction, uses ≈ 1 ε 2 samples and gives an end-to-end construction of an omnipredictor for binary GLMs. This remo ves the w ell-conditioning requiremen t from [ HTY25 ], and do es not rely on computationally-infeasible halfspace optimization oracles (i.e., ERM for linear thresholds) as required b y [ OKK25 ]. 3 During the preparation of this man uscript, the third arXiv v ersion of [ OKK25 ] indep enden tly noted this oracle requirement is remo v able (see their up dated Theorem 7.1); our mo dular framew ork makes this point transparent, whic h w e b elieve will pro ve useful in similar future applications. Our framew ork has additional implications for the theory of calibration and omniprediction. F or example, in Section 6 , w e sho w that our construction directly extends to omnipredicting against unions of c omp ar ators , i.e., the b est comparator in an y of m families {C ( i ) } i ∈ [ m ] , as long as w e can omnipredict against eac h family individually . This simple extension was previously unkno wn, and is made p ossible by the generality of our construction in Section 5.1 . W e are optimistic that our pip eline for constructing omnipredictors will hav e future consequences for related problems. 1.2 Related work Multiclass calibration. Multiclass calibration has seen a resurgence of in terest recently due to its use in ev aluating mo dern classifiers in mac hine learning [ GPSW17 ]. A range of w orks ha ve prop osed new algorithms and relaxations of this problem [ KF15 , KPK + 19 , ZKS + 21 , GHR24 ]. Of particular note, recen t w orks [ Pen25 , FGMS25 ] gav e algorithms with horizons ≈ k poly ( ε − 1 ) for ε -m ulticlass calibration, which is p olynomial in k for constant ε . This improv es up on the classical ε − ( k +1) rate for multiclass calibration [ FV98 ] (as in Theorem 1 ) in some parameter regimes. How- ev er, these results are not obtained through Blackw ell approachabilit y , and thus it seems difficult to incorp orate a m ultiaccuracy component directly as w ould b e required for omniprediction using the indistinguishabilit y framework of [ GHK + 23 ]. F urther, the exp(Ω( k )) low er b ound in Theorem 1.12, [ HV25 ] effectiv ely rules out the use of these results within the framework. Finally , as discussed earlier, [ NRRX25 , LRS25 ] are the primary w orks that hav e studied multiclass omniprediction; w e compared our Theorem 1 to [ LRS25 ] earlier. Regarding [ NRRX25 ], their om- 2 Recen t works by [ P en25 , FGMS25 ] on multiclass omniprediction ha ve traded off the exp onential dependence on k for an exp onential dependence on 1 ε . W e discuss these works in greater detail in Section 1.2 . 3 This requirement is stated in Theorem 5, [ OKK25 ], where ERM access for the comp osition of thresholding with the comparator family is assumed. Ev en when the comparator family is linear functions, this requires implementing a halfspace ERM oracle, a well-kno wn NP-hard problem in computational learning theory [ JP78 , BDEL03 ]. 4 niprediction result only applies to a restricted family of multiclass GLM losses, namely those whic h decomp ose co ordinatewise in their argumen t (see Definition 6.12). This makes it incompatible with sev eral common GLM losses in the (standard) setting of Theorem 1 . F or example, consider the cr oss entr opy loss ℓ ( p , y ) = − E i ∼ y [log p i ] popularly used in machine learning ev aluations. This GLM loss falls into the framework of [ LRS25 ], but once it is cast as a GLM learning problem, the correct parameterization is in the unlinked space (where linear comparator predictors x → Cx liv e), for which the corresp onding loss is ℓ ( t , y ) = log( P i ∈ [ d ] exp( t i )) − ⟨ t , y ⟩ , whic h is not co ordinatewise separable. F or more discussion on this point, see Lemma 4 and Section 2.2, [ HTY25 ]. Multiclass learning. Multiclass learning is a w ell-studied topic in learning theory in general. F or example, classical works by [ Nat89 , BDCBL92 ] prop osed v arious statistical dimensions characteriz- ing the sample complexit y of multiclass P AC learning. More recen tly , [ DSBDSS15 , BCD + 22 ] sho ws that in multiclass setting, empirical risk minimization do es not pro vide a uniform b ound on sample complexit y , and prov ed that a quan tity known as the DS dimension do es tightly characterizes multi- class P AC learnability . A follow-up w ork by [ CP23 ] shows that k -DS dimension, a generalization of DS dimension, charact erizes k -list learnability . Most of these w orks primarily consider the sample complexit y of m ulticlass learning rather than end-to-end efficien t algorithms. Blac kwe ll approac habilit y . V arious w orks hav e generalized Blackw ell approac hability and ap- plied it to problems with a similar spirit to our simultaneous approachabilit y setting in Problem 1 . Ho wev er, to our kno wledge none of them directly study ac hiev abilit y and algorithms for approaching m ultiple sets. The works most closely-related to our setup include [ MPS14 ] who study Blac kw ell approac hability for unkno wn games, i.e., where the structure of the game or target is unknown; [ FKM + 21 ] who study Blackw ell approac habilit y with additional constrain ts for the pay off space; and [ LNPR22 ] who study multiclass calibration and calibeating. 2 Preliminaries In Section 2.1 , w e define notation used throughout the pap er, and state help er results from the online learning literature. W e then introduce preliminaries for omniprediction in Section 2.2 . 2.1 Notation W e denote v ectors in lo wercase b oldface and matrices in uppercase b oldface. F or n ∈ N w e let [ n ] := { i ∈ N | i ≤ n } . W e let 1 d and 0 d denote the all-ones and all-zero es vectors in R d . F or i ∈ [ d ] , w e let e i ∈ R d denote the i th standard basis v ector when the dimension d is clear from con text. When E is some even t, we let I E to denote the 0 - 1 indicator v ariable of the ev ent. When ∥·∥ is a norm on R d , w e let ∥·∥ ∗ denote its dual norm. F or p ∈ R ≥ 1 ∪ {∞} we let ∥·∥ p denote the ℓ p norm of a v ector argument. Note that when ∥·∥ = ∥·∥ p for some p ∈ R ≥ 1 ∪ {∞} , then ∥·∥ ∗ = ∥·∥ q for the v alue of q ∈ R ≥ 1 ∪ {∞} satisfying 1 p + 1 q = 1 . W e say that a vector-v alued function v : R d → R k is β -Lipschitz in ∥·∥ if for all x , y ∈ R d , ∥ v ( x ) − v ( y ) ∥ ∗ ≤ β ∥ x − y ∥ . F or ¯ x ∈ R d and r > 0 w e define B d p ( ¯ x , r ) := { x ∈ R d | ∥ x − ¯ x ∥ p ≤ r } to b e an ℓ p ball centered at ¯ x . When ¯ x is omitted, ¯ x = 0 d b y default. F or a compact set K ⊆ R d w e let Π K ( v ) := 5 arg min x ∈K ∥ v − x ∥ 2 denote the Euclidean pro jection. F or p, q ≥ 1 and M ∈ R n × d , we denote ∥ M ∥ p → q := max v ∈ B d p (1) ∥ Mv ∥ q . W e let ∆ k := { v ∈ R k ≥ 0 | ∥ v ∥ 1 = 1 } denote the probability simplex in dimension k . When S is a set, w e o verload notation and let v ∈ R S b e a vector with co ordinates indexed b y elemen ts s ∈ S , and w e similarly define 1 S , 0 S , ∆ S , etc. F or a con v ex set K ⊆ R d w e use ∂ K to denote the b oundary of K , i.e., all v ∈ K that cannot b e written as a conv ex com bination of other p oints in K . F or example, ∂ ∆ k is the set of standard basis vectors { e i } i ∈ [ k ] . F or a distribution P supp orted on some set Ω we write ω ∼ P to mean a sample from the distribution, and when p ∈ ∆ k w e ov erload notation and let i ∼ p (resp. y ∼ p ) mean a sample that tak es on the v alue i (resp. y = e i ) with probabilit y p i . W e sa y that N is an ε -net in ∥·∥ for K if for all x ∈ K , there exists x ′ ∈ N such that ∥ x − x ′ ∥ ≤ ε . When ∥·∥ is omitted, we let ∥·∥ = ∥·∥ 1 b y default. The follo wing construction is standard. 4 F act 1. F or al l k ∈ N and ε ∈ (0 , 1) , ther e exists N , an ε -net of ∆ k , satisfying |N | ≤ ( 5 ε ) k − 1 . W e sa y that H ⊆ R d is a halfsp ac e if for some ( v , c ) ∈ R d × R , H = { x | v · x ≤ c } . F or a sequence { v t } t ∈ N ⊆ R d and a set V ⊆ R d , w e write lim t →∞ v t → V to mean that for any ε > 0 , there is some T ( ε ) such that for all t ≥ T ( ε ) , v t is within ε of the set V i n Euclidean distance. 5 W e often refer to sequences of v ectors, e.g., { x t } t ∈ [ T ] , b y indexing the set of indices, e.g., as x [ T ] . W e use f ◦ g to denote the comp osition of t wo functions f and g . T o instantiate our framew ork, we require v arious online learning algorithms from the literature. W e b egin by stating a general fact ab out the generalization of sto chastic mirror descen t metho ds. 6 Lemma 1 (Lemma 9, [ HTY25 ]) . L et T ∈ N , η > 0 , δ ∈ (0 , 1) , let X ⊆ R d have diameter ≤ R in ∥·∥ . L et r : X → R b e 1 -str ongly c onvex in a norm ∥·∥ such that max x ∈X r ( x ) − min x ∈X r ( x ) ≤ Θ , and let x 1 := arg min x ∈X r ( x ) . F or a se quenc e of deterministic ve ctors { g t } t ∈ [ T ] such that g t c an dep end on al l r andomness use d in iter ations t ∈ [ T ] , let x t +1 ← arg min x ∈X {⟨ η ˜ g t − ∇ r ( x t ) , x ⟩ + r ( x ) } , wher e E [ ˜ g t | ˜ g 1 , . . . , ˜ g t − 1 ] = g t , for al l t ∈ [ T ] . F urther supp ose ∥ ˜ g t ∥ ∗ ≤ L deterministic al ly. Then for some choic e of η , with pr ob ability ≥ 1 − δ , sup x ∈X X t ∈ [ T ] ⟨ g t , x t − x ⟩ ≤ 4 L √ T Θ + 16 LR s T log  2 δ  . W e will apply t wo sp ecializations of Lemma 1 , where r is either a Euclidean regularizer (pro jected gradien t descent), or negative entrop y ov er the probabilit y simplex (m ultiplicative weigh ts). Finally , for deterministic applications of Lemma 1 , w e state the following sharp er b ound. 4 This result is stated in [ V er18 ], Corollary 4.2.11 for the ℓ 2 ball, but the same construction works for the ℓ 1 ball in dimension k − 1 . Pro jection onto the subspace 1 ⊤ k v = 1 at most doubles the ℓ 1 distance, so w e adjusted the constant. 5 In R d , all norms are equiv alen t up to univ ersal constants, so using Euclidean distance is without loss of generality . 6 Lemma 9 of [ HTY25 ] only claims this result for an ℓ 2 setup, but the same argument extends to all stochastic mirror descent setups as the result simply b ounds the random error term via martingale concentration. 6 Lemma 2 (Theorem 4.2, [ Bub15 ]) . In the setting of L emma 1 , if ˜ g t = g t in every iter ation, sup x ∈X X t ∈ [ T ] ⟨ g t , x t − x ⟩ ≤ L √ 2 T Θ . 2.2 Omniprediction W e consider tw o sup ervise d le arning problem settings in k -class prediction. In the following discus- sion, let ℓ : Ω × ∂ ∆ k → R b e a loss function that ev aluates predictions (in a set Ω ) and labels. Online setting. There is a sequence of examples { ( x t , y t ) } t ∈ [ T ] presen ted to us in an online fashion. Our goal is to predict { p t ∈ Ω } t ∈ [ T ] where p t can only depend on previous examples { ( x s , y s ) } s 1 . The main result of [ Bla56 ] (see also a more mo dern exp osition b y [ ABH11 ]) is that a differen t equiv alence holds. Prop osition 2 ([ Bla56 ]) . In the setting of Definition 4 , the fol lowing thr e e statements ar e e quivalent. • V is r esp onse-satisfiable. • V is halfsp ac e-satisfiable. 9 • V is approac hable , i.e., for any se quenc e { b t } t ∈ N , ther e is a choic e of { a t } t ∈ N such that a t dep ends only on b [ t − 1] , and such that lim t →∞ 1 t X s ∈ [ t ] v ( a s , b s ) → V . A quantitativ e, algorithmic v ariant of Prop osition 2 was given b y [ ABH11 ]. T o explain its relev ance to Section 3.2 , w e specialize our attention to sets V induced via a con v ex set of distinguishers U ⊆ R d , and a scalar ρ ∈ R . Sp ecifically , we are interested in sets of the form V :=  v ∈ R d | sup u ∈U ⟨ u , v ⟩ ≤ ρ  . (8) The function sup u ∈U ⟨ u , v ⟩ is called the supp ort function of U in conv ex analysis. Intuitiv ely , we can view eac h mem b er u ∈ U as a distinguisher that tests whether a given v ∈ R d satisfies ⟨ u , v ⟩ ≤ ρ . If v passes all tests given by U , then we can certify v ∈ V as in ( 8 ). F o cusing on ( 8 ), i.e., sublevel sets of supp ort functions, may seem restrictive. W e first mention that this sp ecialization captures an imp ortan t family of sets. Lemma 5. If V ⊆ R d is close d and c onvex, 0 d ∈ V , and ρ > 0 , ther e exists U so that ( 8 ) holds. Pr o of. It suffices to tak e U = ρ V ◦ , where V ◦ := { u ∈ R d | sup v ∈V ⟨ u , v ⟩ ≤ 1 } is the p olar set. That V ◦ satisfies ( 8 ) with ρ = 1 is by the well-kno wn fact V ◦◦ = V (see e.g., Theorem 14.5, [ Ro c70a ]). In fact, sp ecializing to ( 8 ) is the first step in the reduction of [ ABH11 ] (see their Prop osition 2). The key observ ation of [ ABH11 ] is that if V is a c onvex c one (a set closed under nonnegativ e linear com binations), then it satisfies ( 8 ) for ρ = 0 and U taken to be the dual c one to V . Moreo ver, any con vex set V ⊆ R d can b e lifted to a conv ex cone in R d +1 , by defining the set clift ( V ) := n ( v , c ) ∈ R d × R > 0 | v c ∈ V o ∪ { 0 d +1 } . In tuitively , the c = 1 “slice” of clift ( V ) pro jects to V in the first d dimensions, and the slice at an arbitrary c ≥ 0 pro jects to c V . By conv erting b et ween the distance of a p oin t in R d +1 to clift ( V ) , and the distance of its pro jection in R d to V (paying an ov erhead of ≈ diam ( V ) ), [ ABH11 ] sho w that approaching sets ( 8 ) suffices to derive a general algorithmic v arian t of Proposition 2 . 3.2 F ramew ork W e next provide a generalization of Algorithm 2, [ ABH11 ] that simultaneously approaches a col- lection of m conv ex sets of the form ( 8 ). In fact, under our oracle abstraction (to b e in tro duced in Definition 5 ), the conv ex sets we approac h need not live in a finite-dimensional space, and can come from an arbitrary Hilb ert space. W e formalize our problem setting here. Problem 1 (Sim ultaneous Blackw ell approachabilit y) . L et a, b, m ∈ N , let ρ, ε ≥ 0 , and let A ⊆ R a and B ⊆ R b b e c omp act and c onvex. F or al l i ∈ [ m ] , let V ( i ) , U ( i ) ⊆ H ( i ) satisfy V ( i ) := ( v ∈ H ( i ) | sup u ∈U ( i ) ⟨ u , v ⟩ ≤ ρ ) , (9) 10 wher e H ( i ) is a Hilb ert sp ac e. L et v ( i ) : A × B → H ( i ) b e a biline ar, ve ctor-value d function for al l i ∈ [ m ] . Our go al is to observe a se quenc e b [ T ] ∈ B T , and to cho ose a [ T ] ∈ A T so that a t dep ends on b [ t − 1] for al l t ∈ [ T ] , and max i ∈ [ m ] sup u ( i ) ∈U ( i ) * u ( i ) , 1 T X t ∈ [ T ] v ( i ) ( a t , b t ) + ≤ ρ + ε. (10) When m = 1 and ε → 0 , the b ound ( 10 ) implies that 1 T P t ∈ [ T ] v (1) ( a t , b t ) approaches the set V (1) , b ecause it passes all tests induced by U (1) . Equiv alences betw een the regret b ound ( 10 ) and distance to V (1) are standard, e.g., Lemma 3 of [ ABH11 ]. In Problem 1 , we pose a generalization allowing for m approac habilit y instances, eac h with their o wn associated set V ( i ) , distinguishers U ( i ) , and function v ( i ) . The goal ( 10 ) then asks to approach all m sets simultaneously . T o ac hieve simultaneous approac hability , w e assume existence of the following type of oracle. Definition 5 (Mixture linear optimization oracle) . In the setting of Pr oblem 1 , we c al l O an ε - mixture linear optimization oracle (MLOO) if on inputs w ∈ ∆ m , and { u ( i ) } i ∈ [ m ] ∈ Q i ∈ [ m ] U ( i ) , the or acle outputs a ∈ A satisfying X i ∈ [ m ] w i D u ( i ) , v ( i ) ( a , b ) E ≤ ρ + ε for al l b ∈ B . (11) W e briefly interpret Definition 5 in the finite-dimensional setting. When the input w is a point mass e i , the oracle definition is equiv alent to finding a distribution P suc h that D u ( i ) , v ( i ) ( a , b ) E ≤ ρ + ε for all b ∈ B . (12) F ortunately , w e kno w ( 12 ) is achiev able (ev en when ε = 0 ) whenev er V ( i ) is approachable, b ecause for any u ( i ) ∈ U ( i ) , the set { v |  u ( i ) , v  ≤ ρ } is a halfspace containing V ( i ) . Prop osition 2 shows V ( i ) is approachable iff it is halfspace-satisfiable, a.k.a. there exists a ac hieving ( 12 ). W e next observe that in general, achiev ability of Definition 5 (and our goal ( 10 )) are strictly stronger requiremen ts than eac h V ( i ) b eing individually approac hable. Lemma 6. Ther e exists a simultane ous Blackwel l appr o achability instanc e (Pr oblem 1 ) wher e e ach individual set V ( i ) is appr o achable, but a MLOO do es not exist, and the simultane ous appr o achability b ound ( 10 ) is imp ossible to achieve, for sufficiently smal l ε > 0 . Pr o of. W e take m = 2 , ρ = 0 , U (1) = U (2) = { 1 } so V (1) = V (2) = { c ∈ R | c ≤ 0 } , A := ∆ 2 , and v (1) ( a , b ) := a 1 , v (2) ( a , b ) := a 2 . Clearly these are bilinear functions, and in fact indep endent of b ∈ B . Moreov er, V (1) and V (2) are b oth approac hable, the former b y rep eatedly pla ying a t ← e 2 , and the latter by rep eatedly playing b t ← e 1 . Ho wev er, the mixture oracle fails for an y ε < 1 2 b y taking w := 1 2 ( e 1 + e 2 ) , b ecause for an y a ∈ A (and an y b ∈ B ), ( 11 ) would yield the false statement 1 2 = P i ∈ [2] 1 2 · 1 · a i ≤ ε . 11 F or this same instance, regarding the sim ultaneous approac hability b ound ( 10 ), no matter what c hoices of { a t , b t } t ∈ [ T ] w ere pla yed, the scalars 1 T P t ∈ [ T ] v (1) ( a t , b t ) and 1 T P t ∈ [ T ] v (2) ( a t , b t ) m ust sum to 1 . Th us, one of the inequalities ( 10 ), for i ∈ [2] , m ust b e violated for ε < 1 2 . T o ease applications, w e giv e a unified recipe for constructing MLOOs in the case of supervised m ulticlass prediction tasks in Section 5.1 , a setting where we sho w MLOOs alwa ys exist. W e are now ready to present the main result of this section. Our result reduces solving Problem 1 to the implemen tation of an MLOO (Definition 5 ) and online learners for each U ( i ) . Theorem 2 (Sim ultaneous Blac kwell approac hability) . In the setting of Pr oblem 1 , assume we have ac c ess to O , an ε -MLOO. F urther, for al l i ∈ [ m ] and T ∈ N , assume ther e is an online le arner alg ( i ) that takes inputs ( a [ T ] , b [ T ] ) ∈ A T × B T , and outputs u ( i ) [ T ] such that u ( i ) t ∈ U ( i ) dep ends only on a [ t − 1] and b [ t − 1] , and sup u ( i ) ∈U ( i ) X t ∈ [ T ] D v ( i ) ( a t , b t ) , u ( i ) − u ( i ) t E ≤ reg ( i ) ( T ) , (13) for some reg ( i ) : N → R ≥ 0 . Final ly, assume    D v ( i ) ( a , b ) , u ( i ) E    ≤ L (14) for al l i ∈ [ m ] , ( a , b ) ∈ A × B , and u ( i ) ∈ U ( i ) . Then, for any b [ T ] ∈ B T , A lgorithm 1 pr o duc es a [ T ] ∈ A T such that a t dep ends only on b [ t − 1] , and sup u ( i ) ∈U ( i ) * u ( i ) , 1 T X t ∈ [ T ] v ( i ) ( a t , b t ) + ≤ ρ + ε + reg ( i ) ( T ) + L p 2 T log ( m ) T for al l i ∈ [ m ] . Pr o of. The algorithm is presen ted in Algorithm 1 , and we follow the notation therein throughout. By observ ation, the algorithm computes a t on Line 13 b efore observing b t . W e first show max i ∈ [ m ] 1 T X t ∈ [ T ] D u ( i ) t , v ( i ) t E = sup w ∈ ∆ m 1 T X t ∈ [ T ] ⟨ w , g t ⟩ ≤ ρ + ε + L p 2 T log ( m ) T . (15) Indeed, this follo ws from sup w ∈ ∆ m 1 T X t ∈ [ T ] ⟨ w , g t ⟩ = 1 T X t ∈ [ T ] ⟨ w t , g t ⟩ + sup w ∈ ∆ m 1 T X t ∈ [ T ] ⟨ w − w t , g t ⟩ ≤ ρ + ε + L p 2 T log ( m ) T . (16) Here, we b ounded the first term in the righ t-hand side of the first line b y using the oracle guaran tee ( 11 ), whic h holds for an y b t used to define eac h g t . Moreov er, to b ound the second term, w e observ e that Lines 11 and 12 are implemen ting m ultiplicativ e w eigh t updates, i.e., Lemma 2 with r ( w ) = P i ∈ [ m ] w i log w i , using feedbac k vectors g t satisfying ∥ g t ∥ ∞ ≤ L by assumption ( 14 ). 12 Next we complete the proof: given ( 15 ) and the regret b ound ( 13 ), for all i ∈ [ m ] , sup u ( i ) ∈U ( i ) * u ( i ) , 1 T X t ∈ [ T ] v ( i ) t + ≤ 1 T X t ∈ [ T ] D u ( i ) t , v ( i ) t E + sup u ( i ) ∈U ( i ) 1 T X t ∈ [ T ] D u ( i ) − u ( i ) t , v ( i ) t E ≤ ρ + ε + reg ( i ) ( T ) + L p 2 T log ( m ) T . Algorithm 1: SimultaneousApp roach ( b [ T ] , { alg ( i ) } i ∈ [ m ] , O ) 1 Input: Online sequence b [ T ] ∈ B T , online learners { alg ( i ) } i ∈ [ m ] satisfying ( 13 ), ε -MLOO O (follo wing notation in Problem 1 , Definition 5 ) 2 Output: a [ T ] ∈ A T suc h that eac h a t is output before observing b t 3 w 1 ← 1 m 1 m 4 u ( i ) 1 ← alg ( i ) ( {} ) for all i ∈ [ m ] // Initialize each u ( i ) 1 as alg ( i ) does before observing any examples. 5 a 1 ← O ( w 1 , { u ( i ) 1 } i ∈ [ m ] ) 6 η ← 1 L · p 2 log( m ) · T − 1 / 2 7 for 2 ≤ t ≤ T do 8 v ( i ) t − 1 ← v ( i ) ( a t − 1 , b t − 1 ) for all i ∈ [ m ] 9 g t − 1 ← vector in R m suc h that [ g t − 1 ] i = ⟨ u ( i ) t − 1 , v ( i ) t − 1 ⟩ for all i ∈ [ m ] 10 u ( i ) t ← alg ( i ) ( v ( i ) [ t − 1] ) for all i ∈ [ m ] 11 w t ← w t − 1 ◦ exp( η g t − 1 ) // ◦ denotes entrywise multiplication and exp is applied entrywise. 12 w t ← w t ∥ w t ∥ − 1 1 13 a t ← O ( w t , { u ( i ) t } i ∈ [ m ] ) 14 end 15 Return: a [ T ] Theorem 2 in fact holds in a m uc h more general setting than captured b y simultaneous Blac kw ell approac hability (Problem 1 ). Indeed, b oth the statement of Theorem 2 and the MLOO definition do not explicitly sp ecify sets V ( i ) distinguished by the corresp onding U ( i ) in the sense of ( 9 ). F or our applications to online omniprediction, w e require a more general v ersion of Theorem 2 , where eac h U ( i ) consists of functions u ( i ) : X → H ( i ) , taking as input a con text x from some domain X . F or these uses, we generalize Problem 1 and Definition 5 , and give an analog of Theorem 2 . Problem 2 (Con textual Blac kwell approachab ility) . L et a, b, m ∈ N , let ε ≥ 0 , let A and B b e subsets of ve ctor sp ac es, and let X b e an abstr act domain of c ontexts. F or al l i ∈ [ m ] , let U ( i ) c onsist of functions u ( i ) : X → H ( i ) , wher e H ( i ) is a Hilb ert sp ac e. L et v ( i ) : A × B → H ( i ) b e a biline ar, ve ctor-value d function for al l i ∈ [ m ] . Our go al is to observe se quenc es b [ T ] ∈ B T and 13 x [ T ] ∈ X T , and to cho ose a [ T ] so that a t dep ends on b [ t − 1] and x [ t ] for al l t ∈ [ T ] , and max i ∈ [ m ] sup u ( i ) ∈U ( i ) 1 T X t ∈ [ T ] D u ( i ) ( x t ) , v ( i ) ( a t , b t ) E ≤ ε. When e ach U ( i ) c onsists only of c onstant functions, i.e., e ach u ( i ) ∈ U ( i ) c an only take on one value, we abuse notation and let U ( i ) ⊆ H ( i ) r epr esent a set of u ( i ) ∈ H ( i ) . Definition 6 (Contextual mixture linear optimization oracle) . In the setting of Pr oblem 2 , we c al l O an ε - contextual mixture linear optimization oracle (CMLOO) if on inputs w ∈ ∆ m , x ∈ X , and { u ( i ) } i ∈ [ m ] ∈ Q i ∈ [ m ] U ( i ) , the or acle outputs a ∈ A satisfying X i ∈ [ m ] w i D u ( i ) ( x ) , v ( i ) ( a , b ) E ≤ ε for al l b ∈ B . (17) Algorithm 2: ContextualSimultaneousApp roach ( b [ T ] , x [ T ] , { alg ( i ) } i ∈ [ m ] , O ) 1 Input: Online sequences b [ T ] ∈ B T , x [ T ] ∈ X T , online learners { alg ( i ) } i ∈ [ m ] satisfying ( 18 ), ε -CMLOO O (following notation in Problem 2 , Definition 6 ) 2 Output: a [ T ] ∈ A T suc h that eac h a t is output after observing x t and b efore observing b t 3 w 1 ← 1 m 1 m 4 u ( i ) 1 ← alg ( i ) ( {} ) for all i ∈ [ m ] 5 a 1 ← O ( w 1 , { u ( i ) 1 } i ∈ [ m ] ) 6 η ← 1 L · p 2 log( m ) · (5 T ) − 1 / 2 7 for 2 ≤ t ≤ T do 8 v ( i ) t − 1 ← v ( i ) ( p t − 1 , b t − 1 ) for all i ∈ [ m ] 9 ˜ g t − 1 ← vector in R m suc h that [ ˜ g t − 1 ] i = ⟨ u ( i ) t − 1 ( x t − 1 ) , v ( i ) t − 1 ⟩ for all i ∈ [ m ] 10 u ( i ) t ← alg ( i ) ( v ( i ) [ t − 1] ) for all i ∈ [ m ] 11 w t ← w t − 1 ◦ exp( η ˜ g t − 1 ) 12 w t ← w t ∥ w t ∥ − 1 1 13 a t ← O ( w t , x t , { u ( i ) t } i ∈ [ m ] ) 14 p t ← any random element of A such that E [ p t | a [ t − 1] , b [ t − 1] ] = a t 15 end 16 Return: p [ T ] W e no w state our extension of Theorem 2 . In our statement, we allow for unbiased estimators of the CMLOO outputs to b e play ed, and give a high-probabilit y guaran tee on the error. W e also allo w for improp er learners that satisfy ( 18 ), but output hypotheses from an (appropriately bounded) differen t set than U ( i ) . Corollary 1. In the setting of Pr oblem 2 , assume we have ac c ess to O , an ε -CMLOO. F urther, for al l i ∈ [ m ] and T ∈ N , assume ther e is an online le arner alg ( i ) that takes inputs ( a [ T ] , b [ T ] , x [ T ] ) ∈ 14 A T × B T × X T , and outputs u ( i ) [ T ] ∈ (( U ′ ) ( i ) ) T such that u ( i ) t dep ends only on a [ t − 1] , b [ t − 1] , and x [ t ] , and sup u ( i ) ∈U ( i ) X t ∈ [ T ] D v ( i ) ( a t , b t ) , u ( i ) ( x t ) − u ( i ) t ( x t ) E ≤ reg ( i ) ( T ) , (18) for some reg ( i ) : N → R ≥ 0 . Final ly, assume    D v ( i ) ( a , b ) , u ( i ) ( x ) E    ≤ L (19) for al l i ∈ [ m ] , ( a , b , x ) ∈ A × B × X , and u ( i ) ∈ U ( i ) ∪ ( U ′ ) ( i ) . Then, for any ( b [ T ] , x [ T ] ) ∈ B T × X T , A lgorithm 2 pr o duc es p [ T ] ∈ A T such that p t dep ends only on b [ t − 1] and x [ t ] , and for any δ ∈ (0 , 1) , sup u ( i ) ∈U ( i ) 1 T X t ∈ [ T ] D u ( i ) ( x t ) , v ( i ) ( p t , b t ) E ≤ ε + reg ( i ) ( T ) + 28 L q T log ( 4 m δ ) T for al l i ∈ [ m ] , (20) with pr ob ability at le ast 1 − δ over the r andomness of the p [ T ] . Pr o of. The proof is largely analogous to Theorem 2 , substituting the con texts x [ T ] as necessary . The key difference is in ( 16 ), which no longer holds deterministically . W e instead apply the v arian t in Lemma 1 with the ˜ g [ T ] as defined in Algorithm 2 . Note that this is un biased for g t with entries ⟨ u ( i ) t , v ( i ) ( a t , b t ) ⟩ , by linearit y of eac h v ( i ) in its first argumen t. Thus, in place of ( 16 ), sup w ∈ ∆ m 1 T X t ∈ [ T ] ⟨ w , ˜ g t ⟩ = 1 T X t ∈ [ T ] ⟨ w t , g t ⟩ + 1 T X t ∈ [ T ] ⟨ w t , ˜ g t − g t ⟩ + sup w ∈ ∆ m 1 T X t ∈ [ T ] ⟨ w − w t , ˜ g t ⟩ ≤ ε + 1 T X t ∈ [ T ] ⟨ w t , ˜ g t − g t ⟩ + 20 L s log( 4 m δ ) T ≤ ε + 28 L s log( 4 m δ ) T , with probabilit y ≥ 1 − δ . Here, the first inequality used the CMLOO guarantee to b ound ⟨ w t , g t ⟩ ≤ ε for all t ∈ [ T ] , as w ell as Lemma 1 with failure probability δ 2 to b ound the last term. The second inequalit y used the Azuma-Ho effding inequality with the fact that each ⟨ w t , ˜ g t − g t ⟩ is mean-zero conditioned on the history , and b ounded in [ − 2 L, 2 L ] . W e remark that the setting of η in Algorithm 2 is for Lemma 1 to hold; see Lemma 9, [ HTY25 ] for additional discussion. 4 Binary Omniprediction As a w arm up, we consider the binary omniprediction setting, where w e make predictions o ver k = 2 classes. W e state some general preliminaries in Section 4.1 . In Sections 4.2 and 4.3 we apply our framework from Section 3 to reduce online and statistical omniprediction, resp ectiv ely , to appropriate low-regret learners. W e complete our binary omniprediction results b y providing these lo w-regret learners in the linear (Section 4.4 ) and general (Section 4.5 ) classification settings. F or notational simplicity in this section only , we identify the binary simplex ∆ 2 with the prediction in terv al [0 , 1] and the b oundary ∂ ∆ 2 with the label set { 0 , 1 } , so e.g., in place of ( 5 ), L GLM :=  ℓ : R × { 0 , 1 } → R | ℓ ( t, y ) = ω ( t ) − ty , ω : R → R con vex with ω ′ : R → [0 , 1]  . (21) 15 W e will also use standard script (rather than b oldface) to denote any scalar-v alued v ariables. Finally , throughout the section w e mak e the normalization assumptions that X ⊆ B d 2 (1) , c ( x ) ∈ [ − 1 , 1] for all c ∈ C , x ∈ X , ℓ ( t, y ) ∈ [ − 1 , 1] for all ℓ ∈ L , ( t, y ) ∈ [ − 1 , 1] × { 0 , 1 } . (22) All our results generalize to generic b ounds on L and C in a scale-inv ariant wa y . F or example, the assumption ( 22 ) is enforced for L GLM b y requiring that ω ′ ( t ) ∈ [0 , 1] for all t ∈ [ − 1 , 1] . 4.1 Binary omniprediction preliminaries W e recall tw o useful characterizations of prop er losses from prior w ork. Lemma 7 (Lemma 3, [ KLST23 ]) . L et ℓ : Ω × ∂ ∆ k → R b e arbitr ary. Then defining k ⋆ ℓ as in ( 3 ) , the function ( p , y ) → ℓ ( k ⋆ ℓ ( p ) , y ) is a pr op er loss. Lemma 8 (Theorem 8, [ KLST23 ]) . F or al l s ∈ [0 , 1] and ( p, y ) ∈ [0 , 1] 2 , let ℓ s ( p, y ) := − | p − s | + ( p − y ) sign ( p − s ) . (23) Then ℓ s is pr op er for al l s ∈ [0 , 1] , and for every b ounde d pr op er loss ℓ : [0 , 1] × { 0 , 1 } → [ − 1 , 1] , ℓ ( p, y ) = a ℓ y + b ℓ + Z w ℓ ( v ) ℓ v ( p, y ) d v , for some nonne gative weights { w ℓ ( v ) } v ∈ [0 , 1] satisfying R 1 0 w ℓ ( v ) d v ≤ 2 , and a c onstant | a ℓ | ≤ 2 . Pr o of. All parts of the claim are explicit in Theorem 8, [ KLST23 ], except for the bound on | a ℓ | , so w e repro duce part of the pro of here to mak e this clear. Overloading notation, an y (biv ariate) prop er loss ℓ : [0 , 1] × { 0 , 1 } → [ − 1 , 1] can be written as ℓ ( p, y ) = − uni ( ℓ )( p ) + ( p − y ) uni ( ℓ ) ′ ( p ) for a univ ariate con vex function uni ( ℓ ) : [0 , 1] → R with | uni ( ℓ ) ′ | ≤ 2 (this is implied b y Lemma 4 with uni ( ℓ ) ← ψ ; see also Lemma 1 and Corollary 2, [ KLST23 ]). Since piecewise linear functions are dense for this family of ℓ under the sup norm, we only need to consider uni ( ℓ )( p ) that is piecewise linear with respect to p . If ψ := uni ( ℓ ) has k breakp oints, s 1 , . . . , s k , we can write ψ ( p ) as ψ ( p ) = ψ (0) + pψ ′ (0) + X i ∈ [ k ]  ψ ′ + ( s i ) − ψ ′ − ( s i )  · max( p − s i , 0) , where ψ ′ + ( s ) = lim p → s + ℓ ′ ( p ) and ℓ ′ − ( s ) = lim p → s − ℓ ′ ( p ) . The claim then follows since max( x, 0) = 1 2 ( | x | + x ) and ψ s i ( p ) = | p − s i | is the univ ariate form of ℓ s i in ( 23 ), so equiv alen tly , ψ ( p ) =   ψ (0) − 1 2 X i ∈ [ k ] λ i s i   + p   ψ ′ (0) + 1 2 X i ∈ [ k ] λ i   + 1 2 X i ∈ [ k ] λ i · ψ s i ( p ) , with λ i := ψ ′ + ( s i ) − ψ ′ − ( s i ) ≥ 0 . Note that since | ψ ′ (0) + P i ∈ [ k ] λ i | = | ψ ′ ( s k ) | ≤ 2 and | ψ ′ (0) | ≤ 2 , the co efficient of p has absolute v alue bounded b y 2 as claimed. 16 By combining Lemmas 7 and 8 , [ OKK25 ] show ed that obtaining calibration against the family of weigh ts {− d ℓ ◦ k ⋆ ℓ } ℓ ∈L , as required b y Prop osition 1 , can b e reduced to calibration against an appropriate basis of w eights induced by the sp ecific proper losses in ( 23 ). Lemma 9. L et L b e a family of losses over Ω × { 0 , 1 } such that ℓ ( ω , y ) ∈ [ − 1 , 1] for al l ℓ ∈ L , ω ∈ Ω , y ∈ { 0 , 1 } . In the statistic al setting, if p : R d → [0 , 1] satisfies ε - ( D , W thresh ) -c alibr ation for W thresh := { w ( p ) = sign ( p − s ) } s ∈ [0 , 1] , (24) it also satisfies 4 ε - ( D , {− d ℓ ◦ k ⋆ ℓ } ℓ ∈L ) -c alibr ation. In the online setting, if p [ T ] ∈ [0 , 1] T satisfies ε - ( y [ T ] , W thresh ) -c alibr ation, it also satisfies 4 ε - ( y [ T ] , {− d ℓ ◦ k ⋆ ℓ } ℓ ∈L ) -c alibr ation. Pr o of. The pro of follows Corollary 3.4 in [ OKK25 ], but includes the linear term a ℓ y (this term was not explicitly discussed in [ OKK25 ], which we account for here). Sp ecifically , the pro of of Theorem 3.1 in [ OKK25 ] shows that Lemma 7 implies it suffices to prov e the lemma statemen t for prop er ℓ , and drop composition with k ⋆ ℓ . Next, for an y prop er ℓ , d ℓ ( p ) = a ℓ + Z w ℓ ( v ) d ℓ v ( p ) d v , for nonnegative weigh ts { w ℓ ( v ) } v ∈ [0 , 1] satisfying R 1 0 w ℓ ( v ) d v ≤ 2 , and | a ℓ | ≤ 2 , by using Lemma 8 and linearity of d ℓ . Equiv alen tly , because d ℓ 0 = sign (0 − · ) = − 1 and d ℓ 1 = sign (1 − · ) = 1 are constan t functions, w e can directly include a ℓ in an appropriate w eight in the in tegral, so that d ℓ ( p ) = Z u ℓ ( v ) d ℓ v ( p ) d v , for nonnegative { u ℓ ( v ) } v ∈ [0 , 1] satisfying R 1 0 u ℓ ( v ) d v ≤ 4 . The conclusion follo ws b ecause ε -calibration against W thresh implies 4 ε -calibration against an y suc h d ℓ defined ab ov e, by integrating. 4.2 Online binary omniprediction W e first consider the online binary omniprediction setting, i.e., Definition 1 with k = 2 , for a family of losses L and a family of comparators C . Throughout, let X ⊆ R d b e the supp ort of our features. Our starting p oint is an observ ation from [ OKK25 ] that the (online v arian ts of ) Definitions 2 and 3 can naturally be framed in the con text of Problem 2 . F or a fixed parameter ε ∈ (0 , 1) , we define A := ∆ N , B := [0 , 1] , (25) where N := { iε } i ∈ [ ⌊ 1 ε ⌋ ] ∪ { 0 , 1 } is an ε -net for [0 , 1] , and is viewed as a set of representativ e thresholds. F or simplicit y of notation, w e use p ∼ a to mean that p ∈ N is sampled according to a . No te that the sequence b [ T ] ∈ B T will even tually corresp ond to our (binary) online label space. W e next define the sets and pay off vectors in Problem 2 , where sign (0) = 1 by conv en tion: U (1) := ∆ N , v (1) ( a , b ) := { E p ∼ a [( p − b ) sign ( p − s )] } s ∈N , U (2) := { d ℓ ◦ c } ℓ ∈L ,c ∈C , v (2) ( a , b ) := E p ∼ a [ p − b ] . (26) 17 W e remark that our definition of U (2) is exactly the set F in Prop osition 1 . W e equip b oth H (1) = R N and H (2) = R with the standard Euclidean inner pro duct. Note that v (1) is a function that takes ( a , b ) ∈ A × B to a v ector in R N , whose co ordinates are indexed by N . Eac h co ordinate of U (1) is used to ensure calibration against an elemen t of W thresh . In a sligh t abuse of notation (as remarked on in Problem 2 ), U (1) consists of elements of H (1) , whereas U (2) consists of functions taking contexts from our domain X to H (2) . T o use Corollary 1 , w e first instantiate the learner alg (1) as sp ecified in ( 18 ). Lemma 10. F ol lowing definitions ( 25 ) , ( 26 ) , ther e exists alg (1) such that for any ( a [ T ] , b [ T ] ) ∈ A T × B T , alg (1) outputs u (1) [ T ] ∈ ( U (1) ) T such that u (1) t dep ends only on a [ t − 1] , b [ t − 1] , and ( 18 ) holds with reg (1) ( T ) := s 2 T log  1 ε + 2  . Pr o of. The algorithm is m ultiplicative weigh ts. More precisely , for all ( a t , b t ) ∈ A × B , ∥ v (1) ( a t , b t ) ∥ ∞ ≤ 1 . Therefore, Lemma 2 applies with L = 1 and ∥·∥ = ∥·∥ 1 . W e c ho ose r ( u ) := P s ∈N u s log u s , at whic h point the standard bound Θ ≤ log( |N | ) in Lemma 2 yields the conclusion. Next, we recall a construction of a CMLOO (Definition 6 ) giv en by [ OKK25 ]. Lemma 11 (Lemma 3.11, [ OKK25 ]) . F ol lowing definitions ( 25 ) , ( 26 ) , and assuming ( 22 ) , ther e exists O , an ε -CMLOO. Pr o of. W e state the algorithm in Algorithm 3 . F or notational simplicity , w e fix inputs w = ( q , r ) ∈ ∆ 2 , x ∈ X , u := u (1) ∈ ∆ |N | +2 , and d ℓ ◦ c := u (2) ∈ U (2) to the CMLOO. Also, let f ( a , b ) := q X s ∈N u s E p ∼ a [( p − b ) sign ( p − s )] ! + r d ℓ ( c ( x )) E p ∼ a [ p − b ] , so that following Definition 6 , w e wish to find a ∈ A suc h that f ( a , b ) ≤ ε for ev ery b ∈ B . It is helpful to define the “pure strategy” sp ecialization of ( 27 ), i.e., where a is a p oint mass on s ∈ N : h ( p ) := q X s ∈N u s sign ( p − s ) ! + r d ℓ ( c ( x )) . (27) In particular, we hav e f ( e p , b ) = h ( p )( p − b ) in this special case. Observ e that under ( 22 ), we hav e for any p ∈ N that | h ( p ) | ≤ 1 , b ecause | c ( x ) | ≤ 1 , | sign ( p − s ) | ≤ 1 for all s ∈ N , and q + r = 1 . Case 1: h (0) ≥ 0 . In this case, Algorithm 3 outputs a = e 0 , which satisfies f ( a , b ) = h (0)(0 − b ) ≤ 0 , for all b ∈ B . Case 2: h (1) ≤ 0 . In this case, w e similarly ha ve for a = e 1 , f ( a , b ) = h (1)(1 − b ) ≤ 0 , for all b ∈ B . 18 Case 3: h (0) < 0 and h (1) > 0 . In this case, there are adjacent ( p, p ′ ) ∈ N × N with p ≤ p ′ ≤ p + ε , h ( p ) ≤ 0 , and h ( p ′ ) ≥ 0 , and Algorithm 3 outputs a = | h ( p ′ ) | | h ( p ) | + | h ( p ′ ) | e p + | h ( p ) | | h ( p ) | + | h ( p ′ ) | e p ′ . Then, f ( a , b ) = | h ( p ′ ) | | h ( p ) | + | h ( p ′ ) | · h ( p )( p − b ) + | h ( p ) | | h ( p ) | + | h ( p ′ ) | · h ( p ′ )( p ′ − b ) = | h ( p ′ ) | | h ( p ) | + | h ( p ′ ) | · h ( p )( p − b ) + | h ( p ) | | h ( p ) | + | h ( p ′ ) | · h ( p ′ )( p − b ) + | h ( p ) | | h ( p ) | + | h ( p ′ ) | · h ( p ′ )( p ′ − p ) = | h ( p ) | | h ( p ) | + | h ( p ′ ) | · h ( p ′ )( p ′ − p ) ≤ ε, for all b ∈ B , where the last line used | p − p ′ | ≤ ε and | h ( p ) | h ( p ) | + | h ( p ′ ) | · h ( p ′ ) | ≤ 1 . Algorithm 3: CMLOO for binary omniprediction 1 Input: w = ( q, r ) ∈ ∆ 2 , x ∈ X , u ∈ ∆ N , c ∈ C , ℓ ∈ L 2 if h (0) ≥ 0 then return e 0 // Following definition ( 27 ) . 3 else if h (1) ≤ 0 then return e 1 4 else 5 ( p, p ′ ) ← elements in N × N such that p ≤ p ′ ≤ p + ε , h ( p ) ≤ 0 , h ( p ′ ) ≥ 0 6 return | h ( p ′ ) | | h ( p ) | + | h ( p ′ ) | e p + | h ( p ) | | h ( p ) | + | h ( p ′ ) | e p ′ W e conclude the section by sho wing ho w to apply Lemmas 10 and 11 within the context of Prop o- sition 1 to giv e our result on online binary omniprediction. Corollary 2 (Online binary omniprediction) . L et L b e a family of loss functions and C , C ′ b e families of c omp ar ators satisfying ( 22 ) . Assume ther e exists an online le arner alg (2) that takes inputs ( v [ T ] , x [ T ] ) ∈ [ − 1 , 1] T × X T , and outputs ℓ [ T ] ∈ L T , c [ T ] ∈ ( C ′ ) T , 7 such that ( ℓ t , c t ) dep ends only on v [ t − 1] , x [ t − 1] , and sup ( ℓ,c ) ∈L×C X t ∈ [ T ] v t ( d ℓ ( c ( x t )) − d ℓ t ( c t ( x t ))) ≤ reg ( T ) , (28) for reg : N → R ≥ 0 such that al l T ≥ T L , C satisfy reg ( T ) T ≤ ε . Then if T = Ω( 1 ε 2 log( 1 δ ε )) + T L , C , we c an pr o duc e p [ T ] ∈ [0 , 1] T , a 15 ε -omnipr e dictor for ( x [ T ] , y [ T ] , L , C ) , with pr ob ability ≥ 1 − δ . Pr o of. F ollo wing notation in Proposition 1 , it suffices to show how to pro duce p [ T ] ∈ [0 , 1] T satisfying 3 ε - ( x [ T ] , y [ T ] , F ) -multiaccuracy and 12 ε - ( y [ T ] , W ) -calibration, where w e recall F := { d ℓ ◦ c } ℓ ∈L , c ∈C , W := {− d ℓ ◦ k ⋆ ℓ } ℓ ∈L . W e achiev e this b y using Corollary 1 . F rom our definitions of U (1) , v (1) , U (2) , and v (2) in ( 26 ), it is clear that w e ma y take L = 1 in ( 19 ). Now if we can satisfy the requirements ( 18 ) of Corollary 1 , 7 W e include this additional flexibilit y for our applications in Section 4.5 , which may return improp er hypotheses. 19 pla ying an y p t ∼ a t as a t is pro duced b y the CMLOO in Lemma 11 in each iteration, gives sup u ( i ) ∈U ( i ) 1 T X t ∈ [ T ] D u ( i ) ( x t ) , v ( i ) ( e p t , b t ) E ≤ ε + reg ( i ) ( T ) + 28 q T log ( 8 δ ) T (29) for i ∈ [2] with probabilit y ≥ 1 − δ . Condition on this even t henceforth. When i = 1 , the guarantee in ( 29 ) exactly corresp onds to ( y [ T ] , W thresh ) -calibration, when comparing with Definition 3 and Lemma 9 . 8 Similarly , when i = 2 , ( 29 ) is exactly ( x [ T ] , y [ T ] , F ) -multiaccuracy . Th us, it is enough to b ound the right-hand side of ( 29 ) b y 3 ε in b oth cases, whic h also results in 12 ε - ( y [ T ] , W ) -calibration via Lemma 9 . T o do this w e use the online learner from Lemma 10 as alg (1) , and the learner with guaran tee ( 28 ) as alg (2) , and tak e T as specified. W e p ostp one discussion of the construction of online learners alg (2) meeting the requiremen t ( 28 ), for b oth specific and general pairs of L × C , to Sections 4.4 and 4.5 . 4.3 Statistical binary omniprediction W e next consider statistical omniprediction, again with k = 2 . Ho wev er, w e require a differen t form ulation of Problem 2 . In this section, we let H (1) and H (2) b e the Hilb ert spaces of (norm) square-in tegrable functions under D , taking X × { 0 , 1 } → R N and X × { 0 , 1 } → R resp ectively . The inner pro ducts of u , v ∈ H (1) and u, v ∈ H (2) are the corresponding L 2 ( D ) inner products: ⟨ u , v ⟩ := E ( x ,y ) ∼D [ ⟨ u ( x , y ) , v ( x , y ) ⟩ ] , ⟨ u, v ⟩ := E ( x ,y ) [ u ( x , y ) v ( x , y )] . Note that in this instance of Problem 2 , there is no additional con text x ∈ X , as it is implicitly sp ecified through our inner pro duct definitions. Next, define A to b e the set of functions taking each x ∈ X → a ( x ) ∈ ∆ N , i.e., A :=  a : X → ∆ N  . (30) As b efore, N is an ε -net for [0 , 1] that includes { 0 , 1 } , and we write p ∼ a ( x ) to mean p ∈ N is sampled as sp ecified b y a ( x ) ∈ ∆ N . Also, our pa yoff v ectors v (1) and v (2) will b e indep endent of b ∈ B , so w e simply let B = ∅ and drop the input b from v (1) , v (2) . Finally , w e let U (1) := ∆ N , v (1) ( a )( x , y ) :=  E p ∼ a ( x ) [( p − y ) sign ( p − s )]  s ∈N , U (2) := { d ℓ ◦ c } ℓ ∈L ,c ∈C , v (2) ( a )( x , y ) := E p ∼ a ( x ) [ p − y ] . (31) As b efore, U (1) is in terpreted as the family of constant functions ov er X × { 0 , 1 } with range ∆ N , and d ℓ ◦ c ∈ U (2) acts on ( x , y ) ∈ X × { 0 , 1 } by discarding y and outputting d ℓ ( c ( x )) . W e also sp ecify the functions v (1) ( a ) ∈ H (1) and v (2) ( a ) ∈ H (2) b y their actions on an elemen t ( x , y ) ∈ X × { 0 , 1 } . With this setup in hand, w e now extend Lemmas 10 and 11 to the statistical setting. 8 Our definition of U (1) formally only yields ( y [ T ] , W thresh ) -calibration for the thresholds s ∈ N . How ever, because w e only play predictions in N , all weigh ts induced by thresholds outside N agree with that of some threshold in N . 20 Lemma 12. L et δ ∈ (0 , 1) . F ol lowing definitions ( 30 ) , ( 31 ) , ther e exists alg (1) such that for any a [ T ] ∈ A T , alg (1) outputs u (1) [ T ] ∈ ( U (1) ) T such that u (1) t dep ends only on a [ t − 1] , and ( 18 ) holds with reg (1) ( T ) := 20 s T log  4 δ ε  , with pr ob ability ≥ 1 − δ , wher e for e ach t ∈ [ T ] , we r e quir e one i.i.d. dr aw ( x t , y t ) ∼ D . Pr o of. More concretely , our goal in ( 18 ) is to hav e sup u ∈ ∆ N X t ∈ [ T ] D E ( x ,y ) ∼D h v (1) ( a t )( x , y ) i , u − u t E ≤ reg (1) ( T ) . F or any a t w e hav e an un biased estimator of E ( x ,y ) ∼D [ v (1) ( a t )( x , y )] conditioned on the history , with en tries alw ays in [ − 1 , 1] , under our assumed sample access to D . Thus, the conclusion follows by applying the v ariant of m ultiplicative weigh ts in Lemma 1 with L = 1 and Θ = log( 1 ε + 2) . Lemma 13. F ol lowing definitions ( 30 ) , ( 31 ) , and assuming ( 22 ) , ther e exists O , an ε -CMLOO. Pr o of. The CMLOO is unchanged from Lemma 11 , except w e no w return a function a ∈ H (1) suc h that for any x ∈ X , we let a ( x ) hav e the same output as Algorithm 3 with con text x . Then, for any auxiliary inputs w = ( q , r ) ∈ ∆ 2 , u ∈ ∆ |N | +1 , c ∈ C , and ℓ ∈ L , w e hav e for all ( x , y ) ∈ X × [0 , 1] , q X s ∈N u s E p ∼ a ( x ) [( p − y ) sign ( p − s )] ! + r d ℓ ( c ( x )) E p ∼ a ( x ) [ p − y ] ≤ ε. T aking expectations o ver the abov e display ov er ( x , y ) ∼ D then yields the CMLOO guaran tee E ( x ,y ) ∼D h q D u , v (1) ( a ) E + r d ℓ ( c ( x )) E p ∼ a ( x ) [ p − y ] i ≤ 0 . W e can no w derive the statistical analog of Corollary 2 , again p ostp oning the discussion of online learners meeting ( 32 ) to Sections 4.4 and 4.5 . Corollary 3 (Statistical binary omniprediction) . L et L b e a family of loss functions and C , C ′ b e families of c omp ar ators satisfying ( 22 ) . Assume ther e exists an online le arner alg (2) that takes inputs { v t : X × { 0 , 1 } → R } t ∈ [ T ] , and outputs ℓ [ T ] ∈ L T , c [ T ] ∈ ( C ′ ) T , such that ( ℓ t , c t ) dep ends only on v [ t − 1] , and sup ( ℓ,c ) ∈L×C X t ∈ [ T ] ⟨ v t , d ℓ ( c ) − d ℓ t ( c t ) ⟩ ≤ reg ( T ) , (32) for reg : N → R ≥ 0 such that al l T ≥ T L , C satisfy reg ( T ) T ≤ ε with pr ob ability ≥ 1 − δ 2 . Then if T = Ω( 1 ε 2 log( 1 δ ε )) + T L , C , we c an pr o duc e p : X → [0 , 1] , a 15 ε -omnipr e dictor for ( D , L , C ) , with pr ob ability ≥ 1 − δ , given T i.i.d. samples fr om D . 21 Pr o of. The pro of is completely analogous to Corollary 2 but using ( 32 ) and Lemmas 12 , 13 in place of ( 28 ) and Lemmas 10 , 11 . The conclusion is that we can return { p t : X → [0 , 1] } t ∈ [ T ] satisfying F -multiaccuracy and W -calibration on av erage, i.e., for all f ∈ F and w ∈ W in Prop osition 1 , E ( x ,y ) ∼D   1 T X t ∈ [ T ] ⟨ p t ( x ) − y , f ( x ) ⟩   ≤ 3 ε, E ( x ,y ) ∼D   1 T X t ∈ [ T ] ⟨ p t ( x ) − y , w ( p t ( x )) ⟩   ≤ 12 ε, sim ultaneously except with probabilit y δ (taking a union b ound o v er ( 32 ) and Lemma 12 with δ ← δ 2 ). By linearit y of exp ectation, outputting p ← p t for a uniformly random t ∈ [ T ] satisfies F -multiaccuracy and W -calibration, so applying Prop osition 1 to this p gives the result. 4.4 Generalized linear models In this section, we sp ecialize Corollary 2 and 3 to the setting of generalized linear models, where L := L GLM as defined in ( 5 ), and C := C lin where C lin := n c ( x ) := ⟨ c , x ⟩ | c ∈ B d 2 (1) o . (33) W e conflate the actual linear classifier c ∈ B d 2 (1) with a function c : X → [ − 1 , 1] b y using b oldface, so e.g., c t ∈ B d 2 (1) corresp onds to the function c t = ⟨ c t , ·⟩ ∈ C lin . Recalling ( 6 ), we take the discrete deriv ative d ℓ to be negation for all ℓ ∈ L GLM , so F in Proposition 1 is equiv alen t to C lin b ecause C lin is closed under negation. W e next require online learners satisfying ( 28 ), ( 32 ). Lemma 14. Assuming ( 22 ) holds, ther e exists alg (2) such that for any ( v [ T ] , x [ T ] ) ∈ [ − 1 , 1] T × X T , alg (2) outputs c [ T ] ∈ ( B d 2 (1)) T in O ( dT ) time, such that c t only dep ends on v [ t − 1] , x [ t − 1] , and sup c ∈C lin X t ∈ [ T ] v t ⟨ x t , c − c t ⟩ ≤ √ T . Pr o of. This follows from Lemma 2 with X ← B d 2 (1) and r ( c ) := 1 2 ∥ c ∥ 2 2 , where w e take g t := − v t x t so that our application satisfies ∥ g t ∥ 2 ≤ L := 1 and Θ ≤ 1 2 , b ecause ∥ x t ∥ 2 ≤ 1 under ( 22 ). Lemma 15. L et δ ∈ (0 , 1) . Assuming ( 22 ) holds, ther e exists alg (2) such that for any { v t : X × { 0 , 1 } → [ − 1 , 1] } t ∈ [ T ] , alg (2) outputs c [ T ] ∈ ( B d 2 (1)) T in O ( dT ) time, such that c t only dep ends on v [ t − 1] , and sup c ∈C lin X t ∈ [ T ] E ( x ,y ) ∼D [ v t ( x , y ) ⟨ x , c − c t ⟩ ] ≤ 20 s T log  2 δ  , with pr ob ability ≥ 1 − δ , wher e for e ach t ∈ [ T ] , we r e quir e one i.i.d. dr aw ( x t , y t ) ∼ D . Pr o of. The pro of is identical to Lemma 14 , where w e use Lemma 1 in place of Lemma 2 , gran ting us unbiased access to g t := E ( x ,y ) ∼D [ v t ( x , y ) x ] under our sampling assumption. 22 W e no w combine the pieces to give our result on omnipredicting (binary) generalized linear mo dels. Theorem 3 (Binary generalized linear mo dels) . L et δ ∈ (0 , 1) , let L := L GLM and C := C lin define d in ( 21 ) , ( 33 ) r esp e ctively, and assume ( 22 ) holds. Then if T = Ω log  1 δ ε  ε 2 ! for an appr opriate c onstant, in the online setting, we c an output p [ T ] ∈ [0 , 1] T , an ε -omnipr e dictor for ( x T , y [ T ] , L , C ) , in time O (( d + 1 ε ) T ) with pr ob ability ≥ 1 − δ . In the statistic al setting, we c an output p : X → [0 , 1] , an ε -omnipr e dictor for ( D , L , C ) , in time O (( d + 1 ε ) T ) with pr ob ability ≥ 1 − δ , given T i.i.d. samples fr om D , such that p c an b e evaluate d in time O ( d + 1 ε ) . Pr o of. The result on online omniprediction is immediate by combining Corollary 2 and Lemma 14 , and adjusting ε ← ε 15 . F or the result on statistical omniprediction, the result similarly follo ws from Corollary 3 and Lemma 15 . Note that we take C = C ′ in these applications. T o ev aluate p as specified in Corollary 3 , 9 w e store the v alues of all inputs to Algorithm 2 for eac h iteration. W e can compute the function h ( s ) in ( 27 ) for all s ∈ N in O ( 1 ε ) time, after sp ending O ( d ) time to ev aluate some c t ( x ) . This complexity also dominates the cost of eac h iteration. 4.5 General classifiers and losses W e finally consider Corollary 2 and Corollary 3 for general loss functions L and general comparators C that satisfy ( 22 ). T o state the guaran tees of our online learners satisfying ( 28 ), ( 32 ), we define the following complexit y measure parameters of a function class F . Definition 7 (Statistical Rademacher complexit y) . L et F b e a class of functions f : X → R , D b e a distribution over X , and T ∈ N . The statistical Rademac her complexity of F is define d to b e rad T ( F ) := E { x t } t ∈ [ T ] ∼ i.i.d. D   E σ [ T ] ∼ unif. {± 1 } T   sup f ∈F 1 T X t ∈ [ T ] σ t f ( x t )     . Definition 8 (Sequen tial Rademac her complexity) . L et F b e a class of functions f : X → R and T ∈ N . The sequential Rademacher complexity of F is define d to b e srad T ( F ) := sup { x t : {± 1 } t − 1 →X } t ∈ [ T ] E σ [ T ] ∼ unif. {± 1 } T   sup f ∈F 1 T X t ∈ [ T ] σ t f ( x t ( σ [ t − 1] ))   . F or the online setting, w e use the follo wing result from [ OKK25 ]. Lemma 16 (Theorem 4.5, [ OKK25 ]) . In the setting of Cor ol lary 2 , let F := { d ℓ ◦ c } ℓ ∈L ,c ∈C . Ther e exists alg (2) such that for any ( v [ T ] , x [ T ] ) ∈ [ − 1 , 1] T × X T , alg (2) outputs c [ T ] , such that c t ∈ C ′ only dep ends on v [ t − 1] , x [ t − 1] , and sup c ∈C X t ∈ [ T ] v t ( d ℓ ( c ( x t )) − d ℓ t ( c t ( x t ))) ≤ 2 T · srad T ( F ) . 9 W e prepro cess the indices in [ T ] so a uniform sample is attainable in O (1) time, e.g., via the alias method. 23 W e remark that Lemma 16 is based on a computationally-inefficient (indeed, nonconstructive) argumen t from [ RST15 ], but that the regret bound is known to b e tight up to a constant factor (Theorem 4.5, [ OKK25 ]). F or specific pairs ( L , C ) , e.g., the ones in Lemma 14 , it is p ossible to design more explicit online learners, so in general the computational cost depends on the setting. F or the statistical setting, w e similarly use the following result. Lemma 17 (Lemma 7.4, Lemma 7.6, [ OKK25 ]) . In the setting of Cor ol lary 3 , let δ ∈ (0 , 1) and F := { d ℓ ◦ c } ℓ ∈L ,c ∈C . L et V ⊆ { v : X × { 0 , 1 } → [ − 1 , 1] } . Ther e exists alg (2) such that for any v [ T ] ∈ V T , alg (2) outputs c [ T ] , making O ( T 1 . 5 ) c al ls to an ERM or acle for F over T samples p er iter ation, such that c t ∈ C ′ only dep ends on v [ t − 1] , and for a universal c onstant C , sup c ∈C X t ∈ [ T ] E ( x ,y ) ∼D [ v t ( x , y ) ( d ℓ ( c ( x )) − d ℓ t ( c t ( x )))] ≤ C r T · log 1 δ + T · rad T ( F · V ) ! , with pr ob ability ≥ 1 − δ , wher e for e ach t ∈ [ T ] , we r e quir e T i.i.d. dr aws ∼ D . In the statement of Lemma 17 , we let F · V consist of functions ( x , y ) 7→ f ( x ) v ( x , y ) for f ∈ F and v ∈ V , and an ERM oracle for F finds min f ∈F 1 n P i ∈ [ n ] w i f ( x i , y i ) o ver some dataset { ( x i , y i ) } of n i.i.d. dra ws from D , and some w eigh ts w ∈ R n . The computational complexit y of implementing suc h an oracle again t ypically dep ends on the sp ecific setting. Finally , w e conclude with our main result for binary omniprediction in the general case. Theorem 4 (General binary omniprediction) . L et L b e a family of loss functions and C b e a family of c omp ar ators such that ( 22 ) holds, let F := { d ℓ ◦ c } ℓ ∈L ,c ∈C , and let δ ∈ (0 , 1) . L et T seq L , C , T stat L , C b e such that al l T ≥ T seq L , C satisfy srad T ( F ) ≤ ε 30 , and al l T ≥ T stat L , C satisfy rad T ( F · V ) ≤ ε 30 C , wher e V c onsists of functions ( x , y ) → p ( x ) − y for al l p ossible p : X → [0 , 1] outputte d by the CMLOO in L emma 13 given classes C , L . Then if T = Ω log  1 δ ε  ε 2 ! + T seq L , C , for an appr opriate c onstant, in the online setting, we c an output p [ T ] ∈ [0 , 1] T , an ε -omnipr e dictor for ( x T , y [ T ] , L , C ) , with pr ob ability ≥ 1 − δ . If T = Ω log  1 δ ε  ε 2 ! + T stat L , C , for an appr opriate c onstant, in the statistic al setting, we c an output p : X → [0 , 1] , an ε -omnipr e dictor for ( D , L , C ) , with pr ob ability ≥ 1 − δ , given T 2 i.i.d. samples fr om D and O ( T 2 . 5 ) c al ls to an ERM or acle for F . Pr o of. The pro of is largely the same as the pro of for Theorem 3 . W e apply Lemma 16 within Corollary 2 for the online setting and Lemma 17 within Corollary 3 in the statistical setting. The oracle complexity comes from the cost of the online learner in Lemma 17 . 24 5 Multiclass Omniprediction W e no w pro ceed to our main result: omniprediction with k > 2 classes. Here, many prop erties sp ecific to the binary classification setting do not hold, e.g., the generalizations of Lemmas 9 and 11 . W e dev elop a general strategy for constructing MLOOs for m ulticlass omniprediction in Section 5.1 . W e then give the multic lass extensions of Corollaries 2 and 3 in Section 5.2 . Finally , in Sections 5.3 and 5.4 , w e give our full m ulticlass omniprediction results for linear and general classifiers. Throughout, we mak e the follo wing normalization assumptions: X ⊆ B d 2 (1) , c ( x ) ∈ [ − 1 , 1] k for all c ∈ C , x ∈ X , ℓ ( t , y ) ∈ [ − 1 , 1] for all ℓ ∈ L , ( t , y ) ∈ [ − 1 , 1] k × ∂ ∆ k . (34) 5.1 MLOOs for m ulticlass omniprediction In this section, we consider a sp ecialized application of the mac hinery in Section 3.2 to multiclass prediction. Sp ecifically , suppose that we hav e an instance of Problem 1 , where A := ∆ N , B = ∂ ∆ k , (35) and N is an ε -net for ∆ k . Also, define for all a ∈ ∆ N and b ∈ B , v ( a , b ) := { a s ( s − b ) } s ∈N ∈ R k |N | , (36) and supp ose that for all i ∈ [ m ] , v ( i ) ( a , b ) = M ( i ) v ( a , b ) (37) for some linear op erator M ( i ) : R k |N | → H ( i ) . W e giv e a meta-result that shows ho w to implemen t an MLOO for arbitrary sim ultaneous Blackw ell approac habilit y instances satisfying ( 36 ), ( 37 ), whose qualit y scales with b ounds on the { M ( i ) } i ∈ [ m ] and the {U ( i ) } i ∈ [ m ] . Lemma 18. In the setting of Pr oblem 1 , supp ose ( 35 ) , ( 36 ) , and ( 37 ) hold, wher e N is an ε -net for ∆ k . F urther, supp ose that for al l i ∈ [ m ] , we have     M ( i )  ∗ u ( i )    ∞ ≤ R for al l u ( i ) ∈ U ( i ) . (38) wher e ∗ denotes the adjoint. W e c an implement a 2 εR -MLOO with pr ob ability at le ast 1 − δ in time O ( |N | · p oly ( k , log 1 δ ε )) . Similarly, in the setting of Pr oblem 2 , we c an implement a 2 εR -CMLOO with pr ob ability at le ast 1 − δ in time O ( |N | · p oly ( k , log 1 δ ε )) . Pr o of. W e first sho w that for all w ∈ ∆ m , { u ( i ) } i ∈ [ m ] ∈ Q i ∈ [ m ] U ( i ) , there exists a ∈ A with max b ∈B X i ∈ [ m ] w i D u ( i ) , v ( i ) ( a , b ) E ≤ εR. Throughout the proof fix a set of w ∈ ∆ m and { u ( i ) } i ∈ [ m ] ∈ Q i ∈ [ m ] U ( i ) , and denote f := X i ∈ [ m ]  M ( i )  ∗ u ( i ) ∈ B k |N | ∞ ( R ) . 25 Th us, our goal is to establish min a ∈A max b ∈B ⟨ f , v ( a , b ) ⟩ ≤ εR. (39) Because ⟨ f , v ( a , b ) ⟩ is a bilinear function of a , b , the von Neumann minimax theorem gives min a ∈A max b ∈B ⟨ f , v ( a , b ) ⟩ = max q ∈ ∆ k min s ∈N E b ∼ q [ ⟨ f , v ( e s , b ) ⟩ ] = max q ∈ ∆ k min s ∈N E b ∼ q [ ⟨ f s , s − b ⟩ ] = max q ∈ ∆ k min s ∈N ⟨ f s , s − q ⟩ , where e s ∈ { 0 , 1 } N is the indicator v ector for strategy s ∈ N , and f s ∈ B k ∞ ( R ) concatenates the corresp onding co ordinates of f . Finally we claim that for an y q ∈ ∆ k , min s ∈N ⟨ f s , s − q ⟩ ≤ εR. Indeed, choosing s ∈ N so that ∥ s − q ∥ 1 ≤ ε and applying Hölder’s inequality yields this b ound. W e conclude by discussing run time. Normalize the problem by R b y resetting f ← 1 R f , so w e wan t to solve ( 39 ) to ε additiv e error. Notice that ( 39 ) is of the follo wing form: min a ∈ ∆ N max b ∈ ∆ k g ⊤ a − b ⊤ F a = min a ∈ ∆ N max b ∈ ∆ k b ⊤ Ma , where F ∈ R k ×N horizon tally stac ks the v alues of f , g ∈ R N has coordinate s ∈ N equal to ⟨ f s , s ⟩ , and we define M := 1 k g ⊤ − F . Also, w e hav e that M ∈ [ − 2 , 2] k ×N . W e can rewrite this as min t such that Aa + c = t 1 k , 1 ⊤ N a = 1 , a ≥ 0 N , c ≥ 0 k en trywise. W e note that c is enforcing the inequalit y constrain ts Aa ≤ t 1 k . W e can trivially enforce that t ∈ [ − 2 , 2] , a ∈ [0 , 1] N , and c ∈ [0 , 4] k . It is enough to obtain ε additiv e error for this problem for our guarantees. A t this point, the solver in Theorem 1.1 of [ vdBLL + 21 ] gives the claim. 5.2 Reducing multiclass omniprediction to lo w-regret learning W e giv e the analogs of Corollaries 2 and 3 in the multiclass setting. Online setting. In the online setting, for a fixed parameter ε ∈ (0 , 1) , we define ( A , B ) as in ( 35 ), where N is an ε -net for ∆ k of size ( 5 ε ) k − 1 as guaranteed by F act 1 . F or some a ∈ A , w e use p ∼ a to mean that some p ∈ N is sampled according to a . W e next define the sets and pay off vectors in Problem 2 : U (1) := [ − 1 , 1] N × k , v (1) ( a , b ) := { a s ( s − b ) } s ∈N , U (2) := { d ℓ ◦ c } ℓ ∈L , c ∈C , v (2) ( a , b ) := E p ∼ a [ p − b ] . (40) Note that U (1) and v (1) liv e in a vector space of dimension k |N | , whereas U (2) and v (2) are functions with range in R k . W e require the analog of Lemma 10 , an online learner for U (1) . Lemma 19. F ol lowing definitions ( 35 ) , ( 40 ) , ther e exists alg (1) such that for any ( a [ T ] , b [ T ] ) ∈ A T × B T , alg (1) outputs u (1) [ T ] ∈ ( U (1) ) T such that u (1) t dep ends only on a [ t − 1] , b [ t − 1] , and ( 18 ) holds with reg (1) ( T ) := εT + k |N | ε . 26 Pr o of. The algorithm is pro jected gradient descent. More precisely , because ∥ s − b ∥ 1 ≤ 2 for all ( s , b ) ∈ N × B , we ha ve for all ( a t , b t ) ∈ A × B , that ∥ v (1) ( a t , b t ) ∥ 2 ≤ ∥ v (1) ( a t , b t ) ∥ 1 ≤ 2 . Therefore, standard regret analyses of pro jected gradien t descen t with step size η ← ε 2 (e.g., Theorem 3.2, [ Bub15 ]) giv es the result, b ecause U (1) has ℓ 2 radius at most p k |N | . Corollary 4 (Online multiclass omniprediction) . L et L b e a family of loss functions and C , C ′ b e families of c omp ar ators satisfying ( 34 ) . Assume ther e exists an online le arner alg (2) that takes inputs ( v [ T ] , x [ T ] ) ∈ ( B k 2 (2)) T × X T , and outputs ℓ [ T ] ∈ L T , c [ T ] ∈ ( C ′ ) T , such that ( ℓ t , c t ) dep ends only on v [ t − 1] , x [ t − 1] , and sup ( ℓ, c ) ∈L×C X t ∈ [ T ] ⟨ v t , d ℓ ( c ( x t )) − d ℓ t ( c t ( x t )) ⟩ ≤ reg ( T ) , (41) for reg : N → R ≥ 0 such that al l T ≥ T L , C satisfy reg ( T ) T ≤ ε . Then if T = Ω( k ( 1 ε ) k +1 + 1 ε 2 log( 1 δ )) + T L , C , we c an pr o duc e p [ T ] ∈ [0 , 1] T , a 12 ε -omnipr e dictor for ( x [ T ] , y [ T ] , L , C ) , with pr ob ability ≥ 1 − δ . Pr o of. The proof is completely analogous to Corollary 2 . W e substitute Lemma 19 and ( 41 ) for Lemma 10 and ( 28 ), and note that we may take L = 2 in ( 19 ) by the ℓ ∞ - ℓ 1 Hölder’s inequality . W e p ostp one discussion of implementing the CMLOO for a momen t, but supp ose w e hav e a 2 ε -CMLOO. Then, Corollary 1 yields a sequence p [ T ] ∈ (∆ k ) T suc h that with probability ≥ 1 − δ , sup u ( i ) ∈U ( i ) 1 T X t ∈ [ T ] D u ( i ) ( x t ) , v ( i ) ( e p t , b t ) E ≤ 2 ε + reg ( i ) ( T ) + 28 q T log ( 4 δ ) T (42) for i ∈ [2] . When i = 1 , the guarantee in ( 42 ) corresp onds to calibration against the entire ℓ ∞ -norm ball in dimension k |N | , which encompasses W -calibration for W in Prop osition 1 , under the scaling assumption ( 34 ). When i = 2 , the guarantee in ( 42 ) corresp onds to F -multiaccuracy as required b y Proposition 1 . F or large enough T as sp ecified, we th us ha v e 5 ε - W -calibration using Lemma 19 as alg (1) , and 4 ε - F -multiaccuracy using ( 41 ), and Proposition 1 gives the claim. It remains to give a 2 ε -CMLOO. F or this w e use Lemma 18 . Comparing the definitions ( 37 ) and ( 40 ), M (1) is simply the identit y matrix in dimension k |N | , and M (2) is 1 k ⊗ I N , where ⊗ denotes the Kroneck er pro duct. This matrix has one-sparse columns, so it satisfies    M (2)    1 → 1 = 1 = ⇒     M (2)  ∗    ∞→∞ = 1 . Hence, we may tak e R = 1 in ( 40 ), b ecause b oth U (1) and U (2) are con tained in the ℓ ∞ balls of their resp ectiv e dimension. The result now follo ws from Lemma 18 . Statistical setting. W e let H (1) and H (2) b e the Hilbert spaces of (norm) square-in tegrable functions under D , with ranges R N × k , R k , resp ectively , with the standard L 2 ( D ) inner products. 27 Next, we tak e A to b e functions taking each x ∈ X → a ( x ) ∈ ∆ N , i.e., A :=  a : X → ∆ N  . (43) Our pay off vectors will again b e indep endent of b ∈ B , so we omit it from our notation. Also, let U (1) := ∆ N , v (1) ( a )( x , y ) := { [ a ( x )] s ( s − y ) } s ∈N , U (2) := { d ℓ ◦ c } ℓ ∈L , c ∈C , v (2) ( a )( x , y ) := E p ∼ a ( x ) [ p − y ] . (44) W e last require an online learner for U (1) in the statistical setting. Lemma 20. F ol lowing definitions ( 43 ) , ( 44 ) , ther e exists alg (1) such that for any a [ T ] ∈ A T , alg (1) outputs u (1) [ T ] ∈ ( U (1) ) T such that u (1) t dep ends only on a [ t − 1] , and ( 18 ) holds with reg (1) ( T ) := εT + 10 k |N | ε + 32 s T log  2 δ  , with pr ob ability ≥ 1 − δ , wher e for e ach t ∈ [ T ] , we r e quir e one i.i.d. dr aw ( x t , y t ) ∼ D . Pr o of. W e pattern our pro of off of Lemma 1 , although we require a few differences to obtain the sp ecific form of regret b ound here. The k ey observ ation is that for all a ∈ A , the definitions ( 44 ) giv e |  u (1) , v (1) ( a )  | ≤ 2 using the ℓ ∞ - ℓ 1 Hölder’s inequalit y . Our strategy is then to play (sto chastic) pro jected gradient descen t (PGD) against the { v (1) ( a t ) } t ∈ [ T ] . T o simplify notation, let g t := E ( x , y ) ∼D h v (1) ( a t )( x , y ) i , ˜ g t := v (1) ( a t )( x t , y t ) , d t := g t − ˜ g t , and observe that under our sampling assumptions, ˜ g t is unbiased for g t conditioned on the history of the algorithm if we use a held out i.i.d. sample ( x t , y t ) . Also, |  ˜ g t , u (1)  | ≤ 2 holds for all t ∈ [ T ] and u (1) ∈ U (1) , and max t ∈ [ T ] max {∥ g t ∥ 1 , ∥ ˜ g t ∥ 1 } ≤ 2 . No w, we define u 1 ← 0 N × k , and our iterates u t using PGD with step size η > 0 and the {− ˜ g t } t ∈ [ T ] , u (1) t ← argmin u (1) ∈U (1)     u (1) −  u (1) t − 1 + η ˜ g t − 1     2 2  . (45) W e also define a “ghost iterate” sequence of w [ T +1] ∈ ( U (1) ) T +1 that sets w 1 = u (1) 1 , but up dates using d t − 1 in place of ˜ g t − 1 in ( 45 ). Standard PGD analysis (e.g., Theorem 3.2, [ Bub15 ]) sho ws X t ∈ [ T ] D ˜ g t , u (1) − u (1) t E ≤ 2 η T + k |N | 2 η , X t ∈ [ T ] D d t , u (1) − w t E ≤ 8 η T + k |N | 2 η , sim ultaneously hold for all u (1) ∈ U (1) . Summing and rearranging yields X t ∈ [ T ] D g t , u (1) − u (1) t E ≤ 10 η T + k |N | η + X t ∈ [ T ] D d t , w t − u (1) t E . 28 No w the last term abov e is the sum of T conditionally mean-zero terms, eac h of which is b ounded in [ − 8 , 8] . Thus b y the Azuma-Ho effding inequality , with probabilit y ≥ 1 − δ , X t ∈ [ T ] D g t , u (1) − u (1) t E ≤ 10 η T + k |N | η + 32 s T log  2 δ  , and supremizing this o ver u (1) ∈ U (1) and setting η ← ε 10 giv es the claim. Corollary 5 (Statistical multiclass omniprediction) . L et L b e a family of loss functions and C , C ′ b e families of c omp ar ators satisfying ( 22 ) . Assume ther e exists an online le arner alg (2) that takes inputs { v t : X × ∂ ∆ k → B k 2 (2) } t ∈ [ T ] , and outputs ℓ [ T ] ∈ L T , c [ T ] ∈ ( C ′ ) T , such that ( ℓ t , c t ) dep ends only on v [ t − 1] , and sup ( ℓ, c ) ∈L×C X t ∈ [ T ] ⟨ v t , d ℓ ( c ) − d ℓ t ( c t ) ⟩ ≤ reg ( T ) , (46) for reg : N → R ≥ 0 such that al l T ≥ T L , C satisfy reg ( T ) T ≤ ε with pr ob ability ≥ 1 − δ 2 . Then if T = Ω( k ( 1 ε ) k +1 + 1 ε 2 log( 1 δ )) + T L , C , we c an pr o duc e p : X → [0 , 1] , a 9 ε -omnipr e dictor for ( D , L , C ) , with pr ob ability ≥ 1 − δ , given T i.i.d. samples fr om D . Pr o of. The pro of is the same as Corollary 4 (with mo difications analogous to Corollary 3 vis- à-vis Corollary 2 ), where we use Lemma 20 and ( 46 ) instead of Lemma 19 and ( 41 ). In our construction of the CMLOO in the statistical setting, w e note that the matrices M (1) and M (2) in ( 37 ) are again I N × k and 1 k ⊗ I N , so Lemma 18 again yields a 2 ε -CMLOO that outputs a function a t : X → ∆ N , from which we can sample random predictions p t : X → ∆ N . As in Corollary 3 , our final omnipredictor ev aluates a uniform randomly sampled p t , ov er the range t ∈ [ T ] . 5.3 Generalized linear models In this section, w e sp ecialize Corollaries 4 and 5 to the setting of m ulticlass generalized linear mo dels, where L := L GLM as defined in ( 5 ), and C := C lin where C lin := n c ( x ) := Cx | C ∈ R k × d , ∥ C ∥ 2 →∞ ≤ 1 o . (47) In other words, C has sub-unit norm ro ws. This is the natural family of classifiers because it tak es x ∈ X to c ( x ) ∈ [ − 1 , 1] k , under the scaling b ounds in ( 34 ). Analogously to Section 4.4 , we conflate a function c ∈ C lin with the asso ciated linear classifier C ∈ R k × d via capitalization. W e again observ e that b ecause d ℓ is negation for all ℓ ∈ L GLM b y ( 6 ), and C lin is closed under negation, we can equiv alen tly set F ← C lin in applications of Prop osition 1 . Our last ingredien ts are online learners satisfying ( 41 ), ( 46 ). Lemma 21. Assuming ( 34 ) holds, ther e exists alg (2) such that for any ( v [ T ] , x [ T ] ) ∈ ( B k 2 (2)) T × X T , alg (2) outputs C [ T ] ∈ ( B k × d 2 →∞ (1)) T in O ( dk T ) time, such that C t only dep ends on v [ t − 1] , x [ t − 1] , and sup C ∈C lin X t ∈ [ T ] ⟨ v t ⊗ x t , C − C t ⟩ ≤ 2 √ k T . 29 Pr o of. This follo ws from Lemma 2 with X ← B k × d 2 →∞ and r ( C ) := 1 2 ∥ C ∥ 2 F (i.e., half the squared en trywise ℓ 2 norm). Note that for all t ∈ [ T ] , because v t ⊗ x t is rank-one, ∥ v t ⊗ x t ∥ F = ∥ v t ⊗ x t ∥ op = ∥ v t ∥ 2 ∥ x t ∥ 2 ≤ 2 . Th us, the only adjustments compared to Lemma 14 is that no w w e hav e L = 2 and Θ ≤ k 2 . Lemma 22. L et δ ∈ (0 , 1) . Assuming ( 34 ) holds, ther e exists alg (2) such that for any { v t : X × ∂ ∆ k → B k 2 (2) } t ∈ [ T ] , alg (2) outputs C [ T ] ∈ ( B k × d 2 →∞ (1)) T in O ( dk T ) time, such that C t only dep ends on v [ t − 1] , and sup C ∈C lin X t ∈ [ T ] E ( x , y ) ∼D [ ⟨ v t ( x , y ) ⊗ x , C − C t ⟩ ] ≤ 40 s k T log  2 δ  , with pr ob ability ≥ 1 − δ , wher e for e ach t ∈ [ T ] , we r e quir e one i.i.d. dr aw ( x t , y t ) ∼ D . Pr o of. The proof is identical to L emma 21 , where w e use Lemma 1 in place of Lemma 2 . W e conclude with our main result on omnipredicting m ulticlass generalized linear mo dels. Theorem 5 (Multiclass generalized linear models) . L et δ ∈ (0 , 1) , let L := L GLM and C := C lin define d in ( 5 ) , ( 47 ) r esp e ctively, and assume ( 34 ) holds. Then if T = k Ω  1 ε  k +1 + Ω log  1 δ  ε 2 !! for an appr opriate c onstant, in the online setting, we c an output p [ T ] ∈ (∆ k ) T , an ε -omnipr e dictor for ( x T , y [ T ] , L , C ) , in time O ( dk T ) + O ( 1 ε ) 2 k p oly ( k , log 1 δ ε ) with pr ob ability ≥ 1 − δ . In the sta- tistic al setting, we c an output p : X → ∆ k , an ε -omnipr e dictor for ( D , L , C ) , in time O ( dk T ) + O ( 1 ε ) 2 k p oly ( k , log 1 δ ε ) with pr ob ability ≥ 1 − δ , given T i.i.d. samples fr om D , such that p c an b e evaluate d in time O ( dk ) + O ( 1 ε ) k +1 p oly ( k , log 1 δ ε ) on any x ∈ X with pr ob ability ≥ 1 − δ . Pr o of. F or the online omniprediction result, w e combine Lemma 21 and Corollary 4 , and for the statistical omniprediction result, w e com bine Lemma 22 and Cor ollary 5 . W e note that the run- time cost of eac h iteration is dominated b y the O ( dk ) time for computing c t ( x t ) , and the cost of Lemma 18 . This also applies to the cost of ev aluating p on a fresh sample. 5.4 General classifiers and losses In this section, we sp ecialize Corollaries 4 and 5 to the setting of general multiclass mo dels. Analo- gously to Section 5.3 , we consider general loss functions L and general function class C that satisfy ( 34 ). W e again require online learners satisfying ( 41 ), ( 46 ). Our multiclass online learning results apply the binary online learners from Section 4.5 in a black- b o x w ay . It is p ossible that tigh ter c haracterizations in the m ulticlass setting are p ossible (e.g., in the dep endence on k ), esp ecially for sp ecific structured ( C , L ) . W e demonstrated an example of this in Section 5.3 , and lea v e a more general theory to future work. This section is included primarily 30 to highlight ho w to apply our techniques in a general setting, as our pap er’s fo cus is developing the omniprediction framework rather than m ulticlass learning for sp ecific comparators. Applying Theorem 4.5 of [ OKK25 ] coordinatewise, w e obtain the following lemma. Lemma 23. In the setting of Cor ol lary 4 , let F := { d ℓ ◦ c } ℓ ∈L , c ∈C . Assuming ( 34 ) holds for family of loss functions L and families of c omp ar ators C and C ′ , ther e exists alg (2) such that for any ( v [ T ] , x [ T ] ) ∈ ( B k 2 (2)) T × X T , alg (2) outputs c [ T ] ∈ C ′ , such that c t only dep ends on v [ t − 1] , x [ t − 1] , and sup c ∈C X t ∈ [ T ] ⟨ v t , d ℓ ( c ( x )) − d ℓ t ( c t ( x )) ⟩ ≤ T · X i ∈ [ k ] srad T ( F i ) , wher e F i c onsists of functions x 7→ [ f ( x )] i for f ∈ F , with [ f ( x )] i b eing the i th c o or dinate of f ( x ) . Similarly , applying Lemma 7.4 and Lemma 7.6 of [ OKK25 ] co ordinatewise yields the following. Lemma 24. In the setting of Cor ol lary 5 , let δ ∈ (0 , 1) and F := { d ℓ ◦ c } ℓ ∈L , c ∈C . L et V b e a family of functions v : X × ∂ ∆ k → B k 2 (2) . Ther e exists alg (2) such that for any v [ T ] ∈ V T , alg (2) outputs c [ T ] , making O ( T 1 . 5 ) c al ls to an ERM or acle for e ach F i over T samples p er iter ation, such that c t ∈ C ′ only dep ends on v [ t − 1] , and for a universal c onstant C , sup c ∈C X t ∈ [ T ] E ( x , y ) ∼D [ ⟨ v t ( x , y ) , d ℓ ( c ( x )) − d ℓ t ( c t ( x )) ⟩ ] ≤ C   k r T · log k δ + T · X i ∈ [ k ] rad T ( F i · V i )   , with pr ob ability ≥ 1 − δ , wher e for e ach t ∈ [ T ] , we r e quir e T i.i.d. dr aws ( x t , y t ) ∼ D . In the statemen t of Lemma 17 , the class F i · V i consists of functions ( x , y ) 7→ [ f ( x )] i [ v ( x , y )] i for f ∈ F and v ∈ V , with [ f ( x )] i , [ v ( x , y )] i b eing the i th co ordinates of f ( x ) , v ( x , y ) , respectively . W e conclude with our main result on multiclass omniprediction in the general setting. Theorem 6 (General multiclass omnprediction) . L et L b e a family of loss functions and C b e a family of c omp ar ators such that ( 34 ) holds, let F := { d ℓ ◦ c } ℓ ∈L , c ∈C , and let δ ∈ (0 , 1) . L et T seq L , C , T stat L , C b e such that al l T ≥ T seq L , C satisfy P k i =1 srad T ( F i ) ≤ ε 9 , and al l T ≥ T stat L , C satisfy P k i =1 rad T ( F i · V i ) ≤ ε 18 C , wher e V c onsists of functions ( x , y ) → p ( x ) − y for al l p ossible p : X → ∆ k outputte d by the CMLOO in L emma 18 given classes C , L . Then if T = Ω k  1 ε  k +1 + log 1 δ ε 2 ! + T seq L , C , for an appr opriate c onstant, in the online setting, we c an output p [ T ] ∈ (∆ k ) T , an ε -omnipr e dictor for ( x [ T ] , y [ T ] , L , C ) , with pr ob ability ≥ 1 − δ . If T = Ω k  1 ε  k +1 + k 2 log 1 δ ε 2 ! + T stat L , C , for an appr opriate c onstant, in the statistic al setting, we c an output p : X → ∆ k , an ε -omnipr e dictor for ( D , L , C ) , with pr ob ability ≥ 1 − δ , given T 2 i.i.d. samples fr om D and O ( T 2 . 5 ) c al ls to the ERM or acle for e ach F i . 31 Pr o of. The proof is largely the same as the pro of for Theorem 5 . F or the online omniprediction result, we com bine Lemma 23 and Corollary 4 , and for the statistical omniprediction result, we com bine Lemma 24 and Corollary 5 . The oracle complexit y comes from Lemma 24 . 6 Unions of Comparators In this section, we show case the flexibility of our framew ork b y applying it to omniprediction against a union of c omp ar ators . Let L b e a family of losses ℓ : [ − 1 , 1] k × ∂ ∆ k → [ − 1 , 1] , and let C ( i ) b e a comparator family satisfying ( 34 ) for all i ∈ [ m ] . Our goal is to learn an ( L , C ) -omnipredictor for C := [ i ∈ [ m ] C i . In other words, w e wish to b e comp etitiv e against the b est c in any C i . F or simplicit y , here w e fo cus on the online setting, although similar extensions for statistical omniprediction are straightforw ard. T o design an online omnipredictor against ( L , C ) , we define a simultaneous approachabil ity instance as follows: w e define ( A , B ) as in ( 35 ), and let U ( i ) := { d ℓ ◦ c i } ℓ ∈L , c ∈C ( i ) , v ( i ) ( a , b ) := E p ∼ a [ p − b ] , for all i ∈ [ m ] , U ( m +1) := [ − 1 , 1] N × k , v ( m +1) ( a , b ) := { a s ( s − b ) } s ∈N . In other w ords, there is one approac hability set U ( i ) for eac h comparator class C ( i ) , and the ( m + 1) th approac hability set is defined analogously to U (1) in ( 40 ). Theorem 7. L et L b e a family of loss functions and C ( i ) b e a family c omp ar ators for al l i ∈ [ m ] , such that ( 34 ) holds for L and every C ← C ( i ) , let F ( i ) := { d ℓ ◦ c } ℓ ∈L , c ∈C ( i ) , and let δ ∈ (0 , 1) . L et T seq L , C b e such that al l T ≥ T seq L , C and i ∈ [ m ] satisfy X j ∈ [ k ] srad T ( F ( i ) j ) ≤ ε 9 . Then if T = Ω k  1 ε  k +1 + log m δ ε 2 ! + T seq L , C for an appr opriate c onstant, in the online setting, we c an output p [ T ] ∈ (∆ k ) T , an ε -omnipr e dictor for ( x [ T ] , y [ T ] , L , S i ∈ [ m ] C ( i ) ) , with pr ob ability ≥ 1 − δ . Pr o of. The pro of is almost exactly identical to Theorem 5 , sav e for tw o changes. First, the additive regret term in Corollary 1 now scales with log ( m δ ) (as there are m + 1 approachabilit y sets). Second, the CMLOO in Lemma 18 now m ust hold for m + 1 inputs. How ev er, when applying Lemma 18 (sp ecifically following the notation ( 37 )), ev ery M ( i ) is iden tical for i ∈ [ m ] , and we b ounded the quan tity ( 38 ) for M ( m +1) already in Corollary 4 . Th us, the same pro of holds and we simply adjust the logarithmic term in the T low er b ound. 32 W e remark that all of our main results generalize to unions of comparators; indeed, the binary omniprediction CMLOO construction in Lemma 11 also has a simple extension to this setting. Our framew ork is ev en capable of handling unions of loss families in muc h the same wa y , where w e define an approac hability set to ensure multiaccuracy for eac h pairing of a loss family and a comparator class, although w e omit this extension to a v oid tedium. 33 References [ABH11] Jacob D. Ab ernethy , P eter L. Bartlett, and Elad Hazan. Blackw ell approac hability and no-regret learning are equiv alent. In COL T 2011 - The 24th A nnual Confer enc e on L e arning The ory , volume 19 of JMLR Pr o c e e dings , pages 27–46. JMLR.org, 2011. [BCD + 22] Nataly Brukhim, Daniel Carmon, Irit Din ur, Sha y Moran, and Amir Y eh uday off. A c haracterization of m ulticlass learnabilit y . In 2022 IEEE 63r d Annual Symp osium on F oundations of Computer Scienc e (FOCS) , pages 943–955. IEEE, 2022. [BDCBL92] Shai Ben-David, Nicolo Cesa-Bianchi, and Philip M Long. Characterizations of learn- abilit y for classes of { O,. . . , n } -v alued functions. In Pr o c e e dings of the fifth annual workshop on Computational le arning the ory , pages 333–340, 1992. [BDEL03] Shai Ben-Da vid, Nadav Eiron, and Philip M Long. On the difficult y of approximately maximizing agreements. Journal of Computer and System Scienc es , 66(3):496–514, 2003. [Bla56] Da vid Blac kw ell. An analog of the minimax theorem for v ector pay offs. 1956. [BP13] Nik o Brummer and Johan du Preez. The P A V algorithm optimizes binary prop er scoring rules. arXiv pr eprint arXiv:1304.2331 , 2013. [Bub15] Sébastien Bubeck. Conv ex optimization: Algorithms and complexit y . F oundations and T r ends in Machine L e arning , 8(3-4):231–357, 2015. [CP23] Moses Charik ar and Chirag Pabbara ju. A c haracterization of list learnability . In Pr o c e e dings of the 55th A nnual A CM Symp osium on The ory of Computing , pages 1713–1726, 2023. [DDS + 09] Jia Deng, W ei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li F ei-F ei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE c onfer enc e on c omputer vision and p attern r e c o gnition , pages 248–255. Ieee, 2009. [Den12] Li Deng. The mnist database of handwritten digit images for mac hine learning re- searc h. IEEE signal pr o c essing magazine , 29(6):141–142, 2012. [DHI + 25] Cyn thia Dwork, Chris Hays, Nicole Immorlica, Juan C. Perdomo, and Pranay T ank ala. F rom fairness to infinity: Outcome-indistinguishable (omni)prediction in evolving graphs. In The Thirty Eighth A nnual Confer enc e on L e arning The ory , v olume 291 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 1564–1637. PMLR, 2025. [DSBDSS15] Amit Daniely , Siv an Sabato, Shai Ben-David, and Shai Shalev-Shw artz. Multiclass learnabilit y and the erm principle. J. Mach. L e arn. R es. , 16(1):2377–2404, 2015. [F GMS25] Maxw ell Fishelson, Noah Golowic h, Mehry ar Mohri, and Jon Sc hneider. High- dimensional calibration from sw ap regret. CoRR , abs/2505.21460, 2025. [FKM + 21] Gaëtan F ournier, Eden Kuperwasser, Orin Munk, Eilon Solan, and A visha y W ein- baum. Approac hability with constraints. Eur. J. Op er. R es. , 292(2):687–695, 2021. [F os99] Dean P F oster. A pro of of calibration via blackw ell’s approac hability theorem. Games and Ec onomic Behavior , 29(1-2):73–78, 1999. 34 [FV98] Dean P F oster and Rak esh V V ohra. Asymptotic calibration. Biometrika , 85(2):379– 390, 1998. [GH25] P arikshit Gopalan and Lunjia Hu. Calibration through the lens of indistinguishability . CoRR , abs/2509.02279, 2025. [GHK + 23] P arikshit Gopalan, Lunjia Hu, Michael P . Kim, Omer Reingold, and Udi Wieder. Loss minimization through the lens of outcome indistinguishability . In 14th Innovations in The or etic al Computer Scienc e Confer enc e, ITCS 2023 , volume 251 of LIPIcs , pages 60:1–60:20. Schloss Dagstuhl - Leibniz-Zen trum für Informatik, 2023. [GHR24] P arikshit Gopalan, Lunjia Hu, and Guy N. Rothblum. On computationally efficien t m ulti-class calibration. In The Thirty Seventh Annual Confer enc e on L e arning The ory , v olume 247 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 1983–2026. PMLR, 2024. [GJRR24] Sumegha Garg, Christopher Jung, Omer Reingold, and Aaron Roth. Oracle efficien t online m ulticalibration and omniprediction. In Pr o c e e dings of the 2024 ACM-SIAM Symp osium on Discr ete Algorithms, SODA 2024 , pages 2725–2792. SIAM, 2024. [GKR + 22] P arikshit Gopalan, A dam T auman Kalai, Omer Reingold, V atsal Sharan, and Udi Wieder. Omnipredictors. In 13th Innovations in The or etic al Computer Scienc e Confer- enc e, ITCS 2022 , volume 215 of LIPIcs , pages 79:1–79:21. Schloss Dagstuhl - Leibniz- Zen trum für Informatik, 2022. [GKR23] P arikshit Gopalan, Michael P . Kim, and Omer Reingold. Swap agnostic learning, or c haracterizing omniprediction via m ulticalibration. In A dvanc es in Neur al Information Pr o c essing Systems 36: A nnual Confer enc e on Neur al Information Pr o c essing Systems 2023, NeurIPS 2023 , 2023. [GPSW17] Ch uan Guo, Geoff Pleiss, Y u Sun, and Kilian Q. W einberger. On calibration of mo dern neural net works. In Pr o c e e dings of the 34th International Confer enc e on Machine L e arning, ICML 2017 , v olume 70 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 1321–1330. PMLR, 2017. [GR07] Tilmann Gneiting and Adrian E Raftery . Strictly prop er scoring rules, prediction, and estimation. Journal of the A meric an statistic al Asso ciation , 102(477):359–378, 2007. [HKRR18] Úrsula Héb ert-Johnson, Michael P . Kim, Omer Reingold, and Guy N. Rothblum. Mul- ticalibration: Calibration for the (computationally-identifiable) masses. In Pr o c e e dings of the 35th International Confer enc e on Machine L e arning, ICML 2018 , volume 80 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 1944–1953. PMLR, 2018. [HNR Y23] Lunjia Hu, Inbal Rac hel Livni Na von, Omer Reingold, and Ch utong Y ang. Om- nipredictors for constrained optimization. In International Confer enc e on Machine L e arning, ICML 2023 , volume 202 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 13497–13527. PMLR, 2023. [HTY25] Lunjia Hu, Kevin Tian, and Chutong Y ang. Omnipredicting single-index mo dels with m ulti-index models. In Pr o c e e dings of the 57th A nnual ACM Symp osium on The ory of Computing , pages 1762–1773, 2025. 35 [HV25] Lunjia Hu and Salil V adhan. Generalized and unified equiv alences b etw een hardness and pseudo entrop y . In The ory of Crypto gr aphy: 23r d International Confer enc e, TCC 2025, A arhus, Denmark, De c emb er 1–5, 2025, Pr o c e e dings, Part IV , page 258–288, Berlin, Heidelb erg, 2025. Springer-V erlag. [JP78] Da vid S. Johnson and F ranco P Preparata. The densest hemisphere problem. The o- r etic al Computer Scienc e , 6(1):93–107, 1978. [KF15] Meelis Kull and Peter A. Flach. Nov el decomp ositions of prop er scoring rules for classification: Score adjustmen t as precursor to calibration. In Machine L e arning and Know le dge Disc overy in Datab ases - Eur op e an Confer enc e, ECML PKDD 2015, Porto, Portugal, Septemb er 7-11, 2015, Pr o c e e dings, Part I , v olume 9284 of L e ctur e Notes in Computer Scienc e , pages 68–85. Springer, 2015. [KLST23] Bobb y Klein b erg, Renato Paes Leme, Jon Schneider, and Yifeng T eng. U-calibration: F orecasting for an unknown agen t. In The Thirty Sixth Annual Confer enc e on L e arning The ory, COL T 2023 , volume 195 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 5143–5145. PMLR, 2023. [KPK + 19] Meelis Kull, Miquel Perelló-Nieto, Markus Kängsepp, T elmo de Menezes e Silv a Filho, Hao Song, and Peter A. Flach. Beyond temp erature scaling: Obtaining well-calibrated m ulti-class probabilities with diric hlet calibration. In A dvanc es in Neur al Information Pr o c essing Systems 32: A nnual Confer enc e on Neur al Information Pr o c essing Systems 2019, NeurIPS 2019 , pages 12295–12305, 2019. [KS09] A dam T auman Kalai and Ravi Sastry . The isotron algorithm: High-dimensional iso- tonic regression. In COL T 2009 - The 22nd Confer enc e on L e arning The ory , 2009. [LNPR22] Daniel Lee, Georgy Noaro v, Mallesh M. Pai, and Aaron Roth. Online minimax mul- tiob jectiv e optimization: Multicalib eating and other applications. In A dvanc es in Neur al Information Pr o c essing Systems 35: Annual Confer enc e on Neur al Informa- tion Pr o c essing Systems 2022, NeurIPS 2022 , 2022. [LRS25] Jiuy ao Lu, Aaron Roth, and Mirah Shi. Sample efficien t omniprediction and do wn- stream sw ap regret for non-linear losses. In The Thirty Eighth A nnual Confer enc e on L e arning The ory , volume 291 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 3829–3878. PMLR, 2025. [MDP + 11] Andrew Maas, Raymond E Daly , P eter T Pham, Dan Huang, Andrew Y Ng, and Christopher P otts. Learning word vectors for sen timent analysis. In Pr o c e e dings of the 49th annual me eting of the asso ciation for c omputational linguistics: Human language te chnolo gies , pages 142–150, 2011. [MPS14] Shie Mannor, Vianney Perc het, and Gilles Stoltz. Approac habilit y in unkno wn games: Online learning meets multi-ob jective optimization. In Pr o c e e dings of The 27th Con- fer enc e on L e arning The ory, COL T 2014 , volume 35 of JMLR W orkshop and Confer- enc e Pr o c e e dings , pages 339–355. JMLR.org, 2014. [MS10] Shie Mannor and Gilles Stoltz. A geometric pro of of calibration. Mathematics of Op er ations R ese ar ch , 35(4):721–727, 2010. 36 [Nat89] Balas K Natara jan. On learning sets and functions. Machine L e arning , 4(1):67–97, 1989. [NRRX25] Georgy Noaro v, Ramy a Ramalingam, Aaron Roth, and Stephan Xie. High-dimensional prediction for sequen tial decision making. In F orty-se c ond International Confer enc e on Machine L e arning, ICML 2025 , 2025. [OKK25] Princewill Ok oroafor, Rob ert Klein b erg, and Mic hael P . Kim. Near-optimal algorithms for omniprediction. CoRR , abs/2501.17205, 2025. [P en25] Bingh ui Peng. High dimensional online calibration in p olynomial time. CoRR , abs/2504.09096, 2025. [Ro c70a] R. Tyrell Ro ck afellar. Convex Analysis . Princeton Universit y Press, 1970. [Ro c70b] Ralph Ro c k afellar. On the maximal monotonicity of sub differential mappings. Pacific Journal of Mathematics , 33(1):209–216, 1970. [RST15] Alexander Rakhlin, Karthik Sridharan, and Ambuj T ewari. Online learning via se- quen tial complexities. J. Mach. L e arn. R es. , 16:155–186, 2015. [Sha15] Ohad Shamir. The sample complexity of learning linear predictors with the squared loss. J. Mach. L e arn. R es. , 16:3475–3486, 2015. [vdBLL + 21] Jan v an den Brand, Yin T at Lee, Y ang P . Liu, Thatchaphol Saran urak, Aaron Sidford, Zhao Song, and Di W ang. Minimum cost flo ws, mdps, and ℓ 1 -regression in nearly linear time for dense instances. In STOC ’21: 53r d A nnual ACM SIGACT Symp osium on The ory of Computing , pages 859–869. A CM, 2021. [V er18] Roman V ershynin. High-dimensional pr ob ability: An intr o duction with applic ations in data scienc e , v olume 47. Cam bridge universit y press, 2018. [ZKS + 21] Sheng jia Zhao, Michael P . Kim, Roshni Saho o, T engyu Ma, and Stefano Ermon. Calibrating predictions to decisions: A nov el approach to multi-class calibration. In A dvanc es in Neur al Information Pr o c essing Systems 34: Annual Confer enc e on Neur al Information Pr o c essing Systems 2021, NeurIPS 2021, De c emb er 6-14, 2021, virtual , pages 22313–22324, 2021. 37 A Deferred Pro ofs Prop osition 1. F ol lowing notation fr om Definition 1 and ( 4 ) , in the online setting, if p [ T ] satisfies ε 1 - ( x [ T ] , y [ T ] , F ) -multiac cur acy and ε 2 - ( y [ T ] , W ) -c alibr ation for F := { d ℓ ◦ c } ℓ ∈L , c ∈C , W := {− d ℓ ◦ k ⋆ ℓ } ℓ ∈L , then p [ T ] is an ( ε 1 + ε 2 ) -omnipr e dictor for ( x [ T ] , y [ T ] , L , C ) . In the statistic al setting, if p : R d → ∆ k satisfies ε 1 - ( D , F ) -multiac cur acy and ε 2 - ( D , W ) -c alibr ation, then p is an ( ε 1 + ε 2 ) -omnipr e dictor for ( D , L , C ) . Pr o of. W e b egin with the statistical setting. By definition of k ⋆ ℓ , for all ℓ ∈ L and c ∈ C , E x ∼D x  E y ∼ p ( x ) [ ℓ ( k ⋆ ℓ ( p ( x )) , y )]  ≤ E x ∼D x  E y ∼ p ( x ) [ ℓ ( c ( x ) , y )]  , (48) and thus E ( x , y ) ∼D [ ℓ ( k ⋆ ℓ ( p ( x )) , y )] = E ( x , y ) ∼D [ ℓ ( k ⋆ ℓ ( p ( x )) , y )] − E x ∼D x  E y ∼ p ( x ) [ ℓ ( k ⋆ ℓ ( p ( x )) , y )]  + E x ∼D x  E y ∼ p ( x ) [ ℓ ( k ⋆ ℓ ( p ( x )) , y )]  − E x ∼D x  E y ∼ p ( x ) [ ℓ ( c ( x ) , y )]  + E x ∼D x  E y ∼ p ( x ) [ ℓ ( c ( x ) , y )]  − E ( x , y ) ∼D [ ℓ ( c ( x ) , y )] + E ( x , y ) ∼D [ ℓ ( c ( x ) , y )] ≤ E ( x , y ) ∼D [ ℓ ( k ⋆ ℓ ( p ( x )) , y )] − E x ∼D x  E y ∼ p ( x ) [ ℓ ( k ⋆ ℓ ( p ( x )) , y )]  + E x ∼D x  E y ∼ p ( x ) [ ℓ ( c ( x ) , y )]  − E ( x , y ) ∼D [ ℓ ( c ( x ) , y )] + E ( x , y ) ∼D [ ℓ ( c ( x ) , y )] , (49) where the second line in ( 49 ) w as b ounded b y ( 48 ). T aking an expectation of Lemma 3 o ver x ∼ D x , E ( x , y ) ∼D [ ℓ ( k ⋆ ℓ ( p ( x )) , y )] − E x ∼D x  E y ∼ p ( x ) [ ℓ ( k ⋆ ℓ ( p ( x )) , y )]  = E ( x , y ) ∼D [ ⟨ d ℓ ( k ⋆ ℓ ( p ( x ))) , y − p ( x ) ⟩ ] , E x ∼D x  E y ∼ p ( x ) [ ℓ ( c ( x ) , y )]  − E ( x , y ) ∼D [ ℓ ( c ( x ) , y )] = E ( x , y ) ∼D [ ⟨ d ℓ ( c ( x )) , p ( x ) − y ⟩ ] , and the conclusion follo ws b y applying Definitions 2 and 3 . The online setting is similar: 1 T X t ∈ [ T ] ℓ ( k ⋆ ℓ ( p t ) , y t ) = 1 T X t ∈ [ T ] ( ℓ ( k ⋆ ℓ ( p t ) , y t ) − E y ∼ p t [ ℓ ( k ⋆ ℓ ( p t ) , y )]) + 1 T X t ∈ [ T ] ( E y ∼ p t [ ℓ ( k ⋆ ℓ ( p t ) , y )] − E y ∼ p t [ ℓ ( c ( x t ) , y )]) + 1 T X t ∈ [ T ] ( E y ∼ p t [ ℓ ( c ( x t ) , y )] − ℓ ( c ( x t ) , y t )) + 1 T X t ∈ [ T ] ℓ ( c ( x t ) , y t ) at which p oin t the conclusion again follows from Definitions 2 and 3 , b ecause for all t ∈ [ T ] , ℓ ( k ⋆ ℓ ( p t ) , y t ) − E y ∼ p t [ ℓ ( k ⋆ ℓ ( p t ) , y )] = ⟨ d ℓ ( k ⋆ ℓ ( p t )) , y t − p t ⟩ , E y ∼ p t [ ℓ ( c ( x t ) , y )] − ℓ ( c ( x t ) , y t ) = ⟨ d ℓ ( c ( x t )) , p t − y t ⟩ . 38 B Coun terexample for Multiclass Isotonic Regression Our framework for m ulticlass omniprediction was based on the construction of [ OKK25 ] in the binary setting. Concurren tly , another construction of ≈ ε − 2 -sample complexity binary omnipredictors was giv en by [ HTY25 ] for GLMs. It is th us natural to ask whether the construction in [ HTY25 ] has a m ulticlass extension. In this section, w e show a barrier to such a generalization. The [ HTY25 ] construction w as based on the Isotr on algorithm [ KS09 ], which alternates online gradien t descen t with isotonic regression. In particular, the isotonic regression problem that Isotron rep eatedly solves is, for input lab els { y i } i ∈ [ n ] , and some prop er loss function ℓ : [0 , 1] 2 → R , min { p i } i ∈ [ n ] ∈ [0 , 1] n X i ∈ [ n ] ℓ ( p i , y i ) , sub ject to p i ≤ p i +1 for all i ∈ [ n − 1] . (50) In other w ords, Isotron finds the b est-fitting monotone sequence of { p i } i ∈ [ n ] with resp ect to { y i } i ∈ [ n ] , as measured by ℓ . The monotonicity requiremen t comes from p i b eing induced by the gradien t of a one-dimensional conv ex function (for more on this relationship, see Lemma 4 and Sec tion 2.2, [ HTY25 ]). Crucially , in the binary setting the optimal c hoice of { p i } i ∈ [ n ] is indep endent of the c hoice of prop er loss ℓ in ( 50 ) (Corollary 9, [ HTY25 ]; see also [ BP13 ]). This omniprediction prop erty of isotonic regression ( 50 ) is then inherited b y the ov erall Isotron framework. W e next state the natural generalization of ( 50 ) to the m ulticlass setting. As [ GR07 ] sho ws, again an y prop er loss induces predictions via the gradien t of a conv ex function. A v ector field is the gradien t of a con v ex function iff it is cyclic al ly monotone (Theorem B, [ Ro c70b ]), and w e can capture this high-dimensional condition via the follo wing extension of ( 50 ). Problem 3. Given { ( v i , y i ) } i ∈ [ T ] ⊂ R k × ∂ ∆ k , we define the fol lowing isotonic r e gr ession pr oblem for a pr op er loss ℓ : ∆ k × ∆ k → R : { p ⋆ i , f ⋆ i } i ∈ [ n ] := argmin { p i ,f i } i ∈ [ n ] ∈ (∆ k × R ) n X i ∈ [ n ] ℓ ( p i , y i ) , subje ct to ⟨ p j , v i − v j ⟩ ≤ f i − f j for al l ( i, j ) ∈ [ n ] × [ n ] . (51) Here, the { v i } i ∈ [ n ] should b e interpreted as the “unlinked” predictors in a GLM, and the monotonic- it y condition in ( 51 ) is equiv alent to p i = ∇ ω ( v i ) , f i = ω ( v i ) for all i ∈ [ n ] . This parameterization is implicit in ( 50 ) as w ell, where the input v i are first sorted to define the indexing. F or the strategy in [ HTY25 ] to generalize to high dimensions, a reasonable necessary condition is for the same omniprediction prop ert y to hold for ( 51 ), i.e., that its minimizing { p ⋆ i } i ∈ [ n ] do es not dep end on the choice of prop er loss ℓ . W e give a simple n umerical counterexample. Define: ℓ sq ( p , y ) := 1 2 ∥ p − y ∥ 2 2 , ℓ log ( p , y ) := − X i ∈ [ k ] log( p i ) I y = e i . W e minimize ( 51 ) with resp ect to these tw o prop er losses, and the following choices of { v i , y i } i ∈ [2] : { v i } i ∈ [2] =   0 1 0 0 0 0   , { y i } i ∈ [2] =   1 0 0 1 0 0   . (52) 39 Lemma 25. The minimizer of ( 51 ) with ℓ ← ℓ sq and inputs ( 52 ) is { p ⋆ i } i ∈ [2] =   3 7 3 7 2 7 4 7 2 7 0   . (53) Pr o of. F or v 1 − v 2 = − e 1 , the constrain ts in ( 51 ) are equiv alen t to [ p 1 ] 1 ≤ [ p 2 ] 1 . Our goal is to minimize ∥ p 1 − e 1 ∥ 2 2 + ∥ p 2 − e 2 ∥ 2 2 sub ject to this constraint. It is clear that the constrain t is tight, b ecause otherwise p 2 w ould put any excess mass on the second co ordinate. Th us the minimizing p 1 and p 2 are of the form p 1 =   t 1 − t 2 1 − t 2   , p 2 =   t 1 − t 0   . The former claim is b ecause Jensen’s inequality implies p 1 should spread all remaining mass ov er the last t wo co ordinates, and the latter is b ecause p 2 has no incen tive to place an y mass on the third co ordinate. The conclusion follows by solving for t that minimizes (1 − t ) 2 + 2 · 1 4 (1 − t ) 2 + 2 t 2 . Lemma 26. The minimizer of ( 51 ) with ℓ ← ℓ log and inputs ( 52 ) is not ( 53 ) . Pr o of. It suffices to c heck that the following choices attain b etter function v alue: p 1 =   1 2 1 4 1 4   , p 2 =   1 2 1 2 0   . This is because − log ( 12 49 ) ≥ − log( 1 4 ) , and the constrain t [ p 1 ] 1 ≤ [ p 2 ] 1 is satisfied. 40

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment