Characterizing Online and Private Learnability under Distributional Constraints via Generalized Smoothness

Understanding minimal assumptions that enable learning and generalization is perhaps the central question of learning theory. Several celebrated results in statistical learning theory, such as the VC theorem and Littlestone's characterization of onli…

Authors: Moïse Blanchard, Abhishek Shetty, Alex

Characterizing Online and Priv ate Learnabilit y under Distributional Constrain ts via Generalized Smo othness Mo ¨ ıse Blanc hard ∗ GeorgiaT ec h mblanchard41@gatech.edu Abhishek Shett y ∗ MIT shetty@mit.edu Alexander Rakhlin MIT rakhlin@mit.edu Abstract Understanding minimal assumptions that enable learning and generalization is p erhaps the cen tral question of learning theory . Sev eral celebrated results in statistical learning theory , such as the VC theorem and Littlestone’s characterization of online learnabilit y , establish conditions on the h ypothesis class that allow for learning under indep enden t data and adversarial data, resp ectiv ely . Building upon recen t w ork bridging these extremes, we study sequen tial decision making under distributional adversaries that can adaptively choose data-generating distribu- tions from a fixed family U and ask when such problems are learnable with sample complexity that b eha ves lik e the fav orable indep enden t case. W e provide a near complete characteriza- tion of families U that admit learnability in terms of a notion known as gener alize d smo othness i.e. a distribution family admits VC-dimension-dependent regret b ounds for every finite-VC h yp othesis class if and only if it is generalized smo oth. F urther, we give univ ersal algorithms that achiev e low regret under any generalized smo oth adversary without explicit knowledge of U . Finally , when U is kno wn, we provide refined b ounds in terms of a combinatorial param- eter, the fragmentation num b er, that captures how many disjoin t regions can carry nontrivial mass under U . These results provide a nearly complete understanding of learnability under distributional adversaries. In addition, building up on the surprising connection b et ween online learning and differential priv acy , we show that the generalized smo othness also characterizes priv ate learnability under distributional constraints. 1 In tro duction The grand c hallenge of statistics and learning theory is to understand when data can b e used to make accurate predictions. V arious formalizations of this question hav e b een studied ov er the decades, leading to a ric h theory of learnabilit y under different data-generating assumptions on the co v ariates and the lab els, p erformance metrics, feedback mo dels, and computational constraints. One prominent framework is P AC le arning [ V al84 , VC74 ] whic h supp oses that data are drawn indep enden tly and identically from a distribution. In this setting, the celebrated VC theorem [ V C71 , VC74 ] sho ws that a hypothesis class F is learnable with sample complexity dep ending on its VC dimension [ VC71 ]. On the other extreme of the data-generating spectrum lies online le arning [ Lit88 ], which mo dels data as b eing completely arbitrarily c hosen (modeled as b eing chosen b y an adversary). Here, Littlestone’s se minal w ork [ Lit88 ] established that a hypothesis class F is learnable with regret dep ending on its Littlestone dimension [ Lit88 ]. Though these tw o framew orks pro vide a solid understanding of learnabilit y under indep enden t data (which w e view as “b enign” or well-behav ed in our present con text but consider a strong and p erhaps unreasonable mo delling ∗ equal con tribution 1 assumption) and adversarial data (which is a minimal assumption that treats the data as malicious and ill-b eha ved), real-world applications t ypically inv olv e data-generating pro cesses that lie b et ween these tw o extremes. F urther, and p erhaps unsurprisingly , the characterization of learnabilit y under the fully adv ersarial mo del of Littlestone precludes learnability even for simple hypothesis classes suc h as linear thresholds [ Lit88 ]. This motiv ates the study of structured data-generating pro cesses that in terp olate b et ween the t wo extremes of i.i.d. and fully adversarial data. A natural formulation is to consider a pro cess where the c hoice of data at each round is constrained to lie within a certain family of distributions. This framew ork is referred to as distributional ly c onstr aine d online le arning . P erhaps the most well- studied sp ecial case is that of smo othe d adversaries [ RST11 , HRS24 , She24 , BDGR22 ], where the adv ersary is constrained to choose distributions that hav e b ounded density ratio with resp ect to a fixed reference measure. This line of w ork, whic h can be also b e seen as a mo del for distribution shift (See [ HK25 ] and references therein), has established that, under smo othness, sequen tial decision- making has the same qualitative learnability guaran tees as in the i.i.d. setting [ HRS24 ]. Though this gives a satisfactory answer under smo othness, p erhaps the condition is to o strong, leading to the following fundamental question: What ar e minimal c onditions on a distribution family U that ensur e any form of le arn- ability for low-c omplexity function classes (e.g., finite VC dimension), against adaptive U -distributional ly c onstr aine d adversaries? Surprisingly , this question is in timately connected to a seemingly unrelated problem of learning under privacy constraints. Differential priv acy [ DMNS06 ] is a rigorous framework for ensuring that learning algorithms do not leak sensitiv e information ab out individual data p oin ts. Unfortunately , differen tial priv acy , b eing a stringent condition often imp oses significan t limitations on what can b e learned, p erhaps most notably exemplified by the fact that for binary classification, only classes with finite Littlestone dimension are priv ately learnable [ ALMM19 ]. As with online learning, this negativ e result has motiv ated the study of structured data-generating pro cesses that allow for priv ate learnability of ric her classes [ HRS20 , ABM19 , BCM + 20 , BBKN14 ]. W e ask whether there is a characterization of distribution families that admit priv ate learnability with finite sample complexit y for VC hypothesis classes. 1.1 Our Con tribution. W e address this question by pro viding a complete characterization of distribution families U that admit learning for V C classes. T o do this, w e introduce a structural condition on U that generalizes the classical smo othness condition from smo othed online learning. Recall that in smo othed online learning, the adversary is constrained to choose distributions that hav e b ounded density ratio with resp ect to a fixed reference measure µ 0 ; that is, there exists σ > 0 such that for every µ ∈ U and measurable set A , µ ( A ) ≤ µ 0 ( A ) /σ . Gener alize d Smo othness relaxes this linear relationship to allo w for more general tolerance functions. That is, a distribution family U is said to b e ( ρ, µ 0 )- gener alize d smo oth if for every µ ∈ U , we hav e µ ( A ) ≤ ρ ( µ 0 ( A )) for all measurable sets A. (1) This condition implies that the adv ersary cannot place significant mass on sets that are small under the base measure—but the relationship need not b e linear as in classical smo othness. Our main result (Theorem 3 ) shows that this condition is necessary for any form of learnability against ev ery finite-V C class F . In fact, we sho w a stronger result: generalized smo othness is necessary ev en to achiev e asymptotic consistency against thresholds, the simplest non-trivial function classes. 2 T o complete the characterization, w e build upon recent algorithmic techniques for smo othed online learning [ She24 , HRS24 , BRS24 , Bla25 ] to show that generalized smo othness is sufficient to ac hieve VC-dimension-dependent regret b ounds for every finite-VC class F , even when U is not explicitly known to the learner (Theorems 6 and 7 ). These results show that the complex interpla y of com binatorial structure and regret b ound for distributionally constrained adv ersaries can b e fully captured b y a single structural condition on the distribution family U . Additionally , and p erhaps surprisingly , this also sho ws that merely asking for asymptotic consistency on simple VC classes further implies non-asymptotic regret b ounds. Note that asymptotic and non-asymptotic learnabilit y are, in general, qualitatively different notions with distinct c haracterizations. Our main technical contribution is establishing a connection b et ween regret b ound against threshold classes to a uniform notion of compactness of the distribution family U , whic h w e cap- ture via the existence of a contin uous dominating submeasure (Definition 6 ) for U . F urther, w e sho w that the existence of such a contin uous dominating submeasure is equiv alent to generalized smo othness. The relation betw een con tin uity of submeasures and uniform domination by a measure (corresp onding to generalized smo othness) is well-studied in measure theory , with celebrated results suc h as the Kalton-R ob erts the or em [ KR83 ] 1 and (T alagrand’s resolution of ) Mahar am’s pr oblem [ T al08 ]. In our setting, with the extra structure of our submeasure (arising as a supremum o ver a family of probability measures), w e are able to establish a clean equiv alence b et ween contin uit y and uniform domination directly without relying on opaque application of this hea vy machinery but developing a deep er connection to these classical results is an in teresting direction for future w ork. W e complemen t the c haracterization with algorithms (Theorems 6 and 7 ) that ac hieve V C- dimension-dep enden t regret bounds when U is generalized smooth, ev en when U is unkno wn to the learner. These can b e viewed as the generalized smo othness analogues of the algorithms for smo othed online learning with unknown base measure from [ BRS24 , Bla25 ]. Finally , when U is kno wn to the learner, w e provide refined regret b ounds (Theorems 8 and 10 ) that dep end on a com binatorial parameter of U that we term the fr agmentation numb er (Definition 8 ). While having finite fragmentation num bers is equiv alen t to generalized smo othness, the precise dependence of regret b ounds on fragmentation n umbers pro vides a more refined understanding of the difficult y of learning under a giv en distribution family U , corresp onding to the different rates in the known- b ase-me asur e smoothed online learning setting [ HRS24 ] and the unknown-b ase-me asur e setting [ BRS24 , Bla25 ]. In addition to the characterization of online learnabilit y in the distribution constrained setting, generalized smo othness serves as a characterization of priv ate learnability under the distribution- ally constrained settings. This can b e seen as the distributionally constrained analogue of the c haracterization of priv ate learnabilit y via the Littlestone dimension [ ALMM19 ]. W e show that a distribution family U admits priv ate learnabilit y with sample complexity dep ending only on the V C dimension of the h yp othesis class if and only if U is generalized smo oth (Theorems 11 and 12 ). 2 Preliminaries Let X b e a separable measurable instance space and denote by B the collection of measurable sets of X . Let Y = { 0 , 1 } denote the label space. Let F ⊆ B b e a family of measurable binary classifiers, asso ciated with indicators ov er sets, and use the indicator loss 1 { f ( x )  = y } . While w e fo cus on the case of binary classification and the indicator loss for concreteness, extending ideas prop osed here 1 In terestingly , this result is perhaps the first significant example of the use of expander graphs outside graph theory 3 to m ulticlass classification and regression is an interesting direction for future w ork. The central com binatorial quantit y in the binary classification setting is the VC dimension [ VC71 ]. Definition 1 (V C dimension) . The VC dimension of a function class F : X → { 0 , 1 } is define d as V C( F ) = sup n d ∈ N : ∃ x 1 , . . . , x d ∈ X s.t. |{ ( f ( x 1 ) , . . . , f ( x d )) : f ∈ F }| = 2 d o . A set { x 1 , . . . , x d } ⊆ X satisfying the ab ove c ondition is said to b e shattered by F . W e next define the online learning proto col under distributional constraints. Let U b e a family of probabilit y distributions ov er X . Consider the follo wing T -round online proto col. F or eac h round t = 1 , . . . , T : 1. The adv ersary selects a distribution µ t ∈ U , either obliviously (the sequence ( µ t ) T t =1 is fixed in adv ance) or adaptively based on the history . 2. The learner chooses f t ∈ B (p ossibly randomized). 3. A sample x t ∼ µ t is drawn; then a lab el y t ∈ Y is revealed. The learner incurs loss 1 { f t ( x t )  = y t } (and, in the full-information setting, observes ( x t , y t )). Note that µ t is not observed. The ob jective of the learner is to minimize re gret against the comparator class F , that is, the difference b et ween the cum ulative loss of the learner and that of the b est fixed function in F in hindsigh t. F ormally , define the regret after T rounds for an algorithmn alg as Reg T (alg; F , U ) = E " T X t =1 1 { f t ( x t )  = y t } − inf f ∈F T X t =1 1 { f ( x t )  = y t } # . The setup descib ed ab o ve generalizes both the classical adv ersarial online learning setting (when U con tains all distributions ov er X ) and the (sequen tial analogue of the) standard i.i.d. statistical learning setting (when U is a single fixed distribution). W e denote the b est ac hiev able regret against the worst-case U -constrained adversary b y Reg T ( F , U ). Note that when U is a singleton, from the V C theory of statistical learning [ V C71 ], we ha v e Reg T ( F , U ) ≲ p V C( F ) T . F uther, when U con tains all distributions ov er X , from the classical theory of adversarial online learning [ Lit88 , BDPSS09 ], w e ha ve that Reg T ( F , U ) ≲ p LD( F ) T , where LD( F ) is the Littlestone dimension of F [ Lit88 ] 2 . The key issue we aim to address is the fact that LD ≫ VC for most natural function classes, with LD t ypically b eing infinite. With this is mind, the aim of this work is to c haracterize conditions on U that allow ac hieving regret b ounds that dep end on the V C dimension of F . In order to capture this, we introduce tw o notions of learnabilit y for distribution families. Definition 2 (W eak VC Learnability) . A distribution class U on X is we akly VC le arnable if for any function class F with finite V C dimension, ther e exists an online algorithm alg such that for any U -c onstr aine d adversary gener ating a data se quenc e ( x t , y t ) t ≥ 1 with the lab el y t c onsistent with some function in F , we have lim T →∞ E " 1 T T X t =1 1 [ ˆ y t  = y t ] # = 0 . 2 Since we do not directly work with the Littlestone dimension, we defer the definition to the app endix (Defini- tion 10 ) 4 Note that the definition ab ov e merely asks for asymptotic consistency on realizable data: con- v ergence rates are allow ed to p oten tially dep end on the adversary’s b eha vior. Arguably , this is the w eakest meaningful learning guaran tee. T o emphasize the difference b et w een asymptotic consis- tency and finite-regret b ounds, we note that any countable function class F —in particular, even with infinite VC dimension—admits asymptotically consistent learners even when the distribution class U is fully unrestricted, i.e., even for adversarial data (this is for instance a direct consequence of [ Han21 , Lemma 34]). W e can strengthen the notion of learnability by requiring non-asymptotic regret b ounds that dep end on the V C dimension of the comparator class. W e will refer to this as str ong VC le arnability . Definition 3 (Strong V C Learnabilit y) . A distribution class U on X is str ongly VC le arnable if for any function class F with finite V C dimension, for any ϵ > 0 ther e exists m U ( ϵ, d ) and an online algorithm alg such that for any U -c onstr aine d adversary, and T ≥ m U ( ϵ, d ) , Reg T ( alg ; F , U ) ≤ ϵT . The key structural condition that captures learnability is gener alize d smo othness . Definition 4 (Generalized smo othed distribution class) . L et X b e a me asur able sp ac e. Fix a me asur e µ 0 on X and a non-de cr e asing function ρ : [0 , 1] → R + with lim ϵ → 0 ρ ( ϵ ) = 0 . A family of distributions U on X is said to b e ( µ 0 , ρ ) -gener alize d smo othe d if for any distribution µ ∈ U and me asur able set A ⊆ X , µ ( A ) ≤ ρ ( µ 0 ( A )) . One can see that generalized smo othness implies that the distributions in U are absolutely con tinuous with respect to the base measure µ 0 , with ρ acting as a uniform control on this con tinuit y . Note that this definition generalizes classical smo othness, whic h corresp onds to the sp ecial case ρ ( ϵ ) = ϵ/σ for some constan t σ ≤ 1. F urther, this notion also generalizes smo othness notions based on f -divergences studied in [ BP23 ]. Last, the notion of generalized smo othness is closely related to the notion of co verage studied b y [ CHG + 25 ]. All these notions control the mass in the tails of density ratios b et ween distributions in U and a base measure µ 0 and can b e seen as different parameterizations of the same underlying concept. How ev er, the notion of generalized smo othness is the most con venien t for our purp oses here and most directly addresses the underlying question of learnability . 3 Learnabilit y and Generalized Smo othness In this section, we characterize weak and strong VC learnability as defined in Definitions 2 and 3 . Sp ecifically , we show that b oth are equiv alent to generalized smo othness: in particular, weak VC learnabilit y is—despite its asymptotic nature b y definition—sufficien t to reco v er non-asymptotic regret b ounds. In fact, it turns out that weak learnability with resp ect to simple VC=1 classes is already sufficient for non-asymptotic regret b ounds. T o formalize this result, w e begin b y in tro ducing the family of gener alize d thr esholds , whic h are the prototypical example of VC one classes and conceptually , the simplest non-trivial function classes for learning. Definition 5 (Generalized thresholds) . A function class F : X → { 0 , 1 } is a generalized threshold function class if ther e exists a total pr e or der 3 ⪯ on X such that F ⊆ { 1 [ · ⪯ x 0 ] : x 0 ∈ X } . 3 Preorders satisfy the same prop erties as orders except for antisymmetry: ⪯ is a preorder if it is reflexive and transitiv e. 5 These function classes naturally generalize the standard thresholds— { 1 [ · ≤ x 0 ] : x 0 ∈ [0 , 1] } — on the interv al X = [0 , 1] to the case when one ma y use p oten tially different orderings than ≤ on the domain [0 , 1]. Note that the V C dimension of any generalized threshold class is at most 1. Due to their simplicity , w e use generalized thresholds as “test classes” for learnability . W e sp ecialize the notion of w eak learnabilit y (Definition 2 ) and strong learnability (Definition 3 ) to general- ized thresholds and refer to these as we ak thr eshold le arnability and str ong thr eshold le arnability , resp ectiv ely . Surprisingly , w e sho w in this section that weak threshold learnabilit y is equiv alen t to generalized smo othness and hence sufficient to recov er non-asymptotic regret-bounds for all function classes with finite VC dimension. In turn, this will imply the equiv alence of w eak and strong V C learnability to generalized smo othness. As the first step of this proof, w e study links betw een weak threshold learnability of a distribution class U and prop erties of its env elop e functional µ U defined via µ U ( A ) := sup µ ∈U µ ( A ) , A ∈ B . A key prop ert y will b e whether µ U is additionally a contin uous submeasure, as defined b elo w. Definition 6 (Contin uous Submeasure) . A function ˆ µ : B → [0 , 1] is a con tinuous submeasure if: 1. Monotonicity: F or any A, B ∈ B such that A ⊆ B , ˆ µ ( A ) ≤ ˆ µ ( B ) . 2. Sub-additivity: F or any A, B ∈ B , ˆ µ ( A ∪ B ) ≤ ˆ µ ( A ) + ˆ µ ( B ) . 3. Continuity fr om ab ove: F or any de cr e asing se quenc e of me asur able sets ( A i ) i ≥ 1 with T i ≥ 1 A i = ∅ , we have lim n →∞ ˆ µ ( A n ) = 0 . Of course, the functional µ U alw ays satisfies monotonicit y and sub-additivity for an y distribution class U . More imp ortan tly , w e show that µ U m ust also satisfy the contin uity from ab o ve prop erty if the distribution class U is w eakly threshold learnable. Lemma 1. L et U b e a we akly thr eshold le arnable distribution class on X . Then, µ U is a c ontinuous subme asur e. As a brief ov erview of the pro of, if µ U fails to b e con tinuous, then there exist infinitely-man y (coun tably) disjoin t sets each carrying at least ϵ mass under some distribution in U for some parameter ϵ > 0. W e index these sets by rationals and construct a threshold class where eac h function f r is the indicator of the union of sets indexed by rationals at most r . By construction, the adversary can place ϵ mass under each of these sets, whic h allows the adversary to simulate an unconstrained adversary on the threshold function class. This yields an exp ected low er b ound Ω( ϵT ) on the n umber of mistakes following classical low er b ound argumen ts for adversarial online learning [ Lit88 ], up to minor mo difications to ensure realizability . Essen tially , the adv ersary hides the true threshold r ⋆ through a binary search. W e defer the full pro of to App endix A . The main step of the pro of is to sho w that if µ U is a contin uous submeasure, then the distribution class U corresp onds to generalized smo othed classes. Lemma 2. L et U b e a family of distributions on X such that µ U is a c ontinuous subme asur e. Then, U is ( µ 0 , ρ ) -gener alize d smo oth for some distribution µ 0 on X and a non-de cr e asing function ρ : [0 , 1] → R + with lim ϵ → 0 ρ ( ϵ ) = 0 . 6 The pro of is constructiv e. A t the high-lev el, if µ U is con tinuous then for an y scale ϵ > 0 then w e can construct a finite cov er { µ 1 , . . . , µ N } ∈ U of the distribution class U for any measurable set A ∈ B on which some distribution µ ∈ U places ϵ mass. T aking an appropriate mixture of these finite co vers for decreasing scales ϵ → 0 then yields a distribution µ 0 with the desired generalized-smo othness prop ert y on U . Pro of Fix U satisfying the assumptions and any ϵ > 0. W e iteratively construct a sequence of distributions ( µ ϵ i ) i ≥ 1 together with measurable sets ( A ϵ i ) i ≥ 1 as follo ws. Supp ose w e ha ve constructed µ ϵ j , A ϵ j for all j < i for some i ≥ 1. If there exist a distribution µ ∈ U and a measurable set A such that µ ( A ) ≥ 2 ϵ X j i A ϵ j for all i ≥ 1. Then, µ U ( B ϵ i ) ≥ µ ϵ i ( B ϵ i ) ≥ µ ϵ i ( A ϵ i ) − X j >i µ ϵ i ( A ϵ j ) ≥ ϵ − X j >i ϵ 2 j − i +1 µ ϵ j ( A ϵ j ) ≥ ϵ 2 . Since the sets B ϵ i are disjoin t, this contradicts the con tinuous submeasure assumption on µ U . Sp ecifically , we hav e µ ( T j ≥ i B i ) ≥ ϵ/ 2 while T j ≥ i B j ↓ ∅ . As a result, the construction terminates. Let N ϵ b e the last constructed index. Then, by construction, the finite measure µ ϵ := 2 ϵ P j ≤ N ϵ µ ϵ j satisfies for any measurable set A ⊆ X , µ U ( A ) = sup µ ∈U µ ( A ) ≤ ϵ + µ ϵ ( A ) . (2) No w fix an y p ositiv e decreasing sequence ( ϵ i ) i ≥ 1 in [0 , 1] with lim i →∞ ϵ i = 0. F or eac h i ≥ 1, since µ ϵ i is a finite measure, w e can write it as µ ϵ i = C i ν i where ν i is a probability measure and C i = µ ϵ i ( X ) ≥ 0. Without loss of generalit y , we may assume that ( C i ) i ≥ 0 is non-decreasing while preserving Eq. ( 2 ). In particular, the sequence formed b y δ i := ϵ i 2 i C i is non-increasing in i ≥ 1. W e then define the probability measure µ 0 := X i ≥ 1 2 − i ν i , and w e fix a non-decreasing function ρ : [0 , 1] → R + b y setting ρ ( z ) = 2 ϵ i for all z ∈ ( δ i +1 , δ i ] for all i ≥ 1, ρ (0) = 0 and ρ ( z ) = 2 for z ∈ ( δ 1 , 1]. No w fix a measurable set A ⊆ X and denote z = µ 0 ( A ). If z > δ 1 then we immediately hav e µ U ( A ) ≤ 1 = ρ ( µ 0 ( A )). Next, if z ∈ ( δ i +1 , δ i ] for some i ≥ 1, from Eq. ( 2 ) we ha ve µ U ( A ) ≤ ϵ i + µ ϵ i ( A ) = ϵ i + C i ν i ( A ) ( i ) ≤ ϵ i + C i 2 i µ 0 ( A ) ≤ 2 ϵ i = ρ ( µ 0 ( A )) . In ( i ) we used the definition of µ 0 . Finally , if z = 0, then similarly as ab o ve, Eq. ( 2 ) implies that µ U ( A ) ≤ ϵ i for all i ≥ 1 and hence µ U ( A ) = 0 = ρ ( µ 0 ( A )). This ends the pro of. ■ Com bining Lemmas 1 and 2 directly sho ws that any weakly threshold learnable family of dis- tributions U is a generalized smo othed distribution class. It is further known that generalized smo othness classes are learnable for all V C classes [ BK25 ]. This yields the following main c harac- terization theorem. 7 Theorem 3. L et U b e a family of distributions on X . The fol lowing ar e e quivalent: 1. U is we akly thr eshold le arnable. 2. U is str ongly VC le arnable. 3. U is a gener alize d smo othe d distribution class. Pro of W e pro ved (1) ⇒ (3) in Lemmas 1 and 2 . Sp ecifically , if U is weakly threshold-learnable, then Lemma 1 shows that µ U is a con tinuous submeasure and hence Lemma 2 exactly constructs a reference measure µ 0 and a tolerance function ρ for which U is generalized smo oth. (2) ⇒ (1) is immediate. (3) ⇒ (2): The learnability of all VC classes under generalized smo othed distribution classes w as already observed in previous works, e.g. [ BK25 , Corollary 10]. W e will further give concrete algorithms—ERM and R-Cover —that ac hieves V C-dimension-dep enden t regret b ounds for any generalized smo othed distribution classes in the realizable and agnostic setting resp ectiv ely within Theorems 6 and 7 . ■ Conceptually , Theorem 3 shows that any form of learnability on generalized thresholds—while b eing very simple VC=1 function classes—already forces strong structural prop erties on the distri- bution class U , as captured by generalized smo othness. P erhaps more surprisingly , this sho ws that asymptotic learnability guaran tees on generalized thresholds imply non-asymptotic regret guaran- tees on all VC classes. As a remark, this prop ert y may not hold for simpler classes than generalized thresholds. F or instance, note that any coun table function class F is weakly learnable in the sense that there exists an algorithm whic h ensures sublinear excess regret compared to any fixed function within F ev en for unconstrained adv ersaries, e.g. see [ Han21 , Lemma 34]. Hence, U = { all distributions on X } is weakly learnable for any countable function class, but it is not learnable for any function class with infinite Littlestone dimension—including threshold function classes. 3.1 Uniform Co v ers Another p ersp ectiv e on generalized smo othness is through the lens of uniform cov ers. Definition 7 (Uniform Cov er) . A distribution class U is said to admit uniform cov ers for a function class F if for any ϵ > 0 , ther e exists a finite set of functions F ϵ ⊆ F such that for any µ ∈ U and f ∈ F , ther e exists ˆ f ∈ F ϵ such that µ ( { x : f ( x )  = ˆ f ( x ) } ) ≤ ϵ . It is easy to see that generalized smoothness implies existence of uniform cov ers for all V C classes. Indeed, if U is generalized smo oth with resp ect to some reference measure µ 0 and tolerance function ρ , then for an y ϵ > 0, using a cov ering for F under µ 0 at scale δ = ρ − 1 ( ϵ ) yields a uniform co ver at scale ϵ for all distributions in U . W e record this formally as Lemma 20 . A distribution class admitting uniform cov ers for a class F is also sufficient for ac hieving low regret if the adversary is oblivious and the distribution class U is kno wn. Indeed, as observed in [ Hag18 ], a standard Hedge algorithm run on the uniform cov er F ϵ ensures regret at most ϵT + O ( p T log |F ϵ | ) against an y oblivious U -constrained adversary . But, to the b est of our knowledge, there is no kno wn direct wa y to extend this to the adaptiv e adversaries setting or the case when U is unknown, without using the mac hinery of the coupling lemma introduced by [ HRS24 ] (which requires stronger assumption on the distribution class than just uniform cov ers). A mo dification of our pro of of Theorem 3 actually sho ws that the existence of uniform co vers for all VC classes is indeed equiv alen t to generalized smo othness. 8 Prop osition 4. L et U b e a distribution class. Then U has uniform c overs for al l VC classes (e quivalently, al l gener alize d thr eshold classes) if and only if it is gener alize d smo oth. Th us, our result can b e seen as sho wing that if w e insist on V C dep endent regret b ounds then uniform cov er is also a necessary and sufficient condition for learnabilit y . 4 Univ ersal Algorithms In the following t wo sections, we pro vide algorithms that achiev e low regret under generalized smo othed distribution classes. As a reminder, as p er the definitions of learnability in Definitions 2 and 3 , a distribution class U is learnable if there exists a corresp onding algorithm which achiev es the desired learning guarantee. Imp ortan tly , the considered algorithm a priori requires knowledge of the distribution class (and the function class F , whic h is una v oidable). In this section, w e instead fo cus on algorithms that do not require this prior kno wledge on U , yet still achiev e learning under V C classes (or generalized thresholds) and any distribution class U for whic h this is achiev able— equiv alen tly , an y generalized smo othed distribution class U , by Theorem 3 . W e refer to such adaptiv e algorithms as optimistic al ly universal , following the corresp onding literature on “learning whenev er p ossible” [ Han21 , Bla22 , BJ23 ]. Note that their existence is not a priori guaranteed. 4 T o obtain universal algorithms with lo w regret under generalized smoothness, w e leverage known results for smo oth adv ersaries, whic h corresp ond to generalized smo othed classes with linear toler- ance function ρ ( ϵ ) = ϵ/σ . Sp ecifically , for smo oth adv ersaries, Empirical Risk Minimization (ERM) already achiev es sublinear regret for VC classes in the realizable setting [ BRS24 ], while R-Cover [ Bla25 ]—an algorithm based on recursiv e empirical cov ers—is known to achiev e sublinear regret for VC classes in the agnostic setting. Coupling reduction lemma. W e then lift these guaran tees to generalized smo oth adv ersaries using a simple coupling reduction to smo oth adversaries. Fix a generalized smo othed class U with base measure µ 0 and tolerance ρ . Recall that b y definition, for any ϵ , any distribution places at most mass ρ ( ϵ ) on sets of µ 0 -measure ϵ . Hence, distributions in U are intuitiv ely ϵ/ρ ( ϵ )-smo oth with resp ect to µ 0 up to a mass ρ ( ϵ ). More formally , we can couple the sequence generated by a U - constrained adversary with a O ( ϵ/ρ ( ϵ ))-smo oth sequence which agree except on a fraction O ( ρ ( ϵ )) of the samples for whic h w e replace the original sample with a dumm y v ariable x ∅ , as detailed b elo w. This argumen t will allo ws us to imp ort regret b ounds from the smo othed online learning literature to the case of generalized smo othed adversaries. F or con venience, w e say that a tolerance function ρ : [0 , 1] → R + is wel l-b ehave d if it is non- decreasing, lim ϵ → 0 ρ ( ϵ ), and ϵ ∈ (0 , 1] 7→ ρ ( ϵ ) /ϵ is non-increasing. Lemma 5. L et U b e a gener alize d smo othe d distribution class on X with a wel l-b ehave d toler anc e function ρ : [0 , 1] → R + and a horizon T ≥ 1 . F or c onvenienc e, fix a dummy new variable x ∅ / ∈ X . Consider any U -c onstr aine d adaptive adversary to gener ate samples x 1 , . . . , x T . F or any ϵ > 0 , ther e exists a c oupling of these samples with a se quenc e ˜ x 1 , . . . , ˜ x T ∈ X ∪ { x ∅ } which is gener ate d fr om a ϵ 2 ρ ( ϵ ) -smo oth adversary on X ∪ { x ∅ } such that for al l t ∈ [ T ] , ˜ x t ∈ { x t , x ∅ } and P [ ˜ x t = x ∅ | ˜ x 0 ( C r ρ ( ϵ ) ϵ dT ln 5 T + ρ ( ϵ ) T ) . The pro of essentially com bines the coupling reduction Lemma 5 with known regret b ounds for ERM in the realizable setting. Naturally , for any ϵ > 0, compared the b ound for the O ( ϵ/ρ ( ϵ ))- smo oth adversaries from [ BRS24 , Theorem 8], the regret b ound from Theorem 6 includes an addi- tiv e term ρ ( ϵ ) T corresp onding to the exp ected num b er of disagreemen ts b et ween the original and smo othed sequence constructed in Lemma 5 . Agnostic setting. Next, w e turn to the general agnostic setting in which no assumptions are made on the lab els y t . W e show that the algorithm R-Cover from [ Bla25 ] also univ ersally learns V C classes under any threshold learnable distribution class U . Theorem 7. L et F : X → { 0 , 1 } b e a function class with finite V C dimension and let U b e a gener alize d smo othe d distribution class on X with a wel l-b ehave d toler anc e function ρ : [0 , 1] → [0 , 1] . Then, for any adaptive adversary, R-Cover makes pr e dictions ˆ y t such that E " T X t =1 ℓ t ( ˆ y t ) − inf f ∈F T X t =1 ℓ t ( f ( x t )) # ≤ C ln 3 / 2 T · inf ϵ> 0 ( r ρ ( ϵ ) ϵ dT · ln T + ρ ( ϵ ) T √ d ) , for some universal c onstant C > 0 . Due to the recursive nature of R-Cover , the pro of requires additional care to handle the disagreemen ts b et ween the original sequence x 1 , . . . , x T and the smo othed sequence ˜ x 1 , . . . , ˜ x T from Lemma 5 . Sp ecifically , previous samples are used recursively to construct cov ers, hence disagreemen ts b etw een the tw o sequences may amplify with the depth of the construction. This increases the additiv e disagreement term in Theorem 7 by a factor √ d · p olylog T , compared to the ERM regret b ound in Theorem 6 . 5 Distribution-class-dep enden t regret b ounds The previous sections established that generalized smo othness characterizes V C-dep enden t learn- abilit y and that universal algorithms exist which ac hieve lo w regret without knowing the specific distribution family U . A natural follow-up question is: c an we do b etter when U is known? 10 When the distribution family U is known to the learner, one can often obtain sharp er regret b ounds b y exploiting additional quantitativ e structure of U . In this section, w e dev elop a simple com binatorial parameterization of distribution families and connect it to upp er and low er regret b ounds. The key insight is that the “complexity” of a distribution family can b e measured by how fragmen ted mass can b e: if U allows mass to b e spread across many disjoin t regions, learning is harder. This can b e formalized b y the ϵ - fr agmentation numb er of the distribution family U , which in tuitively measures how “fragmen ted” mass can b e under distributions in U at scale ϵ ∈ (0 , 1]. Definition 8 (F ragmentation num ber) . F or any distribution class U and sc ale ϵ ∈ (0 , 1] , the ϵ - fr agmentation numb er of U is define d via N U ( ϵ ) := max { k ∈ N : ∃ disjoint A 1 , . . . , A k ∈ B , ∀ i ∈ [ k ] , µ U ( A i ) ≥ ϵ } . As a useful remark, µ U is a con tinuous submeasure if and only if N U ( ϵ ) < ∞ for all ϵ > 0 (see the b eginning of the pro of of Lemma 1 for the pro of ). In w ords, the ϵ -fragmen tation n um b er is maxim um n umber of disjoin t regions that can each carry at least ϵ mass under some µ ∈ U . Equiv alen tly , it counts disjoint regions on which a U - constrained adversary may place ϵ ∈ (0 , 1] mass. In particular, a U -constrained adv ersary ma y em ulate, for a fraction ϵ of the rounds, an adversary whic h selects the region on whic h the next sample lies, among these disjoin t regions. This observ ation leads to the follo wing regret low er bound for U -constrained adv ersaries. Theorem 8. Fix a family of distributions U on X and T ≥ 1 . F or any ϵ > 0 and d ≥ 1 such that 3 d ≤ N U ( ϵ ) . Then, ther e exists a function class F : X → { 0 , 1 } of V C dimension at most d such that for any algorithm alg, R e g T ( alg ; F , U ) ≳ r ϵdT ln N U ( ϵ ) d ∧ ( ϵT ) . for some oblivious U -c onstr aine d adversary. W e next turn to regret upp er b ounds. W e first give a quan titative v ersion of Lemma 2 using fragmen tation num b ers. This essen tially b ounds the contin uit y mo dulus of the env elop e functional µ U as a function of fragmentation num b ers. Lemma 9. L et U b e a family of distributions and ϵ ∈ (0 , 1] such that N U ( ϵ ) < ∞ . Then, ther e exists a distribution µ ϵ on X such that for any me asur able set A ⊆ X , µ ϵ ( A ) ≤ ϵ N U ( ϵ ) 2 = ⇒ µ U ( A ) ≤ 2 ϵ. The pro of is constructiv e and largely similar to that of Lemma 2 . Intuitiv ely , Lemma 9 sho ws that at the scale of even ts of mass Ω( ϵ ), the distribution class U is essentially O ( N U ( ϵ ) 2 )-smo oth with respect to a fixed base distribution µ ϵ . With this prop ert y at hand, w e can run a simple Hedge algorithm ov er an appropriate cov er of F , following standard agrumen ts for smo othed adversaries [ HRS24 , BDGR22 ], to achiev e the following regret upp er b ound. Theorem 10. L et U b e a family of distributions on X and T ≥ 1 . F or any function class F : X → { 0 , 1 } of VC dimension d ≥ 1 , ther e exists an algorithm alg such that R e g T ( alg ; F , U ) ≲ inf ϵ> 0 { p dT ln( T N U ( ϵ )) + ϵT } . 11 As an imp ortan t remark, b oth upp er and low er b ounds dep end only on the logarithm of frag- men tation n um b ers N U ( ϵ ) for ϵ ∈ (0 , 1]. In comparison, the b ounds from ERM and R-Co ver qualitativ ely dep end p olynomial ly on the fragmentation num b ers. Sp ecifically , if N U ( ϵ ) < ∞ for all ϵ > 0, Lemma 9 suggests that U is generalized smo oth with a function ρ such that ρ ( ϵ/ N U ( ϵ ) 2 ) = 2 ϵ for ϵ ∈ (0 , 1]. Pro vided that ρ is additionally conca ve, Theorem 6 yields the following regret b ound for ERM : Reg T ( ERM ; F , U ) ≲ inf ϵ> 0 { N U ( ϵ ) p dT ln 5 T + ϵT } , for function classes F of VC dimension d . Theorem 7 yields similar regret b ounds for R-Co ver . Of course, this regret b ound is lo oser than Theorem 10 but has the adv antage of b eing achiev ed b y a universal algorithm that do es not require kno wledge of the distribution class U . 6 Differen tial Priv acy In this section, we will extend the insigh ts from our characterization of distributionally constrained online learning to the setting of differentially priv ate learning. Recall that differential priv acy [ DMNS06 ] is a rigorous framework for ensuring that le arning algorithms do not leak sensitiv e information ab out individual data p oints. Though differen tial priv acy can b e defined is broad generalit y , we will fo cus on the setting of learning binary classifiers and present the definition sp ecialized to this setting. Definition 9 (Differen tial Priv ate Learning) . A r andomize d algorithm A : ( X × { 0 , 1 } ) m → { 0 , 1 } X is ( α, β ) -differ ential ly private if for any p air of datasets S, S ′ ∈ X m differing in a single entry, and for any me asur able set E ⊆ H , P ( A ( S ) ∈ E ) ≤ e α P ( A ( S ′ ) ∈ E ) + β . (3) F urther, for a distribution class U on X , a function class F : X → { 0 , 1 } is said to b e ( α, β ) -privately ( ϵ, δ ) -ac cur ately le arnable under U with sample c omplexity m if ther e exists an ( α, β ) -differ ential ly private algorithm A : X m → { 0 , 1 } X such that for any distribution D over X × { 0 , 1 } with mar ginal on X in U , we have P S ∼ D m  L D ( A ( S )) ≤ inf f ∈F L D ( f ) + ϵ  ≥ 1 − δ (4) wher e L D ( f ) = P ( x,y ) ∼ D ( f ( x )  = y ) . With this definition at hand, we can no w state our c haracterization of priv ate learnabilit y under distribution constrain ts. This characterization mirrors our main result for online learning: generalized smo othness is necessary for priv ate learnability with VC-dependent sample complexity . Theorem 11. Ther e is a c onstant c 0 > 0 such that the fol lowing holds. L et U b e a distribution class on X such that for any gener alize d thr eshold class F and any ϵ > 0 , ther e exists m ( ϵ, F ) < ∞ such that for F is (0 . 1 , c 0 m ( ϵ, F ) 2 log m ( ϵ, F ) ) -privately ( ϵ, 1 / 32) -ac cur ately le arnable under U with sample c omplexity m ( ϵ, F ) . Then, U is a gener alize d smo oth distribution class. As a remark, we b eliev e that a stronger version of this statemen t ma y b e p ossible, b y showing that generalized smo othness also characterizes priv ate asymptotic c onsistency , for whic h accuracy is only required asymptotically for each fixed data distribution. W e complement this result by showing that differentially priv ate learnability with VC-dependent sample complexity is ac hiev able under an y generalized smo oth distribution class. In fact, this is 12 a simple consequence of the fact that under generalized smo othness VC class has uniform cov ers (Definition 7 ) and the fact that for finite classes the exp onen tial mec hanism [ MT07 ] provides priv ate learnabilit y with logarithmic dep endence on the size of the class. Theorem 12 (Priv ate Learnability under Generalized Smo othness) . L et U b e a ( ρ, µ 0 ) gener alize d smo oth distribution class on X . Then, for any function class F : X → { 0 , 1 } of V C dimension d ≥ 1 , any ϵ, δ ∈ (0 , 1) , and any α ∈ (0 , 1) , ther e exists an ( α, 0) -differ ential ly private algorithm that ( ϵ, δ ) -ac cur ately le arns F under U with sample c omplexity m ≲ d log (1 /ϵ ) + log(1 /δ ) ϵ 2 + d log ( ρ − 1 ( ϵ )) αϵ . This result can b e seen as b eing related to the line of work on priv ate learning with unlab elled public data [ ABM19 , BCM + 20 , BBD + 24 , BS25 ] since unlab elled public data from the base dis- tribution suffices to get the impro v ed sample complexit y guaran tees. F urther, using tec hniques from [ HRS20 ] similar results can b e obtained for query release and data release tasks under gen- eralized smoothness assumptions. It is an in teresting researc h direction to understand whether oracle-efficien t priv ate learning algorithms [ BBD + 24 , BS25 ] can b e obtained under generalized smo othness assumptions. 7 Discussion W e hav e established that generalized smo othness exactly c haracterizes when VC-dimension-dependent regret bounds are achiev able against distributionally constrained adv ersaries. This c haracterization bridges the gap b et ween the classical σ -smo oth mo del and the fully adversarial setting, providing a complete picture of learnabilit y under distribution shifts. F urther this characterization extends to the setting of differen tially priv ate learning, sho wing that generalized smo othness is necessary and sufficient for priv ate learnabilit y with VC-dependent sample complexity . Our work fits into the broader program of understanding “b ey ond w orst-case” mo dels in on- line learning and priv ate learning. The characterization suggests that generalized smo othness is a fundamen tal structural prop ert y , muc h lik e how VC dimension and Littlestone dimension are fun- damen tal for i.i.d. and adversarial learning resp ectively . W e hop e this p erspective inspires further w ork on understanding the landscap e of learnability under v arious distributional assumptions. Ac kno wledgmen ts W e ackno wledge supp ort from ARO through aw ard W911NF-21-1-0328, Simons F oundation, and the NSF through a w ards DMS-2031883 and PHY-2019786, the D ARP A AIQ program, and AF OSR F A9550-25-1-0375. Abhishek Shett y was supp orted by an NSF FODSI fellowship. 13 References [ABM19] Noga Alon, Raef Bassily , and Shay Moran. Limits of priv ate learning with access to public data. In A dvanc es in Neur al Information Pr o c essing Systems , volume 32, 2019. [AHK + 14] Alekh Agarwal, Daniel Hsu, Saty en Kale, John Langford, Lihong Li, and Rob ert Sc hapire. T aming the monster: A fast and simple algorithm for con textual bandits. In International c onfer enc e on machine le arning , pages 1638–1646. PMLR, 2014. [ALMM19] Noga Alon, Roi Livni, Mary an the Malliaris, and Shay Moran. Priv ate pac learning implies finite littlestone dimension. In Pr o c e e dings of the 51st Annual ACM SIGACT Symp osium on The ory of Computing , pages 852–860, 2019. [BBD + 24] Adam Blo c k, Mark Bun, Rathin Desai, Abhishek Shett y , and Steven W u. Oracle- efficien t differentially priv ate learning with public data, 2024. [BBKN14] Amos Beimel, Hai Brenner, Shiv a Prasad Kasiviswanathan, and Kobbi Nissim. Bounds on the sample complexity for priv ate learning and priv ate data release. In Machine L e arning , volume 94, pages 401–437. Springer, 2014. [BCM + 20] Raef Bassily , Alb ert Cheu, Sha y Moran, Aleksandar Nikolo v, Jonathan Ullman, and Stev en W u. Priv ate query release assisted by public data. In International Confer enc e on Machine L e arning , pages 695–705. PMLR, 2020. [BDGR22] Adam Blo c k, Y uv al Dagan, Noah Golowic h, and Alexander Rakhlin. Smo othed online learning is as easy as statistical learning. In Confer enc e on L e arning The ory , pages 1716–1786. PMLR, 2022. [BDPSS09] Shai Ben-David, D´ avid P´ al, and Shai Shalev-Shw artz. Agnostic online learning. In COL T , volume 3, page 1, 2009. [BHJ23] Mo ¨ ıse Blanchard, Steve Hanneke, and Patric k Jaillet. Adv ersarial rewards in univ ersal learning for contextual bandits. arXiv pr eprint arXiv:2302.07186 , 2023. [BJ23] Mo ¨ ıse Blanc hard and P atrick Jaillet. Univ ersal regression with adversarial resp onses. The A nnals of Statistics , 51(3):1401–1426, 2023. [BK25] Mo ¨ ıse Blanc hard and Samory Kp otufe. Distributionally-constrained adversaries in on- line learning. arXiv pr eprint arXiv:2506.10293 , 2025. [Bla22] Mo ¨ ıse Blanc hard. Universal online learning: An optimistically universal learning rule. In Confer enc e on L e arning The ory , pages 1077–1125. PMLR, 2022. [Bla25] Mo ¨ ıse Blanchard. Agnostic smo othed online learning. In Pr o c e e dings of the 57th Annual A CM Symp osium on The ory of Computing , pages 1997–2006, 2025. [BLL + 11] Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Rob ert Schapire. Con- textual bandit algorithms with sup ervised learning guarantees. In Pr o c e e dings of the F ourte enth International Confer enc e on Artificial Intel ligenc e and Statistics , pages 19– 26. JMLR W orkshop and Conference Pro ceedings, 2011. [BP23] Adam Blo c k and Y ury Poly anskiy . The sample complexity of approximate rejection sampling with applications to smoothed online learning. In The Thirty Sixth Annual Confer enc e on L e arning The ory , pages 228–273. PMLR, 2023. 14 [BRS24] Adam Blo c k, Alexander Rakhlin, and Abhishek Shetty . On the p erformance of empir- ical risk minimization with smo othed data. arXiv pr eprint arXiv:2402.14987 , 2024. [BS25] Adam Blo ck and Abhishek Shetty . Small loss b ounds for online learning separated function classes: A gaussian pro cess p ersp ectiv e, 2025. [CHG + 25] F an Chen, Audrey Huang, Noah Golo wich, Sadhik a Malladi, Adam Blo c k, Jordan T Ash, Akshay Krishnamurth y , and Dylan J F oster. The cov erage principle: How pre- training enables p ost-training. arXiv pr eprint arXiv:2510.15020 , 2025. [DMNS06] Cyn thia Dwork, F rank McSherry , Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in priv ate data analysis. In The ory of Crypto gr aphy Confer enc e , pages 265–284. Springer, 2006. [F re75] David A F reedman. On tail probabilities for martingales. the A nnals of Pr ob ability , pages 100–118, 1975. [Hag18] Nik a Hagh talab. F oundation of Machine L e arning, by the Pe ople, for the Pe ople . PhD thesis, Microsoft Research, 2018. [Han21] Stev e Hannek e. Learning whenever learning is p ossible: Univ ersal learning under gen- eral sto c hastic pro cesses. Journal of Machine L e arning R ese ar ch , 22(130):1–116, 2021. [Hau95] Da vid Haussler. Sphere packing num b ers for subsets of the b oolean n-cub e with b ounded v apnik-cherv onenkis dimension. Journal of Combinatorial The ory, Series A , 69(2):217–232, 1995. [HK25] Stev e Hannek e and Samory Kp otufe. Adaptive sample aggregation in transfer learning, 2025. [HRS20] Nik a Hagh talab, Tim Roughgarden, and Abhishek Shetty . Smo othed analysis of online and differentially priv ate learning. A dvanc es in Neur al Information Pr o c essing Systems , 33:9203–9215, 2020. [HRS24] Nik a Haghtalab, Tim Roughgarden, and Abhishek Shett y . Smo othed analysis with adaptiv e adversaries. Journal of the ACM , 71(3):1–34, 2024. [KLN + 11] Shiv a Prasad Kasiviswanathan, Homin K Lee, Kobbi Nissim, Sofya Raskho dnik ov a, and Adam Smith. What can we learn priv ately? SIAM Journal on Computing , 40(3):793–826, 2011. [KR83] Nigel J Kalton and James W Rob erts. Uniformly exhaustive submeasures and nearly additiv e set functions. T r ansactions of the Americ an Mathematic al So ciety , 278(2):803– 816, 1983. [Lit88] Nic k Littlestone. Learning quickly when irrelev an t attributes ab ound: A new linear- threshold algorithm. Machine le arning , 2(4):285–318, 1988. [MT07] F rank McSherry and Kunal T alwar. Mec hanism design via differential priv acy . In 48th A nnual IEEE Symp osium on F oundations of Computer Scienc e (FOCS’07) , pages 94– 103. IEEE, 2007. 15 [RST11] Alexander Rakhlin, Karthik Sridharan, and Am buj T ewari. Online learning: Sto c has- tic, constrained, and smo othed adversaries. A dvanc es in neur al information pr o c essing systems , 24, 2011. [Sau72] Norb ert Sauer. On the densit y of families of sets. Journal of Combinatorial The ory, Series A , 13(1):145–147, 1972. [She72] Saharon Shelah. A combinatorial problem; stability and order for mo dels and theories in infinitary languages. Pacific Journal of Mathematics , 41(1):247–261, 1972. [She24] Abhishek V asantha Shetty . L e arning in a Changing World: Covariate Shift, Subset Sele ction and Optimal P A C Bounds . PhD thesis, Univ ersity of California, Berk eley , 2024. [T al08] Michel T alagrand. Maharam’s problem. Annals of Mathematics , 168(3):981–1009, 2008. [V al84] Leslie G V alian t. A theory of the learnable. Communic ations of the ACM , 27(11):1134– 1142, 1984. [V C71] V. N. V apnik and A. Y a. Chervonenkis. On the uniform conv ergence of relative fre- quencies of even ts to their probabilities. The ory of Pr ob ability & Its Applic ations , 16(2):264–280, 1971. [V C74] Vladimir V apnik and Alexey Chervonenkis. Theory of pattern recognition, 1974. 16 A Pro ofs from Section 3 W e start b y proving that for threshold learnable distribution classes U , their env elope measure µ U is a contin uous submeasure. Pro of of Lemma 1 The first t wo prop erties of contin uous submeasures are immediate to chec k b y definition of µ U := sup µ ∈U µ . W e fo cus on proving the contin uit y prop ert y . Supp ose b y con tradiction that µ U is not con tinuous and fix a sequence of measurable sets B i ↓ ∅ and ϵ > 0 suc h that for all i ≥ 1, µ U ( B i ) ≥ 2 ϵ . F rom these decreasing sets, we construct by induction a sequence of disjoin t measurable sets ( A i ) i ≥ 1 suc h that for i ≥ 1 we hav e µ U ( A i ) ≥ ϵ , as follo ws. Let j 1 = 1. F or a given i ≥ 1, let µ ∈ U with µ ( B j i ) > 3 2 ϵ , whic h exists by construction of ( B i ) i ≥ 1 . Since B j ↓ ∅ , there exists j i +1 > j i suc h that µ ( B j i \ B j i +1 ) ≥ ϵ . W e then p ose A i := B j i \ B j i +1 ; in particular, µ U ( A i ) ≥ µ ( A i ) ≥ ϵ . By construction, ( A i ) i ≥ 1 are all disjoint, whic h ends the construction. Fix any bijection ϕ : Q → N b etw een p ositiv e integers and rationals. F or an y rational r ∈ R , we define the function f r : X → { 0 , 1 } to b e the indicator of the set S y ≤ x,y ∈ Q A ϕ ( y ) . W e then consider the function class which collects of all suc h functions: F := { f r : r ∈ R } . By construction, we can c heck that F is a generalized threshold function class since for an y r ≤ s ∈ R one has f r ≤ f s . F or con venience, in the rest of the pro of we use the following abuse of notation. F or any interv al I ⊆ R and x ∈ X we write x ∈ I if and only if x ∈ S r ∈ I ∩ Q A ϕ ( r ) . W e now show that U is not weakly learnable for F . The corresp onding construction follows classical arguments. Sp ecifically , we fix an online learner alg and construct an instance on whic h they incur linear regret. T o do so, we recursiv ely a sequence of rationals ( a t , b t , r t ) t ≥ 1 suc h that a t < b t for all t ≥ 1, and corresp onding distributions ( µ t ) t ≥ 1 . By default, initially we can p ose a 0 = 0 and b 0 = 1). Fix t ≥ 1 and supp ose a t − 1 < b t − 1 ha ve b een constructed. W e first p ose c t := a t − 1 + b t − 1 2 . By construction, there exists µ t ∈ U such that µ t ( A ϕ ( c t ) ) ≥ µ U ( A ϕ ( c t ) ) 2 ≥ ϵ 2 . (5) Next, we consider the online learning setup for which x 1 , . . . , x t are sampled indep enden tly from µ 1 , . . . , µ t resp ectiv ely and consider the v alues y t ′ = f c t ( x t ′ ) for all t ′ < t . W e now define the v alue z t := arg min z ∈{ 0 , 1 } P ( ˆ y ( t ) t = z , x t ∈ { c t } ) , (6) where we denoted b y ˆ y ( t ) t the prediction of the algorithm at time t for this online learning instance. If z = 0 we p ose b t = c t and fix a t ∈ ( a t − 1 , c t ) such that P x t ′ ∼ µ t ′  ∃ t ′ ∈ [ t ] : x t ′ ∈ ( a t , b t )  ≤ ϵ/ 8 . (7) Note that this is indeed p ossible since as a t → c t , the subset of X corresp onding to interv al ( a t , c t ) decreases to the empt y set. Conv ersely , if z t = 1, we p ose a t = c t and fix any b t ∈ ( c t , b t − 1 ) such that Eq. ( 7 ) holds. This ends the recursiv e construction of the sequences. By construction, note that ( a t ) t ≥ 1 and ( b t ) t ≥ 1 are conv ergent to the same real v alue which we denote r ⋆ ∈ R . W e no w chec k that for the sequence of instances ( µ t ) t ≥ 1 and function f ⋆ := f r ⋆ ∈ F , the learner induces linear regret. By construction, note that f ⋆ has v alue 1 − z t on the set { c t } = A ϕ ( c t ) for all 17 t ≥ 1. Therefore, for any T ≥ 1, E " T X t =1 1 [ ˆ y t  = f ⋆ ( x t )] # ≥ T X t =1 P ( ˆ y t = z t , x t ∈ { c t } ) . Note that by construction we ha ve c t , r ⋆ ∈ ( a t − 1 , b t − 1 ). In particular, f ⋆ and f c t yield the same v alues y t ′ for t ′ < t whenever for all previous samples t ′ < t one has x t ′ / ∈ ( a t − 1 , b t − 1 ). Hence, for an y t ≥ 1, P ( ˆ y t = z t , x t ∈ { c t } ) ≥ P ( ˆ y ( t ) t = z t , x t ∈ { c t } ) − P ( ∃ t ′ ∈ [ t − 1] : x t ′ ∈ ( a t − 1 , b t − 1 )) ( i ) ≥ P ( x t ∈ { c t } ) 2 − ϵ 8 ( ii ) ≥ ϵ 8 . In ( i ) we used Eqs. ( 6 ) and ( 7 ) and in ( ii ) we used Eq. ( 5 ). Plugging these estimates within the previous equation shows that for all T ≥ 1, E " T X t =1 1 [ ˆ y t  = f ⋆ ( x t )] # ≥ ϵT 8 . This ends the pro of that U is not weakly learnable for F , reac hing a contradiction. Hence µ U is a con tinuous submeasure. ■ B F urther preliminaries B.1 Additional definitions W e give the definition of the Littlestone dimension b elo w, which is kno wn to characterize learnable function classes for fully unconstrained adversaries [ Lit88 , BDPSS09 ]. Definition 10 (Littlestone Dimension) . The Littlestone dimension of a function class F : X → { 0 , 1 } , denote d by Ldim( F ) , is the lar gest inte ger d such that ther e exists a c omplete binary tr e e of depth d with internal no des lab ele d by elements of X and e dges lab ele d by { 0 , 1 } , such that for every r o ot-to-le af p ath ( x 1 , b 1 ) , . . . , ( x d , b d ) , ther e exists a function f ∈ F satisfying f ( x i ) = b i for al l i ∈ [ d ] . If such tr e es exist for arbitr arily lar ge d , then the Littlestone dimension is infinite. B.2 Concen tration inequlities F reedman’s inequalit y [ F re75 ] gives tail probability b ounds for martingales. The statemen t b elo w is for instance taken from [ BLL + 11 , Theorem 1] or [ AHK + 14 , Lemma 9]. Theorem 13 (F reedman’s inequality) . L et ( Z t ) t ∈ T b e a r e al-value d martingale differ enc e se quenc e adapte d to filtr ation ( F t ) t . If | Z t | ≤ R almost sur ely, then for any η ∈ (0 , 1 /R ) , with pr ob ability at le ast 1 − δ , T X t =1 Z t ≤ η T X t =1 E [ Z 2 t | F t − 1 ] + ln 1 /δ η . 18 B.3 Prop erties of Generalized Smo othness In this section, w e giv e further prop erties of generalized smo othness, starting with short pro of of the equiv alence b et ween generalized smo othness and having uniform cov ers for all VC classes. Pro of of Prop osition 4 W e first chec k that a ( µ 0 , ρ )-generalized smo oth distribution class U has uniform co vers for any V C class F . Fix ϵ > 0. The distribution µ 0 admits a finite co ver F ϵ at scale δ := ρ − 1 ( ϵ ) > 0 [ Hau95 ], that is, for an y f ∈ F there exists ˆ f ∈ F ϵ suc h that µ 0 ( { x : f ( x )  = ˆ f ( x ) } ) ≤ δ . By the generalized smoothness property , F ϵ is therefore a uniform co ver for U for scale ϵ > 0. T o complete the pro of, it suffices to c hec k that if U has uniform co vers for all generalized thresh- olds, then it is generalized smo oth. Indeed, if it do esn’t hold, Lemma 2 shows that µ U is not a contin- uous submeasure. F ollo wing the standard argument in the b eginning of the pro of of Lemma 1 , this sho ws that there exists ϵ > 0 and a sequence of disjoint measurable sets ( A i ) i ≥ 1 suc h that µ U ( A i ) ≥ ϵ for all i ≥ 1. W e then construct the generalized threshold class F := { x 7→ 1 [ x ∈ S j ≥ i A j ] : i ≥ 1 } whic h by construction is an infinite ϵ -packing for µ U . That is, for any f  = f ′ ∈ F , there exists µ ∈ U suc h that µ ( { x : f ( x )  = f ′ ( x ) } ) ≥ ϵ . In particular, F do es not admit finite uniform cov ers for scale ϵ/ 2, ending the pro of. ■ Next, we show that the tolerance function ρ for a ( µ 0 , ρ )-generalized smo oth distribution class U is necessarily well-behav ed if all distribution in U are non-atomic. Prop osition 14. L et U b e a ( µ 0 , ρ ) -gener alize d smo oth distribution class wher e ρ : [0 , 1] → R + is non-de cr e asing and lim ϵ → 0 ρ ( ϵ ) = 0 , such that al l µ ∈ U ar e non-atomic (i.e., µ ( A ) = 0 for al l atoms A ). Then, the function ˜ ρ define d via ˜ ρ ( ϵ ) = sup { µ ( A ) : µ ∈ U , A ∈ B , µ 0 ( A ) ≤ ϵ } , ϵ ≥ 0 , is wel l-b ehave d and U is ( µ 0 , ˜ ρ ) -gener alize d smo oth. Also, ˜ ρ ( ϵ ) ≤ ρ ( ϵ ) for al l ϵ ≥ 0 . Pro of First, since U is ( µ 0 , ρ )-generalized smo oth, for any ϵ ≥ 0, µ ∈ U and A ∈ B with µ 0 ( A ) ≤ ϵ w e hav e µ ( A ) ≤ ρ ( ϵ ). Hence, by construction of ˜ ρ , we hav e ˜ ρ ( ϵ ) ≤ ρ ( ϵ ). Also, by construction we directly hav e ˜ ρ (0) = 0. It remains to chec k that ˜ ρ is well-behav ed. T o do so, w e first sho w that for any non-atomic distribution µ ≪ µ 0 on X , A ∈ B and δ ∈ [0 , µ 0 ( A )], there exists B ⊆ A such that µ 0 ( B ) ≤ δ and µ ( B ) ≥ µ ( A ) δ µ 0 ( A ) . T o do so, for an y z ≥ 0 w e define the set B z := A ∩ { x : dµ dµ 0 ( x ) ≥ z } . Since µ ≪ µ 0 w e hav e B z ↓ 0 as z → ∞ , while B 0 = A . Hence, there exists z δ ≥ 0 suc h that µ 0 ( B z δ ) ≥ δ but µ 0 ( A ∩ { x : dµ dµ 0 ( x ) > z δ } ) ≤ δ . As a result, there exists a set B such that A ∩  x : dµ dµ 0 ( x ) > z δ  ⊆ B ⊆ A ∩  x : dµ dµ 0 ( x ) ≥ z δ  = B z δ and an atom C ∈ B such that B ∪ C ⊆ B z δ , µ 0 ( B ) ≤ δ but µ 0 ( B ∪ C ) ≥ δ . No w we compute µ ( A ) = µ ( B ∪ C ) + µ ( A \ ( B ∪ C )) ( i ) ≤ µ ( B ∪ C ) + z δ µ 0 ( A \ ( B ∪ C )) ( ii ) ≤ µ ( B ∪ C ) + µ ( B ∪ C ) µ 0 ( B ∪ C ) µ 0 ( A \ ( B ∪ C )) ( iii ) = µ ( B ) µ 0 ( A ) µ 0 ( B ∪ C ) . In ( i ) we used the fact that A \ ( B ∪ C ) ⊆ { x : dµ dµ 0 ( x ) ≤ z 2 } and in ( ii ) we used B ∪ C ⊆ B z δ . In ( iii ) we used µ ( B ) = µ ( B ∪ C ) since µ is non-atomic. This exactly gives the desired b ound µ ( B ) ≥ µ ( A ) δ µ 0 ( A ) while µ 0 ( B ) ≤ δ . 19 With this prop ert y at hand, w e can c heck that ˜ ρ ( ϵ ) /ϵ is non-increasing in ϵ . First, note that all µ ∈ U satisfy µ ≪ µ 0 otherwise we would ha ve lim ϵ → 0 ρ ( ϵ ) > 0. Also, by assumption, all distribu- tions µ ∈ U are non-atomic. Now fix any 0 < ϵ 1 ≤ ϵ 2 . F or an y η > 0 there is µ ∈ U , A ∈ B with µ 0 ( A ) ≤ ϵ 2 and µ ( A ) ≥ ˜ ρ ( ϵ 2 ) − η . F rom the previous paragraph, there is B ⊆ A with µ 0 ( B ) ≤ ϵ 1 and µ ( B ) ≥ ( ˜ ρ ( ϵ 2 ) − η ) ϵ 1 ϵ 2 . This holds for any η > 0 and hence ˜ ρ ( ϵ 1 ) ≥ ˜ ρ ( ϵ 2 ) ϵ 1 ϵ 2 . ■ C Pro ofs from Section 4 W e start by pro ving the reduction coupling Lemma 5 from generalized smo oth adversaries to (standard) smo oth adversaries. Pro of of Lemma 5 Denote by µ 0 the base measure f or the generalized smo othed class U . W e also denote b y ˜ µ 0 := 1 2 µ 0 + 1 2 δ x ∅ the mixture of the base measure and a Dirac at the dumm y v ariable x ∅ . F or conv enience, w e denote η := ρ ( ϵ ) /ϵ . F or any t ∈ [ T ], denote b y µ t the distribution of x t conditional on the past history x η | x η  . F or conv enience, denote A t := { dµ t dµ 0 ( x ) > η } . Then, b y construction, ηµ 0 ( A t ) < µ t ( A t ) ≤ ρ ( µ 0 ( A t )), where in the last inequality we used µ t ∈ U and the generalized smo othed prop ert y . Recalling that η = ρ ( ϵ ) ϵ and α 7→ ρ ( α ) α is non-increasing, we obtain µ 0 ( A t ) ≤ ϵ and in turn, µ t ( A t ) ≤ ρ ( ϵ ). This ends the pro of. ■ Next, w e prov e Theorem 6 which gives regret b ounds for ERM under any VC class and gener- alized smo oth distribution class U . Pro of of Theorem 6 W e will use Lemma 5 to reduce the problem to smo oth adversaries, for whic h [ BRS24 ] giv es regret lo wer b ounds. T o do so, we fix ϵ > 0 and letting σ := ϵ 2 ρ ( ϵ ) , we use Lemma 5 to construct the σ -smo oth sequence ˜ x 1 , . . . , ˜ x T as guaran teed by this result. W e then extend the function class F to X ∪ { x ∅ } b y posing f ( x ∅ ) = 0 for all f ∈ F . Throughout, we use tildes to distinguish b et ween quantities computed for the smo oth sequence ˜ x or the original sequence x . Denote by f t the prediction function c hosen by the ERM algorithm at time t (which w as trained on the a v ailable data ( x t ′ , f ⋆ ( x t ′ )) t ′ 0. As a slight note, the [ BRS24 , Theorem 8] gives a dependency of the regret in 1 /σ , but this can b e impro ved to 1 / √ σ with the same pro of as noted in [ Bla25 ]. As a result, we obtain E " T X t =1 ℓ t ( ˆ y t ) # ≤ E " T X t =1 1 [ ˜ y t  = f ⋆ ( ˜ x t )] + 1 [ ˜ x t = x ∅ ] # ≤ C r 2 ρ ( ϵ ) ϵ dT ln 5 T + ρ ( ϵ ) T , where in the last inequalit y we used the last prop ert y from Lemma 5 . T aking the infimum o ver ϵ > 0 ends the pro of. ■ Finally , we prov e the regret b ound on R-Cover from Theorem 7 for agnostic distributionally constrained adversaries. Pro of of Theorem 7 Again, we use Lemma 5 to reduce the problem to smooth adv ersaries, for which [ Bla25 ] gives regret lo wer b ounds. W e use the same notations for the σ := ϵ 2 ρ ( ϵ ) -smo oth sequence ˜ x 1 , . . . , ˜ x T as in the pro of of Theorem 6 . Imp ortan tly , the only part of the original pro of which uses smoothness is in [ Bla25 , Lemma 11] which b ounds some discrepancy terms denoted Γ ( p,r ) k (for b oth classification and regression), and when going from oblivious regret b ounds to adaptive regret b ounds (only for classification). In particular, their regret decomp osition still holds even for the regression setup. W e rewrite these b ounds here for completeness. F or context, the R-Cover algorithm pro ceeds by ep ochs on sev eral lay ers p ∈ { 0 , . . . , P = ⌊ log 2 T ⌋} : these la y er- p ep o c hs are denoted E ( p ) 1 , . . . , E ( p ) N p and form a roughly-balanced partition of [ T ] and N p = 2 P − p . Their end p oin ts are denoted E ( p ) k := ( T ( p ) k − 1 , T ( p ) k ] for k ∈ [ N p ]. As a first step of the original pro of (see [ Bla25 , Equation (14)], the regret of the learning-with-exp erts algorithm at lay er p during ep o c h E ( p ) k for k ∈ [ N p ] are b ounded using the follo wing quantit y (with simplified notation) ∆ ( p ) k := X t ∈ E ( k ) p E r ∼ ˆ p t  ( ℓ t ( ˆ y t ) − ℓ t ( A r ( t ))) 2  . Here, ˆ p t is the distribution among experts used b y the algorithm at time t , ˆ y t denotes the prediction of the exp ert, and A l,r ( t ) denotes the prediction of exp ert r at time t . The main p oin t is that we can b ound ∆ ( p ) k ≤ ¯ ∆ ( p ) k + X t ∈ E ( k ) p 1 [ ˜ x t = x ∅ ] where ¯ ∆ ( p ) k := X t ∈ E ( k ) p E r ∼ ˆ p t  ( ℓ t ( ˆ y t ) − ℓ t ( A r ( t ))) 2  1 [ ˜ x t = x t ] . (8) As a note, ¯ ∆ ( p ) k is distinct from the quantit y ˜ ∆ ( p ) k whic h would corresp ond to using the smo othed sequence ˜ x t throughout the game instead of x t , since R-Co ver constructs cov ers (whic h affect ˆ p t ) using the av ailable data x t . Similarly , we can b ound the follo wing quantit y which is also used within the original pro of: Λ ( p ) k := X t ∈ E ( p ) k ℓ t  f ( p ) k,S ( x t )  − ℓ t ( f ⋆ ( x t )) ≤ ˜ Λ ( p ) k + X t ∈ E ( k ) p 1 [ ˜ x t = x ∅ ] , (9) 21 where we recall that ˜ Λ ( p ) k denotes the quantit y to Λ ( p ) k but for the smo othed sequence ˜ x t . W e further the b ounds as in the original pro of, using [ Bla25 , Lemma 10]. F ormally , we first define P ( p ) k := { f ∈ F : max t ∈ [ T ( p ) k − 1 ] | f ( x t ) − f ⋆ ( x t ) | ≤ 2 α } , whic h corresp onds to the set of functions approximately consisten t with f ⋆ on instances from previous lay er- p ep ochs. Then, for any r ≥ 1, we define ¯ Γ ( p,r ) k := X t ∈ E ( p ) k ¯ γ ( p,r ) ( t ) where ¯ γ ( p,r ) ( t ) := sup f ∈P ( p ) k E [ | f ( x t ) − f ⋆ ( x t ) | r 1 [ ˜ x t = x t ] | H t − 1 ] , t ∈ E ( p ) k whic h intuitiv ely quantifies the ℓ r discrepancy b etw een the queries on ep och E ( p ) k and queries prior to this ep o c h. Since within ¯ ∆ ( p ) k , we only fo cus on times when ˜ x t and x t coincide, the same pro of as for [ Bla25 , Lemma 10] shows that with probabilit y at least 1 − δ , w e hav e ¯ ∆ ( p ) k ≤ 5 ¯ Γ ( p, 2) k + 16 ln T δ for all p ∈ [ P ] , k ∈ [ N p ]. A main observ ation is that since all functions agree on x ∅ , observing the v alue of f ⋆ on this instance do es not bring any new information. F ormally , for all la y ers p and ep ochs k , P ( p ) k ⊆ ˜ P ( p ) k . Hence, for any t ∈ R ( p ) k , one has ¯ γ ( p,r ) ( t ) ≤ ˜ γ ( p,r ) ( t ) := sup f ∈ ˜ P ( p ) k E [ | f ( ˜ x t ) − f ⋆ ( ˜ x t ) | r | H t − 1 ] . In particular, we obtained ¯ Γ ( p,r ) k ≤ ˜ Γ ( p,r ) k := P t ∈ E ( p ) k ˜ γ ( p,r ) ( t ) for all r ≥ 1, which corresp onds to the quan tity used within the original pro of if we were working with the smo othed data ˜ x t . Altogether, the regret decomp osition b ound from [ Bla25 , Equation (14)] sho ws that for an y p 0 ∈ { 0 , . . . , P } , with probability at least 1 − 4 δ , T X t =1 ℓ t ( ˆ y t ) − ℓ t ( f ⋆ ( x t )) ≤ 12 P X p =max( p 0 , 1) X k ∈ [ N p ] r 2 max  ∆ ( p ) k , 2  ln ( N ( F ; α, T ) + 1) + 8 N p 0 ln 2 T δ + X k ∈ [ N p 0 ] Λ ( p ) k ( i ) ≤ 12 P X p =max( p 0 , 1) X k ∈ [ N p ] r 2 max  ¯ ∆ ( p ) k , 2  ln ( N ( F ; α, T ) + 1) + 8 N p 0 ln 2 T δ + X k ∈ [ N p 0 ] ˜ Λ ( p ) k + (12 P p 2 ln ( N ( F ; α, T ) + 1) + 1) X t ∈ [ T ] 1 [ ˜ x t = x ∅ ] ≤ 12 P X p =max( p 0 , 1) X k ∈ [ N p ] s 2  5 ˜ Γ ( p, 2) k + 16 ln T δ  ln ( N ( F ; α, T ) + 1) + 8 N p 0 ln 2 T δ + X k ∈ [ N p 0 ]  2 ˜ Γ ( p 0 , 1) k + 3 ln T δ  + (12 P p 2 ln ( N ( F ; α, T ) + 1) + 1) X t ∈ [ T ] 1 [ ˜ x t = x ∅ ] . where N ( F ; α, T ) is the α - co v ering num b er of F and α is the parameter used within the algorithm to construct co verings. In ( i ) we used Eqs. ( 8 ) and ( 9 ) together with the inequalities √ a + b ≤ √ a + √ b for a, b ≥ 0 and √ a ≤ a for an y non-negative in teger. Imp ortan tly , except for the term prop ortional to P t ∈ [ T ] 1 [ ˜ x t = x ∅ ] the rest of the last term exactly corresp onds to the regret b ound from the 22 original pro of for the smo othed data ˜ x t . In particular, the rest of the pro of directly applies. Also, in the binary classification setting, we can take the parameter α = 0 and by Sauer-Shelah lemma [ Sau72 , She72 ] w e hav e ln N ( F ; α , T ) ≲ d ln T . In particular, in the binary classification setting, the rest of the pro of of [ Bla25 , Prop osition 13] implies that for any f ⋆ ∈ F , with probabilit y at least 1 − δ , T X t =1 ℓ t ( ˆ y t ) − T X t =1 ℓ t ( f ⋆ ( x t )) ≤ C s ( d ln 2 T + d ln ln 1 δ + ln 1 δ ) ln 3 T σ · T + C p d ln 3 T T X t =1 1 [ ˜ x t = x ∅ ] , for some universal constant C > 0. F urther, from the last claim of Lemma 5 , w e can apply F reedman’s inequalit y Theorem 13 to the martingale increments 1 [ ˜ x t = x ∅ ] − P [ ˜ x t = x ∅ | ˜ x i A j . By construction, all sets B i are disjoint and if i ≤ N U ( ϵ ) + 1, µ i ( B i ) ≥ µ i ( A i ) − X j >i µ i ( A j ) ≥ 2 ϵ − ( i − 1) ϵ N U ( ϵ ) ≥ ϵ. By definition of N U ( ϵ ), this implies that the construction must hav e ended at some index i max ≤ N U ( ϵ ). W e then p ose µ ϵ := 1 i max P i max j =1 µ j . Then, for an y measurable set A ⊆ X with µ ϵ ( A ) ≤ ϵ N U ( ϵ ) 2 w e hav e i max X j =1 µ j ( A ) ≤ i max ϵ N U ( ϵ ) 2 ≤ ϵ N U ( ϵ ) . Since the construction ended at i max , this precisely implies that µ U ( A ) ≤ 2 ϵ , ending the pro of. ■ W e next prov e the regret low er b ound from Theorem 8 based on fragmen tation num b ers. Pro of of Theorem 8 The pro of uses classical arguments from [ BDPSS09 ] which giv es a low er b ound Ω( p LD( F ) T ∧ T ) on the regret for adv ersarial online learning for a function class F . W e recall the definition of the Littlestone dimension LD( F ) in Definition 10 in the app endix for completeness. Fix ϵ > 0 and for con v enience, denote N = N U ( ϵ ). Let A 1 , . . . , A N b e disjoint measurable sets suc h that for all i ∈ [ N ], µ U ( A i ) ≥ ϵ . Let I j := { i ∈ [ N ] : i ∈ [( j − 1) N /d, j N /d ] } for j ∈ [ d ]. F or con venience, we enumerate I j := { i ( j ) l , l ∈ [ N j ] } . F or each l ∈ { 0 , . . . , N 1 } × . . . × { 0 , . . . , N d } , we define the function f l : X → { 0 , 1 } via f l ( x ) := ( 1 if x ∈ S l>l j A i ( j ) l , j ∈ [ d ] 0 otherwise . W e then consider the function class F whic h contains all such functions. F or conv enience, we also consider a simplified version of this function class F ′ : I 1 ⊔ . . . ⊔ I d → { 0 , 1 } where we identify all sets A i to a single p oin t i : it is the pro duct of thresholds on each set I j for j ∈ [ d ]. The Littlestone dimension of b oth function classes is the sum of the Littlestone dimension of each underlying threshold function class: LD( F ) = LD( F ′ ) ≥ X j ∈ [ d ] ⌈ log 2 ( | I j | + 1) ⌉ ≳ ( d ∧ N ) log 2 ( ⌈ N /d ⌉ + 1) . (11) Fix a learning algorithm al g . W e then em ulate the pro of of the oblivious regret low er b ound for online learning on this function class for a horizon T ′ := ⌈ ϵT / 2 ⌉ as follows. F or each new iteration t , let i t ∈ I 1 ⊔ . . . ⊔ I d b e the next element which would b e c hosen by the online learning low er b ound adversary . A t time t , we then define µ t to b e a distribution such that µ t ( A i t ) ≥ ϵ , whic h exists by construction. If the sample x t ∼ µ t satisfies x t ∈ A i t , w e use the same labeling y t as the online learning low er b ound adversary . Otherwise, w e randomly lab el y t ∼ B (1 / 2) and ignore 24 this iteration for the online learning adversary . Conditionally on the ev ent E that w e were able to em ulate T ′ iterations of the online learning adversary , the regret lo wer b ound for online learning [ BDPSS09 ] implies the desired low er b ound Reg T (alg; F , U | E ) ≳ p LD( F ) T ′ ∧ T ′ , since the ignored iterations can only increase the regret in exp ectation. Here, Reg T (alg; F , U | E ) denotes the conditional regret on E . By construction, eac h iteration t ∈ [ T ] emulates a step of the online learning adversary with probability ϵ . Therefore, b y Azuma-Ho effding’s inequalit y , P [ E c ] = P   X t ∈ [ T ] 1 [ x t ∈ A i t ] < ϵT 2   ( i ) ≤ P Y ∼B ( T ,ϵ )  Y < ϵT 2  ( ii ) ≤ e − ϵT / 8 . In ( i ) we sto c hastically dominated the quan tity of interest b y a binomial of parameters T and ϵ , and in ( ii ) we used Chernoff ’s inequalit y . Therefore, the final regret low er b ound b ecomes Reg T (alg; F , U ) ≥ Reg T (alg; F , U | E )(1 − P [ E c ]) ≳ p LD( F ) ϵT ∧ ( ϵT ) ≳ s ϵ ( d ∧ N U ( ϵ )) T ln  N U ( ϵ ) d + e  ∧ ( ϵT ) , where in the last inequality we used Eq. ( 11 ). This implies the desired result when 3 d ≤ N U ( ϵ ). ■ Finally , w e prov e the regret upp er b ound from Theorem 10 for U -dep enden t algorithms. This uses a quantitativ e v arian t of the coupling Lemma 5 whose pro of is deferred to App endix E . Pro of of Theorem 10 Fix a distribution class U and a function class F of VC dimension d . Fix ϵ > 0. F rom Lemma 9 , we can construct a distribution µ ϵ suc h that for any measurable set A ⊆ X with µ ϵ ( A ) ≤ ϵ N U ( ϵ ) 2 , one has µ U ( A ) ≤ 2 ϵ . Using Lemma 15 , w e then couple the original adversary with a sequence ˜ x 1 , . . . , ˜ x T generated b y an ( ϵ/ (4 N U ( ϵ ) 2 ))-smo oth adv ersary (with resp ect to a base measure ˜ µ ϵ ) such that for all t ∈ [ T ], P [ ˜ x t  = x t | ˜ x 0 is a tuning parameter. F ollo wing [ HRS20 , Theorem 3.1], we can b ound the exp ected regret of al g α b y O  p T ln |H ( α ) |  + E " max f ∈F min g ∈F ( α ) T X t =1 1 [ f ( x t )  = g ( x t )] # ≤ O  p T ln |H ( α ) |  + E " max f ∈F min g ∈F ( α ) T X t =1 1 [ f ( ˜ x t )  = g ( ˜ x t )] # + E " T X t =1 1 [ ˜ x t  = x t ] # . By construction, the last term is b ounded by 2 ϵT . The first tw o terms exactly corresp ond to the regret analysis for the smo othed data ˜ x ≤ T . In turn, the arguments from [ HRS20 , Theorem 3.1] 25 directly imply that with an appropriate v alue of α , Reg T ( alg α ; F , U ) ≲ s T d ln  4 T N U ( ϵ ) 2 dϵ  + d ln  4 T N U ( ϵ ) 2 dϵ  + ϵT . W e then minimize this v alue ov er ϵ ∈ (0 , 1]. First, note that without loss of generalit y w e may focus on ϵ ≥ 1 /T : smaller v alues cannot improv e the right-hand side by more than a constant factor, since the first term is already Ω(1) in that regime. Next, we can ignore the second term on the right-hand side: if it dominates the first term, then necessarily d ln  4 T N U ( ϵ ) 2 dϵ  ≳ T , in which case the b ound is already trivial (it exceeds T ). T uning the parameter ϵ ∈ (1 /T , 1] then gives the desired result. ■ E A quan titative coupling lemma The next lemma is a v arian t of Lemma 5 that is conv enien t when one only has a guaran tee of the form µ 0 ( A ) ≤ ϵ ⇒ µ U ( A ) ≤ η for some base measure µ 0 . Lemma 15. L et U b e a gener alize d smo othe d distribution class on X , let ϵ, η ∈ (0 , 1] , and let µ 0 b e a pr ob ability me asur e such that for any me asur able set A ⊆ X , if µ 0 ( A ) ≤ ϵ then µ U ( A ) ≤ η . Consider any U -c onstr aine d adaptive adversary gener ating samples x 1 , . . . , x T . Fix a dummy symb ol x ∅ / ∈ X . Then ther e exists a c oupling with a se quenc e ˜ x 1 , . . . , ˜ x T ∈ X ∪ { x ∅ } which is gener ate d by an ( ϵ/ 4) -smo oth adversary on X ∪ { x ∅ } such that for al l t ∈ [ T ] , 1. ˜ x t ∈ { x t , x ∅ } , and 2. P [ ˜ x t = x ∅ | ˜ x ϵ, x ∅ otherwise . Let ˜ µ t b e the distribution of ˜ x t conditional on x 2 σ o ∩ n x ∈ X : µ 0 ( { x } ) ≤ ϵ o . By construction, P [ ˜ x t = x ∅ | ˜ x ϵ . By Lemma 16 , there exists a measurable B ⊆ A t with µ 0 ( B ) ∈ ( ϵ/ 2 , ϵ ]. On the one hand, by the defining prop ert y of A t w e hav e µ t ( B ) ≥ 2 σ µ 0 ( B ) > σ ϵ = η . On the other hand, µ 0 ( B ) ≤ ϵ implies µ t ( B ) ≤ µ U ( B ) ≤ η , a contradiction. Therefore µ t ( A t ) ≤ η in all cases, completing the pro of. ■ Lemma 16. L et ϵ > 0 , let µ b e a pr ob ability me asur e on X , and let A ⊆ X b e me asur able such that µ ( A ) > ϵ and for al l x ∈ A we have µ ( { x } ) ≤ ϵ . Then ther e exists a me asur able set B ⊆ A such that µ ( B ) ∈ ( ϵ/ 2 , ϵ ] . Pro of If there exists an atom x ∈ A with µ ( { x } ) ∈ ( ϵ/ 2 , ϵ ], then take B = { x } . Otherwise, ev ery atom in A has mass at most ϵ/ 2. Let A at ⊆ A b e the (at most countable) set of atoms and write m := µ ( A at ) ∈ [0 , 1]. If m ≥ ϵ/ 2, pick a finite subset of atoms { x 1 , . . . , x k } ⊆ A at b y adding atoms greedily un til the partial sum first exceeds ϵ/ 2. Since each atom has mass at most ϵ/ 2, the resulting set B = { x 1 , . . . , x k } satisfies µ ( B ) ∈ ( ϵ/ 2 , ϵ ]. If m < ϵ/ 2, then the non-atomic part A \ A at has measure µ ( A \ A at ) > ϵ − m > ϵ/ 2. Since µ restricted to A \ A at is non-atomic, there exists a measurable C ⊆ A \ A at with µ ( C ) = ϵ/ 2. T aking B := A at ∪ C yields µ ( B ) = m + ϵ/ 2 ∈ ( ϵ/ 2 , ϵ ]. ■ F Pro ofs from Section 6 F.1 Pro of of Theorem 11 W e start by showing the follo wing lemma which is a consequence of T ur´ an’s theorem which low er b ounds the indep endence num b er of graphs with few edges. Lemma 17. Fix ϵ, δ ∈ (0 , 1] , N ≥ 1 , and U b e a distribution class such that N U ( ϵ ) ≥ N . Then, ther e exist k ≥ N δ / 2 disjoint me asur able sets A 1 , . . . , A k to gether with distributions µ 1 , . . . , µ k ∈ U such that 1. for al l i ∈ [ k ] , µ i ( A i ) ≥ ϵ , 2. and for al l i  = j ∈ [ k ] , µ i ( A j ) ≤ δ . A dditional ly, if N U ( ϵ ) = ∞ , then ther e exist disjoint me asur able sets ( A i ) i ≥ 1 , distributions ( µ i ) i ≥ 1 in U such that for al l i ≥ 1 , µ i ( A i ) ≥ ϵ , and for al l n ≥ 1 , sup 2 n ≤ i  = j < 2 n +1 µ i ( A j ) ≤ 2 − n . Pro of Fix U suc h that N U ( ϵ ) ≥ N . Then there exist B 1 , . . . , B N disjoin t measurable sets together with distributions ν 1 , . . . , ν N ∈ U with ν i ( B i ) ≥ ϵ for all i ∈ [ N ]. Construct the graph G = ([ N ] , E ) on [ N ] suc h that ( i, j ) ∈ E if and only if ν i ( B j ) ≥ δ or ν j ( B i ) ≥ δ for some δ ∈ (0 , 1]. Note that G has at most N/δ edges since for each i ∈ [ N ] there can b e at most 1 /δ indices j ∈ [ N ] for which ν i ( B j ) ≥ δ given that all sets B j are disjoint. Therefore, G has av erage node degree at most 1 /δ . Hence, T ur´ an’s theorem sho ws that there is an independent set S of size at least N / (1 + 1 /δ ) ≥ N δ / 2. F o cusing only on the indices i ∈ S with the corresp onding sets B i and distribution ν i giv es the first desired result. W e now prov e the second claim when N U ( ϵ ) = ∞ . Fix a sequence ( B j ) j ≥ 1 of disjoint mea- surable sets and corresp onding distributions ( ν j ) j ≥ 1 in U suc h that ν j ( B j ) ≥ ϵ for all i ≥ 1. W e 27 then essen tially stitc h together the abov e construction for the parameters δ n = 2 − n for n ≥ 1. Sp ecifically , ha ving constructed indices j i ∈ N for all i < 2 n , w e fo cus on indices that hav e not b een used yet I n := N \ { j l : l < 2 n } . Among the first 2 2 n elemen ts of I n , the previous arguments sho w that we can construct distinct indices i j ∈ I l for j ∈ [2 n , 2 n +1 ) suc h that letting A j = B i j and µ j = ν i j w e hav e ν i ( A j ) ≤ 2 − n for all i  = j ∈ [2 n , 2 n +1 ). This prov es the desired prop erties and ends the pro of. ■ With this to ol at hand w e now show that algorithms that priv ately learn generalized thresholds under U distributions at scale ϵ with sample complexity m , induce priv ate algorithms for learning thresholds of size Ω( p N U ( ϵ ) /m ). Lemma 18. L et U b e a c onvex distribution class. Fix m ≥ 1 and η ∈ (0 , 1] . If for any gen- er alize d thr eshold function class F , ther e exists an ( ϵ, δ ) -differ ential ly private ( α, β ) -ac cur ate al- gorithm on distributions in U with sample c omplexity m , then ther e exists an ( ϵ, δ ) -differ ential ly private (2 α/η , 2 β ) -ac cur ate P A C le arner for thr esholds in [ K ] with sample c omplexity m , wher e K = Ω  q β N U ( η ) m  . Next, supp ose that N U ( η ) = ∞ for some η ∈ (0 , 1] . Then, if for any gener alize d thr eshold func- tion class F , ther e exists an ( ϵ, δ ) -differ ential ly private ( α, β ) -ac cur ate algorithm on distributions in U with some sample c omplexity m ( F ) < ∞ , then ther e exists m < ∞ and an ( ϵ, δ ) -differ ential ly private (2 α/η , 2 β ) -ac cur ate P AC le arner for thr esholds in [ K ] with sample c omplexity m for any K ≥ 1 . Pro of Let γ ∈ (0 , 1] b e a parameter to b e fixed later. W e first use Lemma 17 to construct K = ⌊ N U ( η ) γ / 2 ⌋ disjoint measurable sets A i and distributions µ i ∈ U for i ∈ [ K ] satisfying µ i ( A i ) ≥ η and µ i ( A j ) ≤ γ for all i  = j ∈ [ K ]. W e no w consider the generalized threshold class on these se ts F := { x ∈ X 7→ 1 [ x ∈ S j ≥ i A j ] : i ∈ [ K + 1] } . Giv en an ( ϵ, δ )-differen tially priv ate ( α, β )-accurate algorithm for F on distributions in U with sample complexit y m , w e construct a new algorithm A ′ as follows. F or any dataset D ′ = ( a i , y i ) i ∈ [ m ] ∈ ([ K ] × { 0 , 1 } ) m , we replace eac h datap oin t ( a i , y i ) with ( x i , y ′ i := y i 1 [ x i ∈ A i ]) where x i ∼ µ a i are sampled indep endently for eac h i ∈ [ m ]. Denote by D the corresp onding dataset and let h b e the output of A on D . A ′ returns the vector ( 1 [ E x ∼ µ i [ h ( x ) | x ∈ A i ] ≥ 1 / 2]) i ∈ [ K ] . Since A is ( ϵ, δ )-priv ate, so is the algorithm that w ould directly output the hypothesis h (note that A ′ can b e viewed as a mixture of ( ϵ, δ )-priv ate algorithms, one for each p otential realization of x i ∼ µ i for i ∈ [ K ]). Hence A ′ is also ( ϵ, δ )-priv ate. W e next show that A ′ is a P AC learner for the class of thresholds on [ K ] which we denote b y F ′ := { i ∈ [ K ] 7→ 1 [ i ≥ i 0 ] : i 0 ∈ [ K + 1] } . Fix any distribution ν on [ K ] and a threshold g ′ ∈ F . Let D ′ = ( a i , y i ) i ∈ [ m ] b e a dataset corresp onding of m i.i.d. samples from ν lab eled by g ′ . Note that for any sample i ∈ [ m ], letting x i ∼ µ a i b e the sampled instance, if x i / ∈ S j  = i A j , then the v alue y i 1 [ x i ∈ A i ] is consisten t with the function g : x ∈ X 7→ 1 [ x ∈ S i ∈ [ K ]: g ′ ( i )=1 A i ] ∈ F . In particular, under the even t E :=    ∀ i ∈ [ m ] : x i / ∈ [ j  = i A j    , D ′ coincides with a dataset comp osed of m i.i.d. samples from the mixture µ := P i ∈ [ K ] ν i µ i lab eled b y g . Note that µ ∈ U since U is conv ex and that by construction, P [ E ] ≤ mK γ . 28 Since A is ( α, β )-accurate, this shows that with probability at leas t 1 − β − mK γ on the sampled dataset D ′ , the output h of A on the corresp onding dataset D satisfies α ≥ P x ∼ µ [ h ( x )  = g ( x )] ≥ E i ∼ ν  µ i ( A i ) P x ∼ µ i [ h ( x )  = g ′ ( i ) | x ∈ A i ]  ≥ η 2 E i ∼ ν [ h ′ ( i )  = g ′ ( i )] , where h ′ is the final output of A ′ . This shows that A ′ is an (2 α/η , β + mK γ )-accurate P AC learner for F ′ . Note that mK γ ≤ mN U ( η ) γ 2 / 2. T aking γ = q 2 β mN U ( η ) giv es the desired result for the first claim. The second claim is prov ed analogously . Supp ose that N U ( η ) = ∞ . Then, we use Lemma 17 to construct disjoint measurable sets A i and distributions µ i ∈ U for i ≥ 1 satisfying µ i ( A i ) ≥ η for all i ≥ 1, and for all n ≥ 1 and i  = j ∈ [2 n , 2 n +1 ), µ i ( A j ) ≤ 2 − n . As abov e, w e construct the generalized threshold class on these sets F := { x ∈ X 7→ 1 [ x ∈ S j ≥ i A j ] : i ≥ 1 } . The main p oin t is that for an y n ≥ 1, this function class essentially encapsulates the previous construction for γ n = 2 − n , by fo cusing only on the sub class F n := { x ∈ X 7→ 1 [ x ∈ S j ≥ i A j ] : i ∈ [2 n , 2 n + K n ] } where K n := ⌊ 2 n /m ⌋ . The only slight mo dification in the construction of the P A C learner A ′ for the thresholds on [ K n ] is that we additionally alw ays lab el y ′ i = 1 if x i ∈ S j < 2 n A j , to preserve realizabilit y in that case. F ollowing the exact same arguments therefore shows that if there is an ( ϵ, δ )-differentially priv ate ( α , β )-accurate algorithm for F on distributions in U with sample com- plexit y m , then for any n ≥ 1, there is an (2 α/η , β + mK n γ n ≤ 2 β )-accurate P AC learner for F ′ . This ends the pro of. ■ Com bining this result with the sample complexity lo wer bound from [ ALMM19 , Theorem 1] for priv ately learning thresholds yields the following result. Theorem 19. Ther e is a c onstant c 0 > 0 such that the fol lowing holds. L et U b e a c onvex distribution class, ϵ ∈ (0 , 1] , and m ≥ 1 . If for any gener alize d thr eshold class F , ther e exists a (0 . 1 , c 0 m 2 log m ) -differ ential ly private ( ϵ/ 32 , 1 / 32) -ac cur ate algorithm on distributions fr om U , with sample c omplexity m ; then m ≥ Ω(log ⋆ N U ( ϵ )) . A lso, if U is a c onvex distribution such that N U ( ϵ ) = ∞ for some ϵ ∈ (0 , 1] , then ther e exists a gener alize d thr eshold class F such that for any m ≥ 1 , ther e do es not exist a (0 . 1 , c 0 m 2 log m ) - differ ential ly private ( ϵ/ 32 , 1 / 32) -ac cur ate algorithm on distributions fr om U , with sample c omplex- ity m . Pro of Giv en such an algorithm for generalized threshold classes, Lemma 18 constructs a (0 . 1 , c 0 m 2 log m ))- differen tially priv ate (1 / 16 , 1 / 16)-accurate P AC learner for thresholds on [ K ] where K = Ω( p N U ( ϵ ) /m ). Therefore, for c 0 > 0 sufficiently small, [ ALMM19 , Theorem 1] shows that m ≥ Ω(log ⋆ K ) = Ω(log ⋆ N U ( ϵ ) − log ⋆ m ) . In turn, this shows the desired b ound m ≥ Ω(log ⋆ N U ( ϵ )). The sec ond claim is pro ved similarly using the second claim from Lemma 18 . ■ W e are now ready to prov e Theorem 11 as an immediate consequence of Theorem 19 and the previous characterization of generalized-smo oth distribution classes. 29 Pro of of Theorem 11 If U is not generalized smo oth, then Lemma 2 implies that µ U is not a con tinuous submeasure, whic h in turn implies that there is ϵ > 0 for which N U ( ϵ ) = ∞ . Then, the desired imp ossibility result for priv ate learnability is a direct consequence from Theorem 19 . ■ F.2 Pro of of Theorem 12 The pro of follows the same structure as priv ate learning algorithms for VC classes under smo oth- ness and with unlab elled public data [ HRS20 , ABM19 ] but we include it here for completeness. First, recall that under generalized smo othness, that an y V C class has finite uniform cov er from Definition 7 . W e present this result b elo w for completeness. Lemma 20 (Uniform Co vers under Generalized Smo othness) . L et U b e a ( ρ, µ 0 ) gener alize d smo oth class of distributions on X . Then, for any function class F of V C dimension d , any ϵ ∈ (0 , 1] , ther e is a class F ϵ which is an ϵ -uniform c over of F with r esp e ct to U and satisfies log |F ϵ | ≤ d log ( C · ρ − 1 ( ϵ )) for some universal c onstant C > 0 . Pro of The idea is to construct an ϵ ′ -co ver of F with resp ect to the base measure µ 0 . F or an y f , f ′ and µ ∈ U we hav e that µ ( { x : f ( x )  = f ′ ( x ) } ) ≤ ρ ( µ 0 ( { x : f ( x )  = f ′ ( x ) } )) ≤ ρ ( ϵ ′ ). Setting ϵ ′ = ρ − 1 ( ϵ ), we see that any ϵ ′ -co ver with resp ect to µ 0 is an ϵ -uniform co ver with resp ect to U . The result then follows from the classical VC dimension cov ering num b er b ound. ■ The required priv ate learning algorithm is then a direct application of the priv ate learning algorithm based on the exp onen tial mechanism from [ KLN + 11 ] using uniform cov ers. W e recall the main result b elow for completeness. Theorem 21 ([ KLN + 11 ]) . L et F b e a finite function class. F or any ϵ, δ ∈ (0 , 1] , ther e exists an ( α, β ) -differ ential ly private algorithm that is ( α, β ) -ac cur ate for le arning F with sample c omplexity m = d + log (1 /δ ) ϵ 2 + log |F | αϵ . and note that a ( α, 0) priv ate algorithm ( ϵ, δ ) accurate learning algorithm for F ϵ is also a ( α, 0) priv ate algorithm (2 ϵ, δ ) accurate learning algorithm for F for any distribution with marginal in U . 30

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment