The Power of Comparisons for Actively Learning Linear Classifiers
In the world of big data, large but costly to label datasets dominate many fields. Active learning, a semi-supervised alternative to the standard PAC-learning model, was introduced to explore whether adaptive labeling could learn concepts with expone…
Authors: Max Hopkins, Daniel M. Kane, Shachar Lovett
The P o w er of Comparisons for Activ ely Learning Linear Classifiers Max Hopkins ∗ , Daniel Kane † , Shac har Lov ett ‡ June 2, 2020 Abstract In the w orld of big data, large but costly to lab el datasets dominate man y fields. Activ e learning, a semi-supervised alternative to the standard P A C-learning model, w as introduced to explore whether adaptiv e labeling could learn concepts with exp onentially few er lab eled samples. While previous results sho w that activ e learning performs no b etter than its supervised alternative for important concept classes suc h as linear separators, w e show that by adding weak distributional assumptions and allo wing compar- ison queries, activ e learning requires exponentially few er samples. F urther, we show that these results hold as well for a stronger mo del of learning called Reliable and Probably Useful (RPU) learning. In this model, our learner is not allow ed to make mistak es, but ma y instead answer “I don’t kno w.” While previous negativ e results sho wed this model to ha ve in tractably large sample complexit y for label queries, w e show that comparison queries make RPU-learning at worst logarithmically more expensive in b oth the passive and activ e regimes. ∗ Department of Computer Science and Engineering, UCSD, California, CA 92092. Email: nmhopkin@eng.ucsd.edu . Sup- ported by NSF Award DGE-1650112 † Department of Computer Science and Engineering / Department of Mathematics, UCSD, California, CA 92092. Email: dakane@eng.ucsd.edu . Supp orted by NSF CAREER Award ID 1553288 and a Sloan fello wship ‡ Department of Computer Science and Engineering, UCSD, California, CA 92092. Email: slovett@cs.ucsd.edu . Supp orted by NSF CAREER award 1350481, CCF award 1614023 and a Sloan fellowship 1 1 In tro duction In recent years, the av ailability of big data and the high cost of lab eling has lead to a surge of interest in active le arning , an adaptiv e, semi-sup ervised learning paradigm. In traditional active learning, given an in- stance space X , a distribution D on X , and a class of concepts c : X → { 0 , 1 } , the learner receives unlab eled samples x from D with the ability to query an oracle for the lab eling c ( x ). Classically our goal would b e to minimize the num b er of samples the learner dra ws b efore approximately learning the concept class with high probability (P AC-learning). Instead, active learning assumes unlab eled samples are inexp ensive, and rather aims to minimize exp ensive queries to the oracle. While active learning requires exp onen tially few er lab eled samples than P AC-learning for simple classes such as thresholds in one dimension, it fails to provide asymptotic impro vemen t for classes essential to machine learning such as linear separators [1]. Ho wev er, recen t results p oint to the fact that with slight relaxations or additions to the paradigm, suc h concept classes can b e learned with exp onen tially fewer queries. In 2013, Balcan and Long [2] prov ed that this was the case for homogeneous (through the origin) linear separators, as long as the distribution o ver the instance space X = R d w as (nearly isotropic) log-concav e–a wide range of distributions generalizing common cases suc h as gaussians or uniform distributions ov er conv ex sets. Later, Balcan and Zhang [3] ex- tended this to isotropic s-concav e distributions, a diverse generalization of log-concavit y including fat-tailed distributions. Similarly , El-Y aniv and Wiener [4] prov ed that non-homogeneous linear separators can b e learned with exp onentially few er queries ov er gaussian distributions with resp ect to the accuracy parameter, but require exp onentially more queries in the dimension of the instance space X , making their algorithm in tractable in high dimensions. Kane, Lov ett, Moran, and Zhang (KLMZ) [5] prov ed that the non-homogeneit y barrier c ould b e brok en for general distributions in tw o dimensions by emp ow ering the oracle to compare p oin ts rather than just lab el them. Queries of this type are called comparison queries, and are notable not only for their increase in computational pow er, but for their real world applications such as in recommender systems [6] or for increasing accuracy ov er purely lab el-based techniques [7]. Our w ork adopts a mixture of the approaches of Balcan, Long, Zhang, and KLMZ. By leveraging comparison queries, we sho w that non-homogeneous linear separators ma y be learned in exponentially fewer samples as long as the distribution satisfies weak concentra- tion and anti-concen tration b ounds, conditions realized by , for instance, (not necessarily isotropic) s-concav e distributions. F urther, by introducing a new av erage case complexity measure, aver age infer enc e dimension , that extends KLMZ’s techniques to the distribution dep enden t setting, w e prov e that comparisons provide significan tly stronger learning guarantees than the P AC-learning paradigm. In the late 80’s, Rivest and Sloan [8] prop osed a comp eting mo del to P AC-learning called Reliable and Probably Useful (RPU) learning. This mo del, a learning theoretic formalization of sele ctive classific ation in tro duced by Cho w [9] ov er t wo decades prior, do es not allow the learner to make mistakes, but instead allo ws the answer “I don’t know,” written as “ ⊥ ”. Here, error is measured not b y the amoun t of misclassified examples, but by the measure of examples on which our learner returns ⊥ . RPU-learning w as for the most part abandoned by the early 90’s in fa vor of P A C-learning as Kivinen [10–12] prov ed the sample complexit y of RPU-learning simple concept classes such as rectangles required an exp onential num ber of samples even under the uniform distribution. How ever, the mo del w as recently re-introduced by El-Y aniv and Wiener [4], who termed it p erfe ct sele ctive classific ation . El-Y aniv and Wiener prov e a connection b et ween Activ e and RPU-learning similar to the strategy employ ed by KLMZ [5] (who refer to RPU-learners as “confident” learners). W e extend the low er b ound of El-Y aniv and Wiener to prov e that activ ely RPU-learning linear separators with only lab els is exp onentially difficult in dimension even for nice distributions. On the other hand, through a structural analysis of av erage inference dimension, we prov e that comparison queries allow RPU-learning with nearly matc hing sample and query complexity to P AC-learning, as long as the underlying distribution is sufficiently nice. This last result has already found further use b y Hopkins, Kane, Lov ett, and Maha jan [13], who use our analysis of av erage inference dimension to extend their comparison-based algorithms for robustly learning non-homogeneous h yp erplanes to higher dimensions. 2 1.1 Bac kground and Related W ork 1.1.1 P AC-learning Probably Approximately Correct (P AC)-learning is a framework for learning classifiers ov er an instance space introduced by V alian t [14] with aid from V apnik and Chervonenkis [15]. Giv en an instance space X , lab el space Y , and a concept class C of concepts c : X → Y , P AC-learning pro ceeds as follows. First, an adv ersary chooses a hidden distribution D o ver X and a hidden classifier c ∈ C . The learner then draws lab eled samples from D , and outputs a concept c 0 whic h it thinks is close to c with resp ect to D . F ormally , w e define closeness of c and c 0 as the error: er r D ( c 0 , c ) = P r x ∈ D [ c 0 ( x ) 6 = c ( x )] . W e say the pair ( X, C ) is P AC-learnable if there exists a learner A whic h, using only n ( ε, δ ) = Poly( 1 ε , 1 δ ) samples 1 , for all ε, δ picks a classifier c 0 that with probabilit y 1 − δ has at most ε error from c . F ormally , ∃ A s.t. ∀ c ∈ C, ∀ D , P r S ∼ D n ( ε,δ ) [ er r D ( A ( S ) , c ) < ε ] ≥ 1 − δ. The goal of P AC-learning is to compute the sample complexity n ( ε, δ ) and thereby prov e whether certain pairs ( X , C ) are efficiently learnable. In this pap er, w e will b e concerned with the case of binary classification, where Y = { 0 , 1 } . In addition, in the case that C is linear separators we instead write our concept classes as the sign of a family H of functions h : X → R . Instead of ( X , C ), we write the hyp othesis class ( X , H ), and eac h h ∈ H defines a concept c h ( x ) = sg n ( h ( x )). The sample complexity of P AC-learning is c haracterized b y the VC dimension [16 – 18] of ( X , H ) whic h we denote by k , and is given by: n ( ε, δ ) = θ k + log ( 1 δ ) ε . 1.1.2 RPU-learning Reliable and Probably Useful (RPU)-learning is a stronger v ariant of P A C-learning introduced by Rivest and Sloan [8], in which the learner is reliable: it is not allow ed to mak e errors, but may instead say “I don’t kno w” (or for shorthand, “ ⊥ ”). Since it is easy to make a reliable learner by simply alw ays outputting “ ⊥ ”, our learner must be useful, and with high probabilit y cannot output “ ⊥ ” more than a small fraction of the time. Let A b e a reliable learner, we define the error of A on a sample S with resp ect to D , c to b e er r D ( A ( S ) , c ) = P r x ∼ D [ A ( S )( x ) = ⊥ ] . W e call 1 − er r D ( A ( S ) , c ) the c over age of the learner A , denoted C D ( A ( S )), or just C ( A ) when clear from con text. Finally , we say the pair ( X , C ) is RPU-learnable if ∀ ε, δ , there exists a reliable learner A which in n ( ε, δ ) = Poly( 1 ε , 1 δ ) samples has error ≤ ε with probability ≥ 1 − δ : ∃ A s.t. ∀ c ∈ C, ∀ D , P r S ∼ D n ( ε,δ ) [ er r D ( A ( S ) , c ) ≤ ε ] ≥ 1 − δ RPU-learning is characterized b y the V C dimension of certain intersections of concepts [11]. Unfortunately , man y simple cases turn out to b e not RPU-learnable (e.g. rectangles in [0 , 1] d , d ≥ 2 [10]), with even relaxations ha ving exp onential sample complexity [12]. 1.1.3 P assive vs Active Learning P A C and RPU-learning traditionally refer to sup ervised learning, where the learning algorithm receives pre- lab eled samples. W e call this paradigm passive learning. In contrast, active learning refers to the case where the learner receives unlab eled samples and may adaptively query a lab eling oracle. Similar to the passive case, for activ e learning w e study the query complexit y q ( ε, δ ), the minimum num b er of queries to learn some pair ( X , C ) in either the P A C or RPU learning mo dels. The hop e is that b y adaptively choosing when to query the oracle, the learner ma y only need to query a num b er of samples logarithmic in the sample 1 F ormally , n ( ε, δ ) m ust also b e polynomial in a n umber of parameters of C 3 complexit y . W e will discuss tw o paradigms of active learning: po ol-based active learning, and membership query syn- thesis (MQS) [19, 20]. In the former, the learner has access to a p o ol of unlab eled data and may request that the oracle lab el any p oin t. This mo del matches real-w orld scenarios where learners hav e access to large, unlab eled datasets, but lab eling is too exp ensive to use passiv e learning (e.g. medical imagery). Mem b ership query synthesis allows the learner to synthesize p oints in the instance space and query their lab els. This mo del is the logical extreme of the p o ol-based mo del where our p ool is the en tire instance space. Because w e will b e considering learning with a fixed distribution, we will sligh tly mo dify MQS: the learner ma y only query p oin ts in the support of the distribution 2 . This is the natural sp ecification to distribution dep endent learning, as it still mo dels the case where our p ool is as large as p ossible. 1.1.4 The Distribution Dep endent Case While P A C and RPU-learning were traditionally studied in the w orst-case scenario o ver distributions, data in the real world is often drawn from distributions with nice prop erties such as concentration and anti- concen tration bounds. As such, there has b een a w ealth of researc h into distribution-dep enden t P AC- learning, where the mo del has b een relaxed only in that some distributional conditions are known. Dis- tribution dep endent learning has b een studied in b oth the passive and the active case [2, 21 – 23]. Most closely related to our work, Balcan and Long [2] pro ved new upp er b ounds on active and passive learning of homogeneous (through the origin) linear separators in 0-centered log-concav e distributions. Later, Balcan and Zhang [3] extended this to isotropic s-concav e distributions. W e directly extend the original algorithm of Balcan and Long to non-homogeneous linear separators via the inclusion of comparison queries, and leverage the concentration results of Balcan and Zhang to provide an inference based algorithm for learning under general s-conca ve distributions. 1.1.5 The Poin t Lo cation Problem Our results on RPU-learning imply the existence of simple linear decision trees (LDTs) for an imp ortant problem in computer science and computational geometry known as the p oint location problem. Given a set of n hyperplanes in d dimensions, called a hyp erplane arr angement of size n and denoted by H = { h 1 , . . . , h n } , it is a classic result that H partitions R d in to O ( n d ) cells. The p oint lo cation problem is as follows: Definition 1.1 (Poin t Lo cation Problem) . Given a hyp erplane arr angement H = { h 1 , . . . , h n } and a p oint x , b oth in R d , determine in which c el l of H x lies. Instances of this problem sho w up throughout computer science, such as in k -sum, subset-sum, knapsack, or any v ariet y of other problems [24]. The b est kno wn depth for a linear decision tree solving the p oin t lo ca- tion problem is from a recent work of Hopkins, Kane, Lov ett, and Maha jan [25], who prov ed the existence of a nearly optimal ˜ O ( d log( n )) depth LDT for arbitrary H and x . The ca veat of this work is that the LDT ma y use arbitrary linear queries, whic h may be to o p ow erful of a mo del in practice. Kane, Lov ett, and Moran [26] offer an O ( d 4 log( d ) log( n )) depth LDT restricting the mo del to generalized comparison queries, queries of the form sg n ( a h h 1 , x i − b h h 2 , x i ) for a point x and hyperplanes h 1 , h 2 . These queries are nice as they pre- serv e structural prop erties of the input H such as sparsity , but they still suffer from ov er-complication–any H allows an infinite set of queries. KLMZ’s [5] original w ork on inference dimension sho wed that in the worst case, the depth of a comparison LDT for p oint lo cation is Ω( n ). Ho wev er, b y restricting H to ha ve go o d margin or b ounded bit complexity , they build a comparison LDT of depth ˜ O ( d log( d ) log( n )), which comes with the adv antage of drawing from a finite set of queries for a given problem instance. Our w ork pro vides another result of this flav or: we will prov e that if H is drawn from a distribution with weak restrictions, for large enough n there exists a comparison LDT with exp ected depth ˜ O ( d log 2 ( d ) log 2 ( n )). 2 W e note that in this version of the mo del, the learner m ust know the supp ort of the distribution. Since we only use the model for lower b ounds, w e lose no generalit y b y making this assumption. 4 1.2 Our Results 1.2.1 Notation W e b egin by introducing notation for our learning mo dels. F or a distribution D , an instance space X ⊆ R d , and a hypothesis class H : X → R , we write the triple ( D , X , H ) to denote the problem of learning a h yp othesis h ∈ H with resp ect to D ov er X . When D is the uniform distribution ov er S ⊆ X , we will write ( S, X, H ) for conv enience. W e will further denote by B d the unit ball in d dimensions, and b y H d h yp erplanes in d dimensions. Given h ∈ H and a p oint x ∈ X , a lab el query determines sign( h ( x )); giv en x, x 0 ∈ X , a c omp arison query determines sign( h ( x ) − h ( x 0 )). In addition, we will separate our mo dels of learnability into com binations of three classes Q,R, and S, where Q ∈ { Lab el, Comparison } , R ∈ { P assive, Pool, MQS } , and S ∈ { P A C, RPU } . Informally , we say an elemen t Q defines our query type, an elemen t in R our learning regime, and an element in S our learning mo del. Learnability of a triple is then defined by the combination of any choice of query , regime, and mo del, whic h we term as the Q - R - S learnability of ( D , X , H ). Note that in Comparison-learning we hav e b oth a lab eling and comparison oracle. Finally , we will discuss a num b er of different measures of complexity for Q - R - S learning triples. F or passive learning, we will fo cus on the sample complexity n ( ε, δ ). F or active learning, w e will fo cus on the query complexit y q ( ε, δ ). In b oth cases, we will often drop δ and instead giv e b ounds on the exp e cte d sample/query complexit y for error ε denoted E [ n ( ε )] (or q ( ε ) resp ectively), the expected num b er of samples/queries needed to reach ε error. A b ound for probabilit y 1 − δ then follow with log (1 /δ ) rep etitions b y Chernoff. In the case of a finite instance space X of size n , w e denote the exp ected query complexit y of perfectly learning X as E [ q ( n )]. As a final note, w e will at times use a subscript d in our asymptotic notation to suppress factors only dep enden t on dimension. 1.2.2 P AC-Learning T o sho w the p o wer of active learning with comparison queries in the P A C-learning model, we will b egin b y proving low er b ounds. In particular, we show that neither activ e learning nor comparison queries alone pro vide a significan t sp eed-up o ver passive learning. In order to do this, we will assume the stronger MQS mo del, as lo wer b ounds here transfer ov er to the p ool-based regime. Prop osition 1.2. F or smal l enough ε , and δ = 1 2 , the query c omplexity of L ab el-MQS-P A C le arning ( B d , R d , H d ) is: q ( ε, 1 / 2) = Ω d 1 ε d − 1 d +1 ! . Th us without enriched queries, active learning fails to significantly improv e ov er passive learning even o ver a nice distributions. Likewise, adding comparison queries alone also provides little improv ement. Prop osition 1.3. F or smal l enough ε , and δ = 3 8 , the sample c omplexity of Comp arison-Passive-P AC le arning ( B d , R d , H d ) is: n ( ε, 3 / 8) = Ω 1 ε . No w we can compare the query complexit y of active learning with comparisons to the abov e. F or our upp er b ound, we will assume the po ol-based mo del with a P oly(1 /ε, log(1 /δ )) p o ol size, as upp er b ounds here transfer to the MQS mo del. Our algorithm for Comparison-P o ol-P AC learning com bines a mo dification of Balcan and Long’s [2] learning algorithm with noisy thresholding to pro vide an exp onential speed-up for non-homogeneous linear separators. Theorem 1.4. L et D b e a lo g-c onc ave distribution over R d . Then the query c omplexity of Comp arison- Po ol-P A C le arning ( D , R d , H d ) is q ( ε, δ ) = ˜ O d + log 1 δ log 1 ε . 5 Kulk arni, Mitter, and Tsitsiklis [27] (combined with an observ ation from [2]) also give a lo wer b ound of d log (1 /ε ) for log-concav e distributions for arbitrary binary queries (and thus for our setting), so Theorem 1.4 is near tight in dimension and error. It should b e noted, how ever, that to co ver non-isotropic distributions, Theorem 1.4 must kno w the exact distribution D . This restriction b ecomes unnecessary if the distribution is promised to b e isotropic. 1.3 RPU-Learning In the RPU-learning mo del, we will first confirm that passiv e learning with lab el queries is in tractable information theoretically , and contin ue to sho w that activ e learning alone pro vides little impro vemen t. Unlike in P AC-learning how ever, we will show that comparisons in this regime provide a significant impro vemen t in not only activ e, but also passive learning. Prop osition 1.5. The exp e cte d sample c omplexity of L ab el-Passive-RPU le arning ( B d , R d , H d ) is: E [ n ( ε )] = ˜ Θ d 1 ε d +1 2 ! . Th us we see that RPU-learning linear separators is intractable for large dimension. F urther, activ e learning with lab el queries is of the same order of magnitude. Prop osition 1.6. F or al l δ < 1 , the query c omplexity of L ab el-MQS-RPU le arning ( B d , R d , H d ) is: q ( ε, δ ) = Ω d 1 ε d − 1 2 ! . These tw o b ounds are a generalization of the technique employ ed by El-Y aniv and Wiener [4] to prov e lo wer b ounds for a sp ecific algorithm, and apply to any learner. W e further show that this b ound is tight up to a logarithmic factor. F or passive RPU-learning with comparison queries, we will simply inherit the lo wer b ound from the P AC mo del (Prop osition 1.3). Corollary 1.7. F or smal l enough ε , and δ = 3 8 , any algorithm that Comp arison-Passive-RPU le arns ( B d , R d , H d ) must use at le ast n ( ε, 3 / 8) = Ω 1 ε samples. Note that unlik e for lab el queries, this low er b ound is not exp onential in dimension. In fact, w e will show that this b ound is tight up to a linear factor in dimension, and further that employing comparison queries in general shifts the RPU model from b eing intractable to losing only a logarithmic factor o ver P A C-learning in b oth the passive and active regimes. W e need one definition: t wo distributions D , D 0 o ver R d are affinely equiv alent if there is an inv ertible affine map f : R d → R d suc h that D ( x ) = D 0 ( f ( x )). Theorem 1.8. L et D b e a distribution over R d that is affinely e quivalent to a distribution D 0 over R d , for which the fol lowing holds: 1. ∀ α > 0 , P r x ∼ D 0 [ || x || > dα ] ≤ c 1 α 2. ∀ α > 0 , h v , ·i + b ∈ H d , P r x ∼ D 0 [ |h x, v i + b | ≤ α ] ≤ c 2 α The sample c omplexity of Comp arison-Passive-RPU-le arning ( D , R d , H d ) is: n ( ε, δ ) = ˜ O d log 1 δ log 1 ε ε ! , and the query c omplexity of Comp arison-Po ol-RPU le arning ( D , R d , H d ) is: q ( ε, δ ) = ˜ O d log 1 δ log 2 1 ε . Note that the c onstants have lo garithmic dep endenc e on c 1 and c 2 . 6 It is worth noting that Theorem 1.8 is also computationally efficient, running in time Poly( d, 1 /ε, log (1 /δ )). The distributional assumptions are satisfied by a wide range of distributions, including (based up on concen- tration results from [3]) the class of s-conca ve distributions for s ≥ − 1 2 d +3 –this remov es the requirement of isotrop y from the lab el-only learning algorithms for homogeneous hyperplanes of [3]. W e view Theorem 1.8 and its surrounding con text as this w ork’s main technically no vel contribution. In particular, to prov e the result, w e introduce a new av erage-case complexity measure called av erage inference dimension that extends the theory of inference dimension from [5] (See Section 1.4.2). In addition, b y din t of this framew ork, our analysis implies the following result for the p oint lo cation problem as well. Theorem 1.9. L et D b e a distribution satisfying the criterion of The or em 1.8, x ∈ R d , and h 1 , . . . , h n ∼ D . Then for lar ge enough n ther e exists an LDT using only lab el and c omp arison queries solving the p oint lo c ation pr oblem with exp e cte d depth ˜ O ( d log 2 ( n )) F or ease of viewing, we summarize our main results on exp ected sample/query complexity in T ables 1 and 2 for the sp ecial case of the uniform distribution o ver the unit ball. The only table entries not nov el to this work are the Lab el-P assive-P AC b ounds [21, 18], and the low er b ound on Comparison-Pool/MQS-P AC learning [2, 27]. Note also that low er b ounds for P A C learning carry ov er to RPU learning. T able 1: Exp ected sample and query complexity for P AC learning ( B d , R d , H d ). P AC P assive P o ol MQS Lab el Θ d ε [21, 18] Ω d 1 ε d − 1 d +1 Ω d 1 ε d − 1 d +1 Comparison Ω 1 ε ˜ Θ d log 1 ε ˜ Θ d log 1 ε [2, 27] T able 2: Exp ected sample and query complexity for RPU learning ( B d , R d , H d ). RPU P assive P o ol MQS Lab el ˜ Θ d 1 ε d +1 2 ˜ Ω d 1 ε d − 1 2 ˜ Ω d 1 ε d − 1 2 Comparison ˜ O d ε ˜ O d log 2 1 ε ˜ O d log 2 1 ε 1.4 Our T ec hniques 1.4.1 Lo wer Bounds: Caps and Polytopes Our lo wer b ounds for b oth the P A C and RPU mo dels rely mainly on high-dimensional geometry . F or P A C-learning, we consider spherical caps, p ortions of B d cut off by a hyperplane. Our tw o low er b ounds, Lab el-MQS-P AC, and Comparison-Passiv e-P A C, consider different aspects of these ob jects. The former (Prop osition 1.2) emplo ys a packing argument: if an adversary chooses a hyperplane uniformly among a set defining some packing of (sufficiently large) caps, the learner is forced to query a p oin t in many of them in order to distinguish whic h is lab eled negatively . The latter b ound (Prop osition 1.3), follows from an indistinguishabilit y argument: if an adv ersary chooses just b etw een one hyperplane defining some (sufficien tly large) cap, and the corresp onding parallel hyperplane tangent to B d , the learner must draw a p oint near the cap b efore it can distinguish betw een the tw o. F or the RPU-learning mo del, our low er bounds rely on the a verage and worst-case complexity of p olytop es. F or Lab el-P assive-RPU learning (Prop ositions 1.5), we consider random p olytop es, conv ex hulls of samples S ∼ D n , whose complexity E ( D , n ) is the exp ected probability mass across samples of size n . In this regime, we consider an adv ersary who, with high probability , picks a distribution in whic h almost all samples 7 are en tirely p ositive. As a result, the learner cannot infer an y p oin t outside of the con vex hull of their sample, which b ounds their exp ected cov erage b y E ( D, n ). F or Lab el-MQS-RPU learning (Prop osition 1.6), the argument is muc h the same, except we consider the maximum probability mass ov er p olytopes rather than the exp ectation. These techniques are generalizations of the algorithm sp ecific lo wer b ounds given b y El-Y aniv and Wiener [4], who also consider random p olytop e complexity . 1.4.2 Upp er Bounds: Average Inference Dimension W e fo cus in this section on tec hniques used to prov e our RPU-learning upp er b ounds, which we consider our most tec hnically no vel contribution. T o prov e our Comparison-Pool-RPU learning upp er b ound (and corresp onding p oin t lo cation result), Theorems 1.8 and 1.9, w e introduce a no vel extension to the inference dimension framework of KLMZ [5]. Inference dimension is a com binatorial complexity measure that charac- terizes the distribution indep endent query complexit y of active learning with enric hed queries. KLMZ show, for instance, that linear separators in R 2 ma y b e Comparison-Pool-P AC learned in only ˜ O (log( 1 ε )) queries, but require Ω 1 ε queries in 3 or more dimensions. Giv en a h yp othesis class ( X , H ), and a set of binary queries Q (e.g. lab els and comparisons), denote the answ ers to all queries on S ⊆ X by Q ( S ). Inference dimension examines the size of S necessary to infer another p oin t x ∈ X , where S infers the p oin t x under h , denoted S → h x, if Q ( S ) under h determines the label of x . As an example, consider H to be linear separators in d dimensions, Q to b e lab el queries, and our sample to b e d + 1 p ositively lab eled p oints under some classifier h in conv ex p osition. Due to linearity , any p oint inside the conv ex hull of S is inferred by S under h . Then in greater detail, the inference dimension of ( X , H ) is the minimum k such that in any subset of X of size k , at least one p oint can b e inferred from the rest: Definition 1.10 (Inference Dimension [5]) . The infer enc e dimension of ( X , H ) with query set Q is the smal lest k such that for any subset S ⊂ X of size k , ∀ h ∈ H , ∃ x ∈ S s.t. Q ( S − { x } ) infers x under h . KLMZ show that finite inference dimension implies distribution indep enden t query complexity that is logarithmic in the sample complexity . On the other hand, they pro ve a low er b ound showing that P AC learning classes with infinite inference dimension requires at least Ω(1 /ε ) queries. T o ov ercome this low er b ound (which holds for linear separators in three or more dimensions), we intro- duce a distribution dep endent version of inference dimension whic h examines the probability that a sample con tains no p oint which can b e inferred from the rest. Definition 1.11 (Average Inference Dimension) . We say ( D , X , H ) has aver age infer enc e dimension g ( n ) , if: ∀ h ∈ H, P r S ∼ D n [ @ x s.t. S − { x } → h x ] ≤ g ( n ) . Theorems 1.8 and 1.9 follo w from the fact that small a verage inference dimension implies that finite samples will hav e low inference dimension with go o d probabilit y (Observ ation 4.6). Our main technical con tribution lies in proving a structural result (Theorem 4.10): that the av erage inference dimension of ( D , R d , H d ) with resp ect to comparison queries is sup erexp onentially small, 2 − Ω d ( n 2 ) , as long as D satisfies the w eak distributional requirements outlined in Theorem 1.8. 2 Bac kground: Inference Dimension Before proving our main results, we detail some additional background in inference dimension that is nec- essary for our RPU-learning tec hniques. First, w e review the inference-dimension based upp er and low er b ounds of KLMZ [5]. Let f Q ( k ) b e the num b er of oracle queries required to answ er all queries on a sample of size k in the worst case (e.g. f Q ( k ) = O ( k log( k )) for comparison queries via sorting). Finite inference dimension implies the follo wing upp er b ound: 8 Theorem 2.1 ([5]) . L et k denote the infer enc e dimension of ( X , H ) with query set Q . Then the exp e cte d query c omplexity of ( X , H ) for | X | = n is: E [ q ( n )] ≤ 2 f Q (4 k ) log ( n ) . F urther, infinite inference dimension pro vides a low er b ound: Theorem 2.2 ([5]) . Assume that the infer enc e dimension of ( X , H ) with query set Q is > k . Then for ε = 1 k , δ = 1 6 , the sample c omplexity of Q-Po ol-P AC le arning ( X, H ) is: q ( ε, 1 / 6) = Ω(1 /ε ) . As the name would suggest, the upp er b ound derived via inference dimension is based up on a reliable learner that infers a large n umber of p oin ts given a small sample. While not explicitly stated in [5], it follows from the same argumen t that finite inference dimension gives an upp er b ound on the sample complexity of RPU-learning: Corollary 2.3. L et k denote the infer enc e dimension of ( X , H ) with query set Q . Then the sample c omplexity of Q -Passive-RPU le arning ( X , H ) is: E [ n ( ε )] = O k ε . 3 P A C Learning with Comparison Queries In this section w e study P AC learning with comparison queries in b oth the passive and active cases. 3.1 Lo w er Bounds T o b egin, we prov e that ov er a uniform distribution on a unit ball, learning linear separators with only lab el queries is hard. Prop osition 3.1 (Restatement of Prop osition 1.2) . F or smal l enough ε , and δ = 1 2 , the query c omplexity of L ab el-MQS-P AC le arning ( B d , R d , H d ) is: q ( ε, 1 / 2) = Ω d 1 ε d − 1 d +1 ! . Pr o of. This follows from a packing argumen t. It is well known (see e.g. [28]) that for small enough ε , it is p ossible to pack Ω d 1 ε d − 1 d +1 disjoin t spherical caps (intersections of halfspaces with B d ) of volume 2 ε on to B d . By Y ao’s Minimax Theorem [29], we may consider a randomized strategy from the adversary such that any deterministic strategy of the learner will fail with constant probability . Consider an adversary whic h picks one of the disjoint spherical caps to be negative uniformly at random. If the learner queries only O d 1 ε d − 1 d +1 p oin ts, for a small enough constant and ε , an y strategy will unco ver the negative cap with at most some constan t, say less than 25% probabilit y . Since for small enough ε there will b e at least three remaining caps in which the learner nev er queried a p oin t, the probability that the learner outputs the correct negative cap (whic h is necessary to learn up to error ε ), is at most 1 / 3 due to uniform distribution of the negativ e cap. Thus alltogether the learner will fail with probability at least 1/2. T o sho w that our exp onen tial impro vemen t comes from the use of comparisons in combination with active learning, w e will prov e that using comparisons coupled with passive learning provides no improv ement. Prop osition 3.2 (Restatement of Prop osition 1.3) . F or smal l enough ε , and δ = 3 8 , any algorithm that p assively le arns ( B d , R d , H d ) with c omp arison queries must use at le ast n ( ε, δ ) = Ω 1 ε samples. 9 Pr o of. Let h 2 ε b e a h yp erplane whic h forms a spherical cap c of measure 2 ε , and h b e the parallel h yp erplane tangen t to this cap. By Y ao’s Minimax Theorem [29], we consider an adv ersary which chooses uniformly b et ween h 2 ε and h . Given k uniform samples from B d , the probability that at least one p oint lands inside the cap c is ≤ 2 k ε . Let k = o 1 ε , then for small enough ε , this probability is ≤ 1 / 4. Say no sample lands in c , then h 2 ε and h are completely indistinguishable by lab el or comparison queries. Any h yp othesis chosen b y the learner must lab el at least half of c p ositive or negative, and will thus hav e error ≥ ε with either h 2 ε or h . Since the distribution o ver these h yp erplanes is uniform, the learner fails with probability at least 50%. Thus in total the probability that the learner fails is at least 3 4 · 1 2 = 3 8 T ogether, these low er b ounds show it is only the c om bination of activ e learning and comparison queries whic h provides an exp onential improv ement. 3.2 Upp er Bounds F or completeness, we will b egin by sho wing that Prop osition 1.2 is tight for d = 2 before moving to our main result for the section. Prop osition 3.3. The query c omplexity of L ab el-MQS-P AC le arning ( B 2 , R 2 , H 2 ) is: q ( ε, 0) = O 1 ε 1 3 ! . Pr o of. T o b egin, we will show that selecting k = O 1 ε 1 3 p oin ts along the b oundary of B 2 in a regular fashion (such that their conv ex h ull is the regular k sided p olygon) is enough if all such points ha ve the same lab el. This follows from the fact that each cap created b y the p olygon has area and thus probability mass Ar ea ( C ap ) = 1 2 (2 π /k − sin(2 π/k )) . T a ylor approximating sine shows that picking k = O 1 ε 1 3 giv es Area(Cap) < ε . If all k p oin ts are of the same sign (sa y 1), a hyperplane can only cut through one such cap, and thus labeling the entire disk 1. Th us we hav e reduced to the case where there are one or more points of differing signs. In this scenario, there will b e exactly tw o edges where connected v ertices are of different signs, which denotes that the hyperplane passes through b oth edges. Next, on each of the t wo caps asso ciated with these edges, we query O (log(1 /ε )) p oin ts in order to find the crossing p oin t of the h yp erplane via binary searc h up to an accuracy of ε/ 2. This reduces the area of unknown lab els to the strip connecting these tw o < ε/ 2 arcs, whic h has < ε probability mass. Picking an y consistent hyperplane then finishes the pro of. No w we will sho w that active learning with comparison queries in the P AC-learning mo del exp onen tially impro ves o ver the passive and label regimes. Our work is closely related to the algorithm of Balcan and Long [2], and relies on using comparison queries to reduce to a combination of their algorithm and thresholding. Our b ounds will relate to a general set of distributions called isotropic (0-centered, identit y v ariance) log- conca ve distributions, distributions whose density function f may b e written as e g ( x ) for some concav e function g . log-concavit y generalizes many natural distributions suc h as gaussians and conv ex sets. T o b egin, we will need a few statemen ts regarding isotropic log-concav e distributions prov ed initially b y Lo v asz and V empala [30], and Kliv ans, Long, and T ang [31] (here w e include additional facts we require for RPU- learning later on). F act 3.4 ([30, 31]) . L et D b e an arbitr ary lo g-c onc ave distribution in R d with pr ob ability density function f , and u, v normal ve ctors of homo gene ous hyp erplanes. The fol lowing statements hold wher e 3,4,5, and 6 assume D is isotr opic: 10 Algorithm 1: Comparison-Pool-P AC learn ( D , R d , H d ) 1 N = O 1 ε ; shift list = []; 2 normal vector = B-L I so ( D − D ) , O log(1 /ε ) ε , δ ; 3 for i in r ange O (log (1 /δ )) do 4 S ∼ D N ; 5 S = Pro ject( S, normal vector); 6 shift list.add(Threshold(S)); 7 end 8 Return h = h normal vector , ·i + median(shift list) Figure 1: Algorithm for Comparison-Pool-P AC learning hyperplanes ov er a log-concav e distribution D . Our algorithm references three sub-routines. The first, B-L( D , ε, δ ), is the algorithm from Theorem 3.5. The second is Pro ject(sample,v ector), which simply pro jects each p oint in a sample onto the giv en vector. The third is Threshold( S ), which lab els the one-dimensional array S by binary search and outputs a (mostly) consisten t threshold v alue. 1. D − D , the differ enc e of i.i.d p airs, is lo g-c onc ave 2. D may b e affinely tr ansforme d to an isotr opic distribution Iso ( D ) 3. Ther e exists a universal c onstant c s.t. the angle b etwe en any u and v , denote d θ ( u, v ) , satisfies cθ ( u, v ) ≤ Pr x ∼ D [ sg n ( h x, v i ) 6 = sg n ( h x, u i )] 4. ∀ a > 0 , Pr x ∈ D [ || x || ≥ a ] ≤ e − a √ d +1 5. Al l mar ginals of D ar e isotr opic lo g-c onc ave 6. If d = 1 , Pr x ∈ D [ x ∈ [ a, b ]] ≤ | b − a | W e will additionally need Balc an and Long’s [2] query optimal algorithm for label-Pool-P AC learning homogeneous h yp erplanes 3 . Theorem 3.5 (Theorem 5 [2]) . L et D b e a lo g-c onc ave distribution over R d . The query c omplexity of L ab el-Po ol-P AC le arning ( D , R d , H 0 d ) is q ( ε, δ ) = O d + log 1 δ + log log 1 ε log 1 ε , wher e H 0 d is the class of homo gene ous hyp erplanes. Using these facts, w e will give an upp er bound for the P o ol-based mo del assuming a p o ol of P oly(1 /ε, log (1 /δ )) unlab eled samples. F or a sketc h of the algorithm, see Figure 1. Theorem 3.6 (Restatement of Theorem 1.4) . L et D b e a lo g-c onc ave distribution over R d . The query c omplexity of Comp arison-Po ol-P A C le arning ( D, R d , H d ) is q ( ε, δ ) = O d + log 1 δ + log log 1 ε log 1 ε . Pr o of. Recall that D ma y be affinely transformed in to an isotropic distribution Iso( D ). F urther, w e ma y sim ulate queries ov er Iso( D ) b y applying the same transformation to our samples, and after learning ov er Iso( D ), w e may transform our learner bac k to D . Thus learning Iso( D ) is equiv alen t to learning D , and we will assume D is isotropic without loss of generality . Our algorithm will first learn a “homogenized” version of the hidden separator h = h v , ·i + b via Balcan and Long’s algorithm, thereby reducing to thresholding. 3 This work was later impro ved to be computationally efficient [32], but no longer achiev ed optimal query complexit y . 11 Note that comparison queries on the difference of p oin ts x, y ∈ D is equiv alen t to a lab el query on the p oin t x − y on the homo gene ous hyperplane with normal vector v : h ( x ) − h ( y ) = ( h v, x i + b ) − ( h v , y i + b ) = h v , x − y i . W e b egin by drawing samples from the log-concav e distribution D − D and then apply Balcan and Long’s algorithm [2] to learn the homogenized version of h ( h v , ·i ) up to O ε log(1 /ε ) error with probabilit y 1 − δ using only O d + log 1 δ + log log 1 ε log 1 ε comparison queries. F urther, since the constant c given in item 2 of F act 3.4 is universal, this means any separator output b y the algorithm has a normal vector u with angle θ ( u, v ) = O ε log(1 /ε ) . Ha ving learned an approximation to v , we turn our attention to approximating b . Consider the set of p oints on whic h u and v disagree, that is: D is = { x : sg n ( h v, x i + b ) 6 = sg n ( h u, x i + b ) } T o find an approximation for b , we need to sho w that there will b e correctly lab eled points close to the threshold. T o this end, let α = ε/ 8 and define b ± α suc h that: D ( { y : α < h u, y i + b < b α } ) = α D ( { y : b − α < h u, y i + b < − α } ) = α W e will show that dra wing a sample S of O 1 ε p oin ts, the following three statemen ts hold with at least 2 / 3 probabilit y: 1. ∃ x 1 ∈ S : α < h u, x 1 i + b < b α 2. ∃ x 2 ∈ S : b − α < h u, x 2 i + b < − α 3. ∀ x ∈ D is ∩ S, |h u, x i + b | < α Since the measure of the regions defined in statements 1 and 2 is ε/ 4, the probability that S do es not hav e at least one p oin t in b oth regions is ≤ 2 ∗ (1 − α ) | S | ≤ 1 / 6 with an appropriate constant. T o pro ve the third statement, assume for contradiction that there exists x ∈ D is ∩ S such that |h u, x i + b | > ε/ 4. Because h u, x i + b and h v , x i + b differ in sign, this implies that |h u − v , x i| = |h u − v , x u,v i| > α , where x u,v is the pro jection of x onto the plane s panned by u and v . W e can b ound the probability of this even t o ccurring b y the concentration of isotropic log-concav e distributions: P r [ |h u − v , x u,v i| > α ] ≤ e − Ω ( ε | u − v | ) . (1) Because w e hav e b ounded the angle b et ween u and v , with a large enough constant for θ we ha ve: | u − v | ≤ O ε log ( | S | ) . Then with a large enough constant for θ , union b ounding ov er D is ∩ S gives that the third statement o ccurs with probabilit y at most 1 / 6. W e hav e prov ed that with probability 2 / 3, statements 1,2, and 3 hold. F urther, if these statements hold, an y hyperplane h u, ·i + b 0 w e pick consisten t with thresholding will disagree on at most ε/ 4 probability mass from h u, ·i + b due to the anti-concen tration of isotropic log-concav e distributions and the definition of b ± α . 12 F urther, rep eating this pro cess O (log(1 /δ )) times and taking the median shift v alue b 0 giv es the same state- men t with probabilit y at least 1 − δ by a Chernoff b ound. Note that the num b er of queries made in this step is dominated b y the num b er of queries to learn u . Finally , we need to analyze the error of our prop osed hyperplane h u, ·i + b 0 . W e hav e already prov ed that the error b et ween this and h u, ·i + b is ≤ ε/ 4 with probabilit y at least 1 − δ , so it is enough to show that D ( Dis ) ≤ 3 ε/ 4. This follo ws similarly to statement 3 ab ov e. The p ortion of Dis satisfying |h u, x i + b | ≤ α has probabilit y mass at most ε/ 4 by anti-concen tration. With a large enough constan t for θ , the remainder of Dis has mass at most ε/ 2 by (1). Then in total, with probabilit y 1 − 2 δ , h u, ·i + b 0 has error at most ε . Balcan and Long [2] provide a low er b ound on query complexit y for log-concav e distributions and oracles for an y binary query of Ω( d log( 1 ε )), so this algorithm is tigh t up to logarithmic factors. 4 RPU Learning with Comparison Queries Kivinen [12] sho wed that RPU-learning is intractable for nice concept classes even under simple d istributions when restricted to lab el queries. W e will confirm that RPU-learning linear separators with only lab el queries is intractable in high dimensions, but can b e made efficient in b oth the passive and active regimes via comparison queries. 4.1 Lo w er b ounds In the passiv e, lab el-only case, RPU-learning is lo wer b ounded by the exp ected n umber of vertices on a random p olytop e dra wn from our distribution D . F or simple distributions suc h as uniform ov er the unit ball, this gives sample complexity which is exp onential in dimension, making RPU-learning impractical for an y sort of high-dimensional data. Definition 4.1. Given a distribution D and p ar ameter ε > 0 , we denote by v D ( ε ) the minimum size of a sample S dr awn i.i.d fr om D such that the exp e cte d me asur e of the c onvex hul l of S , which we denote E ( D, n ) for | S | = n , is ≥ 1 − ε . The quan tit y v D ( ε ), whic h has been studied in computational geometry for decades [33, 34], lo w er b ounds Lab el-P assive-RPU Learning, and in some cases provides a matching upp er b ound up to log factors. Prop osition 4.2. L et D b e any distribution on R d . The exp e cte d sample c omplexity of L ab el-Passive-RPU- le arning ( D , R d , H d ) is: n ( ε, 1 / 3) = Ω v D (2 ε ) log(1 /ε ) . Pr o of. F or any δ > 0 and sample size n , there exists some radius r δ,n suc h that the probability that a sample S ∼ D n con tains any p oint outside the ball of radius r δ,n , B d ( r δ,n ), is less than δ . By Y ao’s Mini- max Theorem [29], it is sufficien t to consider an adversary who pic ks some h yp erplane tangen t to B d ( r δ,n ) with probability 1 − δ (lab eling it entirely p ositiv e), and otherwise chooses a hyperplane uniformly from S d × [ − r δ,n , r δ,n ]. Notice that if the adversary chooses the tangent hyperplane and the learner draws a sample S en tirely within the ball, for any p oin t x outside the conv ex h ull of S there exist hyperplanes within the supp ort of the adv ersary’s distribution that are consistent on S but differ on x . Recall that v D ( ε ) is the minim um size of the sample S which needs to b e dra wn such that 1 − E ( D, n ) is ≤ ε in exp ectation. Consider drawing a sample S of size n = v D (2 ε ) − 1. The exp ected measure E ( D , n ) is then E ( D , n ) < 1 − 2 ε. This in turn implies a b ound by the Marko v inequalit y on the probability of the measure of the con vex hull of a giv en sample, which we denote V ( S ): P r S ∼ D n [ V ( S ) ≥ 1 − ε ] ≤ 1 − 2 ε 1 − ε = 1 − ε 1 − ε . 13 No w consider the following relation b etw een samples of size n and b n/k c , which follows by viewing our size n sample as k distinct samples of size at least n/k : 1 − P r S ∼ D b n/k c [ V ( S ) < 1 − ε ] k ≤ P r S ∼ D n [ V ( S ) ≥ 1 − ε ] . Com bining these results and letting k = log (1 /ε ): P r S ∼ D b n/k c [ V ( S ) < 1 − ε ] ≥ ε 1 − ε 1 /k ≥ 1 / 2 . T o force any learner to fail on a sample, we need tw o conditions: first that the measure of the conv ex h ull is < 1 − ε , and second that all p oints lie in B d ( r δ,n ). Since the latter o ccurs with probability 1 − 2 δ , picking δ < 1 / 12 then gives the desired success b ound: P r S ∼ D b n/ log(1 /ε ) c [( V ( S ) ≥ 1 − ε ) ∨ ( ∃ x ∈ S : x / ∈ B d ( r 1 / 12 ,n )] ≤ 1 / 2 + 2 δ < 2 / 3 . F urther, for simple distributions such as uniform ov er a ball, this b ound is tight up to a log 2 factor. Prop osition 4.3. The sample c omplexity of L ab el-Passive-RPU le arning ( B d , R d , H d ) is: E [ n ( ε )] = O log( d/ε ) v B d ε 2 = O d log(1 /ε ) 1 ε d +1 2 ! . Pr o of. W e will b egin by computing v B d ( ε ) for a ball. The exp ected measure of a sample drawn randomly from B d is computed in [35], and giv en by E ( B d , n ) = 1 − c ( d ) n − 2 d +1 , where c ( d ) is a constan t dep ending only on dimension. Setting c ( d ) n − 2 d +1 = ε then giv es: v B d ( ε ) = c ( d ) ε d +1 2 ! Giv en a sample S of size O (log(1 /δ ) n ), let S p denote the subset of p ositively lab eled p oin ts, and S n negativ ely lab eled. W e can infer at least the p oin ts inside the conv ex hulls of S p and S n . Our goal is to show that, with high probabilit y , the measure of M = ConvHull( S p ) ∪ ConvHull( S n ) is ≥ 1 − ε . T o sho w this, we will employ the fact [33] that the exp ected measure of the con vex hull of a sample of size n uniformly drawn from an y conv ex b o dy K is low er-b ounded b y: E ( K , n ) = 1 − c ( d ) n − 2 d +1 . Giv en this, let P of measure p b e the set of p ositiv e p oints, and N the negativ e p oints with measure 1 − p . Since we hav e drawn O (log (1 /δ ) n ) p oints, with probabilit y ≥ 1 − δ we will ha ve at least pn p oints from P , and at least (1 − p ) n p oints from N . Given this many p oin ts, the exp ected v alue of our inferred mass M is: E [ M ] ≥ pE ( P, pn ) + (1 − p ) E ( N , (1 − p ) n ) = 1 − c ( d ) p ( pn ) − 2 / ( d +1) + (1 − p )((1 − p ) n ) − 2 / ( d +1) . This function is minimized at p = . 5, and plugging in p = . 5, n = 2 v B d ε 2 giv es E [ M ] ≥ 1 − 2 d − 1 2 d ε . Ho wev er, since we hav e conditioned on enough p oin ts b eing drawn from P and N, w e are not done. This o ccurs across at least a 1 − δ p ercent of our samples, meaning that if we assume the inferred mass M is 0 on other samples, our exp ected error (for a large enough constan t on our num b er of samples) will b e at most: 1 − E [ M ] = (1 − δ ) (2 d − 1) ε 2 d + δ . 14 Setting δ = ε/ (2 d ) is enough to drop the error b elow ε , and gives the num b er of samples as O log( d/ε ) v B d ε 2 . In the activ e regime, this sort of bound is c omplicated by the fact that we are less interested in the n umber of p oints drawn than lab eled. If w e were restricted to only drawing E [ n ( ε )] p oints, we could rep eat the same argumen t in combination with the exp ected n umber of vertices to get a b ound. How ev er, with a larger p ool of allow ed p oints, the p ertinent question b ecomes the maximum rather than expected measure of the con vex hull. In cases such as the unit ball, these actually give ab out the same result. Prop osition 4.4 (Restatemen t of Prop osition 1.6) . F or al l δ < 1 , the query c omplexity of L ab el-MQS-RPU le arning ( B d , R d , H d ) is: q ( ε, δ ) = Ω d 1 ε d − 1 2 ! Pr o of. The maxim um volume of the conv ex hull of n p oints in B d is [34] max S, | S | = n (V ol(Con vHull( S ))) = 1 − θ d n − 2 d − 1 . Notice here the difference from the random case in the exp onen t, which comes from the fact that we are only coun ting the exp ected θ d n d − 1 d +1 v ertices on the b oundary of the h ull of the sample. The lo wer b ound is then implied by the same adv ersary strategy as in Prop osition 4.2, since for small enough ε , the conv ex h ull of any set of o d 1 ε d − 1 2 p oin ts has less than 1 − ε probabilit y mass. 4.2 Upp er b ounds Our p ositive results for comparison based RPU-learning rely on w eakening the concept of inference dimension to b e distribution dependent. With this in mind, we introduce av erage inference dimension: Definition 4.5 (Average Inference Dimension) . We say ( D , X , H ) has aver age infer enc e dimension g ( n ) , if: ∀ h ∈ H, P r S ∼ D n [ @ x s.t. S − { x } → h x ] ≤ g ( n ) . In other w ords, the probabilit y that we cannot infer a p oint from a randomly drawn sample of size n is b ounded by its a verage inference dimension g ( n ). There is a simple av erage-case to worst-case reduction for a verage inference dimension via a union b ound: Observ ation 4.6. L et ( D , X , H ) have aver age infer enc e dimension g ( n ) , and S ∼ D n . Then ( S, H ) has infer enc e dimension k with pr ob ability: P r [ Infer enc e dimension of ( S, H ) ≤ k ] ≥ 1 − n k g ( k ) . Pr o of. The probability that a fixed subset S 0 ⊂ S of size k do es not ha ve a point x s.t. S − { x } → h x is at most g ( k ). Union b ounding ov er all n k subsets giv es the desired result. This reduction allows us to apply inference dimension in b oth the active and passive distributional cases. This is due in part to the fact that the b o osting algorithm prop osed by KLMZ [5] is reliable even when giv en the wrong inference dimension as input–the algorithm simply loses its guarantee on query complexity . As a result, w e may plug this reduction directly into their algorithm. 15 Corollary 4.7. Given a query set Q , let f Q ( n ) b e the numb er of queries r e quir e d to answer al l questions on a sample of size n . L et ( D , X, H ) have aver age infer enc e dimension g ( n ) , then ther e exists an RPU-le arner A with c over age E [ C ( A )] = max k ≤ n 1 − n k g ( k ) n − k n after dr awing n p oints. F urther, the exp e cte d query c omplexity of actively RPU-le arning a finite sample S ∼ D n is E [ q ( n )] ≤ min k ≤ n 2 f Q (4 k ) log ( n ) 1 − g ( k ) n k + ng ( k ) n k Pr o of. F or the first fact, w e will app eal to the symmetry argumen t of [5]. Consider a reliable learner A which tak es in a sample S of size n − 1 and infers all p ossible p oints in D . T o compute co verage, we wan t to kno w the probability a random p oin t x ∼ D is inferred by A . Since S was randomly drawn from D , this is the same as computing the probability that any p oint in S ∪ { x } can b e inferred from S . By Observ ation 4.6, the probabilit y that S ∪ { x } has inference dimension k is 1 − n k g ( k ) . Since x could equally w ell hav e b een an y p oint in S b y symmetry , if S has inference dimension k the co verage will be at least n − k n [5]. Since this o ccurs with probabilit y at least 1 − n k g ( k ) by Observ ation 4.6, the exp ected co verage of A is at least E [ C ( A )] ≥ 1 − n k g ( k ) n − k n . The second statement follo ws from a similar argument. If S has inference dimension k , then by Theorem 2.1 the exp ected query complexity is at most 2 f Q (4 k ) log ( n ). F or a given k , the exp ected query complexity is then b ounded b y: E [ q ( n )] ≤ 2 f Q (4 k ) log ( n ) Pr[S has inference dimension ≤ k ] + n Pr[S has inference dimension > k ] . Plugging in Observ ation 4.6 and minimizing ov er k then gives the desired result. In fact, this lemma shows that RPU-learning ( D , X , H ) with inv erse sup er-exp onential av erage inference dimension loses only log factors ov er passive or active P A C-learning. Asking for such small av erage inference dimension may seem unreasonable, but something as simple as lab el queries on a uniform distributions ov er con vex sets has av erage inference dimension 2 − Θ( n log( n )) with resp ect to linear separators [36]. Corollary 4.8. Given a query set Q , let f Q ( n ) b e the numb er of queries r e quir e d to answer al l questions on a sample of size n . F or any α > 0 , let ( D , X , H ) have aver age infer enc e dimension g ( n ) ≤ 2 − Ω( n 1+ α ) . Then the exp e cte d sample c omplexity of Q-Po ol-RPU le arning is: E [ n ( ε )] = O log( 1 ε ) 1 /α ε ! . F urther, the exp e cte d query c omplexity of actively le arning a finite sample S ∼ D n is: E [ q ( n )] ≤ 2 f Q O log 1 /α ( n ) log( n ) . Pr o of. Both results follo w from the fact that setting the av erage inference dimension k to O log( n ) 1 /α giv es 1 − n k g ( k ) = 1 − O 1 n . Then for the sample complexit y , it is enough to plug this into Corollary 4.7 and let n b e n = O log( 1 ε ) 1 /α ε ! . 16 Plugging this in to the query complexity sets the latter term from Corollary 4.7 to 1, giving: E [ q ( n )] ≤ 2 f Q O log 1 /α ( n ) log( n ) . W e will sho w that by emplo ying comparison queries we can improv e the a verage inference dimension of linear separators from 2 Ω( − n log( n )) to 2 − Ω( n 2 ) , but first w e will need to review a result on inference dimension from [5]. Theorem 4.9 (Theorem 4.7 [5]) . Given a set X ⊆ R d , we define the minimal-r atio of X with r esp e ct to a hyp erplane h ∈ H d as: min x ∈ X | h ( x ) | max x ∈ X | h ( x ) | . In other wor ds, the minimal-r atio is a normalize d version of mar gin, a c ommon to ol in le arning algorithms. Given X , define H d,η ⊆ H d to b e the subset of hyp erplanes with minimal r atio η with r esp e ct to X . The infer enc e dimension of (X,H) is then: k ≤ 10 d log( d + 1) log(2 η − 1 ) . Our strategy to prov e the a verage inference dimension of comparison queries follows via a reduction to minimal-ratio. Informally , our strategy is very simple. W e will argue that, with high probability , throwing out the closest and furthest p oin ts from any classifier leav es a set with large minimal-ratio. W e will show this in three main steps. Step 1: Assuming concentration of our distribution, a large num b er of p oints are con tained inside a ball. W e will use this to b ound the maxim um function v alue for a given hyperplane when its furthest p oin ts are remo ved. Step 2: Assuming anti-concen tration of our distribution, we will union b ound ov er all hyperplanes to sho w that they ha ve go o d margin. In order to do this, we will define the notion of a γ -strip ab out a hy- p erplane h, whic h is simply h “fattened” by γ in b oth directions. If not to o many p oin ts lie inside eac h h yp erplane’s γ -strip, then w e can b e assured when we remov e the closest p oints the remaining set will hav e margin γ . Since we cannot union bound ov er the infinite set of γ -strips, we will build a γ -net of the ob jects and use this instead. Step 3: Com bining the ab ov e results carefully shows that for any h yp erplane, removing the furthest and closest p oin ts leav es a subsample of go od minimal-ratio. In particular, by making sure the n umber of re- maining p oints matc hes the b ound on inference dimension given in Theorem 4.9, we can b e assured that one of these p oin ts may b e inferred from the rest as long as our high probability conditions hold. Theorem 4.10. L et D b e a distribution over R d affinely e quivalent to another with the fol lowing pr op erties: 1. ∀ α > 0 , P r x ∼ D [ || x || > dα ] ≤ c 1 α 2. ∀ α > 0 , h v , ·i + b ∈ H d , P r x ∼ D [ |h x, v i + b | ≤ α ] ≤ c 2 α Then for n = Ω( d log 2 ( d )) , the aver age infer enc e dimension g ( n ) of ( D, R d , H d ) is g ( n ) ≤ 2 − Ω n 2 d log( d ) , wher e the c onstant has lo garithmic dep endenc e on c 1 , c 2 . Pr o of. T o b egin, note that since inference is inv ariant to affine transformation we can assume that our distribution D satisfies prop erties 1 and 2 without loss of generalit y . Our argument will hinge on the minimal ratio based inference dimension b ound of [5]. Let k denote inference dimension of ( X , H d,η ). W e 17 b egin by drawing a sample S of size n , and set our goal minimal-ratio η such that k = n/ 3. In particular, it is sufficien t to let η = 2 − θ ( n d log( d ) ) . W e will now prov e that for all hyperplanes, removing the closest and furthest k p oints from S leav es the remaining p oin ts with minimal-ratio η with high probability . T o b egin, we will show that with high probability , n − k p oints lie inside the ball B of radius r = 2 θ ( n d log( d ) ) ab out the origin. By condition 1 on our distribution D , we know that the probability any k = n/ 3 size subset lies outside radius r is ≤ c 1 d r k . Union b ounding ov er all p ossible size k subsets then gives: P r [ ∃ S 0 ⊆ S, | S 0 | = n/ 3 : ∀ x ∈ S 0 , || x || ≥ r ] ≤ n n/ 3 2 − Ω n 2 d log( d ) + O ( n log( dc 1 )) ≤ 2 − Ω n 2 d log( d ) , where the last step follows with n = Ω( d log 2 ( d )) and a large enough constant. Assume then that no such subset exists. What implication do es this hav e for the distance of the k furthest p oints from any given h yp erplane? F or a giv en hyperplane h , denote the shortest distance b etw een h and any point in B to b e L . By removing the furthest k p oints from h , w e are guaran teed that the maximum distance is 2 r + L . W e will separate our analysis in to tw o cases: L ≤ r and L > r . In the case that L ≤ r , our problem reduces to classifiers which intersect the ball B 2 of radius 2 r . This further allo ws us to reduce our question from one of minimal-ratio to margin, as the minimal-ratio is b ounded b y: η ≥ γ / (4 r ) . Then with the correct parameter setting, it is enough to show that γ ≤ r − 2 with high probability for all h yp erplanes with L ≤ r . W e will inflate our margin to γ b y removing the n/ 3 p oints closest to h . It is enough to show that ∀ h no subset of n/ 3 p oin ts lies in h × [ − γ , γ ], which we will call the γ -strip, or strip of heigh t γ , ab out h . Condition 2 gives a b ound on this o ccurring for a given subset of k p oin ts and hyperplane h , but in this case w e must union b ound ov er b oth subsets and hyperplanes. Naiv ely , this is a problem, since the set of p ossible hyperplanes is infinite. How ev er, as we hav e reduced to hyperplanes intersecting the ball, each is defined by a unit vector v ∈ S d and a shift b ∈ [ − 2 r , 2 r ] = [ − γ − 1 / 2 , γ − 1 / 2 ]. Our strategy will b e to build a finite γ -net N ov er these strips and sho w that each p oint in the net has O ( γ 1 / 2 ) measure. Consider the space of normal v ectors to our strips, whic h for now w e assume are homogeneous. This is a d -unit sphere, which can b e cov ered by at worst (3 γ − 1 ) d γ -balls. W e can extend this γ -co ver to non- homogeneous strips b y placing 4 γ − 3 / 2 of these co vers at regular in terv als along the segment [ − 2 r , 2 r ]. F or- mally , each p oin t in this cov er N corresp onds to some hyperplane h = h v , ·i + b , and is comprised of the union γ -strips nearby h : N v ,b = [ || v − v 0 ||≤ γ | b − b 0 |≤ γ γ − strip ab out h v 0 , ·i + b 0 . What is the measure of N v ,b ? Note that N v ,b = ( N v ,b ∩ B 2 ) ∪ ( N v ,b ∩ ( R d \ B 2 )) . W e can immediately b ound the measure of latter p ortion by c 1 d √ γ 2 due to concentration. F or the former, we will show that N v ,b ∩ B 2 is contained in a small strip with measure b ounded by anti-concen tration. F or a visualization of this, see Figure 2. Since the height of a strip is inv ariant up on translation, we will let b = 0 for simplicity . Consider any x 0 in the γ -strip ab out some hyperplane h v 0 , ·i + b 0 ∈ N v , 0 . Since v is the cen ter 18 Figure 2: The ab o ve image is the γ -ball N (1 , 0) in the simplified homogeneous case. The black strip corre- sp onds to the cen tral hyperplane, and the blue areas denote the strips with close normal vectors. The red dotted line denotes the larger strip in whic h the γ -ball lies. of our ball, b y definition we hav e || v − v 0 || ≤ γ , and | b 0 | ≤ γ . Then for x 0 in strip v 0 , w e can b ound h v , x 0 i : |h v , x 0 i| = |h v 0 , x 0 i + h v − v 0 , x 0 i| ≤ 2 γ + |h v − v 0 , x 0 i| ≤ 2 γ + || v − v 0 || · || x 0 || ≤ 2 γ + γ · 2 r = 2( γ + √ γ ) In other words, this neighborho o d of strips lies entirely within the strip about v of height 2( γ + √ γ ) , which in turn b y condition 2 has measure at most 2 c 2 ( γ + √ γ ). Finally , note that if no subset of n/ 3 p oints lies in any N v ,b , then certainly no such subset lies in a sin- gle strip, as N cov ers all strips. Now we can union b ound ov er subsets and N : P r [ ∃ N v ,b ∈ N , S 0 ⊆ S, | S 0 | = n/ 3 : ∀ x ∈ S 0 , x ∈ N v ,b ] ≤ n n/ 3 4( c 1 + c 2 ) dγ 1 / 2 n/ 3 (4 γ − 1 ) d +5 / 2 . Recall that γ = 2 − θ ( n d log( d ) ) . The only term con tributing an n 2 to the exp onent is γ n/ 6 , and thus plugging in γ gives: P r [ ∃ N v ,b ∈ N , S 0 ⊆ S, | S 0 | = n/ 3 : ∀ x ∈ S 0 , x ∈ N v ,b ] ≤ 2 − Ω n 2 d log( d ) . The argument for L > r is m uch simpler. By assuming at least n − k p oints lie in B, removing the closet k p oin ts gives a margin of at least L, and removing the furthest a maxim um v alue of at most 2 r + L . Because L > r , the minimal ratio is b ounded by: η ≥ r + L 2 r + L ≥ 1 / 3 . Then in total, assuming | S ∩ B | ≥ 2 n/ 3, the probability ov er samples S that the subsample S 0 created from remo ving the closest and furthest k p oints has minimal-ratio less than η is: P r [ ∃ h ∈ H d : minimal-ratio of S 0 < η ] ≤ 2 − Ω n 2 d log( d ) . 19 Since the probability that | S ∩ B | ≥ 2 n/ 3 is at least 1 − 2 − Ω n 2 d log( d ) , the ab o ve b ound holds with no as- sumption on | S ∩ B | as well. Com bining this result together with Theorem 4.9 completes the pro of. Let S 0 h b e the remaining n/ 3 points when the furthest and closest n/ 3 are remov ed, and assume S 0 h has minimal ratio η . S 0 h ma y thus b e viewed as a sample of size n/ 3 from ( S 0 h , H d,η ). Since ( S 0 h , H d,η ) has inference dimension n/ 3 for our choice of η by Theorem 4.9, ∀ h there must exist x s.t. Q ( S 0 h − { x } ) infers x . Thus the probability that w e cannot infer a p oin t is upp er b ounded by 2 − Ω n 2 d log( d ) Plugging this result in to Corollary 4.7 gives our desired guarantee on Comparison-Pool-RPU learning query complexit y . Theorem 4.11 (Restatement of Theorem 1.8) . L et D b e a distribution on R d which satisfies the c onditions of The or em 4.10. Then the sample c omplexity of Comp arison-Passive-RPU le arning ( D , R d , H d ) is n ( ε, δ ) ≤ O d log ( d ) log( d/ε ) log(1 /δ ) ε . The query c omplexity of Comp arison-Po ol-RPU le arning ( D, R d , H d ) is q ( ε, δ ) ≤ O d log 2 ( d/ε ) log ( d ) log( d log log(1 /ε )) log (1 /δ ) . Pr o of. Recall from Corollary 4.7 that with n samples w e can build an RPU learner A with expected cov erage: E [ C ( A )] ≥ 1 − n k g ( k ) n − k n . By Theorem 4.10, letting k = O ( d log ( d ) log ( n )) simplifies this to E [ C ( A )] ≥ 1 − 1 n n − k n as long as n ≥ Ω(max( c 1 + c 2 , k )). Setting the right hand side to 1 − ε and solving for n gives n = O d log ( d ) log( d/ε ) ε , and a Chernoff b ound giv es the desired dep endence on δ . Bounding the query complexit y is a bit more n uanced. Since the ab o ve analysis requires knowing b oth comparisons and lab els for all sampled p oin ts, w e cannot simply dra w n ( ε, δ ) p oints and actively learn their lab els as we would do in the P AC case. Instead, consider the following algorithm for learning a finite sample S of size n dra wn from D . 1. Subsample S 0 ⊂ S , | S 0 | = O ( d log ( d ) log ( n )). 2. Query lab els and comparisons on S 0 . 3. Infer lab els in S implied by Q ( S 0 ). 4. Restrict to the set of uninferred p oints, and rep eat T = O (log( n/ε )) times. KLMZ [5] prov ed that if S has inference dimension at most O ( d log ( d ) log ( n )), with the right choice of constan ts each round of the ab o ve algorithm infers half of the remaining p oin ts with probability at least a half. Since we rep eat the process T times, a Chernoff b ound gives that all of S is learned with probabilit y at least 1 − ε/ 3. F urther, notice that if n is sufficiently large: n = O d log ( d ) log 2 ( d/ε ) ε , 20 Theorem 4.10 and Observ ation 4.6 imply S has inference dimension at most O ( d log ( d ) log ( n )) with proba- bilit y at least 1 − ε/ 3, and further the algorithm queries only a ε/ 3 fraction of p oin ts in the sample. W e argue that any algorithm with suc h guarantees is sufficien t to learn the en tire distribution. A v ari- ation of this fact is pro ved in [13], but we rep eat the argument here for completeness. Notice that the exp ected cov erage of A ov er the entire distribution ma y b e rewritten as the probability that A infers some additionally dra wn p oint, that is: E S ∼ D n [ C ( A )] = Pr x 1 ,...,x n +1 ∼ D n +1 [ A ( x 1 , . . . x n ) → x n +1 ] W e argue that the righthand side is b ounded by the probability that a p oint is inferred but not queried b y A across samples of size n + 1. T o see this, recall that A op erates on S 0 = { x 1 , . . . , x n +1 } by querying a set of subsets S 0 1 , . . . , S 0 T ⊂ S 0 , where each S 0 i +1 is dra wn uniformly at random from p oints not inferred b y S 0 1 , . . . , S 0 i . If x n +1 is learned but not queried by A , it m ust b e inferred by some subset S i . Suc h a configuration of subsets is only more likely to occur when running A ( S 0 \ { x n +1 } ), since the only difference is that at an y step where x n +1 has not y et b een inferred, A ( S 0 ) migh t include x n +1 in the next sample. Finally , recall that our algorithm infers all of S 0 in only ε | S 0 | / 3 queries with probability at least 2 ε/ 3. Since x n +1 is just an arbitrary p oint from D , the probability it is inferred but not queried is then at least 1 − ε , whic h gives the desired cov erage. All that remains is to analyze the query complexit y . The total num ber of queries made is O ( T k log( k )), and repeating this pro cess log(1 /δ ) times returns the desired RPU learner by a Chernoff b ound. Thus the total query complexit y is: q ( ε, δ ) ≤ O d log 2 ( d/ε ) log ( d ) log( d log log(1 /ε )) log (1 /δ ) The necessary conditions in Theorem 4.10 are satisfied b y a wide range of distributions. The concentration b ound is satisfied by any distribution whose norm has finite exp ectation, and the anti-concen tration b ound is satisfied by many con tinuous distributions. Log-concav e distributions, for instance, easily satisfy the conditions. Prop osition 4.12. lo g-c onc ave distributions satisfy the c onditions of The or em 4.10 with c 1 = c 2 = O (1) . Pr o of. An y log-concav e distribution D is affinely equiv alen t to an isotropic log-concav e distribution D 0 . Isotropic log-conca ve distributions hav e the following prop erties [2]: 1. ∀ a > 0, P x ∈ D 0 [ || x || ≥ a ] ≤ e − a/ √ d +1 2. All marginals of D 0 are isotropic log-conca ve. 3. If d = 1 , P x ∈ D 0 [ x ∈ [ a, b ]] ≤ | b − a | W e wan t to show that these three prop erties satisfy the t wo conditions of Theorem 4.10. Prop ert y 1 satisfies condition 1 with constant c 1 = 1. Prop erties 2 and 3 imply condition 2 with constant c 2 = 2, as the probabilit y mass of a strip is equiv alent to the probabilit y mass of the one dimensional marginal along the normal v ector. With significant additional work, Balcan and Zhang [3] show that an even more general class of distri- butions satisfies these prop erties, s -conca ve distributions. Prop osition 4.13 (Theorems 5,11 [3]) . s-c onc ave distributions satisfy the c onditions of The or em 3.10 4 for s ≥ − 1 2 d +3 and: c 1 = 4 √ d c , c 2 = 4 for some absolute c onstant c > 0 . 4 Condition 1, how ev er, must b e c hanged to ∀ α > 16... rather than 0, whic h does not affect the pro of. 21 Theorem 4.10 provides a randomized comparison LDT for solving the p oin t lo cation problem. Because our metho d in volv es reducing to worst case inference dimension, we may use the derandomization technique (Theorem 1.8) of [24] to pro ve the existence of a deterministic LDT. Corollary 4.14 (Restatement of Theorem 1.9) . L et D b e a distribution satisfying the criterion of The o- r em 1.8, x ∈ R d , and h 1 , . . . , h n ∼ D n . Then for n ≥ Ω( d log 2 ( d )) ther e exists an LDT using only lab el and c omp arison queries solving the p oint lo c ation pr oblem with exp e cte d depth O ( d log( d ) log( d log( n )) log 2 ( n )) . 5 Exp erimen tal Results T o confirm our theoretical fi ndings, we ha ve implemented a v ariant of our reliable learning algorithm for finite samples. Our simulations w ere run on a com bination of the T riton Shared Computing Cluster supported b y the San Diego Sup ercomputer Center, and the Odyssey Computing Cluster supp orted by the Researc h Computing Group at Harv ard Universit y . F or a given sample size or dimension, the query complexity we presen t is av eraged ov er 500 trials of the algorithm. 5.1 Algorithm W e first note a few practical mo difications. First, our algorithm lab els finite samples drawn from the uni- form distribution ov er the unit ball in d -dimensions. Second, to matc h our metho dology in lo wer b ounding Lab el-P o ol-RPU learning, we will draw our classifier uniformly from hyperplanes tangent to the unit ball. Finally , b ecause the true inference dimension of the sample might b e small, our algorithm guesses a lo w p oten tial inference dimension to start, and doubles its guess on each iteration with low cov erage. Our algorithm will reference tw o sub-routines employ ed by the original inference dimension algorithm in [5], Query( Q, S ), and Infer( S, C ). Query( Q, S ) simply returns Q ( S ), the oracle resp onses to all queries on S of type Q . Infer( S, C ) builds a linear program from constraints C (solutions to some Query( Q, S )), and returns whic h p oints in S are inferred. Algorithm 2: Perfect-Learning( N , Q, d, g ( n )) Result: Lab els all p oints in sample S ∼ ( B d ) N using query set Q 9 S ∼ ( B d ) N ; Classifier ∼ S d , B 1 ; 10 Subsample Size = d + 1; Uninferred = S ; Subsample List = []; 11 while size(Uninferr e d) > g ( Subsample Size ) do 12 Subsample ∼ Uninferred[Subsample Size]; 13 Subsample List.extend(Subsample); 14 Inferred P oints = Infer(Uninferred, Query(Q, Subsample List)); 15 if size(Inferr e d Points) < size(Uninferr e d) / 2 then 16 Subsample Size ∗ = 2; 17 end 18 Uninferred.remo ve(Inferred Poin ts) 19 end 20 Query(Lab el,Uninferred) Note that this algorithm is efficient. The while lo op runs at most log( N ) times, and each lo op solves at most N linear programs with O ( f Q ( N )) constraints in d + 1 dimensions. Thus the total running time of Algorithm 2 is P oly( N , d ). F urther note for simplicity we hav e chosen g ( n ) = 1 for lab els and g ( n ) = 2 for comparisons and will drop this parameter in the following. 5.2 Query Complexit y Our theoretical results state that for an adv ersarial c hoice of classifier, the n umber of queries Perfect- Learning( N , Comparison, d ) p erforms is logarithmic compared to P erfect-Learning( N , Lab els, d ). The left 22 graph in Figure 3 shows this corresp ondence for uniformly drawn hyperplanes tangen t to the unit ball and sample v alues ranging from 1 to 2 10 in log-scale. In particular, it is easy to see the exp onential difference b et ween the Lab el query complexity in blue, and the Comparison query complexity in orange. F urther, our results suggest that Perfect-Learning( N , Comparison, d ) should scale near linearly in dimension. The righ t graph in Figure 3 confirms that this is true in practice as well. Figure 3: The left graph shows a log-scale comparison of P erfect-Learning( N , Lab el, 3) and P erfect- Learning( N , Comparison, 3). The right graph sho ws how P erfect-Learning(256, Comparison, d ) scales with dimension. 6 F urther Directions 6.1 Av erage Inference Dimension and Enric hed Queries KLMZ [5] propose lo oking for a simple set of queries with finite inference dimension k for d -dimensional linear separators. In particular, they suggest lo oking at extending to t-lo cal relative queries, questions which ask comparative questions ab out t p oin ts. Unfortunately , simple generalizations of comparison queries seem to fail, but the problem of analyzing their av erage inference dimension remains op en. When moving from 1-lo cal to 2-local queries, our av erage inference dimension improv ed from: 2 − ˜ O ( n ) → 2 − ˜ O ( n 2 ) If there exist simple relativ e t-lo cal queries with av erage inference dimension 2 − ˜ O ( n t ) o ver some distribution D , then it would imply a passive RPU-learning algorithm o ver D with sample complexity n ( ε, δ ) = O log 1 ε 1 / ( t − 1) ε log 1 δ ! and query complexit y q ( ε, δ ) ≤ O 2 f Q 4 log 1 / ( t − 1) ( n ) log 1 δ log( n ) One suc h candidate 3-lo cal query giv en p oin ts x 1 , x 2 , and x 3 is the question: is x 1 closer to x 2 , or x 3 ? KLMZ suggest looking into this query in particular, and other similar types of relativ e queries are studied in [37 – 43]. 6.2 Av erage Inference Dimension = ⇒ Lo wer Bounds W e show ed in this pap er that a verage inference dimension pro vides upp er bounds on passiv e and active RPU- learning, but to show av erage inference dimension characterizes the distribution dep enden t mo del, we would 23 need to show it provides a matching low er b ound. The first step in this pro cess would require examining the tigh tness of our av erage to worst case reduction. Observ ation 6.1. L et ( D, X, H ) have aver age infer enc e dimension ω ( g ( k )) . Then the pr ob ability that a r andom sample S ∼ D n has infer enc e dimension ≤ k is: 1 − g ( k ) n k ≤ Pr[ infer enc e dimension of S ≤ k ] ≤ (1 − g ( k )) n/k Ev en with a tight v ersion of Observ ation 6.1, it is an op en problem to apply such a result as a low er b ound tec hnique for the P AC or RPU mo dels. 6.3 Noisy and Agnostic Learning The mo dels w e hav e prop osed in this pap er are unrealistic in the fact that they assume a p erfect oracle. RPU-learning in particular must b e noiseless due to its zero-error nature. This raises a natural question: c an infer enc e dimension te chniques b e applie d in a noisy or non-r e alizable setting? . Hopkins, Kane, Lov ett, and Maha jan [13] recently made progress in this direction, introducing a relaxed v ersion of RPU-learning called Almost Reliable and Probably Useful learning. They are able to provide learning algorithms under the p opular [7, 2, 44 – 48] Massart [49] and Tsybak ov noise [50, 51] mo dels. Ho wev er, many problems in this direction remain completely open, such as agnostic or more adversarial settings. It remains unclear whether inference based techniques are robust to these settings, since small adv ersarial adjustments to the inference LP can cause substantial corruption to its output. 6.4 F urther Applications of RPU-learning In this paper we offer the first set of positive results on RPU-learning since the model was introduced by Sloan and Rivest [8]. RPU-learning has potential for b oth practical and theoretical applications. On the practical side, p ositiv e results on RPU-learning, or a slightly relaxed noisy mo del, may allow us to build predictors with b etter confidence levels. On the theoretical side efficient RPU-learners ha ve p oten tial applications for circuit lo wer b ounds [52]. Ac knowledgmen ts The authors would lik e to thank Sanjoy Dasgupta for helpful discussions throughout the process and in particular p oin ting us to previous work in RPU-learning. References [1] Sanjoy Dasgupta. Coarse sample complexity b ounds for active learning. In A dvanc es in neur al infor- mation pr o c essing systems , pages 235–242, 2006. [2] Long P . Balcan, M. Active and passive learning of linear separators under log-concav e distributions. In Pr o c e e dings of the 26th Confer enc e on L e arning The ory , 2013. [3] Maria-Florina F Balcan and Hongy ang Zhang. Sample and computationally efficient learning algorithms under s-conca ve distributions. In A dvanc es in Neur al Information Pr o c essing Systems , pages 4796–4805, 2017. [4] Ran El-Y aniv and Y air Wiener. Active learning via p erfect selective classification. Journal of Machine L e arning R ese ar ch , 13(F eb):255–279, 2012. [5] Lov ett S. Moran S. Zhang J Kane, D. Active classification with comparison queries. In IEEE 58th A nnual Symp osium on F oundations of Computer Scienc e , 2017. 24 [6] Benjamin Satzger, Markus Endres, and W erner Kiessling. A preference-based recommender system. In Pr o c e e dings of the 7th International Confer enc e on E-Commer c e and Web T e chnolo gies , EC-W eb’06, pages 31–40, Berlin, Heidelb erg, 2006. Springer-V erlag. ISBN 3-540-37743-3, 978-3-540-37743-6. [7] Yichong Xu, Hongyang Zhang, Kyle Miller, Aarti Singh, and Artur Dubrawski. Noise-tolerant interac- tiv e learning using pairwise comparisons. In A dvanc es in Neur al Information Pr o c essing Systems , pages 2431–2440, 2017. [8] Sloan R. Rivest, R. Learning complicated concepts reliably and usefully . In Pr o c e e dings of the First Workshop on Computational L e arning The ory , pages 61–71, 1998. [9] Chi-Keung Cho w. An optim um c haracter recognition system using decision functions. IRE T r ansactions on Ele ctr onic Computers , (4):247–254, 1957. [10] Jyrki Kivinen. Reliable and useful learning. In Pr o c e e dings of the se c ond annual workshop on Compu- tational le arning the ory , pages 365–380, 2014. [11] J. Kivinen. Learning reliably and with one-sided error. Mathematic al Systems The ory , 28(2):141–172, 1995. [12] J. Kivinen. Reliable and useful learning with uniform probability distributions. In Pr o c e e dings of the First International Workshop on Algorithmic L e arning The ory , 1990. [13] Max Hopkins, Daniel Kane, Shac har Lo vett, and Gaurav Maha jan. Noise-tolerant, reliable active classification with comparison queries. arXiv pr eprint arXiv:2001.05497 , 2020. [14] L. V aliant. A theory of the learnable. Communic ations of the ACM , 27(11):1134–1142, 1984. [15] Vladimir V apnik and Alexey Chervonenkis. Theory of pattern recognition, 1974. [16] Chervonenkis A. V apnik V. On the uniform con vergence of relativ e frequencies of ev ents to their probabilities. The ory of Pr ob ability and its Applic ations , 16(2):264–280, 1971. [17] Anselm Blumer, Andrzej Ehrenfeuch t, David Haussler, and Manfred K W armuth. Learnabilit y and the v apnik-cherv onenkis dimension. Journal of the ACM (JACM) , 36(4):929–965, 1989. [18] Steve Hanneke. The optimal sample complexity of pac learning. The Journal of Machine L e arning R ese ar ch , 17(1):1319–1333, 2016. [19] Nigam K. McCallum, A. Employing em and p o ol-based active learning for text classification. In Pr o c e e dings of the Fifte enth International Confer enc e on Machine L e arning , 1998. [20] Dana Angluin. Queries and concept learning. Machine le arning , 2(4):319–342, 1988. [21] P . Long. On the sample complexit y of pac learning halfspaces against the uniform distribution. IEEE T r ansactions on Neur al Networks , 6(6):1556–1559, 1995. [22] P . Long. An upp er b ound on the sample complexity of pac learning halfspaces with resp ect to the uniform distribution. Information Pr o c essing L etters , 2003. [23] S. Hanneke. Theory of disagreement-based active learning. F oundations and T r ends in Machine L e arn- ing , 7(2-3):131–309, 2014. [24] Lov ett S. Moran S. Kane, D. Near-optimal linear decision trees for k-sum and related problems. In IEEE 50th Annual Symp osium on The ory of Computation , 2018. [25] Max Hopkins, Daniel M Kane, Shachar Lov ett, and Gaura v Maha jan. Poin t lo cation and activ e learning: Learning halfspaces almost optimally . arXiv pr eprint arXiv:2004.11380 , 2020. [26] Lov ett S. Moran S. Kane, D. Generalized comparison trees for p oin t-lo cation problems. In Pr o c e e dings of the 45th International Col lo quium on A utomata, L anguages and Pr o gr amming , 2018. 25 [27] Sanjeev R Kulk arni, Sanjoy K Mitter, and John N Tsitsiklis. Activ e learning using arbitrary binary v alued queries. Machine L e arning , 11(1):23–35, 1993. [28] Richard M Dudley , Hiroshi Kunita, and F ran¸ cois Ledrappier. Ec ole d’Ete de Pr ob abilites de Saint-Flour XII, 1982 , volume 1097. Springer, 2006. [29] Andrew Chi-Chin Y ao. Probabilistic computations: T ow ard a unified measure of complexit y . In 18th A nnual Symp osium on F oundations of Computer Scienc e (sfcs 1977) , pages 222–227. IEEE, 1977. [30] V empala S. Lov asz, L. The geometry of logconcav e functions and sam pling algorithms. R andom Struc- tur es and Algorithms , 2007. [31] Adam R Kliv ans, Philip M Long, and Alex K T ang. Baums algorithm learns in tersections of halfs- paces with resp ect to log-concav e distributions. In Appr oximation, R andomization, and Combinatorial Optimization. Algorithms and T e chniques , pages 588–600. Springer, 2009. [32] Pranjal Awasthi, Maria Florina Balcan, and Philip M Long. The p ow er of lo calization for efficiently learning linear separators with noise. In Pr o c e e dings of the forty-sixth annual ACM symp osium on The ory of c omputing , pages 449–458, 2014. [33] I. Barany . Random p oints and lattice p oints in conv ex b odies. The Bul letin , 45(3):339–365, 1994. [34] F uredi Z. Baran y , I. Appro ximation of the sphere by p olytop es having few v ertices. Pr o c e e dings of the A meric an Mathematic al So ciety , 102(3):651–659, 1988. [35] JA Wieack er. Einige probleme der polyedrisc hen approximation. F r eibur g im Br eisgau: Diplomarb eit , 1978. [36] Imre B´ ar´ an y et al. Sylv esters question: The probabilit y that n points are in conv ex p osition. The annals of pr ob ability , 27(4):2020–2034, 1999. [37] Eric P Xing, Michael I Jordan, Stuart J Russell, and Andrew Y Ng. Distance metric learning with application to clustering with side-information. In A dvanc es in neur al information pr o c essing systems , pages 521–528, 2003. [38] Matthew Sch ultz and Thorsten Joachims. Learning a distance metric from relative comparisons. In A dvanc es in neur al information pr o c essing systems , pages 41–48, 2004. [39] Sameer Agarwal, Josh Wills, La wrence Cayton, Gert Lanckriet, David Kriegman, and Serge Belongie. Generalized non-metric multidimensional scaling. In A rtificial Intel ligenc e and Statistics , pages 11–18, 2007. [40] Brian McF ee and Gert Lanckriet. Learning similarity in heterogeneous data. In Pr o c e e dings of the international c onfer enc e on Multime dia information r etrieval , pages 243–244. A CM, 2010. [41] Kaizhu Huang, Yiming Ying, and Colin Campb ell. Generalized sparse metric learning with relativ e comparisons. Know le dge and Information Systems , 28(1):25–45, 2011. [42] Omer T am uz, Ce Liu, Serge Belongie, Ohad Shamir, and Adam T auman Kalai. Adaptively learning the cro wd kernel. arXiv pr eprint arXiv:1105.1033 , 2011. [43] Buyue Qian, Xiang W ang, F ei W ang, Hongfei Li, Jieping Y e, and Ian Davidson. Activ e learning from relativ e queries. In Twenty-Thir d International Joint Confer enc e on Artificial Intel ligenc e , 2013. [44] Pranjal Awasthi, Maria-Florina Balcan, Nik a Haghtalab, and Ruth Urner. Efficient learning of linear separators under b ounded noise. In Confer enc e on L e arning The ory , pages 167–190, 2015. [45] Yining W ang and Aarti Singh. Noise-adaptiv e margin-based activ e learning and lo wer b ounds under tsybak ov noise condition. In Thirtieth AAAI Confer enc e on A rtificial Intel ligenc e , 2016. 26 [46] Aadity a Ramdas and Aarti Singh. Optimal rates for sto c hastic conv ex optimization under tsybako v noise condition. In International Confer enc e on Machine L e arning , pages 365–373, 2013. [47] DaoHong Xiang. Classification with gaussians and conv ex loss ii: improving error b ounds b y noise conditions. Scienc e China Mathematics , 54(1):165–171, 2011. [48] Pranjal Awasthi, Maria-Florina Balcan, Nik a Haghtalab, and Hongy ang Zhang. Learning and 1-bit compressed sensing under asymmetric noise. In Confer enc e on L e arning The ory , pages 152–192, 2016. [49] Pascal Massart, ´ Elo die N´ ed ´ elec, et al. Risk b ounds for statistical learning. The Annals of Statistics , 34 (5):2326–2366, 2006. [50] Alexander B Tsybako v et al. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics , 32(1):135–166, 2004. [51] Rui M Castro and Rob ert D Now ak. Upper and lo wer error b ounds for active learning. [52] Igor Carb oni Oliveira and Rahul Santhanam. Conspiracies b etw een learning algorithms, circuit low er b ounds and pseudorandomness. 2017. 27
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment