The Geometry of Generalized Binary Search

The Geometry of Generalized Binary Search Robert D. No wak Abstract This paper in vestig ates the problem of determining a binary-valued function through a sequence of strate gically selected queries. The focus is an algorithm called Generalized Binary Search (GBS). GBS is a well-known greedy algorithm for determining a binary-valued function through a sequence of strate gically selected queries. At each step, a query is selected that most evenly splits the hypotheses under consideration into two disjoint subsets, a natural generalization of the idea underlying classic binary search. This paper develops novel incoherence and geometric conditions under which GBS achie ves the information-theoretically optimal query comple xity; i.e., given a collection of N hypotheses, GBS terminates with the correct function after no more than a constant times log N queries. Furthermore, a noise-tolerant version of GBS is dev eloped that also achiev es the optimal query complexity . These results are applied to learning halfspaces, a problem arising routinely in image processing and machine learning. I . I N T RO D U C T I O N This paper studies learning problems of the following form. Consider a ﬁnite, but potentially very large, collection of binary-valued functions H deﬁned on a domain X . In this paper , H will be called the hypothesis space and X will be called the query space . Each h ∈ H is a mapping from X to {− 1 , 1 } . Throughout the paper we will let N denote the cardinality of H . Assume that the functions in H are unique and that one function, h ∗ ∈ H , produces the correct binary labeling. It is assumed that h ∗ is ﬁxed but unknown, and the goal is to determine h ∗ through as few queries from X as possible. For each query x ∈ X , the value h ∗ ( x ) , possibly corrupted with independently distributed binary noise, is observed. The goal is to strategically select queries in a sequential fashion in order to identify h ∗ as quickly as possible. If the responses to queries are noiseless, then the problem is related to the construction of a binary decision tree. A sequence of queries deﬁnes a path from the root of the tree (corresponding to H ) to a leaf (corresponding to a single element of H ). There are se veral ways in which one might deﬁne the notion of an optimal tree; e.g., the tree with the minimum average or worst case depth. In general the determination of the optimal tree (in either sense abov e) is a combinatorial problem and was shown by Hyaﬁl and Rivest to be NP-complete [19]. Therefore, this paper in v estigates the performance of a greedy procedure called generalized binary sear ch (GBS), depicted below in Fig. 1. At each step GBS selects a query that results in the most even split of the hypotheses under consideration into two subsets responding +1 and − 1 , respectiv ely , to the query . The correct response to the query eliminates one of these two subsets from further consideration. W e denote the number of hypotheses remaining at step n by |H n | . The main results of the paper characterize the worst-case number of queries required by GBS in order to identify the correct hypothesis h ∗ . More formally , we deﬁne the notion of query complexity as follows. 1 Generalized Binary Search (GBS) initialize: n = 0 , H 0 = H . while |H n | > 1 1) Select x n = arg min x ∈X | P h ∈H n h ( x ) | . 2) Query with x n to obtain response y n = h ∗ ( x n ) . 3) Set H n +1 = { h ∈ H n : h ( x n ) = y n } , n = n + 1 . Fig. 1. Generalized Binary Search, also kno wn as the Splitting Algorithm. Deﬁnition 1: The minimum number of queries required by GBS (or another algorithm) to identify any hypothesis in H is called the query complexity of the algorithm. The query complexity is said to be near-optimal if it is within a constant factor of log N , since at least log N queries are required to specify one of N hypotheses. Conditions are established under which GBS (and a noise-tolerant v ariant) have a near-optimal query complexity . The main contributions of this paper are two-fold. First, incoherence and geometric relations between the pair ( X , H ) are studied to bound the number of queries required by GBS. This leads to an easily veriﬁable suf ﬁcient condition that guarantees that GBS terminates with the correct hypothesis after no more than a constant times log N queries. Second, noise-tolerant v ersions of GBS are proposed. The following noise model is considered. The binary response y ∈ {− 1 , 1 } to a query x ∈ X is an independent realization of the random v ariable Y satisfying P ( Y = h ∗ ( x )) > P ( Y = − h ∗ ( x )) , where P denotes the underlying probability measure. In other words, the response to x is only probably correct. If a query x is repeated more than once, then each response is an independent realization of Y . A new algorithm based on a weighted (soft-decision) GBS procedure is sho wn to conﬁdently identify h ∗ after a constant times log N queries ev en in the presence of noise (under the sufﬁcient condition mentioned above). An agnostic algorithm that performs well ev en if h ∗ is not in the hypothesis space H is also proposed. A. Notation The following notation will be used throughout the paper . The hypothesis space H is a ﬁnite collection of binary- valued functions deﬁned on a domain X , which is called the query space. Each h ∈ H is a mapping from X to {− 1 , 1 } . For any subset H 0 ⊂ H , |H 0 | denotes the number of hypotheses in H 0 . The number of hypotheses in H is denoted by N := |H| . I I . A G E O M E T R I C A L V I E W O F G E N E R A L I Z E D B I N A RY S E A R C H The ef ﬁciency of classic binary search is due to the fact at each step there exists a query that splits the pool of viable hypotheses in half. The existence of such queries is a result of the special ordered structure of the problem. Because of ordering, optimal query locations are easily identiﬁed by bisection. In the general setting in which the 2 query and hypothesis space are arbitrary it is impossible to order the hypotheses in a similar fashion and “bisecting” queries may not exist. For example, consider hypotheses associated with halfspaces of X = R d . Each hypothesis takes the value +1 on its halfspace and − 1 on the complement. A bisecting query may not exist in this case. T o address such situations we next introduce a more general framework that does not require an ordered structure. A. P artition of X While it may not be possible to naturally order the hypotheses within X , there does exist a similar local geometry that can be exploited in the search process. Observe that the query space X can be partitioned into equiv alence subsets such that every h ∈ H is constant for all queries in each such subset. Let A ( X , H ) denote the smallest such partition 1 . Note that X = S A ∈A A . For every A ∈ A and h ∈ H , the value of h ( x ) is constant (either +1 or − 1 ) for all x ∈ A ; denote this value by h ( A ) . Observe that the query selection step in GBS is equiv alent to an optimization over the partition cells in A . That is, it sufﬁces to select a partition cell for the query according to A n = arg min A ∈A | P h ∈H n h ( A ) | . The main results of this paper concern the query complexity of GBS, b ut before moving on let us comment on the computational comple xity of the algorithm. The query selection step is the main computational burden in GBS. Constructing A may itself be computationally intensiv e. Ho we ver , given A the computational complexity of GBS is N |A| , up to a constant factor , where |A| denotes the number of partition cells in A . The size and construction of A is manageable in many practical situations. For example, if X is ﬁnite, then |A| ≤ |X | , where |X | is the cardinality of X . Later , in Section V -B, we show that if H is deﬁned by N halfspaces of X := R d , then |A| gro ws like N d . B. Distance in X The partition A provides a geometrical link between X and H . The hypotheses induce a distance function on A , and hence X . For every pair A, A 0 ∈ A the Hamming distance between the response vectors { h 1 ( A ) , . . . , h N ( A ) } and { h 1 ( A 0 ) , . . . , h N ( A 0 ) } provides a natural distance metric in X . Deﬁnition 2: T wo sets A, A 0 ∈ A are said to be k -neighbor s if k or fe wer hypotheses (along with their complements, if they belong to H ) output different values on A and A 0 . For example, suppose that H is symmetric, so that h ∈ H implies − h ∈ H . Then two sets A and A 0 are k - neighbors if the Hamming distance between their respecti ve response vectors is less than or equal to 2 k . If H is non-symmetric ( h ∈ H implies that − h is not in H ), then A and A 0 are k -neighbors if the Hamming distance between their respecti ve response vectors is less than or equal to k . 1 Each h splits X into two disjoint sets. Let C h := { x ∈ X : h ( x ) = +1 } and let ¯ C n denote its complement. A is the collection of all non-empty intersections of the form T h ∈ H e C h , where e C h ∈ { C h , ¯ C h } , and it is the smallest partition that reﬁnes the sets { C h } h ∈H . A is known as the join of the sets { C h } h ∈H . 3 Deﬁnition 3: The pair ( X , H ) is said to be k -neighborly if the k -neighborhood graph of A is connected (i.e., for e very pair of sets in A there exists a sequence of k -neighbor sets that begins at one of the pair and ends with the other). If ( X , H ) is k -neighborly , then the distance between A and A 0 is bounded by k times the minimum path length between A and A 0 . Moreover , the neighborly condition implies that there is an incremental way to move from one query to the another, moving a distance of at most k at each step. This local geometry guarantees that near-bisecting queries almost alw ays exist, as shown in the following lemma. Lemma 1: Assume that ( X , H ) is k -neighborly and deﬁne the coherence parameter c ∗ ( X , H ) := min P max h ∈H      X A ∈A h ( A ) P ( A )      , (1) wher e the minimization is over all probability mass functions on A . F or every H 0 ⊂ H and any constant c satisfying c ∗ ≤ c < 1 ther e e xists an A ∈ A that appr oximately bisects H 0      X h ∈H 0 h ( A )      ≤ c |H 0 | , or the set H 0 is a small |H 0 | < k c , wher e |H 0 | denotes the cardinality of H 0 . Pr oof: According to the deﬁnition of c ∗ it follo ws that there exists a probability distribution P such that      X h ∈H 0 X A ∈A h ( A ) P ( A )      ≤ c |H 0 | . This implies that there exists an A ∈ A such that      X h ∈H 0 h ( A )      ≤ c |H 0 | , or there exists a pair A and A 0 such that X h ∈H 0 h ( A ) > c |H 0 | and X h ∈H 0 h ( A 0 ) < − c |H 0 | . In the former case, it follo ws that a query from A will reduce the size of H 0 by a factor of at least (1 + c ) / 2 (i.e., ev ery query x ∈ A approximately bisects the subset H 0 ). In latter case, an approximately bisecting query does not exist, but the k -neighborly condition implies that |H 0 | must be small. T o see this note that the k -neighborly condition guarantees that there exists a sequence of k -neighbor sets beginning at A and ending at A 0 . By assumption in this case,   P h ∈H 0 h ( · )   > c |H 0 | on ev ery set and the sign of P h ∈H 0 h ( · ) must change at some point in the sequence. It follo ws that there exist k -neighbor sets A and A 0 such that P h ∈H 0 h ( A ) > c |H 0 | and P h ∈H 0 h ( A 0 ) < − c |H 0 | . T wo inequalities follo w from this observation. First, P h ∈H 0 h ( A ) − P h ∈H 0 h ( A 0 ) > 2 c |H 0 | . Second, | P h ∈H 0 h ( A ) − P h ∈H 0 h ( A 0 ) | ≤ 2 k . Note that if h and its complement h 0 belong to H 0 , then their contrib utions to the quantity | P h ∈H 0 h ( A ) − P h ∈H 0 h ( A 0 ) | cancel each other . Combining these inequalities yields |H 0 | < k /c .  4 Example 1: T o illustrate Lemma 1, consider the special situation in which we are given two points x 1 , x 2 ∈ X known to satisfy h ∗ ( x 1 ) = +1 and h ∗ ( x 2 ) = − 1 . This allows us to restrict our attention to only those hypotheses that a gr ee with h ∗ at these points. Let H denote this collection of hypotheses. A depiction of this situation is shown in F ig. 2, wher e the solid curves r epr esent the classiﬁcation boundaries of the hypotheses, and each cell in the partition shown corr esponds to a subset of X (i.e., an element of A ). As long as each subset is non-empty , then the 1 -neighborhood graph is connected in this example . The minimization in (1) is achie ved by the distribution P = 1 2 δ x 1 + 1 2 δ x 2 (equal point-masses on x 1 and x 2 ) and c ∗ ( X , H ) = 0 . Lemma 1 implies that ther e exists a query (equivalently a partition cell A ) wher e half of the hypotheses take the value +1 and the other half − 1 . The shaded cell in F ig. 2 has this bisection pr operty . The ﬁgure also shows a dashed path between x 1 and x 2 that passes thr ough the bisecting cell. Fig. 2. An illustration of the idea of GBS. Each solid curve denotes the decision boundary of a hypothesis. There are six boundaries/hypotheses in this e xample. The correct hypothesis in this case is known to satisfy h ∗ ( x 1 ) = +1 and h ∗ ( x 2 ) = − 1 . W ithout loss of generality we may assume that all hypotheses agree with h ∗ at these two points. The dashed path between the points x 1 and x 2 rev eals a bisecting query location. As the path crosses a decision boundary the corresponding hypothesis changes its output from +1 to − 1 (or vice-versa, depending on the direction follo wed). At a certain point, indicated by the shaded cell, half of the hypotheses output +1 and half output − 1 . Selecting a query from this cell will bisect the collection of hypotheses. C. Coher ence and Query Complexity The coherence parameter c ∗ quantiﬁes the informati veness of queries. The coherence parameter is optimized over the choice of P , rather than sampled at random according to a speciﬁc distribution on X , because the queries may be selected as needed from X . The minimizer in (1) exists because the minimization can be computed over the space of ﬁnite-dimensional probability mass functions ov er the elements of A . For c ∗ to be close to 0 , there must exist a distribution P on A so that the moment of ev ery h ∈ H is close to zero (i.e., for each h ∈ H the probabilities of the responses +1 and − 1 are both close to 1 / 2 ). This implies that there is a way to randomly sample queries so that the expected response of every hypothesis is close to zero. In this sense, the queries are incoherent with 5 the hypotheses. In Lemma 1, c ∗ bounds the proportion of the split of any subset H 0 generated by the best query (i.e., the degree to which the best query bisects any subset H 0 ). The coherence parameter c ∗ leads to a bound on the number of queries required by GBS. Theor em 1: If ( X , H ) is k -neighborly , then GBS terminates with the corr ect hypothesis after at most d log N / log( λ − 1 ) e queries, wher e λ = max { 1+ c ∗ 2 , k +1 k +2 } . Pr oof: Consider the n th step of the GBS algorithm. Lemma 1 shows that for any c ∈ [ c ∗ , 1) either there exists an approximately bisecting query and |H n | ≤ 1+ c 2 |H n − 1 | or |H n − 1 | < k /c . The uniqueness of the hypotheses with respect to X implies that there e xists a query that eliminates at least one hypothesis. Therefore, | H n | ≤ |H n − 1 | − 1 = |H n − 1 | (1 − |H n − 1 | − 1 ) < |H n − 1 | (1 − c/k ) . It follows that each GBS query reduces the number of viable hypotheses by a factor of at least λ := min c ≥ c ∗ max  1 + c 2 , 1 − c/k  = max  1 + c ∗ 2 , k + 1 k + 2  . Therefore, |H n | ≤ N λ n and GBS is guaranteed to terminate when n satisﬁes N λ n ≤ 1 . T aking the logarithm of this inequality produces the query complexity bound.  Theorem 1 demonstrates that if ( X , H ) is neighborly , then the query complexity of GBS is near-optimal; i.e., within a constant factor of log 2 N . The constant depends on coherence parameter c ∗ and k , and clearly it is desirable that both are as small as possible. Note that GBS does not require kno wledge of c ∗ or k . W e also remark that the constant in the bound is not necessarily the best that one can obtain. The proof inv olves selecting c to balance splitting f actor 1+ c 2 and the “tail” beha vior 1 − c/k , and this may not gi ve the best bound. The coherence parameter c ∗ can be computed or bounded for many pairs ( X , H ) that are commonly encountered in applications, as cov ered later in Section V. I I I . N O I S Y G E N E R A L I Z E D B I NA RY S E A R C H In noisy problems, the search must cope with erroneous responses. Speciﬁcally , assume that for any query x ∈ X the binary response y ∈ {− 1 , 1 } is an independent realization of the random variable Y satisfying P ( Y = h ∗ ( x )) > P ( Y = − h ∗ ( x )) (i.e., the response is only probably correct). If a query x is repeated more than once, then each response is an independent realization of Y . Deﬁne the noise-level for the query x as α x := P ( Y = − h ∗ ( x )) . Throughout the paper we will let α := sup x ∈X α x and assume that α < 1 / 2 . Before presenting the main approach to noisy GBS, we ﬁrst consider a simple strategy based on repetitiv e querying that will serve as a benchmark for comparison. A. Repetitive Querying W e begin by describing a simple noise-tolerant version of GBS. The noise-tolerant algorithm is based on the simple idea of repeating each query of the GBS sev eral times, in order to overcome the uncertainty introduced by the noise. Similar approaches are proposed in the work K ¨ a ¨ ari ¨ ainen [36]. Karp and Kleinberg [30] analyze of this strategy for noise-tolerant classic binary search. This is essentially like using a simple repetition code to 6 communicate over a noisy channel. This procedure is termed noise-tolerant GBS (NGBS) and is summarized in Fig. 3. Noise-T olerant GBS (NGBS) input: H , repetition rate R ≥ 1 (integer). initialize: n = 0 , H 0 = H . while |H n | > 1 1) Select A n = arg min A ∈A | P h ∈H n h ( A ) | . 2) Query R times from A n to obtain R noisy versions of y n = h ∗ ( A n ) . Let b y n denote the majority vote of the noisy responses. 3) Set H n +1 = { h ∈ H n : h ( A n ) = b y n } , n = n + 1 . Fig. 3. Noise-tolerant GBS based on repeated queries. Theor em 2: Let n 0 denote the number of queries made by GBS to determine h ∗ in the noiseless setting. Then in the noisy setting, with probability at least max { 0 , 1 − n 0 e − R | 1 2 − α | 2 } the noise-tolerant GBS algorithm in F ig. 3 terminates in e xactly R n 0 queries and outputs h ∗ . Pr oof: Consider a speciﬁc query x ∈ X repeated R times, let b p denote the frequency of +1 in the R trials, and let p = E [ b p ] . The majority vote decision is correct if | b p − p | ≤ 1 2 − α . By Chernoff ’ s bound we hav e P ( | b p − p | ≥ 1 2 − α ) ≤ 2 e − 2 R | 1 2 − α | 2 . The results follows by the union bound. Based on the bound above, R must satisfy R ≥ log( n o /δ ) | 1 / 2 − α | 2 to guarantee that the labels determined for all n 0 queries are correct with probability 1 − δ . The query complexity of NGBS can thus be bounded by n 0 log( n 0 /δ ) | 1 / 2 − α | 2 . Recall that N = |H| , the cardinality of H . If n 0 = log N , then bound on the query complexity of NGBS is proportional to log N log log N δ , a logarithmic factor worse than the query complexity in the noiseless setting. Moreover , if an upper bound on n 0 is not known in advance, then one must assume the worst-case value, n 0 = N , in order to set R . That is, in order to guarantee that the correct hypothesis is determined with probability at least 1 − δ , the required number of repetitions of each query is R = d log( N/δ ) | 1 / 2 − α | 2 e . In this situation, the bound on the query complexity of NGBS is proportional to log N log N δ , compared to log N in the noiseless setting. It is conjectured that the extra logarithmic factor cannot be removed from the query complexity (i.e., it is unav oidable using repetitiv e queries). As we sho w next, these problems can be eliminated by a more sophisticated approach to noisy GBS. B. Soft-Decision Pr ocedur e A more effecti ve approach to noisy GBS is based on the following soft-decision procedure. A similar procedure has been shown to be near -optimal for the noisy (classic) binary search problem by Burnashev and Zigangirov [3] and later independently by Karp and Kleinberg [30]. The crucial distinction here is that GBS calls for a more general approach to query selection and a fundamentally different con ver gence analysis. Let p 0 be a known 7 probability measure over H . That is, p 0 : H → [0 , 1] and P h ∈H p 0 ( h ) = 1 . The measure p 0 can be viewed as an initial weighting over the hypothesis class. For example, taking p 0 to be the uniform distribution ov er H expresses the fact that all hypothesis are equally reasonable prior to making queries. W e will assume that p 0 is uniform for the remainder of the paper , but the extension to other initial distributions is trivial. Note, howe v er , that we still assume that h ∗ ∈ H is ﬁxed but unknown. After each query and response ( x n , y n ) , n = 0 , 1 , . . . , the distribution is updated according to p n +1 ( h ) ∝ p n ( h ) β (1 − z n ( h )) / 2 (1 − β ) (1+ z n ( h )) / 2 , (2) where z n ( h ) = h ( x n ) y n , h ∈ H , β is any constant satisfying 0 < β < 1 / 2 , and p n +1 ( h ) is normalized to satisfy P h ∈H p n +1 ( h ) = 1 . The update can be vie wed as an application of Bayes rule and its effect is simple; the probability masses of hypotheses that agree with the label y n are boosted relative to those that disagree. The parameter β controls the size of the boost. The hypothesis with the largest weight is selected at each step: b h n := arg max h ∈H p n ( h ) . If the maximizer is not unique, one of the maximizers is selected at random. Note that, unlike the hard-decisions made by the GBS algorithm in Fig. 1, this procedure does not eliminate hypotheses that disagree with the observed labels, rather the weight assigned to each hypothesis is an indication of how successful its predictions have been. Thus, the procedure is termed Soft-Decision GBS (SGBS) and is summarized in Fig. 4. The goal of SGBS is to dri ve the error P ( b h n 6 = h ∗ ) to zero as quickly as possible by strate gically selecting the queries. The query selection at each step of SGBS must be informati ve with respect to the distrib ution p n . In particular , if the weighted prediction P h ∈ H p n ( h ) h ( x ) is close to zero for a certain x (or A ), then a label at that point is informati ve due to the lar ge disagreement among the hypotheses. If multiple A ∈ A minimize | P h ∈H p n ( h ) h ( A ) | , then one of the minimizers is selected uniformly at random. Soft-Decision Generalized Binary Search (SGBS) initialize: p 0 uniform ov er H . for n = 0 , 1 , 2 , . . . 1) A n = arg min A ∈A | P h ∈H p n ( h ) h ( A ) | . 2) Obtain noisy response y n . 3) Bayes update p n → p n +1 ; Eqn. (2). hypothesis selected at each step: b h n := arg max h ∈ H p n ( h ) Fig. 4. Soft-Decision algorithm for noise-tolerant GBS. T o analyze SGBS, deﬁne C n := (1 − p n ( h ∗ )) /p n ( h ∗ ) , n ≥ 0 . The v ariable C n was also used by Burnashev and Zigangirov [3] to analyze classic binary search. It reﬂects the amount of mass that p n places on incorrect hypotheses. 8 Let P denotes the underlying probability measure gov erning noises and possible randomization in query selection, and let E denote expectation with respect to P . Note that by Markov’ s inequality P ( b h n 6 = h ∗ ) ≤ P ( p n ( h ∗ ) < 1 / 2) = P ( C n > 1) ≤ E [ C n ] . (3) At this point, the method of analyzing SGBS departs from that of Burnashev and Zigangirov [3] which focused only on the classic binary search problem. The lack of an ordered structure calls for a different attack on the problem, which is summarized in the following results and detailed in the Appendix. Lemma 2: Consider any sequence of queries { x n } n ≥ 0 and the corresponding responses { y n } n ≥ 0 . If β ≥ α , then { C n } n ≥ 0 is a nonne gative supermartingale with r espect to { p n } n ≥ 0 ; i.e., E [ C n +1 | p n ] ≤ C n for all n ≥ 0 . The lemma is prov ed in the Appendix. The condition β ≥ α ensures that the update (2) is not o verly aggressi ve. It follo ws that E [ C n ] ≤ C 0 and by the Martingale Con ver gence Theorem we have that lim n →∞ C n exists and is ﬁnite (for more information on Martingale theory one can refer to the textbook by Br ´ emaud [37]). Furthermore, we hav e the following theorem. Theor em 3: Consider any sequence of queries { x n } n ≥ 0 and the corresponding responses { y n } n ≥ 0 . If β > α , then lim n →∞ P ( b h n 6 = h ∗ ) ≤ C 0 . Pr oof: First observe that for every positiv e integer n E [ C n ] = E [( C n /C n − 1 ) C n − 1 ] = E [ E [( C n /C n − 1 ) C n − 1 | p n − 1 ]] = E [ C n − 1 E [( C n /C n − 1 ) | p n − 1 ]] ≤ E [ C n − 1 ] max p n − 1 E [( C n /C n − 1 ) | p n − 1 ] ≤ C 0  max i =0 ,...,n − 1 max p i E [( C i +1 /C i ) | p i ]  n . In the proof of Lemma 2, it is shown that if β > α , then E [( C i +1 /C i ) | p i ] ≤ 1 for e very p i and therefore max p i E [( C i +1 /C i ) | p i ] ≤ 1 . It follows that the sequence a n := (max i =0 ,...,n − 1 max p i E [( C i +1 /C i ) | p i ]) n is monotonically decreasing. The result follows from (3). Note that if we can determine an upper bound for the sequence max p i E [( C i +1 /C i ) | p i ] ≤ 1 − λ < 1 , i = 0 , . . . , n − 1 , then it follows that that P ( b h n 6 = h ∗ ) ≤ N (1 − λ ) n ≤ N e − λn . Unfortunately , the SGBS algorithm in Fig. 4 does not readily admit such a bound. T o obtain a bound, the query selection criterion is randomized. A similar randomization technique has been successfully used by Burnashe v and Zigangiro v [3] and Karp and Kleinberg [30] to analyze the noisy (classic) binary search problem, but again the lack of an ordering in the general setting requires a different analysis of GBS. The modiﬁed SGBS algorithm is outlined in Fig. 5. It is easily veriﬁed that Lemma 2 and Theorem 3 also hold for the modiﬁed SGBS algorithm. This follo ws since the modiﬁed query selection step is identical to that of the original SGBS algorithm, unless there exist two neighboring sets with strongly bipolar weighted responses. In the latter case, a query is randomly selected from one of these two sets with equal probability . For every A ∈ A and an y probability measure p on H the weighted pr ediction on A is deﬁned to be W ( p, A ) := P h ∈ H p ( h ) h ( A ) , where h ( A ) is the constant v alue of h for every x ∈ A . The following lemma, which is the soft-decision analog of Lemma 1, plays a crucial role in the analysis of the modiﬁed SGBS algorithm. 9 Modiﬁed SGBS initialize: p 0 uniform ov er H . for n = 0 , 1 , 2 , . . . 1) Let b = min A ∈A | P h ∈H p n ( h ) h ( A ) | . If there exist 1 -neighbor sets A and A 0 with P h ∈H p n ( h ) h ( A ) > b and P h ∈H p n ( h ) h ( A 0 ) < − b , then select x n from A or A 0 with probability 1 / 2 each. Otherwise select x n from the set A min = arg min A ∈A | P h ∈H p n ( h ) h ( A ) | . In the case that the sets above are non-unique, choose at random an y one satisfying the requirements. 2) Obtain noisy response y n . 3) Bayes update p n → p n +1 ; Eqn. (2). hypothesis selected at each step: b h n := arg max h ∈ H p n ( h ) Fig. 5. Modiﬁed SGBS Algorithm. Lemma 3: If ( X , H ) is k -neighborly , then for every pr obability measur e p on H there either exists a set A ∈ A such that | W ( p, A ) | ≤ c ∗ or a pair of k -neighbor sets A, A 0 ∈ A such that W ( p, A ) > c ∗ and W ( p, A 0 ) < − c ∗ . Pr oof: Suppose that min A ∈A | W ( p, A ) | > c ∗ . Then there must exist A, A 0 ∈ A such that W ( p, A ) > c ∗ and W ( p, A 0 ) < − c ∗ , otherwise c ∗ cannot be the incoherence parameter of H , deﬁned in (1). T o see this suppose, for instance, that W ( p, A ) > c ∗ for all A ∈ A . Then for ev ery distribution P on X we have P A ∈A P h ∈H p ( h ) h ( A ) P ( A ) > c ∗ . This contradicts the deﬁnition of c ∗ since P A ∈A P h ∈H p ( h ) h ( A ) P ( A ) ≤ P h ∈H p ( h ) | P A ∈A h ( A ) P ( A ) | ≤ max h ∈H | P A ∈A h ( A ) P ( A ) | . The neighborly condition guarantees that there exists a sequence of k -neighbor sets beginning at A and ending at A 0 . Since | W ( p, A ) | > c ∗ on e very set and the sign of W ( p, · ) must change at some point in the sequence, it follows that there exist k -neighbor sets satisfying the claim. The lemma guarantees that there exists either a set in A on which the weighted hypotheses signiﬁcantly disagree (provided c ∗ is signiﬁcantly below 1 ) or two neighboring sets in A on which the weighted predictions are strongly bipolar . In either case, if a query is drawn randomly from these sets, then the weighted predictions are highly variable or uncertain, with respect to p . This makes the resulting label informativ e in either case. If ( X , H ) is 1 -neighborly , then the modiﬁed SGBS algorithm guarantees that P ( b h n 6 = h ∗ ) → 0 exponentially fast. The 1 -neighborly condition is required so that the expected boost to p n ( h ∗ ) is signiﬁcant at each step. If this condition does not hold, then the boost could be arbitrarily small due to the ef fects of other hypotheses. Fortunately , as sho wn in Section V, the 1 -neighborly condition holds in a wide range of common situations. Theor em 4: Let P denote the underlying pr obability measur e (governing noises and algorithm r andomization). If β > α and ( X , H ) is 1 -neighborly , then the modiﬁed SGBS algorithm in F ig. 5 generates a sequence of hypotheses 10 satisfying P ( b h n 6 = h ∗ ) ≤ N (1 − λ ) n ≤ N e − λn , n = 0 , 1 , . . . with exponential constant λ = min n 1 − c ∗ 2 , 1 4 o  1 − β (1 − α ) 1 − β − α (1 − β ) β  , wher e c ∗ is deﬁned in (1). The theorem is proved in the Appendix. The exponential con vergence rate 2 is go verned by the coherence parameter 0 ≤ c ∗ < 1 . As shown in Section V , the v alue of c ∗ is typically a small constant much less than 1 that is independent of the size of H . In such situations, the query complexity of modiﬁed SGBS is near -optimal. The query complexity of the modiﬁed SGBS algorithm can be deriv ed as follo ws. Let δ > 0 be a pre-speciﬁed conﬁdence parameter . The number of queries required to ensure that P ( b h n 6 = h ∗ ) ≤ δ is n ≥ λ − 1 log N δ , which is near-optimal. Intuitively , about log N bits are required to encode each hypothesis. More formally , the noisy classic binary search problem satisﬁes the assumptions of Theorem 4 (as shown in Section V -A), and hence it is a special case of the general problem. Using information-theoretic methods, it has been shown by Burnashev and Zigangirov [3] (also see the work of Karp and Kleinberg [30]) that the query comple xity for noisy classic binary search is also within a constant factor of log N δ . In contrast, the query complexity bound for NGBS, based on repeating queries, is at least logarithmic factor worse. W e conclude this section with an example applying Theorem 4 to the halfspace learning problem. Example 2: Consider learning multidimensional halfspaces. Let X = R d and consider hypotheses of the form h i ( x ) := sign ( h a i , x i + b i ) . (4) wher e a i ∈ R d and b i ∈ R parameterize the hypothesis h i and h a i , x i is the inner pr oduct in R d . The following cor ollary characterizes the query complexity for this pr oblem. Cor ollary 1: Let H be a ﬁnite collection of hypotheses of form (4) and assume that the r esponses to each query ar e noisy , with noise bound α < 1 / 2 . Then the hypotheses selected by modiﬁed SGBS with β > α satisfy P ( b h n 6 = h ∗ ) ≤ N e − λn , with λ = 1 4  1 − β (1 − α ) 1 − β − α (1 − β ) β  . Mor eover , b h n can be computed in time polynomial in | H | . The err or bound follows immediately fr om Theorem 4 since c ∗ = 0 and ( R d , H ) is 1 -neighborly , as shown in Section V -B. The polynomial-time computational complexity follows fr om the work of Buck [42], as discussed in Section V -B. Suppose that H is an  -dense set with r espect to a uniform pr obability measure on a ball in R d (i.e., for any hyperplane of the form (4) H contains a hypothesis whose pr obability of error is within  of it). The size of such an H satisﬁes log N ≤ C d log  − 1 , for a constant C > 0 , which is the pr oportional to the minimum query complexity possible in this setting, as shown by Balcan et al [17]. Those authors also present an algorithm with r oughly the same query complexity for this pr oblem. Howe ver , their algorithm is speciﬁcally designed for the linear threshold problem. Remarkably , near-optimal query comple xity is achieved in polynomial-time by the general-purpose modiﬁed SGBS algorithm. 2 Note that the factor  1 − β (1 − α ) 1 − β − α (1 − β ) β  in the exponential rate parameter λ is a positi ve constant strictly less than 1 . For a noise lev el α this factor is maximized by a value β ∈ ( α, 1 / 2) which tends to (1 / 2 + α ) / 2 as α tends to 1 / 2 . 11 I V . A G N O S T I C G B S So far we ha ve assumed that the correct h ypothesis h ∗ is in H . In this section we drop this assumption and consider agnostic algorithms guaranteed to ﬁnd the best hypothesis in H e ven if the correct hypothesis h ∗ is not in H and/or the assumptions of Theorem 1 or 4 do not hold. The best hypothesis in H can be deﬁned as the one that minimizes the error with respect to a probability measure on X , denoted by P X , which can be arbitrary . This notion of “best” commonly arises in machine problems where it is customary to measure the error or risk with respect to a distribution on X . A common approach to hypothesis selection is empirical risk minimization (ERM), which uses queries randomly drawn according to P X and then selects the hypothesis in H that minimizes the number of errors made on these queries. Given a budget of n queries, consider the following agnostic procedure. Di vide the query b udget into three equal portions. Use GBS (or NGBS or modiﬁed SGBS) with one portion, ERM (queries randomly distrib uted according P X ) with another, and then allocate the third portion to queries from the subset of X where the hypothesis selected by GBS (or NGBS or modiﬁed SGBS) and the hypothesis selected by ERM disagree, with these queries randomly distributed according to the restriction of P X to this subset. Finally , select the hypothesis that makes the fewest mistakes on the third portion as the ﬁnal choice. The sample complexity of this agnostic procedure is within a constant factor of that of the better of the two competing algorithms. For example, if the conditions of Theorems 1 or 4 hold, then the sample comple xity of the agnostic algorithm is proportional to log N . In general, the sample complexity of the agnostic procedure is within a constant factor of that of ERM alone. W e formalize this as follows. Lemma 4: Let P X denote a probability measur e on X and for every h ∈ H let R ( h ) denote its probability of err or with respect to P X . Consider two hypotheses h 1 , h 2 ∈ H and let ∆ ⊂ X denote the subset of queries for which h 1 and h 2 disagr ee; i.e., h 1 ( x ) 6 = h 2 ( x ) for all x ∈ ∆ . Suppose that m queries are dr awn independently fr om P ∆ , the r estriction of P X to the set ∆ , let b R ∆ ( h 1 ) and b R ∆ ( h 2 ) denote averag e number of err ors made by h 1 and h 2 on these queries, let R ∆ ( h 1 ) = E [ b R ∆ ( h 1 )] and R ∆ ( h 2 ) = E [ b R ∆ ( h 2 )] , and select b h = arg min { b R ∆ ( h 1 ) , b R ∆ ( h 2 ) } . Then R ( b h ) > min { R ( h 1 ) , R ( h 2 ) } with pr obability less than e − m | R ∆ ( h 1 ) − R ∆ ( h 2 ) | 2 / 2 . Pr oof: Deﬁne δ := R ∆ ( h 1 ) − R ∆ ( h 2 ) and let b δ = b R ∆ ( h 1 ) − b R ∆ ( h 2 ) . By Hoeffding’ s inequality we ha ve b δ ∈ [ δ − , δ +  ] with probability at least 1 − 2 e − m |  | 2 / 2 . It follo ws that P ( sign ( b δ ) 6 = sign ( δ )) ≤ 2 e − m | δ | 2 / 2 . For example, if δ > 0 then since P ( b δ < δ −  ) ≤ 2 e − m 2 / 2 we may take δ =  to obtain P ( b δ < 0) ≤ 2 e − mδ 2 / 2 . The result follo ws since h 1 and h 2 agree on the complement of ∆ . Note that there is a distinct advantage to drawing queries from P ∆ rather than P X , since the error exponent is proportional to | R ∆ ( h 1 ) − R ∆ ( h 2 ) | 2 which is greater than or equal to | R ( h 1 ) − R ( h 2 ) | 2 . No w to illustrate the idea, consider an agnostic procedure based on modiﬁed SGBS and ERM. The following theorem is proved in the Appendix. Theor em 5: Let P X denote a measur e on X and suppose we have a query budget of n . Let h 1 denote the hypothesis selected by modiﬁed SGBS using n/ 3 of the queries and let h 2 denote the hypothesis selected by ERM fr om n/ 3 queries dr awn independently fr om P X . Draw the r emaining n/ 3 queries independently fr om P ∆ , the 12 r estriction of P X to the set ∆ on which h 1 and h 2 disagr ee, and let b R ∆ ( h 1 ) and b R ∆ ( h 2 ) denote the average number of err ors made by h 1 and h 2 on these queries. Select b h = arg min { b R ∆ ( h 1 ) , b R ∆ ( h 2 ) } . Then, in general, E [ R ( b h )] ≤ min { E [ R ( h 1 )] , E [ R ( h 2 )] } + p 3 /n , wher e R ( h ) , h ∈ H , denotes the probability of err or of h with respect to P X . Furthermor e, if the assumptions of Theor em 4 hold and h ∗ ∈ H , then P ( b h 6 = h ∗ ) ≤ N e − λn/ 3 + 2 e − n | 1 − 2 α | 2 / 6 , wher e α is the noise bound. Note that if the assumptions of Theorem 4 hold, then the agnostic procedure performs almost as well as modiﬁed SGBS alone. In particular , the number of queries required to ensure that P ( b h 6 = h ∗ ) ≤ δ is proportional to log N δ ; optimal up to constant factors. Also observe that the probability bound implies the follo wing bound in e xpectation: E [ R ( b h )] ≤ R ( h ∗ ) + N e − λn/ 3 + 2 e − n | 1 − 2 α | 2 / 6 ≤ R ( h ∗ ) + N e − C n , where C > 0 is a constant depending on λ and α . The exponential conv er gence of the e xpected risk is much faster than the usual parametric rate for ERM. If the conditions of Theorem 4 are not met, then modiﬁed SGBS (alone) may perform poorly since it might select inappropriate queries and could even terminate with an incorrect hypothesis. Howe ver , the expected error of the agnostic selection b h is within p 3 /n of the expected error of ERM, with no assumptions on the underlying distributions. Note that the expected error of ERM is proportional to n − 1 / 2 in the worst-case situation. Therefore, the agnostic procedure is near -optimal in general. The agnostic procedure of fers this safeguard on performance. The same approach could be used to deriv e agnostic procedures from any active learning scheme (i.e., learning from adaptively selected queries), including GBS or NGBS. W e also note that the important element in the agnostic procedure is the selection of queries; the proposed selection of b h based on those queries is con venient for proving the bounds, b ut not necessarily optimal. V . A P P L I C AT I ON S O F G B S In this section we examine a variety of common situations in which the neighborliness condition can be veriﬁed. W e will conﬁne the discussion to GBS in the noise-free situation, and analogous results hold in the presence of noise. For a gi ven pair ( X , H ) , the effecti veness of GBS hinges on determining (or bounding) c ∗ and establishing that ( X , H ) are neighborly . Recall the deﬁnition of the bound c ∗ from (1) and that N = |H| , the cardinality of H . A trivial bound for c ∗ is max h ∈H      X A ∈A h ( A ) P ( A )      ≤ 1 − N − 1 , which is achiev ed by allocating N − 1 mass to each hypothesis and ev enly distrib uting 1 2 N mass to set(s) A where h ( A ) = +1 and 1 2 N on set(s) A where h ( A ) = − 1 . Non-trivial coherence bounds are those for which there exists 13 a P and 0 ≤ c < 1 that does not depend on N such that max h ∈H      X A ∈A h ( A ) P ( A )      ≤ c . The coherence parameter c ∗ is analytically determined or bounded in se veral illustrative applications below . W e also note that it may be known a priori that c ∗ is bounded far way from 1 . Suppose that for a certain P on X (or A ) the absolute v alue of the ﬁrst-moment of the correct hypothesis (w .r .t. P ) is kno wn to be upper bounded by a constant c < 1 . Then all hypotheses that violate the bound can be eliminated from consideration. Thus the constant c is an upper bound on c ∗ . Situations like this can arise, for e xample, in binary classiﬁcation problems with side/prior knowledge that the marginal probabilities of the two classes are somewhat balanced. Then the moment of the correct hypothesis, with respect to the marginal probability distribution on X , is bounded far aw ay from 1 and − 1 . The neighborly condition can be numerically veriﬁed in a straightforward fashion. Enumerate the equiv alence sets in A as A 1 , . . . , A M . T o check whether the k -neighborhood graph is connected, form an M × M matrix R k whose i, j entry is 1 if A i and A j are k-neighbors and 0 otherwise. Normalize the rows of R k so that each sums to 1 and denote the resulting stochastic matrix by Q k . The k -neighborhood graph is connected if and only if there exists an integer ` , 0 < ` ≤ M , such that Q ` k contains no zero entries. This follows from the standard condition for state accessibility in Markov chains (for background on Markov chains one can refer to the textbook by Br ´ emaud [37]). The smallest k for which the k -neighborhood graph is connected can be determined using a binary search ov er the set 1 , . . . , M , checking the condition abov e for each value of k in the search. This idea was suggested to me by Clayton Scott. Thus, the neighborly condition can be veriﬁed in polynomial-time in M . Alternati vely , in many cases the neighborly condition can be veriﬁed analytically , as demonstrated in follo wing applications. A. One Dimensional Pr oblems First we show that GBS reduces to classic binary search. Let H = { h 1 , . . . , h N } be the collection of binary-valued functions on X = [0 , 1] of the following form, h i ( x ) := sign  x − i N +1  for i = 1 , . . . , N (and sign (0) := +1 ). Assume that h ∗ ∈ H . First consider the neighborly condition. Recall that A is the smallest partition of X into equi v alence sets induced by H . In this case, each A is an interval of the form A i = [ i − 1 N +1 , i N +1 ) , i = 1 , . . . , N . Observe that only a single hypothesis, h i , has dif ferent responses to queries from A i and A i +1 and so the y are 1 -neighbors, for i = 1 , . . . , N − 1 . Moreov er , the 1 -neighborhood graph is connected in this case, and so ( X , H ) is 1 -neighborly . Next consider coherence parameter c ∗ . T ake P to be two point masses at x = 0 and x = 1 of probability 1 / 2 each. Then   P A ∈A h ( A ) P ( A )   = 0 for ev ery h ∈ H , since h (0) = − 1 and h (1) = +1 . Thus, c ∗ = 0 . Since c ∗ = 0 and k = 1 , we hav e α = 2 / 3 and the query complexity of GBS is proportional to log N according to Theorem 1. The reduction f actor of 2 / 3 , instead of 1 / 2 , arises because we allow the situation in which the number of hypotheses may be odd (e.g., gi ven three hypotheses), the best query may eliminate just one). If N is e ven, then the query complexity is log 2 ( N ) , which is information-theoretically optimal. 14 Now let X = [0 , 1] and consider a ﬁnite collection of hypotheses H = { h i ( x ) } N i =1 , where h i takes the v alue +1 when x ∈ [ a i , b i ) , for a pair 0 ≤ a i < b i ≤ 1 , and − 1 otherwise. Assume that h ∗ ∈ H . The partition A again consists of intervals, and the neighborly condition is satisﬁed with k = 1 . T o bound c ∗ , note that the minimizing P must place some mass within and outside each interval [ a i , b i ) . If the intervals all hav e length at least ` > 0 , then taking P to be the uniform measure on [0 , 1] yields that c ∗ ≤ 1 − 2 ` , re gardless of the number of interv al hypotheses under consideration. Therefore, in this setting Theorem 1 guarantees that GBS determines the correct hypothesis using at most a constant times log N steps. Howe ver , consider the special case in which the interv als are disjoint. Then it is not hard to see that the best allocation of mass is to place 1 / N mass in each subinterval, resulting in c ∗ = 1 − 2 N − 1 . And so, Theorem 1 only guarantees that GBS is will terminate in at most N steps (the number of steps required by exhausti ve linear search). In fact, it is easy to see that no procedure can do better than linear search in this case and the query complexity of any method is proportional to N . Howe ver , note that if queries of a different form were allowed, then much better performance is possible. For example, if queries in the form of dyadic subinterval tests were allowed (e.g., tests that indicate whether or not the correct hypothesis is +1 -valued anywhere within a dyadic subinterv al of choice), then the correct hypothesis can be identiﬁed through d log 2 N e queries (essentially a binary encoding of the correct hypothesis). This underscores the importance of the geometrical relationship between X and H embodied in the neighborly condition and the incoherence parameter c ∗ . Optimizing the query space to the structure of H is related to the notion of arbitrary queries examined in the w ork of K ulkarni et al [25], and some what to the theory of compressed sensing de veloped by Candes et al [39] and Donoho [40]. B. Multidimensional Pr oblems Let H = { h i } N i =1 be a collection of multidimensional threshold functions of the follo wing form. The threshold of each h i determined by (possibly nonlinear) decision surface in d -dimensional Euclidean space and the queries are points in X := R d . It suf ﬁces to consider linear decision surfaces of the form h i ( x ) := sign ( h a i , x i + b i ) , (5) where a i ∈ R d , k a i k 2 = 1 , the of fset b i ∈ R satisﬁes | b i | ≤ b for some constant b < ∞ , and h a i , x i denotes the inner product in R d . Each hypothesis is associated with a halfspace of R d . Note that hypotheses of this form can be used to represent nonlinear decision surfaces, by ﬁrst applying a mapping to an input space and then forming linear decision surfaces in the induced query space. The problem of learning multidimensional threshold functions arises commonly in computer vision (see the re view of Swain and Stricker [12] and applications by Geman and Jedynak [13] and Arkin et al [14]), image processing studied by K orostelev and Kim [10], [11], and acti ve learning research; for example the inv estigations by Freund et al [15], Dasgupta [16], Balcan et al [17], and Castro and Now ak [31]. First we sho w that the pair ( R d , H ) is 1 -neighborly . Each A ∈ A is a polytope in R d . These polytopes are generated by intersections of the halfspaces corresponding to the hypotheses. Any two polytopes that share a 15 common face are 1 -neighbors (the hypothesis whose decision boundary deﬁnes the face, and its complement if it exists, are the only ones that predict different v alues on these two sets). Since the polytopes tessellate R d , the 1 -neighborhood graph of A is connected. W e next show that the the coherence parameter c ∗ = 0 . Since the of fsets of the hypotheses are all less than b in magnitude, it follo ws that the distance from the origin to the nearest point of the decision surface of ev ery hypothesis is at most b . Let P r denote the uniform probability distribution on a ball of radius r centered at the origin in R d . Then for every h of the form (5) there exists a constant C > 0 (depending on b ) such that      X A ∈A h ( A ) P r ( A )      =     Z R d h ( x ) dP r ( x )     ≤ C r d , and lim r →∞   R R d h ( x ) dP r ( x )   = 0 . Therefore c ∗ = 0 , and it follows from Theorem 1 guarantees that GBS determines the correct multidimensional threshold in at most d log 2 log(3 / 2) log 2 N e steps. T o the best of our kno wledge this is a new result in the theory of learning multidimensional threshold functions, although similar query complexity bounds hav e been established for the subclass of linear threshold functions with b = 0 (threshold boundaries passing through the origin); see for example the work of Balcan et al [17]. These results are based on somewhat different learning algorithms, assumptions and analysis techniques. Observe that if H is an  -dense (with respect Lesbegue measure ov er a compact set in R d ) subset of the continuous class of threshold functions of the form (5), then the size of the H satisﬁes log |H| = log N ∝ d log  − 1 . Therefore the query complexity of GBS is proportional to the metric entropy of the continuous class, and it follows from the results of Kulkarni et al [25] that no learning algorithm exists with a lo wer query complexity (up to constant factors). Furthermore, note that the computational comple xity of GBS for hypotheses of the form (5) is proportional to the cardinality of A , which is equal to the number of polytopes generated by intersections of half-spaces. It is a well kno wn fact (see Buck [42]) that |A| = P d j =0  N j  ∝ N d . Therefore, GBS is a polynomial-time algorithm for this problem. In general, the cardinality of A could be as large as 2 N . Next let H again be the hypotheses of the form (5), but let X := [ − 1 , 1] d , instead of all of R d . This constraint on the query space af fects the bound on the coherence c ∗ . T o bound c ∗ , let P be point masses of probability 2 − d at each of the 2 d vertices of the cube [ − 1 , 1] d (the natural generalization of the P chosen in the case of classic binary search in Section V -A abov e). Then   P A ∈A h ( A ) P ( A )   =   R X h ( x ) dP ( x )   ≤ 1 − 2 − d +1 for e very h ∈ H , since for each h there is at least one vertex on where it predicts +1 and one where it predicts − 1 . Thus, c ∗ ≤ 1 − 2 − d +1 . W e conclude that the GBS determines the correct hypothesis in proportional to 2 d log N steps. The dependence on 2 d is unav oidable, since it may be that each threshold function takes that v alue +1 only at one of the 2 d vertices and so each verte x must be queried. A noteworthy special case is arises when b = 0 (i.e., the threshold boundaries pass through the origin). In this case, with P as speciﬁed abov e, c ∗ = 0 , since each hypothesis responds with +1 at half of the vertices and − 1 on the other half. Therefore, the query complexity of GBS is at most d log 2 log(3 / 2) log 2 N e , independent of the dimension. As discussed abo ve, similar results for this special case have been previously reported based on dif ferent algorithms and analyses; see the results in the work of Balcan et al [17] and the references therein. Note that even if the threshold boundaries do not pass through the origin, and therefore 16 the number of queries needed is proportional to log N so long as | b | < 1 . The dependence on dimension d can also be eliminated if it is known that for a certain distribution P on X the absolute value of the moment of the correct hypothesis w .r .t. P is known to be upper bounded by a constant c < 1 , as discussed at the beginning of this section. Finally , we also mention hypotheses associated with axis-aligned rectangles in [0 , 1] d , the multidimensional version of the interval hypotheses considered abov e. An axis-aligned rectangle is deﬁned by its boundary coordinates in each dimension, { a j , b j } d j =1 , 0 ≤ a j < b j ≤ 1 . The hypothesis associated with such a rectangle takes the v alue +1 on the set { x ∈ [0 , 1] d : a j ≤ x j ≤ b j , j = 1 , . . . , d } and − 1 otherwise. The complementary hypothesis may also be included. Consider a ﬁnite collection H of hypotheses of this form. If the rectangles associated with each h ∈ H have volume at least ν , then by taking P to be the uniform measure on [0 , 1] d it follo ws that the coherence parameter c ∗ ≤ 1 − 2 ν for this problem. The cells of partition A of [0 , 1] associated with a collection of such hypotheses are rectangles themselves. If the boundaries of the rectangles associated with the hypotheses are distinct, then the 1 -neighborhood graph of A is connected. Theorem 1 implies that the number of queries needed by GBS to determine the correct rectangle is proportional to log N / log((1 − ν ) − 1 ) . C. Discr ete Query Spaces In many situations both the hypothesis and query spaces may be discrete. A machine learning application, for example, may hav e access to a large (but ﬁnite) pool of unlabeled examples, any of which may be queried for a label. Because obtaining labels can be costly , “activ e” learning algorithms select only those examples that are predicted to be highly informativ e for labeling. Theorem 1 applies equally well to continuous or discrete query spaces. For example, consider the linear separator case, but instead of the query space R d suppose that X is a ﬁnite subset of points in R d . The hypotheses again induce a partition of X into subsets A ( X , H ) , but the number of subsets in the partition may be less than the number in A ( R d , H ) . Consequently , the neighborhood graph of A ( X , H ) depends on the speciﬁc points that are included in X and may or may not be connected. As discussed at the beginning of this section, the neighborly condition can be veriﬁed in polynomial-time (polynomial in |A| ≤ |X | ). Consider two illustrative examples. Let H be a collection of linear separators as in (5) abo ve and ﬁrst reconsider the partition A ( R d , H ) . Recall that each set in A ( R d , H ) is a polytope. Suppose that a discrete set X contains at least one point inside each of the polytopes in A ( R d , H ) . Then it follo ws from the results abov e that ( X , H ) is 1 - neighborly . Second, consider a simple case in d = 2 dimensions. Suppose X consists of just three non-colinear points { x 1 , x 2 , x 3 } and suppose that H is comprised of six classiﬁers, { h + 1 , h − 1 , h + 2 , h − 2 , h + 3 , h − 3 } , satisfying h + i ( x i ) = +1 , h + i ( x j ) = − 1 , j 6 = i , i = 1 , 2 , 3 , and h − i = − h + i , i = 1 , 2 , 3 . In this case, A ( X , H ) = {{ x 1 } , { x 2 } , { x 3 }} and the responses to any pair of queries differ for four of the six hypotheses. Thus, the 4 -neighborhood graph of A ( X , H ) is connected, b ut the 1 -neighborhood is not. Also note that a ﬁnite query space naturally limits the number of hypotheses that need be considered. Consider an uncountable collection of hypotheses. The number of unique labeling assignments generated by these hypotheses can be bounded in terms of the VC dimension of the class; see the book by V apnik for more information on VC 17 theory [43]. As a result, it suf ﬁces to consider a ﬁnite subset of the hypotheses consisting of just one representati ve of each unique labeling assignment. Furthermore, the computational complexity of GBS is proportional to N |X | in this case. V I . R E L A T E D W O R K Generalized binary search can be vie wed as a generalization of classic binary search, Shannon-Fano coding as noted by Goodman and Smyth [1], and channel coding with noiseless feedback as studied by Horstein [2]. Problems of this nature arise in many applications, including channel coding (e.g., the work of Horstein [2] and Zigangirov [4]), e xperimental design (e.g., as studied by R ´ enyi [5], [6]), disease diagnosis (e.g., see the work of Lov eland [7]), fault-tolerant computing (e.g., the work of Feige et al [8]), the scheduling problem considered by K osaraju et al [9], computer vision problems inv estigated by Geman and Jedynak [13] and Arkin et al [14]), image processing problems studied by K orostelev and Kim [10], [11], and activ e learning research; for e xample the in vestigations by Freund et al [15], Dasgupta [16], Balcan et al [17], and Castro and No wak [31]. Past work has provided a partial characterization of this problem. If the responses to queries are noiseless, then selecting the sequence of queries from X is equiv alent to determining a binary decision tree, where a sequence of queries deﬁnes a path from the root of the tree (corresponding to H ) to a leaf (corresponding to a single element of H ). In general the determination of the optimal (worst- or average-case) tree is NP-complete as shown by Hyaﬁl and Riv est [19]. Howe ver , there exists a greedy procedure that yields query sequences that are within a factor of log N of the optimal search tree depth; this result has been discovered independently by sev eral researchers including Lov eland [7], Garey and Graham [7], Arkin et al [20], and Dasgupta [16]. The greedy procedure is referred to here as Gener alized Binary Sear ch (GBS) or the splitting algorithm , and it reduces to classic binary search, as discussed in Section V -A. The number of queries an algorithm requires to determine h ∗ is called the query complexity of the algorithm. Since the hypotheses are assumed to be distinct, it is clear that the query complexity of GBS is at most N (because it is always possible to ﬁnd query that eliminates at least one hypothesis at each step). In fact, there are simple examples (see Section V -A) demonstrating that this is the best one can hope to do in general. Howe ver , it is also true that in man y cases the performance of GBS can be much better , requiring as fe w as log 2 ( N ) queries. In classic binary search, for example, half of the hypotheses are eliminated at each step (e.g., refer to the textbook by Cormen et al [21]). R ´ enyi ﬁrst considered a form of binary search with noise [5] and explored its connections with information theory [6]. In particular , the problem of sequential transmission ov er a binary symmetric channel with noiseless feedback, as formulated by Horstein [2] and studied by Burnashev and Zigangirov [3] and more recently by Pelc et al [22], is equiv alent to a noisy binary search problem. There is a large literature on learning from queries; see the revie w articles by Angluin [23], [24]. This paper focuses exclusi vely on membership queries (i.e., an x ∈ X is the query and the response is h ∗ ( x ) ), although other types of queries (equiv alence, subset, superset, disjointness, and exhaustiv eness) are possible as discussed by Angluin [23]. Arbitrary queries have also been in vestigated, in which the query is a subset of H and the output 18 is +1 if h ∗ belongs to the subset and − 1 otherwise. A ﬁnite collection of hypotheses H can be successi vely halved using arbitrary queries, and so it is possible to determine h ∗ with log 2 N arbitrary queries, the information- theoretically optimal query complexity discussed by Kulkarni et al [25]. Membership queries are the most natural in function learning problems, and because this paper deals only with this type we will simply refer to them as queries throughout the rest of the paper . The number of queries required to determine a binary-valued function in a ﬁnite collection of hypotheses can be bounded (abov e and belo w) in terms of a combinatorial parameter of ( X , H ) due to Heged ¨ us [26] (see the work of Hellerstein et al [27] for related work). Due to its combinatorial nature, computing such bounds are generally NP-hard. In contrast, the geometric relationship between X and H de veloped in this paper leads to an upper bound on the query complexity that can be determined analytically or computed in polynomial time in many cases of interest. The term GBS is used in this paper to emphasize connections and similarities with classic binary search, which is a special case the general problem considered here. Classic binary search is equiv alent to learning a one-dimensional binary-valued threshold function by selecting point ev aluations of the function according to a bisection procedure. Consider the threshold function h t ( x ) := sign ( x − t ) on the interv al X := [0 , 1] for some threshold v alue t ∈ (0 , 1) . Throughout the paper we adopt the con vention that sign (0) = +1 . Suppose that t belongs to the discrete set { 1 N +1 , . . . , N N +1 } and let H denote the collection of threshold functions h 1 , . . . , h N . The value of t can then determined from a constant times log N queries using a bisection procedure analogous to the game of twenty questions. In f act, this is precisely what GBS performs in this case (i.e., GBS reduces to classic binary search in this setting). If N = 2 m for some integer m , then each point ev aluation provides one bit in the m -bit binary expansion of t . Thus, classic binary search is information-theoretically optimal; see the book by T raub, W asilko wski and W ozniakowski [28] for a nice treatment of classic bisection and binary search. The main results of this paper generalize the salient aspects of classic binary search to a much broader class of problems. In many (if not most) applications it is unrealistic to assume that the responses to queries are without error . A form of binary search with noise appears to ha ve been ﬁrst posed by R ´ enyi [5]. The noisy binary search problem arises in sequential transmission over a binary symmetric channel with noiseless feedback studied by Horstein [2] and Zigangirov [4], [29]. The survey paper by Pelc et al [22] discusses the connections between search and coding problems. In channel coding with feedback, each threshold corresponds to a unique binary code word (the binary expansion of t ). Thus, channel coding with noiseless feedback is equiv alent to the problem of learning a one-dimensional threshold function in binary noise, as noted by Burnashev and Zigangirov [3]. The near-optimal solutions to the noisy binary sear ch problem ﬁrst appear in these two contexts. Discrete versions of Horstein’ s probabilistic bisection procedure [2] were shown to be information-theoretically optimal (optimal decay of the error probability) in the works of Zigangirov and Burnashev [3], [4], [29]. More recently , the same procedure was independently proposed and analyzed in the context of noise-tolerant versions of the classic binary search problem by Karp and Kleinberg [30], which was motiv ated by applications ranging from in vestment planning to admission control in queueing networks. Closely related approaches are considered in the work of Feige et al [8]. The noisy binary search problem has found important applications in the minimax theory of sequential, adaptiv e sampling procedures proposed by K orostelov 19 and Kim [10], [11] for image recovery and binary classiﬁcation problems studied by Castro and Nowak [31]. W e also mention the works of Riv est et al [32], Spencer [33] and Aslam and Dhagat [35], and Dhagat et al [34], which consider adversarial situations in which the total number of erroneous oracle responses is ﬁxed in adv ance. One straightforward approach to noisy GBS is to follow the GBS algorithm, but to repeat the query at each step multiple times in order to decide whether the response is more probably +1 or − 1 . This simple approach has been studied in the context of noisy versions of classic binary search and shown to perform signiﬁcantly worse than other approaches in the work of Karp and Kleinberg [30]; perhaps not surprising since this is essentially a simple repetition code approach to communicating over a noisy channel. A near-optimal noise-tolerant version of GBS was de veloped in this paper . The algorithm can be viewed as a non-trivial generalization of Horstein’ s probabilistic bisection procedure. Horstein’ s method relies on the special structure of classic binary search, namely that the hypotheses and queries can be naturally ordered together in the unit interv al. Horstein’ s method is a sequential Bayesian procedure. It begins with uniform distribution over the set of hypotheses. At each step, it queries at the point that bisects the probability mass of the current distribution ov er hypotheses, and then updates the distribution according to Bayes rule. Horstein’ s procedure isn’t directly applicable to situations in which the hypotheses and queries cannot be ordered togetherl, but the geometric condition developed in this paper provides similar structure that is exploited here to de vise a generalized probabilistic bisection procedure. The key elements of the procedure and the analysis of its con ver gence are fundamentally different from those in the classic binary search work of Burnashev and Zigangirov [3] and Karp and Kleinberg [30]. V I I . C O N C L U S I O N S A N D P O S S I B L E E X T E N S I O N S This paper inv estigated a generalization of classic binary search, called GBS, that extends it to arbitrary query and hypothesis spaces. While the GBS algorithm is well-kno wn, past work has only partially characterized its capabilities. This paper dev eloped ne w conditions under which GBS (and a noise-tolerant variant) achie ve the information-theoretically optimal query complexity . The new conditions are based on a novel geometric relation between the query and hypothesis spaces, which is veriﬁable analytically and/or computationally in many cases of practical interest. The main results are applied to learning multidimensional threshold functions, a problem arising routinely in image processing and machine learning. Let us brieﬂy consider some possible extensions and open problems. First recall that in noisy situations it is assumed that the binary noise probability has a known upper bound α < 1 / 2 . It is possible to accommodate situations in which the bound is unknown a priori. This can be accomplished using an NGBS algorithm in which the number of repetitions of each query , R , is determined adaptively to adjust to the unkno wn noise lev el. This procedure was dev eloped by the author in [18], and is based on a straightforward, iterated application of Chernoff ’ s bound. Similar strategies have been suggested as a general approach for devising noise-tolerant learning algorithms [36]. Using an adaptiv e procedure for adjusting the number of repetitions of each query yields an NGBS algorithm with query complexity bound proportional to log N log log N δ , the same order as that of the NGBS algorithm discussed abov e which assumed a known bound α . Whether or not the additional logarithmic factor can be removed if the 20 noise bound α is unknown is an open question. Adversarial noise models in which total number of errors is ﬁxed in advance, like those considered by Riv est et al [32] and Spencer [33], are also of interest in classic binary search problems. Repeating each query multiple times and taking the majority v ote of the responses, as in the NGBS algorithm, is a standard approach to adversarial noise. Thus, NGBS provides an algorithm for generalized binary search with adversarial noise. Finally , we suggest that the salient features of the GBS algorithms could be extended to handle continuous, uncountable classes of hypotheses. For example, consider the continuous class of halfspace threshold functions on R d . This class is indexed parametrically and it is possible to associate a volume measure with the class (and subsets of it) by introducing a measure o ver the parameter space. At each step of a GBS-style algorithm, all inconsistent hypotheses are eliminated and the next query is selected to split the v olume of the parameter space corresponding to the remaining hypotheses, mimicking the splitting criterion of the GBS algorithm presented here. A C K N O W L E D G E M E N T S . The author thanks R. Castro, A. Gupta, C. Scott, A. Singh and the anonymous revie wers for helpful feedback and suggestions. V I I I . A P P E N D I X A. Pr oof of Lemma 2 First we derive the precise form of p 1 , p 2 , . . . is deriv ed as follows. Let δ i = (1 + P h p i ( h ) z i ( h )) / 2 , the weighted proportion of hypotheses that agree with y i . The factor that normalizes the updated distribution in (2) is related to δ i as follows. Note that P h p i ( h ) β (1 − z i ( h )) / 2 (1 − β ) (1+ z i ( h )) / 2 = P h : z i ( h )= − 1 p i ( h ) β + P h : z i ( h )=1 p i ( h )(1 − β ) = (1 − δ i ) β + δ i (1 − β ) . Thus, p i +1 ( h ) = p i ( h ) β (1 − z i ( h )) / 2 (1 − β ) (1+ z i ( h )) / 2 (1 − δ i ) β + δ i (1 − β ) Denote the reciprocal of the update factor for p i +1 ( h ∗ ) by γ i := (1 − δ i ) β + δ i (1 − β ) β (1 − z i ( h ∗ )) / 2 (1 − β ) (1+ z i ( h ∗ )) / 2 , (6) where z i ( h ∗ ) = h ∗ ( x i ) y i , and observe that p i +1 ( h ∗ ) = p i ( h ∗ ) /γ i . Thus, C i +1 C i = (1 − p i ( h ∗ ) /γ i ) p i ( h ∗ ) p i ( h ∗ ) /γ i (1 − p i ( h ∗ )) = γ i − p i ( h ∗ ) 1 − p i ( h ∗ ) . W e prove that E [ C i +1 | p i ] ≤ C i by sho wing that E [ γ i | p i ] ≤ 1 . T o accomplish this, we will let p i be arbitrary . For ev ery A ∈ A and ev ery h ∈ H let h ( A ) denote the value of h on the set A . Deﬁne δ + A = (1 + P h p i ( h ) h ( A )) / 2 , the proportion of hypotheses that take the v alue +1 on A . Let A i denote that set that 21 x i is selected from, and consider the four possible situations: h ∗ ( x i ) = +1 , y i = +1 : γ i = (1 − δ + A i ) β + δ + A i (1 − β ) 1 − β h ∗ ( x i ) = +1 , y i = − 1 : γ i = δ + A i β +(1 − δ + A i )(1 − β ) β h ∗ ( x i ) = − 1 , y i = +1 : γ i = (1 − δ + A i ) β + δ + A i (1 − β ) β h ∗ ( x i ) = − 1 , y i = − 1 : γ i = δ + A i β +(1 − δ + A i )(1 − β ) 1 − β T o bound E [ γ i | p i ] it is helpful to condition on A i . Deﬁne q i := P ( y i 6 = h ∗ ( x i )) . If h ∗ ( A i ) = +1 , then E [ γ i | p i , A i ] = (1 − δ + A i ) β + δ + A i (1 − β ) 1 − β (1 − q i ) + δ + A i β + (1 − δ + A i )(1 − β ) β q i = δ + A i + (1 − δ + A i )  β (1 − q i ) 1 − β + q i (1 − β ) β  . Deﬁne γ + i ( A i ) := δ + A i + (1 − δ + A i ) h β (1 − q i ) 1 − β + q i (1 − β ) β i . Similarly , if h ∗ ( A i ) = − 1 , then E [ γ i | p i , A i ] = (1 − δ + A i ) + δ + A i  β (1 − q i ) 1 − β + q i (1 − β ) β  =: γ − i ( A i ) By assumption q i ≤ α < 1 / 2 , and since α ≤ β < 1 / 2 the factor β (1 − q i ) 1 − β + q i (1 − β ) β ≤ β (1 − α ) 1 − β + α (1 − β ) β ≤ 1 (strictly less than 1 if β > α ). Deﬁne ε 0 := 1 − β (1 − α ) 1 − β − α (1 − β ) β , to obtain the bounds γ + i ( A i ) ≤ δ + A i + (1 − δ + A i )(1 − ε 0 ) , (7) γ − i ( A i ) ≤ δ + A i (1 − ε 0 ) + (1 − δ + A i ) . (8) Observe that for e very A we hav e 0 < δ + A < 1 , since at least one hypothesis takes the value − 1 on A and p ( h ) > 0 for all h ∈ H . Therefore both γ + i ( A i ) and γ − i ( A i ) are less or equal to 1 , and it follows that E [ γ i | p i ] ≤ 1 (and strictly less than 1 if β > α ).  B. Pr oof of Theor em 4 The proof amounts to obtaining upper bounds for γ + i ( A i ) and γ − i ( A i ) , deﬁned above in (7) and (8). Consider two distinct situations. Deﬁne b i := min A ∈A | W ( p i , A ) | . First suppose that there do not exist neighboring sets A and A 0 with W ( p i , A ) > b i and W ( p i , A 0 ) < − b i . Then by Lemma 1, this implies that b i ≤ c ∗ , and according the query selection step of the modiﬁed SGBS algorithm, A i = arg min A | W ( p i , A ) | . Note that because | W ( p i , A i ) | ≤ c ∗ , (1 − c ∗ ) / 2 ≤ δ + A i ≤ (1 + c ∗ ) / 2 . Hence, both γ + i ( A i ) and γ − i ( A i ) are bounded above by 1 − ε 0 (1 − c ∗ ) / 2 . Now suppose that there exist neighboring sets A and A 0 with W ( p i , A ) > b i and W ( p i , A 0 ) < − b i . Recall that in this case A i is randomly chosen to be A or A 0 with equal probability . Note that δ + A > (1 + b i ) / 2 and δ + A 0 < (1 − b i ) / 2 . If h ∗ ( A ) = h ∗ ( A 0 ) = +1 , then applying (7) results in E [ γ i | p i , A i ∈ { A, A 0 } ] < 1 2 (1 + 1 − b i 2 + 1 + b i 2 (1 − ε 0 )) = 1 2 (2 − ε 0 1 + b i 2 ) ≤ 1 − ε 0 / 4 , 22 since b i > 0 . Similarly , if h ∗ ( A ) = h ∗ ( A 0 ) = − 1 , then (8) yields E [ γ i | p i , A i ∈ { A, A 0 } ] < 1 − ε 0 / 4 . If h ∗ ( A ) = − 1 on A and h ∗ ( A 0 ) = +1 , then applying (8) on A and (7) on A 0 yields E [ γ i | p i , A i ∈ { A, A 0 } ] ≤ 1 2  δ + A (1 − ε 0 ) + (1 − δ + A ) + δ + A 0 + (1 − δ + A 0 )(1 − ε 0 )  = 1 2 (1 − δ + A + δ + A 0 + (1 − ε 0 )(1 + δ + A − δ + A 0 )) = 1 2 (2 − ε 0 (1 + δ + A − δ + A 0 )) = 1 − ε 0 2 (1 + δ + A − δ + A 0 ) ≤ 1 − ε 0 / 2 , since 0 ≤ δ + A − δ + A 0 ≤ 1 . The ﬁnal possibility is that h ∗ ( A ) = +1 and h ∗ ( A 0 ) = − 1 . Apply (7) on A and (8) on A 0 to obtain E [ γ i | p i , A i ∈ { A, A 0 } ] ≤ 1 2  δ + A + (1 − δ + A )(1 − ε 0 ) + δ + A 0 (1 − ε 0 ) + (1 − δ + A 0 )  = 1 2 (1 + δ + A − δ + A 0 + (1 − ε 0 )(1 − δ + A + δ + A 0 )) Next, use the fact that because A and A 0 are neighbors, δ + A − δ + A 0 = p i ( h ∗ ) − p i ( − h ∗ ) ; if − h ∗ does not belong to H , then p i ( − h ∗ ) = 0 . Hence, E [ γ i | p i , A i ∈ { A, A 0 } ] ≤ 1 2 (1 + δ + A − δ + A 0 + (1 −  0 )(1 − δ + A + δ + A 0 )) = 1 2 (1 + p i ( h ∗ ) − p i ( − h ∗ ) + (1 −  0 )(1 − p i ( h ∗ ) + p i ( − h ∗ ))) ≤ 1 2 (1 + p i ( h ∗ ) + (1 −  0 )(1 − p i ( h ∗ ))) = 1 − ε 0 2 (1 − p i ( h ∗ )) , since the bound is maximized when p i ( − h ∗ ) = 0 . No w bound E [ γ i | p i ] by the maximum of the conditional bounds abov e to obtain E [ γ i | p i ] ≤ max n 1 − ε 0 2 (1 − p i ( h ∗ )) , 1 − ε 0 4 , 1 − (1 − c ∗ ) ε 0 2 o , and thus it is easy to see that E  C i +1 C i | p i  = E [ γ i | p i ] − p i ( h ∗ ) 1 − p i ( h ∗ ) ≤ 1 − min n ε 0 2 (1 − c ∗ ) , ε 0 4 o .  C. Pr oof of Theor em 5 First consider the bound on E [ R ( b h )] . Let δ n = 2 e − n | R ∆ ( h 1 ) − R ∆ ( h 2 ) | 2 / 6 and consider the conditional e xpectation E [ R ( b h ) | h 1 , h 2 ] ; i.e., expectation with respect to the n/ 3 queries dra wn from the re gion ∆ , conditioned on the 2 n/ 3 queries used to select h 1 and h 2 . By Lemma 4 E [ R ( b h ) | h 1 , h 2 ] ≤ (1 − δ n ) min { R ( h 1 ) , R ( h 2 ) } + δ n max { R ( h 1 ) , R ( h 2 ) } , = min { R ( h 1 ) , R ( h 2 ) } + δ n [max { R ( h 1 ) , R ( h 2 ) } − min { R ( h 1 ) , R ( h 2 ) } ] , = min { R ( h 1 ) , R ( h 2 ) } + δ n | R ( h 1 ) − R ( h 2 ) | , = min { R ( h 1 ) , R ( h 2 ) } + 2 | R ( h 1 ) − R ( h 2 ) | e − n | R ∆ ( h 1 ) − R ∆ ( h 2 ) | 2 / 6 , ≤ min { R ( h 1 ) , R ( h 2 ) } + 2 | R ( h 1 ) − R ( h 2 ) | e − n | R ( h 1 ) − R ( h 2 ) | 2 / 6 , 23 where the last inequality follows from the fact that | R ( h 1 ) − R ( h 2 ) | ≤ | R ∆ ( h 1 ) − R ∆ ( h 2 ) | . The function 2 u e − ζ u 2 attains its maximum at u = q 2 eζ < p 1 /ζ , and therefore E [ R ( b h ) | h 1 , h 2 ] ≤ min { R ( h 1 ) , R ( h 2 ) } + p 3 /n . Now taking the expectation with respect to h 1 and h 2 (i.e., with respect to the queries used for the selection of h 1 and h 2 ) E [ R ( b h )] ≤ E [min { R ( h 1 ) , R ( h 2 ) } ] + p 3 /n , ≤ min { E [ R ( h 1 )] , E [ R ( h 2 )] } + p 3 /n , by Jensen’ s inequality . Next consider the bound on P ( h 1 6 = h ∗ ) . This also follo ws from an application of Lemma 4. Note that if the conditions of Theorem 4 hold, then P ( h 1 6 = h ∗ ) ≤ N e − λn/ 3 . Furthermore, if h 1 = h ∗ and h 2 6 = h ∗ , then | R ∆ ( h 1 ) − R ∆ ( h 2 ) | ≥ | 1 − 2 α | . The bound on P ( b h 6 = h ∗ ) follows by applying the union bound to the events h 1 = h ∗ and R ( b h ) > min { R ( h 1 ) , R ( h 2 ) } .  R E F E R E N C E S [1] R. M. Goodman and P . Smyth, “Decision tree design from a communication theory standpoint, ” IEEE T rans. Info. Theory , vol. 34, no. 5, pp. 979–994, 1988. [2] M. Horstein, “Sequential decoding using noiseless feedback, ” IEEE T rans. Info. Theory , v ol. 9, no. 3, pp. 136–143, 1963. [3] M. V . Burnashev and K. S. Zigangirov , “ An interval estimation problem for controlled observations, ” Problems in Information T ransmission , vol. 10, pp. 223–231, 1974. [4] K. S. Zigangirov , “Upper bounds for the error probability of feedback channels, ” Probl. P eredachi Inform. , vol. 6, no. 2, pp. 87–82, 1970. [5] A. R ´ enyi, “On a problem in information theory , ” MT A Mat. Kut. Int. Kozl. , p. 505516, 1961, reprinted in Selected P apers of Alfred R ´ enyi , vol. 2, P . T uran, ed., pp. 631-638. Akademiai Kiado, Budapest, 1976. [6] ——, “On the foundations information theory , ” Review of the International Statistical Institute , vol. 33, no. 1, 1965. [7] D. W . Loveland, “Performance bounds for binary testing with arbitrary weights, ” Acta Informatica , vol. 22, pp. 101–114, 1985. [8] U. Feige, E. Ragha van, D. Peleg, and E. Upfal, “Computing with noisy information, ” SIAM J. Comput. , vol. 23, no. 5, pp. 1001–1018, 1994. [9] S. R. K osaraju, T . M. Przytycka, and R. Bor gstrom, “On an optimal split tree problem, ” Lectur e Notes in Computer Science: Algorithms and Data Structures , vol. 1663, pp. 157–168, 1999. [10] A. P . K orostelev , “On minimax rates of conv ergence in image models under sequential design, ” Statistics & Pr obability Letters , vol. 43, pp. 369–375, 1999. [11] A. P . Korostele v and J.-C. Kim, “Rates of conv ergence fo the sup-norm risk in image models under sequential designs, ” Statistics & Pr obability Letters , vol. 46, pp. 391–399, 2000. [12] M. Swain and M. Stricker, “Promising directions in active vision, ” Int. J. Computer V ision , v ol. 11, no. 2, pp. 109–126, 1993. [13] D. Geman and B. Jedynak, “ An acti ve testing model for tracking roads in satellite images, ” IEEE T rans. P AMI , vol. 18, no. 1, pp. 1–14, 1996. [14] E. M. Arkin, H. Meijer , J. S. B. Mitchell, D. Rappaport, and S. Skiena, “Decision trees for geometric models, ” Intl. J. Computational Geometry and Applications , vol. 8, no. 3, pp. 343–363, 1998. [15] Y . Freund, H. S. Seung, E. Shamir, and N. Tishby , “Selective sampling using the query by committee algorithm, ” Machine Learning , vol. 28, pp. 133–168, 1997. [16] S. Dasgupta, “ Analysis of a greedy active learning strate gy , ” in Neur al Information Pr ocessing Systems , 2004. [17] M.-F . Balcan, A. Broder, and T . Zhang, “Margin based active learning, ” in Conf. on Learning Theory (COLT) , 2007. 24 [18] R. Now ak, “Generalized binary search, ” in Pr oceedings of the Allerton Conference, Monticello, IL , (www .ece.wisc.edu/ ∼ now ak/gbs.pdf) 2008. [19] L. Hyaﬁl and R. L. Rivest, “Constructing optimal binary decision trees is NP-complete, ” Inf . Pr ocess. Lett. , vol. 5, pp. 15–17, 1976. [20] M. R. Garey and R. L. Graham, “Performance bounds on the splitting algorithm for binary testing, ” Acta Inf. , vol. 3, pp. 347–355, 1974. [21] C. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to Algorithms, Third Edition . MIT Press, 2009. [22] A. Pelc, “Searching games with erro r– ﬁfty years of coping with liars, ” Theor etical Computer Science , vol. 270, pp. 71–109, 2002. [23] D. Angluin, “Queries and concept learning, ” Mac hine Learning , vol. 2, pp. 319–342, 1988. [24] ——, “Queries revisited, ” Springer Lecture Notes in Comp. Sci.: Algorithmic Learning Theory , pp. 12–31, 2001. [25] S. R. K ulkarni, S. K. Mitter , and J. N. Tsitsiklis, “ Activ e learning using arbitrary binary v alued queries, ” Machine Learning , pp. 23–35, 1993. [26] T . Heged ¨ us, “Generalized teaching dimensions and the query complexity of learning, ” in 8th Annual Conference on Computational Learning Theory , 1995, pp. 108–117. [27] L. Hellerstein, K. Pillaipakkamnatt, V . Raghavan, and D. W ilkins, “Ho w many queries are needed to learn?” J. ACM , vol. 43, no. 5, pp. 840–862, 1996. [28] J. F . Traub, G. W . W asilkowski, and H. W ozniakowski, Information-based Comple xity . Academic Press, 1988. [29] K. S. Zigangirov , “Message transmission in a binary symmetric channel with noiseless feedback 9random transmission time), ” Probl. P er edachi Inform. , vol. 4, no. 3, pp. 88–97, 1968. [30] R. Karp and R. Kleinberg, “Noisy binary search and its applications, ” in Pr oceedings of the 18th A CM-SIAM Symposium on Discrete Algorithms (SODA 2007) , pp. 881–890. [31] R. Castro and R. No wak, “Minimax bounds for active learning, ” IEEE T rans. Info. Theory , pp. 2339–2353, 2008. [32] R. L. Rivest, A. R. Meyer , and D. J. Kleitman, “Coping with errors in binary search procedure, ” J . Comput. System Sci. , pp. 396–404, 1980. [33] J. Spencer, “Ulam’ s searching game with a ﬁxed number of lies, ” in Theoretical Computer Science , 1992, pp. 95:307–321. [34] A. Dhagat, P . Gacs, and P . Winkler , “On playing ‘twenty questions’ with a liar, ” in Proc. ACM Symposium on Discr ete Algorithms (SODA) , 1992, pp. 16–22. [35] J. Aslam and A. Dhagat, “Searching in the presence of linearly bounded errors, ” in Pr oc. ACM Symposium on the Theory of Computing (STOC) , 1991, pp. 486–493. [36] M. K ¨ a ¨ ari ¨ ainen, “ Activ e learning in the non-realizable case, ” in Algorithmic Learning Theory , 2006, pp. 63–77. [37] P . Br ´ emaud, Markov Chains, Gibbs Fields, Monte Carlo Simulations, and Queues . Springer, 1998. [38] C. Scott, personal communication. [39] E. J. Cand ` es, J. Romberg, and T . T ao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information, ” IEEE T rans. Inform. Theory , vol. 52, no. 2, pp. 489–509, Feb. 2006. [40] D. L. Donoho, “Compressed sensing, ” IEEE T rans. Inform. Theory , vol. 52, no. 4, pp. 1289–1306, 2006. [41] S. Dasgupta, “Coarse sample comple xity bounds for active learning, ” in Neural Information Processing Systems , 2005. [42] R. C. Buck, “Partition of space, ” The American Mathematical Monthly , vol. 50, no. 9, pp. 541–544, 1943. [43] V . V apnik, The Natur e of Statistical Learning Theory . NY : Springer, 1995.

The Geometry of Generalized Binary Search

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment