A Permutation-based Model for Crowd Labeling: Optimal Estimation and Robustness

The task of aggregating and denoising crowd-labeled data has gained increased significance with the advent of crowdsourcing platforms and massive datasets. We propose a permutation-based model for crowd labeled data that is a significant generalizati…

Authors: Nihar B. Shah, Sivaraman Balakrishnan, Martin J. Wainwright

A Permutation-based Model for Crowd Labeling: Optimal Estimation and   Robustness
A P erm utation-based Mo del for Cro wd Lab eling: Optimal Estimation and Robustness Nihar B. Shah ∗ , Siv araman Balakrishnan ] and Martin J. W ain wright † ∗ Mac hine Learning Department and Computer Science Departmen t ] Departmen t of Statistics and Data Science Carnegie Mellon Univ ersity † Departmen t of EECS and Department of Statistics Univ ersity of California, Berk eley Abstract The task of aggregating and denoising cro wd-lab eled data has gained increased significance with the adv ent of cro wdsourcing platforms and massiv e datasets. W e prop ose a permutation- based model for cro wd lab eled data that is a significan t generalization of the classical Da wid-Skene mo del, and in tro duce a new error metric by whic h to compare differen t estimators. W e derive global minimax rates for the p ermutation-based model that are sharp up to logarithmic factors, and matc h the minimax low er b ounds derived under the simpler Da wid-Skene model. W e then design t wo computationally-efficien t estimators: the W AN estimator for the setting where the ordering of w ork ers in terms of their abilities is approximately known, and the OBI-W AN estimator where that is not known. F or each of these estimators, we provide non-asymptotic b ounds on their p erformance. W e conduct synthetic simulations and exp eriments on real-world cro wdsourcing data, and the exp erimen tal results corrob orate our theoretical findings. 1 In tro duction Recen t years hav e witnessed a surge of interest in the use of crowdsourcing for lab eling massive datasets. Exp ert lab els are often difficult or exp ensive to obtain at scale, and cro wdsourcing platforms allow for the collection of lab els from a large num b er of low-cost w orkers. This paradigm, while enabling several new applications of machine learning, also in tro duces some key challenges: first, low-cost work ers are often non-exp erts and the lab els they pro duce can b e quite noisy , and second, data collected in this fashion has a high amoun t of heterogeneity with significant differences in the qualit y of lab els across w orkers and tasks. Th us, it is imp ortant to develop realistic mo dels and scalable algorithms for aggregating and drawing meaningful inferences from the noisy lab els obtained via cro wdsourcing. This pap er fo cuses on ob jective lab eling tasks in v olving binary choices, meaning that each question or task is asso ciated with a single correct binary answer or lab el. 1 There is a v ast literature on the problem of estimation from noisy crowdsourced labels (e.g., [ 39 , 33 , 22 , 21 , 15 , 28 , 14 , 5 , 48 , 13 ]). The bulk of this past work is based on the classical Dawid-Sk ene mo del [ 6 ], in whic h each work er i is asso ciated with a single scalar parameter q DS i ∈ [0 , 1], and it is assumed that the probabilit y that w orker i answ ers any question j correctly is giv en by the same scalar q DS i . Th us, the Dawid-Sk ene mo del imp oses a homogeneity condition on the questions, one which is often not satisfied in practical Author email addresses: nihars@cs.cm u.edu, siv a@stat.cmu.edu, w ainwrig@berkeley .edu. 1 In this pap er, we use the terms { question, task } , and { answ er, lab el } in an interc hangeable manner. 1 applications where some questions may b e more difficult than others. W e note that the original mo del b y Dawid and Sk ene [ 6 ] also allo ws for asymmetric errors across differen t classes. In this paper, w e fo cus on the setting with symmetric error probabilities, that has p opularly come to b e known as the “one-coin Da wid-Skene mo del”, and has b een the fo cus of m uch of past literature [ 22 , 21 , 15 , 5 ]. Both the asymmetric and symmetric mo dels, how ever, are gov erned by restrictiv e parameter-based assumptions and assume homogeneit y of questions. Accordingly , in this pap er, we prop ose and analyze a more general p ermutation-based mo del that allo ws the noise in the answ er to dep end on the particular question-work er pair. Within the con text of such mo dels, we prop ose and analyze a v ariet y of estimation algorithms. One p ossible metric for analysis is the Hamming error, and there is a large b o dy of past work [ 22 , 21 , 15 , 14 , 5 , 48 , 13 ] that pro vide sufficient conditions that guaran tee zero Hamming error—meaning that every question is answ ered correctly—with high probability . Although the Hamming error can b e suitable for the analysis of Dawid-Sk ene style mo dels, we argue in the sequel that it is less appropriate for the heterogenous settings studied in this pap er. Instead, when tasks hav e heterogenous difficulties, it is more natural to use a w eigh ted metric that also accounts for the underlying difficulty of the tasks. Concretely , an estimator should b e p enalized less for making an error on a question that is intrinsically more difficult. In this pap er, we in tro duce and provide analysis under suc h a difficult y-weigh ted error metric. F rom a high-lev el p ersp ectiv e, the con tributions of this pap er can b e summarized as follows: • W e in tro duce a new “p ermutation-based” mo del for cro wd-lab eled data, which is considerably ric her than the p opular Da wid-Skene class of mo dels. • In order to incorp orate the ric hness in the mo del, we introduce a new difficult y-weigh ted loss that extends the p opular Hamming loss . W e prov e non-asymptotic upp er and lo wer b ounds on the global minimax error under the difficulty-w eighted loss, sharp up to logarithmic factors, for estimation under the permutation-based mo del. These bounds matc h those under the Da wid-Skene mo del up to logarithmic factors. • W e prop ose a computationally-efficient estimator, termed the W AN estimator, for the setting where an approximate ordering of the w ork ers in terms of their abilities is kno wn. W e show that under the p erm utation-based mo del, this estimator has strong guarantees for the 0-1 loss and also ac hieves the global minimax limits (up to logarithmic factors) for the difficult y-rew eighted loss. • W e provide a computationally-efficien t estimator, termed the OBI-W AN estimator, when no prior information ab out the work ers is known. This estimator achiev es strong guarantees for the 0-1 loss under the Dawid-Sk ene mo del and an intermediate mo del, and simultaneously also has guaran tees ov er the muc h richer p ermutation-based mo del thereb y establishing its robustness to mo del sp ecification. • W e conduct synthetic simulations as w ell as real-world exp eriments using data from the Amazon Mec hanical T urk cro wdsourcing platform. These exp eriments reveal a strong p erformance of the OBI-W AN estimator in practice. The remainder of this pap er is organized as follo ws. In Section 2 , we provide some bac kground, setup the problems we address in this pap er, and provide an ov erview of related literature. Section 3 is devoted to our main results. W e present numerical sim ulations and real-world exp erimen ts in Section 4 . W e presen t pro ofs of the claimed theoretical results in Section 5 . W e conclude the pap er with a discussion of future researc h directions in Section 6 . 2 2 Bac kground and mo del form ulation W e b egin with some background on existing cro wd-lab eling mo dels, follow ed by an in tro duction to our prop osed mo dels; we conclude with a discussion of related work. 2.1 Observ ation mo del Consider a cro wdsourcing system that consists of n w orkers and d questions. W e assume every question has tw o p ossible answ ers, denoted by {− 1 , +1 } , of which exactly one is correct. W e let x ∗ ∈ {− 1 , 1 } d denote the binary v ector of correct answ ers to all d questions. W e mo del the question-answ ering via an unknown matrix Q ∗ ∈ [0 , 1] n × d whose ( i, j ) th en try , Q ∗ ij , represen ts the probabilit y that work er i answ ers question j correctly . Otherwise, with probability 1 − Q ∗ ij , work er i giv es the incorrect answ er to question j . F or future reference, note that the (one-coin) Dawid-Sk ene mo del in v olves a sp ecial case of suc h a matrix, namely one of the form Q ∗ = q DS 1 T , where the vector q DS ∈ [0 , 1] n corresp onds to the vector of correctness probabilities, with a single scalar asso ciated with eac h work er. W e denote the resp onse of work er i to question j b y a v ariable Y ij ∈ {− 1 , 0 , 1 } , where we set Y ij = 0 if work er i is not asked question j , and set Y ij to the answer (either − 1 or 1) provided by the w ork er otherwise. W e also assume that work er i is ask ed question j with probabilit y p obs ∈ [0 , 1], indep enden tly for every pair ( i, j ) ∈ [ n ] × [ d ], and that a work er is nev er asked the same question t wice. W e also mak e the standard assumption that given the v alues of x ∗ and Q ∗ , the en tries of Y are all m utually indep endent. In summary , we observe a matrix Y whic h has indep endent entries distributed as Y ij =      x ∗ j with probabilit y p obs Q ∗ ij − x ∗ j with probabilit y p obs (1 − Q ∗ ij ) 0 with probabilit y (1 − p obs ). Giv en this random matrix Y , our goal is to estimate the binary vector x ∗ ∈ {− 1 , 1 } d of true lab els. Obtaining non-trivial guarantees for this problem requires that some structure b e imp osed on the probabilit y matrix Q ∗ . The Da wid-Skene model is one form of such structure: it requires that the probability matrix Q ∗ b e rank one, with iden tical columns all equal to q DS ∈ R n . As noted previously , this structural assumption on Q ∗ is v ery strong. It assumes that eac h work er has a fixed probabilit y of answering a question correctly , and is likely to b e violated in settings where some questions are more difficult than others. Accordingly , in this pap er, w e study a more general p ermutation-based mo del of the following form. W e assume that there are tw o underlying orderings, b oth of whic h are unknown to us: first, a p erm utation π ∗ : [ n ] → [ n ] that orders the n w orkers in terms of their (latent) abilities, and second, a p erm utation σ ∗ : [ d ] → [ d ] that orders the d questions with resp ect to their (latent) difficulties. In terms of these permutations, we assume that the probability matrix Q ∗ ob eys the follo wing conditions: • W orker monotonicit y: F or every pair of work ers i and i 0 suc h that π ∗ ( i ) < π ∗ ( i 0 ) and every question j , we hav e Q ∗ ij ≥ Q ∗ i 0 j . • Question monotonicit y: F or ev ery pair of questions j and j 0 suc h that σ ∗ ( j ) < σ ∗ ( j 0 ) and ev ery work er i , we hav e Q ∗ ij ≥ Q ∗ ij 0 . In other w ords, the p ermutation-based mo del assumes the existence of a p ermutation of the rows and c olumns such that each row and eac h column of the p erm uted matrix Q ∗ has non-increasing 3 en tries. The rank of the resulting matrix is allo w ed to b e as large as min { n, d } . It is straightforw ard to verify that the Da wid-Sk ene mo del corresp onds to a particular t yp e of such probability matrices, restricted to ha ve identical columns. In summary , we let C Perm denote the set of all p ossible v alues of matrix Q ∗ under the prop osed p erm utation-based mo del, that is, C Perm : =  Q ∈ [0 , 1] n × d | there exist p erm utations ( π , σ ) suc h that question & w orker monotonicity hold  . F or future reference, w e also use C DS : =  Q ∈ C Perm | Q = q DS 1 T for some q DS ∈ [0 , 1] n  , to denote the subset of suc h matrices that are realizable under the Da wid-Skene assumption. It should b e noted that none of these mo dels are iden tifiable without further constraints. F or instance, c hanging x ∗ to − x ∗ and Q ∗ to (11 T − Q ∗ ) do es not c hange the distribution of the observ ation matrix Y . In the context of the Dawid-Sk ene mo del, sev eral pap ers [ 22 , 21 , 14 , 48 ] ha ve resolved this issue b y requiring that 1 n P n i =1 q DS i ≥ 1 2 + µ for some constan t v alue µ > 0. Although this condition resolves the lac k of iden tifiability , the underlying assumption—namely that ev ery question is answ erable by a subset of the w ork ers—can b e violated in practice. In particular, one frequently encoun ters questions that are to o difficult to answer b y any of the hired work ers, and for whic h the w orker’s answers are near uniformly random (e.g., see the pap ers [ 8 , 38 ]). On the other hand, empirical observ ations also sho w that w orkers in cro wdsourcing platforms, as opposed to being adv ersarial in nature, at worst pro vide random answers to lab eling tasks [ 47 , 8 , 12 , 11 ]. On this basis, for certain results in the pap er, we will consider the regime: Q ∗ ij ≥ 1 2 ∀ i ∈ [ n ] , j ∈ [ d ] . (R1) Note that neither the condition ( R1 ) nor the condition 1 n P n i =1 q DS i ≥ 1 2 + µ from past literature dominate one another. 2.2 Ev aluating estimators In this section, we introduce the criteria used to ev aluate estimators in this pap er. In formal terms, an estimator b x is a measurable function that maps any observ ation matrix Y to a vector in the Bo olean hypercub e {− 1 , 1 } d . The most p opular w ay of assessing the p erformance of such an estimator is in terms of its (normalized) Hamming err or d H ( b x, x ∗ ) : = 1 d d X j =1 1 { b x j 6 = x ∗ j } , (1) where 1 { b x j 6 = x ∗ j } denotes a binary indicator whic h tak es the v alue 1 if b x j 6 = x ∗ j , and 0 otherwise. A p oten tial deficiency of the Hamming error is that it places a uniform w eigh t on each question. As men tioned earlier, there are applications of cro wdsourcing in which some subset of the questions are v ery difficult, and no hired w orker can answ er reliably . In such settings, any estimator will hav e an inflated Hamming error, not due to an y particular deficiencies of the estimator, but rather due to the intrinsic hardness of the assigned collection of questions. This error inflation will obscure p ossible differences b etw een estimators. 4 Our goal in choosing an appropriate loss function is to allow for ev aluation and comparison of v arious estimators. Th us, with the aforementioned issue in mind, we an alternative error measure that w eights the Hamming error with the difficult y of eac h task. A more general class of error measures tak es the form L Q ∗ ( b x, x ∗ ) = 1 d d X j =1 1 { b x j 6 = x ∗ j } Ψ( Q ∗ 1 j , . . . , Q ∗ nj ) , (2) for some function Ψ : [0 , 1] n → R + whic h captures the difficulty of estimating the answer to a question. The Q ∗ -loss: In order to choose a suitable function Ψ, we note that past work on the Da wid-Skene mo del [ 22 , 21 , 15 , 14 , 5 ] has shown that the quantit y 1 n n X i =1 (2 q DS i − 1) 2 , (3) p opularly kno wn as the c ol le ctive intel ligenc e of the crowd, is central to characterizing the ov erall difficult y of the cro wd-sourcing problem under the Da wid-Skene assumption. A natural generalization, then, is to consider the w eigh ts Ψ( Q ∗ 1 j , . . . , Q ∗ nj ) = 1 n n X i =1  2 Q ∗ ij − 1  2 for eac h task j ∈ [ d ], (4a) whic h characterizes the difficulty of task j for a giv en collection of w orkers. This c hoice giv es rise to the Q ∗ -loss function L Q ∗ ( b x, x ∗ ) : = 1 d d X j =1  1 { b x j 6 = x ∗ j } 1 n n X i =1 (2 Q ∗ ij − 1) 2  (4b) = 1 dn | | | ( Q ∗ − 1 2 11 T ) diag( b x − x ∗ ) | | | 2 F , (4c) where diag ( b x − x ∗ ) denotes the matrix in R d × d whose diagonal en tries are given by the vector b x − x ∗ . Note that under the Da wid-Skene mo del (in whic h Q ∗ = q DS 1 T ), this loss function reduces to L Q ∗ ( b x, x ∗ ) =  1 n n X i =1 (2 q DS i − 1) 2   1 d d X j =1 1 { b x j 6 = x ∗ j }  | {z } d H ( b x,x ∗ ) , (5) corresp onding to the normalized Hamming error rescaled by the collective intelligence. F or future reference, let us summarize some prop erties of the function L Q ∗ : (a) it is symmetric in its arguments ( x ∗ , b x ), and satisfies the triangle inequality; (b) it takes v alues in the interv al [0 , 1]; and (c) if for every question j ∈ [ d ], there exists a w orker ` ∈ [ n ] such that Q ∗ `j 6 = 1 2 , then L Q ∗ defines a metric; if not, it defines a pseudo-metric. 5 Regime of interest: In this pap er, we fo cus on understanding the minimax risk as w ell as the risk of v arious computationally efficient estimators. W e work in a non-asymptotic framework where w e are interested in ev aluating the risk in terms of the triplet ( n, d, p obs ). W e assume that p obs ≥ 1 n , whic h ensures that on a verage, at least one w orker answers any question. W e also op erate in the regime d ≥ n , which is commonplace in practical applications. Indeed, as also noted in earlier w orks [ 48 ], t ypical medium or large-scale crowdsourcing tasks employ tens to hundreds of w orkers, while the num b er of questions is on the order of hundreds to man y thousands. W e assume that the v alue of p obs is kno wn. This is a mild assumption since it is straightforw ard to estimate p obs v ery accurately using its empirical exp ectation. W e encompass the aforemen tioned conditions as the regime: p obs ≥ 1 n and d ≥ n. (R2) 2.3 Related work Ha ving set up our mo del and notation, let us now relate it to past work in the area. F or the problem of crowd lab eling, the Dawid-Sk ene mo del [ 6 ] is the dominant paradigm, and has b een widely studied [ 22 , 21 , 15 , 28 , 14 , 5 , 48 ]. Some pap ers ha ve studied mo dels that generalize the Da wid-Skene mo del. In a recen t w ork, Khetan and Oh [ 23 ] analyze an extension of the Da wid-Skene mo del where a vector e q ∈ R n , capturing the abilities of the work ers, is supplemented with a second vector h ∗ ∈ [0 , 1] d , and the likelihoo d of work er i correctly answering question j is set as e q i (1 − h ∗ j + (1 − e q i ) h ∗ j ). Although this mo del no w has ( n + d ) parameters instead of just n as in the Da wid-Sk ene mo del, it retains parametric-type assumptions. Each work er and each question is describ ed by a single parameter, and in this mo del the probability of correctness takes a sp ecific form go verned by these parameters. In contrast, in the p ermutation-based mo del each work er-question pair is describ ed by a single parameter. Our p ermutation-based mo del forms a strict sup erset of this class. Zhou et al. [ 50 , 49 ] prop ose a mo del based on a certain minimax entrop y principle, whereas Whitehill et al. [ 46 ] prop ose a parameter-based mo del that also incorp orates question difficulties. Ho wev er, the algorithms prop osed in these pap ers [ 50 , 49 , 46 ] ha ve yet to b e rigorously analyzed. In this pap er, w e in tro duce a class of mo dels that are considerably more flexible than the Da wid-Skene mo del, as well as a nov el algorithm for estimation in such mo dels, whic h w e equip with some theoretical guaran tees. The presen t pap er also introduces another new algorithm for the setting in whic h an ordering of the w orkers in terms of their abilities is appro ximately known, for instance, based on some initial test. T o b e clear, the results of this pap er hav e some limitations as compared to past work on the Da wid-Sk ene mo del, and w e hop e that these limitations will b e remov ed in future w ork on the p ermutation-based mo del. Concretely , while the present pap er addresses the setting of binary lab els with symmetric error probabilities, several of these prior works also address settings with more than t w o classes, and where the probability of error of a work er ma y be asymmetric across the classes. The results presented in this pap er hav e logarithmic factor gaps, that is, the ‘optimal’ results are optimal up to logarithmic factors, as stated throughout the pap er. F or the Da wid-Skene mo del, particularly under sparse observ ations, the past works [ 22 , 21 , 14 , 48 , 23 ] hav e results with sharp er logarithmic factors. Finally , the guarantees provided in past results hav e error exp onen ts that adapt to the underlying signal, whereas ours do not. A related problem in the con text of crowdsourcing is to estimate pairwise outcome probabilities from pairwise comparison data. In our past work [ 34 , 35 ], we hav e considered this problem under an assumption of “strong sto chastic transitivity (SST)”, which is a regularit y condition related to the p erm utation-based mo del of this pap er. Accordingly , parts of our pro ofs make use of metric entrop y calculations from this past work. Unlik e our previous work, the current pap er in v olves an unkno wn 6 set of lab els, as well as a significantly different observ ation mo del: in particular, the observed data couples the unkno wn matrix Q ∗ with the unkno wn lab els. Moreo ver, rather than estimating the unkno wn probabilities Q ∗ , our primary goal in this pap er is to estimate these underlying lab els, for whic h significantly different algorithmic ideas and pro of tec hniques are required. Finally , the problem of aggregating lab els of cro wdsourcing w orkers is conceptually similar to that of combining classifiers in an unsup ervised context, eac h solving multiple classification problems [ 32 , 19 ]. Our w ork has implications for this line of researc h as w ell. 3 Main results W e now turn to the statement of our main results. W e use c , c U , c L , c 0 , c H to denote p ositive universal constan ts that are indep enden t of all other problem parameters. Recall that the Q ∗ -loss tak es v alues in the in terv al [0 , 1]. 3.1 Minimax risk for estimation under the p ermutation-based mo del W e b egin by proving sharp upp er and low er b ounds on the minimax risk for the p erm utation-based mo del C Perm . The upp er b ound is obtained via an analysis of the following least squares estimator ( e x LS , e Q LS ) ∈ arg min x ∈{− 1 , 1 } d , Q ∈ C Perm | | | p − 1 obs Y − (2 Q − 11 T ) diag( x ) | | | 2 F . (6) In order to provide some intuition for this estimator, one can sho w (see the pro of of Theorem 1 (a) for details) that the unkno wns x ∗ and Q ∗ are related to the mean of the observ ed matrix Y via the equalit y E [ Y ] = p obs (2 Q ∗ − 11 T ) diag( x ∗ ) . Consequently , the estimate ( e x LS , e Q LS ) computed via the program ( 6 ) equals the true solution ( x ∗ , Q ∗ ) when Y is replaced b y its p opulation v ersion. W e do not kno w of a computationally efficien t wa y to solv e the optimization problem ( 6 ) . Despite this computational issue, our statistical analysis of this estimator serves to pro vide a b enchmark for comparing other computationally-efficien t estimators, to b e discussed in the sequel. The following theoretical guaran tees hold in the regime ( R1 ) ∩ ( R2 ): Theorem 1. (a) F or any binary ve ctor x ∗ ∈ {− 1 , 1 } d and any matrix Q ∗ ∈ C Perm , the le ast squar es estimator e x LS has err or at most L Q ∗ ( e x LS , x ∗ ) ≤ c U 1 np obs (log d ) 2 , (7a) with pr ob ability at le ast 1 − e − c H d log( dn ) . (b) Ther e exists a matrix e Q ∈ C DS such that any estimator b x (which may even know the value of e Q ) has err or at le ast sup x ∗ ∈{− 1 , 1 } d E [ L e Q ( b x, x ∗ )] ≥ c L 1 np obs . (7b) W e provide the this theorem in Sections 5.1 and 5.2 . As a consequence of this result, we see that in terms of the (global) minimax risk under the Q ∗ -loss, there is only a p olylogarithmic factor difference b et w een the Dawid-Sk ene and the p erm utation-based mo dels, despite the p ermutation-based mo del b eing considerably ric her. W e note that while the upp er bound of Theorem 1 (a) is quite in volv ed, the lo wer b ound of Theorem 1 (b) is a straightforw ard result of a simple “worst case” construction. This suggests that 7 the global minimax error b e augmented with an inv estigation of lo cal minimax errors under v arious sub classes of C Perm and v arious notions and v alues of the signal to noise ratio, which w e lea ve as imp ortan t future w ork. The least squares estimator analyzed ab ov e also yields an accurate estimate of the probability matrix Q ∗ in the F rob enius norm, useful in settings where the calibration of w ork ers or questions migh t b e of in terest. Again, this result holds in the regime ( R1 ) ∩ ( R2 ): Corollary 1. (a) F or any x ∗ ∈ {− 1 , 1 } d and any Q ∗ ∈ C Perm ,, the le ast squar es estimate e Q LS has err or at most 1 dn | | | e Q LS − Q ∗ | | | 2 F ≤ c U 1 np obs log 2 d, (8a) with pr ob ability at le ast 1 − e − c H d log( dn ) . (b) Conversely, for any answer ve ctor x ∗ ∈ {− 1 , 1 } d , any estimator b Q (which is al lowe d to know the value of x ∗ ) has err or at le ast sup Q ∗ ∈ C Perm E [ 1 dn | | | b Q − Q ∗ | | | 2 F ] ≥ c L 1 np obs . (8b) Please see Sections 5.3 and 5.4 for the pro of of this corollary . W e do not kno w if there exist computationally-efficient estimators that can ac hiev e the upp er b ound on the sample complexity established in Theorem 1 o v er the entire p ermutation-based mo del class. In the following sections, we design and analyze p olynomial-time estimators that ha ve interesting (but sub optimal) guaran tees ov er the p ermutation-based mo del and also useful guaran tees ov er p opular sub classes of the p erm utation-based mo del. 3.2 The W AN estimator: When work ers’ ordering is (appro ximately) known Sev eral organizations employ cro wdsourcing work ers only after a thorough testing and calibration pro cess. Motiv ated b y this fact, w e no w turn to the study of the setting in which the work ers are calibrated, in the sense that it is known how they are ordered in terms of their resp ective abilities. More formally , recall from Section 2.1 that any matrix Q ∗ ∈ C Perm is asso ciated with tw o p erm utations: a p erm utation of the work ers in terms of their abilities, and a p erm utation of the questions in terms of their difficult y . In this section, we assume that the p ermutation of the w orkers is (appro ximately) kno wn to the estimation algorithm. Note that the estimator do es not know the p erm utation of the questions, nor do es it know the v alues of the en tries of Q ∗ . Giv en a p ermutation π of the work ers, our estimator consists of tw o steps, which we refer to as Windo wing and Aggregating Na ¨ ıv ely , resp ectively , and accordingly term the pro cedure as the W AN estimator: • Step 1 (Windo wing): Compute the integer k W AN ∈ arg max k ∈{ p − 1 obs log 1 . 5 ( dn ) ,...,n } X j ∈ [ d ] 1 n   X i ∈ [ k ] Y π − 1 ( i ) j   ≥ q k p obs log 1 . 5 ( dn ) o , (9a) where ties in the argmax are brok en arbitrarily . • Step 2 (Aggregating Na ¨ ıvely): Set b x W AN ( π ) as a ma jority vote of the b est k W AN w orkers—that is [ b x W AN ( π )] j ∈ arg max b ∈{− 1 , 1 } k W AN X i =1 1 { Y π − 1 ( i ) j = b } for ev ery j ∈ [ d ] . (9b) 8 The windowing step finds a v alue k W AN suc h that the answ ers of the b est k W AN w orkers to most questions are significantly biased tow ards one of the options, thereby indicating that these w ork ers are knowledgeable—or at least, are in agreement with each other. The second step then simply takes a ma jorit y v ote of this set of the b est k W AN w orkers. W e remark that it is important to choose an appropriate v alue of k W AN (as done in Step 1), since an ov erly large v alue could include many random w orkers, thereb y increasing the noise in the input to the second step; on the flip side, ch o osing to o small a v alue could eliminate to o muc h of the “signal”. Both steps can b e carried out in time O ( nd ). F or the case when π is an appro ximate ordering, we now establish a b ound on the error of the W AN estimator. F or every j ∈ [ d ], let Q ∗ j denote the j th column of Q ∗ ; for an y ordering π of the w orkers, Q π j denote the v ector obtained by p ermuting the entries of Q ∗ j in the order given b y π , that is, with the first en try of Q π j corresp onding to the b est work er according to π , and so on. Also recall the notation π ∗ represen ting the true p erm utation of the work ers in terms of their actual abilities. As with all of our theoretical, results, the follo wing claim holds in the regime ( R1 ) ∩ ( R2 ): Theorem 2. F or any matrix Q ∗ ∈ C Perm and any binary ve ctor x ∗ ∈ {− 1 , 1 } d , supp ose that the W AN estimator is pr ovide d with the p ermutation π of workers. Consider the subset of the questions given by J : = n j ∈ [ d ] | ∃ k j ≥ log 1 . 5 ( dn ) p obs s.t. k j X i =1 ( Q ∗ π − 1 ( i ) j − 1 2 ) ≥ 3 4 s k j p obs log 1 . 5 ( dn ) o . (10a) Then the W AN estimator c orr e ctly estimates the lab els of al l questions in set J with high pr ob ability: P  [ b x W AN ( π )] j = x ∗ j for al l j ∈ J  ≥ 1 − e − c H log 1 . 5 ( dn ) . (10b) W e pro vide the pro of of Theorem 2 in Section 5.5 . A t a high level, the theorem says that all questions that hav e some reasonable signal are estimated correctly b y the W AN estimator. T o gain in tuition b ehind the notion of signal in ( 10a ) , let us consider p obs = 1 and consider the ma jorit y v oting algorithm (that is, taking a ma jority vote ov er all n w orkers). A straigh tforw ard application of Ho effding’s inequality yields that for any question j ∈ [ d ], the condition P n i =1 ( Q ∗ ij − 1 2 ) = e Ω ( √ n ) is sufficient for the ma jority voting estimator to estimate x ∗ j correctly (with high probability). F urthermore, in the app endix, w e also sho w that there exist matrices Q ∗ where this condition is also necessary . Theorem 2 says that the W AN algorithm can estimate a question correctly if there exists some subset of “top” work ers (according to π ), suc h that this condition for ma jority voting applies when restricted to only the answ ers from these w orkers. While Theorem 2 (as w ell as other results in the sequel) fo cuses on exact recov ery with high probabilit y , we note that alternativ ely directly b ounding the exp ected Hamming error may yield guaran tees that go b eyond what is captured in this result. W e lea ve this interesting problem for future w ork. Theorem 2 has an immediate corollary , one which provides guarantees on the W AN estimator in terms of certain norms of the matrix Q ∗ whic h may b e more interpretable, and also provides a guaran tee on the Q ∗ -loss incurred b y the W AN estimator. Again, these results hold in the regime ( R1 ) ∩ ( R2 ). Corollary 2. F or any matrix Q ∗ ∈ C Perm and any binary ve ctor x ∗ ∈ {− 1 , 1 } d , supp ose that the W AN estimator is pr ovide d with the p ermutation π of workers. Then for every question j ∈ [ d ] such that k Q ∗ j − 1 2 k 2 2 ≥ 5 log 2 . 5 ( dn ) p obs , and k Q π j − Q π ∗ j k 2 ≤ k Q ∗ j − 1 2 k 2 p 9 log( dn ) , (11a) 9 we have P ([ b x W AN ( π )] j = x ∗ j ) ≥ 1 − e − c H log 1 . 5 ( dn ) . (11b) Conse quently, if π is the c orr e ct p ermutation of the workers, then with pr ob ability at le ast 1 − e − c 0 H log 1 . 5 ( dn ) , we have L Q ∗ ( b x W AN ( π ) , x ∗ ) ≤ c U 1 np obs log 2 . 5 d. (11c) Please see Section 5.6 for the pro of of Corollary 2 . The conditions ( 11a ) required for the result of Corollary 2 are sharp up to logarithmic factors in the follo wing sense. The required approximation guarantee k Q π j − Q π ∗ j k 2 ≤ k Q ∗ j − 1 2 k 2 √ 9 log ( dn ) , if w eakened to k Q π j − Q π ∗ j k 2 ≤ 2 k Q ∗ j − 1 2 k 2 , w ould allow for any arbitrary p erm utation π . This is b ecause ev ery p erm utation π satisfies k Q π j − Q π ∗ j k 2 ≤ k Q π j − 1 2 k 2 + k Q π ∗ j − 1 2 k 2 = 2 k Q ∗ j − 1 2 k 2 . Secondly , there exist constan ts c 0 > 0 and c L > 0 suc h that if one were guaranteed a low er b ound of only c 0 p obs on k Q ∗ j − 1 2 k 2 2 instead of the stated condition of 5 log 2 . 5 ( dn ) p obs , then there exists a Q ∗ ∈ C DS satisfying this w eaker condition such that any estimator b x incurs an error at least P ( b x j 6 = x ∗ j ) ≥ c L . F urthermore, this lo w er b ound holds not only when the ordering of work ers is exactly known, but ev en when the en tire matrix Q ∗ is kno wn. The pro of for this claim follows from the construction in the pro of of Theorem 1 (b). A t this p oint, we recall from Theorem 1 (b) the lo w er b ound on the estimation error in the Q ∗ -loss for any estimator. This low er b ound applies to estimators that know not only the ordering of the work ers, but also the entire matrix Q ∗ . This low er b ound matches the upp er b ound ( 11c ) of Corollary 2 , and the tw o results in conjunction imply that the b ound ( 11c ) is sharp up to logarithmic factors. W e conclude this section with a k ey insigh t obtained from our analysis of the W AN estimator. Remark 1 (Insigh t for unknown work er ordering problem) . The afor ementione d r esults for the W AN algorithm have the fol lowing useful implic ation for the setting when the or dering of workers is unknown, under either of the mo dels C DS or C Perm . F or any matrix Q ∗ ∈ C Perm , ther e exists a set of workers S Q ∗ ⊆ [ n ] such that the majority vote of the answers of the workers in S Q ∗ incurs a smal l risk. Conse quently, it suffic es to design an estimator that identifies a set of go o d workers and c omputes a majority vote of their answers. The estimator ne e d not attempt to infer the values of the entries of Q ∗ , as is otherwise r e quir e d, for instanc e, to c ompute maximum likeliho o d estimates. The estimator prop osed in the next section is based on the observ ation in Remark 1 . 3.3 The OBI-W AN estimator In this section, we return to the setting where the ordering of the w ork ers is unknown . W e b egin by presen ting a computationally efficient estimator. Our propos ed estimator op erates in tw o steps. The first step p erforms an Ordering Based on Inner-pro ducts (OBI), that is, computes an ordering of the work ers based on an inner pro duct with the data. The second step calls up on the W AN estimator from Section 3.2 with this ordering. W e thus term our prop osed estimator as the OBI-W AN estimator, b x OBI-W AN . In order to mak e its description precise, we augment the notation of the W AN estimator b x W AN ( π ) to let b x W AN ( π , Y ) to denote the estimate giv en b y b x W AN ( π ) op erating on Y when given the p ermutation π of work ers. 10 An important technical issue is that re-using the observ e data Y to both determine an appropriate ordering of work ers as well as to estimate the desired answers, results in a violation of imp ortant indep endence assumptions. W e resolv e this difficult y by partitioning the set of questions into tw o sets, and using the ordering estimated from one set to estimate the desired answers for the other set and vice v ersa. W e provide a careful error analysis for this partitioning-based estimator in the sequel. In more precise terms, the OBI-W AN estimator b x OBI-W AN is defined by the following three steps: • Step 0 (preliminary): Split the set of d questions into tw o sets, T 0 and T 1 , with every question assigned to one of the t w o sets uniformly at random. Let Y 0 and Y 1 denote the corresp onding submatrices of Y , con taining the columns of Y asso ciated to questions in T 0 and T 1 resp ectiv ely . • Step 1 (OBI): F or ` ∈ { 0 , 1 } , let u ` ∈ arg max k u k 2 =1 k Y T ` u k 2 denote the top eigenv ector of Y ` Y T ` ; in order to resolve the global sign ambiguit y of eigenv ectors, w e choose the global sign so that P i ∈ [ n ] [ u ` ] 2 i 1 { [ u ` ] i > 0 } ≥ P i ∈ [ n ] [ u ` ] 2 i 1 { [ u ` ] i < 0 } . Let π ` b e the p erm utation of the n w orkers in order of the resp ectiv e entries of u ` (with ties broken arbitrarily). • Step 2 (W AN): Compute the quan tities b x OBI-W AN ( T 0 ) : = b x W AN ( Y 0 , π 1 ) , and b x OBI-W AN ( T 1 ) : = b x W AN ( Y 1 , π 0 ) , corresp onding to estimates of the answers for questions in the sets T 0 and T 1 , resp ectiv ely . This completes the description of the OBI-W AN algorithm. W e note that with regard to the use of the singular vectors of the observ ed data in the OBI step, previous w orks [ 22 , 15 , 32 , 48 , 19 ] also use singular v ectors to estimate prop erties of the underlying parameters in cro wdsourcing. In these previous works, this step is motiv ated b y the fact that the sp ectrum of the p opulation matrix E [ Y Y T ] (or its mean-centered counterpart), can b e related to the parameters that underlie the mo del. In the next three subsections, we provide guarantees for our OBI-W AN estimator under three mo del classes. Imp ortantly , the guaran tees for OBI-W AN hold simultane ously for all mo del classes, and the estimator do es not know the true class to which the data actually b elongs. 3.3.1 Guaran tees for OBI-W AN under an in termediate mo del In addition to the Da wid-Sk ene and the p erm utation-based mo dels introduced earlier, we study the estimation problem in an intermediate mo del that lies b etw een these t w o mo dels. This intermediate mo del introduces a parameter h ∗ j ∈ [0 , 1] that captures the difficulty of each question j ∈ [ d ], along with parameters e q ∈ R n asso ciated with the work ers as in the Da wid-Skene mo del. Under this in termediate mo del, the probabilit y that work er i ∈ [ n ] correctly answ ers question j ∈ [ d ] (when the w orker is asked the question) is giv en by P ( Y ij = x ∗ j ) = e q i (1 − h ∗ j ) + 1 2 h ∗ j , ∀ ( i, j ) suc h that Y ij 6 = 0 . (12) In tuitively , the parameter h ∗ j corresp onds to the difficulty of question j . When h ∗ j = 1, the work er is purely sto c hastic and provides random guesses, while for smaller v alues of h ∗ j the work er is more lik ely to provide a correct answ er. 11 This mo deling assumption leads to the class C Int : =  Q = e q (1 − h ) T + 1 2 1 h T | for some e q ∈ [0 , 1] n , h ∈ [0 , 1] d  . Note that we hav e the nested relation C DS ⊂ C Int ; the Dawid-Sk ene mo del is a sp ecial case of C Int corresp onding to h = 0. In the regime ( R1 ), we further hav e C DS ⊂ C Int ⊂ C Perm . Up to a bijectiv e transformation of the parameters, the mo del ( 12 ) is identical to a recent mo del prop osed indep endently b y Khetan and Oh [ 23 ], where the probabilit y of a correct answer is assumed to b e e q i (1 − h ∗ j ) + (1 − e q i ) h ∗ j . The tw o mo dels ho w ev er arise from differen t conceptual motiv ations: Khetan and Oh consider the probability of correctness as a conv ex combination of the work er’s b eha vior e q i and the opp osite b ehavior (1 − e q i ), whereas our consideration of rarity of adv ersarial b eha vior leads to the probability of correctness set as a conv ex combination of the work er’s b eha vior e q i and random resp onses 1 2 . W e no w provide exact-recov ery guarantees for the OBI-W AN estimator under this intermediate mo del. As with our other results, the following theorem applies to the regime ( R1 ) ∩ ( R2 ): Theorem 3. Consider any binary ve ctor x ∗ ∈ {− 1 , 1 } d and any matrix Q ∗ ∈ C Int asso ciate d with ve ctors ( e q , h ) satisfying k e q − 1 2 k 2 2 k 1 − h ∗ k 2 2 ≥ e cd log 2 . 5 ( dn ) p obs for a lar ge enough c onstant e c . Then for every question j ∈ [ d ] such that (1 − h ∗ j ) 2 k e q − 1 2 k 2 2 ≥ 5 log 2 . 5 ( dn ) p obs , (13a) we have P ([ b x OBI-W AN ] j = x ∗ j ) ≥ 1 − e − c H log 1 . 5 ( dn ) . (13b) Please see Section 5.7 for the pro of of this theorem. See also Theorem 4 in the sequel, which provides a matc hing low er b ound (up to logarithmic factors) for the sp ecial case of h ∗ = 0. W e no w provide some intuition ab out the OBI part of the OBI-W AN estimator, and we do so in the con text of Theorem 3 . F or simplicit y in this explanation, let us ignore the sample splitting step (Step 0) of the OBI-W AN algorithm and assume the OBI step (Step 1) is applied to the en tire observed data Y . Then under the mo del C Int , we can rewrite the observ ation matrix as Y = p obs (2 e q − 1)(1 − h ) T diag ( x ∗ ) + W , where W is a “noise” matrix and p obs (2 e q − 1)(1 − h ) T diag ( x ∗ ) is the “signal” in the observed data. In this represen tation, the signal is a matrix of rank one, its top left singular vector equals 2 e q − 1 (up to a scaling), and the “magnitude of the signal” is | | | p obs (2 e q − 1)(1 − h ) T diag( x ∗ ) | | | op = p obs k 2 e q − 1 k 2 k 1 − h k 2 . F urthermore, we show in the pro of of Theorem 3 that the “magnitude of the noise” is b ounded as | | | W | | | op ≤ c p dp obs p olylog( dn ) with high probabilit y . Consequently when the magnitude of signal exceeds the noise (condition stated in the b eginning of Theorem 3 ), the top left singular vector of Y appro ximately captures the ordering of en tries in (2 e q − 1) that represent the work er abilities. The follo wing corollary now upp er b ounds the Q ∗ -loss for the OBI-W AN estimator under the in termediate mo del in the regime ( R1 ) ∩ ( R2 ): Corollary 3. F or any Q ∗ ∈ C Int and any ve ctor x ∗ ∈ {− 1 , 1 } d , the estimate b x OBI-W AN has err or at most L Q ∗ ( b x OBI-W AN , x ∗ ) ≤ c U 1 np obs log 2 . 5 d, (14) with pr ob ability at le ast 1 − e − c H log 1 . 5 ( dn ) . Please see Section 5.8 for the pro of of this corollary . A comparison with the low er b ound of Theorem 1 (b) rev eals that the b ound ( 14 ) is tigh t up to logarithmic factors. 12 3.3.2 Guaran tees for OBI-W AN under the Da wid-Skene mo del In this section, we present results relating the p erformance of the OBI-W AN estimator under the Da wid-Skene mo del. Unlike the rest of the pap er, in this section the simplicity of the mo del allows us to generalize in another direction: handling adv ersarial w ork ers, that is, not b eing restricted to regime ( R1 ) and allo wing q DS i < 1 2 for some w orkers i ∈ [ n ]. W e introduce some additional notation. F or the v ector q DS ∈ [0 , 1] n , we define tw o associated v ectors q DS+ , q DS − ∈ [0 , 1] n as q DS+ i = max { q DS i , 1 2 } and q DS − i = min { q DS i , 1 2 } for ev ery i ∈ [ n ]. Then w e hav e ( q DS − 1 2 ) = ( q DS+ − 1 2 ) + ( q DS − − 1 2 ), with q DS+ represen ting normal work ers and q DS − represen ting adv ersarial w ork ers who are more inclined to provide incorrect answers. The follo wing result holds in the regime ( R2 ): Theorem 4. Consider any Dawid-Skene matrix of the form Q ∗ = q DS 1 T for some q DS ∈ [0 , 1] n . Then: (a) If k q DS + − 1 2 k 2 ≥ k q DS − − 1 2 k 2 + q 4 log 2 . 5 ( dn ) p obs and ( q DS − 1 2 ) T 1 ≥ 0 , then for any x ∗ ∈ {− 1 , 1 } d , the OBI-W AN estimator satisfies P ( b x OBI-W AN = x ∗ ) ≥ 1 − e − c H log 1 . 5 ( dn ) . (15a) (b) Conversely, ther e exists a p ositive universal c onstant c such that for any q DS ∈ [ 1 10 , 9 10 ] n with k q DS − 1 2 k 2 ≤ q c p obs , any estimator b x has (normalize d) Hamming err or at le ast sup x ∗ ∈{− 1 , 1 } d E  d X i =1 1 d 1 { b x i 6 = x ∗ i }  ≥ 1 10 . (15b) The pro ofs of the t w o parts of Theorem 4 are pro vided in Sections 5.9 and 5.10 . A couple of remarks are in order. F or the following discussion, consider the tw o mild conditions k q DS+ − 1 2 k 2 ≥ 1 . 01 k q DS − − 1 2 k 2 and ( q DS − 1 2 ) T 1 > 0. W e claim that under these mild conditions, the OBI-W AN estimator is optimal up to logarithmic factors. T o see this, first observe that the lo wer b ound in Theorem 4 (b) implies that for an y non-trivial recov ery guaran tee to hold, it must b e the case that k q DS − 1 2 k 2 > q c p obs for some p ositive univ ersal constan t c . No w supp ose that k q DS − 1 2 k 2 > q c 0 log 2 . 5 ( dn ) p obs for a large enough p ositive constant c 0 ; observ e that this condition is only a logarithmic factor aw a y from the necessary condition. Then under the mild aforementioned conditions, we ha ve k q DS+ − 1 2 k 2 ≥ k q DS − − 1 2 k 2 + q 4 log 2 . 5 ( dn ) p obs . Part (a) of Theorem 4 then guaran tees that the OBI-W AN estimator reco v ers the true answ ers x ∗ with high probabilit y . Secondly , an application of Theorem 4 is to the setting that has b een the fo cus of our pap e r, where w e hav e no adversarial work ers. In this case, w e hav e q DS − = 0 and q DS+ = q DS , and the upp er and low er b ounds match upto a logarithmic factor. The upp er b ound shows that when k q DS − 1 2 k 2 ≥ q 4 log 2 . 5 ( dn ) p obs , the Hamming error is v anishingly small, whereas the lo wer b ound shows that there is a univ ersal constan t c such that the Hamming error is essen tially as large as p ossible when k q DS − 1 2 k 2 ≤ q c p obs . 3.3.3 Guaran tees for OBI-W AN under the p erm utation-based mo del The previous tw o subsections provided strong guarantees for OBI-W AN for exact recov ery and the Q ∗ -loss under the Dawid-Sk ene and intermediate mo dels. A natural question that arises then is 13 ho w robust is OBI-W AN to mismatches with resp ect to the Dawid-Sk ene and intermediate mo dels. W e analyze OBI-W AN under the considerably richer p erm utation-based mo del class in this section. Prop osition 1. Consider any matrix Q ∗ ∈ C Perm and any binary ve ctor x ∗ ∈ {− 1 , 1 } d . F or every question j ∈ [ d ] that such that n X i =1 ( Q ∗ ij − 1 2 ) ≥ 3 4 r n p obs log 1 . 5 ( dn ) , (16a) the OBI-W AN estimator satisfies P ([ b x OBI-W AN ] j = x ∗ j ) ≥ 1 − e − c H log 1 . 5 ( dn ) . (16b) Conse quently for any Q ∗ ∈ C Perm and any x ∗ ∈ {− 1 , 1 } d , with pr ob ability at le ast 1 − e − c H log 1 . 5 ( dn ) , the estimator incurs a Q ∗ -loss of at most L Q ∗ ( b x OBI-W AN , x ∗ ) ≤ c U 1 √ np obs log d. (17) See Section 5.11 for the pro of of this result. It is well known that the ma jority voting estimator is highly robust to mo del sp ecification (see, for instance, the discussion in [ 14 , Section 4.2]); this robustness p erhaps underlies its p opularity in practice. In the app endix, w e show that the ma jority voting estimator achiev es a rate Ω( 1 √ np obs ) in terms of the Q ∗ -loss in the w orst case o v er the p erm utation-based mo del. Thus the guarantee ( 17 ) for OBI-W AN matc hes the low er b ound for ma jority voting. How ev er, imp ortan tly , in addition to this guaran tee ov er C Perm , the OBI-W AN estimator simultane ously also ac hieves the strong exact reco very and Q ∗ -loss guaran tees of Theorem 3 , Corollary 3 , and Theorem 4 o v er the simpler mo dels C Int and C DS . 4 Exp erimen ts In this section, w e rep ort the results of a suite of exp eriments, on b oth synthetic and real-w orld data, so as to ev aluate the OBI-W AN estimator which was introduced in Section 3.3 . W e compare OBI-W AN to the Sp ectral-EM estimator due to Zhang et al. [ 48 ], which to the b est of our knowledge, has the strongest established guarantees in the literature. F or the Sp ectral-EM estimator, we used an implementation pro vided by the authors of the pap er [ 48 ]. The co de for the OBI-W AN estimator as w ell as the constituen t W AN estimator is freely a v ailable on the first author’s website. 4.1 Sim ulations W e first conduct synthetic simulations to ev aluate v arious asp ects of the algorithms. W e conduct six sets of sim ulations as detailed b elo w. The results from our simulations are plotted in Figure 1 . The plots in the six panels (a) through (f ) of the figure are discussed b elow. (a) Easy : Q ∗ = q DS 1 T ∈ C DS where q DS i = 9 10 if i < n 2 , and q DS i = 1 2 otherwise. The parameter n is v aried, and the regime of op eration is ( d = n, p obs = 1). In this setting, b oth estimators correctly reco ver x ∗ . 14 1000 4000 7000 10000 n 0.0000 0.0010 0.0020 0.0030 0.0040 OBI-WAN Spectral-EM (a) Easy 1000 4000 7000 10000 n 0.0000 0.0010 0.0020 0.0030 0.0040 (b) F ew smart 1000 4000 7000 10000 n 0.00 0.03 0.06 0.09 (c) Adv ersarial 1000 4000 7000 10000 n 1 0 - 4 1 0 - 3 1 0 - 2 1 0 - 1 (d) In C Perm \ C Int 0.50 0.35 0.20 0.05 p o b s 0.000 0.004 0.008 0.012 (e) Minimax lo w er b ound 0.50 0.35 0.20 0.05 p o b s 0.000 0.004 0.008 0.012 (f ) Sup er sparse Figure 1: Results from numerical simulations comparing the Sp ectral-EM algorithm (circular mark ers) with OBI-W AN (square markers). The plots in panels (a)-(d) measure the Q ∗ -loss as a function of n , and the plots in panels (e)-(f ) measure the Q ∗ -loss as a function of p obs . Each p oint is an av erage of o ver 20 trials. Recall that when Q ∗ follo ws the Dawid-Sk ene mo del, as in panels (a)-(c), (e)-(f ), the Hamming error is prop ortional to the Q ∗ -loss. Also note that the Y-axis of panel (d) is plotted on a logarithmic scale. (b) F ew smart : Q ∗ = q DS 1 T ∈ C DS where q DS i = 9 10 if i < √ n , and q DS i = 1 2 otherwise. The parameter n is v aried, and the regime of op eration ( d = n, p obs = 1). Even though the data is dra wn from the Da wid-Skene mo del, the error of Sp ectral-EM is m uc h higher than that of the OBI-W AN estimator. Recall that the OBI-W AN estimator has guarantees of recov ery ov er the en tire Dawid-Sk ene class, unlike the estimators in prior literature. (c) Adv ersarial : Q ∗ = q DS 1 T ∈ C DS where q DS i = 9 10 if i < n 4 + √ n , q DS i = 1 10 if i > 3 n 4 , and q DS i = 1 2 otherwise. The parameter n is v aried, and the regime of op eration is ( d = n, p obs = 1). This set of sim ulations mov es b eyond the assumption that the en tries of Q ∗ are lo wer b ounded by 1 2 , and allo ws for adversarial work ers. The OBI-W AN estimator is successful in such a setting as w ell. (d) In C Perm but outside C Int : Q ∗ ij = 9 10 if ( i < √ n or j < d 2 ), and Q ∗ ij = 1 2 otherwise. The 15 parameter n is v aried, and the regime of op eration is ( d = n, p obs = 1). Here we ha ve Q ∗ ∈ C Perm \ C Int . The Q ∗ -loss incurred b y the OBI-W AN estimator decays as 1 √ n , whereas the Q ∗ -loss of Sp ectral-EM remains a constan t. (e) Minimax lo wer b ound : Q ∗ = q DS 1 T ∈ C DS where q DS i = 9 10 if i ≤ 5 p obs and q DS i = 1 2 otherwise. The parameter p obs is v aried, and the regime of op eration is ( d = 1000 , n = 1000). This setting is the cause of the minimax lo w er b ound of Theorem 1 (b). The error of b oth estimators, in this case, b eha v es in an almost iden tical manner with a scaling of 1 p obs . (f ) Sup er sparse : Q ∗ = q DS 1 T ∈ C DS where q DS i = 9 10 if i ≤ n 10 and q DS i = 1 2 otherwise. The parameter p obs is v aried, and the regime of op eration is ( d = 1000 , n = 1000). W e see that the OBI-W AN estimator incurs a relatively higher error when data is very sparse — more generally , we ha ve observed a higher error when p obs = o ( log 2 ( dn ) n ), and this gap is also reflected in our upp er b ounds for the OBI-W AN estimator in Theorem 3 (a) and Theorem 4 (a) that are lo ose b y precisely a p olylogarithmic factor as compared to the asso ciated low er b ounds. The relativ e b enefits and disadv an tages of of the prop osed OBI-W AN estimator, as observ ed from the sim ulations, may b e summarized as follows. In terms of limitations, the error of OBI-W AN is higher than prior works when p obs is small (as observed in the sup er-sparse case) or when n and d are small (for instance, less than 200). On the p ositiv e side, the sim ulations rev eal that the OBI-W AN estimator leads to accurate estimates in a v ariet y of settings, providing guarantees o ver the C DS and C Int classes, and demonstrating significant robustness in more general settings in comparison to the b est kno wn estimator in the literature. 4.2 Real-w orld cro wds ourcing data In this section we describe a set of six exp erimen ts conducted using real-world data from the Amazon Mec hanical T urk cro wdsourcing platform, ranging from visual recognition to knowledge elicitation. The exp eriments inv olved more than 200 work ers in total. In eac h exp eriment, work ers are asked to answ er a num b er of questions, an we then employ statistical aggregation algorithms to estimate the ground truth answers. The results of these exp erimen ts are shown in Figure 2 . As b efore, we compare the OBI-W AN estimator with the Sp ectral-EM estimator [ 48 ]. W e now describ e more details regarding the exp eriments. The error bars in eac h figure represen t the standard error of the mean. The plots and error bars shown in each figure are obtained via 300 iterations p er experiment of subsampling the work er’s answ ers with p obs = 1 / 3 and executing the t wo algorithms on the subsampled data. ( a ) Dysplasia The exp erimen t was based on 48 pictures of (biological) cells to work ers. They had to classify each image as either “Mild dysplasia” where the ratio of the n ucleus’ area to cytoplasm’s area is less than 33%, or as “severe dysplasia” where the ratio of the n ucleus’ area to cytoplasm’s area is more than 33%. The images and the ground truth were obtained from the DTU/Herlev P ap Smear Database 2005 [ 20 ]. W e collected resp onses from a total of 41 work ers from Amazon Mec hanical T urk, and Figure 2a depicts the results of this exp eriment. The data is freely av ailable for do wnload on the first author’s w ebsite. ( b ) Bridges The exp erimen t w as based on 21 images of bridges, and the task for any work er was to classify each image as either the golden bridge or not. The data for this exp eriment was collected in past w ork [ 38 ] from a total of 35 work ers from Amazon Mechanical T urk. Figure 2b depicts the results from this data. 16 Spectral-EM OBI-WAN (Normalized) Hamming error 1 0 4 1 0 3 1 0 2 1 0 1 1 0 0 (a) Dysplasia 1 0 4 1 0 3 1 0 2 1 0 1 1 0 0 (b) Bridges 1 0 4 1 0 3 1 0 2 1 0 1 1 0 0 (c) Dogs (Normalized) Hamming error 1 0 4 1 0 3 1 0 2 1 0 1 1 0 0 (d) Flags 1 0 4 1 0 3 1 0 2 1 0 1 1 0 0 (e) P eople 1 0 4 1 0 3 1 0 2 1 0 1 1 0 0 (f ) T extures Figure 2: Results from exp eriments on the Amazon Mechanical T urk crowdsourcing platform comparing the Sp ectral-EM algorithm (left bars) with OBI-W AN (right bars). The y-axes of the plots are on a logarithmic scale and represent the Hamming error normalized b y the n um b er of questions (lo wer is b etter). ( c ) Dogs The exp eriment comprised 85 images of dogs (from [ 24 , 7 ]), and each work er was ask ed to identify the breed of each dog from ten provided options. The data w as collected in past work [ 38 ] from a total of 35 work ers from Amazon Mechanical T urk. The data is conv erted to binary choice form b y c ho osing one breed uniformly at random in each iteration and considering a binary-c hoice task of iden tifying whether or not the dogs b elong to this breed. Figure 2c plots the results of the exp erimen ts. ( d ) Flags In this exp erimen t, each work er w as sho wn a series of 126 flags. Each question required the w orker to identify if a display ed flag b elonged to a place in Africa, Asia/Oceania, Europ e, or neither of these. W e use the data collected in the past work [ 38 ] which con tains resp onses of 35 w orkers to all of the questions. W e con v ert this data into binary c hoice format in the same manner as the dogs exp erimen t ab ov e. Finally , w e plot the results from this exp erimen t in Figure 2d . ( e ) P eople In this exp eriment, the names of 20 p ersonalities were provided and the w orker w ere ask ed to classify whether they were ev er the President of the USA, Presiden t of India, Prime Minister of Canada, or neither of these. Resp onses from 35 work ers w ere collected in past work [ 38 ], and we 17 con vert this data to binary choice as p er the aforemen tioned pro cedure. Figure 2e plots the results of this exp erimen t. ( f ) T extures As a final exp eriment, we ev aluated the p erformance of the algorithms on work ers’ classification of textures. Sp ecifically , work ers were asked to classify 24 images from the dataset [ 26 , Dataset 1: T extured surfaces] into one of eight p ossible textures. W e use the resp onses of the w orkers collected in [ 38 ], with a con version to binary-c hoice as describ ed ab o v e. The aggregate results from executing the t w o algorithms are depicted in Figure 2f . All in all, the exp erimen ts rev eal that OBI-W AN compares fa v orably to Sp ectral-EM. 5 Pro ofs In this section, w e presen t the pro ofs of main theoretical results. In the pro ofs, we use c , c 1 , c 0 etc. to denote p ositiv e universal constants, and ignore flo ors and ceilings unless critical to the pro of. W e assume that n and d are greater than some univ ersal constan ts; the case of smaller v alues of these parameters are then directly implied b y only c hanging the constant prefactors. 5.1 Pro of of Theorem 1 (a): Minimax upp er b ound W e begin by proving the minimax upp er b ound stated in part (a) of Theorem 1 . The pro of is divided in to t wo parts, where in the first part, we obtain an upp er b ound on the error term | | | (2 Q ∗ − 11 T ) diag ( x ∗ ) − (2 e Q LS − 11 T ) diag ( e x LS ) | | | 2 F , following whic h we con vert this b ound to one on L Q ∗ ( x ∗ , e x LS ). follo ws along the lines of the pro of of our previous work [ 34 , Theorem 1(a)]. In relation of our presen t problem, one can think of the setting of [ 34 ] as that of estimating the matrix (2 Q ∗ − 11 T ) when the v alue of x ∗ is known . The primary additional challenge in the first part of the presen t pro of is to accommo date the additional uncertaint y ab out x ∗ . W e b egin with the first part of the pro of, where w e b ound the error in estimating the pro duct term (2 Q ∗ − 11 T ) diag ( x ∗ ). Let us rewrite our observ ation mo del in a “linearized” fashion that is con venien t for subsequent analysis. In particular, let us define a random matrix W ∈ R n × d with en tries indep enden tly drawn from the distribution W ij =        1 − p obs (2 Q ∗ ij − 1) x ∗ j w.p. p obs  Q ∗ ij  1+ x ∗ j 2  + (1 − Q ∗ ij )  1 − x ∗ j 2   − 1 − p obs (2 Q ∗ ij − 1) x ∗ j w.p. p obs  Q ∗ ij  1 − x ∗ j 2  + (1 − Q ∗ ij )  1+ x ∗ j 2   − p obs (2 Q ∗ ij − 1) x ∗ j w.p. 1 − p obs , (18) where “w.p.” is a shorthand for “with probabilit y”. One can verify that E [ W ] = 0, ev ery entry of W is b ounded b y 2 in absolute v alue, and moreov er that our observed matrix Y can b e written in the form 1 p obs Y = (2 Q ∗ − 11 T ) diag( x ∗ ) + 1 p obs W . (19) Let Π n denote the set of all p erm utations of the n w orkers, and let Σ d denote the set of all p erm utations of the d questions. F or an y pair of p ermutations ( π , σ ) ∈ Π n × Σ d , define the set C Perm ( π , σ ) : = n Q ∈ [0 , 1] n × d | Q ij ≥ Q i 0 j 0 whenev er π ( i ) ≤ π ( i 0 ) and σ ( j ) ≤ σ ( j 0 ) o , 18 corresp onding to the subset of C Perm consisting of matrices that are faithful to the p ermutations π and σ . F or any fixed x ∈ {− 1 , 1 } d , π ∈ Π n and σ ∈ Σ d , define the matrix e Q ( π , σ, x ) ∈ arg min Q ∈ C Perm ( π ,σ ) C ( Q, x ) , where C ( Q, x ) : = | | | 1 p obs Y − (2 Q − 11 T ) diag( x ) | | | 2 F . Using this notation, w e can rewrite the least squares estimator ( 6 ) in the compact form ( e x LS , e π LS , e σ LS ) ∈ arg min ( π ,σ ) ∈ Π n × Σ d x ∈{− 1 , 1 } d C ( e Q ( π , σ, x ) , x ) , and e Q LS = e Q ( e π LS , e σ LS , e x LS ) . F or the purp oses of analysis, let us define the set P : = n ( π , σ, x ) ∈ Π n × Σ d × {− 1 , 1 } d | C ( e Q ( π , σ, x ) , x ) ≤ C ( Q ∗ , x ∗ ) o . (20) With this set-up, we claim that it is sufficient to show the following: fix a triplet ( π , σ, x ) ∈ P , for this fixed triplet there is a univ ersal constan t c 1 suc h that P  | | | (2 e Q ( π , σ, x ) − 11 T ) diag( x − x ∗ ) | | | 2 F ≤ c 1 d p obs log 2 d  ≥ 1 − e − 4 d log ( dn ) . (21) Giv en this b ound, since the cardinalit y of the set P is upper b ounded by e 3 d log d (since d ≥ n ), a union b ound o v er all these p erm utations applied to ( 21 ) yields P  max ( π ,σ,x ) ∈P | | | (2 e Q ( π , σ, x ) − 11 T ) diag( x − x ∗ ) | | | 2 F ≤ c 1 d log 2 d p obs  ≥ 1 − e − d log ( dn ) . The set P is guaranteed to be non-empt y since the true p erm utations π ∗ and σ ∗ corresp onding to Q ∗ and the true answer x ∗ alw ays lie in P , and consequently , the ab ov e tail b ound yields the claimed result. The remainder of our analysis is devoted to pro ving the b ound ( 21 ) . Giv en an y triplet ( π , σ, x ) ∈ P , we define the matrices V ∗ : = (2 Q ∗ − 11 T ) diag( x ∗ ) , and e V ( π , σ, x ) : = (2 e Q ( π , σ, x ) − 11 T ) diag( x ) . Henceforth, for brevity , w e refer to the matrix e V ( π , σ, x ) simply as e V and the matrix e Q ( π , σ, x ) simply as e Q , since the v alues of the asso ciated quantities ( π , σ, x ) are fixed and clear from context. Since ( π , σ, x ) ∈ P , the definition of set P in ( 20 ) yields the inequality | | | 1 p obs Y − (2 e Q ( π , σ, x ) − 11 T ) diag( x ) | | | 2 F ≤ | | | 1 p obs Y − (2 Q ∗ − 11 T ) diag( x ∗ ) | | | 2 F . Substituting the expression 1 p obs Y = (2 Q ∗ − 11 T ) diag ( x ∗ ) + 1 p obs W from ( 19 ) , w e obtain the relations | | | (2 Q ∗ − 11 T ) diag( x ∗ ) + 1 p obs W − (2 e Q ( π , σ, x ) − 11 T ) diag( x ) | | | 2 F ≤ | | | (2 Q ∗ − 11 T ) diag( x ∗ ) + 1 p obs W − (2 Q ∗ − 11 T ) diag( x ∗ ) | | | 2 F = | | | 1 p obs W | | | 2 F . 19 Using the expansion ( a + b ) 2 = a 2 + b 2 + 2 ab , and substituting the expressions for V ∗ and e V , we obtain follo wing basic inequality 1 2 | | | V ∗ − e V | | | 2 F ≤ 1 p obs h h e V − V ∗ , W i i . (22) The follo wing lemma uses this inequalit y to obtain an upp er b ound on the quantit y 1 2 | | | V ∗ − e V | | | 2 F . Lemma 1. Ther e exists a universal c onstant c 1 > 0 such that P  | | | V ∗ − e V | | | 2 F ≤ c 1 d log 2 d p obs  ≥ 1 − e − 4 d log ( dn ) . (23) See Section 5.1.1 for the pro of of this lemma. This completes the first part of the pro of. In the second part of the pro of, we no w conv ert our b ound ( 23 ) on the F rob enius norm | | | V ∗ − e V | | | F in to one on the error in estimating x ∗ under the Q ∗ -loss. The follo wing lemma is useful for this con version: Lemma 2. F or any p air of matric es A 1 , A 2 ∈ R n × d + and any p air of ve ctors v 1 , v 2 ∈ {− 1 , 1 } d , we have | | | A 1 diag( v 1 − v 2 ) | | | 2 F ≤ 4 | | | A 1 diag( v 1 ) − A 2 diag( v 2 ) | | | 2 F . (24) See Section 5.1.2 for the pro of of this claim. Recall our assumption that every entry of the matrices Q ∗ and e Q is at least 1 2 . Consequently , w e can apply Lemma 2 with A 1 = ( Q ∗ − 1 2 11 T ), A 2 = ( e Q − 1 2 11 T ), v 1 = x ∗ and v 2 = x to obtain the inequalit y | | | ( Q ∗ − 1 2 11 T ) diag( x ∗ − x ) | | | 2 F ≤ 4 | | | ( Q ∗ − 1 2 11 T ) diag( x ∗ ) − ( e Q − 1 2 11 T ) diag( b x ) | | | 2 F = 4 | | | V ∗ − e V | | | 2 F . (25) Coupled with Lemma 1 , this b ound yields the desired result ( 21 ). 5.1.1 Pro of of Lemma 1 Our pro of of this lemma closely follows along the lines of the pro of of a related result in our past w ork [ 34 ]. Denote the error in the estimate as b ∆ : = e V − V ∗ . Then from the inequalit y ( 22 ), hav e 1 2 | | | b ∆ | | | 2 F ≤ 1 p obs h h W , b ∆ i i . (26) F or the quadruplet ( π , σ, x, V ∗ ) under consideration, define the set V DIFF ( π , σ, x, V ∗ ) : = n α ( V − V ∗ ) | V = (2 Q − 11 T ) diag( x ) , Q ∈ C Perm ( π , σ ) , α ∈ [0 , 1] o . Since the terms π , σ , x and V ∗ are fixed for the purp oses of this pro of, w e will use the abbreviated notation V DIFF for V DIFF ( π , σ, x, V ∗ ). F or eac h choice of radius t > 0, define the random v ariable Z ( t ) : = sup D ∈ V DIFF , | | | D | | | F ≤ t 1 p obs h h D , W i i . (27a) 20 Using the basic inequalit y ( 26 ), the F rob enius norm error | | | b ∆ | | | F then satisfies the b ound 1 2 | | | b ∆ | | | 2 F ≤ 1 p obs h h W , b ∆ i i ≤ Z  | | | b ∆ | | | F  . (27b) Th us, in order to obtain a high probability bound, we need to understand the b ehavior of the random quan tity Z ( t ). One can v erify that the set V DIFF is star-shap ed, meaning that αD ∈ V DIFF for every α ∈ [0 , 1] and ev ery D ∈ V DIFF . Using this star-shap ed prop erty , we are guaran teed that there is a non-empty set of scalars δ 0 > 0 satisfying the critical inequalit y E [ Z ( δ 0 )] ≤ δ 2 0 2 . (27c) Our interest is in an upp er b ound to the smallest (strictly) p ositive solution δ 0 to the critical inequalit y ( 27c ) , and moreov er, our goal is to show that for every t ≥ δ 0 , we hav e | | | b ∆ | | | F ≤ c √ tδ 0 with high probabilit y . Define a “bad” ev ent A t : = n ∃ ∆ ∈ V DIFF | | | | ∆ | | | F ≥ p tδ 0 and 1 p obs h h ∆ , W i i ≥ 2 | | | ∆ | | | F p tδ 0 o . (28) No w supp ose the ev ent A t is true for some t ≥ δ 0 , and let ∆ 0 ∈ V DIFF b e a matrix that satisfies the t wo conditions required for A t to o ccur. F urthermore, since V DIFF is star-shap ed, the function Z ( t ) gro ws at most linearly in t . Consequently whenever even t A t is true, we hav e | | | ∆ 0 | | | F ≥ δ 0 and hence Z ( δ 0 ) ≥ δ 0 | | | ∆ 0 | | | F Z ( | | | ∆ 0 | | | F ) ( i ) ≥ δ 0 | | | ∆ 0 | | | F 1 p obs h h ∆ 0 , W i i ( ii ) ≥ 2 δ 0 p tδ 0 , where inequalit y ( i ) follo ws from the definition of function Z and inequalit y ( ii ) uses the second condition in the definition of even t A t . As a consequence, we obtain the following b ound on the probabilities of the asso ciated ev en ts P [ A t ] ≤ P [ Z ( δ 0 ) ≥ 2 δ 0 p tδ 0 ] for all t ≥ δ 0 . The follo wing lemma helps con trol the b eha vior of the random v ariable Z ( δ 0 ). Lemma 3. F or any δ > 0 , the me an of Z ( δ ) is upp er b ounde d as E [ Z ( δ )] ≤ c 1 n + d p obs log 2 ( nd ) , (29a) and for every u > 0 , its tail pr ob ability is b ounde d as P  Z ( δ ) > E [ Z ( δ )] + u  ≤ exp  − c 2 u 2 p obs δ 2 + E [ Z ( δ )] + u  , (29b) wher e c 1 and c 2 ar e p ositive universal c onstants. See Section 5.1.3 for the pro of of this lemma. Setting u = δ 0 √ tδ 0 in the tail b ound ( 29b ), w e find that P  Z ( δ 0 ) > E [ Z ( δ 0 )] + δ 0 p tδ 0  ≤ exp  − c 2 ( δ 0 √ tδ 0 ) 2 p obs δ 2 0 + E [ Z ( δ 0 )] + δ 0 √ tδ 0  , for all t > 0. 21 By the definition of δ 0 in ( 27c ) , we hav e E [ Z ( δ 0 )] ≤ δ 2 0 ≤ δ 0 √ tδ 0 for any t ≥ δ 0 , and with these relations w e obtain the b ound P [ A t ] ≤ P [ Z ( δ 0 ) ≥ 2 δ 0 p tδ 0  ≤ exp  − c 2 3 δ 0 p tδ 0 p obs  , for all t ≥ δ 0 . Consequen tly , either | | | b ∆ | | | F ≤ √ tδ 0 , or we hav e | | | b ∆ | | | F > √ tδ 0 . In the latter case, conditioning on the complemen t A c t , our basic inequalit y implies that 1 2 | | | b ∆ | | | 2 F ≤ 2 | | | b ∆ | | | F √ tδ 0 and hence | | | b ∆ | | | F ≤ 4 √ tδ 0 . Putting together the pieces yields that P  | | | b ∆ | | | F ≤ 4 p tδ 0  ≥ 1 − exp  − c 2 3 δ 0 p tδ 0 p obs  , v alid for all t ≥ δ 0 . (30) Finally , from the b ound on the exp ected v alue of Z ( t ) in Lemma 3 , we see that the critical inequalit y ( 27c ) is satisfied for δ 0 = q 2 c 1 ( n + d ) p obs log ( nd ). Setting t = δ 0 = q 2 c 1 ( n + d ) p obs log ( nd ) in equation ( 30 ) yields the claimed result. 5.1.2 Pro of of Lemma 2 Consider an y four scalars a 1 ≥ 0 , a 2 ≥ 0 , b 1 ∈ {− 1 , 1 } and b 2 ∈ {− 1 , 1 } . If b 1 = b 2 then ( a 1 b 1 − a 1 b 2 ) 2 = 0 ≤ ( a 1 b 1 − a 2 b 2 ) 2 . Otherwise w e hav e b 1 = − b 2 . In this case, since a 1 and a 2 ha ve the same sign, ( a 1 b 1 − a 2 b 2 ) 2 ≥ ( a 1 b 1 ) 2 = 1 4 ( a 1 b 1 − a 1 b 2 ) 2 . The tw o results ab ov e in conjunction yield the inequalit y ( a 1 ( b 1 − b 2 )) 2 ≤ 4( a 1 b 1 − a 2 b 2 ) 2 . Applying the ab o v e argumen t to eac h en try of the matrices A 1 diag ( v 1 − v 2 ) and ( A 1 diag ( v 1 ) − A 2 diag ( v 2 )) yields the claim. 5.1.3 Pro of of Lemma 3 W e need to pro ve the upp er b ound ( 29a ) on the mean, as well as the tail b ound ( 29b ). Upp er b ounding the mean: W e upp er b ound the mean by using Dudley’s entrop y in tegral, as w ell as some auxiliary results on metric en trop y . Given a set C equipp ed with a metric ρ and a tolerance parameter  ≥ 0, we let log N ( , C , ρ ) denote the  -metric entrop y of the class C in the metric ρ . With this notation, the truncated form of Dudley’s en trop y integral inequality 2 yields E [ Z ( δ )] ≤ c p obs n d − 8 + Z 2 √ nd 1 2 d − 9 p log N ( , V DIFF , | | | . | | | F )(∆  ) o . (31) The upp er limit of 2 √ nd in the in tegration is due to the fact | | | D | | | F ≤ 2 √ nd for ev ery D ∈ V DIFF . It is kno wn [ 34 ] that the metric en tropy of the set V DIFF is upp er b ounded as log N ( , V DIFF , | | | . | | | F ) ≤ 8 max { n, d } 2  2  log max { n, d }   2 for eac h  > 0. Com bining this upp er b ound with the Dudley entrop y integral ( 31 ) , and observing that the in tegration has  ≥ 1 2 d − 9 , the claimed upp er b ound ( 29a ) follows. 2 Here we use (∆  ) to denote the differen tial of  , so as to av oid confusion with the num b er of questions d . 22 Bounding the tail probabilit y of Z ( δ ) : In order to establish the claimed tail bound ( 29b ) , w e use a Bernstein-type b ound on the supremum of empirical pro cesses due to Klein and Rio [ 25 , Theorem 1.1c]. In particular, this result applies to a random v ariable of the form X † = sup v ∈V h X , v i , where X = ( X 1 , . . . , X m ) is a v ector of indep endent random v ariables taking v alues in [ − 1 , 1], and V is some subset of [ − 1 , 1] m . Their theorem guaran tees that for an y t > 0, P  X † > E [ X † ] + t  ≤ exp − t 2 2 sup v ∈V E [ h v , X i 2 ] + 4 E [ X † ] + 3 t ! . (32) In our setting, w e apply this tail b ound with the c hoices X = 1 2 W , and X † = 1 2 sup D ∈ V DIFF , | | | D | | | F ≤ δ h h D , W i i = 1 2 p obs Z ( δ ) . The entries of the matrix W are indep endently distributed with a mean of zero and a v ariance of at most 4 p obs , and are b ounded in absolute v alue b y 2. As a result, we hav e E [ h h D , W i i 2 ] ≤ 4 p obs | | | D | | | 2 F ≤ 4 p obs δ 2 for ev ery D ∈ V DIFF . With these assignmen ts, inequalit y ( 32 ) guarantees that P  p obs Z ( δ ) > p obs E [ Z ( δ )] + p obs u  ≤ exp  − ( up obs ) 2 2 p obs δ 2 + 2 p obs E [ Z ( δ )] + 3 up obs  , for all u > 0, and some algebraic simplifications yield the claimed result. 5.2 Pro of of Theorem 1 (b): Minimax low er b ound W e no w turn to the pro of of the minimax low er b ound. F or a numerical constant δ ∈ (0 , 1 4 ) whose precise v alue is determined later, define the v ector q DS ∈ [0 , 1] n with en tries q DS i = ( 1 2 + δ if i ≤ 1 p obs 1 2 otherwise . (33) Set the probability matrix Q ∗ ∈ [0 , 1] n × d as Q ∗ = q DS 1 T . Observe that we then hav e Q ∗ ∈ C DS . One ma y assume that the matrix Q ∗ is kno wn to the e stimator under consideration. The Gilb ert-V arshamo v b ound [ 16 , 43 ] guaran tees that for a universal constant c > 0, there is a collection β = exp ( cd ) binary vectors—that is, a collection of v ectors { x 1 , . . . , x β } all b elonging to the Bo olean h yp ercub e {− 1 , 1 } d —suc h that the normalized Hamming distance ( 1 ) b et w een any pair of v ectors in this set is lo wer b ounded as d H ( x ` , x ` 0 ) ≥ 1 10 , for ev ery `, ` 0 ∈ [ β ]. F or each ` ∈ [ β ], let P ` denote the probability distribution of Y induced by setting x ∗ = x ` . F or the c hoice of Q ∗ sp ecified in ( 33 ) , follo wing some algebra, w e obtain a upp er b ound on the Kullbac k-Leibler divergence b et ween an y pair of distributions from this collection as D KL ( P ` k P ` 0 ) ≤ c 0 dδ 2 for ev ery ` 6 = ` 0 ∈ [ β ] , for another constant c 0 > 0. Combining the ab o ve observ ations with F ano’s inequalit y [ 3 ] yields that an y estimator b x has exp ected normalized Hamming error lo w er b ounded as E [ d H ( b x, x ∗ )] ≥ 1 20  1 − c 0 dδ 2 + log 2 log β  . 23 Consequen tly , for the choice of Q ∗ giv en by ( 33 ), the Q ∗ -loss is lo wer bounded as E [ L Q ∗ ( b x, x ∗ )] = 4 δ 2 p obs E [ d H ( b x, x ∗ )] n ≥ 4 δ 2 20 np obs  1 − c 0 dδ 2 + log 2 cd  ( i ) ≥ c 00 np obs , for some constant c 00 > 0 as claimed. Here inequality (i) follo ws by setting δ to b e a sufficiently small p ositiv e constan t (dep ending on the v alues of c 0 and c 00 ). 5.3 Pro of of Corollary 1 (a) In the pro of of Theorem 1 (a), we show ed that there is a constan t c 1 > 0 suc h that | | | (2 Q ∗ − 11 T ) x ∗ − (2 e Q LS − 11 T ) e x LS | | | 2 F ≤ c 1 d p obs log 2 d, with probabilit y at least 1 − e − d log ( dn ) . Since all entries of the matrices 2 Q ∗ − 11 T and 2 e Q LS − 11 T are non-negative, and since every entry of the vectors x ∗ and e x LS lies in {− 1 , 1 } , some algebra yields the b ound  (2 Q ∗ ij − 1) − (2[ e Q LS ] ij − 1)  2 ≤  (2 Q ∗ ij − 1) x ∗ j − (2[ e Q LS ] ij − 1)[ e x LS ] j  2 , for ev ery i ∈ [ n ] , j ∈ [ d ]. Combining these inequalities yields the claimed b ound. 5.4 Pro of of Corollary 1 (b) W e b egin by constructing a set, of cardinality β , of p ossible matrices Q ∗ , for some integer β > 1, and subsequen tly we show that it is hard to identify the true matrix if drawn from this set. W e b egin b y defining a β -sized collection of vectors { h 1 , . . . , h β } , all contained in the set [ 1 2 , 1] d , as follows. The Gilb ert-V arshamo v b ound [ 16 , 43 ] guarantees a constant c ∈ (0 , 1) such that there exists set of β = exp ( cd ) vectors, v 1 , . . . , v β ∈ {− 1 , 1 } d with the prop erty that the normalized Hamming distance ( 1 ) b et w een any pair of these v ectors is low er b ounded as d H ( v ` , v ` 0 ) ≥ 1 10 , for ev ery `, ` 0 ∈ [ β ] . Fixing some δ ∈ (0 , 1 4 ), let us define, for eac h ` ∈ [ β ], the v ector h ` ∈ R d with en tries [ h ` ] j : = ( 1 2 + δ if [ v ` ] j = 1 1 2 otherwise . F or eac h ` ∈ [ β ], define the matrix Q ` = 1( h ` ) T , and let P ` denote the probability distribution of the observ ed data Y induced by setting Q ∗ = Q ` and x ∗ = 1. Since the entries of Y are all indep endent, some algebra leads to the following upp er b ound on the Kullbac k-Leibler divergence b etw een any pair of distributions from this collection: D KL ( P ` k P ` 0 ) ≤ 4 p obs ndδ 2 for ev ery ` 6 = ` 0 ∈ [ β ] . Moreo ver, some simple calculation shows that the squared F rob enius norm distance b etw een any t wo matrices in this collection is lo wer bounded as | | | Q ` − Q ` 0 | | | 2 F ≥ 1 10 dnδ 2 for ev ery ` 6 = ` 0 ∈ [ β ] . 24 Com bining the ab ov e observ ations with F ano’s inequalit y [ 3 ] yields that any estimator b Q for Q ∗ has mean squared error lo wer bounded as E [ | | | Q ∗ − b Q | | | 2 F ] ≥ 1 20 dnδ 2  1 − 4 p obs dnδ 2 + log 2 log β  ≥ c 0 d p obs , where w e hav e set δ 2 = c 00 p obs n for a small enough p ositive constant c 00 , where c 0 is another p ositiv e constan t whose v alue may dep end only on c and c 00 . 5.5 Pro of of Theorem 2 : W AN under the p erm utation-based mo del Observ e that the windo wing step of the W AN estimator identifies a group of k W AN w orkers suc h that their aggregate resp onses tow ards questions are biased (tow ards either answ er {− 1 , 1 } ) b y at least q k W AN p obs log 1 . 5 ( dn ) . W e first derive three prop erties asso ciated with having such a bias. These prop erties in v olve function γ π : [ n ] × [ d ] × {− 1 , 1 } → R , where γ π ( k , j, x ) represen ts the amoun t of bias in the resp onses of the top k ∈ [ n ] work ers for question j ∈ [ d ] tow ards the answer x ∈ {− 1 , 1 } : γ π ( k , j, x ) : = k X i =1 ( 1 { Y π − 1 ( i ) j = x } − 1 { Y π − 1 ( i ) j = − x } ) = x k X i =1 Y π − 1 ( i ) j . A straigh tforward application of the Bernstein inequalit y [ 2 ], using the fact that the entries of the observ ed matrix Y are all indep endent, with moments b ounded as E [ Y ij ] = 2 p obs ( Q ∗ ij − 1 2 ) x ∗ j , and E [ Y 2 ij ] = p obs , ensures that all three prop erties stated b elow are satisfied with probability at least 1 − e c log 1 . 5 ( dn ) for ev ery question j ∈ [ d ] and ev ery k ∈ { p − 1 obs log 1 . 5 ( dn ) , . . . , n } . F or the remainder of the pro of we w ork conditioned on the even t where the following prop erties hold: (P1) Sufficien t condition for bias to w ards correct answer: If P k i =1 ( Q ∗ π − 1 ( i ) j − 1 2 ) ≥ 3 4 q k log 1 . 5 ( dn ) p obs , then γ π ( k , j, x ∗ j ) ≥ q k p obs log 1 . 5 ( dn ). (P2) Necessary condition for bias to w ards any answer x ∈ {− 1 , 1 } : γ π ( k , j, x ) ≥ q k p obs log 1 . 5 ( dn ) only if x = x ∗ j and P k i =1 ( Q ∗ π − 1 ( i ) j − 1 2 ) ≥ 1 4 q k log 1 . 5 ( dn ) p obs . (P3) Sufficien t condition for aggregate to b e correct: If P k i =1 ( Q ∗ π − 1 ( i ) j − 1 2 ) ≥ 1 4 q k log 1 . 5 ( dn ) p obs , then γ π ( k , j, x ∗ j ) > 0. W e no w show that when these three prop erties hold, for any question j 0 ∈ J , w e must hav e that [ b x W AN ( π )] j 0 = x ∗ j 0 . In particular, we do so by exihibiting a question that is at least as hard as j 0 on whic h the W AN estimator is definitely correct, and use the ab ov e prop erties to conclude that it therefore m ust also b e correct on the question j 0 . Recall that by the definition ( 10a ) of J , for any question j 0 ∈ J , it must b e the case that there exists a k j 0 ≥ p − 1 obs log 1 . 5 ( dn ) suc h that k j 0 X i =1 ( Q ∗ π − 1 ( i ) j − 1 2 ) ≥ 3 4 s k j 0 p obs log 1 . 5 ( dn ) . (34) 25 W e define an asso ciated set J 0 as the set of questions that are at least as easy as question j 0 according to the underlying p erm utation σ ∗ , that is, J 0 : = { j ∈ [ d ] | σ ∗ ( j ) ≤ σ ∗ ( j 0 ) } . By the monotonicity of the columns of Q ∗ , ev ery question in J 0 also satisfies condition ( 34 ) . F or eac h p ositiv e integer k , define the set J ( k ) : = n j ∈ [ d ]    γ π ( k , j, x ) ≥ q k p obs log 1 . 5 ( dn ) for some x ∈ {− 1 , 1 } o . Prop ert y (P1) ensures that ev ery question in the set J 0 is also in the set J ( k j 0 ). W e then ha v e | J ( k W AN ) | ( i ) ≥ | J ( k j 0 ) | ≥ | J 0 | , where step (i) uses the optimality of k W AN for the optimization problem in equation ( 9a ) . Giv en this, there are tw o p ossibilities: either (1) we hav e the equality J ( k W AN ) = J 0 , or (2) the set J ( k W AN ) con tains some question not in the set J 0 . W e address eac h of these p ossibilities in turn. Case 1: It suffices to observe by Prop erties (P1) – (P3) , that the aggregate of the top k W AN w orkers is correct on ev ery question in the set J ( k W AN ) and this implies that it must b e the case that [ b x W AN ( π )] j 0 = x ∗ j 0 as desired. Case 2: In this case, there is some question j 0 / ∈ J 0 suc h that γ π ( k W AN , j, x ) ≥ q k W AN p obs log 1 . 5 ( dn ) for some x ∈ {− 1 , 1 } . Prop ert y (P2) guaran tees that P k W AN i =1 ( Q ∗ π − 1 ( i ) j 0 − 1 2 ) ≥ 1 4 q k W AN log 1 . 5 ( dn ) p obs and that x = x ∗ j . Now, since every question easier than j 0 is in the set J 0 , question j 0 m ust b e more difficult than j 0 , whic h implies that k W AN X i =1 ( Q ∗ π − 1 ( i ) j 0 − 1 2 ) ≥ 1 4 s k W AN log 1 . 5 ( dn ) p obs . Applying Prop ert y (P3) , w e can then conclude that [ b x W AN ( π )] j 0 = x ∗ j 0 as desired. 5.6 Pro of of Corollary 2 Theorem 2 guaran tees that the W AN estimator correctly answ ers all questions that ha ve some reasonable signal. Note that the set ( 10a ) is defined in terms of the ` 1 -norm of sub v ectors of columns of Q ∗ − 1 2 , whereas the conditions k Q ∗ j − 1 2 k 2 2 ≥ 5 log 2 . 5 ( dn ) p obs and k Q π j − Q π ∗ j k 2 ≤ k Q ∗ j − 1 2 k 2 p 9 log( dn ) (35) in the theorem claim are in terms of the ` 2 -norm of the columns of Q ∗ . The following lemma allows us to connect the ` 1 and ` 2 -norm constrain ts for any v ector in a general class. Lemma 4. F or any ve ctor v ∈ [0 , 1] n such that v 1 ≥ . . . ≥ v n , ther e must b e some α ≥ d 1 2 k v k 2 2 e such that α X i =1 v i ≥ s α k v k 2 2 2 log n . (36) 26 See Section 5.6.1 for the pro of of this lemma. W e now complete the pro of of the theorem. W e may assume without loss of generality that the ro ws of Q ∗ are ordered to b e non-decreasing down wards along an y column, that is, that π ∗ is the iden tity p ermutation. Consider any question j ∈ [ d ] for which the p ermutation π satisfies the b ounds ( 35 ) . F or any ` ∈ [ n ], let g ` ∈ R n denote a vector with ones in its first ` p ositions and zeros elsewhere. The Cauch y-Sch w arz inequality implies that ( Q π j − 1 2 ) T g ` ≥ ( Q ∗ j − 1 2 ) T g ` − √ ` k Q π j − Q ∗ j k 2 . By applying Lemma 4 to the vector Q ∗ j − 1 2 , we are guaran teed the existence of some v alue k ≥ 5 log 2 . 5 ( dn ) 2 p obs suc h that ( Q ∗ j − 1 2 ) T g k ≥ k Q ∗ j − 1 2 k 2 q k 2 log n . Consequently , we ha ve the lo w er b ound ( Q π j − 1 2 ) T g k ≥ k Q ∗ j − 1 2 k 2 s k 2 log n − √ k k Q π j − Q ∗ j k 2 ( i ) ≥ . 37 k Q ∗ j − 1 2 k 2 s k log( dn ) ( ii ) ≥ 3 4 s k p obs log 1 . 5 ( dn ) , where inequalities (i) and (ii) follow from conditions ( 35 ) . Consequen tly , we can apply Theorem 2 for ev ery such question j , thereby yielding the result ( 11b ). Finally , the claimed result ( 11c ) on the Q ∗ -loss under the correct p ermutation is obtained by considering a zero error (with high probability) for all questions j ∈ [ d ] for whic h k Q ∗ j − 1 2 k 2 2 ≥ 5 log 2 . 5 ( dn ) p obs and where each of the remaining (at most d ) questions contribute a Q ∗ -loss of at most 5 log 2 . 5 ( dn ) ndp obs . 5.6.1 Pro of of Lemma 4 W e partition the pro of in to t wo cases dep ending on the v alue of k v k 2 2 . Case 1: First, supp ose that 1 2 k v k 2 2 ≥ e . In this case, we pro ceed via pro of b y contradiction. If the claim w ere false, then w e would hav e s α k v k 2 2 2 log n > α X i =1 v i ≥ αv α for ev ery α ≥ d 1 2 k v k 2 2 e . It w ould then follow that n X i =1 v 2 i = d 1 2 k v k 2 2 e− 1 X i =1 v 2 i + n X i = d 1 2 k v k 2 2 e v 2 i (i) ≤ d 1 2 k v k 2 2 e − 1 + n X i = d 1 2 k v k 2 2 e v 2 i < 1 2 k v k 2 2 + n X i = d 1 2 k v k 2 2 e k v k 2 2 2 i log n , where step (i) uses the fact that v i ∈ [0 , 1]. Using the standard b ound P b i = a 1 i ≤ log ( eb a ) and the assumption d 1 2 k v k 2 2 e ≥ e , w e find that 1 2 k v k 2 2 + n X i = d 1 2 k v k 2 2 e k v k 2 2 2 i log n ≤ k v k 2 2 . The resulting c hain of inequalities con tradicts the definition of k v k 2 2 . 27 Case 2: Otherwise, w e ma y assume that 1 2 k v k 2 2 < e . Observ e that the case v = 0 trivially satisfies the claim with α = 1, and hence we restrict attention to non-zero vectors. Define a vector v 0 ∈ [0 , 1] n as v 0 = 1 v 1 v . W e first pro v e the claim of the lemma for the vector v 0 , that is, w e prov e that there exists some v alue α ≥ d 1 2 k v 0 k 2 2 e suc h that α X i =1 v 0 i ≥ s α k v 0 k 2 2 2 log n . (37) Observ e that 1 = v 0 1 ≥ · · · ≥ v 0 n ≥ 0. If 1 2 k v 0 k 2 2 ≥ e , then our claim ( 37 ) is pro ved via the analysis of Case 1 ab ov e. Otherwise, we hav e that 1 2 k v 0 k 2 2 ≤ e and v 0 1 = 1. Setting α = 1, w e obtain the inequalities α X i =1 v 0 i = 1 and s α k v 0 k 2 2 2 log n ≤ 1 , where we hav e used the assumption that n is large enough (concretely , n ≥ 16). W e ha v e thus pro ved the b ound ( 37 ) , and it remains to translate this b ound on v 0 to an analogous b ound on the v ector v . Observe that since v 1 ≤ 1, we ha ve the relation k v 0 k 2 ≥ k v k 2 . Using the same v alue of α as that deriv ed for vector v 0 , w e then obtain from ( 37 ) that this v alue α ≥ d 1 2 k v 0 k 2 2 e ≥ d 1 2 k v k 2 2 e satisfies v 1 α X i =1 v 0 i ≥ v 1 s α k v 0 k 2 2 2 log n , whic h establishes the claim. 5.7 Pro of of Theorem 3 : OBI-W AN under the intermediate mo del T o simplify notation, let us define the v ector r ∗ : = e q − 1 2 . Note that the v alue of the constan t e c in the statemen t of the theorem is sp ecified later in the pro of via equation ( 44 ) in Lemma 5 . Our pro of of this case is divided into three parts, each corresp onding to one of the three steps in the OBI-W AN algorithm. The first step is to deriv e certain prop erties of the split of the questions. The second step is to derive approximation-guaran tees on the outcome of the OBI step. The third and final step is to sho w that this appro ximation guaran tee ensures that the output of the W AN estimator meets the claimed error guaran tee. Step 1: Analyzing the split Our first step is to exhibit a useful prop erty of the split of the questions—namely , that with high probability , the questions in the tw o sets T 0 and T 1 ha ve a similar total difficult y . The random sets ( T 0 , T 1 ) chosen in the first step can b e obtained as follows: first generate an i.i.d. sequence {  j } d j =1 of equiprobable { 0 , 1 } v ariables, and then set T ` : = { j ∈ [ d ] |  j = ` } for ` ∈ { 0 , 1 } . Note that we hav e E [ P j ∈ [ d ] (1 − h ∗ j ) 2  j ] = 1 2 k 1 − h ∗ k 2 2 , and E [ P j ∈ [ d ] ((1 − h ∗ j ) 2  j ) 2 ] = 1 2 P j ∈ [ d ] (1 − h ∗ j ) 4 ≤ 1 2 k 1 − h ∗ k 2 2 . Applying Bernstein’s inequalit y then guaran tees that P  X j ∈ T ` (1 − h ∗ j ) 2 > 2 3 k 1 − h ∗ k 2 2  ≤ exp  − c k 1 − h ∗ k 2 2  for eac h ` ∈ { 0 , 1 } , 28 where c is a p ositiv e univ ersal constan t. W e are thus guaran teed that 1 3 k 1 − h ∗ k 2 2 ≤ X j ∈ T ` (1 − h ∗ j ) 2 ≤ 2 3 k 1 − h ∗ k 2 2 for b oth ` ∈ { 1 , 2 } , (38) with probability at least 1 − e − c e c log 2 . 5 d p obs , where we hav e used the fact that k 1 − h ∗ k 2 2 ≥ e cd log 2 . 5 d p obs k r ∗ k 2 2 ≥ e c log 2 . 5 d p obs . No w define the error ev ent E : = n L Q ∗ ( b x OBI-W AN , x ∗ ) > 6 e c log 2 . 5 d np obs o . Com bining the sandwich relation ( 38 ) with the union b ound, we find that P ( E ) = X partitions e T 0 , e T 1 P ( E | T 0 = e T 0 , T 1 = e T 1 ) P ( T 0 = e T 0 , T 1 = e T 1 ) ≤ X partitions e T 0 , e T 1 satisfying ( 38 ) P ( E | T 0 = e T 0 , T 1 = e T 1 ) P ( T 0 = e T 0 , T 1 = e T 1 ) + e − c e c log 2 . 5 d p obs . Consequen tly , in the rest of the pro of we consider an y partition ( e T 0 , e T 1 ) that satisfies the sandwich b ound ( 38 ) and derive an upper b ound on the error conditioned on this partition. In other w ords, it suffices to pro ve the follo wing b ound for any partition ( e T 0 , e T 1 ) satisfying ( 38 ): P ( E | T 0 = e T 0 , T 1 = e T 1 ) ≤ e − c 0 log 1 . 5 ( dn ) , (39) for some p ositive universal constan t c 0 whose v alue ma y dep end only on e c . W e note that conditioned on the partition ( e T 0 , e T 1 ), and for any fixed v alues of Q ∗ and x ∗ , the resp onses of the work ers to the questions in one set are statistically indep enden t of the resp onses in the other set. Consequently , we describ e the pro of for any one of the tw o partitions, and the ov erall result is implied b y a union b ound of the error guarantees for the tw o partitions. W e use the notation ` to denote either one of the t wo partitions in the sequel, that is, ` ∈ { 0 , 1 } . Step 2: Guaran tees for the OBI step Assume without loss of generalit y that the rows of the matrix Q ∗ are ordered according to the abilities of the corresp onding work ers, that is, the entries of e q are arranged in a non-increasing order. Recall that π ` denotes the p ermutation of the work ers in order of their resp ectiv e v alues in u ` . Let e r ` ∈ R n denote the vector obtained by p ermuting the entries of r ∗ in the order given by π ` . Thus the en tries of e r ` are identical to those of r ∗ up to a p ermutation; the ordering of the entries of e r ` is identical to the ordering of the entries of u ` . The follo wing lemma—central for the pro of of this theorem—establishes a relation b et w een these v ectors. The pro of of this lemma combines matrix p erturbation theory with some careful algebraic argumen ts. Lemma 5. Supp ose that c ondition ( 47 ) holds for a sufficiently lar ge c onstant e c > 0 . Then for any split ( T 0 , T 1 ) satisfying the r elation ( 38 ) , we have P  k e r ` − r ∗ k 2 2 > k r ∗ k 2 2 9 log( dn )  ≤ e − c log 1 . 5 d . (40) See Section 5.7.1 for the pro of of this lemma. A t this p oin t, we are no w ready to apply the b ound for the W AN es timator from Corollary 2 . 29 Step 3: Guaran tees for the W AN step Recall that for any c hoice of index ` ∈ { 0 , 1 } , the OBI step op erates on the set T ` of questions, and the W AN step op erates on the alternate set T 1 − ` . Consequen tly , conditioned on the partition ( e T 0 , e T 1 ), the outcomes Y 1 − ` of the comparisons in set (1 − ` ) are statistically indep endent of the p ermutation π ` obtained from set ` in the OBI step. Consider an y question j ∈ T 1 − ` that satisfies the inequalit y k (1 − h ∗ j ) r ∗ k 2 2 ≥ 5 log 2 . 5 ( dn ) p obs . W e no w claim that this question j satisfies the pair of conditions ( 11a ) required b y the statement of Corollary 2 . First observe that (1 − h ∗ j ) r ∗ is simply the j th column of the matrix ( Q ∗ − 1 2 ), w e hav e k Q ∗ j − 1 2 k 2 2 ≥ 5 log 2 . 5 ( dn ) p obs . The first condition in ( 11a ) is th us satisfied. In order to establish the second condition, observe that a rescaling of the inequality ( 40 ) b y the non-negativ e scalar (1 − h ∗ j ) yields the b ound k (1 − h ∗ j ) e r ` − (1 − h ∗ j ) r ∗ k 2 2 ≤ k (1 − h ∗ j ) r ∗ k 2 2 9 log( dn ) for ev ery j ∈ T 1 − ` . (41) Recall our notational assumption that the entries of e q (and hence the rows of Q ∗ ) are arranged in order of the work ers’ abilities, and that Q π is a matrix obtained by p ermuting the rows of Q ∗ according to a given p ermutation π . Also observe that the vector (1 − h ∗ j ) e r ` equals the j th column of ( Q π ` − 1 2 ), where π ` is the p ermutation of the work ers obtained from the OBI step. Consequently , the approximation guarantee ( 41 ) implies that k Q π ` j − Q ∗ j k 2 ≤ k Q ∗ j k 2 √ 9 log ( dn ) . Th us the second condition in equation ( 11a ) is also satisfied for the question j under consideration. This allows us to apply the result of Corollary 2 for the W AN step, which yields that this question j is deco ded correctly with a probabilit y at least 1 − e − c log 1 . 5 ( dn ) , for some p ositive constant c . This argument holds for every question j satisfying k (1 − h ∗ j ) r ∗ k 2 2 ≥ 5 log 2 . 5 ( dn ) p obs , and applying the union b ound sho ws that all these questions are deco ded correctly with high probability . 5.7.1 Pro of of Lemma 5 The pro of of this lemma consists of three main steps: (i) First, we show that u ` is a go o d approximation for the vector of w ork er abilities r ∗ up to a global sign. (ii) W e then sho w that the global sign is correctly identified with high probabilit y . (iii) The final step in the pro of is to conv ert this guarantee to one on the p ermutation induced by u ` . Step 1 W e first show that the vector u ` appro ximates r ∗ up to a global sign. Assume without loss of generality that x ∗ j = 1 for every question j ∈ [ d ]. As in the pro of of Theorem 1 (a), we b egin b y rewriting the mo del in a “linearized” fashion which is conv enien t for our analysis. Let Q ∗ 0 and Q ∗ 1 denote the submatrices of Q ∗ obtained b y splitting its columns according to the sets T 0 and T 1 . Then w e hav e for ` ∈ { 0 , 1 } , 1 p obs Y ` = (2 Q ∗ ` − 11 T ) diag( x ∗ ) + 1 p obs W ` , (42) where conditioned on T 0 and T 1 , the noise matrices W 0 , W 1 ∈ R n × d ha ve en tries indep enden tly dra wn from the distribution ( 18 ) . One can verify that the entries of W 0 and W 1 ha ve a mean of zero, second momen t upp er b ounded b y 4 p obs , and their absolute v alues are upp er b ounded b y 2. 30 W e now require a standard result on the p erturbation of eigenv ectors of symmetric matrices [ 42 ]. Consider a symmetric and p ositiv e semidefinite matrix M ∈ R d × d , a second symmetric matrix ∆ M ∈ R d × d , and let f M = M + ∆ M . Let v ∈ R d b e an eigenv ector asso ciated to the largest eigen v alue of M . Likewise define e v ∈ R d as an eigen vector asso ciated to the largest eigenv alue of f M . Then w e are guaranteed [ 42 ] that min {k e v − v k 2 , k e v + v k 2 } ≤ 2 | | | ∆ M | | | op max { λ 1 ( M ) − λ 2 ( M ) − 2 | | | ∆ M | | | op , 0 } , (43) where λ 1 ( M ) and λ 2 ( M ) denote the largest and second largest eigen v alues of M , resp ectively . In order to apply the b ound ( 43 ) , we define the matrix R ∗ ` : = Q ∗ ` − 1 2 11 T , as well as the matrices f M : = 1 p 2 obs Y ` Y T ` , M = 4 R ∗ ` ( R ∗ ` ) T , and ∆ M : = 2 p obs W ` ( R ∗ ` ) T + 2 p obs R ∗ ` W T ` + 1 p 2 obs W ` W T ` . Using our linearized observ ation mo del ( 42 ) , it is straightforw ard to verify that these choices satisfy the condition f M = M + ∆ M , so that the b ound ( 43 ) can b e applied. Recall that for any matrix Q ∗ ∈ C Int , w e hav e Q ∗ = e q (1 − h ∗ ) T + 1 2 ( h ∗ ) T for some vectors e q ∈ [ 1 2 , 1] n and h ∗ ∈ [0 , 1] d . Also recall our definition of the asso ciated quantit y r ∗ ∈ [0 , 1 2 ] n as r ∗ = e q − 1 2 . W e denote the magnitude of the v ector r ∗ as ρ : = k r ∗ k 2 . With the notation introduced ab ov e, we are ready to apply the b ound ( 43 ) . First observe that the matrix R ∗ ` has a rank of one, and consequen tly | | | R ∗ ` | | | op = ρ q P j ∈ T ` (1 − h ∗ j ) 2 . Conditioned on the b ound ( 38 ), w e obtain r 1 3 ρ k 1 − h ∗ k 2 ≤ | | | R ∗ ` | | | op ≤ r 2 3 ρ k 1 − h ∗ k 2 . Moreo ver, the en tries of the matrix W ` are independent, zero-mean, and ha v e a second moment upper b ounded by 4 p obs . Consequen tly , known results on random matrices [ 1 , Remark 3.13] guaran tee that | | | W ` | | | op ≤ c q max { d, n } p obs log 1 . 5 d ≤ c q dp obs log 1 . 5 d, with probability at least 1 − e − c log 1 . 5 d , where we hav e used the fact that d ≥ n and p obs ≥ 1 n . These inequalities, in turn, imply that the top eigenv alue of M is lo w er b ounded as λ 1 ( M ) = | | | R ∗ | | | 2 op ≥ 1 3 ρ 2 k 1 − h ∗ k 2 2 , the second eigen v alue v anishes (that is, λ 2 ( M ) = 0), and moreo ver that | | | ∆ M | | | op ≤ 2 p obs | | | R ∗ | | | op | | | W | | | op + 1 p 2 obs | | | W | | | 2 op ≤ c 0 p d log 1 . 5 d p obs ( ρ k 1 − h ∗ k 2 √ p obs + q d log 1 . 5 d ) . Recall the low er b ound ρ k 1 − h ∗ k 2 ≥ q e cd log 2 . 5 d p obs , assumed in the statement of the lemma. Using these facts and doing some algebra, we find that with probability at least 1 − e − c log 1 . 5 d , for any pair of sets T 0 and T 1 satisfying ( 38 ), we hav e the b ound min {k u ` − 1 ρ r ∗ k 2 2 , k u ` + 1 ρ r ∗ k 2 2 } ≤ 1 36 1 ρ 2 k 1 − h ∗ j k 2 2 d log 1 . 5 d p obs , (44) where the prefactor 1 36 is obtained b y setting the constan t e c > 20 to a large enough v alue. 31 Step 2 W e now v erify that the global sign is correctly identified. Recall our selection n X j =1 [ u ` ] 2 j 1 { [ u ` ] j > 0 } ≥ n X j =1 [ u ` ] 2 j 1 { [ u ` ] j < 0 } . Since ev ery entry of the v ector r ∗ is non-negativ e, we hav e the inequality k u ` + 1 ρ r ∗ k 2 2 ≥ n X j =1 [ u ` ] 2 j 1 { [ u ` ] j > 0 } ≥ n X j =1 [ u ` ] 2 j 1 { [ u ` ] j < 0 } , and consequen tly , k u ` + 1 ρ r ∗ k 2 2 ≥ 1 2 k u ` k 2 2 . (45a) On the other hand, a v ersion of the triangle inequalit y yields 2 k u ` k 2 2 + 2 k u ` + 1 ρ r ∗ k 2 2 ≥ k 1 ρ r ∗ k 2 2 = 1 (45b) No w supp ose that k u ` − 1 ρ r ∗ k 2 2 ≥ k u ` + 1 ρ r ∗ k 2 2 . Then from our earlier result ( 44 ) , w e hav e the b ound k u ` + 1 ρ r ∗ k 2 2 ≤ d log 1 . 5 d 36 ρ 2 k 1 − h ∗ k 2 2 p obs , (45c) with probabilit y at least 1 − e − c log 1 . 5 ( dn ) . Putting together the inequalities ( 45a ) , ( 45b ) and ( 45c ) and rearranging some terms yields the inequalit y ρ 2 k 1 − h ∗ k 2 2 ≤ d log 1 . 5 d 9 p obs . This requiremen t contradicts our initial assumption ρ 2 k 1 − h ∗ k 2 2 ≥ e cd log 2 . 5 d p obs , with e c > 20, thereb y pro ving that k u ` − 1 ρ r ∗ k 2 2 < k u ` + 1 ρ r ∗ k 2 2 . Substituting this inequalit y into equation ( 44 ) yields the b ound k u ` − 1 ρ r ∗ k 2 2 ≤ 1 36 ρ 2 k 1 − h ∗ j k 2 2 d log 1 . 5 d p obs . (46) Step 3 The final step of this pro of is to con v ert the approximation guarantee ( 46 ) on u ` to an appro ximation guaran tee on the vector e r ` (whic h, recall, is a p ermutation of r ∗ according to the p erm utation induced b y u ` ). An additional lemma is useful for this step: Lemma 6. F or any ` ∈ { 0 , 1 } , we have k e r ` − r ∗ k 2 ≤ 2 k ρu ` − r ∗ k 2 . See Section 5.7.2 for the pro of of this claim. Com bining Lemma 6 with the inequality ( 46 ) yields that for any choice of the set T 0 and T 1 satisfying the condition ( 38 ), with probabilit y at least 1 − e − c log 1 . 5 d , w e hav e k e r ` − r ∗ k 2 2 ≤ 1 18 k 1 − h ∗ k 2 2 d log 1 . 5 d p obs ( i ) ≤ k r ∗ k 2 2 18 log( dn ) . Here, inequalit y (i) follows from our earlier assumption that k r ∗ k 2 k 1 − h ∗ k 2 ≥ q e cd log 2 . 5 d p obs with e c > 20. 32 5.7.2 Pro of of Lemma 6 Recall that the tw o v ectors e r ` and r ∗ are iden tical up to a p ermutation. Now supp ose e r ` 6 = r ∗ . Then there m ust exist some p osition i ∈ [ n − 1] suc h that [ r ∗ ] i < [ r ∗ ] i +1 and [ e r ` ] i ≥ [ e r ` ] i +1 . Define the v ector e r 0 obtained b y interc hanging the entries in p ositions i and ( i + 1) in r ∗ . The difference ∆ : = k e r 0 − ρu ` k 2 2 − k r ∗ − ρu ` k 2 2 then can b e b ounded as ∆ = ([ e r 0 ] i − ρ [ u ` ] i ) 2 + ([ e r 0 ] i +1 − ρ [ u ` ] i +1 ) 2 − ([ r ∗ ] i − ρ [ u ` ] i ) 2 − ([ r ∗ ] i +1 − ρ [ u ` ] i +1 ) 2 = ([ r ∗ ] i +1 − ρ [ u ` ] i ) 2 + ([ r ∗ ] i − ρ [ u ` ] i +1 ) 2 − ([ r ∗ ] i − ρ [ u ` ] i ) 2 − ([ r ∗ ] i +1 − ρ [ u ` ] i +1 ) 2 = 2 ρ ([ r ∗ ] i +1 − [ r ∗ ] i )([ u ` ] i +1 − [ u ` ] i ) ≤ 0 , where the final inequalit y uses the fact that the ordering of the entries in the tw o v ectors e r ` and u ` are identical, which in turn implies that [ u ` ] i ≥ [ u ` ] i +1 . W e ha ve th us shown an in terchange of the entries i and ( i + 1) in r ∗ , whic h brings it closer to the p erm utation of e r ` , cannot increase the distance to the vector ρu ` . A recursive application of this argument leads to the inequalit y k e r ` − ρu ` k 2 ≤ k r ∗ − ρu ` k 2 . Applying the triangle inequalit y then yields k e r ` − r ∗ k 2 ≤ k e r ` − ρu ` k 2 + k ρu ` − r ∗ k 2 ≤ 2 k ρu ` − r ∗ k 2 , as claimed. 5.8 Pro of of Corollary 3 First supp ose the matrix Q ∗ satisfies the condition k e q − 1 2 k 2 k 1 − h ∗ k 2 ≥ s e cd log 2 . 5 ( dn ) p obs (47) for a large enough constant e c whose v alue is determined b y the result of Theorem 3 . Applying the result of Theorem 3 , we obtain that every question j satisfying (1 − h ∗ j ) 2 k e q − 1 2 k 2 2 ≥ 5 log 2 . 5 ( dn ) p obs is deco ded correctly with a probability at least 1 − e − c log 1 . 5 ( dn ) . The total con tribution from the remaining questions to the Q ∗ -loss is at most 5 log 2 . 5 ( dn ) p obs n . A union b ound ov er all questions and b oth v alues of ` ∈ { 0 , 1 } then yields the claim that the aggregate Q ∗ -loss is at most 5 log 2 . 5 ( dn ) p obs n with probabilit y at least 1 − e − c 0 log 1 . 5 ( dn ) , for some p ositiv e constan t c 0 , as claimed in ( 39 ). Otherwise, supp ose that condition ( 47 ) is violated. Then for any arbitrary b x ∈ {− 1 , 1 } d , w e ha ve L Q ∗ ( b x, x ∗ ) ≤ 1 dn k e q − 1 2 k 2 2 k 1 − h ∗ k 2 2 ≤ 6 e c log 2 . 5 d np obs , as claimed, where w e ha ve made use of the fact that d ≥ n . 5.9 Pro of of Theorem 4 (a): OBI-W AN under the Dawid-Sk ene mo del Throughout the pro of, we make use the notation previously introduced in the pro of of Theorem 3 (a). As in this same pro of, w e condition on some c hoice of T 0 and T 1 that satisfies ( 38 ) . The pro of of this theorem follows the same structure as the pro of of Theorem 3 (a) and the lemmas within it. Ho wev er, we must make additional argumen ts in order to account for adversarial work ers. In the remainder of the pro of, we consider an y ` ∈ { 0 , 1 } , and then apply the union b ound across b oth v alues of ` . Our pro of consists of the three steps: 33 (1) W e first sho w that the vector u ` is a go o d approximation to ( q DS − 1 2 ) up to a global sign. (2) Second, w e show that the global sign of r ∗ is indeed reco vered correc tly . (3) Third, w e establish guarantees on the p erformance of the W AN estimator for our setting. W e w ork through each of these steps in turn. Step 1 W e first sho w that the vector u ` is a go o d appro ximation to q DS − 1 2 up to a global sign. When Q ∗ = q DS 1 T , w e can set the vector h ∗ = 0 in the pro of of Theorem 3 (a). W e also ha ve r ∗ = q DS − 1 2 . With these assignments, the the arguments up to equation ( 44 ) in Lemma 5 contin ue to apply ev en for the presen t setting where q DS ∈ [0 , 1] n . F rom these arguments, we obtain the follo wing approximation guarantee ( 44 ) on reco vering r ∗ up to a global sign: min {k u ` − 1 ρ r ∗ k 2 2 , k u ` + 1 ρ r ∗ k 2 2 } ≤ 1 36 1 ρ 2 log 1 . 5 d p obs , (48) with probabilit y at least 1 − e − c log 1 . 5 d . Step 2 The next step of the pro of is to sho w that the global sign of r ∗ is indeed reco vered correctly . Define t w o pairs of vectors { u + ` , u − ` } and { r ∗ + ` , r ∗− ` } , all lying in the unit cub e [0 , 1] n , with entries [ u + ` ] i : = max { [ u + ` ] i , 0 } and [ u − ` ] i : = min { [ u − ` ] i , 0 } for ev ery i ∈ [ n ]; [ r ∗ + ` ] i : = max { [ r ∗ + ` ] i , 0 } , and [ r ∗− ` ] i : = min { [ r ∗− ` ] i , 0 } for ev ery i ∈ [ n ]. F rom the conditions assumed in the statement of the theorem, w e ha ve k r ∗ + k 2 ≥ k r ∗− k 2 + q 4 log 2 . 5 ( dn ) p obs , whereas from the c hoice of u in the OBI-W AN estimator, we hav e k u + k 2 ≥ k u − k 2 . One can also v erify that k u ` + 1 ρ r ∗ k 2 2 ≥ k u + ` + 1 ρ r ∗− k 2 2 + k u − ` + 1 ρ r ∗ + k 2 2 . (49a) No w supp ose that k 1 ρ r ∗ + k 2 ≥ k u − ` k 2 + q log 2 . 5 ( dn ) ρ 2 p obs . Then from the triangle inequality , w e obtain the b ound k u − ` + 1 ρ r ∗ + k 2 ≥ k 1 ρ r ∗ + k 2 − k u − ` k 2 ≥ s log 2 . 5 ( dn ) ρ 2 p obs . (49b) Otherwise w e hav e that k 1 ρ r ∗ + k 2 < k u − ` k 2 + q log 2 . 5 ( dn ) ρ 2 p obs . In this case, w e ha ve k u + ` + 1 ρ r ∗− k 2 ≥ k u + ` k 2 − k 1 ρ r ∗− k 2 ≥ k u − ` k 2 − k 1 ρ r ∗ + k 2 + 2 s log 2 . 5 ( dn ) ρ 2 p obs ≥ s log 2 . 5 ( dn ) ρ 2 p obs . (49c) Putting together the conditions ( 49a ) , ( 49b ) and ( 49c ) , w e obtain the b ound k u ` + 1 ρ r ∗ k 2 2 ≥ log 2 . 5 ( dn ) ρ 2 p obs . In conjunction with the result of equation ( 48 ) , this b ound guarantees the correct detection of the 34 global sign, that is, k u ` − 1 ρ r ∗ k 2 2 ≤ 1 36 1 ρ 2 log 1 . 5 d p obs . The deterministic inequality afforded by Lemma 6 then guaran tees that k e r ` − r ∗ k 2 2 ≤ 1 18 log 1 . 5 d p obs , (50) and this completes the analysis of the OBI part of the estimator. Step 3 In the third step, we establish guaran tees on the p erformance of the W AN estimator for our setting. Recall that since the W AN estimator uses the p ermutation given by e r ` and with this p ermutation, acts on the observ ation Y 1 − ` of the other set of questions, the noise W 1 − ` is statistically indep enden t of the choice of e r ` , when conditioned on the split ( T 0 , T 1 ). Assume without loss of generality that x ∗ = 1 and that the rows of Q ∗ are arranged according to the w ork er abilities, meaning that q DS i ≥ q DS i 0 for every i < i 0 , or in other words, r ∗ i ≥ r ∗ i 0 for every i < i 0 . Recall our earlier notation of g k ∈ { 0 , 1 } n denoting a v ector with ones in its first k p ositions and zeros elsewhere. No w from the pro of of Theorem 2 the following tw o prop erties ensure that the W AN estimator deco des every question correctly with probability at least 1 − e − c log 1 . 5 ( dn ) : (i) There exists some v alue k ≥ p − 1 obs log 1 . 5 ( dn ) such that h e r ` , g k i ≥ 3 4 q k log 1 . 5 ( dn ) p obs , and (ii) for every k ∈ [ n ], it must b e that h e r ` , g k i > − 1 4 q k log 1 . 5 ( dn ) p obs . Let us first address prop erty (i). Lemma 4 guaran tees the existence of some v alue k ≥ d 1 2 k r ∗ k 2 2 e suc h that h r ∗ + , g k i ≥ √ k k r ∗ + k 2 p log( dn ) . If there exist multiple such v alues of k , then choose the smallest suc h v alue. Since the vector r ∗ has its en tries arranged in order, and since k r ∗ + k 2 ≥ k r ∗− k 2 , w e obtain the following relations for this c hosen v alue of k : h r ∗ , g k i = ( r ∗ + ) T g k ≥ √ k k r ∗ + k 2 p log( dn ) ≥ k r ∗ k 2 2 s k log( dn ) ≥ s log 2 . 5 ( dn ) p obs k log( dn ) . The Cauc hy-Sc hw arz inequality then implies h e r ` , g k i ≥ h r ∗ , g k i − √ k k e r ` − r ∗ k 2 ( i ) ≥ 3 4 s k log 1 . 5 ( dn ) p obs , where the inequality (i) also uses our earlier b ound ( 50 ) , thereb y proving the first prop ert y . Now to wards the second prop ert y , w e use the condition h r ∗ , 1 i ≥ 0. Since the entries of r ∗ are arranged in order, w e hav e h r ∗ , g k i ≥ 0 for ev ery k ∈ [ n ]. Applying the Cauch y-Sch warz inequality yields h e r ` , g k i ≥ h r ∗ , g k i − √ k k e r ` − r ∗ k 2 ( ii ) > − 1 4 s k log 1 . 5 ( dn ) p obs , where the inequality (ii) also uses our earlier b ound ( 50 ) , thereb y proving the second prop erty . This argumen t completes the pro of of part (a). 35 5.10 Pro of of Theorem 4 (b): Con verse result under the Da wid-Skene mo del The Gilb ert-V arshamov b ound [ 16 , 43 ] guarantees existence of a set of β v ectors, x 1 , . . . , x β ∈ {− 1 , 1 } d suc h that the normalized Hamming distance ( 1 ) b et w een any pair of v ectors in this set is lo wer b ounded as d H ( x ` , x ` 0 ) ≥ 0 . 25 , for ev ery `, ` 0 ∈ [ β ] , where β = exp ( c 1 d ) for some constant c 1 > 0. F or each ` ∈ [ β ], let P ` denote the probabilit y distribution of Y induced b y setting x ∗ = x ` . When Q ∗ = q DS 1 T for some q DS ∈ [ 1 10 , 9 10 ] n , we hav e an upp er b ound on the Kullbac k-Leibler div ergence b et ween any pair of distributions ` 6 = ` 0 ∈ [ β ] as D KL ( P ` k P ` 0 ) ≤ 25 p obs d k q DS − 1 2 k 2 2 ≤ 25 cd , where we ha ve used the assumption k q DS − 1 2 k 2 2 ≤ c p obs . Putting the ab o v e observ ations together in to F ano’s inequality [ 3 ] yields a low er b ound on the exp ected v alue of the normalized Hamming error ( 1 ) for an y estimator b x as: E [ d H ( b x, x ∗ )] ≥ 1 8  1 − 25 cd + log 2 c 1 d  ( i ) ≥ 1 10 , as claimed, where inequality (i) results from setting the v alue of c as a small enough p ositiv e constan t. 5.11 Pro of of Prop osition 1 : OBI-W AN under the p erm utation-based mo del First, supp ose that p obs < log 1 . 5 ( dn ) n . Then the condition ( 16a ) is not satisfied for an y question, and hence the first part of the claim is trivially (v acuously) true. In this case, w e also hav e L Q ∗ ( b x OBI-W AN , x ∗ ) ≤ 1 ≤ 1 √ np obs log( dn ) , due to whic h the se cond claim also follo ws immediately . Otherwise, we may assume that p obs ≥ log 1 . 5 ( dn ) n . F or any index ` ∈ { 0 , 1 } , consider an arbitrary p erm utation π ` . Observe that conditioned on the split ( T 0 , T 1 ), the data Y 1 − ` is indep endent of the c hoice of the p ermutation π ` . Now consider an y question j ∈ T 1 − ` that satisfies ( 16a ) . W e then apply Theorem 2 with the parameter k j = n in ( 10a ) , and note that the p erm utation π sp ecified in the statemen t of Theorem 2 do es not matter when k j = n . This result guarantees that our estimator satisfies P ([ b x OBI-W AN ] j = x ∗ j ) ≥ 1 − e − c log 1 . 5 ( dn ) . A union b ound ov er all questions j ∈ [ d ] satisfying condition ( 16a ) implies that all of these questions are deco ded correctly with probabilit y at least 1 − e − c 0 log 1 . 5 ( dn ) . F urthermore, all remaining questions can contribute a total of at most 3 2 1 √ np obs log( dn ) to the Q ∗ -loss. This yields the second part of the claim. 6 Discussion W e prop ose a new p ermutation-b ase d mo del for crowdsourced lab eling which is considerably more general than the p opular Dawid-Sk ene mo del, pro vide a computationally-efficien t algorithm “OBI- W AN”, and asso ciated statistical guarantees and empirical ev aluations. W e hop e that the desirable features of the p ermutation-based mo del will encourage researchers and practitioners to further build on the p erm utation-based core of this mo del. This w ork giv es rise to sev eral op en problems that are theoretically challenging and of interest to practitioners. • The problem of establishing optimal minimax risk under the p ermutation-based model for computationally-efficien t estimators remains op en, and is related to sev eral problems [ 34 , 10 , 36 ] 36 in volving p ermutations that hav e an unresolved difference in the computationally efficient and inefficien t rates. 3 It is of interest to reduce this gap in the future, p ossibly building on recen t w ork [ 29 , 27 ] on rates of computationally-efficien t algorithms for p erm utation-based mo dels. • In addition to the global minimax error, and it is of interest to obtain sharp b ounds on adaptivit y to the underlying noise levels under v arious models. Suc h adaptive b ounds are obtained for the Da wid-Skene and in termediate mo dels in the pap ers [ 22 , 48 , 13 , 23 ]. • It will b e useful to extend the prop osed p erm utation-based mo del and asso ciated algorithms to more general settings in crowdsourcing such as a fixed design setup (i.e., where each work er answ ers a fixed, given subset of questions), questions with more than tw o choices, and with asymmetric error probabilities of w orkers (t wo-coin Dawid-Sk ene mo del). • Our results are lo ose by logarithmic factors. In the future, it will b e of interest to tigh ten this gap, p ossibly via new results on the la w of iterative logarithm in non-asymptotic regimes suc h as [ 4 ], and alongside understand its relation with suc h gaps in other problems inv olving shap e-constrained estimation [ 17 ]. Finally , there are many other problem settings inv olving estimation from noisy (as w ell as biased and sub jective) lab elers, such as in p eer review [ 40 , 44 , 31 , 41 , 45 ], and it is of interest to see whether p erm utation-based mo dels and asso ciated techniques can play a useful role in these applications. Ac knowledgemen ts This work w as partially supp orted b y Office of Nav al Researc h MURI grant DOD-002888, Air F orce Office of Scientific Research Grant AFOSR-F A9550-14-1-001, Office of Nav al Research grant ONR-N00014, National Science F oundation Grants CIF: 31712-23800, CRI I: CIF: 1755656, and CIF: 1763734. The w ork of NBS w as als o supp orted in part b y a Microsoft Researc h PhD fellowship. W e thank the authors of the pap er [ 48 ] for sharing their implementation of their Sp ectral-EM algorithm. References [1] A. S. Bandeira, R. V an Handel, et al. Sharp nonasymptotic b ounds on the norm of random matrices with indep enden t en tries. The Annals of Pr ob ability , 44(4):2479–2506, 2016. [2] S. Bernstein. On a mo dification of Chebyshev’s inequality and of the error formula of Laplace. A nn. Sci. Inst. Sav. Ukr aine, Se ct. Math , 1(4):38–49, 1924. [3] T. M. Co ver and J. A. Thomas. Elements of information the ory . John Wiley & Sons, 2012. [4] A. Dalaly an, N. Schreuder, and V.-E. Brunel. A nonasymptotic law of iterated logarithm for general m-estimators. In International Confer enc e on Artificial Intel ligenc e and Statistics , pages 1331–1341, 2020. [5] N. Dalvi, A. Dasgupta, R. Kumar, and V. Rastogi. Aggregating crowdsourced binary ratings. In Confer enc e on World Wide Web , pages 285–294, 2013. 3 That said, there are related problems [ 37 , 18 ] inv olving p ermutation-based mo dels where statistically optimal tec hniques are computationally efficient, and also adapt optimally to muc h more restrictive parameter-based models. 37 [6] A. Da wid and A. Skene. Maximum likelihoo d estimation of observer error-rates using the EM algorithm. Applie d Statistics , pages 20–28, 1979. [7] J. Deng, W. Dong, R. So cher, L.-J. Li, K. Li, and L. F ei-F ei. Imagenet: A large-scale hierarc hical image database. In IEEE Confer enc e on Computer Vision and Pattern R e c o gnition, 2009 , pages 248–255. IEEE, 2009. [8] C. Eic khoff and A. de V ries. How cro wdsourcable is y our task. In Cr owdsour cing for se ar ch and data mining , 2011. [9] W. F eller. Generalization of a probability limit theorem of Cram´ er. T r ansactions of the A meric an Mathematic al So ciety , 54(3):361–372, 1943. [10] N. Flammarion, C. Mao, and P . Rigollet. Optimal rates of statistical seriation. Bernoul li , 25(1):623–653, 2019. [11] U. Gadira ju, B. F etahu, and R. Kaw ase. T raining work ers for improving p erformance in cro wdsourcing microtasks. In Design for T e aching and L e arning in a Networke d World . 2015. [12] U. Gadira ju, R. Ka wase, S. Dietze, and G. Demartini. Understanding malicious b ehavior in cro wdsourcing platforms: The case of online surveys. In ACM Confer enc e on Human F actors in Computing Systems , 2015. [13] C. Gao, Y. Lu, and D. Zhou. Exact exponent in optimal rates for crowdsourcing. In International Confer enc e on Machine L e arning (ICML) , 2016. [14] C. Gao and D. Zhou. Minimax optimal con v ergence rates for estimating ground truth from cro wdsourced lab els. arXiv pr eprint arXiv:1310.5764 , 2013. [15] A. Ghosh, S. Kale, and P . McAfee. Who mo derates the mo derators?: Crowdsourcing abuse detection in user-generated con tent. In ACM c onfer enc e on Ele ctr onic c ommer c e , 2011. [16] E. N. Gilb ert. A comparison of signalling alphab ets. Bel l System T e chnic al Journal , 31(3):504– 522, 1952. [17] Q. Han. Global empirical risk minimizers with” shap e constraints” are rate optimal in general dimensions. arXiv pr eprint arXiv:1905.12823 , 2019. [18] R. Hec kel, N. B. Shah, K. Ramchandran, M. J. W ainwrigh t, et al. Activ e ranking from pairwise comparisons and when parametric assumptions do not help. The Annals of Statistics , 47(6):3099–3126, 2019. [19] A. Jaffe, B. Nadler, and Y. Kluger. Estimating the accuracies of multiple classifiers without lab eled data. In Artificial Intel ligenc e and Statistics , pages 407–415, 2015. [20] J. Jantzen, J. Norup, G. Dounias, and B. Bjerregaard. Pap-smear b enc hmark data for pattern classification. Natur e inspir e d Smart Information Systems (NiSIS 2005) , pages 1–9, 2005. [21] D. Karger, S. Oh, and D. Shah. Budget-optimal crowdsourcing using low-rank matrix ap- pro ximations. In Annual Al lerton Confer enc e on Communic ation, Contr ol, and Computing , 2011. [22] D. Karger, S. Oh, and D. Shah. Iterativ e learning for reliable crowdsourcing systems. In A dvanc es in neur al information pr o c essing systems , 2011. 38 [23] A. Khetan and S. Oh. Ac hieving budget-optimalit y with adaptive schemes in cro wdsourcing. A dvanc es in Neur al Information Pr o c essing Systems , 29:4844–4852, 2016. [24] A. Khosla, N. Jay adev aprak ash, B. Y ao, and L. F ei-F ei. Nov el dataset for fine-grained image categorization. In First Workshop on Fine-Gr aine d Visual Cate gorization, IEEE Confer enc e on Computer Vision and Pattern R e c o gnition , Colorado Springs, CO, June 2011. [25] T. Klein and E. Rio. Concentration around the mean for maxima of empirical pro cesses. The A nnals of Pr ob ability , 33(3):1060–1077, 2005. [26] S. Lazebnik, C. Schmid, and J. Ponce. A sparse texture representation using lo cal affine regions. IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e , 27(8):1265–1278, 2005. [27] A. Liu and A. Moitra. Better algorithms for estimating non-parametric models in crowd-sourcing and rank aggregation. In Confer enc e On L e arning The ory , 2020. [28] Q. Liu, J. P eng, and A. T. Ihler. V ariational inference for crowdsourcing. In A dvanc es in Neur al Information Pr o c essing Systems , pages 692–700, 2012. [29] C. Mao, A. Pananjady , and M. J. W ain wright. Breaking the 1/sqrt n barrier: F aster rates for p erm utation-based mo dels in p olynomial time. In Confer enc e On L e arning The ory , pages 2037–2042, 2018. [30] J. Matou ˇ sek and J. V ondr´ ak. The probabilistic metho d. L e ctur e Notes, Dep artment of Applie d Mathematics, Charles University, Pr ague , 2001. [31] R. No othigattu, N. Shah, and A. Pro caccia. Loss functions, axioms, and p eer review. In ICML Workshop on Inc entives in Machine L e arning , July 2020. [32] F. Parisi, F. Strino, B. Nadler, and Y. Kluger. Ranking and com bining multiple predictors without labeled data. Pr o c e e dings of the National A c ademy of Scienc es , 111(4):1253–1258, 2014. [33] V. C. Ra yk ar, S. Y u, L. H. Zhao, G. H. V aladez, C. Florin, L. Bogoni, and L. Mo y . Learning from cro wds. The Journal of Machine L e arning R ese ar ch , 99:1297–1322, 2010. [34] N. B. Shah, S. Balakrishnan, A. Gun tub o yina, and M. J. W ain wrigh t. Sto chastically transitive mo dels for pairwise comparisons: Statistical and computational issues. IEEE T r ansactions on Information The ory , 63(2):934–959, 2017. [35] N. B. Shah, S. Balakrishnan, and M. J. W ain wrigh t. F eeling the Bern: Adaptive estimators for b ernoulli probabilities of pairwise comparisons. IEEE T r ansactions on Information The ory , 2019. [36] N. B. Shah, S. Balakrishnan, and M. J. W ainwrigh t. Low permutation-rank matrices: Structural prop erties and noisy completion. Journal of Machine L e arning R ese ar ch , 2019. [37] N. B. Shah and M. J. W ain wright. Simple, robust and optimal ranking from pairwise comparisons. Journal of Machine L e arning R ese ar ch , 2018. [38] N. B. Shah and D. Zhou. Double or nothing: multiplicativ e incen tive mechanisms for crowd- sourcing. The Journal of Machine L e arning R ese ar ch , 17(1):5725–5776, 2016. 39 [39] V. S. Sheng, F. Pro vost, and P . G. Ip eirotis. Get another lab el? improving data quality and data mining using multiple, noisy lab elers. In Pr o c e e dings of the 14th A CM SIGKDD international c onfer enc e on Know le dge disc overy and data mining , pages 614–622. ACM, 2008. [40] I. Stelmakh, N. Shah, and A. Singh. On testing for biases in p eer review. In NeurIPS , 2019. [41] I. Stelmakh, N. Shah, and A. Singh. P eerReview4All: F air and accurate reviewer assignment in p eer review. In Confer enc e on Algorithmic L e arning The ory (AL T) , 2019. [42] G. Stew art and J.-G. Sun. Matrix p erturbation theory , 1990. [43] R. V arshamov. Estimate of the n umber of signals in error correcting co des. In Dokl. Akad. Nauk SSSR , 1957. [44] J. W ang and N. B. Shah. Y our 2 is m y 1, your 3 is my 9: Handling arbitrary miscalibrations in ratings. In AAMAS , 2019. [45] J. W ang, I. Stelmakh, Y. W ei, and N. Shah. Debiasing ev aluations that are biased by ev aluations. In AAAI , 2021. [46] J. Whitehill, T.-f. W u, J. Bergsma, J. R. Mo vellan, and P . L. Ruv olo. Whose vote should count more: Optimal integration of lab els from lab elers of unknown exp ertise. In A dvanc es in neur al information pr o c essing systems , pages 2035–2043, 2009. [47] M.-C. Y uen, I. King, and K.-S. Leung. A surv ey of cro wdsourcing systems. In IEEE Inernational Confer enc e on So cial Computing , 2011. [48] Y. Zhang, X. Chen, D. Zhou, and M. I. Jordan. Sp ectral metho ds meet em: A prov ably optimal algorithm for cro wdsourcing. The Journal of Machine L e arning R ese ar ch , 17(1):3537–3580, 2016. [49] D. Zhou, Q. Liu, J. C. Platt, C. Meek, and N. B. Shah. Regularized minimax conditional en tropy for crowdsourcing. arXiv pr eprint arXiv:1503.07240 , 2015. [50] D. Zhou, J. Platt, S. Basu, and Y. Mao. Learning from the wisdom of crowds by minimax en tropy . In A dvanc es in Neur al Information Pr o c essing Systems 25 , pages 2204–2212, 2012. App endix: Analysis of the ma jorit y v oting estimator In this section, w e analyze the ma jority voting estimator, given by [ e x MV ] j ∈ arg max b ∈{− 1 , 1 } n X i =1 1 { Y ij = b } for ev ery j ∈ [ d ] . Here w e use 1 {·} to denote the indicator function. The follo wing theorem provides b ounds on the risk of ma jority voting under the Q ∗ -semimetric loss in the regime of in terest ( R2 ). Prop osition 2. F or the majority vote estimator, the risk over the Dawid-Skene class is lower b ounde d as sup x ∗ ∈{− 1 , 1 } d sup Q ∗ ∈ C DS E [ L Q ∗ ( e x MV , x ∗ )] ≥ c L 1 √ np obs , (51) for some p ositive c onstant c L . 40 A comparison of the b ound ( 51 ) with the results of Theorem 1 , Theorem 3 (a) and Theorem 4 sho ws that the ma jority voting estimator is sub optimal in terms of the sample complexit y . Since this sub optimality holds for the (smaller) Da wid-Sk ene mo del class, it also holds for the (larger) in termediate mo del class, as w ell as the p erm utation-based mo del class. The remainder of this section is dev oted to the pro of of this claim. Pro of of Proposition 2 W e b egin with a low er b ound due to F eller [ 9 ] (see also [ 30 , Theorem 7.3.1]) on the tail probability of a sum of indep enden t random v ariables. Lemma 7 (F eller) . Ther e exist p ositive universal c onstants c 1 and c 2 such that for any set of indep endent r andom variables X 1 , . . . , X n satisfying E [ X i ] = 0 and | X i | ≤ M for every i ∈ [ n ] , if P n i =1 E [( X i ) 2 ] ≥ c 1 then P  n X i =1 X i > t  ≥ c 2 exp  − t 2 12 P n i =1 E [( X i ) 2 ]  , for every t ∈ [0 , P n i =1 E [( X i ) 2 ] M 2 √ c 1 ] . In what follows, we use Lemma 7 to derive the claimed low er b ound on the error incurred by the ma jorit y voting algorithm. T o this end, let S ⊂ [ n ] denote the set of some | S | = q n 2 p obs w orkers. Consider the follo wing v alue of matrix Q ∗ : Q ∗ ij = ( 1 if i ∈ S 1 2 otherwise . Then for an y question j ∈ [ d ], w e hav e P n i =1 (2 Q ∗ ij − 1) 2 = q n 2 p obs . No w supp ose that x ∗ j = − 1 for every question j ∈ [ d ]. Then for every i ∈ S , the observ ations are distributed as Y ij = ( 0 with probabilit y 1 − p obs − 1 with probabilit y p obs , and for ev ery i / ∈ S , as Y ij =      0 with probabilit y 1 − p obs − 1 with probabilit y 0 . 5 p obs 1 with probabilit y 0 . 5 p obs . Consider any question j ∈ [ d ]. Then in this setting, the ma jority voting estimator incorrectly estimates the v alue of x ∗ j when P n i =1 Y ij > 0. W e no w use Lemma 7 to obtain a low er b ound on the probabilit y of the o ccurrence of this ev en t. Some simple algebra yields n X i =1 E [ Y ij ] = −| S | p obs and n X i =1 E [( Y ij ) 2 ] = np obs . In order to satisfy the conditions required by the lemma, we assume that np obs > c 1 . Note that this condition makes the problem strictly easier than the condition np obs ≥ 1 assumed otherwise, 41 and affects the low er b ounds b y at most a constan t factor c 1 . An application of Lemma 7 with t = − P n i =1 E [ Y ij ] = | S | p obs no w yields P ( n X i =1 Y ij > 0) ≥ c 2 exp  −| S | 2 p 2 obs 12 np obs  ( i ) ≥ c 0 , for some constan t c 0 > 0 that ma y dep end only on c 1 and c 2 , where inequalit y ( i ) is a consequence of the c hoice | S | = q n 2 p obs . No w that w e ha ve established a constan t-v alued low er b ound on the probability of error in the estimation of x ∗ j for ev ery j ∈ [ d ], for the v alue of Q ∗ under consideration, w e hav e P ([ e x MV ] j 6 = x ∗ j ) n X i =1 ( Q ∗ ij − 1 2 ) 2 ≥ r n 2 p obs c 0 , and consequen tly E [ L Q ∗ ( e x MV , x ∗ )] ≥ c 0 √ 2 np obs , as claimed. 42

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment