Error Rate Bounds and Iterative Weighted Majority Voting for Crowdsourcing
Crowdsourcing has become an effective and popular tool for human-powered computation to label large datasets. Since the workers can be unreliable, it is common in crowdsourcing to assign multiple workers to one task, and to aggregate the labels in or…
Authors: Hongwei Li, Bin Yu
Error Rate Bounds and Iterativ e W eigh ted Ma jorit y V oting for Cro wdsourcing Hongw ei Li hwli@st a t.berkeley.edu Dep artment of Statistics University of California Berkeley, CA 94720-1776, USA Bin Y u binyu@st a t.berkeley.edu Dep artment of Statistics & EECS University of California Berkeley, CA 94720-1776, USA Abstract Cro wdsourcing has b ecome an effective and popular to ol for human-pow ered computa- tion to lab el large datasets. Since the w orkers can b e unreliable, it is common in crowd- sourcing to assign multiple w ork ers to one task, and to aggregate the lab els in order to obtain results of high quality . In this pap er, we pro vide finite-sample exp onential b ounds on the error rate (in probabilit y and in exp ectation) of general aggregation rules under the Da wid-Skene crowdsourcing mo del. The b ounds are deriv ed for multi-class lab eling, and can b e used to analyze many aggregation metho ds, including ma jorit y v oting, weigh ted ma jority voting and the oracle Maxim um A Posteriori (MAP) rule. W e show that the oracle MAP rule appro ximately optimizes our upp er b ound on the mean error rate of w eighted ma jority voting in certain setting. W e prop ose an iterative weigh ted ma jority v oting (IWMV) metho d that optimizes the error rate b ound and approximates the oracle MAP rule. Its one step v ersion has a pro v able theoretical guarantee on the error rate. The IWMV metho d is intuitiv e and computationally simple. Exp erimental results on sim ulated and real data sho w that IWMV performs at least on par with the state-of-the-art metho ds, and it has a muc h low er computational cost (around one hundred times faster) than the state-of-the-art metho ds. Keyw ords: Cro wdsourcing, Error rate b ound, Mean error rate, Expectation-Maximization, W eighted ma jority voting 1. In tro duction There are many tasks which can b e easily carried out by p eople but that tend to b e hard for computers, e.g., image annotation, visual design and video ev ent classification. When these tasks are extensiv e, outsourcing them to exp erts or well-trained p eople may b e to o exp ensiv e. Cro wdsourcing has recently emerged as a p o werful alternative. It outsources tasks to a distributed group of p eople (called work ers) who might b e inexp erienced in these tasks. How ever, if we can appropriately aggregate the outputs from a crowd, the yielded results could b e as go o d as the ones b y exp erts (Sm yth et al., 1995; Snow et al., 2008; 1 Whitehill et al., 2009; Rayk ar et al., 2010; W elinder et al., 2010; Y an e t al., 2010; Liu et al., 2012; Zhou et al., 2012). The fla ws of crowdsourcing are apparent. Each work er is paid purely based on how man y tasks that he/she has completed (for example, one cent for lab eling one image). No ground truth is a v ailable to ev aluate how well he/she has p erformed on the tasks. So some w orkers may randomly submit answ ers indep endent of the questions when the tasks assigned to them are b ey ond their exp ertise. Moreov er, w orkers are usually not p ersisten t. Some w orkers ma y complete many tasks, while the others may finish only v ery few tasks. In spite of these drawbac ks, is it still p ossible to get reliable answers in a cro wdsourcing system? The answer is yes. In fact, ma jority voting (MV) has b een able to generate fairly reasonable results (Sno w et al., 2008). Ho wev er, ma jority voting treats each work er’s result as equal in qualit y . It does not distinguish a spammer from a diligen t work er. Thus ma jority v oting can b e significan tly improv ed up on (Karger et al., 2011). The first improv emen t o ver ma jority voting dates back at least to (Dawid and Sk ene, 1979). They assumed that each work er is asso ciated with an unknown confusion matrix, whose rows are discrete conditional distributions of input from w orkers giv en ground truth. Eac h off-diagonal elemen t represen ts misclassification rate from one class to the other, while the diagonal elements represent the accuracy in each class . Based on the observed lab els b y the work ers, the maximum likelihoo d principle is applied to jointly estimate unobserv ed true lab els and w orker confusion matrices. Although the likelihoo d function is non-conv ex, a lo cal optimum can b e obtained b y using the Exp ectation-Maximization (EM) algorithm, whic h can b e initialized by ma jority v oting. Da wid and Sk ene’s mo del (Dawid and Sk ene, 1979) can b e extended by assuming true lab els are generated from a logistic mo del (Rayk ar et al., 2010), or putting a prior ov er w orker confusion matrices (Liu et al., 2012), or taking the task difficulties into accoun t (Bac hrach et al., 2012). One may simplify the assumption made by Dawid and Sk ene (1979) to consider a confusion matrix with only a single parameter (Karger et al., 2011; Liu et al., 2012), which w e call the Homogenous Dawid-Sk ene mo del (Section 2). Recen tly , significant progress has b een made for inferring the true lab els of the items. Ra yk ar et al. (2010) presented a maximum likelihoo d estimator (via EM algorithm) that infers work er reliabilities and true lab els. W elinder et al. (2010) endow ed each item (i.e., image data in their w ork) with features, which could represen t concepts or topics, and w ork- ers ha ve different areas of exp ertise of matching these topics. Liu et al. (2012) transformed lab el inference in crowdsourcing in to a standard inference problem in graphical mo dels, and applied approximate v ariational metho ds. Zhou et al. (2012) inferred the true lab els by applying a minimax entrop y principle to the distribution which join tly mo del the work ers, items and lab els. Some w ork also considers the problem of adaptiv ely assigning the tasks to work ers for budget efficiency (Ho et al., 2013; Chen et al., 2013). All the previous work we men tioned ab o ve fo cused on applying or extending Dawid- Sk ene mo del, and inferring the true lab els based on that. Ho wev er, to understand the b eha vior and consequences of the crowdsourcing system, it is of great intension to in vestigate the error rate of v arious aggregation rules. T o theoretically analyze sp ecific algorithm, Karger et al. (2011) pro vided asymptotic error b ounds for their iterative algorithm and also ma jorit y voting. It seems difficult to generalize their results to other aggregation rules in crowdsourcing or apply to finite sample scenario. V ery recently , Gao and Zhou (2014) 2 studied the minimax conv ergence rate of the global maximizer of a low er b ound of the marginal-lik eliho o d function under a simplified Dawid-Sk ene mo del (i.e., one coin mo del in binary lab eling). Their results are on clustering error rate, which is different from the ordinary error rate, i.e., prop ortion of mistakes in final lab eling. They fo cused on the mathematical prop erties of the global optimizer of a sp ecific function for sufficiently large n umber of work ers and items, and not on the b eha vior of rules/algorithms whic h find the optimizer or aggregate the results. In this pap er, we fo cus on providing finite sample b ounds on the error rate of some general aggregation rules under cro wdsourcing mo dels of which the effectiv eness on real data has b een ev aluated in (Dawid and Skene, 1979; Rayk ar et al., 2010; Liu et al., 2012; Zhou et al., 2012), and motiv ate efficien t algorithms. Our main contributions are as follo ws: 1. W e derived e rror rate b ounds (in probability and in exp ectation) of a general t yp e of aggregation rules with any finite n um b er of work ers and items under the Dawid-Sk ene mo del (with the Class-Conditional Da wid-Skene mo del and Homogenous Dawid-Sk ene mo del (Section 2) as sp ecial cases). 2. By applying the general error rate b ounds to some sp ecial cases such as weigh ted ma jority voting and ma jorit y voting under sp ecific mo dels, we gain insights and in- tuitions. These lead to the oracle b ound-optimal rule for designing optimal w eighted ma jority voting, and also the consistency prop erty of ma jority v oting. 3. W e show that the oracle Maximum A Posteriori (MAP) rule approximately optimizes the upp er b ound on the mean error rate of weigh ted ma jority voting. The EM algo- rithm approximates the oracle MAP rule, thus the error rate b ounds can help us to understand the EM algorithm in the con text of crowdsourcing. 4. W e prop osed a data-driven iterative weigh ted ma jorit y voting (IWMV) algorithm with p erformance guarantee on its one-step version (Section 4.2). It is intuitiv e, easy to implemen t and p erforms as well as the state-of-the-art metho ds on simulated and real data but with m uch lo wer computational cost. T o the b est of our knowledge, this is the first work which fo cuses on the finite sample error rate analysis on general aggregation rules under the practical Dawid-Sk ene mo del for cro wdsourcing. The results we obtained can b e used for analyzing error rate and sample complexit y of algorithms. It is also worth mentioning that most of the previous w ork done only fo cused on binary crowdsourcing lab eling, while our results are based on multi- class lab eling, which naturally apply to the binary case. Meanwhile, we did not make any assumptions on the num b er of work ers and items in the crowdsourcing, thus the results can b e directly applied to the setting of real cro wdsourcing data. 2. Bac kground and form ulation As an example of cro wdsourcing, we assume that a set of work ers are assigned to p erform lab eling tasks, such as judging whether an image of an animal is that of a cat, a dog or a sheep, or ev aluating if a video even t is abnormal or not. 3 Throughout this pap er, we assume there are M work ers and N items for a lab eling task with L lab el classes. W e denote the set of work ers [ M ] = { 1 , 2 , · · · , M } , the set of items [ N ] = { 1 , 2 , · · · , N } , and the set of lab els [ L ] = { 1 , 2 , · · · , L } (called lab el set ). The extended lab el set is defined as [ L ] = [ L ] ∪ { 0 } = { 0 , 1 , 2 , · · · , L } , where 0 represents the lab el is missing. How ever, in the case of L = 2, w e use the common conv en tion of lab el set as {− 1 , +1 } and extended lab el set as { 0 , − 1 , +1 } . In what follows, w e use y j as the true lab el for the j -th item, and ˆ y j as the predicted lab el for the j -th item b y an algorithm. 1 Let π k = P ( y j = k ) denotes the prev alence of lab el “ k ” in the true lab els of the items for an y j ∈ [ N ] and k ∈ [ L ]. The observed data matrix is denoted by Z ∈ [ L ] M × N , where Z ij is the lab el given by the i -th w orker to the j -th item, and it will b e 0 if the corresp onding lab el is missing (the i th work er did not lab el the j th item). W e in tro duce the indicator matrix T = ( T ij ) M × N , where T ij = 1 indicates that entry ( i, j ) is observed, and T ij = 0 indicates entry ( i, j ) is unobserv ed. Note that T and Z are observed together. W or k e r s I te m s D a ta m a tr i x Z : 1 U n l a b e l e d L a b e l e d 2 3 Figure 1: Illustration of the input data matrix. En try ( i, j ) is the lab el of j -th item given b y i -th work er. The set of lab els is [ L ] = { 1 , 2 , 3 } . The pro cess of matching work ers with tasks (i.e., lab eling items) can b e mo deled by a probability matrix Q = ( q ij ) M × N , where q ij = P ( T ij = 1) is the probability that the j th item was assigned to the i th work er (i.e., gets lab eled). W e call Q as assignment pr ob ability matrix . Unlike the fixed assignment configuration in (Karger et al., 2011), the assignmen t probability matrix is more flexible, and it does not require each w ork er lab el the same n umber of items, nor each item gets lab eled b y same num b er of work ers. Hence, the assignmen t probability matrix co vers the most general form of assigning items to work ers, and there are sp ecial cases commonly adopted in literature, suc h as a work er has the same c hance to lab el all items (Karger et al., 2011; Liu et al., 2012). More sp ecifically , when q ij = q i ∈ (0 , 1] , ∀ i ∈ [ M ] , j ∈ [ N ], we call it the assignment pr ob ability ve ctor ~ q = ( q 1 , · · · , q M ). If q ij = q ∈ (0 , 1] , ∀ i ∈ [ M ] , j ∈ [ N ], then we call it c onstant assignment pr ob ability q . The 1. In this paper, an y parameter with a hat ˆ is an estimate for this parameter. 4 three assignment configurations ab ov e are referred to as the task assignment 2 is b ase d on pr ob ability matrix Q , pr ob ability ve ctor ~ q and c onstant pr ob ability q , resp ectively . Generally , we use π , p, q as probabilities, and they might hav e indices according to the con text. W e denote A and C as constan ts whic h dep end on other given v ariables, and a as either a general constan t or a vector dep ending on context. η denotes likelihoo d probabilities in the context. Θ denotes a set of parameters. and δ are constants in (0 , 1), where is used for b ounding the error rate, and δ is used for denoting a p ositive probabilit y . H e ( ) denotes the natural en trop y of Bernoulli random v ariable with parameter , i.e., H e ( ) = − ln − (1 − ) ln(1 − ). Op erators ∧ and ∨ denote the min op erator and the max op erator b etw een tw o n umbers, resp ectively . Mean while, throughout the pap er, w e will lo cally define each notation in the context b efore using them. 2.1 Dawid-Sk ene mo dels W e discuss three mo dels cov ering all the cases that are widely used for mo deling the quality of the work ers (Dawid and Skene, 1979; Ra yk ar et al., 2010; Karger et al., 2011; Liu et al., 2012; Zhou et al., 2012). The first one, whic h is also the most general one, was originally prop osed by Da wid and Skene (1979): Gener al Dawid-Skene mo del. In this model, the reliabilit y of w ork er i is modeled as a confusion matrix P ( i ) = p ( i ) kl L × L ∈ [0 , 1] L × L , which is in a matrix form and represen ts a conditional probabilit y table such that p ( i ) kl . = P ( Z ij = l | y j = k , T ij = 1) , ∀ k , l ∈ [ L ] , ∀ i ∈ [ M ] . (1) Note that p ( i ) kk denotes the accuracy of w orker i on lab eling an item with true lab el k correctly , and p ( i ) kl , k 6 = l represents the error probability of lab eling an item with true lab el k as l mistak enly . The n umber of free parameters of mo deling the reliability of a work er are L ( L − 1) under the General Da wid-Skene mo del. Since the General Da wid-Skene mo del has L ( L − 1) degree of freedom to mo del each w orker, it is flexible, but often leads to ov erfitting on small datasets. As a further regu- larization of the work er mo dels, w e consider another t wo mo dels which are sp ecial cases of General Dawid-Sk ene mo del via imp osing constraints on the work er confusion matrices. • Class-Conditional Dawid-Skene mo del. In this mo del, the error probabilities of la- b eling an item with true lab el k as lab el l mistak enly for eac h work er are the same 2. The term task assignmen t seems to imply that w orkers are passive of lab eling items — they will surely lab el an item whenever they are assigned to. This might not match the reality in the crowdsourcing platform. When a task owner distributed tasks to the crowd, it is likely that most of the w orkers will lab el a set of items and they can stop whenev er they w ant to (Sno w et al., 2008), unless they are required (b y the owner) to complete a sp ecific set of tasks for getting paid. Thus the pro cess of matching tasks with work ers might b e determined by either work ers (sub jectively) or task owners (by enforcement), whic h dep ends on how the owners design and distribute tasks. If the work ers hav e choice to select which item to label, it might b e more proper to call the task-w orker matching as task sele ction , instead of task assignment . How ever, b oth of the cases can b e mo deled by the probability matrix Q (or probability v ector ~ q , or constant probabilit y q ). In what follows, w e use the term task assignment to represent the task-w orker matching without introducing ambiguit y . 5 Figure 2: The graphical mo del of the General Da wid-Skene mo del. Note that P ( i ) is the confusion matrix of w orker i and if we change P ( i ) to w i , the graph will b ecome the graphical mo del for the Homogenous Da wid-Skene mo del. across different l . F ormally , w e ha ve p ( i ) kk = P ( Z ij = k | y j = k , T ij = 1) , ∀ k ∈ [ L ] , ∀ i ∈ [ M ] , p ( i ) kl = 1 − p ( i ) kk L − 1 , ∀ k , l ∈ [ L ] , l 6 = k, ∀ i ∈ [ M ] . (2) This mo del simplifies the error probabilities of a work er to b e the same given the true lab el of an item. Thus, the off-diagonal elements of each ro w of the confusion matrix P ( i ) will b e the same. The num b er of free parameters to mo del the reliabilit y of each w orker under the Class-Conditional Dawid-Sk ene mo del is L . • Homo genous Dawid-Skene mo del. Each w ork er is assumed to ha ve the same accuracy on each class of items, and hav e the same error probabilities as well. F ormally , work er i lab els an item correctly with a fixed probability w i and mistakenly with another fixed probability 1 − w i L − 1 , i.e., ( p ( i ) kk = w i , ∀ k ∈ [ L ] , ∀ i ∈ [ M ] , p ( i ) kl = 1 − w i L − 1 , ∀ k , l ∈ [ L ] , k 6 = l, ∀ i ∈ [ M ] . (3) In this case, the work er lab els an item with the same accuracy , indep endent of which lab el this item actually is. The n umber of parameters of mo deling the reliability of eac h w orker is 1 under the Homogenous Dawid-Sk ene mo del. Generally , the parameter set under all the three mo dels can b e denoted as Θ = n P ( i ) M i =1 , Q, π o . Sp ecifically , the parameter set of the Homogenous Dawid-Sk ene mo del can b e denoted as Θ = n { w i } M i =1 , Q, π o . It is w orth men tioning that when L = 2, the Class-Conditional Dawid-Sk ene mo del and the General Dawid-Sk ene mo del are the same, which are referred to as the t wo-coin mo del, and the Homogenous Da wid-Skene mo del is referred to as the one-coin mo del in the literature (Rayk ar et al., 2010; Liu et al., 2012). In signal pro cessing, the Homogenous Da wid-Skene mo del is equiv alen t to the random classification noise mo del (Angluin and 6 0.8 0.1 0.1 0.1 0.8 0.1 0.1 0.1 0.8 0.7 0.15 0.15 0.05 0.9 0.05 0.1 0.1 0.8 0.7 0.2 0.1 0.03 0.9 0.07 0.05 0.15 0.8 (a) (b) (c ) True la bel Pre dicte d l a be l Figure 3: T oy examples of confusion matrices of different work er reliabilit y mo dels. W e assume L = 3 thus confusion matrices ∈ [0 , 1] 3 × 3 . (a) General Dawid-Sk ene mo del. (b) Class-Conditional Dawid-Sk ene mo del. (c) Homogenous Dawid-Sk ene mo del. V ertical axis is actual classes, and horizontal axis is predicted classes. Differen t color corresp onds to different conditional probabilit y , and the diagonal elemen ts are the accuracy of lab eling the corresp onding class of items correctly . Laird, 1988), and the Class-Conditional Da wid-Skene mo del is also referred to as the class- conditional noise mo del (Natara jan et al., 2013). W e do not adopt the original term because the error comes from the limitations of work ers’ ability to lab el items correctly , not from noise which is in the context of signal pro cessing. Binary lab eling is the sp ecial case ( L = 2), and it is the ma jor fo cus of the previous researc h in cro wdsourcing (Rayk ar et al., 2010; Karger et al., 2011; Liu et al., 2012). As a conv en tion, w e assume the set of lab els is {± 1 } instead of { 1 , 2 } when L = 2. F or notation con venience, we defined the work er confusion matrix in a different wa y as follo ws: for i = 1 , 2 , · · · , M ( p ( i ) + = P ( Z ij = 1 | y j = 1 , T ij = 1) , p ( i ) − = P ( Z ij = − 1 | y j = − 1 , T ij = 1) . (4) Then the parameter set will b e Θ = n p ( i ) + , p ( i ) − o M i =1 , Q, π under this mo del. When we presen t the result of binary lab eling, w e will use p ( i ) + and p ( i ) − without introducing any am biguity . Under the mo dels ab o ve, the p osterior probability of the true lab el of item j to b e k is defined as: ρ ( j ) k = P ( y j = k | Z , T , Θ) , ∀ j ∈ [ N ] . (5) F or binary lab eling, the posterior probabilit y of the true label of item j to be k is defined as: ρ ( j ) + = P ( y j = 1 | Z, T , Θ) , ∀ j ∈ [ N ] . (6) 2.2 Aggregation rules After collecting the crowdsourced lab els, the task owner could use an arbitrary rule to aggregate the m ultiple noisy lab els of an item to a “refined” lab el for that item. The 7 qualit y of final predicted lab els dep ends not only on the input from the work ers but also on the aggregation rule. It is hence of great imp ortance to design a go o d aggregation rule. A natural aggregation rule is majority voting (Snow et al., 2008; Karger et al., 2011). F or multiple lab eling, ma jorit y voting can b e written formally as ˆ y j = argmax k ∈ [ L ] M X i =1 I ( Z ij = k ) , (7) whic h giv es the j th item the ma jority label among all work ers. Since work ers hav e different reliabilities, it is inefficient to treat the lab els from differ- en t work ers with the same weigh t as in ma jorit y voting (Karger et al., 2011). A natural extension of ma jority voting is weighte d majority voting (WMV), which weighs the lab els differen tly . F ormally , weigh ted ma jorit y voting rule can b e written as ˆ y j = argmax k ∈ [ L ] M X i =1 ν i I ( Z ij = k ) , (8) where ν i ∈ R is the weigh t asso ciated with the i th w orker. The idea b ehind weigh ted ma jority voting can b e generalized in this wa y: given an item and the lab els input from work ers, the item can b e p otentially predicted to b e any lab el class in [ L ]. Supp ose each class has a score whic h is computed based on the work er inputs, and we call it the aggr e gate d sc or e for p otential class 3 , then the aggregated lab el can b e c hosen as the class that obtains the highest score. Based on the ideas ab ov e, w e consider a general form of aggregation rule which is based on maximizing aggregated scores. The aggregated score of each lab el class can b e decomp osed into the sum of b ounded prediction functions asso ciated with w orkers, plus a shift constant. W e refer this type of aggregation rule to de c omp osable aggr e gation rule , whic h has the form ˆ y j = argmax k ∈ [ L ] s ( j ) k and s ( j ) k . = M X i =1 f i ( k , Z ij ) + a k , (9) where s ( j ) k is the aggr e gate d sc or e for p otential class k on item j , a k ∈ R and f i : [ L ] × [ L ] → R ∀ i ∈ M is bounded, i.e., | f i ( k , h ) | < ∞ , ∀ k ∈ [ L ] , h ∈ [ L ]. Giv en k and h , f i ( k , h ) is a constant. Intuitiv ely , f i ( k , h ) is the score gained for the k th p otential lab el class when the i th work er lab els an item with lab el h . It is reasonable to assume that Z ij = 0 con tributes no information to predict the true lab el of j th item, th us, w e further assume that f i ( k , 0) = constan t , ∀ i ∈ [ M ] , k ∈ [ L ]. Without ambiguit y , w e will refer to { f 1 , f 2 , · · · , f M } as sc or e functions in what follows. Note that score functions are usually designed by the task owners when they aggregate the noisy inputs into final predicted results. F or illustration, ma jority voting is a sp ecial case of decomp osable aggregation rule with f i ( k , Z ij ) = 1 if Z ij = k , and 0 if Z ij 6 = k . F or weigh ted ma jorit y voting, f i ( k , Z ij ) = ν i if Z ij = k , and 0 otherwise. Later, w e will see more aggregation rules whic h can b e expressed or approximated in this form (Section 4.2). 3. Since we do not know the true label of an item, we use the term p otential class based on the fact that all the lab el class can p otentially be the true lab el of this item. 8 2.3 Performance metric Giv en an estimation or an aggregation rule, supp ose that its predicted lab el for item j is ˆ y j , then our ob jective is to minimize the error rate ER = 1 N N X j =1 I ( ˆ y j 6 = y j ) . (10) Since the error rate is random, we are also in terested in its exp ected v alue (i.e., the me an err or r ate ). F ormally , the mean error rate is: E [ER] = 1 N N X j =1 P ( ˆ y j 6 = y j ) . (11) The rest of the pap er is organized as follo ws. In Section 3, we presen t finite-sample b ounds on the error rate of the decomp osable aggregation rule in probabilit y and in exp ec- tation under the General Da wid-Skene mo del. Section 4 con tains the error rate b ounds of some sp ecial cases, which the cro wdsourcing communit y is widely concerned with. In Sec- tion 5, we prop ose an iterativ e w eighted ma jority voting (IWMV) algorithm based on the analysis of the optimization of the error rate b ounds, and provide performance guaran tee for the one step verson of IWMV algorithm. Exp erimental results on simulated and real-w orld dataset are presen ted in section 6. Note that the pro ofs are deferred to the supplementary materials. 3. Error rate b ounds In this section, finite sample bounds are provided for error rates (in high probability and in exp ectation) of the decomp osable aggregation rule under the Dawid-Sk ene mo del. Our main results will b e fo cused on the setting which is as general as p ossible. W e define the gener al setting as follows: • Worker mo deling. W e fo cus on the General Dawid-Sk ene mo del, and then the results can b e sp ecialized for the Class-Conditional Dawid-Sk ene mo del and the Homogenous Da wid-Skene mo del straightforw ardly . • T ask assignment. W e consider the data matrix that is observed based on assignment probabilit y matrix Q = ( q ij ) M × N ,where q ij = P ( T ij = 1). The results can b e easily simplified to the scenarios where task assignment is based on probabilit y v ector ~ q or constan t probabilit y q according to the practical assignmen t pro cess. • A ggr e gation rule. Our main results will b e presented based on the decomp osable aggregation rule (9). 3.1 Some quantities of interest One imp ortant question we w ant to address is that how the error rate is b ounded with high probabilit y , and what quantities hav e impact on the b ounds. Before deriving the error rate 9 b ound, we introduce some quantities of interest under the gener al setting in this section. W e shall b ear in mind that all the quantities defined here serve the purp ose of defining t wo measures t 1 and t 2 , which pla y a central role in b ounding the error rate. The first quan tity Γ is asso ciated with the score functions { f 1 , f 2 , · · · , f M } : Γ . = v u u t M X i =1 max k,l ,h ∈ [ L ] ,k 6 = l | f i ( k , h ) − f i ( l, h ) | 2 . (12) It measures the ov erall v ariation of the f i ’s on their first argumen t (i.e., when the p o- ten tial lab el class c hanges). T ake weigthed ma jority voting (8) as an example, f i ( k , h ) = ν i I ( h = k ), then Γ = q P M i =1 ν 2 i = || ν || 2 . F or ma jority voting, Γ = q P M i =1 1 = √ M . Note that Γ is in v arian t to a translation of score functions, and is linear to scale of score functions. That is to say , if w e design new score functions as f 0 i = mf i + b for constant m and b , and for all i ∈ [ M ], then the corresp onding quantit y Γ 0 = m Γ. Later on, we will see that it plays a role in normalization. Another quantit y Λ ( j ) kl is defined as the exp e cte d gap of the aggr e gate d sc or es b etw een t wo p oten tial lab el classes k and l , when the true lab el is k . F ormally , Λ ( j ) kl . = E h s ( j ) k − s ( j ) l | y j = k i = M X i =1 L X h =1 q ij ( f i ( k , h ) − f i ( l, h )) p ( i ) kh + ( a k − a l ) . (13) The larger this quantit y is, the easier the aggregation rule iden tifies the true lab el as k instead of l (i.e., correctly predicted the lab el). Lik e Γ, Λ ( j ) kl is also in v arian t to translation of score functions, and linear to scale of score functions. T ak e w eighted ma jorit y voting (8) for illustration, the gap of aggregated scores under the Homogenous Dawid-Sk ene mo del is Λ ( j ) kl = P M i =1 P L h =1 q ij ν i w i − q ij ν i 1 − w i L − 1 = 1 L − 1 P M i =1 q ij ν i ( Lw i − 1) , b ecause f i ( k , h ) = ν i I ( h = k ) and p ( i ) kh = w i if h = k , otherwise 1 − w i L − 1 . The following tw o quantities serve as the low er b ound and the upp er b ound of the normalized gap of aggregated scores for the j th item (i.e., the ratio of Λ ( j ) kl and Γ). τ j, min . = min k,l ∈ [ L ] ,k 6 = l Λ ( j ) kl Γ and τ j, max . = max k,l ∈ [ L ] ,k 6 = l Λ ( j ) kl Γ , (14) Both τ j, min and τ j, max are inv ariant to translation and scale of the score functions. No w, w e introduce the tw o most imp ortant quantities of interest — t 1 and t 2 , which are resp ectiv ely the lo wer b ound and the upp er b ound of the normalized gap of aggregated scores across all items. In our main results, t 1 is used to provide a sufficient condition for an upp er b ound on the error rate of crowdsourced lab eling under the general setting. Mean while, t 2 is used to provide a sufficient condition for a low er b ound of the error rate. t 1 . = min j ∈ [ N ] τ j, min and t 2 . = max j ∈ [ N ] τ j, max , (15) Both t 1 and t 2 are inv ariant to translation and scale of the score functions { f 1 , f 2 , · · · , f M } . t 1 and t 2 are related to ho w go o d a group of w orkers are, how the tasks are assigned and ho w w ell the aggregation rule is with resp ect to this type of lab eling tasks. 10 Besides the quantities of in terest ab ov e, we further introduce tw o notations which can capture the fluctuation of the score functions and the gap of aggregated scores. These t wo quan tities will b e used to b ound the mean error rate of crowdsourced labeling. A quantit y c measures the maximum change amongst all the score functions when the p oten tial lab el class changes from one lab el to another, and it is defined as c = 1 Γ · max i ∈ [ M ] ,k,l ,h ∈ [ L ] ,k 6 = l | f i ( k , h ) − f i ( l, h ) | . (16) F or weigh ted ma jority v oting (8), c = max i ∈ [ M ] | ν i | || ν || 2 = k ν k ∞ || ν || 2 . And for ma jorit y voting (7), c = 1 √ M . Another quantit y σ 2 relate to the v ariation of the gap of aggregated scores, and σ 2 = 1 Γ 2 · max j ∈ [ N ] ,k ,l ∈ [ L ] ,k 6 = l M X i =1 L X h =1 q ij ( f i ( k , h ) − f i ( l, h )) 2 p ( i ) kh . (17) Note that 0 < c ≤ 1, and 0 < σ 2 ≤ max i ∈ [ M ] ,j ∈ [ N ] q ij . Both c and σ 2 are translation and scale in v arian t of the score functions { f 1 , f 2 , · · · , f M } . In the next section, with t 1 and t 2 , we will derive the b ounds on the error rate of cro wdsourced lab eling with high probability , and together with c and σ 2 , we will deriv e the b ounds on the mean error rate under the general setting. 3.2 Main results In this section, we start with a main theorem to provide finite-sample error rate b ounds for decomp osable aggregation rules under the General Da wid-Skene mo del (1). T o lighten the notation, we define tw o functions as follows: φ ( x ) = e − x 2 2 x ∈ R , (18) D( x || y ) = x ln x y + (1 − x ) ln 1 − x 1 − y ∀ x, y ∈ (0 , 1) . (19) φ ( · ) is the unnormalized standard Gaussian densit y function. D( x || y ) is the Kullbac k-Leibler div ergence of tw o Bernoulli distributions with parameters x and y resp ectively . The following theorem pro vides sufficient conditions and the corresp onding high proba- bilit y b ounds on the error rate under the gener al setting as describ ed in Section 3. Theorem 1 (Bounding err or r ate with high pr ob ability) Under the Gener al Dawid- Skene mo del as in (1), with the pr e diction function for e ach item as in (9), and supp ose the task assignment is b ase d on pr ob ability matrix Q = ( q ij ) M × N wher e q ij is the pr ob ability that the worker i lab els item j . F or ∀ ∈ (0 , 1) , with notations define d fr om (12) to (19), we have: (1) If t 1 ≥ q 2 ln L − 1 , then 1 N N X j =1 I ( ˆ y j 6 = y j ) ≤ with pr ob ability at le ast 1 − e − N D ( || ( L − 1) φ ( t 1 )) . (20) 11 (2) If t 2 ≤ − q 2 ln 1 1 − , then 1 N N X j =1 I ( ˆ y j 6 = y j ) ≥ with pr ob ability at le ast 1 − e − N D ( || 1 − φ ( t 2 )) . (21) Remark: The high probability b ounds on error rate require conditions on t 1 and t 2 , whic h are related to the normalized gap of aggregated scores (Section 3.1). Basically , if the scores of predicting an item as its true lab el (predicted correctly) are larger than the scores of predicting as a wrong lab el, then it is more likely that the error rate will b e small, th us it is bounded from ab ov e with high probabilit y . The in terpretation of the low er b ound and its condition is similar to the upp er b ound. T o ensure the probabilit y of b ounding the error rate to b e at least 1 − δ , w e ha ve to solv e the equation D ( || ( L − 1) φ ( t 1 )) = 1 N ln 1 δ , which cannot b e solv ed analytically . Th us, w e need to figure out the minimum t 1 for b ounding the error rate with probability at least 1 − δ . The following theorem serv es this purp ose by slightly relaxing the conditions on t 1 and t 2 in Theorem 1. Before presen ting the next theorem, we define a notation C , which dep ends on param- eters , δ ∈ (0 , 1), for lightening notations in the theorem. C ( , δ ) = 1 + exp 1 H e ( ) + 1 N ln 1 δ , (22) where H e ( ) = − ln − (1 − ) ln(1 − ), whic h is the natural entrop y of a Bernoulli random v ariable with parameter . Theorem 2 With the same notation as in The or em 1, for ∀ , δ ∈ (0 , 1) , we have: (1) if t 1 ≥ p 2 ln [( L − 1) C ( , δ )] , then 1 N P N j =1 I ( ˆ y j 6 = y j ) ≤ with pr ob ability at le ast 1 − δ , (2) if t 2 ≤ − p 2 ln C (1 − , δ ) , then 1 N P N j =1 I ( ˆ y j 6 = y j ) ≥ with pr ob ability at le ast 1 − δ . Remark: F or any scenarios that can b e formulated in to the gener al setting as in 3, with t 1 and t 2 computed as in Section 3.1, both Theorem 1 and Theorem 2 can b e applied. Therefore, in the rest of the pap er, for any sp ecial case (Section 4) of the general setting , w e will only presen t one of Theorem 1 and Theorem 2, and omit the other for clarit y . In practice, the mean error rate migh t b e a b etter measure of p erformance b ecause of its non-random nature. The metho d of ev aluating the accuracy of a certain algorithm is often conducted b y taking an empirical av erage of its p erformance in each trial, which is a consisten t estimator of the mean error rate. Thus it will b e of general interest to b ound the mean error rate. F or the next theorem, we present the mean error rate b ound under the general setting used in Theorem 1. Theorem 3 (Bounding the me an err or r ate.) Under the same setting as in The or em 1, with c and σ 2 define d as in (16) and (17) r esp e ctively, (1) if t 1 ≥ 0 , then 1 N P N j =1 P ( ˆ y j 6 = y j ) ≤ ( L − 1) · min n exp − t 2 1 2 , exp − t 2 1 2( σ 2 + ct 1 / 3) o , (2) if t 2 ≤ 0 , then 1 N P N j =1 P ( ˆ y j 6 = y j ) ≥ 1 − min n exp − t 2 2 2 , exp − t 2 2 2( σ 2 − ct 2 / 3) o . 12 Remark: The results ab ov e are comp osed of t wo exp onen tial b ounds, and neither is generally dominant ov er the other. Th us eac h comp onen t inside the min op erator can b e serv ed as an individual b ound for the mean error rate. T ak e the upp er b ound in (1) as an example, when t 1 is small (recall that b oth σ 2 and c are b ounded ab o ve by 1), the second comp onen t will b e tighter than the first one. Thus the error rate b ound b ehav es lik e e − t 1 . Otherwise, the first comp onent will b e tigh ter, and the mean error rate b ehav es like e − t 2 1 . All the pro ofs of the results in this section are deferred to App endix A. In the follo wing section, we demonstrate how the main results here can b e applied to v arious settings of lab eling b y crowdsourcing, and provide theoretical b ounds for the error rate corresp ondingly . 4. Apply error rate b ounds to some typical scenarios In this section, w e apply the general results in Section 3 to some common settings such as binary lab eling, and aggregation rules such as ma jority v oting and weigh ted ma jorit y v oting. Sp ecifically , we consider the following scenarios: • W ork er mo deling: Recall that the General Dawid-Sk ene mo del is the same as the Class-Conditional Dawid-Sk ene mo del when L = 2. W e will cov er the binary case with a certain aggregation rule under the General Dawid-Sk ene mo del. F or L > 2, we fo cus on m ulticlass lab eling under the Homogenous Dawid-Sk ene mo del. • T ask assignmen t: W e consider the scenario that the task assignment is based on probabilit y v ector ~ q = ( q 1 , q 2 , · · · , q M ) or a constant probabilit y q . • Aggregation rule: W e fo cus on w eighted ma jorit y voting (WMV) rule and ma jority v oting (MV) since they are in tuitive and of common interest. In the case of binary lab eling, w e consider a general hyperplane rule. Later, w e present results for Maximum A Posteriori rule. 4.1 Error rate b ounds for hyperplane rule, WMV and MV It turns out that in practice, man y prediction metho ds for binary lab eling ( y j ∈ {± 1 } ) can b e form ulated as a sign function of a hyperplane, suc h as ma jorit y v oting, weigh ted ma jority v oting and the oracle MAP rule (Section 4.2). In this section, w e are going to apply our results in Section 3 to discuss the error rate of the aggregation rule, whose decision b oundary is a hyperplane in a high dimensional space that the lab el vector of each item (i.e., all the lab els from the w orkers for this item) lies in. F ormally , the aggregation rule is ˆ y j = sign M X i =1 ν i Z ij + a ! (23) This is called the gener al hyp erplane rule with unnormalized weigh ts ν = ( ν 1 , · · · , ν M ) and shift constant a . F or binary lab eling, ma jorit y voting is a sp ecial case with ν i = 1 , ∀ i ∈ [ M ] and a = 0. Theorem 1 and Theorem 3 can b e directly applied to derive the error rate b ounds of general hyperplane rule in the following corollary . 13 When task assignmen t is based on probabilit y vector ~ q , the corresp onding quantities of in terest for the h yp erplane rule (as in Section 3.1) are defined as follows: t 1 = 1 || ν || 2 " M X i =1 q i ν i (2 p ( i ) + − 1) + a ! ∧ M X i =1 q i ν i (2 p ( i ) − − 1) − a !# (24) t 2 = 1 || ν || 2 " M X i =1 q i ν i (2 p ( i ) + − 1) + a ! ∨ M X i =1 q i ν i (2 p ( i ) − − 1) − a !# (25) c = || ν || ∞ || ν || 2 and σ 2 = 1 || ν || 2 2 M X i =1 q i ν 2 i , (26) where ∧ and ∨ are min and max op erators resp ectiv ely , || ν || ∞ . = max i ∈ [ M ] | ν i | , and functions φ ( x ) and D( x || y ) are defined in (18) and (19). Corollary 4 (Hyp erplane rule in binary lab eling) Consider binary lab eling under the Gen- er al Dawid-Skene mo del, i.e., y j ∈ {± 1 } , p ( i ) + and p ( i ) − ar e define d as in (4). Supp ose the task assignment is b ase d on pr ob ability ve ctor ~ q = ( q 1 , q 2 , · · · , q M ) wher e q i is the pr ob ability that worker i lab els an item, aggr e gation rule is gener al hyp erplane rule as in (23), and the quantities of inter est ar e define d in (24)-(26), for ∀ ∈ (0 , 1) : (1) If t 1 ≥ 0 , then 1 N P N j =1 P ( ˆ y j 6 = y j ) ≤ min n exp − t 2 1 2 , exp − t 2 1 2( σ 2 + ct 1 / 3) o . F urthermor e, if t 1 ≥ q 2 ln 1 , then P 1 N P N j =1 I ( ˆ y j 6 = y j ) ≤ ≥ 1 − e − N D ( || φ ( t 1 )) . (2) If t 2 ≤ 0 , then 1 N P N j =1 P ( ˆ y j 6 = y j ) ≥ 1 − min n exp − t 2 2 2 , exp − t 2 2 2( σ 2 − ct 2 / 3) o . F urthermor e, if t 2 ≤ − q 2 ln 1 1 − , then P 1 N P N j =1 I ( ˆ y j 6 = y j ) ≥ ≥ 1 − e − N D ( || 1 − φ ( t 2 )) . Pro of The aggregation rule can b e expressed as ˆ y j = argmax k ∈{± 1 } P M i =1 ν i I ( Z ij = k ) + a k , where a + = a and a − = 0. Directly apply Theorem 1 and Theorem 3, with f i ( k , k ) = ν i for k ∈ {± 1 } , f i ( k , l ) = 0 for k 6 = l , and denote p ( i ) kl in terms of p ( i ) + and p ( i ) − , q ij = q i , ∀ j ∈ [ N ], w e can obtain the desired result immediately . Remark: F rom this corollary , w e know that if t 1 is big enough, then the error rate will b e upp er b ounded by with high probability . This means when the work ers’ reliabilities are generally go o d, it is very lik ely that we will hav e high quality aggregated results. Note that usually w e can freely choose q , ν and a , so the most imp ortan t factors are work er reliabilities on p ositiv e samples ( p ( i ) + ) and negativ e samples ( p ( i ) − ). If one needs to know, given and δ , what are the results under the setting ab ov e corresp onding to Theorem 2, we can simply compute t 1 and t 2 as listed in Corollary 4, then the conditions and b ounds in Theorem 2 hold. As mentioned in the b eginning of this section, weigh ted ma jority v oting (WMV) is an imp ortan t sp ecial case cov ered b y our results in Section 3. F or the next sev eral results, w e fo cus on the mean error rate b ound of WMV and MV under Homogenous Da wid-Skene mo del. 14 F or simplicity , w e consider the case where task assignment is based on constant proba- bilit y q . Therefore, in the follo wing corollary , the lab el of any en try ( i, j ) is assumed to b e rev ealed with constant probability q , and weigh ted ma jorit y voting (8) is applied to obtain the aggregated lab els in the end. Corollary 5 (W eighted ma jority voting under Homogenous Dawid-Sk ene mo del) F or weighte d majority voting, whose pr e diction rule is ˆ y j = argmax k ∈ [ L ] P M i =1 ν i I ( Z ij = k ) with ν = ( ν 1 , ν 2 , · · · , ν M ) . Assume the task assignment is b ase d on c onstant pr ob ability q ∈ (0 , 1] , and assume the workers ar e mo dele d by the Homo genous Dawid-Skene mo del with p ar ame- ter { w i } M i =1 , then with t = q ( L − 1) || ν || 2 M X i =1 ν i ( Lw i − 1) , c = || ν || ∞ || ν || 2 and σ 2 = q (27) (1) if t ≥ 0 , then 1 N P N j =1 P ( ˆ y j 6 = y j ) ≤ ( L − 1) min n exp − t 2 2 , exp − t 2 2( σ 2 + ct/ 3) o (2) if t ≤ 0 , then 1 N P N j =1 P ( ˆ y j 6 = y j ) ≥ 1 − min n exp − t 2 2 , exp − t 2 2( σ 2 − ct/ 3) o Pro of Under this setting, we hav e p ( i ) kk = w i and p ( i ) kl = 1 − w i L − 1 when k 6 = l . Meanwhile, f i ( k , k ) = ν i for an y k ∈ [ L ] and f i ( k , l ) = 0 for k , l ∈ [ L ] , k 6 = l . Then Γ = || ν || 2 . By plugging these ab ov e in to the definitions of t 1 , t 2 , c and σ 2 in Section 3, we can then obtain the results. Remark: Note that t 1 = t 2 = t under the setting of this corollary . By replacing t 1 and t 2 with t , the high probability b ound on the error rate as in Theorem 1 and Theorem 2 will hold as w ell. It is w orth men tioning that the measure t is of critical imp ortance for the quality of the final aggregated results. It dep ends not only on the work ers’ reliability , but also on the w eight vector ν , which can b e chosen by us. This leav es ro om for us to study the b est p ossible weigh t based on the b ound we derived here. In Section 5.1, we will in vestigate the optimal weigh t in detail. As a further sp ecial case of weigh ted ma jorit y voting and a commonly used prediction rule in crowdsourcing, the ma jorit y voting (MV) rule uses the same weigh t for each work er. It can b e formally expressed as ˆ y j = argmax k ∈ [ L ] P M i =1 I ( Z ij = k ) . W e can then directly obtain the error rate b ounds of MV under the Homogenous Dawid-Sk ene mo del. Corollary 6 (Ma jorit y voting under Homogenous Dawid-Sk ene mo del) F or majority voting with uniform r andom lab eling distribution q ∈ (0 , 1] , and ¯ w = 1 M P M i =1 w i , if ¯ w > 1 L , then 1 N N X j =1 P ( ˆ y j 6 = y j ) ≤ ( L − 1) · exp ( − 1 2 L L − 1 2 M q 2 ¯ w − 1 L 2 ) . (28) Me anwhile, we have 1 N N X j =1 P ( ˆ y j 6 = y j ) ≤ ( L − 1) · exp − 1 2 L L − 1 2 M q ¯ w − 1 L 2 1 + L 3( L − 1) ( ¯ w − 1 L ) . (29) 15 When q < 3 4 , the se c ond upp er b ound (29) is tighter than the first one (28). Pro of F or obtaining the upp er b ounds, we directly apply Corollary 5 b y letting ν i = 1, then || ν || 2 = √ M and note that P M i =1 ( Lw i − 1) = LM ( ¯ w − 1 L ). By direct simplification, w e get the desired result. The only difference in the tw o b ounds is that the exp onen t in the second one is equal to the first one dividing by the factor α = q (1 + L 3( L − 1) ( ¯ w − 1 L )). Since ¯ w ≤ 1, then L 3( L − 1) ( ¯ w − 1 L ) ≤ 1 3 , and α < 1 when q < 3 4 . Therefore, q < 3 4 implies that the second b ound is tighter than the first one. Remark: (1) In real cro wdsourcing applications, it is common that q will b e small enough due to the reasonable size of the a v ailable cro wd (Sno w et al., 2008). Thus the second b ound will likely b e tighter than the first one in practice. (2) The lo wer b ound and its conditions can b e easily derived similarly , so w e omit them here since we are more interested in the “p ossibilities” of con trolling the error rate to b e small. Due to the imp ortance of ma jority voting in the crowdsourcing communit y , w e are also in terested in asymptotic prop erties of the b ound. The following corollary discusses the case when M → ∞ , that is, when the num ber of work ers who lab el items tends to infinity . Corollary 7 (Ma jorit y voting in the asymptotic scenario) F or the Homo genous Dawid- Skene mo del with task assignment b ase d on c onstant pr ob ability q ∈ (0 , 1] , and let ˆ y j b e the pr e dicte d lab el for the j th item via majority voting rule (7), for any N ≥ 1 , (1) if lim M →∞ ¯ w > 1 L , then 1 N P N j =1 I ( ˆ y j 6 = y j ) → 0 in pr ob ability as M → ∞ ; (2) if lim M →∞ ¯ w < 1 L , then 1 N P N j =1 I ( ˆ y j 6 = y j ) → 1 in pr ob ability as M → ∞ ; (3) if lim M →∞ ¯ w > 1 L , then ˆ y j → y j in pr ob ablitliy as M → ∞ , ∀ j ∈ [ N ] , i.e., majority voting is c onsistent. Pro of The setting of this corollary is the same as in Corollary 6, and t 1 = Lq L − 1 √ M ( ¯ w − 1 L ). When lim M →∞ ¯ w > 1 L , there exists η > 0 such that η = lim M →∞ ¯ w − 1 L , which im- plies lim M →∞ t 1 = + ∞ , which further implies 1 N P N j =1 P ( ˆ y j 6 = y j ) = 0 for any N ≥ 1 by Corollary 6. This is to say that lim M →∞ P ( ˆ y j 6 = y j ) = lim M →∞ P (I ( ˆ y j 6 = y j ) = 1) = 0, whic h further implies that lim M →∞ P (I ( ˆ y j 6 = y j ) = 0) = 1. Note that for arbitrary > 0, w e hav e P (I ( ˆ y j 6 = y j ) < ) ≥ P (I ( ˆ y j 6 = y j ) = 0), then lim M →∞ P (I ( ˆ y j 6 = y j ) ≥ ) = 0, i.e., I ( ˆ y j 6 = y j ) → 0 , ∀ j ∈ [ N ] in probabilit y when M → ∞ . By the prop erties of con vergence in probabilit y , 1 N P N j =1 I ( ˆ y j 6 = y j ) → 0 in probabilit y when M → ∞ . With the same argument, one can prov e (2) as well. T o pro ve (3), note that P ( ˆ y j − y j = 0) = P (I ( ˆ y j 6 = y j ) = 0) → 0 as M → ∞ , which implies ˆ y j → y j in probability as M → ∞ . Remark: This corollary tells us that, if the av erage quality of the w orkers (i.e. the accuracy of lab eling items) in the work er p opulation, is b etter than random guess, then all the items can b e lab eled correctly with arbitrarily high probability when there are enough w orkers av ailable. The consistency prop erty of ma jority voting for finite num ber of items ensures us that as long as there are enough reliable work ers (b etter than random guess) a v ailable, we can ac hieve arbitrary accuracy even by a simple aggregating approach — ma jority voting. 16 The examples w e cov ered in this section do not require the estimation of parameters such as w ork er reliabilities. In the next section, w e discuss the maximizing lik elihoo d methods for inferring the parameters in cro wdsourcing mo dels, such as the celebrated EM (Expectation- Maximization) algorithm. Then w e illustrate ho w our main results can b e applied to analyze an underlining metho d that the ML metho ds approximate. 4.2 Error rate b ounds for the Maximum A P osteriori rule If w e know the lab el p osterior distribution of each item defined in (5), then the Bay es classifier is ˆ y j = argmax k ∈ [ L ] ρ ( j ) k , (30) whic h is well kno wn to b e the optimal classifier (Duda et al., 2012). In reality , w e do not kno w the true parameters of the mo del, thus the true p osterior remains unknown. One natural wa y is to estimate the parameters of the mo del by Maxim um Lik eliho o d metho ds, further estimate the p osterior of each lab el class for each item, and then build a classifier based on that. This is usually called the Maximum A Posteriori (MAP) approac h (Duda et al., 2012). After the EM algorithm estimating parameters, the Maxim um A P osteriori (MAP) rule, which predicts the lab el of an item as the one that has the largest estimated p osterior, can b e applied. The prediction function of such a rule is ˆ y j = argmax k ∈ [ L ] ˆ ρ ( j ) k , (31) where ˆ ρ ( j ) k is the estimated p osterior. If the MAP rule is applied after the parameters are learned by the EM algorithm, then we call this metho d the EM-MAP rule , and sometimes simply refer it to EM without introducing an y ambiguit y in the con text. Ho wev er, the EM algorithm cannot guaran tee conv ergence to the global optimum. Thus, the estimated parameters migh t b e biased from the true parameters and the estimated p osterior might b e far a wa y from the true one if it starts from a “bad” initialization. Moreo ver, it is generally hard to study the solution of the EM algorithm, and thus it is relativ ely difficult for us to obtain the error rate for the EM-MAP rule. W e consider the or acle MAP rule , which assumes there is an oracle who kno ws the true parameters and uses the true p osterior to predict lab els. Hence the oracle MAP rule is the Ba yes classifier (30), and recall that its prediction function is ˆ y j = argmax k ∈ [ L ] ρ ( j ) k , where ρ ( j ) k is the true p osterior of y j = k . Based on our empirical observ ations (Section 6), the EM-MAP rule approximates the oracle MAP rule well in p erformance when most of the w orkers are go o d (b etter than random guess). In the next, we pro vide an error rate b ound for the oracle MAP rule, whic h hop efully will help us understand the EM-MAP rule b etter. The following result is ab out the error rate b ounds on the oracle MAP rule, and it can b e straightforw ardly derived from the main results in Section 3 since the oracle MAP rule is a decomp osable rule as in (9). Corollary 8 (Error rate b ounds of the oracle MAP rule under the General Dawid-Sk ene mo del) Supp ose ther e is an or acle that knows the true p ar ameters Θ = n P ( i ) i ∈ [ M ] , Q, π o 17 wher e P ( i ) = p ( i ) kh k,h ∈ [ L ] ∈ (0 , 1] L × L . The pr e diction function of the or acle MAP rule is ˆ y j = argmax k ∈ [ L ] ρ ( j ) k , wher e ρ ( j ) k is the true p osterior. Al l the err or r ate b ounds in The or em 1, 2 and 3 hold for the or acle MAP rule with f i ( k , h ) . = log p ( i ) kh , ∀ k , h ∈ [ L ] . As a sp ecial case of General Dawid-Sk ene mo del, the Homogenous Dawid-Sk ene mo del is relatively easy to visualize and simulate. Therefore, the results under this mo del will b e useful in simulation. The next corollary shows that the oracle MAP rule under the Homogenous Dawid-Sk ene mo del is weigh ted ma jorit y voting with class dep endent shifts { a k } k ∈ [ L ] where a k = log π k . F or simplicity , we assume that the true lab els of the items are dra wn from uniform distribution (i.e., π k = 1 L , balanced classes). Corollary 9 (The oracle MAP rule under Homogenous Dawid-Sk ene mo del) Supp ose the task assignment is b ase d on a c onstant pr ob ability q ∈ (0 , 1] , and the pr evalenc e of the true lab els ar e b alanc e d. Then the or acle MAP rule is a weighte d majority voting rule under the Homo genous Dawid-Skene mo del with ˆ y j = argmax k ∈ [ L ] M X i =1 ν i I ( Z ij = k ) wher e ν i = ln ( L − 1) w i 1 − w i , ∀ i ∈ [ M ] . (32) L et ν = ( ν 1 , ν 2 , · · · , ν M ) . The me an err or r ate of the or acle MAP rule is upp er b ounde d without any c onditions on t 1 , i.e., for any { w i } i ∈ [ M ] ∈ (0 , 1) M , 1 N N X j =1 P ( ˆ y j 6 = y j ) ≤ ( L − 1) · min exp − t 2 1 2 , exp − t 2 1 2 ( σ 2 + ct 1 / 3) , wher e t 1 = q ( L − 1) || ν || 2 P M i =1 ν i ( Lw i − 1) , c = k ν k ∞ || ν || 2 and σ 2 = q . The results in the section ab o ve help us understand more ab out the practice of inferring the ground truth lab els via maxim um lik eliho o d metho ds. The prominent EM-MAP rule appro ximates the oracle MAP rule by estimating the parameters of crowdsourcing mo del and th us estimates the p osterior distribution (5), then applies the MAP rule to predict the lab els of items. A further study on the error rate b ounds of the oracle MAP rule migh t b e go o d for designing b etter algorithms with p erformance on par with the EM-MAP rule. This is the fo cus of the next section. 5. Iterativ e w eighted ma jorit y v oting metho d In this section, w e first study the mean error rate b ound of w eighted ma jority v oting, then we minimize the b ound to get the oracle b ound-optimal rule. Finally , we present its connection to the oracle MAP rule. Based on the oracle b ound-optimal rule, we prop ose an iterative weigh ted ma jority voting metho d with p erformance guarantee on its one-step v ersion. 18 5.1 The oracle b ound-optimal rule and the oracle MAP rule Here we explore the relationship b etw een the oracle MAP rule and the mean error rate b ound of WMV under the Homogenous Dawid-Sk ene mo del. W e assume the task assignment is based on a constant probability q for simplicity , and ignore the shift terms { a k } in the aggregation rules (this is the case for the oracle MAP rule when the lab el classes are balanced, Corollary 9). The mean error rate b ound of w eighted ma jority voting (WMV) in Corollary 5 implies that if t 1 ≥ 0, then 1 N P N j =1 P ( ˆ y j 6 = y j ) ≤ ( L − 1) · min n exp − t 2 1 2 , exp − t 2 1 2( σ 2 + ct 1 / 3) o , where σ 2 = q and 1 √ M ≤ c ≤ 1. Note that the impact of c on the b ound is marginal compared to that of t 1 since c can b e replaced by 1 to relax the b ound slightly . At the same time, b oth functions exp − t 2 1 2 and exp − t 2 1 2( σ 2 + ct 1 / 3) are monotonely decreasing w.r.t. t 1 ∈ [0 , ∞ ). Thus the upp er b ound ( L − 1) · min n exp − t 2 1 2 , exp − t 2 1 2( σ 2 + ct 1 / 3) o is also monotonely decreasing w.r.t. t 1 ∈ [0 , ∞ ). The mean error rate is bounded from ab o ve with the condition t 1 ≥ 0. Therefore, maximizing t 1 will increase the c hance of t 1 ≥ 0 b eing satisfied and reduce the b ound to some extent. Recall that t 1 = q L − 1 M X i =1 ν i || ν || 2 ( Lw i − 1) . Since q is fixed no w and w e assume t 1 ≥ 0, so optimizing the upp er b ound is equiv alen t to maximizing t 1 : ν ? = argmax ν ∈ R M t 1 = argmax ν ∈ R M q L − 1 M X i =1 ν i || ν || 2 ( Lw i − 1) = ⇒ The oracle b ound-optimal rule: WMV with ν ? i ∝ Lw i − 1 . (33) Therefore a b ound-optimal strategy is to choose the weigh t for WMV as in (33). This rule requires the information of the true parameters { w i } i ∈ [ M ] , that is why w e call it the or acle b ound-optimal rule . In practice, we can estimate the parameters and plug { ˆ w i } i ∈ [ M ] in to (33), which w e refer to as the b ound-optimal rule . By Corollary 9, the oracle MAP rule under the Homogenous Dawid-Sk ene mo del is a w eighted ma jorit y voting rule with weigh t ν oracMAP i = log ( L − 1) w i 1 − w i ≈ L L − 1 ( Lw i − 1) . The approximation is due to the T aylor expansion around x = 1 L , ln ( L − 1) x 1 − x = L L − 1 ( Lx − 1) + O x − 1 L 2 ! . (34) Th us, the weight of the or acle b ound-optimal rule is the first or der T aylor exp ansion of the weight in the or acle MAP rule . Similar result and conclusion hold for the Class- Conditional Dawid-Sk ene mo del as well, but we omit them here for clarity . 19 By observing that the oracle MAP rule is very close to the oracle b ound-optimal rule, the oracle MAP rule appro ximately optimizes the upper b ound of the mean error rate. This fact also indicates that our b ound is meaningful since the oracle MAP rule is the oracle Ba yes classifier. 5.2 Iterative w eighted ma jorit y v oting with p erformance guarantee Based on Section 5.1, the oracle b ound-optimal rule of choosing weigh ts is ν i ∝ L ( w i − 1). With this strategy , if w e hav e an estimated w i , we can put more w eights to the “b etter” w orkers and do wnplay the “spammers” (those work ers with accuracy close to random guess). This strategy can p otentially improv e the p erformance of ma jorit y vote and result in a b etter estimate for w i , whic h further impro ves the quality of the w eights, and iterate. This inspires us to design an iterativ e w eighted ma jority voting (IWMV) metho d as in Algorithm 1. Algorithm 1 The iterative w eighted ma jorit y voting algorithm (IWMV) Input: Num b er of work ers= M; Number of items= N; data matrix: Z ∈ [ L ] M × N ; Output: the predicted lab els { ˆ y 1 , ˆ y 2 , ..., ˆ y N } Initialization: ν i = 1 , ∀ i ∈ [ M ]; T ij = I ( Z ij 6 = 0) , ∀ i ∈ [ M ] , ∀ j ∈ [ N ]. rep eat ˆ y j ← argmax k ∈ [ L ] M X i =1 ν i I ( Z ij = k ) , ∀ j ∈ [ N ] . ˆ w i ← P N j =1 I ( Z ij = ˆ y j ) P N j =1 T ij , ∀ i ∈ [ M ] . ν i ← L ˆ w i − 1 , ∀ i ∈ [ M ] . un til con verges or reaches S iterations. Output the predictions { ˆ y j } j ∈ [ N ] b y ˆ y j = argmax k ∈ [ L ] P M i =1 ν i I ( Z ij = k ). The time complexit y of this algorithm is O (( M + L ) N S ), where S is the num b er of iterations in the algorithm. Empirically , the IWMV metho d conv erges fast. But it also suffers from the lo cal optimal trap as EM do es, and is generally hard to analyze its error rate. How ev er, we are able to obtain the error rate b ound in the next theorem for a “naive” v ersion of it – one-step WMV (osWMV), which executes (Step 1) to (Step 3) only once as follows: (Step 1) Use ma jorit y v oting to estimate lab els, which are treated as the “golden standard”, i.e. ˆ y MV j = argmax k ∈ [ L ] P M i =1 I ( Z ij = k ). (Step 2) Use the current “golden standard” to estimate the work er accuracy ˆ w i = P N j =1 I ( Z ij = ˆ y MV j ) P M i =1 I( Z ij 6 =0) for all i and set ν i = L ˆ w i − 1 for all i . (Step 3) Use the current w eight v in WMV to estimate an up dated “gold standard”, i.e., ˆ y j = argmax k ∈ [ L ] P M i =1 ν i I ( Z ij = k ). F or the succinctness of the result, we fo cus on the case where L = 2, but the techniques used can b e applied to the general case of L as well. 20 Theorem 10 (Me an err or r ate b ound of one step WMV for binary lab eling) Under the Homo genous Dawid-Skene mo del, with lab el sampling pr ob ability q = 1 and L = 2 , let ˆ y wmv j b e the lab el pr e dicte d by one-step WMV for the j th item, if ¯ w ≥ 1 2 + 1 M + q ( M − 1) ln 2 2 M 2 , the me an err or r ate of one-step WMV 1 N N X j =1 P ˆ y wmv j 6 = y j ≤ exp − 8 M N 2 ˜ σ 4 (1 − η ) 2 M 2 N + ( M + N ) 2 , (35) wher e ˜ σ = q 1 M P M i =1 ( w i − 1 2 ) 2 and η = 2 exp − 2 M 2 ( ¯ w − 1 2 − 1 M ) 2 M − 1 The pro of of this theorem is deferred to Appendix B. It is non-trivial to prov e this theorem since the dependency among the weigh ts and labels mak es it hard to apply the concentr ation approac h used in pro ving the previous results. Instead, a martingale-difference concentra- tion b ound has to b e used. Remarks: 1. In the exp onent of the b ound, there are sev eral imp ortant factors: ˜ σ represents ho w far aw a y the accuracies of work ers are from random guess, and it is a constant smaller than 1; η will b e close to 0 given a reasonable M . 2. The condition on ¯ w requires that ¯ w − 1 2 is Ω( M − 0 . 5 ), which is easier to satisfy with M large if the av erage accuracy in the crowd p opulation is b etter than random guess. This condition ensures that ma jorit y voting approximates the true lab els. Thus with more items lab eled, we can get a b etter estimate of the w orkers’ ac curacies. The one-step WMV p erformance will then b e improv ed with b etter weigh ts. 3. W e address how M and N affect the b ound : first, when b oth M and N increase but M N = r is a constan t or decreases, the error rate b ound decreases. This makes sense b ecause with the n umber of items lab eled p er w orker increasing, ˆ w i will b e more accurate. The w eights will b e closer to the oracle b ound-optimal rule. Second, when M is fixed and N increases, i.e., the n umber of items labeled increases, the upper b ound on the error rate decreases. Third, when N is fixed and M increases, the b ound decreases when M < √ N and then increases when M is b eyond √ N . In tuitively , when M is larger than N and M increases, the fluctuation of score functions, where ˆ w i is the estimated accuracy of the i th w orker, will b e large. This increases the chance of making more prediction errors. When M is reasonably small (compared with N ) but is increasing, i.e., more p eople lab el eac h item, the accuracy of ma jority voting will b e improv ed according to Corollary 9, then the gain on the accuracy of estimating ˆ w i results in the weigh ts of the one-step WMV to b e closer to the oracle b ound-optimal rule. As an alternative wa y of assigning weigh ts to w orkers in each iteration (Alg.1), w e can also choose the w eight of w orker i b y plugging ˆ w i in to the w eight in the oracle MAP rule. That is, ν 0 i = log ( L − 1) ˆ w i 1 − ˆ w i . W e refer this v ariant of IWMV to the IWMV.lo g algorithm. F rom the practical p oint of view, how ev er, ν 0 i is un b ounded and to o large (or to o small) if estimator ˆ w i is close to 1 (or 0). Therefore IWMV.log uses an aggressive w ay to weigh the 21 w orkers, and it migh t b e to o risky when the estimates { ˆ w i } i ∈ [ M ] are noisy . Recall that given an estimate ˆ w i of the reliabilit y of work er i , the w ay that the IWMV algorithm chooses the w eight is ν i = L ˆ w i − 1. As a linearized version of ν 0 i (34), ν i is more stable to the noise in the estimate ˆ w i . F urthermore, IWMV is more con venien t for theoretical analysis than the IWMV.log algorithm. In the next section, we will show some comparisons b etw een IWMV and IWMV.log b y exp erimen ts. 6. Exp eriments In this section, we first compare the theoretical error rate b ound with the error rate of oracle MAP rule via simulation. Meanwhile, we compare IWMV with EM and IWMV.log on synthetic data. W e then exp erimen tally test IWMV and compare it with the state-of- art algorithms on real-world data. W e implement ma jorit y voting, EM algorithm (Rayk ar et al., 2010) with MAP rule (also referred as the EM-MAP rule in the exp erimen ts), and use public av ailable co de 4 — the iterative algorithm in (Karger et al., 2011) is referred to as K OS, and the v ariational inference algorithm from (Liu et al., 2012) is referred to as LPI. All results are av eraged o ver 100 random trials. All our exp eriments are implemented in Matlab 2012a, and run on a PC with Windows 7 op eration system, Intel Core i7-3740QM (2.70GHz) CPU and 8GB memory . 6.1 Simulation The error rate of a crowdsourcing system is affected by v ariations of different parameters suc h as num b er of work ers M , num b er of items N and work er reliabilities { w i } i ∈ [ M ] etc. T o study ho w the error rate b ound reflects the change of error rate when a parameter of the system changes, we conduct numerical exp eriments on simulated data for comparing the oracle MAP rule with its error rate b ound (Corollary 9). W e also measure the p erformance of the IWMV algorithm and compare it with the p erformance of oracle MAP rule. The simulations are run under the Homogenous Dawid-Sk ene mo del. Eac h w orker has q =30% c hance to lab el any item which b elongs to one of three classes ( L = 3). The ground truth lab els of items are uniformly generated. The accuracies of w orkers (i.e., { w i } ) are sampled from a b eta distribution Beta ( a, b ) with b = 2. Given an exp ected av erage w orker accuracy ¯ w , we choose the paramater a = 2 ¯ w 1 − ¯ w so that the exp ecte d v alue for w orker accuracies under distribution B eta ( a, 2) matches with ¯ w . In each random trial in the simulation, we k eep sampling M work ers from this B eta distribution un til the av erage w orker accuracy is within ± 0 . 01 range from the expected ¯ w . This is to main tain the a v erage w orker accuracy at the same level for each trial. First of all, we fix M = 31 and N = 200, then control the exp ected accuracy of the w orkers v aries from 0.38 (slightly larger than random guess 1 /L ) to 1 with a step size 0.05. The av eraged error rates are display ed in Figure 4(a). Note that the error rate of the oracle MAP rule is b ounded by its mean error rate b ound tightly (see Figure 4(a)). The b ound follo ws the same trend of the true error rate of the oracle MAP rule. The p erformace of IWMV conv erges to that of the oracle MAP rule quickly as ¯ w increases. 4. http://www.ics.uci.edu/ ∼ qliu1/codes/crowd tool.zip 22 0.4 0.5 0.6 0.7 0.8 0.9 1 −5 −4.5 −4 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 0 Average accuracy of workers log 10 (Mean error rate) Oracle MAP rule Oracle MAP bound IWMV 0 20 40 60 80 100 120 −3 −2.5 −2 −1.5 −1 −0.5 0 M: number of workers log 10 (Mean error rate) Oracle MAP rule Oracle MAP bound IWMV 0 200 400 600 800 1000 −2 −1.8 −1.6 −1.4 −1.2 −1 −0.8 −0.6 −0.4 −0.2 0 N: number of tasks log 10 (Mean error rate) Oracle MAP rule Oracle MAP bound IWMV (a) (b) (c) Figure 4: Comparing the oracle MAP rule with its theoretical error rate b ound by sim ula- tion. The p erformance of the IWMV algorithm is also imp osed. These simulations are done under the Homogenous Dawid-Sk ene mo del with L = 3 and q = 0 . 3. (a) V ary the av erage accuracy of work ers and fix M = 31 and N = 200.(b) V ary M and fix N = 200. (c) V ary N and fix M = 31. The reliabilities of work ers are sampled based on w i ∼ Beta(2.3, 2), ∀ i ∈ [ M ] in (b) and (c). Note that all of them are in log scale and all the results are a veraged across 100 rep etitions. By fixing a = 2 . 3 and q = 0 . 3, we then v ary one of the tw o parameters— num b er of items N (default as 200) and n umber of work ers M (default as 31) — with the other parameter maintained as the default. The corresp onding results are presented in Figure 4 (b) and (c), resp ectively . According to the results of simulation, the error rate b ound of the oracle MAP rule and its upp er b ound do not change when the num b er of tasks N increases (Figure 4(c)), but they c hange log linearly when the n um b er of work ers M increases (Figure 4(b)). Nev ertheless, the p erformance of IWMV changes whenever we increase M or N . It b eha ves closely to the oracle MAP rule when M v aries, but differently from th e oracle MAP rule when N increases. This is b ecause with more and more tasks done by the w orkers, the estimation of the reliability of the w orkers will b e more accurate, and this can b o ost the p erformance of IWMV. How ev er, the oracle MAP rule knows the true work er reliability initially , so its p erformance will b e indep endent with N . Our next simulation (Figure 5) shows that IWMV, its v arian t IWMV.log and the EM algorithm achiev e the same final prediction accuracy , while IWMV has the low est compu- tational cost. Sp ecifically , we v ary the n umber of tasks N , and compare the final accuracy with one standard deviation error bar imp osed (Figure 5(a)), the conv ergence time (Figure 5(b)), the num b er of iterations (i.e. steps) to conv erge (Figure 5(c)) and the total run time for 50 iterations (Figure 5(d)). With almost the same accuracies, IWMV conv erges faster and tak es less steps to conv erge than EM and IWMV.log, and the run time of IWMV is prominen tly low er than EM. Similar conclusions can b e also confirmed when changing M , q and L , th us w e omit the similar results here. The exp erimen ts ab ov e are strictly simulated based on the Homogenous Dawid-Sk ene w orker mo del for its simplicit y . T o compare IWMV with EM when the work er mo del is violated, w e simulated toy data. See Figure 6(a) for the setup: supp ose there are t wo 23 0 2000 4000 6000 8000 10000 12000 0.75 0.8 0.85 0.9 0.95 1 N: number of tasks (a) Accuracy IWMV IWMV.log EM 0 2000 4000 6000 8000 10000 12000 0 50 100 150 200 N: number of tasks (b) Converg. time (ms) 0 2000 4000 6000 8000 10000 12000 3.5 4 4.5 5 5.5 6 6.5 N: number of tasks (c) Steps to converge 0 2000 4000 6000 8000 10000 12000 0 200 400 600 800 1000 1200 1400 N: number of tasks (d) Total run time (ms) IWMV IWMV.log EM Figure 5: Comparison b etw een IWMV, IWMV.log and the EM-MAP rule, with the n um- b er of ite ms v arying from 1000 to 11000. Simulation was p erformed with L = 3 , M = 31 , q = 0 . 3 under the Homogenous Da wid-Skene mo del. The re- liabilities of work ers are sampled based on w i ∼ Beta(2.3, 2), i ∈ [ M ]. All the results are a veraged across 100 rep etitions. (a) Final accuracy with error bar im- p osed. (b) The time until con v ergence, and we need to kno w the ground truth for measuring it. (c) Number of steps to con verge. (d) T otal run time is computed based on finishing 50 iterations. 0.9 0.5 0.6 0.7 𝑮 𝟏 : 20 wo r k e r s 𝑺 𝟏 : 60 i t e m s 𝑺 𝟐 : 40 it e m s 𝑮 𝟐 : 30 wo r k e r s 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 MV EM IWMV.log IWMV Error rate (10.3 +/− 0.1)% (12.9 +/− 0.1)% (9.7 +/− 0.1)% (9.4 +/− 0.1)% (a) (b) Figure 6: (a) Setting of mo del missp ecification. (b) P erformance comparison under mo del missp ecification setting in (a). group of w orkers G 1 and G 2 , and tw o sets of items S 1 and S 2 . The true lab els of items are generated uniformly from {± 1 } . The data matrix is generated as follows: P ( Z ij = y j ) = 0 . 9 24 if i ∈ G 1 , j ∈ S 1 ; P ( Z ij = y j ) = 0 . 6 if i ∈ G 1 , j ∈ S 2 ; P ( Z ij = y j ) = 0 . 5 if i ∈ G 2 , j ∈ S 1 ; P ( Z ij = y j ) = 0 . 7 if i ∈ G 2 , j ∈ S 2 . W e use q = 0 . 3. The error rate (with one standard deviation) of MV, EM, IWMV.log and IWMV are shown in Figure 6 (b), which shows that IWMV achiev es lo wer error rate than EM and IWMV.log do in this mo del missp ecification example. The results in Figure 6 sho ws that IWMV are more robust than EM under mo del misspecification to some exten t. Similar results can be obtained under other differen t configuations (as in Figure 6), and w e omit them here. 6.2 Real data T o compare our proposed iterative w eigh ted ma jority v oting with the state-of-the-art meth- o ds (Rayk ar et al., 2010; Karger et al., 2011; Liu et al., 2012), we conducted several exp eri- men ts on real data (most of them are publicly a v ailable). W e sampled the collected lab els in the real data indep endently with probabilit y ˜ q which v aries from 0.1 to 1, and see ho w the error rate and run time change accordingly . F or clarit y of the figures pro duced in this sec- tion, we omit the results of IWMV.log since its p erformance is usually w orse than IWMV in terms of b oth accuracy and computational time. Our fo cus will b e the comparisons among IWMV, EM, K OS and LPI. T able 1: The summary of datasets used in the real data exp eriments. ¯ w is the a verage w orker accuracy . Dataset L classes M work ers N items #lab els ¯ w Duc henne 2 17 159 1221 65.0% R TE 2 164 800 8000 83.7% T emp oral 2 76 462 4620 84.1% W eb search 5 177 2665 15539 37.1% Duchenne dataset. The first dataset is from (Whitehill et al., 2009) on identifying Duc henne smile from non-Duchenne smile based on face images. In this data, 159 im- ages are lab eled with { Duchenne, non-Duchenne } lab els by 17 different Mec hanical T urk 5 w orkers. In total, there are 1,221 lab els, th us 1221/(17 × 159)= 45.2% of the p otential task assignmen ts are done (i.e., 45.2% of the entires in the data matrix are observed). The ground truth lab els are obtained from t wo certified exp erts and 58 out of the 159 images con tain Duc henne smiles. The Duc henne images are hard to iden tify (the a verage accuracy of work ers on this task is only 65%). W e conducted the exp eriments by sampling the lab els indep endently with probability ˜ q v arying from 0.1 to 1. Thus the prop ortion of non-zero lab els in the data matrix will v ary from 0.6% to 45.2%. Note that based on our setting, the task assignmen t probability corresp onding to ˜ q is q = ˜ q × 45 . 2%. After sampling the data matrix from the original Duc henne dataset with a given ˜ q , w e then run IWMV, ma jorit y v oting, the EM algorithm 5. https://www.m turk.com 25 4.5% 9.0% 13.6% 18.1% 22.6% 27.1% 31.6% 36.1% 40.7% 45.2% 0.25 0.3 0.35 0.4 0.45 0.5 Duchenne Dataset (Whitehill et al., 2009) Percentage of task assignments done Error rate EM−MAP Majority Voting KOS LPI IWMV 4.5% 9.0% 13.6% 18.1% 22.6% 27.1% 31.6% 36.1% 40.7% 45.2% 0 500 1000 1500 2000 2500 Duchenne Dataset (Whitehill et al., 2009) Percentage of task assignments done Run time (ms) EM−MAP Majority Voting KOS LPI IWMV 0 200 400 600 800 1000 1200 1400 1600 IWMV EM KOS LPI Duchenne Dataset (Whitehill et al., 2009) Run time (ms) ER=25.8% ER=36.4% ER=29.2% ER=27.7% (a) (b) (c) Figure 7: Duchenne smile dataset (Whitehill et al., 2009). (a) Error rate of differen t algo- rithms when the num b er of lab els av ailable increases. (b) Run time comparison. (c) A visualization with b oth run time and error rate when 40.7% of the task assignmen ts are done. (Ra yk ar et al., 2010) , KOS (the iterative algorithm in (Karger et al., 2011)), and LPI (the v ariational inference algorithm from (Liu et al., 2012)). The entire pro cess will b e rep eated 100 times, and the results will b e av eraged. The comparison is shown in Figure 7(a). F rom Figure 7(a), W e can see that when the a v ailable lab els are v ery few, the p erfor- mance of IWMV is as go o d as LPI, and these tw o generally dominate the other algorithms. With more lab els av ailable, the error rate of IWMV is around 2% low er than LPI (Figure 7(a)). A t the same time, we compared the run time of each algorithm (Figure 7(b)). With more lab els, the run time of LPI increases fast (non-linearly), while the IWMV maintains a low er run time than EM, KOS and LPI. F or a b etter visualization of comparing run time and error rate, we compared the run time of IWMV, EM, KOS and LPI b y a bar plot with their error rates imp osed on top. Figure 7(c) shows the comparison when 40.7% of the lab els are a v ailable ( ˜ q = 0 . 9). IWMV is more than 100 times faster than LPI, and achiev es the low est error rate among these algorithms. An interesting phonomenon in Figure 7(a) is that EM p erforms p o orly — it is even w orse than MV. The ma jor reason for this is that the w orkers reliabilities form a pattern similar to our mo del missp ecification example in Figure 6(a): some work ers are go o d at a set of images but bad at the complemen tary set of images, while the other work ers are rev ersed. This is a real-data example of mo del missp ecificatioin, and IWMV is more robust to the mo del missp ecification on this data than EM. R TE dataset. The R TE data is a language pro cessing dataset from (Snow et al., 2008). The dataset is collected by asking w ork ers to p erform recognizing textual entailmen t (R TE) tasks, i.e., for eac h question the work er is presented with t wo sen tences and giv en a binary c hoice of whether the second sen tence can b e inferred from the first. T emp or al event dataset. This dataset is also a natural language pro cessing dataset from (Sno w et al., 2008). The task is to provide a lab el from { strictly b efor e, strictly after } for ev ent-pairs that represents the temp oral relation b etw een them. 26 0.6% 1.2% 1.8% 2.4% 3.0% 3.7% 4.3% 4.9% 5.5% 6.1% 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 RTE dataset (Snow et al., 2008) Percentage of task assignments done Error rate EM−MAP Majority Voting KOS LPI IWMV 0.6% 1.2% 1.8% 2.4% 3.0% 3.7% 4.3% 4.9% 5.5% 6.1% 0 1000 2000 3000 4000 5000 6000 7000 8000 RTE dataset (Snow et al., 2008) Percentage of task assignments done Run time (ms) EM−MAP Majority Voting KOS LPI IWMV 0 1000 2000 3000 4000 5000 6000 IWMV EM KOS LPI RTE dataset (Snow et al., 2008) Run time (ms) ER=8.2% ER=8.4% ER=49.6% ER=8.3% (a) (b) (c) Figure 8: R TE dataset (Snow et al., 2008). (a) Error rate of different algorithms. (b) Run time when the p ercentage of the task assignments done increases. (c) Run time comparison when 4.9% of the task assignmen ts is done. The error rates of eac h metho d are imp osed on the top of the bar. W e conducted similar exp eriments on the R TE dataset and the temp oral ev ent dataset as the one on the Duchenne dataset. The results on the R TE dataset are shown in Figure 8. Figure 8(a) is the p erformance curves of different algorithms when the p ercen tage of task assignmen ts done increases. Figure 8(b) is the run time of these algorithms, and it confirms the same observ ations as the results on the Duchenne dataset: the IWMV runs muc h faster than the other algorithms except ma jority voting, and it has similar p erformance to LPI which is the state-of-art metho d (Figure 8(c)). F or clarity , we show the p erformance comparison on temp oral even t dataset in Figure 9(a) and omit the run time comparison. 1.3% 2.6% 3.9% 5.3% 6.6% 7.9% 9.2% 10.5% 11.8% 13.2% 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Temp dataset (Snow et al., 2008) Percentage of task assignments done Error rate EM−MAP Majority Voting KOS LPI IWMV 0.3% 0.7% 1.0% 1.3% 1.7% 2.0% 2.3% 2.7% 3.0% 3.3% 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Web search dataset (Zhou et al., 2012) Percentage of task assignments done Error rate EM−MAP Majority Voting IWMV (a) (b) Figure 9: (a) Results on T emp dataset from (Snow et al., 2008). (b) Results on W eb search dataset (Zhou et al., 2012). Web se ar ch dataset. In this dataset (Zhou et al., 2012), work ers w ere asked to rate query-URL pairs on a relev ance rating scale from 1 to 5. Each pair w as lab eled by around 27 6 work ers, and around 3.3% of the entries in the data matrix are observed. The ground truth lab els were collected via consensus from 9 exp erts. W e treat the task as a m ulti- class lab eling problem, thus L = 5. W e conduct the exp eriment in a similar setting to the exp erimen t on the Duchenne dataset — sampling the lab els with probability ˜ q and v aried it to plot the p erformance curve (Figure 9(b)). Since LPI and KOS is constrained to binary lab eling so far, w e only compared IWMV with the EM-MAP rule and ma jorit y voting. The p erformance of IWMV generally outp erforms the EM-MAP rule and ma jority voting by at least 4%. 7. Conclusions In this pap er, we provided finite sample b ounds on the error rate (in probabilit y and in ex- p ectation) of decomp osable aggregation rules under the general Da wid-Sk ene cro wdsourcing mo del. Optimizing the mean error rate b ound under the Homogenous Dawid-Sk ene mo del leads to an aggregation rule that is a go o d appro ximation to the oracle MAP rule. A data- driv en iterative weigh ted ma jorit y voting is prop osed to appro ximate the oracle MAP with a theoretical guaran tee on the error rate of its one-step version. Through simulations under the Homogenous Da wid-Skene mo del (for simplicity) and tests on real data, w e ha ve the following findings. 1. The error rate b ound reflects the trends of the real error rate of the oracle MAP rule when som e imp ortan t factors in the crowdsourcing systems such as ( M , N , { w i } i ∈ [ M ] etc.) c hange. 2. The IWMV algorithm is close to the oracle MAP rule with sup erior p erformance in terms of error rate. 3. The iterative weigh ted ma jorit y voting metho d (IWMV) p erforms as w ell as the EM- MAP rule with muc h lo wer computational cost in simulation, and IWMV is more robust to mo del-missp ecification than EM. 4. On real data, IWMV achiev ed p erformance as go o d as or even b etter than that of the state-of-the-art metho ds with muc h less computational time. In practice, if we wan t to obtain the error rate b ounds for certain aggregation rules that falls in the form of decomp osable aggregation rule (9), what we can do should b e similar to what we did in Section 4: (1) for the sp ecific mo del and tas k assignment, compute the measure of t 1 and t 2 (and also c and σ 2 if interested in b ounds on the mean error rate) according to the descriptions in Section 3.1. (2) Compute the corresp onding error rate b ounds according to the theorems in Section 3. The quantities t 1 and t 2 can tell us if we can obtain upp er b ound or low er b ound on the error rate in probabilit y and in exp ec tation. Note that though the mean error rate b ounds (Theorem 3) are in a comp osite form of tw o exp onen tial b ounds, we can choose one of the tw o to use if the con venience of theoretical analysis is concerned. T o the b est of our knowledge, this is the first extensive w ork on error rate b ounds for gen- eral aggregation rules under the practical Da wid-Skene mo del for multi-class crowdsourced lab eling. Our b ounds are useful for explaining the effectiv eness of different aggregation rules. 28 As a further direction for researc h, it would b e interesting to obtain finite sample error b ounds for the aggregation rules with random score functions, which dep end on the data in a complicated manner. F or example, the EM-MAP rule can b e formulated as a w eighted ma jority voting under the Homogenous Dawid-Sk ene model. How ev er, the weigh ts are estimated b y EM algorithm (Rayk ar et al., 2010) and dep end on the data complicatedly . Hence the analysis of the EM algorithm is rather difficult. The IWMV and the EM algorithm share the similar iterativ e nature, and IWMV is simpler than EM. An error rate analysis of the final prediction of the IWMV will b e helpful to understand the b eha vior of EM algorithm in the crowdsourcing con text. 8. Ac knowledgemen t W e thank Dengyong Zhou for his v aluable suggestions and enlightening comments whic h lead to many impro vemen ts of this pap er. W e would lik e to thank Riddhipratim Basu and Qiang Liu for the v aluable discussions. W e would also lik e to thank T erry Sp eed for his helpful comments and suggestions. References D. Angluin and P . Laird. Learning from noisy examples. Machine L e arning , 2(4):343–370, 1988. Y. Bac hrac h, T. Graep el, T. Mink a, and J. Guiv er. Ho w to grade a test without kno wing the answ ers — a ba yesian graphical mo del for adaptive crowdsourcing and aptitude testing. In ICML , pages 1183–1190, New Y ork, NY, USA, 2012. J. A. Bilmes. A Gentle T utorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Marko v Mo dels. T e chnique R ep ort , 1198 (510), 1998. X. Chen, Q. Lin, and D. Zhou. Optimistic Knowledge Gradient Policy for Optimal Budget Allo cation in Cro wdsourcing. In ICML , 2013. F. Ch ung and L. Liu. Old and new c onc entr ation ine qualities, Chapter 2 in Complex Gr aphs and Networks . AMS, 2010. ISBN ISBN-10:0-8218-3657-9. N. Dalvi, A. Dasgupta, R. Kumar, and V. Rastogi. Aggregating crowdsourced binary ratings. In WWW , 2013. A. P . Da wid and A. M. Skene. Maximum Likelihoo d Estimation of Observ er Error-Rates Using the EM Algorithm. Journal of the R oyal Statistic al So ciety. , 28(1):20–28, 1979. Marie Jean An toine Nicolas de Caritat et al. Essai sur l’applic ation de l’analyse ` a la pr ob abilit´ e des d ´ ecisions r endues ` a la plur alit´ e des voix . L’imprimerie roy ale, 1785. O. Dek el and O. Shamir. Go o d learners for evil teachers. Pr o c e e dings of the 26th Annual International Confer enc e on Machine L e arning , pages 1–8, 2009. doi: 10.1145/1553374.1553404. 29 A. P . Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihoo d from incomplete data via the em algorithm. Journal of the R oyal Statistic al So ciety. Series B (Metho dolo gic al) , pages 1–38, 1977. R. O. Duda, P . E. Hart, and D. G. Stork. Pattern classific ation . John Wiley & Sons, 2012. S. Ertekin, H. Hirsh, and C. Rudin. Approximating the Wisdom of the Crowd. In NIPS Workshop on Computational So cial Scienc e and the Wisdom of Cr owds , pages 1–5, 2011. C. G ao and D. Zhou. Minimax optimal con vergence rates for estimating ground truth from cro wdsourced lab els. , 2014. C. Ho, S. Jabbari, and J. W. V aughan. Adaptive T ask Assignment for Crowdsourced Classification. In ICML , 2013. W. Ho effding. On the distribution of the n umber of successes in indep endent trials. The A nnals of Mathematic al Statistics , pages 713–721, 1956. R. Jin and Z. Ghahramani. Learning with Multiple Lab els. In NIPS , 2002. D. R. Karger, S. Oh, and D. Shah. Iterative learning for reliable cro wdsourcing systems. In NIPS , 2011. Q. Liu, J. Peng, and A. Ihler. V ariational Inference for Crowdsourcing. In NIPS , 2012. C. McDiarmid. Concen tration. T e chnique R ep ort , 1998. URL http://cgm.cs.mcgill.ca/ breed/conc/colin.pdf . N. Natara jan, I. Dhillon, P . Ravikumar, and A. T ewari. Learning with noisy lab els. In NIPS , pages 1196–1204, 2013. H. Q. Ngo. T ail and Concen tration Inequalities. L e ctur e Notes , pages 1–6, 2011. URL http://www.cse.buffalo.edu/ hungngo/classes/2011/Spring-694/lectures/l4.pdf . V. C. Rayk ar, S. Y u, L. H. Zhao, C. Florin, L. Bogoni, and L. Moy . Learning F rom Crowds. Journal of Machine L e arning R ese ar ch , 11:1297–1322, 2010. C. S. Sheng and F. Prov ost. Get Another Lab el? Improving Data Qualit y and Data Mining Using Multiple, Noisy Lab elers Categories and Sub ject Descriptors. SIGKDD , pages 614–622, 2008. P . Smyth, U. F ayy ad, M. Burl, P . Perona, and P . Baldi. Inferring Ground T ruth from Sub jective Lab elling of V en us Images. In NIPS , 1995. R. Snow, B. O. Connor, D. Jurafsky , and A. Y. Ng. Cheap and F ast - But is it Go o d ? Ev aluating Non-Exp ert Annotations for Natural Language T asks. EMNLP , 2008. P . W elinder, S. Branson, S. Belongie, and P . Perona. The Multidimensional Wisdom of Cro wds. In NIPS , 2010. 30 J. Whitehill, P . Ruv olo, T. W u, J. Bergsma, and J. Mov ellan. Whose V ote Should Count More : Optimal In tegration of Lab els from Lab elers of Unknown Exp ertise. In NIPS , 2009. Y. Y an, R. Rosales, G. F ung, M. Schmidt, G. Hermosillo, L. Bogoni, L. Mo y , and J. G. Dy . Mo deling annotator exp ertise : Learning when everybo dy knows a bit of something. In ICML , volume 9, pages 932–939, 2010. Y. Y an, R. Rosales, G. F ung, and J. G. Dy . Activ e Learning from Crowds. In ICML , 2011. D. Zhou, J. Platt, S. Basu, and Y. Mao. Learning from the Wisdom of Crowds by Minimax En tropy. In NIPS , 2012. App endix A. Pro ofs of the main theorems Since the pro of of Theorem 1 requires other theorems in this pap er, w e will not present the pro ofs in the same order as in the pap er. The order of our pro ofs will b e: Prop osition 11, Theorem 3, Theorem 1 and then Theorem 2. After pro ving the first four main results, we also prov e Corollary 8 and Corollary 9. Since the pro of of Theorem 10 requires differen t tec hniques and more efforts than the pro ofs in this section, we put its pro of in a sep erate section — App endix B. Before presenting the proofs, we w ould like to propose some notations for simplicity and clarit y . W e simplify f i ( k , h ) to µ ( i ) kh . = f i ( k , h ) , ∀ k ∈ [ L ] , h ∈ [ L ] , (36) then each work er is asso ciated with a vote matrix µ ( i ) = ( µ ( i ) kh ) , k ∈ [ L ] , h ∈ [ L ], where µ ( i ) kh is the voting score when w orker i lab els item j whose true lab el is k , as class h if h 6 = 0. Then the aggregation rule (9) is equiv alen t to ˆ y j = argmax k ∈ [ L ] M X i =1 L X h =1 µ ( i ) kh I ( Z ij = h ) + a k ! , (37) Note that s ( j ) k = M X i =1 L X h =1 µ ( i ) kh I ( Z ij = h ) + a k , ∀ k ∈ [ L ] , j ∈ [ N ] (38) is the aggregated score of lab el class k on the j th item, and the general aggregation rule is ˆ y j = argmax k ∈ [ L ] s ( j ) k . (39) W e will frequently discuss conditional probability , exp ectation and v ariance conditioned on the ev ent { y j = k } . F or simplicity of notations, we define: P k ( · ) . = P ( · | y j = k ) (40) E k [ · ] . = E [ · | y j = k ] (41) V ar k ( · ) . = V ar( · | y j = k ) . (42) 31 Note that E k h s ( j ) l i = M X i =1 L X h =1 q ij µ ( i ) lh p ( i ) kh + a l , ∀ l , k ∈ [ L ] . (43) A.1 Pro of of Prop osition 11: b ounding the mean error rate of lab eling each item Prop osition 11 (Bounding the me an err or r ate of lab eling e ach item) F ol lowing the setting of The or em 1, and with τ j, min and τ j, max define d as in (14) , we have ∀ j ∈ [ N ] , (1) if τ j, min ≥ 0 , then P ( ˆ y j 6 = y j ) ≤ ( L − 1) · min exp − τ 2 j, min 2 , exp − τ 2 j, min 2( σ 2 + cτ j, min / 3) ; (2) if τ j, max ≤ 0 , t hen P ( ˆ y j 6 = y j ) ≥ 1 − min exp − τ 2 j, max 2 , exp − τ 2 j, max 2( σ 2 − cτ j, max / 3) . Remark: Prop osition 11 pro vides the mean error rate b ounds of lab eling any sp ecific item, and the b ounds dep end on the minimum and maxim um v alues of n Λ ( j ) kl o k,l ∈ [ L ] . Note that the subscript j only comes from the assignmen t distribution q ij . If a sp ecific w orker has the same assignment probability to lab el all items, say q i , then we can drop the subscript j from τ j, min and τ j, max , whic h means the error rate b ounds of eac h item are even tually the same under that task assignmen t. Pro of First of all, we expand the error probabilit y of labeling the j -th item wrong in terms of the conditional probabilities: P ( ˆ y j 6 = y j ) = X k ∈ [ L ] P ( y j = k ) P ( ˆ y j 6 = k | y j = k ) = X k ∈ [ L ] π k P k ( ˆ y j 6 = k ) . (44) Our ma jor fo cus in this pro of is to b ound the term P k ( ˆ y j 6 = k ). Our approach will b e based on the following ev ents relations: [ l ∈ [ L ] ,l 6 = k n s ( j ) l > s ( j ) k o ⊆ { ˆ y j 6 = k } ⊆ [ l ∈ [ L ] ,l 6 = k n s ( j ) l ≥ s ( j ) k o . (45) (1). Assuming τ j, min ≥ 0, w e w ant to show the low er b ound for P ( ˆ y j 6 = y j ). Note that P k ( ˆ y j 6 = k ) ≤ P k [ l ∈ [ L ] ,l 6 = k n s ( j ) l ≥ s ( j ) k o ≤ X l ∈ [ L ] ,l 6 = k P k s ( j ) l ≥ s ( j ) k . (46) With s ( j ) l defined as in (38), Λ ( j ) kl defined as in (13) and ξ ( i ) kl . = L X h =1 ( µ ( i ) lh − µ ( i ) kh )I ( Z ij = h ) , (47) E k h ξ ( i ) kl i = L X h =1 q ij µ ( i ) lh − µ ( i ) kh p ( i ) kh , (48) 32 w e ha ve P k s ( j ) l ≥ s ( j ) k = P k M X i =1 L X h =1 µ ( i ) lh − µ ( i ) kh I ( Z ij = h ) ≥ a k − a l ! = P k M X i =1 ξ ( i ) kl ≥ a k − a l ! = P k M X i =1 ξ ( i ) kl − M X i =1 E k h ξ ( i ) kl i ≥ ( a k − a l ) − M X i =1 E k h ξ ( i ) kl i ! , = P k M X i =1 ξ ( i ) kl − M X i =1 E k h ξ ( i ) kl i ≥ Λ ( j ) kl ! (49) Note that n ξ ( i ) kl o i ∈ [ M ] are conditionally indep endent when giv en { y j = k } , and they are b ounded given the voting weigh ts n µ ( i ) kh o are b ounded. Therefore, w e can apply the Ho effding concentration inequalit y (Ho effding, 1956) to further b ound P k s ( j ) l ≥ s ( j ) k . W e hav e that min l,k ,h ∈ [ L ] ,k 6 = l n µ ( i ) lh − µ ( i ) kh o ≤ ξ ( i ) kl ≤ max l,k ,h ∈ [ L ] ,k 6 = l n µ ( i ) lh − µ ( i ) kh o , and M X i =1 max l,k ,h ∈ [ L ] ,k 6 = l n µ ( i ) lh − µ ( i ) kh o − min l,k ,h ∈ [ L ] ,k 6 = l n µ ( i ) lh − µ ( i ) kh o 2 ≤ M X i =1 2 max l,k ,h ∈ [ L ] ,k 6 = l | µ ( i ) lh − µ ( i ) kh | 2 = 4Γ 2 When Λ ( j ) kl ≥ τ j, min · Γ ≥ 0, by applying the Ho effding inequality to (49), w e ha ve P k s ( j ) l ≥ s ( j ) k ≤ P k M X i =1 ξ ( i ) kl − M X i =1 E k h ξ ( i ) kl i ≥ Λ ( j ) kl ! ≤ exp − 2Λ ( j ) kl 2 P M i =1 h max l,k ,h ∈ [ L ] ,k 6 = l n µ ( i ) lh − µ ( i ) kh o − min l,k ,h ∈ [ L ] ,k 6 = l n µ ( i ) lh − µ ( i ) kh oi 2 ≤ exp − Λ ( j ) kl 2 2Γ 2 (b ecause of (50)) ≤ exp − τ 2 j, min 2 ! . (based on the definition of τ j, min )) The right hand side of the last inequality does not dep end on k , l or i , then P k ( ˆ y j 6 = k ) ≤ X l ∈ [ L ] ,l 6 = k P k s ( j ) l ≥ s ( j ) k ≤ ( L − 1) exp − τ 2 j, min 2 ! . (50) 33 Because the RHS do es not dep end on k , we ha ve P ( ˆ y j 6 = y j ) = X k ∈ [ L ] π k P k ( ˆ y j 6 = k ) ≤ ( L − 1) exp − τ 2 j, min 2 ! X k ∈ [ L ] π k = ( L − 1) exp − τ 2 j, min 2 ! (51) The Hoeffding inequality does not tak e the v ariance information of the independent random v ariables into accoun t, thus a “stronger” concentration inequality can b e applied when the fluctuation of ξ ( i ) kl is av ailable. Note the definition of c and σ 2 are defined as c = 1 Γ · max i ∈ [ M ] ,k,l ,h ∈ [ L ] ,k 6 = l | µ ( i ) kh − µ ( i ) lh | , σ 2 = 1 Γ 2 · max j ∈ [ N ] max k,l ∈ [ L ] ,k 6 = l M X i =1 L X h =1 q ij µ ( i ) kh − µ ( i ) lh 2 p ( i ) kh . The sum of the second moment of ξ ( i ) kl can b e b ounded as M X i =1 E k ξ ( i ) kl 2 = M X i =1 E k L X h =1 ( µ ( i ) lh − µ ( i ) kh )I ( Z ij = h ) ! 2 = M X i =1 L X h =1 q ij µ ( i ) kh − µ ( i ) lh 2 p ( i ) kh ≤ σ 2 Γ 2 . ξ ( i ) kl can b e b ounded as | ξ ( i ) kl | ≤ max i ∈ [ M ] ,h ∈ [ L ] | µ ( i ) lh − µ ( i ) kh | = c Γ . By applying the Bernstein-type concentration inequality ((Chung and Liu, 2010), The- orem 2.8) with that Λ ( j ) kl ≥ τ j, min Γ ≥ 0, P k s ( j ) l ≥ s ( j ) k ≤ P k M X i =1 ξ ( i ) kl − M X i =1 E k h ξ ( i ) kl i ≥ Λ ( j ) kl ! ≤ exp − Λ ( j ) kl 2 2 σ 2 + c ΓΛ ( j ) kl / 3 , ≤ exp − τ 2 j, min 2 ( σ 2 + cτ j, min / 3) ! (b ecause Λ ( j ) kl Γ ≥ τ j, min ≥ 0) , (52) where the RHS do es not dep end on k , l . Then, we hav e P k ( ˆ y j 6 = k ) ≤ ( L − 1) exp − τ 2 j, min 2 ( σ 2 + cτ j, min / 3) ! . 34 F urthermore, P ( ˆ y j 6 = y j ) = X k ∈ [ L ] π k P k ( ˆ y j 6 = k ) ≤ ( L − 1) exp − τ 2 j, min 2 ( σ 2 + cτ j, min / 3) ! . (53) Com bining inequalities (51) and (53) together, we can get the desired result in Theorem 11.(1). (2). Assuming that τ j, max ≤ 0, w e w ant to show the upp er b ound for P ( ˆ y j 6 = y j ). Using the same argument as in (1), we pro vide a low er b ound for P k ( ˆ y j 6 = k ). P k ( ˆ y j 6 = k ) ≥ P k [ l ∈ [ L ] ,l 6 = k n s ( j ) l > s ( j ) k o ≥ max l ∈ [ L ] ,l 6 = k P k s ( j ) l > s ( j ) k = 1 − min l ∈ [ L ] ,l 6 = k P k s ( j ) l ≤ s ( j ) k Giv en Λ ( j ) kl ≤ τ j, max · Γ ≤ 0, by applying the Ho effding and the Bernstein inequality as in (1), we can obtain P k s ( j ) l ≤ s ( j ) k ≤ exp − Λ ( j ) kl 2 2Γ 2 ≤ exp − τ 2 j, max 2 , and P k s ( j ) l ≤ s ( j ) k ≤ exp − Λ ( j ) kl 2 2( σ 2 Γ 2 − c ΓΛ ( j ) kl ) ≤ exp − τ 2 j, max 2( σ 2 − cτ j, max ) . Since the RHS of the tw o inequalities do not dep end on k or l , P k ( ˆ y j 6 = k ) ≥ 1 − min exp − τ 2 j, max 2 , exp − τ 2 j, max 2( σ 2 − cτ j, max ) . With (44), w e ha ve P ( ˆ y j 6 = y j ) ≥ 1 − min exp − τ 2 j, max 2 , exp − τ 2 j, max 2( σ 2 − cτ j, max ) . A.2 Pro of of Theorem 3 Pro of Giv en σ 2 ≥ 0 and c > 0, both functions exp − t 2 2 and exp − t 2 2( σ 2 + ct/ 3) are monotonely increasing on t ∈ [0 , ∞ ). On the other hand, b oth functions exp − t 2 2 and exp − t 2 2( σ 2 − ct/ 3) are monotonely decreasing on t ∈ ( −∞ , 0]. Giv en t 1 ≥ 0, then τ j, min ≥ t 1 ≥ 0. By Prop osition 11, 1 N N X j =1 P ( ˆ y j 6 = y j ) ≤ 1 N N X j =1 ( L − 1) min ( exp − τ 2 j, min 2 ! , exp − τ 2 j, min 2 ( σ 2 + cτ j, min / 3) !) ≤ L − 1 N N X j =1 min exp − t 2 1 2 , exp − t 2 1 2 ( σ 2 + t 1 / 3) = ( L − 1) min exp − t 2 1 2 , exp − t 2 1 2 ( σ 2 + t 1 / 3) . Th us, w e ha ve pro ved Theorem 3.(1). 35 With the same argument, w e can straightforw ardly prov e Theorem 3.(2). A.3 Pro of of Theorem 1 So far, w e ha ve b ounded the mean error rate, but we still need more to ols for b ounding the error rate in the practical case with high probabilit y . The following lemma is another form of the Bernstein-Chernoff-Ho effding theorem (Ngo, 2011). Lemma 12 (Bernstein-Chernoff-Ho effding) L et ξ i ∈ [0 , 1] b e indep endent r andom variables wher e E ξ i = p i , i ∈ [ n ] . L et ¯ ξ = 1 n P n i =1 ξ i and ¯ p = 1 n P n i =1 p i . Then, (1) for any m such that ¯ p ≤ m n < 1 , P ¯ ξ > m/n ≤ e − n D ( m/n || ¯ p ) , (2) for any m such that 0 < m n ≤ ¯ p , P ¯ ξ < m/n ≤ e − n D ( m/n || ¯ p ) . The pro of of Theorem 1 is as follows: Pro of (The or em 1) Pro of of Theorem 1 (1) Let µ = 1 N P N j =1 P ( ˆ y j 6 = y j ). By Theorem 3.(1), we hav e that µ ≤ ( L − 1) e − t 2 1 / 2 = ( L − 1) φ ( t 1 ). Assume t 1 ≥ q 2 ln L − 1 , then we can get ( L − 1) φ ( t 1 ) ≤ , which gives us 0 ≤ µ ≤ ( L − 1) φ ( t 1 ) ≤ . Then b y the Bernstein-Chernoff- Ho effding Theorem, i.e. Lemma 12, w e get P 1 N P N j =1 I ( ˆ y j 6 = y j ) > ≤ e − N D( || µ ) ≤ e − N D( || ( L − 1) φ ( t 1 )) . Therefore, we ha ve P 1 N N X j =1 I ( ˆ y j 6 = y j ) ≤ ≥ 1 − e − N D( || ( L − 1) φ ( t 1 )) . Pro of of Theorem 1 (2) With the same argument as ab ov e, assuming t 2 ≤ − q 2 ln 1 1 − , then 1 ≥ µ ≥ 1 − φ ( t 2 ) ≥ , whic h giv es us P 1 N N X j =1 I ( ˆ y j 6 = y j ) ≥ ≥ 1 − e − N D( || 1 − φ ( t 2 )) . Th us, w e ha ve pro ved Theorem 1. A.4 Pro of of Theorem 2 Before proving Theorem 2, w e are going to pro ve an imp ortant lemma for b ounding the a verage of a group of indep endent Bernoulli random v ariables. The pro of of this lemma relies on Ho effding b ounds and the Bernstein-Chernoff-Ho effding theorem (Ngo, 2011). Lemma 13 Supp ose ∀ j ∈ [ N ] , ξ j ∼ Bernoul li( p j ) with p j ∈ (0 , 1) , and ξ j ’s ar e indep endent of e ach other. L et ¯ ξ = 1 N P N j =1 ξ j and ¯ p = E ¯ ξ = 1 N P N j =1 p j Given any , δ ∈ (0 , 1) : 36 (1) If 0 < ¯ p ≤ 1 1+exp ( 1 [ H e ( )+ 1 N ln 1 δ ]) , then P ( ¯ ξ ≤ ) ≥ 1 − δ. (2) If 1 1+exp ( − 1 1 − ( H e ( )+ 1 N ln 1 δ )) ≤ ¯ p < 1 , then P ( ¯ ξ ≤ ) < δ. Pro of F or simplicit y let’s define A = H e ( ) + 1 N ln 1 δ The pro of of Lemma 13.(1): W e will finish the pro of in sev eral steps: Assume 0 < ¯ p ≤ 1 1+exp ( 1 [ H e ( )+ 1 N ln 1 δ ]) . Step 1. we want to show ¯ p < : exp A = exp ln 1 + (1 − ) ln 1 1 − + 1 N ln 1 δ ! = exp ln 1 + 1 − ln 1 1 − + 1 N ln 1 δ > exp ln 1 ( ∵ , δ ∈ (0 , 1) , N > 0) = 1 = ⇒ 1 + exp A > 1 = ⇒ 1 1+exp ( A ) < Since 0 < ¯ p ≤ 1 1+exp( A/ ) , then w e ha ve ¯ p < Step 2. We want to show P ¯ ξ ≤ ≥ 1 − e − N · D ( || ¯ p ) : This is obtained by the Bernstein-Chernoff-Hoeffding Theorem ((Ngo, 2011; McDiarmid, 1998)), which leads to: If 0 < ¯ p ≤ , then P ( ¯ ξ ≤ ) ≤ 1 − e − N D ( || ¯ p ) (54) If ≤ ¯ p < 1, then P ( ¯ ξ ≤ ) ≤ e − N D ( || ¯ p ) (55) Since we hav e shown in step 1 that ¯ p < , then we can get the desired result in this step easily . Step 3. We want to show e − N D ( || ¯ p ) ≤ δ : Note: e − N D ( || ¯ p ) ≤ δ ⇐ ⇒ D ( || ¯ p ) ≥ 1 N ln 1 δ ⇐ ⇒ ln (1 − ) 1 − ¯ p (1 − ¯ p ) 1 − ≥ ln 1 δ 1 N ⇐ ⇒ ¯ p (1 − ¯ p ) 1 − ≤ exp − H e ( ) + 1 N ln 1 δ = e − A (56) 37 F rom the condition we ha ve, ¯ p ≤ 1 1 + exp( A/ ) = ⇒ ¯ p 1 − ¯ p ≤ e − A Note that ¯ p (1 − ¯ p ) 1 − = ¯ p 1 − ¯ p (1 − ¯ p ) < ¯ p 1 − ¯ p ( ∵ 1 − ¯ p < 1) = ⇒ Inequality (56) holds: ¯ p (1 − ¯ p ) 1 − ≤ e − A = ⇒ e − N D ( || ¯ p ) ≤ δ By step 2 and step 3, w e can easily get that if ¯ p ≤ 1 1+ e A/ , then P ( ¯ ξ ≤ ) ≥ 1 − δ , which is the results we w ant. The pro of of Lemma 13.(2): W e will also finish the pro of in several steps: Assume 1 1 + exp − 1 1 − H e ( ) + 1 N ln 1 δ ≤ ¯ p < 1 Step 1. We want to show ¯ p > W e show it by proving as follows: 1 1 + exp − 1 1 − H e ( ) + 1 N ln 1 δ > ⇐ ⇒ 1 + exp − 1 1 − H e ( ) + 1 N ln 1 δ < 1 ⇐ ⇒ H e ( ) + 1 N ln 1 δ > (1 − ) ln 1 − ⇐ ⇒ ln 1 + (1 − ) ln 1 1 − + 1 N ln 1 δ > (1 − ) ln 1 1 − − (1 − ) ln 1 ⇐ ⇒ ln 1 + 1 N ln 1 δ > 0 whic h is of course true since , δ ∈ (0 , 1). Therefore, we ha ve pro ved ¯ p > . Step 2. We want to show P ( ¯ ξ ≤ ) ≤ e − N D ( || ¯ p ) By the Bernstein-Chernoff-Ho effding Theorem, since < ¯ p = E ¯ ξ and ξ j ∼ Bernoulli( p j ) indep enden tly , w e can directly pro ve this step. Step 3. we want to show e − N D ( || ¯ p ) < δ Note that e − N D ( || ¯ p ) < δ ⇐ ⇒ D ( || ¯ p ) > 1 N ln 1 δ ⇐ ⇒ ¯ p (1 − ¯ p ) 1 − < exp − H e ( ) + 1 N ln 1 δ = e − A (57) 38 F rom the condition we ha ve ¯ p ≥ 1 1 + exp − A 1 − = ⇒ 1 − ¯ p ¯ p ≤ e − A 1 − = ⇒ 1 − ¯ p ¯ p 1 − ≤ exp − H e ( ) + 1 N ln 1 δ (58) And note that ¯ p (1 − ¯ p ) 1 − = 1 − ¯ p ¯ p 1 − · ¯ p < 1 − ¯ p ¯ p 1 − ( ∵ ¯ p < 1) (59) By com bining inequalities (58) and (59), w e can prov e inequalit y (57). Thus we obtained e − N D ( || ¯ p ) < δ Final ly, by step 2 and 3, we get : if ¯ p ≥ 1 1+exp ( − 1 1 − ( H e ( )+ 1 N ln 1 δ )) , then P ¯ ξ ≤ < δ No w, w e are going to prov e Theorem 2 with the results w e obtained in Lemma 13 Pro of of Theorem 2 Let ζ j = I ( ˆ y j 6 = y j ) ∼ Bernoulli(1 − θ j ) and let ¯ p = E ¯ ζ = 1 N P N j =1 E ζ j = 1 − ¯ θ . The pro of of Theorem 2.(1): Assume that t 1 ≥ p 2 ln [( L − 1) C ( , δ )], where C ( , δ ) = 1 + exp 1 H e ( ) + 1 N ln 1 δ , then t 1 ≥ 0. By Theorem 3, we ha ve ¯ θ = 1 − 1 N N X j =1 P ( ˆ y j 6 = y j ) ≥ 1 − ( L − 1) e − t 2 1 2 (60) Let A = H e ( ) + 1 N ln 1 δ , then t 1 ≥ p 2 ln [( L − 1) C ( , δ )] = p 2 ln [( L − 1)(1 + exp( A/ ))] = ⇒ ( L − 1) exp − t 2 1 2 ≤ 1 1 + exp A = ⇒ ¯ θ ≥ 1 − ( L − 1) exp − t 2 1 2 ≥ 1 − 1 1 + exp A ( ∵ (60)) = ⇒ 1 − ¯ θ ≤ 1 1 + exp A . (61) By inequality (61) and by Lemma 13, we hav e P ( ¯ ζ ≤ ) ≥ 1 − δ whic h is to sa y , P 1 N N X j =1 I ( ˆ y j 6 = y j ) ≤ ≥ 1 − δ 39 Therefore, we ha ve pro ved (1). The pro of of Theorem 2.(2): Assume t 2 ≤ − p 2 ln C (1 − , δ ) ≤ 0 , where C (1 − , δ ) = 1+exp 1 1 − H e ( ) + 1 N ln 1 δ . Then by Theorem 3.(2), we hav e ¯ θ = 1 N N X j =1 P ( ˆ y j = y j ) ≤ exp − t 2 2 2 (62) F rom the conditions in (2) t 2 ≤ − s 2 ln 1 + exp 1 1 − H e ( ) + 1 N ln 1 δ = ⇒ exp − t 2 2 2 ≤ 1 1 + exp A 1 − = ⇒ 1 − ¯ θ ≥ 1 − exp − t 2 2 2 ≥ 1 − 1 1 + exp A 1 − = 1 1 + exp − A 1 − By Lemma 13.(2), we ha ve P ( ¯ ζ ≤ ) < δ which implies the desired result. A.5 Pro of of Corollary 8 (Error rate b ounds of the oracle MAP rule) Pro of The p osterior distribution is ρ ( j ) k = π k η ( j ) k P L l =1 π l η ( j ) l , ∀ j ∈ [ N ] , k ∈ [ L ] , where η ( j ) k = M Y i =1 L Y h =1 p ( i ) kh I( Z ij = h ) . F or the oracle MAP classifier, ˆ y oracle j = argmax k ∈ [ L ] ρ ( j ) k = argmax k ∈ [ L ] π k η ( j ) k = argmax k ∈ [ L ] log( η ( j ) k ) + log π k = M X i =1 L X h =1 log p ( i ) kh I ( Z ij = h ) + log π k = M X i =1 L X h =1 µ ( i ) kh I ( Z ij = h ) + a k , where µ ( i ) kh = log p ( i ) kh and a k = log π k . Therefore the oracle MAP rule is a form of the general aggregation rule (37). Thus all the results of error rate b ounds in Section 3 holds for the oracle MAP rule. 40 A.6 Pro of of Corollary 9 (The oracle MAP rule under Homogenous Da wid-Sk ene mo del) Pro of The Homogenous Dawid-Sk ene mo del is the sp ecial case of the General Da wid- Sk ene mo del, in whic h case we hav e p ( i ) kk = w i and p ( i ) kh = 1 − w i L − 1 for all k , h ∈ [ L ] , k 6 = h . By Corollary 8, w e can replace µ ( i ) kk with log w i , and µ ( i ) kh , k 6 = h with log 1 − w i L − 1 , and so we ha ve ˆ y j = argmax k ∈ [ L ] M X i =1 L X h =1 µ ( i ) kh I ( Z ij = h ) + log 1 L = argmax k ∈ [ L ] M X i =1 I ( Z ij = k ) log w i + I ( Z ij 6 = k , 0) log 1 − w i L − 1 = argmax k ∈ [ L ] M X i =1 I ( Z ij = k ) log w i + (I ( Z ij 6 = 0) − I ( Z ij = k )) log 1 − w i L − 1 = argmax k ∈ [ L ] " M X i =1 I ( Z ij = k ) log ( L − 1) w i 1 − w i + M X i =1 I ( Z ij 6 = 0) log 1 − w i L − 1 # = argmax k ∈ [ L ] M X i =1 log ( L − 1) w i 1 − w i I ( Z ij = k ) = argmax k ∈ [ L ] M X i =1 ν i I ( Z ij = k ) , where ν i = log ( L − 1) w i 1 − w i . Therefore the oracle MAP rule under the Homogenous Dawid-Sk ene mo del is a MWV rule. Thus the results from Corollary 5 can b e directly applied here. i.e., t 1 = q ( L − 1) || ν || 2 M X i =1 ν i ( Lw i − 1) , c = k ν k ∞ || ν || 2 and σ 2 = q . And if t 1 ≥ 0, then 1 N N X j =1 P ( ˆ y j 6 = y j ) ≤ ( L − 1) · min exp − t 2 1 2 , exp − t 2 1 2 ( σ 2 + ct 1 / 3) . No w all w e need to show is that t 1 is alwa ys non-negativ e. W e can see that if w i ≥ 1 L , then ( Lw i − 1) ≥ 0 and ν i = ( L − 1) w i 1 − w i ≥ 0 for all i ∈ [ M ], then t 1 ≥ 0 in this case. w i < 1 L , then ( Lw i − 1) < 0 and ( L − 1) w i 1 − w i < 0, thus t 1 > 0 in this case as w ell. All in all, t 1 ≥ 0 is alw a ys true for the oracle MAP rule under the Homogenous Da wid-Sk ene mo del. App endix B. Pro of of Theorem 10 : error rate b ounds of one-step W eighted Ma jorit y V oting In the pro of of this result, w e fo cus on P ( T ij = 1) = q = 1 , ∀ i ∈ [ M ] , j ∈ [ N ], i.e., every w orker lab els an y item with probability q . Mean while, we assume L = 2, and the lab el set 41 [ L ] . = {± 1 } . It’s not hard to generalize our results to q ∈ (0 , 1] and general L case, whic h is more practical, but the b ound will b e muc h more complicated. W e omit it here for clarity . The prediction of the one-step W eighted Ma jorit y V oting for the j th item is ˆ y wmv j = sign M X i =1 (2 ˆ w i − 1) Z ij ! , (63) where ˆ w i is the estimated work er accuracy by taking the output from ma jority voting as “true” lab els. That is to say , ˆ w i = 1 N N X j =1 I Z ij = ˆ y mv j , where ˆ y mv j = sign M X i =1 Z ij ! . (64) Note that the av erage accuracy of work ers is ¯ w = 1 M P M i =1 w i . In fact, Theorem 10 is a direct implication b y the following result. Prop osition 14 If ¯ w ≥ 1 2 + 1 M + q ( M − 1) ln 2 2 M 2 , then the me an err or r ate of one-step Weighte d Majority V oting for the j th item wil l b e: P ˆ y wmv j 6 = y j ≤ exp − 8 M N 2 ˜ σ 4 (1 − η ) 2 M 2 N + ( M + N ) 2 , (65) wher e ˜ σ and η is as define d in The or em 10 . In the next section, we will fo cus on proving this result first, then Theorem 10 can b e obtained directly . B.1 The preparation for the pro ofs Before we prov e Prop osition 14, w e need to prov e sev eral useful results for our final pro of of Prop osition 14. With the same notations as we hav e used for proving Theorem 1, we simplified some notations as follo ws for conv enience and c learance: P ( j ) + ( · ) . = P ( · | y j = +1) and P ( j ) − ( · ) . = P ( · | y j = − 1) , (66) E ( j ) + [ · ] . = E [ · | y j = +1] and E ( j ) − [ · ] . = E [ · | y j = − 1] , (67) where “ · ” denotes any ev ent b elonging to the σ -algebra generated b y Z . The follo wing lemma enable us to b ound the probability of where the ma jorit y vote of the j th item agrees with the lab el giv en b y the i th w orker giv en the true lab el and Z ij . Lemma 15 ∀ j ∈ [ N ] and ∀ i ∈ [ M ] , we have (1) if ¯ w > 1 2 , then P ˆ y mv j = +1 | y j = +1 , Z ij = +1 ≥ 1 − exp − 2 M 2 ( ¯ w − 1 2 + 1 − w i M ) 2 M − 1 ! , (68) 42 and the same b ound holds for P ˆ y mv j = − 1 | y j = − 1 , Z ij = − 1 . (2) if ¯ w ≥ 1 2 + 1 M , then P ˆ y mv j = − 1 | y j = +1 , Z ij = − 1 ≤ exp − 2 M 2 ( ¯ w − 1 2 − w i M ) 2 M − 1 ! , (69) and the same b ound holds for P ˆ y mv j = +1 | y j = − 1 , Z ij = +1 . Pro of (1) Notice that for any i ∈ [ M ] and j ∈ [ N ], given y j , Z ij is indep enden t of { Z lj } l 6 = i , E ( j ) + Z lj = 2 w l − 1 and E ( j ) − Z lj = − (2 w l − 1), then X l 6 = i E ( j ) + Z lj + 1 = X l 6 = i (2 w l − 1) + 1 = 2 M ¯ w − 1 2 + 1 − w i M > 0 , (70) since ¯ w − 1 2 + 1 − w i M > ¯ w − 1 2 > 0. Therefore, w e can apply the Ho effding inequality to get: P ˆ y mv j = +1 | y j = +1 , Z ij = +1 = P ( j ) + ˆ y mv j = +1 | Z ij = +1 = P ( j ) + M X l =1 Z lj > 0 Z ij = +1 ! = P ( j ) + X l 6 = i Z lj − X l 6 = i E ( j ) + Z lj > − ( X l 6 = i E ( j ) + Z lj + 1) ≥ 1 − exp − [ P l 6 = i E ( j ) + Z lj + 1] 2 2( M − 1) ! (b y Ho effding) = 1 − exp − 2 M 2 ( ¯ w − 1 2 + 1 − w i M ) 2 M − 1 ! (b y (70)) Note that with the same argument, P ( ˆ y mv j = − 1 | y j = − 1 , Z ij = − 1) = P ( j ) − M X l =1 Z lj < 0 | Z ij = − 1 ! = P ( j ) − X l 6 = i Z lj − X l 6 = i E ( j ) − Z lj < − X l 6 = i E ( j ) − Z lj + 1 ≥ 1 − exp − 2 M 2 ( ¯ w − 1 2 + 1 − w i M ) 2 M − 1 ! , pro vided that − P l 6 = i E ( j ) − Z lj + 1 = 2 M ¯ w − 1 2 + 1 − w i M > 0 , which is satisfied b y the assumption ¯ w > 1 2 . (2) With the same argument as abov e, notice that P l 6 = i E ( j ) + Z lj − 1 = 2 M ¯ w − 1 2 − w i M ≥ 0 b ecause ¯ w ≥ 1 2 + 1 M . 43 By applying the Ho effding inequalit y: P ˆ y mv j = − 1 | y j = +1 , Z ij = − 1 = P ( j ) + ( X l 6 = i Z lj < 0 | Z ij = − 1) = P ( j ) + X l 6 = i Z lj − X l 6 = i E ( j ) + Z lj < − ( X l 6 = i E ( j ) + Z lj − 1) ≤ exp − [ P l 6 = i E ( j ) + Z lj − 1] 2 2( M − 1) ! = exp − 2 M 2 ( ¯ w − 1 2 − w i M ) 2 M − 1 ! F ollo wing the same argumen t, w e can show that the same b ound holds for P ˆ y mv j = +1 | y j = − 1 , Z ij = +1 . Our next lemma will b ound the probability that the lab el of item j giv en by work er i agrees with Ma jority V oting. Lemma 16 Given ¯ w ≥ 1 2 + 1 M , then ∀ j ∈ [ N ] , we have w i − ξ (1) i ≤ P ( Z ij = ˆ y mv j | y j ) ≤ w i + ξ (2) i , (71) wher e ξ (1) i = w i exp − 2 M 2 ( ¯ w − 1 2 + 1 − w i M ) 2 M − 1 and ξ (2) i = (1 − w i ) exp − 2 M 2 ( ¯ w − 1 2 − w i M ) 2 M − 1 F ur- thermor e, we have w i − ξ (1) i ≤ P ( Z ij = ˆ y mv j ) ≤ w i + ξ (2) i , (72) Remark: This result implies that under mild conditions, the probability that the lab el of the j th item given b y the i th work er matc hes the ma jority vote will b e close to w i , i.e., the accuracy of this w orker. As the num ber of work ers increase, it will b e closer and closer. In tuitively , this mak es sense since if M is large, ma jority voting will b e close to the true lab el if ¯ w > 0 . 5. Pro of P ( Z ij = ˆ y mv j ) = π P ( j ) + ( Z ij = ˆ y mv j ) + (1 − π ) P ( j ) − ( Z ij = ˆ y mv j ) . (73) P ( j ) + ( Z ij = ˆ y mv j ) = w i P ( j ) + ( ˆ y mv j = +1 | Z ij = +1) + (1 − w i ) P ( j ) + ( ˆ y mv j = +1 | Z ij = − 1) . Applying P ( j ) + ( ˆ y mv j = +1 | Z ij = +1) ≥ 1 − exp − 2 M 2 ( ¯ w − 1 2 + 1 − w i M ) 2 M − 1 from Lemma 15.(1) and (1 − π ) P ( j ) − ( Z ij = ˆ y mv j ) ≥ 0 w e can get P ( j ) + ( Z ij = ˆ y mv j ) ≥ w i − w i exp − 2 M 2 ( ¯ w − 1 2 + 1 − w i M ) 2 M − 1 ! . (74) 44 By applying P ( j ) + ( ˆ y mv j = +1 | Z ij = +1) ≤ 1 and P ( j ) − ( Z ij = ˆ y mv j ) ≤ exp − 2 M 2 ( ¯ w − 1 2 − w i M ) 2 M − 1 from Lemma 15.(2), we can get P ( j ) + ( Z ij = ˆ y mv j ) ≤ w i + (1 − w i ) exp − 2 M 2 ( ¯ w − 1 2 − w i M ) 2 M − 1 ! . (75) Similarly , we can obtain the same b ounds for P ( j ) − ( Z ij = ˆ y mv j ), i.e., P ( j ) − ( Z ij = ˆ y mv j ) ≥ w i − w i exp − 2 M 2 ( ¯ w − 1 2 + 1 − w i M ) 2 M − 1 ! , (76) P ( j ) − ( Z ij = ˆ y mv j ) ≤ w i + (1 − w i ) exp − 2 M 2 ( ¯ w − 1 2 − w i M ) 2 M − 1 ! (77) Since P ( j ) + ( Z ij = ˆ y mv j ) and P ( j ) − ( Z ij = ˆ y mv j ) hav e the same b ounds, and P ( Z ij = ˆ y mv j | y j ) = I ( y j = 1) P ( j ) + ( Z ij = ˆ y mv j ) + I ( y j = − 1) P ( j ) − ( Z ij = ˆ y mv j ) , then (71) holds. F urthermore, since P ( Z ij = ˆ y mv j ) = π P ( j ) + ( Z ij = ˆ y mv j ) + (1 − π ) P ( j ) − ( Z ij = ˆ y mv j ) , then (74) to (77) implies (72) . The next lemma will b e crucial for applying concen tration measure results to b ound P ( ˆ y wmv j 6 = y j | { y k } N k =1 ). F or measuring the fluctuation of a function g j : { 0 , ± 1 } M × N → R if w e change one en try of the data matrix, w e define a quan tity as follows: d ( j ) i ? j ? . = inf d : | g j ( Z ) − g j ( Z 0 ) | ≤ d, where Z and Z 0 only differ on ( i ? , j ? ) . (78) The constraints Z and Z 0 only differ on ( i ? , j ? ), which means that Z 0 i ? j ? is a indep endent cop y of Z i ? j ? , and Z ij = Z 0 ij for ( i, j ) 6 = ( i ? , j ? ). Lemma 17 L et g j ( Z ) . = P M i =1 (2 ˆ w i − 1) Z ij , ∀ j ∈ [ N ] , wher e Z is the data matrix and ˆ w i is as define d in (64), with d ( j ) i ? j ? define d in (78), we have (1)if j ? 6 = j , then d ( j ) i ? j ? ≤ 2( M − 1) N ; (2)if j ? = j , then d ( j ) i ? j ? ≤ 2( M − 1) N + 2 . Pro of Since Z 0 i ? j ? is an indep enden t copy of Z i ? j ? , Z 0 i ? j ? = Z i ? j ? or Z 0 i ? j ? = − Z i ? j ? . When Z 0 i ? j ? = Z i ? j ? , Z 0 = Z , then of course | g j ( Z ) − g j ( Z 0 ) | = 0, which satisfies the inequality trivially . Next, we fo cus on the non-trivial case Z 0 i ? j ? = − Z i ? j ? . 45 Note that Z ij = Z 0 ij when ( i, j ) 6 = ( i ? , j ? ). Let ˆ y mv j b e the ma jority v ote of the j th column of Z , and ˆ y mv 0 j b e the ma jority v ote by the j th column of Z 0 . Recall that | g j ( Z ) − g j ( Z 0 ) | = | P M i =1 (2 ˆ w i − 1) Z ij − P M i =1 (2 ˆ w 0 i − 1) Z 0 ij | . If j ? 6 = j , then Z ij = Z 0 ij , ∀ i ∈ [ M ], so, | g j ( Z ) − g j ( Z 0 ) | = | M X i =1 (2 ˆ w i − 2 ˆ w 0 i ) Z ij | ≤ 2 M X i =1 | ( ˆ w i − ˆ w 0 i ) Z ij | = 2 M X i =1 | ˆ w i − ˆ w 0 i | . (79) If j ? = j , we hav e Z ij ? = Z 0 ij ? for i 6 = i ? , and Z i ? j ? = − Z 0 i ? j ? , then, | g j ( Z ) − g j ( Z 0 ) | = | X i 6 = i ? 2( ˆ w i − ˆ w 0 i ) Z ij + 2( ˆ p i ? + ˆ p 0 i ? − 1) Z i ? j ? | ≤ 2 X i 6 = i ? | ˆ w i − ˆ w 0 i | + 2 | ˆ p i ? + ˆ p 0 i ? − 1 | (80) W e can see that the difference g j ( Z ) − g j ( Z 0 ) dep ends heavily on the tw o quantities | ˆ w i − ˆ w 0 i | and | ˆ p i ? + ˆ p 0 i ? − 1 | . Next we b ound | ˆ w i − ˆ w 0 i | : | ˆ w i − ˆ w 0 i | = | 1 N M X i =1 I ( Z ik = ˆ y mv k ) − I Z 0 ik = ˆ y mv 0 k | = 1 N | I Z ij ? = ˆ y mv j ? − I Z 0 ij ? = ˆ y mv 0 j ? | , b ecause Z ik = Z 0 ik and ˆ y mv k = ˆ y mv 0 k if k 6 = j ? . (a). If ˆ y mv j ? = ˆ y mv 0 j ? , then b y (81) | ˆ w i − ˆ w 0 i | = ( 0 if i 6 = i ? , 1 N if i = i ? . In this case, M X i =1 | ˆ w i − ˆ w 0 i | = 1 N , and X i 6 = i ? | ˆ w i − ˆ w 0 i | = 0 . (81) (b). If If ˆ y mv j ? 6 = ˆ y mv 0 j ? , then b y (81) | ˆ w i − ˆ w 0 i | = ( 1 N if i 6 = i ? , 0 if i = i ? . In this case, M X i =1 | ˆ w i − ˆ w 0 i | = M − 1 N , and X i 6 = i ? | ˆ w i − ˆ w 0 i | = M − 1 N . (82) 46 No w, w e are going to b ound | ˆ p i ? + ˆ p 0 i ? − 1 | : | ˆ p i ? + ˆ p 0 i ? − 1 | = 1 N | N X k =1 I ( Z i ? k = ˆ y mv k ) + I Z 0 i ? k = ˆ y mv 0 k − 1 | ≤ 1 N N X k =1 | I ( Z i ? k = ˆ y mv k ) + I Z 0 i ? k = ˆ y mv 0 k − 1 | . (83) (c). If ˆ y mv j ? = ˆ y mv 0 j ? , then | I ( Z i ? k = ˆ y mv k ) + I Z 0 i ? k = ˆ y mv 0 k − 1 | = ( 1 if k 6 = j ? , 0 if k = j ? . So in this case, | ˆ p i ? + ˆ p 0 i ? − 1 | = N − 1 N . (d). If ˆ y mv j ? 6 = ˆ y mv 0 j ? , then | I ( Z i ? k = ˆ y mv k ) + I Z 0 i ? k = ˆ y mv 0 k − 1 | = ( 1 if k 6 = j ? , 1 if k = j ? . So in this case, | ˆ p i ? + ˆ p 0 i ? − 1 | = 1 . Putting together all the results ab ov e, if Z i ? j ? 6 = Z 0 i ? j ? , then we ha ve, (1’) If ˆ y mv j ? = ˆ y mv 0 j ? , | g j ( Z ) − g j ( Z 0 ) | ≤ ( 2 N if j ? 6 = j, 2( N − 1) N if j ? = j. (2’) If ˆ y mv j ? 6 = ˆ y mv 0 j ? , | g j ( Z ) − g j ( Z 0 ) | ≤ ( 2( M − 1) N if j ? 6 = j, 2( N − 1) N + 2 if j ? = j. The upp er b ound in the case when ˆ y mv j 6 = ˆ y mv 0 j is also an upp er b ound for the case when ˆ y mv j = ˆ y mv 0 j . By the definition of d ( j ) i ? j ? and noting that Z can only take a finite num ber of v alues, w e hav e d ( j ) i ? j ? ≤ ( 2( M − 1) N if j ? 6 = j, 2( N − 1) N + 2 if j ? = j. Remark: d ( j ) i ? j ? is the smallest upper b ound on the difference b et ween g j ( Z ) and g j ( Z 0 ). F rom the pro of, we can see that the b ound w e get is ac hiev able, th us the b ounds are tigh t. This result basically sa ys that if w e change only one entry of the data matrix, the fluctuation of the prediction score function of one-step W eigh ted Ma jority V oting, i.e., g j ( Z ), will b e large if M increases and will decrease if N increases. 47 B.2 The pro of of Prop osition 14 and Theorem 10 In this section, we will use the lemmas we obtained in the last section to pro ve Prop osition 14 and then easily derive the b ounds on the exp ected error rate of one-step WMV from it. Pro of of Pr op osition 14 P ( ˆ y wmv j 6 = y j ) = X y 1 , ··· ,y N P ˆ y wmv j 6 = y j | y 1 , · · · , y N · P ( y 1 , · · · , y N ) If we can get an unified upp er b ound on P ˆ y wmv j = y j | y 1 , · · · , y N , say B , which is indep enden t of { y k } N k =1 , then this b ound will also b e an upp er b ound of P ( ˆ y wmv j 6 = y j ) since X y 1 , ··· ,y N P ˆ y wmv j 6 = y j | y 1 , · · · , y N · P ( y 1 , · · · , y N ) ≤ X y 1 , ··· ,y N B · P ( y 1 , · · · , y N ) = B . Note that the Z ij ’s are not independent of eac h other unless conditioned on all the true lab els of y 1 , · · · , y N . F or conv enience of notation, w e define the conditional probabilit y and conditional exp ectation as follows: e P ( j ) + ( · ) . = P · | y j = +1 , { y k } k 6 = j , (84) e P ( j ) − ( · ) . = P · | y j = − 1 , { y k } k 6 = j , (85) e E ( j ) + [ · ] . = E h · | y j = +1 , { y k } k 6 = j i , (86) e E ( j ) − [ · ] . = E h · | y j = − 1 , { y k } k 6 = j i , (87) where “ · ” denotes any even t with resp ect to the σ -algebra generated by Z and { y j } N j =1 . Note that in these conditional notations, all true lab els of item k , k 6 = j remain unknown but are conditioned on, e.g., e E ( j ) + [ y k ] = y k for k 6 = j . Notice that P ˆ y wmv j = y j | y 1 , · · · , y N = I ( y j = +1) · P ˆ y wmv j = − 1 | y j = +1 , { y k } k 6 = j +I ( y j = − 1) · P ˆ y wmv j = +1 | y j = − 1 , { y k } k 6 = j = I ( y j = +1) · e P ( j ) + ˆ y wmv j = − 1 +I ( y j = − 1) · e P ( j ) − ˆ y wmv j = +1 = I ( y j = +1) · e P ( j ) + ( g j ( Z ) < 0) +I ( y j = − 1) · e P ( j ) − ( g j ( Z ) > 0) , where g j ( Z ) = P M i =1 (2 ˆ w i − 1) Z ij and ˆ w i is defined as (64). W e wan t to provide the upp er b ound on b oth e P ( j ) + ˆ y wmv j = − 1 = e P ( j ) + ( g j ( Z ) < 0) and e P ( j ) − ˆ y wmv j = +1 = e P ( j ) + ( g j ( Z ) > 0). 48 W e complete our pro of in several steps. Step 1. Providing an upp er b ound on e P ( j ) + ( g j ( Z ) < 0) Once w e condition on { y k } N k =1 , all the en tries in Z will b e independent of eac h other, and then we can apply McDiarmid Inequality (McDiarmid, 1998) to the probability e P ( j ) + ( g j ( Z ) < 0). F rom Lemma 17, w e get that if Z and Z 0 only differ on en try ( i ? , j ? ), Z 0 i ? j ? is an indep enden t cop y of Z i ? j ? , and so | g j ( Z ) − g j ( Z 0 ) | ≤ d ( j ) i ? j ? . Com bining this with the results from Lemma 17 we ha ve M X i ? =1 N X j ? =1 d ( j ) i ? j ? 2 ≤ M ( N − 1) 2( M − 1) N 2 + M 2 + 2( M − 1) N 2 ≤ M N 2 M N 2 + M 2 + 2 M N 2 ≤ 4 M N 2 M 2 N + ( M + N ) 2 (88) Applying the McDiamid Inequality , w e get e P ( j ) + ( g j ( Z ) < 0) = e P ( j ) + g j ( Z ) − e E ( j ) + [ g j ( Z )] < − e E ( j ) + [ g j ( Z )] ≤ exp − 2 e E ( j ) + [ g j ( Z )] 2 P M i ? =1 P N j ? =1 d ( j ) i ? j ? 2 ≤ exp − N 2 e E ( j ) + [ g j ( Z )] 2 2 M [ M 2 N + ( M + N ) 2 ] , (89) pro vided e E ( j ) + g j ( Z ) ≤ 0. No w, if we can pro vide a lo wer b ound of e E ( j ) + [ g j ( Z )], then b y replacing e E ( j ) + [ g j ( Z )] with that low er b ound in the last inequalit y , we can further b ound e P ( j ) + ( g j ( Z ) < 0) from ab ov e. Next, we aim at deriving a go o d low er b ound of e E ( j ) + [ g j ( Z )]. W e can expand g j ( Z ) so that e E ( j ) + [ g j ( Z )] = e E ( j ) + " M X i =1 (2 ˆ w i − 1) Z ij # = 2 M X i =1 e E ( j ) + [ ˆ w i Z ij ] − M X i =1 e E ( j ) + Z ij = 2 M X i =1 e E ( j ) + [ ˆ w i Z ij ] − M X i =1 (2 w i − 1) , since e E ( j ) + Z ij = E h Z ij | y j = +1 , { y k } k 6 = j i = E [ Z ij | y j = +1] = 2 w i − 1. 49 Note that for any i ∈ [ M ] and j ∈ [ N ], giv en y j , { Z ij } M i =1 will b e indep enden t of { Z lk } k 6 = j and { y k } k 6 = j . W e will use this prop erty for dropping all the irrelev an t conditioned y k ’s. e E ( j ) + [ ˆ w i Z ij ] = e E ( j ) + " Z ij · 1 N N X k =1 I ( Z ik = ˆ y mv k ) # = 1 N N X k =1 e E ( j ) + [ Z ij I ( Z ik = ˆ y mv k )] = 1 N X k 6 = j e E ( j ) + [ Z ij I ( Z ik = ˆ y mv k )] + e E ( j ) + Z ij I Z ij = ˆ y mv j (90) When k 6 = j , Z ik and ˆ y mv k are indep endent of Z ij giv en y j . e E ( j ) + [ Z ij I ( Z ik = ˆ y mv k )] = e E ( j ) + Z ij · e E ( j ) + I ( Z ik = ˆ y mv k ) = (2 w i − 1) P ( Z ik = ˆ y mv k | y k ) ≥ I (2 w i − 1 ≥ 0) · (2 w i − 1) w i " 1 − exp − 2 M 2 ( ¯ w − 1 2 + 1 − w i M ) 2 M − 1 !# +I (2 w i − 1 < 0) · (2 w i − 1) w i " 1 + 1 − w i w i exp − 2 M 2 ( ¯ w − 1 2 − w i M ) 2 M − 1 !# (By Lemma 16) ≥ (2 w i − 1) w i " I w i ≥ 1 2 − I w i ≥ 1 2 exp − 2 M 2 ( ¯ w − 1 2 + 1 − w i M ) 2 M − 1 !# +(2 w i − 1) w i " I w i < 1 2 + I w i < 1 2 1 − w i w i exp − 2 M 2 ( ¯ w − 1 2 − w i M ) 2 M − 1 !# = (2 w i − 1) w i " 1 + 1 − 2 w i w i exp − 2 M 2 ( ¯ w − 1 2 − w i M ) 2 M − 1 !# (Because ¯ w > 1 2 + 1 M ) = (2 w i − 1) w i 1 + 1 − 2 w i 2 w i η i , where η i = 2 exp − 2 M 2 ( ¯ w − 1 2 − w i M ) 2 M − 1 . F urthermore, e E ( j ) + Z ij I Z ij = ˆ y mv j = E ( j ) + h Z ij e E ( j ) + I Z ij = ˆ y mv j | Z ij i = w i P ( ˆ y mv j = +1 | y j = +1 , Z ij = +1) − (1 − w i ) P ( ˆ y mv j = − 1 | y j = +1 , Z ij = − 1) = w i " 1 − exp − 2 M 2 ( ¯ w − 1 2 + 1 − w i M ) 2 M − 1 !# − (1 − w i ) exp − 2 M 2 ( ¯ w − 1 2 − w i M ) 2 M − 1 ! (By Lemma 15) ≥ w i " 1 − exp − 2 M 2 ( ¯ w − 1 2 − w i M ) 2 M − 1 !# − (1 − w i ) exp − 2 M 2 ( ¯ w − 1 2 − w i M ) 2 M − 1 ! (As ¯ w > 1 2 + 1 M ) = w i − exp − 2 M 2 ( ¯ w − 1 2 − w i M ) 2 M − 1 ! ≥ w i − η i / 2 50 Com bining the tw o b ounds ab ov e, w e obtained e E ( j ) + [ ˆ w i Z ij ] = 1 N X k 6 = j e E ( j ) + [ Z ij I ( Z ik = ˆ y mv k )] + e E ( j ) + Z ij I Z ij = ˆ y mv j ≥ 1 N ( N − 1)(2 w i − 1) w i (1 + 1 − 2 w i 2 w i η i ) + w i − η i 2 = 1 2 N ( N − 1)(2 w i − 1) 2 + 1 (1 − η i ) + N (2 w i − 1) ≥ 1 2 N N (2 w i − 1) 2 (1 − η i ) + N (2 w i − 1) (Because 1 ≥ (2 w i − 1) 2 ) = 1 2 (2 w i − 1) 2 (1 − η i ) + 1 2 (2 w i − 1) Let η = 2 exp − 2 M 2 ( ¯ w − 1 2 − 1 M ) 2 M − 1 , so η ≤ η i ∀ i ∈ [ M ] e E ( j ) + g j ( Z ) = 2 M X i =1 e E ( j ) + [ ˆ w i Z ij ] − M X i =1 (2 ˆ w i − 1) ≥ M X i =1 (2 w i − 1) 2 (1 − η i ) + M X i =1 (2 ˆ w i − 1) − M X i =1 (2 ˆ w i − 1) ≥ (1 − η ) M X i =1 (2 w i − 1) 2 = 4 M ˜ σ 2 (1 − η ) , (91) where ˜ σ = q 1 M P M i =1 (2 w i − 1) 2 . Since ¯ w ≥ 1 2 + 1 M + q ( M − 1) ln 2 2 M 2 , so η ≤ 1, which implies e E ( j ) + g j ( Z ) ≥ 0. Then by (89) and (91) we hav e e P ( j ) + g j ( Z ) < 0 ≤ exp − N 2 e E ( j ) + [ g j ( Z )] 2 2 M [ M 2 N + ( M + N ) 2 ] ≤ exp − 8 M N 2 ˜ σ 4 (1 − η ) 2 M 2 N + ( M + N ) 2 . (92) Step 2. With the same argumen t and following the same logic, w e can obtain the same upp er b ound for e P ( j ) − g j ( Z ) > 0. Step 3. Combining the results w e obtained from Step 1 and Step 2. Since P ˆ y wmv j = y j | y 1 , · · · , y N = I ( y j = +1) · e P ( j ) + ( g j ( Z ) < 0) + I ( y j = − 1) · e P ( j ) − ( g j ( Z ) > 0) and b oth e P ( j ) + ( g j ( Z ) < 0) and e P ( j ) − ( g j ( Z ) > 0) ha ve the same upp er b ound, we hav e that P ˆ y wmv j = y j | y 1 , · · · , y N ≤ exp − 8 M N 2 ˜ σ 4 (1 − η ) 2 M 2 N + ( M + N ) 2 . 51 The upp er b ound ab ov e do es not dep end on the v alue of { y k } N k =1 . By what w e hav e discussed in the v ery b eginning of the pro of, P ( ˆ y mv j 6 = y j ) ≤ exp − 8 M N 2 ˜ σ 4 (1 − η ) 2 M 2 N + ( M + N ) 2 . No w, w e can directly pro ve Theorem 10 as follows: Pro of (Pro of of Theorem 10 ) Since the upp er b ound of P ( ˆ y mv j 6 = y j ) do esn’t dep end on j , it can directly imply that 1 N N X j =1 P ( ˆ y mv j 6 = y j ) ≤ exp − 8 M N 2 ˜ σ 4 (1 − η ) 2 M 2 N + ( M + N ) 2 , whic h is the desired result in Theorem 10 . 52
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment