Doubly Robust Crowdsourcing
Large-scale labeled dataset is the indispensable fuel that ignites the AI revolution as we see today. Most such datasets are constructed using crowdsourcing services such as Amazon Mechanical Turk which provides noisy labels from non-experts at a fai…
Authors: Chong Liu, Yu-Xiang Wang
Journal of Artificial Intelligence Research 73 (2022) 209-229 Submitted 09/2021; published 01/2022 Doubly Robust Cro wdsourcing Chong Liu chongliu@cs.ucsb.edu Y u-Xiang W ang yuxiangw@cs.ucsb.edu Dep artment of Computer Scienc e University of California, Santa Barb ar a Santa Barb ar a, CA 93106, USA Abstract Large-scale lab ele d dataset is the indisp ensable fuel that ignites the AI rev olution as w e see to day . Most such datasets are constructed using crowdsourcing services such as Amazon Mechanical T urk which provides noisy labels from non-exp erts at a fair price. The sheer size of such datasets mandates that it is only feasible to collect a few lab els p er data p oint. W e formulate the problem of test-time lab el aggregation as a statistical estimation problem of inferring the exp ected voting score. By imitating work ers with sup ervised learners and using them in a doubly robust estimation framew ork, we prov e that the v ariance of estimation can b e substantially reduced, ev en if the learner is a po or appro ximation. Synthetic and real-world exp eriments show that by com bining the doubly robust approach with adaptive w ork er/item selection rules, we often need muc h low er lab el cost to ac hieve nearly the same accuracy as in the ideal world where all w orkers lab el all data p oints. 1. In tro duction The rise of mac hine learning approac hes in artificial in telligence has enabled machines to p erform w ell on many cognitive tasks that w ere previously though t of as what mak es us h uman. In man y sp ecialized tasks, for example, animial recognition in images (He et al., 2015), conv ersational sp eech recognition (Xiong et al., 2018), translating Chinese text into English (Hassan et al., 2018), learning-based systems are shown to ha ve reached and ev en surpassed human-lev el p erformances. These remark able achiev ements could not ha ve b een p ossible without the many large-scale datasets that are made av ailable by researc hers ov er the past t wo decades. ImageNet, for instance, has long b e en regarded as what spa wned the AI rev olution that w e are experiencing today . These labels do not come for free. ImageNet’s 11 million images w ere lab eled using Amazon Mechanical T urk (AMT) in to more than 15,000 synsets (classes in an on tology). On a verage, eac h image required roughly 2 − 5 indep endent h uman annotations, which were pro vided by 25,000 AMT work ers o ver a perio d of three y ears. W e estimate that the cost of getting all these annotations goes well abov e one million dollars. As the deep learning models get larger and more pow erful every da y so as to tac kle some of the more c hallenging AI tasks, their fero cious app etites for even larger lab eled dataset ha ve grown tremendously as w ell. How ever, unlike the abundant unlab eled data, it is often difficult, exp ensive, or even imp ossible to consult exp ert opinions on large n umber of items. Here the items can b e images, do cuments, voices, sentences, and so on. Services such as AMT ha ve made it muc h easier to seek the wisdom of the crowd b y having non-exp erts © 2022 AI Access F oundation. All righ ts reserved. Liu & W ang (called work ers in the remainder of this pap er) to provide man y noisy annotations at a m uch lo wer cost. A large b o dy of work has b een dev oted to finding more scalable solutions. These include a v ariety of lab el-aggregation metho ds (Sheng et al., 2008; W elinder et al., 2010; Zhang et al., 2016; Zhou et al., 2015), end-to-end h uman-in-the-lo op learning (Khetan et al., 2018), online/adaptive w orker selections (Branson et al., 2017; V an Horn et al., 2018) and so on. A t the heart of these approaches, there are v arious w ays to ev aluate individual w orker p erformances and quan tify the uncertaint y in their pro vided lab els. In this pap er, we take a pre-trained cro wdsourcing mo del with work er ev aluation as a blac kb ox and consider the problem of true lab el inference for new data points. W e form ulate this problem as a statistical estimation problem and prop ose a n umber of wa ys to radically reduce the n umber of work er annotations. 1. W ork er imitation W e prop ose to imitate each w orker with a simple sup ervised learner that learns to predict the work er’s lab el using the item feature. 2. Doubly robust cro wdsourcing (DRC) By tapping into the literature on doubly robust estimation, we design algorithms that exploit the p ossibly unreliable imitation agen ts and significantly reduce the estimation v ariance (hence annotation cost) while remaining unbiased. 3. Adaptiv e w ork er/item selection (A WS/AIS) W e prop ose to b o otstrap the imi- tation agents’ confidence estimates to adaptively filter out high confidence items and select the most qualified work ers for lo w-confidence item, without additional cost. Our results are summarized as follows. 1. W e theoretically show that DR C technique can b e used to generically improv e an y giv en cro wdsourcing mo dels using an y non trivial learned imitation agents. 2. Synthetic and real-world exp erimen ts show DR C improv es the lab el accuracy ov er the standard probabilistic inference with Da wid-Skene and ma jority voting mo dels in almost all budget levels and all datasets. 3. A WS and AIS often reduce the cost by orders of magnitudes, while enjo ying the same lev el of accuracy . On several datasets, the prop osed tec hnique can often get aw a y with muc h few er annotations p er item while achieving almost the same accuracy that can b e obtained by ha ving all work ers annotating all items. 2. Related W ork In this section, we briefly summarize the related work. Our study is motiv ated b y the man y trailblazing approac hes in lab el-aggregation includ- ing the wisdom-of-cro wds (W elinder et al., 2010), Dawid-Sk ene mo del (Da wid and Skene, 1979; Zhang et al., 2016), minimax entrop y approac h (Zhou et al., 2015), p erm utation-based mo del (Shah et al., 2020), work er cluster mo del (Imam ura et al., 2018), cro wdsourced re- gression mo del (Ok et al., 2019) and so on. Our contribution is complementary as w e can tak e an y of these mo dels as blac kb oxes and hop efully impro ve their true-lab el inference. 210 Doubl y R obust Cro wdsourcing Doubly robust techniques originates from the causal inference literature (Rotnitzky and Robins, 1995; Bang and Robins, 2005) and the use of it for v ariance reduction had led to sev eral breakthroughs in machine learning (e.g., Johnson and Zhang, 2013; W ang et al., 2013). W e drew our inspirations directly from the use of doubly robust tec hniques in the off-p olicy ev aluation problem in bandits and reinforcement learning (Dud ´ ık et al., 2014; Jiang and Li, 2016; W ang et al., 2017). The v ariance analysis and weigh t-clipping are adapted from the calculations in Dud ´ ık et al. (2014) and W ang et al. (2017) with some minor differences. T o the b est of our knowledge, this is the first pap er considering doubly robust techniques in crowdsourcing. Our idea of adaptive item/w orker selection is inspired by the recen t w ork of Branson et al. (2017) and V an Horn et al. (2018). They prop ose an AI-aided approach that reduces the num b er of work er labels p er item to b e smaller than 1 in an ob ject detection task. The k ey idea is to train a computer vision algorithm to detect the b ounding b oxes using the aggregated lab els that hav e b een obtained th us far and if the algorithm ac hieves a high confidence on a new image, then the annotation provided b y the algorithm is tak en. The differences of our work is t wofold. First, our use of sup ervised learner is not to predict the true lab els but rather to imitate w orkers. Second, our confidence measure is determined by sup ervised learners’ appro ximation to what all w orkers w ould say ab out an item, rather than as a prior distribution added to mo del-based probabilistic inference. 3. Problem Setup In this section, w e introduce the notations and formulate the problem as a statistical esti- mation problem. 3.1 Notations Supp ose we ha ve n items, m w orkers, and k classes. W e adopt the notation [ k ] := { 1 , 2 , 3 , ..., k } . Each item j ∈ [ n ] is describ ed as a d -dimensional feature vector x j , and the feature matrix is X = [ x 1 , x 2 , · · · , x n ] > ∈ R n × d . Eac h item j ∈ [ n ] also has a hidden true lab el y j ∈ [ k ] which indicates the correct class that item j belongs to. W orkers, suc h as those on AMT, are requested to classify items into one of the k classes. W e denote the lab el that work er i ∈ [ m ] assigns to item j as ` ij ∈ [ k ]. It is imp ortant to distinguish the work er-pro duced lab els ` ij with the true lab el y j , as the work ers are considered non-exp erts and they mak e mistakes. F rom here onw ards, w e will refer to the p oten tially noisy and erroneous lab els from work ers as “annotations”. Conv enien tly , we also collect ` ij in to a matrix L ∈ ([ k ] ∪ {⊥} ) m × n , where an y en tries in L that are ⊥ are unobserv ed lab els. W e use Ω ⊂ [ m ] × [ n ] , Ω i ⊂ [ n ] , Ω j ⊂ [ m ] to denote the indices of the observ ed annotations, indices of all items work er i annotated and indices of all work ers that annotated item j resp ectively . F or a generic item ( x, y ), Ω x collects the indices of work ers who annotated the item and the corresp onding annotation is denoted b y ` i for each i ∈ Ω x . 3.2 Problem Statemen t The goal of the pap er is related to but different from the standard cro wdsourcing problem whic h aims at learning a mo del that one can use to infer the true lab el y 1 , ..., y n using noisy 211 Liu & W ang annotations L [Ω] (and sometimes item features X ). Many highly practical mo dels were prop osed for that task already (Dawid and Sk ene, 1979; W elinder et al., 2010; Zhang et al., 2016; Zhou et al., 2015; Shah et al., 2020). Complemen tary to the existing work that mainly fo cuses on lab el inference, w e consider the problem of cost-saving. Sp ecifically , we would like to design algorithms to reduce the exp ected num b er of new annotations needed to lab el a new item. The algorithm uses a pre-trained crowdsourcing mo del as w ell as the training dataset X and L [Ω]. 3.3 Da wid-Skene Mo del and Score F unctions The primary mo del that we w ork with in this pap er is the Dawid-Sk ene (DS) mo del (Dawid and Sk ene, 1979; Zhang et al., 2016), which assumes the following data generating pro cess. 1. F or eac h j ∈ [ n ], y j ∼ Categorical( τ ) . 2. F or eac h j ∈ [ n ] , i ∈ [ m ], ` ij ∼ Categorical( µ y j , i ) . 3. W e observ e ` ij with probability π ij . where τ and µ y ,i denote the probabilit y distributions defined on [ k ]. In particular, µ y ,i is the column y of the c onfusion matrix of w orker i , whic h the DS mo del uses to describ e P i ( ` | y ). W e denote the confusion matrix asso ciated with work er i by µ i ∈ R k × k . Once the DS model is learned, we can mak e use of the learned parameters τ and µ to infer the true lab els using w orker annotations via the p osterior b elief P ( y | ` 1 , ` 2 , ..., ` m ) ∝ P ( y ) m Y i =1 P ( ` i | y ) = τ [ y ] m Y i =1 µ i [ ` i , y ] . (1) T ake log for both sides and dropping the additiv e constan t, w e obtain the sc or e function that is induced by the DS mo del S DS ( y | ` i ∀ i ∈ Ω x ) = log τ [ y ] + m X i =1 log µ i [ ` i , y ] . (2) This is a weigh ted voting rule based on a pre-trained DS mo del. Similarly , w e can cast the inference pro cedure of other crowdsourcing mo dels as maximizing such a score function as w ell. F or example, in the Ma jorit y V oting (MV) approac h, S MV ( y | Ω x ) = m X i =1 1 ( ` i ( x ) = y ) , (3) where 1 ( · ) is the indicator function. Notably , no training datasets are needed for ma jority v oting. The exp osition ab ov e suggests that test time inv olv es collecting a handful of work er annotations (choosing Ω x ) and calculating a v oting score sp ecified by the crowdsourcing mo del in a form of S ( y | Ω x ) = X i ∈ Ω x S i ( y , ` i ( x )) . (4) where S i is supplied by the mo del that connects annotation ` i to lab el y . Then the lab el y that maximizes the score is chosen. 212 Doubl y R obust Cro wdsourcing 3.4 A Statistical Estimation F ramew ork In the ideal world, when money is not a concern, we will p oll al l work ers and calculate S ( y | [ m ]) = m X i =1 S i ( y , ` i ( x )) . (5) In practice, ho wev er, just as we cannot afford to p oll all voters to estimate who is winning the presidential election, w e cannot afford to p oll every one to annotate a single data p oin t either. But do w e ha ve to? Notice that w e can frame the question as a classical p oint estimation problem in statis- tics, where the statistical quantit y of in terest is v x ( y ) := E 1 m m X i =1 S i ( y , ` i ( x )) , (6) the exp ectation of the ideal world score function (5), rescaled by 1 /m . In the ab o ve, the exp ectation is tak en ov er the randomness in w orker’s annotation. F or example, if w e select eac h work er indep endently with probability π , then the approach used in (2) and (3) would b e an un biased estimate of v x ( y ), if we rescale them by a factor of π − 1 . The adv an tage of translating the problem into a classical statistical estimation problem is that there is now a cen tury of asso ciated literature that w e can tap into, including those on adaptiv e sampling and v ariance reduction techniques. W e emphasize that while w e will b e using a crowdsourcing mo del, for example, the Dawid-Sk ene mo del, we do not assume that the data is generated according to the model. In fact, w e are not imp osing any restrictions on how w orkers annotate items, except that 1. ` i ( x ) ∀ i ∈ [ m ] are mutually indep enden t giv en an y item x . 2. V ar[ S i ( y , ` i ( x ))] < + ∞ ∀ x, y . These are v ery mild assumptions that are typically true in practice. It is generally difficult to analytically mo del h uman b ehaviors b ecause it dep ends on ho w the item is presented to w orker as w ell as the w orker’s knowledge and cognitiv e pro cesses. The agnostic learning p oin t of view helps disen tangle the approximation-theoretic questions from the statistical question of estimating the b est appro ximation p ossible using a given crowdsourcing mo del. The remainder of the pap er will be ab out designing estimators of v x ( y ) that ac hieves accurate lab el inference at a low cost and their corresp onding theory and exp eriments. T o a void any confusions, we emphasize again that item x is fixed. All the estimators are defined for eac h y separately . ` 1 , ..., ` m are random v ariables that comes out of the unkno wn pro cess of work ers lo oking at the item x . Whenever the dep endence is clear from context, we drop the conditioning on x for b etter readability . 4. Benc hmark Approac hes In this section, w e describ e a few baseline approaches for estimating v x ( y ) and their corre- sp onding cost in num b er of annotations. 213 Liu & W ang 4.1 Ideal W orld (IW) Estimator In the ideal world, all work ers are required to lab el x : ˆ v IW ( y ) = 1 m m X i =1 S i ( y , ` i ) . (7) This estimator incurs cost of m and it is unbiased, with v ariance of 1 m 2 P m i =1 V ar[ S i ( y , ` i ( x ))]. This is arguably the b est one can do with additional information. 4.2 Imp ortance Sampling (IS) Estimator A more affordable approach is to directly sample the work ers. Sp ecifically , we will include w orker i indep endently with probability π i 1 . ˆ v IS ( y ) = 1 m m X i =1 1 ( i ∈ Ω) π i S i ( y , ` i ) . (8) The exp ected cost of the IS estimator is P i ∈ [ m ] π i and it is clearly an unbiased estimator. Theorem 1. The ˆ v IS ( y ) is unbiase d and V ar[ ˆ v IS ( y )] = 1 m 2 m X i =1 1 π i V ar[ S i ( y , ` i )] + ( 1 π i − 1) E [ S i ( y , ` i )] 2 . Pr o of. By the indep endence of sampling, E [ ˆ v IS ( y )] = 1 m m X i =1 E h 1 ( i ∈ Ω) 1 π i S i ( y , ` i ) i = 1 m m X i =1 π i 1 π i E [ S i ( y , ` i )] = v x ( y ) . T o calculate the v ariance, we use the indep endence and then apply the law of total v ariance on each i : V ar[ ˆ v IS ( y )] = 1 m 2 m X i =1 1 π 2 i V ar[ 1 ( i ∈ Ω) S i ( y , ` i )] = 1 m 2 m X i =1 1 π 2 i E [( 1 ( i ∈ Ω)) 2 ] E [( S i ( y , ` i )) 2 ] − E [ 1 ( i ∈ Ω)] 2 E [ S i ( y , ` i )] 2 = 1 m 2 m X i =1 1 π 2 i π i V ar[ S i ( y , ` i )] + π i E [ S i ( y , ` i )] 2 − π 2 i E [ S i ( y , ` i )] 2 = 1 m 2 m X i =1 1 π i V ar[ S i ( y , ` i )] + ( 1 π i − 1) E [ S i ( y , ` i )] 2 . 1. This is called a Poisson sampling (S¨ arndal et al., 2003) in the surv ey sampling theory . 214 Doubl y R obust Cro wdsourcing Remark. If π i ≡ π , then we are essen tially doing the standard probabilistic inference as in (2) and (3). When π i = 1, IS trivially subsumes IW (7) as a special case. Moreov er, since x is fixed, the sampling π i can b e c hosen as a function of the item x without affecting the ab o ve results. 4.3 Direct Metho d (DM) Finally , there is an option that comes with no cost. Recall that we hav e a dataset X and L that w ere used to train the cro wdsourcing mo del at our disp osal. W e can reuse the dataset and train m sup ervised learners to imitate each w orker’s b ehavior. Let ˆ ` 1 , ..., ˆ ` m b e the fictitious annotations provided b y these sup ervised learners, we can simply plug them into the ideal w orld estimator (7) without an y cost, ˆ v DM ( y ) = 1 m m X i =1 E [ S i ( y , ˆ ` i )] . (9) F ollowing the conv ention in the con textual bandits literature (Jiang and Li, 2016), we call this approach the direct metho d. The additional E is introduced to capture the case when sup ervised learner outputs a soft annotation ˆ ` i . The v ariance of this approach is 0. How ever, as we mentioned previously , w e can nev er hop e to faithfully learn h uman b eha viors, esp ecially when w e only hav e a small num b er of annotations in the training data for eac h work er i . As a result, (9) may suffer from a bias that do es not v anish even as m → ∞ . 5. Main Results In this section, w e adapt an old statistical technique, doubly robust estimation, to crowd- sourcing problem. 5.1 Doubly Robust Cro wdsourcing As w e established in the last section, IS estimator is unbiased but suffers from a large v ariance, especially when we w ould like to cut cost and use a small sampling probabilit y . The DM estimator incurs no additional annotation cost and has no v ariance, but it can p oten tially suffer from a large bias due to sup ervised learners not imitating the work ers well enough. Doubly robust estimation (Rotnitzky and Robins, 1995; Dud ´ ık et al., 2014) is a pow erful tec hnique that allows us to reduce the v ariance using a DM estimator while retaining the un biasedness, hence getting the b est of b oth worlds. The doubly robust estimator w orks as follo ws: ˆ v DR ( y ) = 1 m m X i =1 E [ S i ( y , ˆ ` i )] + 1 ( i ∈ Ω) π i ( S i ( y , ` i ) − E [ S i ( y , ˆ ` i )]) . (10) The doubly robust estimator can b e though t of using the DM as a baseline and then use IS to estimate and correct the bias. Provided that the supervised learners are able to pro vide a non trivial appro ximation of the work ers, the doubly robust estimator is exp ected to reduce 215 Liu & W ang the v ariance. Just to give tw o explicit examples of ˆ v DR ( y ), under the Dawid-Sk ene mo del, the doubly robust estimator is 1 m m X i =1 log P µ i ( ˆ ` i | y ) + 1 ( i ∈ Ω) π i log P µ i ( ` i | y ) P µ i ( ˆ ` i | y ) . (11) Similarly , for the ma jorit y voting mo del, we can write 1 m m X i =1 e ˆ ` i + 1 π i 1 ( i ∈ Ω)( e ` i − e ˆ ` i ) , (12) where e ` is the basis vector where ` indicates the lo cation of 1, otherwise 0. Theorem 2 (DR C) . The doubly r obust estimator (10) is unbiase d and its varianc e is: 1 m 2 m X i =1 1 π i V ar[ S i ( y , ` i )] + 1 π i − 1 E [ S i ( y , ` i ) − S i ( y , ˆ ` i )] 2 . Pr o of Sketch. Note that the first part 1 m P m i =1 E [ S i ( y , ˆ ` i )] of the estimator is not random. The result follows directly by inv oking Theorem 1 on the second part of the estimator, whic h is an imp ortance sampling estimator of the bias. Pr o of. By the indep endence of sampling, E [ ˆ v DR ( y )] = 1 m m X i =1 E E [ S i ( y , ˆ ` i )] + 1 π i E 1 ( i ∈ Ω)( S i ( y , ` i ) − E [ S i ( y , ˆ ` i )]) = 1 m m X i =1 E E [ S i ( y , ˆ ` i )] + π i 1 π i E [ S i ( y , ` i ) − E [ S i ( y , ˆ ` i )]] = 1 m m X i =1 E [ S i ( y , ` i )] = v x ( y ) . T o calculate the v ariance, we use the indep endence and then apply the law of total v ariance on each i : V ar[ ˆ v DR ( y )] = 1 m 2 m X i =1 1 π 2 i V ar 1 ( i ∈ Ω)( S i ( y , ` i ) − E [ S i ( y , ˆ ` i )]) = 1 m 2 m X i =1 1 π 2 i E [( 1 ( i ∈ Ω)) 2 ] E [( S i ( y , ` i ) − E [ S i ( y , ˆ ` i )]) 2 ] − E [ 1 ( i ∈ Ω)] 2 E [ S i ( y , ` i ) − E [ S i ( y , ˆ ` i )]] 2 = 1 m 2 m X i =1 1 π 2 i π i V ar[ S i ( y , ` i )] + π i E [ S i ( y , ` i ) − E [ S i ( y , ˆ ` i )]] 2 − π 2 i E [ S i ( y , ` i ) − E [ S i ( y , ˆ ` i )]] 2 = 1 m 2 m X i =1 1 π i V ar[ S i ( y , ` i )] + 1 π i − 1 E [ S i ( y , ` i ) − S i ( y , ˆ ` i )] 2 . 216 Doubl y R obust Cro wdsourcing Remark. First, if work ers are deterministic, the first part of the v ariance V ar[ S i ( y , ` i )] ≡ 0. Second, if the sup ervised learner imitates work ers p erfectly in exp e ctation , the second part of the v ariance v anishes. Finally and most imp ortantly , the sup ervised learner do es not ha ve to b e perfect. In the simple case of a deterministic w orkers, the p ercentage of agreemen ts b et ween supervised learners and their human coun terparts directly translate in to a reduction of the v ariance of ab out the same p ercentage, for free. The third p oint is esp ecially remark able as it implies that ev en a trivial surrogate that outputs a lab el at random could lead to a 1 /k factor reduction of the v ariance. In addition, a go o d set of work er imitators with 90% accuracy can lead to an order of magnitude smaller v ariance and hence allow us to incur a m uch low er cost on av erage. W e will illustrate the effects of doubly robust estimation more extensively in the experiments. This feature ensures that our prop osed metho d remains applicable even in the case when the training dataset contain few annotations from some subset of the features. 5.2 Confidence-Based Adaptiv e Sampling Doubly robust estimation allo ws us to reduce the v ariance. Ho wev er, doubly robust is still an imp ortance sampling-based metho d that requires the n umber of new annotations to b e at least linear in the num b er of data p oin ts to lab el. In this section, we prop ose using sup ervised work er imitation to obtain confidence estimates for free and using them to construct confidence-based adaptive sampling schemes. W e prop ose tw o rules. 1. Adaptiv e item selection F or each new data p oint, run DM first. If DM predicts lab el y with an o verwhelming confidence, then chances are, there is no need to collect more annotations. If not, human w orkers are needed. 2. Adaptiv e w orker selection W e can adaptively c ho ose which w orker to annotate a giv en item. Instead of sampling at random with probabilit y π , we choose a set of adaptiv e sampling probability π 1 , · · · , π m that makes high confidence work ers more lik ely to b e selected. As different wor kers hav e different skill sets, confidence may dep end strongly on each item x . W e prop ose to calculate such item-dependent confi- dence using outcome of the imitated w orkers and the confusion matrices from the DS mo del. In b oth cases, we need a w ay to measure confidence giv en a probability distribution. A threshold is introduced to decide whether accept predicted lab els or not (Branson et al., 2017). Margin in m ulti-class classification is defined as the difference b et ween the score of true lab el and the largest score of other lab els (Mohri et al., 2012). Inspired by them, we define confidence margin of a probability as follo ws. Definition 1 (Confidence Margin) . Given a discr ete pr ob ability distribution π 1 , ..., π m , its c onfidenc e mar gin ρ is define d as the differ enc e b etwe en the lar gest pr ob ability and the se c ond lar gest one. 217 Liu & W ang Based on confidence margin, we prop ose three new metho ds: DRC with Adaptive Item Selection (DRC-AIS), DRC with Adaptiv e W ork er Selection (DR C-A WS), and the com bi- nation DR C-A WS-AIS. In DRC-AIS, DM is p erformed on all lab els. F or each item, the surrogate lab el given b y DM follows (1) to get the p osterior b elief, which describes ho w confiden tly DM gives the lab el of this item. Based on p osterior b elief, its confidence margin ρ AIS is compared with the given confidence margin parameter ρ . If ρ AIS is larger, DR C-AIS tak es the surrogate lab el provided by DM with no w orker cost, otherwise, DR C-AIS follows the regular DR C model whic h incures cost. In DRC-A WS, again DM runs first and gets surrogate lab els ˆ ` i . Then from each w orker i ’s confusion matrix we can get the lab eling probabilit y P ( ˆ ` i | y ), whose confidence margin is used as the w orker score γ i and γ 1 , · · · , γ m are normalized to b e a distribution. F or each item, w orker i will b e sampled with prob- abilit y γ i . How ever, in this case the sampling probability for each w orker is usually very small, th us, we can introduce a parameter λ to multiply with γ i to increase the sampling probabilit y . If a work er is sampled, the corresp onding lab el is used as regular DRC mo del. T able 1 shows the summary of b enchmark and DR C approac hes. As we can see, IW and IS take the fewest input elemen ts while IW uses the most w orker cost. DM uses no cost b ecause it only tak es adv an tage of surrogate lab els given b y work er imitation. DR C has similar cost as IS while it is exp ected to impro ve the ground truth inference with less v ariance. Thanks to the confidence-based adaptiv e sampling tec hniques, DR C-AIS and DR C-A WS are able to sav e more cost than DRC. T able 1: Summary of benchmark and DRC approaches in DS mo dels. L, X are annotation matrix and item feature matrix. µ, τ are probabilit y distributions from DS models. f denotes the sup ervised classifier for work er imitation. Metho ds Input Sampling Cost IW L No O ( nm ) IS L π O ( π nm ) DM X + f No 0 DR C L + µ + τ ( y ) + f π O ( π nm ) DR C-AIS L + µ + τ ( y ) + f π O ( π n ∗ m ) DR C-A WS L + µ + τ ( y ) + f π 1: m O ( P i ∈ [ m ] π i n ) n ∗ n denotes n umber of items AIS selected as low-confidence. 5.3 W eigh t-Clipping in DRC Adaptiv e w orker selection in v olves making selection probability π larger for some work ers while smaller for others. According to Theorem 2, the v ariance is prop ortional to P i π − 1 i , hence ev en a single π i b eing close to 0 would result in a h uge v ariance. In off-p olicy ev aluation (W ang et al., 2017) problems this issue is addressed b y clipping the imp ortance 218 Doubl y R obust Cro wdsourcing w eight at a fixed threshold η . This results in the clipp ed doubly robust estimator. ˆ v DR η ( y ) = 1 m m X i =1 E [ S i ( y , ˆ ` i )] + 1 ( i ∈ Ω) min { η , π − 1 i } ( S i ( y , ` i ) − E [ S i ( y , ˆ ` i )]) . (13) Its bias and v ariance are given as follows. Theorem 3. The clipp e d doubly r obust estimator ob eys that: Bias( ˆ v DR η ( y )) = 1 m m X i =1 min { π i η − 1 , 0 } E [ S i ( y , ` i ) − S i ( y , ˆ ` i )] . V ar[ ˆ v DR η ( y )] = 1 m 2 m X i =1 min { η 2 π 2 i , 1 } 1 π i V ar[ S i ( y , ` i )] + 1 π i − 1 E [ S i ( y , ` i ) − S i ( y , ˆ ` i )] 2 . Pr o of. By definition of bias, Bias( ˆ v DR η ( y )) = E [ ˆ v DR η ( y )] − 1 m m X i =1 E [ S i ( y , ` i )] = 1 m m X i =1 E [ E [ S i ( y , ˆ ` 1 )]] + E 1 ( i ∈ Ω) min { η , π − 1 i } ( S i ( y , ` i ) − E [ S i ( y , ˆ ` i )]) − E [ S i ( y , ` i )] = 1 m m X i =1 E [ S i ( y , ˆ ` 1 )] + min { π i η , 1 } E [ S i ( y , ` i ) − E [ S i ( y , ˆ ` i )]] − E [ S i ( y , ` i )] = 1 m m X i =1 min { π i η − 1 , 0 } E [ S i ( y , ` i ) − S i ( y , ˆ ` i )] . And the v ariance can b e calculated as, V ar[ ˆ v DR η ( y )] = 1 m 2 m X i =1 min { η 2 , π − 2 i } V ar[ 1 ( i ∈ Ω)( S i ( y , ` i ) − E [ S i ( y , ˆ ` i )])] = 1 m 2 m X i =1 min { η 2 , π − 2 i } E [( 1 ( i ∈ Ω)) 2 ] E [( S i ( y , ` i ) − E [ S i ( y , ˆ ` i )]) 2 ] − E [ 1 ( i ∈ Ω)] 2 E [ S i ( y , ` i ) − E [ S i ( y , ˆ ` i )]] 2 = 1 m 2 m X i =1 min { η 2 , π − 2 i } π i V ar[ S i ( y , ` i )] + π i E [ S i ( y , ` i ) − E [ S i ( y , ˆ ` i )]] 2 − π 2 i E [ S i ( y , ` i ) − E [ S i ( y , ˆ ` i )]] 2 = 1 m 2 m X i =1 min { η 2 π 2 i , 1 } 1 π i V ar[ S i ( y , ` i )] + ( 1 π i − 1) E [ S i ( y , ` i ) − S i ( y , ˆ ` i )] 2 . 219 Liu & W ang Remark. The bias b ound indicates that only those work ers we clipp ed who con tribute to the bias. The v ariance b ound implies that the part of v ariance from W ork er i is reduced to O (min { 1 /π i , η 2 π i } ) from O (1 /π i ). If the total amoun t of additional Bias 2 in tro duced by the clipping is smaller than the corresp onding sa vings in the v ariance, then clipping makes the estimator more accurate in Mean Squared Error (MSE). The theory inspires us to design an algorithm to automatically choose the threshold. Remark (Automatic choice of threshold) . Assume ` 1 , ..., ` m are deterministic, | E [ S i ( y , ` i ) − S i ( y , ˆ ` i )] | ≤ . Then the bias of ˆ v DR η ( y ) can b e b ounded b y P i 1 ( π − 1 i > η ) /m , and v ariance of ˆ v DR η ( y ) can b e b ounded b y 2 η 2 /m . Recall that MSE can b e decomp osed in to Bias 2 + V ar. The optimal choice of η that minimizes this upper b ound is the one that minimizes | η − P i 1 ( π − 1 i > η ) / √ m | . This can b e found numerically in time O ( m log ( m )) b y sorting [ π − 1 1 , ..., π − 1 m ] and applying binary search. 6. Exp erimen ts In this section, we rep ort our exp erimen tal results, including b oth syn thetic and real-world exp erimen ts. 6.1 Syn thetic Exp eriments First w e describ e our exp erimental settings and then mov e to synthetic exp eriment results. 6.1.1 Experiment al Settings W e are using sup ervised classification datasets to do synthetic exp erimen ts. Due to absence of lab eling matrix, we follow the workflo w shown in Figure 1 to do work er imitation to generate crowdsourcing datasets, which has three steps. 1. W e starts with the raw dataset in step 1. If the dataset was split into training and test parts, w e combine them together. 2. In step 2, we uniformly sample the dataset into tw o equal parts, one for training and the other for test. In order to remo ve the randomness of this splitting pro cess, the sampling index is fixed and sa ved for all further exp erimen ts. T raining means we use this part of data to train m decision trees to simulate the generating pro cess of cro wdsourcing lab els. F or one item only √ d features, a subset of all d features, can b e observ ed by each tree where d is the total num b er of features. Then these m decision trees are used to make predictions on the test set to obtain the item lab el matrix. 3. In step 3, w e uniformly sample the test part of step 2 in to t wo equal parts, one working as source part and the other working as target part. F or all exp eriments, this sampling pro cess will b e rep eated 20 times. Also, m classifiers, one for each work er, are trained on source dataset to simulate w ork er b ehaviors to give surrogate lab els for all DRC approac hes. F or all synthetic exp eriments, m = 50. Ev aluations are p erformed on the target lab el matrix. W e use fiv e classification datasets, Segmen t, Satimage, Usps (Hull, 1994), Pendigits, and Mnist (LeCun et al., 1998), collected b y Libsvm (Chang and Lin, 2011), whic h are all 220 Doubl y R obust Cro wdsourcing Figure 1: W orkflo w of generating crowdsourcing datasets with item features. publicly av ailable. T able 2 shows their statistics of test dataset and item lab el matrix of Step 2. T able 2: Statistics of syn thetic datasets. Dataset # item # work er # dimension # class Segmen t 1 , 155 50 19 7 Satimage 3 , 217 50 36 6 Usps 4 , 649 50 256 10 P endigits 5 , 496 50 16 10 Mnist 35 , 000 50 780 10 All exp erimental results, in figures or tables, are presented after rep eating 20 times with 98% asymptotic confidence in terv al of the exp ected accuracy based on in verting W ald’s test, that is, µ ± 1 . 96 σ / √ 20 , (14) where µ, σ are the mean and stand error of accuracy . Based on W ald’s test, statistical conclusions can b e made with 98% confidence. 6.1.2 Algorithm Comp arison In order to sho w our approac h DR C is able to infer true lab els with lo w work er cost, w e compare our DRC approach with Imp ortance Sampling (IS). In particular, we do experi- men ts with both Dawid-Sk ene (DS) model and Ma jorit y V oting (MV) model. Therefore, we are comparing DRC-DS, DS, DR C-MV, and MV. Moreov er, w e hav e Direct Metho d (DM) and Classifiers trained on Inferred Lab els (CI) as baselines. CI means classifiers are trained with inferred lab els from source part and then mak e predictions for the target part, thus incurring no labeling cost. In detail, w e do P oisson sampling o ver work ers with π going from 0 . 1 to 1 . 0 with in terv al of 0 . 1. Decision trees with maximum depth 3 are used to generate surrogate lab els for DRC-DS, DR C-MV, and DM. Results are sho wn in Figure 2, whic h ha ve three observ ations. • Given the same work er sampling rate, DRC-DS outp erforms DS and DRC-MV works b etter than MV on all datasets, which shows the effectiveness of DRC and matc hes 221 Liu & W ang 0.0 0.2 0.4 0.6 0.8 1.0 Worker Sampling Rate 0.64 0.66 0.68 0.70 0.72 0.74 0.76 Accuracy On SEGMENT dataset DM CI DS DRC-DS MV DRC-MV 0.0 0.2 0.4 0.6 0.8 1.0 Worker Sampling Rate 0.79 0.80 0.81 0.82 0.83 0.84 0.85 Accuracy On SATIMAGE dataset DM CI DS DRC-DS MV DRC-MV 0.0 0.2 0.4 0.6 0.8 1.0 Worker Sampling Rate 0.675 0.700 0.725 0.750 0.775 0.800 0.825 0.850 Accuracy On USPS dataset DM CI DS DRC-DS MV DRC-MV 0.0 0.2 0.4 0.6 0.8 1.0 Worker Sampling Rate 0.60 0.65 0.70 0.75 0.80 Accuracy On PENDIGITS dataset DM CI DS DRC-DS MV DRC-MV 0.0 0.2 0.4 0.6 0.8 1.0 Worker Sampling Rate 0.35 0.40 0.45 0.50 0.55 0.60 0.65 Accuracy On MNIST dataset DM CI DS DRC-DS MV DRC-MV Figure 2: Performances of DRC-DS, DR C-MV, DS, MV, and DM. our theoretical understanding. Moreo ver, our DRC approac hes work well with v ery few work ers and then p erform b etter with sampling rate increasing. • Because CI and DM in volv e no work er cost, they are tw o no des in the figures cor- resp onding to π = 0. Due to high bias incurred, DM p erforms unstably across all datasets. It p erforms well in some cases, but p o orly on Satimage and Usps. • On all datasets, except Satimage, DS performs b etter than MV. How ever, p erformance comparison b etw een DS and MV is out of scop e of this pap er. W e are fo cusing on impro ving existing approaches with DRC metho d. 6.1.3 Effectiveness of AIS and A WS T o show effectiv eness of AIS and A WS, we compare four metho ds: DRC-DS, DRC-AIS, DR C-A WS, and DR C-A WS-AIS, while DM is used as the baseline. A WS and AIS rules are exp ected to sav e a lot of work er cost, so logarithmic w orker cost is used, where cost is defined as n umber of w orkers that p er item use. F or DRC-AIS and DRC-A WS-AIS, the confidence margin parameter ρ is set to b e 0 . 03, 0 . 06, and 0 . 09. F or DR C-A WS and DRC-A WS-AIS, the multiplier parameter λ is set to b e 1, 2, 3, 4, 5, 7, 10, 15, 25, and 50. With dashed blac k lines b eing p erformances of DM, results are sho wn in Figure 3, whic h ha ve fiv e observ ations: • Compared with DRC-DS, all AIS/A WS approac hes are able to sav e labeling cost while main taining almost the same accuracy , whic h v alidates AIS and A WS do play key roles in improving inference accuracy and saving work er cost at the same time. • Among all four metho ds, DR C-A WS-AIS enjoys the low est work er cost, whic h means AIS and A WS can w ork together. • There is an accuracy-cost tradeoff for all approac hes. All p erformances can be im- pro ved b y introducing more work er cost, that is, greater ρ, λ . 222 Doubl y R obust Cro wdsourcing 1 0 1 1 0 0 1 0 1 log(Worker Cost) 0.810 0.815 0.820 0.825 0.830 Accuracy On SEGMENT dataset DRC-AIS-0.03 DRC-AIS-0.06 DRC-AIS-0.09 DRC-AWS-AIS-0.03 DRC-AWS-AIS-0.06 DRC-AWS-AIS-0.09 DRC-DS DRC-AWS 1 0 1 1 0 0 1 0 1 log(Worker Cost) 0.820 0.825 0.830 0.835 0.840 0.845 Accuracy On SATIMAGE dataset DRC-AIS-0.03 DRC-AIS-0.06 DRC-AIS-0.09 DRC-AWS-AIS-0.03 DRC-AWS-AIS-0.06 DRC-AWS-AIS-0.09 DRC-DS DRC-AWS 1 0 0 1 0 1 log(Worker Cost) 0.76 0.77 0.78 0.79 0.80 0.81 0.82 Accuracy On USPS dataset DRC-AIS-0.03 DRC-AIS-0.06 DRC-AIS-0.09 DRC-AWS-AIS-0.03 DRC-AWS-AIS-0.06 DRC-AWS-AIS-0.09 DRC-DS DRC-AWS 1 0 1 1 0 0 1 0 1 log(Worker Cost) 0.790 0.795 0.800 0.805 0.810 0.815 0.820 Accuracy On PENDIGITS dataset DRC-AIS-0.03 DRC-AIS-0.06 DRC-AIS-0.09 DRC-AWS-AIS-0.03 DRC-AWS-AIS-0.06 DRC-AWS-AIS-0.09 DRC-DS DRC-AWS 1 0 0 1 0 1 log(Worker Cost) 0.635 0.640 0.645 0.650 0.655 0.660 0.665 0.670 Accuracy On MNIST dataset DRC-AIS-0.03 DRC-AIS-0.06 DRC-AIS-0.09 DRC-AWS-AIS-0.03 DRC-AWS-AIS-0.06 DRC-AWS-AIS-0.09 DRC-DS DRC-AWS Figure 3: Performances of DRC-DS, DR C-AIS, DR C-A WS, and DRC-A WS-AIS. • DRC-A WS and DRC-DS enjo y the same performance pattern while DR C-A WS-AIS and DRC-AIS hav e the same one. Sp ecifically , b ecause DM is applied at the first stage of algorithms, p erformances of DRC-AIS and DRC-A WS-AIS increase from the DM when w orker cost is very small. • No statistical conclusions can b e made on Segment dataset due to high error bar. 6.1.4 Abla tion Study on Model Misspecifica tion As men tioned in exp erimental settings, decision trees are used to generate the crowdsourc- ing datasets with item features, ho wev er, in real-world tasks, we hav e no idea of the label generating pro cess. Therefore, mo del missp ecification must b e studied. In detail, Decision T rees (DT), Logistic Regression (LR), and Gaussian Naive Bay es (GNB) are used as sup er- vised classifiers for DRC-DS, DRC-MV, and DM. π = 0 . 5 is set ov er source lab el matrix. All parameters of DT, LR, and GNB are set as default according to sklearn pack age. Other exp erimen tal settings are the same as b efore. In T able 3, based on W ald’s test, statistically b etter results are set in bold fon ts. Decision trees p erform only sligh tly b etter than other tw o classifiers, whic h is go o d b ecause it sho w our approaches are able to work without knowing the lab el generating pro cess. In real crowdsourcing problem, it is of a high chance that we ha ve no idea of this pro cess. 6.2 Real-w orld Exp eriments F ollowing synthetic experiments, we do algorithm comparison and study the effectiveness of AIS and A WS on real-world exp erimen ts. 6.2.1 Algorithm Comp arison W e do exp eriments on three real-world datasets: Music Genre (Ro drigues et al., 2013), Dog (Zhou et al., 2012), and Rotten T omato es (Ro drigues et al., 2013). Because now we 223 Liu & W ang T able 3: P erformances of DRC-DS, DR C-MV, and DM with different sup ervised classifiers. Dataset Method DT LR GNB Segmen t DR C-DS 0 . 7385 ± 0 . 0065 0 . 7425 ± 0 . 0063 0 . 7386 ± 0 . 0066 DR C-MV 0 . 7731 ± 0 . 0053 0 . 7698 ± 0 . 0064 0 . 7677 ± 0 . 0053 DM 0 . 7419 ± 0 . 0061 0 . 7788 ± 0 . 0084 0 . 6969 ± 0 . 0084 Satimage DR C-DS 0 . 8483 ± 0 . 0021 0 . 8452 ± 0 . 0028 0 . 8421 ± 0 . 0034 DR C-MV 0 . 8470 ± 0 . 0025 0 . 8416 ± 0 . 0029 0 . 8415 ± 0 . 0030 DM 0 . 8478 ± 0 . 0016 0 . 8270 ± 0 . 0024 0 . 7940 ± 0 . 0033 Usps DR C-DS 0 . 8323 ± 0 . 0024 0 . 8293 ± 0 . 0020 0 . 8260 ± 0 . 0017 DR C-MV 0 . 8067 ± 0 . 0023 0 . 8022 ± 0 . 0029 0 . 7963 ± 0 . 0018 DM 0 . 8315 ± 0 . 0021 0 . 8374 ± 0 . 0022 0 . 7740 ± 0 . 0034 P endigits DR C-DS 0 . 8196 ± 0 . 0021 0 . 8156 ± 0 . 0018 0 . 8156 ± 0 . 0020 DR C-MV 0 . 7359 ± 0 . 0018 0 . 7274 ± 0 . 0027 0 . 7266 ± 0 . 0023 DM 0 . 8193 ± 0 . 0020 0 . 8259 ± 0 . 0024 0 . 8223 ± 0 . 0017 Mnist DR C-DS 0 . 6470 ± 0 . 0020 0 . 6426 ± 0 . 0022 0 . 5939 ± 0 . 0017 DR C-MV 0 . 5317 ± 0 . 0013 0 . 5240 ± 0 . 0015 0 . 4997 ± 0 . 0011 DM 0 . 6471 ± 0 . 0019 0 . 6474 ± 0 . 0021 0 . 4240 ± 0 . 0035 T able 4: Statistics of real-w orld datasets. Dataset # item # work er # dimension # class Music 700 44 124 10 Dog 798 109 5 , 376 4 T omato 4 , 999 203 1 , 200 2 224 Doubl y R obust Cro wdsourcing ha ve lab el matrix, we start from test dataset and item lab el matrix in Step 2 in Figure 1. Statistics of real-w orld datasets are shown in T able 4. Due to extreme few lab els given by w orkers, t wo settings are different from syn thetic experiments. First, only work ers pro viding more than 40% lab els are mo deled and depth of decision trees is set to be 100 in order to ensure qualit y of surrogate lab els. Second, w orker sampling is conducted ov er existing lab els giv en b y work er, and the sampling rate go es from 0 . 1 to 0 . 5 with an interv al of 0 . 1. T able 5: P erformances of compared algorithms on real world datasets. Based on W ald’s test, statistically b etter results are shown in b old fon ts. Dataset π Algorithms and P erformances Music DM CI 0.0 0 . 2286 ± 0 . 0134 0 . 2734 ± 0 . 0124 MV DR C-MV DS DR C-DS 0.1 0 . 2557 ± 0 . 0087 0 . 3449 ± 0 . 0097 0 . 2166 ± 0 . 0080 0 . 2931 ± 0 . 0113 0.2 0 . 3610 ± 0 . 0108 0 . 4181 ± 0 . 0110 0 . 3006 ± 0 . 0102 0 . 3300 ± 0 . 0075 0.3 0 . 4493 ± 0 . 0105 0 . 4824 ± 0 . 0088 0 . 3559 ± 0 . 0103 0 . 3716 ± 0 . 0107 0.4 0 . 5097 ± 0 . 0091 0 . 5247 ± 0 . 0073 0 . 3926 ± 0 . 0108 0 . 3951 ± 0 . 0104 0.5 0 . 5494 ± 0 . 0123 0 . 5601 ± 0 . 0085 0 . 4236 ± 0 . 0127 0 . 4159 ± 0 . 0109 Dog DM CI 0.0 0 . 3637 ± 0 . 0086 0 . 4115 ± 0 . 0102 MV DR C-MV DS DR C-DS 0.1 0 . 5481 ± 0 . 0092 0 . 5860 ± 0 . 0068 0 . 5703 ± 0 . 0098 0 . 5904 ± 0 . 0072 0.2 0 . 6644 ± 0 . 0121 0 . 6792 ± 0 . 0091 0 . 6875 ± 0 . 0080 0 . 6919 ± 0 . 0066 0.3 0 . 7238 ± 0 . 0080 0 . 7286 ± 0 . 0066 0 . 7450 ± 0 . 0078 0 . 7431 ± 0 . 0084 0.4 0 . 7538 ± 0 . 0073 0 . 7593 ± 0 . 0074 0 . 7786 ± 0 . 0061 0 . 7741 ± 0 . 0077 0.5 0 . 7749 ± 0 . 0064 0 . 7746 ± 0 . 0069 0 . 7906 ± 0 . 0069 0 . 7872 ± 0 . 0059 T omato DM CI 0.0 0 . 5196 ± 0 . 0041 0 . 5373 ± 0 . 0036 MV DR C-MV DS DR C-DS 0.1 0 . 6263 ± 0 . 0051 0 . 6368 ± 0 . 0037 0 . 6311 ± 0 . 0033 0 . 6425 ± 0 . 0037 0.2 0 . 7134 ± 0 . 0034 0 . 7169 ± 0 . 0041 0 . 7285 ± 0 . 0035 0 . 7329 ± 0 . 0038 0.3 0 . 7637 ± 0 . 0040 0 . 7663 ± 0 . 0039 0 . 7925 ± 0 . 0027 0 . 7924 ± 0 . 0034 0.4 0 . 8025 ± 0 . 0030 0 . 8046 ± 0 . 0031 0 . 8374 ± 0 . 0021 0 . 8361 ± 0 . 0026 0.5 0 . 8294 ± 0 . 0036 0 . 8295 ± 0 . 0035 0 . 8666 ± 0 . 0030 0 . 8629 ± 0 . 0030 F or algorithm comparison, results are shown in T able 5, which ha ve four observ ations: • Given the same work er sampling rate, DRC-MV works b etter than MV and DRC-DS is better than DS, especially in lo w sampling rate cases. It sho ws our DR C approac hes w ork in practice. 225 Liu & W ang • As the sampling rate increasing, p erformances of DRC approac hes conv erges to non- DR C approac hes, which matches our understanding according to theorems. • There are large improv ements on Music and Dog datasets, while small impro vemen ts on T omato datasets, potentially due to n umber of classes b eing to o small. In other w ords, binary classification remains a easy task on T omato dataset for work ers. Bn con trast, it sho ws our DRC apporach is able to help in difficult m ulti-class classification problems. • Compared b etw een tw o zero cost metho ds, CI p erforms b etter than DM, which sho ws the effectiveness of sup ervised learners. 6.2.2 Effectiveness of AIS and A WS 1 0 0 log(Worker Cost) 0.250 0.275 0.300 0.325 0.350 0.375 0.400 0.425 0.450 Accuracy On MUSIC dataset DRC-AIS-0.01 DRC-AIS-0.02 DRC-AIS-0.03 DRC-AWS-AIS-0.01 DRC-AWS-AIS-0.02 DRC-AWS-AIS-0.03 DRC-DS DRC-AWS 1 0 0 log(Worker Cost) 0.4 0.5 0.6 0.7 0.8 Accuracy On DOG dataset DRC-AIS-0.01 DRC-AIS-0.02 DRC-AIS-0.03 DRC-AWS-AIS-0.01 DRC-AWS-AIS-0.02 DRC-AWS-AIS-0.03 DRC-DS DRC-AWS 1 0 2 1 0 1 1 0 0 log(Worker Cost) 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 Accuracy On TOMATO dataset DRC-AIS-0.0001 DRC-AIS-0.0002 DRC-AIS-0.0003 DRC-AWS-AIS-0.0001 DRC-AWS-AIS-0.0002 DRC-AWS-AIS-0.0003 DRC-DS DRC-AWS Figure 4: Performances of DRC-AIS, DRC-A WS, DRC-A WS-AIS, and DR C-DS on real- w orld datasets. Similar to syn thetic exp eriments, we do experiments of DR C-A WS-AIS, DRC-AIS, DR C- A WS, and DR C-DS on real-world datasets. λ is set to b e 1 , 2 , 4 , 7 , 10 and ρ is set in Figure 4. Other settings remain the same as abov e. There are four observ ations from results sho wn in Figure 4: • On Music dataset, DR C-AIS-0.01 enjo ys the least work er cost, but with p o or accuracy . DR C-A WS p erforms b etter than DRC-DS, b oth in cost and accuracy , whic h shows effectiv enss of our weigh t-clipping technique. • On Dog dataset, DRC-A WS-AIS enjo ys the least work er cost, but p erformances in- crease quickly with DRC-AIS and DRC-DS. • On T omato dataset, it’s hard to break the accuracy-cost tradeoff using AIS or ASW as the DR C-DS p erformance increases rapidly with more work ers. • There is an accuracy-cost tradeoff for all approaches on all datasets. F or practical use, ρ is suggested to set from 0 . 0001 to 0 . 03, small for easy tasks, for example, binary classification, while large for multi-class classification. λ is suggested to set from 1 to c , where c is the av erage lab els received p er item. Small λ leads to lo w cost and low accuracy while large λ results in high cost and high accuracy . 226 Doubl y R obust Cro wdsourcing 7. Conclusion W e form ulate cro wdsourcing as a statistical estimation problem and propose a new approac h DR C to address it where w orker imitation and doubly robust estimation are used. DR C can w ork with any base mo dels suc h as Dawid-Sk ene mo del and ma jorit y voting and impro ve their p erformance. With adaptive item/work er selection, our prop osed approac hes are able to achiev e nearly the same accuracy of using all work ers but with less work er cost. In the future, there are man y problems w orth trying. Since item features are helpful for cro wdsourcing problems, work er features can b e tak en into consideration as w ell. Also, if there are new work ers joining the pro ject, it needs sp ecial considerations. Ac kno wledgments The work is supported by an Adob e Data Science Aw ard and a start-up grant made b y the UCSB Department of Computer Science. The authors thank the anonymous reviewers and the asso ciate editor for useful feedback. References Heejung Bang and James M Robins. Doubly robust estimation in missing data and causal inference mo dels. Biometrics , 61(4):962–973, 2005. Stev e Branson, Gran t V an Horn, and Pietro Perona. Lean crowdsourcing: Com bining h umans and machines in an online system. In Pr o c e e dings of the IEEE Confer enc e on Computer Vision and Pattern R e c o gnition , pages 7474–7483, 2017. Chih-Ch ung Chang and Chih-Jen Lin. Libsvm: A library for supp ort vector machines. A CM T r ansactions on Intel ligent Systems and T e chnolo gy , 2(3):27, 2011. Alexander Philip Da wid and Allan M Skene. Maximum lik eliho o d estimation of observer error-rates using the em algorithm. Applie d Statistics , 28(1):20–28, 1979. Mirosla v Dud ´ ık, Dumitru Erhan, John Langford, and Lihong Li. Doubly robust p olicy ev aluation and optimization. Statistic al Scienc e , 29(4):485–511, 2014. Han y Hassan, An thony Aue, Chang Chen, Vishal Chowdhary , Jonathan Clark, Christian F edermann, Xuedong Huang, Marcin Junczys-Do wmun t, William Lewis, Mu Li, Sh ujie Liu, Tie-Y an Liu, Renqian Luo, Arul Menezes, T ao Qin, F rank Seide, Xu T an, F ei Tian, Lijun W u, Shuangzhi W u, Yingce Xia, Dongdong Zhang, Zhirui Zhang, and Ming Zhou. Ac hieving human parit y on automatic chinese to english news translation. arXiv preprint arXiv:1803.05567, 2018. Kaiming He, Xiangyu Zhang, Shao qing Ren, and Jian Sun. Delving deep in to rectifiers: Surpassing human-lev el p erformance on imagenet classification. In Pr o c e e dings of the IEEE International Confer enc e on Computer Vision , pages 1026–1034, 2015. Jonathan J Hull. A database for handwritten text recognition research. IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e , 16(5):550–554, 1994. 227 Liu & W ang Hideaki Imamura, Issei Sato, and Masashi Sugiy ama. Analysis of minimax error rate for cro wdsourcing and its application to work er clustering mo del. In Pr o c e e dings of the 35th International Confer enc e on Machine L e arning , pages 2147–2156, 2018. Nan Jiang and Lihong Li. Doubly robust off-p olicy v alue ev aluation for reinforcement learning. In Pr o c e e dings of the 33r d International Confer enc e on Machine L e arning , pages 652–661, 2016. Rie Johnson and T ong Zhang. Accelerating sto chastic gradient descen t using predictive v ariance reduction. In A dvanc es in Neur al Information Pr o c essing Systems 26 , pages 315–323, 2013. Ashish Khetan, Zachary C Lipton, and Animashree Anandkumar. Learning from noisy singly-lab eled data. In Pr o c e e dings of the International Confer enc e on L e arning R epr e- sentations , 2018. Y ann LeCun, L´ eon Bottou, Y oshua Bengio, and Patric k Haffner. Gradien t-based learning applied to do cumen t recognition. Pr o c e e dings of the IEEE , 86(11):2278–2324, 1998. Mehry ar Mohri, Afshin Rostamizadeh, and Ameet T alwalk ar. F oundations of machine le arning . MIT press, 2012. Jungseul Ok, Sew o ong Oh, Y unhun Jang, Jinw o o Shin, and Y ung Yi. Iterativ e ba yesian learning for crowdsourced regression. In Pr o c e e dings of the 22nd International Confer enc e on Artificial Intel ligenc e and Statistics , pages 1486–1495, 2019. Filip e Ro drigues, F rancisco P ereira, and Bernardete Rib eiro. Learning from multiple anno- tators: distinguishing go o d from random lab elers. Pattern R e c o gnition L etters , 34(12): 1428–1436, 2013. Andrea Rotnitzky and James M Robins. Semiparametric regression estimation in the pres- ence of dep enden t censoring. Biometrika , 82(4):805–820, 1995. Carl-Erik S¨ arndal, Bengt Swensson, and Jan W retm an. Mo del assiste d survey sampling . Springer Science & Business Media, 2003. Nihar B Shah, Siv araman Balakrishnan, and Martin J W ainwrigh t. A p ermutation-based mo del for cro wd lab eling: Optimal estimation and robustness. IEEE T r ansactions on Information The ory , 67(6):4162–4184, 2020. Victor S Sheng, F oster Prov ost, and Panagiotis G Ip eirotis. Get another lab el? improving data qualit y and data mining using multiple, noisy labelers. In Pr o c e e dings of the 14th A CM SIGKDD Confer enc e on Know le dge Disc overy and Data Mining , pages 614–622, 2008. Gran t V an Horn, Stev e Branson, Scott Loarie, Serge Belongie, and Pietro P erona. Lean m ulticlass cro wdsourcing. In Pr o c e e dings of the IEEE Confer enc e on Computer Vision and Pattern R e c o gnition , pages 2714–2723, 2018. 228 Doubl y R obust Cro wdsourcing Chong W ang, Xi Chen, Alexander J Smola, and Eric P Xing. V ariance reduction for sto c hastic gradient optimization. In A dvanc es in Neur al Information Pr o c essing Systems 26 , pages 181–189, 2013. Y u-Xiang W ang, Alekh Agarwal, and Mirosla v Dudik. Optimal and adaptive off-p olicy ev aluation in con textual bandits. In Pr o c e e dings of the 34th International Confer enc e on Machine L e arning , pages 3589–3597, 2017. P eter W elinder, Steve Branson, Pietro Perona, and Serge J Belongie. The m ultidimensional wisdom of crowds. In A dvanc es in Neur al Information Pr o c essing Systems 23 , pages 2424–2432, 2010. W ayne Xiong, Lingfeng W u, Fil Allev a, Jasha Dropp o, Xuedong Huang, and Andreas Stol- c ke. The microsoft 2017 con versational sp eech recognition system. In Pr o c e e dings of IEEE International Confer enc e on A c oustics, Sp e e ch and Signal Pr o c essing , pages 5934–5938, 2018. Y uchen Zhang, Xi Chen, Dengy ong Zhou, and Mic hael I Jordan. Sp ectral methods meet em: A prov ably optimal algorithm for crowdsourcing. Journal of Machine L e arning R ese ar ch , 17(1):3537–3580, 2016. Dengy ong Zhou, Sumit Basu, Yi Mao, and John C. Platt. Learning from the wisdom of cro wds by minimax en tropy . In A dvanc es in Neur al Information Pr o c essing Systems 25 , pages 2195–2203, 2012. Dengy ong Zhou, Qiang Liu, John C Platt, Christopher Meek, and Nihar B Shah. Regular- ized minimax conditional en tropy for cro wdsourcing. arXiv preprint 2015. 229
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment