Analysis of Stopping Active Learning based on Stabilizing Predictions

Analysis of Stopping Activ e Learning based on Stabilizing Pr edictions Michael Bloodgood Center for Adv anced Study of Language Uni versity of Maryland College P ark, MD 20740 meb@umd.edu John Gr othendieck Raytheon BBN T echnologies 9861 Broken Land P arkway , Suite 400 Columbia, MD 21046 jgrothen@bbn.com Abstract W ithin the natural language processing (NLP) community , acti ve learning has been widely in v estigated and applied in or - der to alle viate the annotation bottleneck faced by de velopers of ne w NLP systems and technologies. This paper presents the ﬁrst theoretical analysis of stopping activ e learning based on stabilizing predictions (SP). The analysis has rev ealed three ele- ments that are central to the success of the SP method: (1) bounds on Cohen’ s Kappa agreement between successi vely trained models impose bounds on differences in F-measure performance of the models; (2) since the stop set does not hav e to be la- beled, it can be made large in practice, helping to guarantee that the results trans- fer to pre viously unseen streams of ex- amples at test/application time; and (3) good (lo w variance) sample estimates of Kappa between successive models can be obtained. Proofs of relationships between the level of Kappa agreement and the dif- ference in performance between consecu- ti ve models are presented. Speciﬁcally , if the Kappa agreement between two mod- els exceeds a threshold T (where T > 0 ), then the difference in F-measure perfor- mance between those models is bounded abov e by 4(1 − T ) T in all cases. If precision of the positi v e conjunction of the models is assumed to be p , then the bound can be tightened to 4(1 − T ) ( p +1) T . 1 Introduction Active learning (AL), also called query learning and selective sampling , is an approach to reduce the costs of creating training data that has receiv ed considerable interest (e.g., (Arg amon-Engelson and Dagan, 1999; Baldridge and Osborne, 2008; Bloodgood and V ijay-Shanker , 2009b; Bloodgood and Callison-Burch, 2010; Hachey et al., 2005; Haertel et al., 2008; Haffari and Sarkar , 2009; Hwa, 2000; Le wis and Gale, 1994; Sassano, 2002; Settles and Crav en, 2008; Shen et al., 2004; Thompson et al., 1999; T omanek et al., 2007; Zhu and Hovy , 2007)). W ithin the NLP community , acti ve learning has been widely in vestigated and applied in order to alle viate the annotation bottleneck faced by devel- opers of new NLP systems and technologies. The main idea is that by judiciously selecting which examples to ha ve labeled, annotation effort will be focused on the most helpful examples and less an- notation effort will be required to achiev e gi ven le vels of performance than if a passive learning policy had been used. Historically , the problem of de veloping meth- ods for detecting when to stop AL was tabled for future work and the research literature was fo- cused on how to select which examples to ha ve la- beled and analyzing the selection methods (Cohn et al., 1996; Seung et al., 1992; Freund et al., 1997; Roy and McCallum, 2001). Ho wev er, to realize the sa vings in annotation ef fort that AL enables, we must have a method for knowing when to stop the annotation process. The challenge is that if we stop too early while useful generalizations are still being made, then we can wind up with a model that performs poorly , but if we stop too late after all the useful generalizations are made, then hu- man annotation effort is wasted and the beneﬁts of using acti ve learning are lost. Recently research has be gun to dev elop meth- ods for stopping AL (Schohn and Cohn, 2000; Ertekin et al., 2007b; Ertekin et al., 2007a; Zhu and Ho vy , 2007; Laws and Sch ¨ utze, 2008; Zhu et al., 2008a; Zhu et al., 2008b; Vlachos, 2008; Bloodgood, 2009; Bloodgood and V ijay-Shanker , 2009a; Ghayoomi, 2010). The methods are all This paper was published within the Pr oceedings of the Seventeenth Confer ence on Computational Natural Language Learning , pages 10-19, Soﬁa, Bulgaria, A ugust 8-9 2013. c  2013 Association for Computational Linguistics heuristics based on estimates of model conﬁdence, error , or stability . Although these heuristic meth- ods ha ve appealing intuitions and ha ve had e x- perimental success on a small handful of tasks and datasets, the methods are not widely usable in practice yet because our community’ s understand- ing of the stopping methods remains too coarse and inexact. Pushing forward on understanding the mechanics of stopping at a more exact le vel is therefore crucial for achieving the design of widely usable ef fecti ve stopping criteria. Bloodgood and V ijay-Shanker (2009a) intro- duce the terminology aggr essive and conserva- tive to describe the beha vior of stopping meth- ods 1 and conduct an empirical ev aluation of the dif ferent published stopping methods on several datasets. While most stopping methods tend to behav e conservati vely , stopping based on stabiliz- ing predictions computed via inter-model Kappa agreement has been shown to be consistently ag- gressi ve without losing performance (in terms of F-Measure 2 ) in sev eral published empirical tests. This method stops when the Kappa agreement be- tween consecuti vely learned models during AL exceeds a threshold for three consecutiv e itera- tions of AL. Although this is an intuitive heuristic that has performed well in published experimental results, there has not been any theoretical analysis of the method. The current paper presents the ﬁrst theoretical analysis of stopping based on stabilizing predic- tions. The analysis helps to explain at a deeper and more exact lev el why the method works as it does. The results of the analysis help to character- ize classes of problems where the method can be expected to work well and where (unmodiﬁed) it will not be expected to work as well. The theory is suggestiv e of modiﬁcations to improve the ro- bustness of the stopping method for certain classes of problems. And perhaps most important, the approach that we use in our analysis provides an enabling framework for more precise analysis of stopping criteria and possibly other parts of the ac- ti ve learning decision space. In addition, the information presented in this pa- 1 Aggressiv e methods stop sooner, aggressiv ely trying to reduce unnecessary annotations while conservati ve methods are careful not to risk losing model performance, ev en if it means annotating many more e xamples than were necessary . 2 For the rest of this paper, we will use F-measure to de- note F1-measure, that is, the balanced harmonic mean of pre- cision and recall, which is a standard metric used to evaluate NLP systems. per is useful for w orks that consider switching be- tween different activ e learning strategies and oper - ating regions such as (Baram et al., 2004; D ¨ onmez et al., 2007; Roth and Small, 2008). Kno wing when to switch strategies, for example, is sim- ilar to the stopping problem and is another set- ting where detailed understanding of the variance of stabilization estimates and their link to perfor- mance ramiﬁcations is useful. More exact un- derstanding of the mechanics of stopping is also useful for applications of co-training (Blum and Mitchell, 1998), and agreement-based co-training (Clark et al., 2003) in particular . Finally , the proofs of the Theorems regarding the relationships between Cohen’ s Kappa statistic and F-measure may be of broader use in works that consider inter - annotator agreement and its ramiﬁcations for per- formance appraisals, a topic that has been of long- standing interest in computational linguistics (Car- letta, 1996; Artstein and Poesio, 2008). In the ne xt section we summarize the stabiliz- ing predictions (SP) stopping method. Section 3 analyzes SP and Section 4 concludes. 2 Stopping Active Learning based on Stabilizing Predictions The intuition behind the SP method is that the models learned during AL can be applied to a large representati ve set of unlabeled data called a stop set and when consecuti vely learned models ha ve high agreement on their predictions for classify- ing the examples in the stop set, this indicates that it is time to stop (Bloodgood and V ijay-Shanker , 2009a; Bloodgood, 2009). The acti ve learning stopping strategy explicitly e xamined in (Blood- good and V ijay-Shanker , 2009a) (after the general form is discussed) is to calculate Cohen’ s Kappa agreement statistic between consecuti ve rounds of acti ve learning and stop once it is above 0.99 for three consecuti ve calculations. Since the Kappa statistic is an important as- pect of this method, we no w discuss some back- ground regarding measuring agreement in general, and Cohen’ s Kappa in particular . Measurement of agreement between human annotators has re- cei ved signiﬁcant attention and in that context, the drawbacks of using percentage agreement have been recognized (Artstein and Poesio, 2008). Al- ternati ve metrics have been proposed that take chance agreement into account. Artstein and Poe- sio (2008) survey several agreement metrics. Most of the agreement metrics they discuss are of the form: ag r eement = A o − A e 1 − A e , (1) where A o = observed agreement, and A e = agree- ment expected by chance. The different metrics dif fer in how they compute A e . All the instances of usage of an agreement metric in this article will hav e two categories and two coders. The two cat- egories are “+1” and “-1” and the two coders are the two consecutiv e models for which agreement is being measured. Cohen’ s Kappa statistic 3 (Cohen, 1960) mea- sures agreement expected by chance by modeling each coder (in our case model) with a separate dis- tribution gov erning their likelihood of assigning a particular cate gory . Formally , Kappa is deﬁned by Equation 1 with A e computed as follo ws: A e = X k ∈{ +1 , − 1 } P ( k | c 1 ) · P ( k | c 2 ) , (2) where each c i is one of the coders (in our case, models), and P ( k | c i ) is the probability that coder (model) c i labels an instance as being in category k . Kappa estimates the P ( k | c i ) in Equation 2 based on the proportion of observ ed instances that coder (model) c i labeled as being in category k . 3 Analysis This section analyzes the SP stopping method. Section 3.1 analyzes the v ariance of the estima- tor of Kappa that SP uses and in particular the re- lationship of this v ariance to speciﬁc aspects of the operationalization of SP , such as the stop set size. Section 3.2 analyzes relationships between the Kappa agreement between tw o models and the dif ference in F-measure between those two mod- els. 3.1 V ariance of Kappa Estimator SP bases its decision to stop on the information contained in the contingenc y tables between the classiﬁcations of models learned at consecutiv e iterations during AL. In determining whether to stop at iteration t, the classiﬁcations of the current model M t are compared with the classiﬁcations of the previous model M t − 1 . T able 1 sho ws the pop- ulation parameters for these two models, where: 3 W e note that there are other agreement measures (beyond Cohen’ s Kappa) which could also be applicable to stopping based on stabilizing predictions, but an analysis of these is outside the scope of the current paper . M t M t − 1 + - T otal + π ++ π + − π + . - π − + π −− π − . T otal π . + π . − 1 T able 1: Contingency table population probabili- ties for M t (model learned at iteration t) and M t − 1 (model learned at iteration t-1). population probability π ij for i, j ∈ { + , −} is the probability of an example being placed in category i by model M t − 1 and category j by model M t ; population probability π .j for j ∈ { + , −} is the probability of an example being placed in category j by model M t ; and population probability π i. for i ∈ { + , −} is the probability of an example being placed in category i by model M t − 1 . The actual probability of agreement is π o = π ++ + π −− . As indicated in Equation 2, Kappa models the prob- ability of agreement expected due to chance by assuming that classiﬁcations are made indepen- dently . Hence, the probability of agreement ex- pected by chance in terms of the population prob- abilities is π e = π + . π . + + π − . π . − . From the deﬁni- tion of Kappa (see Equation 1), we then have that the Kappa parameter K in terms of the population probabilities is gi ven by K = π o − π e 1 − π e . (3) For practical applications we will not kno w the true population probabilities and we will have to resort to using sample estimates. The SP method uses a stop set of size n for deriving its estimates. T able 2 sho ws the contingency table counts for the classiﬁcations of models M t and M t − 1 on a sample of size n . The population probabilities π ij can be estimated by the relati ve frequencies p ij for i, j ∈ { + , − , . } , where: p ++ = a/n ; p + − = b/n ; p − + = c/n ; p −− = d/n ; p + . = ( a + b ) /n ; p − . = ( c + d ) /n ; p . + = ( a + c ) /n ; and p . − = ( c + d ) /n . Let p o = p ++ + p −− , the observed proportion of agreement and let p e = p + . p . + + p − . p . − , the pro- portion of agreement expected by chance if we as- sume that M t and M t − 1 make their classiﬁcations independently . Then the Kappa measure of agree- ment K between M t and M t − 1 (see Equation 3) is estimated by ˆ K = p o − p e 1 − p e . (4) M t M t − 1 + - T otal + a b a + b - c d c + d T otal a + c b + d n T able 2: Contingency table counts for M t (model learned at iteration t) and M t − 1 (model learned at iteration t-1). Using the delta method, as described in (Bishop et al., 1975), Fleiss et al. (1969) derived an estima- tor of the large-sample variance of ˆ K . According to Hale and Fleiss (1993), the estimator simpliﬁes to V ar ( ˆ K ) = 1 n (1 − p e ) 2 × ( X i ∈{ + , −} p ii [1 − 4 ¯ p i (1 − ˆ K )] − ( ˆ K − p e (1 − ˆ K )) 2 + (1 − ˆ K ) 2 × X i,j ∈{ + , −} p ij [2( ¯ p i + ¯ p j ) − ( p i. + p .j )] 2 ) , (5) where ¯ p i = ( p i. + p .i ) / 2 . From Equation 5, we can see that the variance of our estimate of Kappa is in versely proportional to the size of the stop set we use. Bloodgood and V ijay-Shanker (2009a) used a stop set of size 2000 for each of their datasets. Although this worked well in the results they re- ported, we do not believ e that 2000 is a ﬁxed size that will work well for all tasks and datasets where the SP method could be used. T able 3 sho ws the v ariances of ˆ K computed using Equation 5 at the points at which SP stopped AL for each of the datasets 4 from (Bloodgood and V ijay-Shanker , 2009a). These v ariances indicate that the size of 2000 was typically suf ﬁcient to get tight estimates of Kappa, helping to illuminate the empirical success of the SP method on these datasets. More gener- ally , the SP method can be augmented with a v ari- ance check: if the variance of estimated Kappa at a potential stopping point exceeds some desired 4 W e note that each of the datasets was set up as a binary classiﬁcation task (or multiple binary classiﬁcation tasks). Further details and descriptions of each of the datasets can be found in (Bloodgood and V ijay-Shanker , 2009a). threshold, then the stop set size can be increased as needed to reduce the v ariance. Looking at Equation 5 again, one can note that when p e is relativ ely close to 1, the variance of ˆ K can be expected to get quite large. In these situ- ations, users of SP should e xpect to have to use larger stop set sizes and in extreme conditions, SP may not be an advisable method to use. 3.2 Relationship between Kappa agreement and change in performance between models Heretofore, the published literature contained only informal e xplanations of why stabilizing predic- tions is expected to work well as a stopping method (along with empirical tests demonstrat- ing successful operation on a handful of tasks and datasets). In the remainder of this section we describe the mathematical foundations for stop- ping methods based on stabilizing predictions. In particular , we will pro ve that ev en in the worst possible case, if the Kappa agreement between two subsequently learned models is greater than a threshold T , then it must be the case that the change in performance between these two models is bounded above by 4(1 − T ) T . W e then go on to prov e additional Theorems that tighten this bound when assumptions are made about model preci- sion. Lemma 3.1 Suppose F-measure F and Kappa K ar e computed fr om the same contingency table of counts, such as the one given in T able 2. Suppose ad − bc ≥ 0 . Then F ≥ K . Proof By deﬁnition, in terms of the contingency table counts, K = 2 ad − 2 bc ( a + b )( b + d ) + ( a + c )( c + d ) (6) and F = 2 a 2 a + b + c . (7) Re writing F so that it will hav e the same numera- tor as K , we hav e: F = F d − bc a d − bc a ! (8) =  2 a 2 a + b + c  d − bc a d − bc a ! (9) = 2 ad − 2 bc 2 ad + bd + cd − 2 bc − b 2 c + bc 2 a . (10) T ask-Dataset V ariance of ˆ K NER-DN A (10-fold CV) 0.0000223 NER-cellT ype (10-fold CV) 0.0000211 NER-protein (10-fold CV) 0.0000074 Reuters (10 Categories) 0.0000298 20 Ne wsgroups (20 Categories) 0.0000739 W ebKB Student (10-fold CV) 0.0000137 W ebKB Project (10-fold CV) 0.0000190 W ebKB Faculty (10-fold CV) 0.0000115 W ebKB Course (10-fold CV) 0.0000179 TC-spamassassin (10-fold CV) 0.0000042 TC-TREC-SP AM (10-fold CV) 0.0000043 A verage (macro-a vg) 0.0000209 T able 3: Estimates of the v ariance of ˆ K . For each dataset, the estimate of the variance of ˆ K is computed (using Equation 5) from the contingency table at the point at which SP stopped AL and the av erage of all the v ariances (across all folds of CV) is displayed. The last ro w contains the macro-av erage of the av erage variances for all the datasets. W e can see that the expression for F in Equa- tion 10 has the same numerator as K in Equa- tion 6 but the denominator of K in Equation 6 is ≥ the denominator of F in Equation 10. Therefore, F ≥ K . Theorem 3.2 Let M t be the model learned at iter- ation t of active learning and M t − 1 be the model learned at iteration t − 1 . Let K t be the estimate of Kappa a gr eement between the classiﬁcations of M t and M t − 1 on the examples in the stop set. Let ˜ F t be the F-measure between the classiﬁcations of M t and truth on the stop set. Let ˜ F t − 1 be the F- measur e between the classiﬁcations of M t − 1 and truth on the stop set. Let ∆ F t be ˜ F t − ˜ F t − 1 . Sup- pose T > 0 . Then K t > T ⇒ | ∆ F t | ≤ 4(1 − T ) T . Proof Suppose M t , M t − 1 , K t , ˜ F t , ˜ F t − 1 , ∆ F t , and T are deﬁned as stated in the statement of Theorem 3.2. Let F t be the F-measure between the classiﬁcations of M t and M t − 1 on the exam- ples in the stop set. Let T able 2 sho w the con- tingency table counts for M t versus M t − 1 on the examples in the stop set. Then, from their deﬁ- nitions, we hav e K t = 2( ad − bc ) ( a + b )( b + d )+( a + c )( c + d ) and F t = 2 a 2 a + b + c . There exist true labels for the ex- amples in the stop set, which we don’ t know since the stop set is unlabeled, but nonetheless must ex- ist. W e use the truth on the stop set to split T able 2 into two subtables of counts, one table for all the examples that are truly positiv e and one table for all the examples that are truly ne gati ve. T able 4 M t M t − 1 + - T otal + a 1 b 1 a 1 + b 1 - c 1 d 1 c 1 + d 1 T otal a 1 + c 1 b 1 + d 1 n 1 T able 4: Contingency table counts for M t (model learned at iteration t) versus M t − 1 (model learned at iteration t-1) for only the examples in the stop set that hav e truth = +1. M t M t − 1 + - T otal + a − 1 b − 1 a − 1 + b − 1 - c − 1 d − 1 c − 1 + d − 1 T otal a − 1 + c − 1 b − 1 + d − 1 n − 1 T able 5: Contingency table counts for M t (model learned at iteration t) versus M t − 1 (model learned at iteration t-1) for only the examples in the stop set that hav e truth = -1. sho ws the contingency table for M t versus M t − 1 for all of the e xamples in the stop set that have true labels of +1 and T able 5 shows the contingency ta- ble for M t versus M t − 1 for all of the examples in the stop set that hav e true labels of -1. From T ables 2, 4, and 5 one can see that a is the number of examples in the stop set that both M t and M t − 1 classiﬁed as positiv e. Furthermore, out of these a examples, a 1 of them truly are pos- M t T ruth + - T otal + a 1 + c 1 b 1 + d 1 n 1 - a − 1 + c − 1 b − 1 + d − 1 n − 1 T otal a + c b + d n T able 6: Contingency table counts for M t (model learned at iteration t) versus truth. (Deri ved from T ables 4 and 5 M t − 1 T ruth + - T otal + a 1 + b 1 c 1 + d 1 n 1 - a − 1 + b − 1 c − 1 + d − 1 n − 1 T otal a + b c + d n T able 7: Contingency table counts for M t − 1 (model learned at iteration t-1) versus truth. (De- ri ved from T ables 4 and 5 iti ve and a − 1 of them truly are negati ve. Similar explanations hold for the other counts. Also, from T ables 2, 4, and 5, one can see that the equalities a = a 1 + a − 1 , b = b 1 + b − 1 , c = c 1 + c − 1 , and d = d 1 + d − 1 all hold. The contingency tables for M t versus truth and M t − 1 versus truth can be deri ved from T ables 4 and 5. For con venience, T a- ble 6 sho ws the contingency table for M t versus truth and T able 7 sho ws the contingency table for M t − 1 versus truth. Suppose that K t > T . This implies, by Lemma 3.1 5 , that F t > T . This im- plies that 2 a 2 a + b + c > T (11) ⇒ 2 a > (2 a + b + c ) T (12) ⇒ 2 a (1 − T ) > ( b + c ) T (13) ⇒ b + c < 2 a (1 − T ) T . (14) Note that Equations 12 and 14 are justiﬁed since 2 a + b + c > 0 and T > 0 , respectively . From T able 6 we can see that ˜ F t = 2( a 1 + c 1 ) 2( a 1 + c 1 )+ b 1 + d 1 + a − 1 + c − 1 ; from T able 7 we can see that ˜ F t − 1 = 2( a 1 + b 1 ) 2( a 1 + b 1 )+ c 1 + d 1 + a − 1 + b − 1 . For notational con venience, let: g = 2( a 1 + c 1 ) + b 1 + d 1 + a − 1 + c − 1 ; and h = 2( a 1 + b 1 ) + c 1 + d 1 + a − 1 + b − 1 . 5 Note that the condition ad − bc ≥ 0 of Lemma 3.1 is met since K t > T and T > 0 imply K t > 0 , which in turn implies ad − bc > 0 . It follo ws that ∆ F t = 2( a 1 + c 1 ) g − 2( a 1 + b 1 ) h (15) = (2 a 1 + 2 c 1 ) h − (2 a 1 + 2 b 1 ) g g h (16) For notational con venience, let: x = 2( a 1 c 1 + a 1 b − 1 + c 2 1 + c 1 d 1 + c 1 a − 1 + c 1 b − 1 ) ; and y = 2( a 1 b 1 + a 1 c − 1 + b 2 1 + b 1 d 1 + b 1 a − 1 + b 1 c − 1 ) . Then picking up from Equation 16, it follo ws that ∆ F t = x − y g h (17) = 2[ u 1 + c 1 u 2 − b 1 u 3 ] g h , (18) where u 1 = a 1 c 1 − a 1 b 1 + a 1 b − 1 − a 1 c − 1 , u 2 = c 1 + d 1 + a − 1 + b − 1 , and u 3 = b 1 + d 1 + a − 1 + c − 1 . For notational con venience, let: d A = c 1 − b 1 and d B = c − 1 − b − 1 . Then it follows that ∆ F t = 2 u 4 g h , (19) where: u 4 = a 1 ( d A − d B ) + d A ( d 1 + a − 1 + b 1 + c 1 ) + c 1 b − 1 − b 1 c − 1 . Noting that g = h + d A + d B , we hav e ∆ F t = 2 u 4 h ( h + d A + d B ) . (20) Noting that 2 u 4 = 2[ d A ( a 1 + b 1 + c 1 + d 1 + a − 1 + b − 1 ) − d B ( a 1 + b 1 )] and letting u 5 = a 1 + b 1 + c 1 + d 1 + a − 1 + b − 1 , we hav e ∆ F t = 2[ d A u 5 − d B ( a 1 + b 1 )] h ( h + d A + d B ) . (21) Therefore, | ∆ F t | ≤ 2     d A u 5 h ( h + d A + d B )     +     d B ( a 1 + b 1 ) h ( h + d A + d B )     ! (22) Recall that b + c = b 1 + b − 1 + c 1 + c − 1 . Then observe that the following three inequalities hold: b + c ≥ d A ; b + c ≥ d B ; and h ( h + d A + d B ) > 0 . Therefore, | ∆ F t | ≤ 2( b + c )[2 a 1 +2 b 1 + c 1 + d 1 + a − 1 + b − 1 ] h ( h + d A + d B ) (23) = 2( b + c ) h h ( h + d A + d B ) (24) = 2( b + c ) h + d A + d B (25) ≤ 2(2 a )(1 − T ) T ( h + d A + d B ) (26) =  4(1 − T ) T  a h + d A + d B  . (27) Observe that h + d A + d B = 2 a 1 + b 1 + 2 c 1 + d 1 + a − 1 + c − 1 . Therefore, a h + d A + d B ≤ 1 . Therefore, we hav e | ∆ F t | ≤ 4(1 − T ) T . (28) Note that in deri ving Inequality 26, we used the previously deri ved Inequality 14. Also, the proof of Theorem 3.2 assumes a worst possible case in the sense that all examples where the clas- siﬁcations of M t and M t − 1 dif fer are assumed to hav e truth v alues that all serve to maximize one model’ s F-measure and minimize the other model’ s F-measure so as to maximize | ∆ F t | as much as possible. A resulting limitation is that the bound is loose in many cases. It may be possible to deriv e tighter bounds, perhaps by easing off to an expected case instead of a w orst case and/or by making additional assumptions. 6 T aking this possibility up, we now pro ve tighter bounds when assumptions about the precision of the models M t and M t − 1 are made. Consider that in the proof of Theorem 3.2 when transitioning from Equality 27 to Inequality 28, we used the fact that a h + d A + d B ≤ 1 . Note that a h + d A + d B = a 2 a 1 + b 1 +2 c 1 + d 1 + a − 1 + c − 1 , from which one sees that a h + d A + d B = 1 only if all of a 1 , b 1 , c 1 , d 1 and c − 1 are all zero. This is a pathological case. In many practically important classes of cases to consider , a h + d A + d B will be strictly less than 1 , and often sub- stantially less than 1 . The follo wing two Theorems prov e tighter bounds on | ∆ F t | than Theorem 3.2 by utilizing this insight. Theorem 3.3 Suppose M t , M t − 1 , K t , ˜ F t , ˜ F t − 1 , ∆ F t , and T are deﬁned as stated in the statement of Theorem 3.2. Let the contingency tables be de- ﬁned as the y wer e in the pr oof of Theor em 3.3. Let M P ositiv eC onj unction be a model that only clas- siﬁes an e xample as positive if both models M t and M t − 1 classify the example as positive . Sup- pose that M P ositiv eC onj unction has perfect pr eci- sion on the stop set, or in other wor ds that every single example fr om the stop set that both M t and M t − 1 classify as positive is truthfully positive (i.e., a − 1 = 0 ). Then K t > T ⇒ | ∆ F t | ≤ 2(1 − T ) T . Proof The proof of Theorem 3.2 holds exactly as it is up until Equality 27. No w , using the additional assumption that a − 1 = 0 , we hav e 6 If one is planning to undertake this challenge, we would suggest further consideration of Inequalities 22, 23, 26, and 28 as a possible starting point. a h + d A + d B ≤ 1 2 . Therefore, we hav e | ∆ F t | ≤ 2(1 − T ) T . (29) Theorem 3.3 is a special case (in the limit) of a more general Theorem. Before stating and prov- ing the more general Theorem, we prove a Lemma that will be helpful in making the proof of the gen- eral Theorem clearer . Lemma 3.4 Let f , d A , d B and contingency ta- ble counts be deﬁned as they wer e in the pr oof of Theorem 3.2. Suppose a 1 = xa − 1 . Then a h + d A + d B ≤ x +1 2 x +1 . Proof a 1 = xa − 1 by hypothesis. a = a 1 + a − 1 by deﬁnition of contingency table counts. Hence, a = ( x + 1) a − 1 . Therefore, a h + d A + d B ≤ ( x +1) a − 1 2 xa − 1 + a − 1 (30) = ( x +1) a − 1 (2 x +1) a − 1 = x +1 2 x +1 . The follo wing Theorem generalizes Theo- rem 3.3 to cases when M P ositiv eC onj unction has precision p in (0 , 1) . 7 Theorem 3.5 Suppose M t , M t − 1 , K t , ˜ F t , ˜ F t − 1 , ∆ F t , and T are deﬁned as stated in the statement of Theorem 3.2. Let the contingency tables be de- ﬁned as the y wer e in the pr oof of Theor em 3.2. Let M P ositiv eC onj unction be a model that only classi- ﬁes an example as positive if both models M t and M t − 1 classify the example as positive. Suppose that M P ositiv eC onj unction has pr ecision p on the stop set. Then K t > T ⇒ | ∆ F t | ≤ 4(1 − T ) ( p +1) T . Proof The proof of Theorem 3.2 holds exactly as it is up until Equality 27. M P ositiv eC onj unction has precision p on the stop set ⇒ p = a 1 a 1 + a − 1 . Solv- ing for a 1 in terms of a − 1 we have a 1 = p 1 − p a − 1 . Therefore, applying Lemma 3.4 with x = p 1 − p , we hav e a h + d A + d B ≤ p 1 − p +1 2 p 1 − p +1 . Therefore we hav e | ∆ F t | ≤ 4 p 1 − p +1 2 p 1 − p +1 ! (1 − T ) T (31) = 4(1 − T ) ( p +1) T . (32) 7 The case when p = 0 is handled by Theorem 3.2 and the case when p = 1 is handled by Theorem 3.3. Precision 1 p +1 (to 3 decimal places) 50% 0.667 80% 0.556 90% 0.526 95% 0.513 98% 0.505 99% 0.503 99.9% 0.500 T able 8: V alues of the scaling factor from Theo- rem 3.5 for dif ferent precision v alues. The scaling factor 1 p +1 in Theorem 3.5 sho ws ho w the precision of the conjuncti ve model af fects the bound. Theorem 3.2 had the scaling factor im- plicitly set to 1 in order to handle the pathologi- cal case where the positive conjuncti ve model has precision = 0. In Theorem 3.3, where the positiv e conjuncti ve model has precision = 1 on the exam- ples in the stop set, the scaling factor is set to 1/2. Theorem 3.5 generalizes the scaling factor so that it is a function of the precision of the positi ve con- juncti ve model. For con venience, T able 8 sho ws the scaling factor values for a few different preci- sion v alues. The bounds in Theorems 3.2, 3.3, and 3.5 all bound the dif ference in performance on the stop set of two consecutiv ely learned models M t and M t − 1 . An issue to consider is how connected the dif ference in performance on the stop set is to the dif ference in performance on a stream of applica- tion examples generated according to the popula- tion probabilities. T aking up this issue, consider that the proof of Theorems 3.2, 3.3, and 3.5 w ould hold as it is if we had used sample proportions in- stead of sample counts (this can be seen by simply di viding ev ery count by n , the size of the stop set). Since the stop set is unbiased (selected at random from the population), as n approaches inﬁnity , the sample proportions will approach the population probabilities and the difference between the dif- ference in performance between M t and M t − 1 on the stop set and on a stream of application exam- ples generated according to the population proba- bilities will approach zero. 4 Conclusions T o date, the work on stopping criteria has been dominated by heuristics based on intuitions and experimental success on a small handful of tasks and datasets. But the methods are not widely usable in practice yet because our community’ s understanding of the stopping methods remains too inexact. Pushing forw ard on understanding the mechanics of stopping at a more exact le vel is therefore crucial for achieving the design of widely usable ef fecti ve stopping criteria. This paper presented the ﬁrst theoretical anal- ysis of stopping based on stabilizing predictions. The analysis rev ealed three elements that are cen- tral to the SP method’ s success: (1) the sample es- timates of Kappa have low variance; (2) Kappa has tight connections with dif ferences in F-measure; and (3) since the stop set doesn’t ha ve to be la- beled, it can be arbitrarily large, helping to guar- antee that the results transfer to previously unseen streams of examples at test/application time. W e presented proofs of relationships between the lev el of Kappa agreement and the difference in performance between consecutiv e models. Specif- ically , if the Kappa agreement between two mod- els is at least T , then the dif ference in F-measure performance between those models is bounded abov e by 4(1 − T ) T . If precision of the positi ve con- junction of the models is assumed to be p , then the bound can be tightened to 4(1 − T ) ( p +1) T . The setup and methodology of the proofs can serve as a launching pad for many further inv es- tigations, including: analyses of stopping; works that consider switching between different activ e learning strategies and operating regions; and works that consider stopping co-training, and es- pecially agreement-based co-training. Finally , the relationships that have been exposed between the Kappa statistic and F-measure may be of broader use in works that consider inter-annotator agree- ment and its interplay with system ev aluation, a topic that has been of long-standing interest. References Shlomo Argamon-Engelson and Ido Dagan. 1999. Committee-based sample selection for probabilis- tic classiﬁers. Journal of Artiﬁcial Intelligence Re- sear ch (J AIR) , 11:335–360. Ron Artstein and Massimo Poesio. 2008. Inter-coder agreement for computational linguistics. Computa- tional Linguistics , 34(4):555–596. Jason Baldridge and Miles Osborne. 2008. Ac- tiv e learning and logarithmic opinion pools for hpsg parse selection. Nat. Lang. Eng . , 14(2):191–222. Y oram Baram, Ran El-Y aniv , and K obi Luz. 2004. On- line choice of acti ve learning algorithms. Journal of Machine Learning Resear ch , 5:255–291, March. Yvonne M. Bishop, Stephen E. Fienberg, and Paul W . Holland. 1975. Discr ete Multivariate Analysis: Theory and Practice . MIT Press, Cambridge, MA. Michael Bloodgood and Chris Callison-Burch. 2010. Bucking the trend: Large-scale cost-focused acti ve learning for statistical machine translation. In Pr o- ceedings of the 48th Annual Meeting of the Associa- tion for Computational Linguistics , pages 854–864, Uppsala, Sweden, July . Association for Computa- tional Linguistics. Michael Bloodgood and K V ijay-Shanker . 2009a. A method for stopping active learning based on stabi- lizing predictions and the need for user -adjustable stopping. In Proceedings of the Thirteenth Confer- ence on Computational Natural Language Learning (CoNLL-2009) , pages 39–47, Boulder, Colorado, June. Association for Computational Linguistics. Michael Bloodgood and K V ijay-Shanker . 2009b . T ak- ing into account the differences between activ ely and passiv ely acquired data: The case of active learning with support vector machines for imbal- anced datasets. In Pr oceedings of Human Lan- guage T echnologies: The 2009 Annual Conference of the North American Chapter of the Associa- tion for Computational Linguistics , pages 137–140, Boulder , Colorado, June. Association for Computa- tional Linguistics. Michael Bloodgood. 2009. Active learning with sup- port vector machines for imbalanced datasets and a method for stopping active learning based on sta- bilizing predictions . Ph.D. thesis, University of Delaware, Ne wark, DE, USA. A vrim Blum and T om Mitchell. 1998. Combining la- beled and unlabeled data with co-training. In COLT’ 98: Pr oceedings of the eleventh annual conference on Computational learning theory , pages 92–100, New Y ork, NY , USA. ACM. J. Carletta. 1996. Assessing agreement on classiﬁca- tion tasks: The kappa statistic. Computational lin- guistics , 22(2):249–254. Stephen Clark, James Curran, and Miles Osborne. 2003. Bootstrapping pos-taggers using unlabelled data. In W alter Daelemans and Miles Osborne, editors, Pr oceedings of the Seventh Confer ence on Natural Language Learning at HLT -N AA CL 2003 , pages 49–55. J. Cohen. 1960. A coefﬁcient of agreement for nom- inal scales. Educational and Psychological Mea- sur ement , 20:37–46. David A. Cohn, Zoubin Ghahramani, and Michael I. Jordan. 1996. Acti ve learning with statistical mod- els. J ournal of Artiﬁcial Intelligence Researc h , 4:129–145. Meryem Pinar D ¨ onmez, Jaime G. Carbonell, and Paul N. Bennett. 2007. Dual strategy activ e learning. In Joost N. Kok, Jacek K oronacki, Ramon L ´ opez de M ´ antaras, Stan Matwin, Dunja Mladenic, and Andrzej Sko wron, editors, Mac hine Learning: ECML 2007, 18th Eur opean Conference on Machine Learning, W arsaw , P oland, September 17-21, 2007, Pr oceedings , volume 4701 of Lec- tur e Notes in Computer Science , pages 116–127. Springer . Seyda Ertekin, Jian Huang, L ´ eon Bottou, and C. Lee Giles. 2007a. Learning on the border: activ e learn- ing in imbalanced data classiﬁcation. In M ´ ario J. Silva, Alberto H. F . Laender , Ricardo A. Baeza- Y ates, Deborah L. McGuinness, Bjørn Olstad, Øys- tein Haug Olsen, and Andr ´ e O. Falc ˜ ao, editors, Pr o- ceedings of the Sixteenth A CM Confer ence on Infor - mation and Knowledge Management, CIKM 2007, Lisbon, P ortugal, November 6-10, 2007 , pages 127– 136. A CM. Seyda Ertekin, Jian Huang, and C. Lee Giles. 2007b. Activ e learning for class imbalance problem. In W essel Kraaij, Arjen P . de Vries, Charles L. A. Clarke, Norbert Fuhr , and Noriko Kando, editors, SIGIR 2007: Pr oceedings of the 30th Annual Inter- national A CM SIGIR Confer ence on Researc h and Development in Information Retrieval, Amsterdam, The Netherlands, July 23-27, 2007 , pages 823–824. A CM. Joseph L. Fleiss, Jacob Cohen, and B. S. Ev eritt. 1969. Large sample standard errors of kappa and weighted kappa. Psychological Bulletin , 72(5):323 – 327. Y oav Freund, H. Sebastian Seung, Eli Shamir, and Naf- tali T ishby . 1997. Selectiv e sampling using the query by committee algorithm. Machine Learning , 28:133–168. Masood Ghayoomi. 2010. Using variance as a stop- ping criterion for activ e learning of frame assign- ment. In Proceedings of the NAA CL HL T 2010 W orkshop on Active Learning for Natural Langua ge Pr ocessing , pages 1–9, Los Angeles, California, June. Association for Computational Linguistics. Ben Hache y , Beatrice Alex, and Markus Becker . 2005. In vestigating the effects of selecti ve sampling on the annotation task. In Pr oceedings of the Ninth Confer- ence on Computational Natural Language Learning (CoNLL-2005) , pages 144–151, Ann Arbor , Michi- gan, June. Association for Computational Linguis- tics. Robbie Haertel, Eric Ringger , K evin Seppi, James Car- roll, and Peter McClanahan. 2008. Assessing the costs of sampling methods in activ e learning for an- notation. In Pr oceedings of A CL-08: HL T , Short P a- pers , pages 65–68, Columbus, Ohio, June. Associa- tion for Computational Linguistics. Gholamreza Haff ari and Anoop Sarkar . 2009. Activ e learning for multilingual statistical machine trans- lation. In Pr oceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th In- ternational Joint Conference on Natural Language Pr ocessing of the AFNLP , pages 181–189, Suntec, Singapore, August. Association for Computational Linguistics. Cecilia A. Hale and Joseph L. Fleiss. 1993. Interval es- timation under two study designs for kappa with bi- nary classiﬁcations. Biometrics , 49(2):pp. 523–534. Rebecca Hwa. 2000. Sample selection for statistical grammar induction. In Hinrich Sch ¨ utze and K eh- Y ih Su, editors, Pr oceedings of the 2000 Joint SIG- D AT Confer ence on Empirical Methods in Natural Language Pr ocessing , pages 45–53. Association for Computational Linguistics, Somerset, New Jerse y . Florian Laws and Hinrich Sch ¨ utze. 2008. Stopping cri- teria for acti ve learning of named entity recognition. In Pr oceedings of the 22nd International Conference on Computational Linguistics (Coling 2008) , pages 465–472, Manchester , UK, August. Coling 2008 Or - ganizing Committee. David D. Lewis and William A. Gale. 1994. A se- quential algorithm for training te xt classiﬁers. In SI- GIR ’94: Pr oceedings of the 17th annual interna- tional A CM SIGIR confer ence on Researc h and de- velopment in information r etrieval , pages 3–12, New Y ork, NY , USA. Springer-V erlag New Y ork, Inc. D. Roth and K. Small. 2008. Active learning for pipeline models. In Pr oceedings of the National Confer ence on Artiﬁcial Intelligence (AAAI) , pages 683–688. Nicholas Roy and Andrew McCallum. 2001. T ow ard optimal acti ve learning through sampling estimation of error reduction. In In Pr oceedings of the 18th In- ternational Confer ence on Machine Learning , pages 441–448. Morgan Kaufmann. Manabu Sassano. 2002. An empirical study of activ e learning with support vector machines for japanese word segmentation. In ACL ’02: Pr oceedings of the 40th Annual Meeting on Association for Computa- tional Linguistics , pages 505–512, Morristo wn, NJ, USA. Association for Computational Linguistics. Greg Schohn and David Cohn. 2000. Less is more: Activ e learning with support vector machines. In Pr oc. 17th International Conf. on Mac hine Learn- ing , pages 839–846. Morgan Kaufmann, San Fran- cisco, CA. Burr Settles and Mark Crav en. 2008. An analysis of active learning strategies for sequence labeling tasks. In Pr oceedings of the 2008 Conference on Empirical Methods in Natural Language Pr ocess- ing , pages 1070–1079, Honolulu, Hawaii, October . Association for Computational Linguistics. H. S. Seung, M. Opper , and H. Sompolinsky . 1992. Query by committee. In COLT ’92: Pr oceedings of the ﬁfth annual workshop on Computational learn- ing theory , pages 287–294, Ne w Y ork, NY , USA. A CM. Dan Shen, Jie Zhang, Jian Su, Guodong Zhou, and Chew-Lim T an. 2004. Multi-criteria-based ac- tiv e learning for named entity recognition. In Pr o- ceedings of the 42nd Meeting of the Association for Computational Linguistics (A CL ’04), Main V olume , pages 589–596, Barcelona, Spain, July . Cynthia A. Thompson, Mary Elaine Califf, and Ray- mond J. Mooney . 1999. Acti ve learning for natural language parsing and information extraction. In In Pr oceedings of the 16th International Confer ence on Machine Learning , pages 406–414. Mor gan Kauf- mann, San Francisco, CA. Katrin T omanek, Joachim W ermter, and Udo Hahn. 2007. An approach to text corpus construction which cuts annotation costs and maintains reusabil- ity of annotated data. In Proceedings of the 2007 Joint Confer ence on Empirical Methods in Natural Language Pr ocessing and Computational Natur al Language Learning (EMNLP-CoNLL) , pages 486– 495. Andreas Vlachos. 2008. A stopping criterion for activ e learning. Computer Speech and Language , 22(3):295–312. Jingbo Zhu and Eduard Hovy . 2007. Activ e learn- ing for word sense disambiguation with methods for addressing the class imbalance problem. In Pr o- ceedings of the 2007 Joint Confer ence on Empirical Methods in Natural Languag e Pr ocessing and Com- putational Natural Language Learning (EMNLP- CoNLL) , pages 783–790. Jingbo Zhu, Huizhen W ang, and Eduard Hovy . 2008a. Learning a stopping criterion for activ e learning for word sense disambiguation and te xt classiﬁcation. In IJCNLP . Jingbo Zhu, Huizhen W ang, and Eduard Hovy . 2008b . Multi-criteria-based strategy to stop activ e learning for data annotation. In Pr oceedings of the 22nd In- ternational Confer ence on Computational Linguis- tics (Coling 2008) , pages 1129–1136, Manchester , UK, August. Coling 2008 Organizing Committee.

Analysis of Stopping Active Learning based on Stabilizing Predictions

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment