Surrogate Learning - An Approach for Semi-Supervised Classification

Surrogate Learning - An A ppr oa ch f or Semi-Sup ervised Classiﬁcation Anonymous A uthor(s) Afﬁliation Address email Abstract W e consider the task of learning a class iﬁer from the feature space X to the set of classes Y = { 0 , 1 } , whe n the featu res can be partitioned into class-con ditionally indepen dent featu re sets X 1 and X 2 . W e sho w the su rprising fact that the class- condition al independen ce can be used to represent the original lear ning task in terms of 1) learning a classiﬁer from X 2 to X 1 and 2 ) learning the class-conditiona l distribution of the f eature set X 1 . T his fact can be exploited for semi-su pervised learning bec ause the f ormer task can be accomplished pu rely from unlabeled sam- ples. W e present experimental evaluation o f th e idea in two real world applica- tions. 1 Intr oduction Semi-superv ised learnin g is said to occur when th e learner e xp loits (a pre sumably l arge quantity of) unlabeled data to su pplemen t a relatively small labeled sample, fo r accurate induction . The high cost of lab eled data and the simu ltaneous plenitu de of unlabeled d ata in many application do mains, has led to considerab le in terest in semi-superv ised lea rning in recent years. W e show a s o mewhat surp rising consequ ence of class-condition al feature indep endence that leads to a simple semi-super vised learning algorithm. When the featur e set can be partitioned into two class- condition ally in depend ent sets, we show that the o riginal lear ning pr oblem can be reform ulated in terms of the problem of learnin g a predictor from one of the partition s to the other . That is , the latter partition ac ts as a su rr ogate for the c lass variable. Since such a pre dictor can be learn ed from only unlabeled samples, an effecti ve s emi- supervised algorithm results. In the next section we pr esent the simple yet inter esting result on which our semi-superv ised learning algorithm (wh ich we call surr ogate lea rning ) is based. W e presen t examples to clarify the intuition behind the appr oach and present a specia l case of our ap proach that is used in the app lications sec- tion. W e then exam ine related ideas in previous work and situate our alg orithm among previous approa ches to semi-sup ervised learning . W e present emp irical ev aluatio n on two real world app li- cations where the require d a ssumptions of our algorithm are satisﬁed. 2 Surr ogate Lear ning W e c onsider the pro blem of learnin g a classiﬁer fro m the f eature space X to the set o f classes Y = { 0 , 1 } . Let the features be partitioned in to X = X 1 ×X 2 . The ra ndom featur e vector x ∈ X will be r epresented cor respond ingly as x = ( x 1 , x 2 ) . Since we r estrict ou r c onsideratio n to a two-class problem , the con struction o f the classiﬁer inv olves the estimation o f the probab ility P ( y = 0 | x 1 , x 2 ) at every po int ( x 1 , x 2 ) ∈ X . 1 W e make the following as sum ptions on the joint probabilities of the classes and features. 1. P ( x 1 , x 2 | y ) = P ( x 1 | y ) P ( x 2 | y ) fo r y ∈ { 0 , 1 } . That is, th e feature sets x 1 and x 2 are class- condition ally in depend ent f or b oth classes. Note that in gener al our assumption is less r estrictiv e than the Naive Bayes assumption. 2. P ( x 1 | x 2 ) 6 = 0 , P ( x 1 | y ) 6 = 0 and P ( x 1 | y = 0) 6 = P ( x 1 | y = 1) . These assumptions are to av oid divide-b y-zer o pro blems in the algeb ra b elow . If x 1 is a d iscrete valued rand om variable and not irrelev ant for the classiﬁcation task, these conditions are often satisﬁed. Under these assum ptions, surprising ly , we can establish that P ( y = 0 | x 1 , x 2 ) can be written as a function of P ( x 1 | x 2 ) and P ( x 1 | y ) . First, when we consider the q uantity P ( y , x 1 | x 2 ) , we may derive the following. P ( y , x 1 | x 2 ) = P ( x 1 | y , x 2 ) P ( y | x 2 ) ⇒ P ( y , x 1 | x 2 ) = P ( x 1 | y ) P ( y | x 2 ) (from the indepen dence assum ption) ⇒ P ( y | x 1 , x 2 ) P ( x 1 | x 2 ) = P ( x 1 | y ) P ( y | x 2 ) ⇒ P ( y | x 1 , x 2 ) P ( x 1 | x 2 ) P ( x 1 | y ) = P ( y | x 2 ) (1) Since P ( y = 0 | x 2 ) + P ( y = 1 | x 2 ) = 1 , Equatio n 1 implies P ( y = 0 | x 1 , x 2 ) P ( x 1 | x 2 ) P ( x 1 | y = 0) + P ( y = 1 | x 1 , x 2 ) P ( x 1 | x 2 ) P ( x 1 | y = 1) = 1 ⇒ P ( y = 0 | x 1 , x 2 ) P ( x 1 | x 2 ) P ( x 1 | y = 0) + (1 − P ( y = 0 | x 1 , x 2 )) P ( x 1 | x 2 ) P ( x 1 | y = 1) = 1 (2) Solving Equation 2 for P ( y = 0 | x 1 , x 2 ) , we obtain P ( y = 0 | x 1 , x 2 ) = P ( x 1 | y = 0) P ( x 1 | x 2 ) · P ( x 1 | y = 1) − P ( x 1 | x 2 ) P ( x 1 | y = 1) − P ( x 1 | y = 0) (3) W e have succeeded in writing P ( y = 0 | x 1 , x 2 ) a s a function of P ( x 1 | x 2 ) a nd P ( x 1 | y ) . Th is leads to a si g niﬁcant simpliﬁcatio n of the lear ning task when a large amo unt of u nlabeled da ta is available, especially if x 1 is ﬁnite v alu ed. T he learning algorithm in volves the fo llowing two steps. • Estimate the quantity P ( x 1 | x 2 ) fro m o nly the unlabeled d ata, by b uild ing a p redictor from the f eature space X 2 to the space X 1 . Ther e is no restrictio n on the learning algorith m for this predictio n task . • Estimate the quantity P ( x 1 | y ) from a smaller labeled sample by counting. Thus, we can decoup le the p rediction problem into two separate tasks, one o f wh ich inv olves pre- dicting x 1 from the re maining features. In o ther words, x 1 serves as a surr ogate fo r the class label. Furthermo re, f or the two steps above there is no n ecessity for complete samples. All the labeled examples can ha ve the feature x 2 missing. The following examp le illustrates the intuition behind surrogate learning. ———————————————————————- Example 1 Consider a two-class problem, where x 1 is a binary feature and x 2 is a one dimensional real-valued feature. The class-condition al distribution o f x 2 for the class y = 0 is Gaussian, and for the class y = 1 is Laplacian as shown in Fig ure 1.A. Because of the class-conditiona l feature indepen dence assumption , th e joint distribution P ( x 1 , x 2 , y ) can now b e co mpletely spec iﬁed by ﬁxing the jo int proba bility P ( x 1 , y ) . Let P ( x 1 = 0 , y = 0) = 0 . 3 , P ( x 1 = 0 , y = 1) = 0 . 1 , P ( x 1 = 1 , y = 0) = 0 . 2 , a nd P ( x 1 = 1 , y = 1) = 0 . 4 . The full joint distribution is depicted in Figure 1 .B. Also shown in Figure 1.B are the conditiona l distrib ution s P ( x 1 = 0 | x 2 ) a nd P ( y = 0 | x 1 , x 2 ) . Assume that we h ave a classiﬁer to decide between x 1 = 0 and x 1 = 1 fr om th e feature x 2 . I f this classiﬁer is used to classify a sample that is fro m class y = 0 , it will mo st likely b e assigned 2 −4 −2 0 2 4 0 0.5 x 2 P(x 2 |y=0) P(x 2 |y=1) −4 −2 0 2 4 0.5 x 1 = 0 x 1 = 1 x 2 P(x 1 =1,y=0,x 2 ) P(x 1 =1,y=1,x 2 ) P(x 1 =0,y=1,x 2 ) P(x 1 =0,y=0,x 2 ) P(y=0|x 1 =1,x 2 ) P(y=0|x 1 =0,x 2 ) P(x 1 =0|,x 2 ) A B Figure 1: A) Class-cond itional pro bability d istributions of the featu re x 2 , B) the jo int d istributions and the posterior distributions of the class y an d the surrogate class x 1 . the ‘label’ ˆ x 1 = 0 (becau se, for class y = 0 , x 1 = 0 is more likely than x 1 = 1 ) , and a sample that is from class y = 1 is of ten a ssigned the ‘label’ ˆ x 1 = 1 . Conseque ntly the classiﬁer betwe en x 1 = 0 and x 1 = 1 provides info rmation about th e true class label y . This can also b e seen in the similarities between the curves P ( y = 0 | x 1 , x 2 ) to the cur ve P ( x 1 | x 2 ) . 2.1 A Special Case W e now specialize the above setting o f the c lassiﬁcation prob lem to th e on e realized in th e appli- cations w e present later . W e still wish to learn a classiﬁer from X = X 1 × X 2 to the set of classes Y = { 0 , 1 } . W e make the following as sum ptions. 1. x 1 is a binary rando m variable. That is, X 1 = { 0 , 1 } . 2. P ( x 1 , x 2 | y = 0 ) = P ( x 1 | y = 0) P ( x 2 | y = 0) . W e requ ire that the featur e x 1 be class- condition ally i n depend ent of the r emaining features only for the class y = 0 . 3. P ( x 1 = 0 , y = 1) = 0 . T his assum ption says that x 1 is a ‘1 00% rec all’ featur e for y = 1 1 . Assumption 3 simpliﬁes th e learning task to th e estimation of the p robab ility P ( y = 0 | x 1 = 1 , x 2 ) for every point x 2 ∈ X 2 . W e can p roceed as before to obtain the expression in Equation 3. P ( y = 0 | x 1 = 1 , x 2 ) = P ( x 1 = 1 | y = 0) P ( x 1 = 1 | x 2 ) P ( x 1 = 1 | y = 1) − P ( x 1 = 1 | x 2 ) P ( x 1 = 1 | y = 1) − P ( x 1 = 1 | y = 0) = P ( x 1 = 1 | y = 0) P ( x 1 = 1 | x 2 ) · 1 − P ( x 1 = 1 | x 2 ) 1 − P ( x 1 = 1 | y = 0) = P ( x 1 = 1 | y = 0) P ( x 1 = 0 | y = 0) · P ( x 1 = 0 | x 2 ) (1 − P ( x 1 = 0 | x 2 )) (4) Equation 4 shows that P ( y = 0 | x 1 = 1 , x 2 ) is a mono tonically incr easing f unction of P ( x 1 = 0 | x 2 ) . This means that after we build a pre dictor fro m X 2 to X 1 , we only n eed to e stablish the threshold on P ( x 1 = 0 | x 2 ) to yield the o ptimum classiﬁcation between y = 0 and y = 1 . Theref ore the learning proceed s as follo ws. • Estimate the quantity P ( x 1 | x 2 ) fro m o nly the unlabeled d ata, by b uild ing a p redictor from the f eature sp ace X 2 to th e b inary sp ace X 1 . Again , th ere is no r estriction on this pre diction algorithm . • Use a small labeled sample to establish the threshold on P ( x 1 = 0 | x 2 ) . In the unlabeled d ata, we call the samples that ha ve x 1 = 1 as the tar get samples a nd tho se tha t have x 1 = 0 as the bac kgr oun d samples. The reason for this terminolo gy is clariﬁed in Example 2. 1 This assumption can be seen to trivially en force th e independence o f t he features for class y = 1 . 3 ———————————————————————- Example 2 W e consider a p roblem with distrib utions P ( x 2 | y ) identical to Examp le 1 (Figur e 1 .A) , e x cept wit h the join t p robab ility P ( x 1 , y ) given b y P ( x 1 = 0 , y = 0) = 0 . 3 , P ( x 1 = 0 , y = 1) = 0 . 0 , P ( x 1 = 1 , y = 0) = 0 . 2 , a nd P ( x 1 = 1 , y = 1) = 0 . 5 . The class-and- feature joint d istribution is depicted in Figure 2. Clearly , x 1 is a 100% recall feature for y = 1 . −4 −2 0 2 4 0.5 x 2 P(x 1 =0|,x 2 ) P(x 1 =0,y=1,x 2 ) P(x 1 =1,y=1,x 2 ) P(x 1 =1,y=0,x 2 ) P(y=0|x 1 =1,x 2 ) x 1 = 0 x 1 = 1 Figure 2: The joint distributions and the posterior distributions for the Ex- ample 2. Note that on the samples fro m th e class y = 0 it is imp ossible to determ ine whether it is a sample from the targ et or ba ckgr o und b etter th an rand om by lo oking at the x 2 feature, whereas a sample from th e po siti ve class is always a ta r get . Therefo re the backgr ound samples serve to delin eate the positive examples among the tar gets . 3 Related W ork Although th e idea of using un labeled data to imp rove classiﬁer accuracy has been arou nd fo r several decades [8], semi-supervised learn ing has recei ved much attention recently d ue to impressi ve results in some domains. The compilatio n o f chapter s e dited b y Ch appelle et al. is an excellent introdu c- tion to the various approaches to semi-superv ised learning, and the related practical and th eoretical issues [6]. Identical to our setup, co-tr a ining assum es that the features can be split into two class-conditionally indepen dent sets or ‘views’ [3]. Also assumed is the sufﬁciency of either vie w f or accurate classiﬁ- cation. The co- training algorithm iterativ ely uses the un labeled data classiﬁed with high conﬁdenc e by the classiﬁer on one view , to generate labeled data for learning the classiﬁer on the other . The intuitio n underly ing co-training is th at the err ors caused by the classiﬁer on one view are ind e- penden t of the other vie w , hence can b e concei ved as uniform 2 noise added t o the training examples for the o ther view . Consequ ently , the numb er of lab el er rors in a region in the featur e space is pro - portion al to the num ber of samples in the region . If the for mer classiﬁer is reasonably accurate, the pr oportiona lly distributed errors are ‘washed out’ by the co rrectly labeled examples for the latter classiﬁer . The main distinction of surro gate learn ing fro m c o-trainin g is the lea rning of a p redictor fro m o ne view to th e other , as opposed to learning pr edictors f rom b oth vie ws to the class label. W e can there- fore eliminate the requiremen t that both vie ws be sufﬁciently in forma ti ve for reasonably accurate prediction . Further more, unlike co -training , su rrogate learning has no iterati ve compo nent. Ando an d Zh ang pro pose an algorithm to regulariz e the hyp othesis space by simultaneously con - sidering mu ltiple classiﬁcation tasks o n th e same feature space [1]. Th ey th en use their so- called structural learnin g algor ithm for semi-supervised learning of one classiﬁcation task, by th e ar tiﬁcial construction of ‘ related’ p roblems on un labeled data. Th is is d one by cr eating pr oblems of p redict- ing observable f eatures of the data and learnin g the structural r egularization parameter s from th ese ‘auxiliary ’ problems an d unlabeled da ta. More recently in [ 2] they showed that, with con ditionally indepen dent featu re sets pred icting f rom one set to the other allows the co nstruction of a f eature representatio n that leads to an effectiv e semi- supervised learning algorithm . Our appro ach directly operates on the or iginal f eature space and can be v iewed another justiﬁcation for the algo rithm in [1]. 2 Whether or not a label is erroneous is independent of the feature values of the latter v iew . 4 Castelli and Cover h av e studied th e relative value of labeled and un labeled sam ples for lear ning in a specialized settin g whe re the class-co nditiona l featur e distributions are identiﬁable, and can be estimated from an unlab eled dataset [4, 5] . After the mixtu re is iden tiﬁed fr om a large nu mber o f unlabeled samples an d the classiﬁcation b ound ary is d eﬁned, labeled examp les are n ecessary on ly to specif y the ‘or ientation’ of the bo undar y , i.e., to assign class labels to th e regions in the featur e space. W e note the p arallel in surrogate lea rning (cf. Equ ation 4), where a large amou nt of un labeled data can be u sed to estimate the ‘terr ain’ P ( x 1 =0 | x 2 ) (1 − P ( x 1 =0 | x 2 )) and labeled data is n ecessary to choose th e contour that deﬁnes the classiﬁcation bound ary . Multiple I nstance Learnin g (MIL) is a learning setting where training data is provided as po siti ve a nd negativ e ba gs of samples [7]. A n egativ e bag c ontains only negative examples whereas a positive bag contains at least on e positive example. Surrog ate learn ing c an be viewed as artiﬁcially con structing a MIL pr oblem, with the targ ets actin g as one positive bag and the b ackgr ounds acting as one negati ve bag (Section 2.1). T he clas s-co nditiona l feature independen ce assumption for class y = 0 tran slates to the identical and indepen dent d istribution of the negati ve samp les in both bags. 4 T wo A pplications W e app lied surrog ate lear ning to problems in record linkage a nd natural langua ge p rocessing. W e will explain below how the learnin g prob lems in b oth the applicatio ns can be mad e to satisfy the assumptions in our second (100% recall) setting. 4.1 Record Linkage Record linkage is the process of id entiﬁcation and merging of records o f the sam e entity in different databases or the uniﬁcation of record s i n a sing le database, and constitutes an important compon ent of data man agemen t. Th e reader is re ferred to [9] fo r an overview of the recor d linkage problem, strategies and systems. Our p roblem c onsisted o f merging ea ch of ≈ 20000 p hysician reco rds, which we call the update databa se , to the reco rd of the same phy sician in a master databa se of ≈ 10 6 records. The update database has ﬁelds that are absen t in the master database and vice versa . The ﬁelds in common include the n ame ( ﬁrst, last and mid dle initial), several address ﬁelds, phone , specialty , and the year-of-graduation . Althou gh the last name and year-of-graduation are co nsistent when present, the address , spe cialty and pho ne ﬁelds have se veral inconsistencies owing to d ifferent ways of writing the add ress, new addresses, different term s for the same specialty , missing ﬁeld s, etc. Howe ver , the name an d year alone are insufﬁcient for disambig uation. W e had access to ≈ 500 manually matched upd ate rec ords for train ing and evaluation (abo ut 4 0 of the se upd ate re cords were labeled as unmatch able due to insuf ﬁcient information ). The g eneral appro ach to record link age inv olves two steps: 1) blocking , whe re a small set of can - didate reco rds is retr iev ed fro m the master recor d databa se, which contains the correct m atch with high pr obability , and 2) m atching , wher e the ﬁelds of th e update r ecords ar e com pared to those o f the can didates fo r scoring an d selectin g the match. W e perfor med blocking by queryin g the m aster record datab ase w ith the last name f rom the u pdate reco rd. Matching was don e by scorin g a f ea- ture vector of similarities over the v ario us ﬁelds. The featu re values were either bin ary (verifying the equ ality of a particular ﬁeld in th e update an d a master reco rd) or continuou s (some kind of normalized string edit distance between ﬁelds like str eet address , ﬁrst name etc.). The surrogate lear ning solution to our matching problem was set up as follows. W e designated the bin ary featu re of equality o f yea r of graduation 3 as th e surr ogate label x 1 , and the remain ing features are relegated to x 2 . Th e required co nditions for surrogate lear ning are satiﬁed b ecause 1) in our data it is highly un likely for two record s with different year- of-g raduation to belon g to the same phy sician an d 2) if it is k nown that the up date record and a master rec ord belong to two differ ent phy sicians, then kn owing that they h av e th e sam e (or different) year-of-graduation provides no inform ation about the othe r features. Th erefore all the feature vectors with the binary feature indicating equality of year-of-graduation ar e tar gets and the remaining are bac kg r ounds . 3 W e believe that the equ alit y of the middle intial would hav e worked just as well for x 1 . 5 T able 1: Precision an d Recall for record link age. The sur rogate lear ning algo rithm ha d access to none of the manually matched records. T rainin g Precision Recall propo rtion Surroga te 0.96 0.95 Supervised 0.5 0.96 0.94 Supervised 0.2 0.96 0.91 First, we u sed featu re vector s obtained f rom the record s in all block s from all 2 0000 upd ate record s to estimate the probability P ( x 1 | x 2 ) . W e used logistic re gr ession for this p rediction task. For learning the log istic r egression param eters, we discard ed the featu re vectors f or which x 1 was missing and perfor med mean imputation f or the m issing values of other features. Seco nd, the probab ility P ( x 1 = 1 | y = 0 ) (the pr obability that two different r andom ly chosen physicians h ave the same y ear of gradua tion) was estimated straightfo rwardly from the co unts of the different years-of -grad uation in the master record database. These estimate s wer e used to assign th e scor e P ( y = 1 | x 1 = 1 , x 2 ) to the recor ds in a b lock (cf . Equation 4). The score of 0 is assigned to featur e vectors whic h have x 1 = 0 . The only caveat is calculating the score for featu re vectors that h ad missing x 1 . For such rec ords we assign th e score P ( y = 1 | x 2 ) = P ( y = 1 | x 1 = 1 , x 2 ) P ( x 1 = 1 | x 2 ) . W e have estimates for bo th quantities on the right h and side. The highest scoring recor d in e ach block was ﬂagged as a match if it exceeded some approp riate threshold. W e compared the results of the surroga te le arning approach to a superv ised logistic r egression based matcher which used a po rtion of th e manual m atches for training and the r emaining for testing. T able 1 shows the matc h pr ecision an d re call for bo th the surro gate learnin g and the su pervised approa ches. For the superv ised algorithm, we sho w the results fo r the case where half the manua lly matched recor ds were used for training and half f or testing, as well as for the case where a ﬁfth of the reco rds o f training and the r emaining four-ﬁfths fo r testing . I n the latter case, every record participated in exactly one training fold b ut in fou r test folds. The results indicate th at the surrogate learner perfor ms better matching by exp loiting the un labeled data than the superv ised learner with insufﬁcient training data. The r esults alth ough no t dram atic are still promising, consider ing that th e surrogate learning approach used none of the training records. 4.2 Merger -Acquisition Sentence Classiﬁcation Sentence classiﬁcation is of ten a prepro cessing step fo r event or relation extraction fr om text. On e of the challen ges posed by senten ce classiﬁcation is the d iv er sity in the language for expressing the same e vent or relation ship. W e p resent a surro gate learnin g appro ach for constructin g a sentence classiﬁer that detects a mer ger-acquisition (MA) e vent between two organization s in ﬁnancial news (in other words, we ﬁnd paraphrases for the MA e vent). W e assume that the unlabeled sentence co rpus is time-stamped and na med entity tag ged with or- ganization s. W e furth er assum e that a MA sen tence must me ntion at least two o rganizations. Our approa ch to build the sen tence classiﬁer is the following. W e ﬁrst extract all the so-called sour ce sentences f rom the corpus that ma tch a few h igh-p recision seed patterns. An examp le of a seed pattern used for the MA event is ‘ < ORG1 > acquired < ORG2 > ’ (see Example 3 below). W e then extract every sen tence in the co rpus that contain s at least two o rganizations, suc h that at least one o f them matc hes an organizatio n in the source sentences, and has a time-stamp within a two m onth time window o f the matching source sentenc e. Of th is set of senten ces, all tha t contain two or more organ izations fro m the same source sentence are designated as ta r get sentences, and the rest are designated as backgr ound sentences. W e speculate that since an organ ization is u nlikely to have a MA relationship with two different orga- nizations in the same time period the bac kgr ou nds are unlikely to contain MA sentences, and more- over the langua ge of th e non -MA targ et senten ces is indistinguisha ble from th at of the b ackgr ound sentences. T o re late the appr oach to sur rogate learnin g, w e no te that the binary “organization -pair 6 equality” feature ( both organization s in the current senten ce being the same as tho se in a sour ce sentence) serves as the ‘1 00% recall’ feature x 1 . The languag e in the sentence is the featu re set x 2 . This setup satisﬁes the r equired con ditions for surrog ate learn ing bec ause 1) if a sentence is abou t MA, the organizatio n pair mentioned in it must be the sam e as that in a sour ce sentence, (i.e., if only one of the organization s match tho se in a sour ce senten ce, the sentence is unlikely to be about MA) and 2) if an u nlabeled sentenc e is non -MA, then kn owing whether or not it shares an o rganization with a sour ce does not provid e any informa tion about the langu age in th e sentence. W e then train ed a supp ort vector mach ine (SVM) c lassiﬁer to discr iminate between the tar gets an d backgr ounds . Th e featu re set ( x 2 ) used f or th is task was a bag of word unig rams, bigr ams and trigrams, gen erated fr om th e sentences and selected by rank ing the n-gra ms by the diver g ence o f their distrib utions in th e tar gets and backgr ounds . T he sentences were ranked acc ording to the score assigned by the SVM (wh ich is a proxy fo r P ( x 1 = 1 | x 2 ) ). This score was then thresho lded to obtain a classiﬁcation between MA and non-M A sente nces. Example 3 below lists some sentenc es to illustrate the surroga te learn ing a pproac h. N ote tha t the tar gets may contain both MA and non-MA sentences b u t the backgr ounds are unlikely to b e MA. ———————————————————————- Example 3 Seed Patter n “of fer for < ORG > ” Source Sentences 1. < ORG > US Airways < ORG > said W ednesday it will increase its offer f or < ORG > Delta < ORG > . T arget Sentences (SVM scor e) 1. < ORG > US Airways < ORG > were t o combin e with a standalon e < ORG > Delta < ORG > . (1.000 8563) 2. < ORG > US Airways < ORG > argued that the nearly $10 bil lion acquisition of < ORG > Delta < ORG > would result in an ef ﬁciently run carrier that could offer lo w fares to ﬂiers. (0.99958149) 3. < ORG > US Airways < ORG > is asking < ORG > Delta < O RG > ’ s of ﬁcial creditors committee to support postponing that hearing. (-0.999143 71) Background Sentences (SVM sco re) 1. The cities hav e made various ov ertures to < ORG > US Airways < ORG > , including a promise from < ORG > America W est Airlines < ORG > and the former < ORG > US Airways < ORG > . (0.99 957752) 2. < ORG > US Airways < ORG > shares rose 8 cents to close at $53.35 on the < ORG > Ne w Y ork Stock Exchange < ORG > . (-0.99906444) ———————————————————————- W e tested our algorithm on an unlabele d corpu s of appro ximately 7000 00 ﬁnancial news articles. W e experimented with ﬁve seed patterns ( < ORG > ac quired < ORG > , < ORG > bought < ORG > , offer for < O RG > , to bu y < ORG > , merger with < ORG > ) which r esulted in 870 sou r ce sen- tences. The participants that were extracted fro m sou r ces resulted in app roximately 12 000 ta r get sentences and appr oximately 1200 00 backgr ound sentences. For the pu rpose of e valuation, 50 0 ran - domly selected senten ces from the tar gets were manu ally ch ecked lead ing to 3 30 b eing tagg ed a s MA and the remaining 170 as non-MA . Th is correspo nds to a 66% precision of the tar gets . W e then ranked the tar gets accor ding to the score a ssigned by the SVM trained to classify between the tar gets and b ackgr ounds , and selected all th e tar gets above a th reshold as p araphr ases for MA. T able 3 presents the prec ision and recall on the 500 m anually tag ged sen tences as the thresho ld varies. T he results indicate that our app roach provides an ef fective w ay to ran k the tar get senten ces accordin g to the ir likelihood of being about MA. W e also ev aluated the capa bility o f the meth od to ﬁnd parap hrases by con ducting ﬁve separate experiments using each pattern in T able 2 individually as the only seed and co unting the num ber of obtained sentences c ontaining each of the other pattern s (using a thresho ld of 0.0). W e found that the m ethod was effecti ve in ﬁn ding p araphr ases that h av e very different lan guage than the so ur ces . W e do not provide the numbers due to space considerations. Finally we used the p araphr ase sentences, which were foun d by surro gate learning, to augm ent the training data for a MA sentence classiﬁer and e valuated its accur acy . W e ﬁrst built a SVM classiﬁer only on a portion of the labeled targets an d used the remaining as the test set. This approach yielded 7 T able 2: Pr ecision/Recall of surrogate learning on the MA sentence prob lem for v ario us thresholds. The baseline of u sing all the tar gets as par aphrases for MA has a precision of 66% and a recall of 100%. Threshold Precision Recall 0.0 0.83 0.9 4 -0.2 0.82 0.95 -0.8 0.79 0.99 an accuracy o f 76 % on the test set (with two-fold cross validation). W e then added all the targets scored ab ove a thresho ld by surro gate learning as positive e xam ples (40 00 po siti ve senten ces in all were add ed), and all th e backgr ounds that scored below a low threshold as negative examples ( 2700 0 sentences), to the trainin g data and rep eated the two-fold cro ss validation. The c lassiﬁer learne d on the augmented training data improved th e accuracy on the test data to 86% . W e believe th at better design ed features (tha n word n-gram s) will pr ovide p araphr ases with hig her precision an d recall of the MA sentence s fou nd by surro gate learning . T o apply o ur app roach to a new e vent extraction problem, the design step also in volves the selection of the x 1 feature such that the tar gets and backgr ounds satisfy our assumptions. 5 Conclusions W e presented surroga te learning – a simple semi-sup ervised learn ing alg orithm that can be applied when the fe atures satisfy th e requir ed in depend ence assumptions. W e p resented two applicatio ns, showed how the assum ptions are satisﬁed, and pr esented emp irical evidence for the efﬁcacy of ou r algorithm . W e e x pect that surrogate learning is sufﬁciently general to be ap plied in di verse d omains, if the featur es are c arefully designed. W e are d ev elo ping a version o f the algo rithm that allows the statistical independ ence ass u mption to be relaxed to mean independence. Refer ences [1] R. K. And o an d T . Zhang . A fr amework for lear ning pred ictiv e structures f rom mu ltiple tasks and unlabeled data. JMLR , 6:1817 –185 3, 2005 . [2] R. K. Ando and T . Zhang. T wo-view feature generation model for semi-superv ised lear ning. In ICML , pages 25–3 2, 2007. [3] A. Blum and T . M itchell. Combinin g lab eled an d unlab eled d ata with co-training . In COLT , pages 92–1 00, 1998. [4] V . Castelli and T . M. Cover . On the exponential value of labeled samp les. P a ttern Recogn ition Letters , 16(1):1 05–1 11, 1995. [5] V . Castelli a nd T . M. Cover . The relative v alu e of lab eled an d u nlabeled samp les in pattern recog - nition with an unknown mixin g param eter . IEEE T r an s. on Information Theo ry , 42(6 ):2102 – 2117, 1996 . [6] O. Chap elle, B. Sch ¨ olkopf, and A . Zien , ed itors. Se mi-Supervised Learnin g . MIT Press, Cam- bridge, MA, 2006. [7] T . G. Dietterich, R. H. Lathr op, and T . L ozano- Perez. Solv ing the multiple instan ce prob lem with axis-par allel rectangles. Artiﬁcial Intelligence , 89(1-2 ):31–7 1, 1 997. [8] G. Nagy and G. L. Sh elton. Self-c orrective char acter recognition system. IEEE T rans. Informa- tion Theory , 12(2 ):215– 222, 1966 . [9] W . E. W inkler . Ma tching and reco rd linkage. In Business Survey Methods , pages 35 5–384 . W iley , 1995. 8 −4 −2 0 2 4 0.5 x 2 P(x 1 =0|,x 2 ) P(x 1 =0,y=1,x 2 ) P(x 1 =1,y=1,x 2 ) P(x 1 =1,y=0,x 2 ) P(y=0|x 1 =1,x 2 ) x 1 = 0 x 1 = 1 −4 −2 0 2 4 0.5 x 2 P(x 1 =0|,x 2 ) P(x 1 =0,y=1,x 2 ) P(x 1 =1,y=1,x 2 ) P(x 1 =1,y=0,x 2 ) P(y=0|x 1 =1,x 2 ) x 1 = 0 x 1 = 1

Surrogate Learning - An Approach for Semi-Supervised Classification

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment