Semi-supervised logistic discrimination via labeled data and unlabeled data from different sampling distributions

Semi-sup ervised logistic discrimination vi a lab eled data and unlab eled data from diﬀeren t sampling distribution s Sh ui c hi Ka w ano Dep artment of Mathema tic al S cienc es, Gr aduate Scho ol of Engine ering, Osaka Pr efe ctur e University, 1-1 Gakuen-cho, Sakai, Osaka 599-85 31, J ap an. sk a w a no @ms.osak afu-u.ac.jp Abstract: This artic le addresses the problem of classiﬁcation m etho d based on b oth lab eled and unlab eled data, wher e we assume that a d ensit y fu nction for lab eled data is d iﬀeren t from that for unlab eled data. W e prop ose a semi-sup ervised logistic regression mo d el for classiﬁcation problem along with the tec h nique of co v ariate shift adaptation. Unkno wn parameters inv olv ed in prop osed mo dels are estimated b y regularization with EM algorithm. A crucial issue in the mo deling pro cess is the c hoices of tuning p arameters in our semi-sup ervised logistic mo dels. In order to select the parameters, a mo d el s electio n criterion is d er ived from an information-theoretic approac h. Some n umerical stud ies show that our mo deling p ro cedure p erf orms we ll in v arious cases. Key W o rds and P hrases: Co v ariate shift; EM algo rithm; Mo del selectio n; Reg - ularization; Semi-sup ervised learning. 1 In tro duction In recen t years, with the wide a v ailability of fast and high-p ow ered computers, high- throughput da t a of unexampled size and complexit y ha ve frequen tly b een seen in the con temp or a ry statistics and mac hine learning . Examples in v olve data from genomics, proteomics, natural language processing, and signal pro cess ing. F o r the huge amount of data, it is diﬃcult to lab el data b y a human op erator, since its w ork requires v ast times and eﬀorts. Only small lab eled data set ma y , therefore, b e a v ailable, while an unlabeled data set can b e more easily obtained. Under suc h a circumstance, a classiﬁcation metho d that 1 com bines b oth lab eled and unlab eled data , called s emi-sup ervised learning, has receiv ed an enormous amoun t of atten tion in the late mac hine learning and statistical literature (see, e.g., Chap elle et al. , 20 0 6; Liang et al. , 200 7 ). F or o v erviews of semi-sup ervised learning metho ds, w e refer to Zh u (2008), and reference s giv en therein. Man y classiﬁcation tec hniques for semi-supervised learning hav e b een pro p osed b y v arious researc hers, e.g., Amini and Gallinari (2002) , Basu et a l. (2004), Bennett and Demiriz (1998) , Chen and W ang (2007), Dean et al. (2006), K aw ano and Konishi (2011), Ka wano et al. (2012), Laﬀert y and W ass erman (2007), and Zhou et al. (2004). Most of these semi-sup ervised metho ds implicitly a ssumes that a densit y function for lab eled data is the same as that for unlab eled data. On the other hand, we, here, consider the case that the densities for lab eled data and unlab eled data are diﬀeren t, since the densities are not alwa ys same in pr a ctical situations. In suc h a case, sev eral semi-sup ervised meth- o ds ha ve b een presen ted, e.g., Jiang and Zhai (2 007), W u et al. (2009 ), and Z a drozn y (2004). Ho wev er, for t hese metho ds, there remains a problem of ev aluating constructed semi-supervised mo dels, whic h is a crucial issue in t he mo del building pro cess. Cross v alidation (CV) is often used in ev aluating mo dels constructed by semi-sup ervised pro- cedures. An adv an ta ge of CV lies in its independence from probabilistic assumptions. The computational time of the pro cedures is, ho w ev er, v ery large, and the high v ariabil- it y and tendency to undersmo oth in CV are not negligible in the a na lysis of complex or high-dimensional data, since the sele ctors are repeatedly applied. In this pap er, w e pro p ose a logistic mo del for the semi-sup ervised classiﬁcation prob- lem by using statistical metho ds under co v ariate shift (Shimo dair a , 200 0 ) in the case that the densit y function fo r lab eled data is diﬀeren t from that for unlab eled data. The unkno wn parameters in the mo del are estimated b y the regularizatio n metho d with the help o f EM alg o rithm. A crucial issue in our mo deling strategy is to c ho ose v alues of some tuning parameters included in semi-sup ervised logistic mo dels, whic h corresp onds to ev aluating mo dels determined b y our prop osed pro cedures . In order to ob jectiv ely select optimal v alues o f tuning para meters, we then in tro duce a mo del selection criterion based on an inf o rmation-theoretic approac h (Konishi and Kitaga wa, 1996 ) that ev alu- 2 ates the semi-sup ervised logistic mo dels estimated b y t he regularizatio n metho d. Some n umerical examples demonstrate that t he prop osed pro cedure works w ell and p erforms b etter than comp eting metho ds. This pap er is orga nized as fo llo ws. In Section 2, w e prese n t a semi-supervised logistic mo del for classiﬁcation problem based on co v ariate shift adaptation and its estimation pro cedure by the regularization metho d. Section 3 provide s a mo del selection criterion deriv ed from an information-theoretic viewp oint to select some tuning par ameters in semi- sup ervised log istic mo dels . In Section 4, Mon t e Carlo sim ulations and b enc hmark data analysis are giv en to assess the p erformances of our prop osed semi-sup ervised logistic discrimination. Some concluding remarks are giv en in Section 5. 2 Semi-sup ervised logisti c mo deling from diﬀeren t sampling distribu t ions 2.1 Linear logistic mo deling for semi-sup ervised learning W e review here semi-supervised linear logistic mo dels dev elop ed b y early researc hers (e.g., Amini and Gallinari, 2002 ; Vittaut et al. , 2002 ). Supp ose that w e ha v e an n 1 lab eled data set { ( x α , y α ); α = 1 , . . . , n 1 } and an ( n − n 1 ) unlab eled data set { x α ; α = n 1 + 1 , . . . , n } , where x α = ( x α 1 , . . . , x αp ) T denotes a p -dimensional explanatory v a riable and Y α is a random v ariable taking v alues 0 o r 1 with probabilities Pr( Y α = 1 | x α ) = π ( x α ) , Pr( Y α = 0 | x α ) = 1 − π ( x α ) . (1) Note that logistic mo dels are ﬁrst constructed b y only lab ele d data set, while the unlab eled data set is used in estimating the par ameters in v olv ed in the logistic mo dels. Using conditiona l probabilities in Equation (1) and the lab eled data set, a linear logistic mo del (see, e.g., Hastie et al. , 200 9) is form ulated b y log  π ( x α ) 1 − π ( x α )  = w 0 + p X j =1 w j x αj = w T x ∗ α , α = 1 , . . . , n 1 , (2) where w = ( w 0 , w 1 , . . . , w p ) T is an unknow n para meter v ector and x ∗ α = (1 , x T α ) T . Here- after, w e denote conditional probabilities by π ( x α ; w ), since the conditional probabilities 3 dep end on the parameter v ector w . It follo ws from Equation ( 2) that conditional proba- bilities can be rewritten as π ( x α ; w ) = exp( w T x ∗ α ) 1 + exp( w T x ∗ α ) . (3) Also, a probability function of the random v ariable Y α is t he Bernoulli distribution in the form f ( y α | x α ; w ) = π ( x α ; w ) y α { 1 − π ( x α ; w ) } 1 − y α , y α = 0 , 1 . (4) Under the linear logistic mo del, the log-like liho o d function fo r y α in terms of w is induced in to ℓ ( w ) = n 1 X α =1 log f ( y α | x α ; w ) = n 1 X α =1 [ y α log π ( x α ; w ) + (1 − y α ) log { 1 − π ( x α ; w ) } ] = n 1 X α =1  y α w T x ∗ α − lo g { 1 + exp( w T x ∗ α ) }  . (5) The unknow n parameter w included in t he logistic mo del is usually estimated by maximizing the lo g -lik eliho od function with resp ect to the parameter. The pro cedure is known as the sup ervised learning, i.e., the parameter is determined b y using only lab eled data set. Since we ha ve an additional unlab eled data set, the pa r ameter should b e estim ated b y b oth lab eled and unlab eled data set, whic h is called the semi-supervised learning. Thereb y , Amini a nd Ga llina ri ( 2002) prop osed a log-lik eliho od function with additional unlab eled data giv en by ℓ ∗ ( w ) = n 1 X α =1  y α w T x ∗ α − log { 1 + exp ( w T x ∗ α ) }  + n X α = n 1 +1  t α w T x ∗ α − log { 1 + exp( w T x ∗ α ) }  , (6) where t α ( α = n 1 + 1 , . . . , n ) is a lat ent v ariable co ded as 0 o r 1. Amini and Gallina ri (2002) estimated the parameter b y maximizing the Equation (6) with t he tech nique of EM alg o rithm, while Ka wano and Ko nishi (201 1 ) emplo ye d t he Equation (6) with a 4 regularization term in estimating the parameter in the con text of nonlinear logistic mo dels based on basis expansions. Giv en the estimate ˆ w , we assign a future observ ation x f in to class j ( j = 0 , 1 ) that has the maximum conditional probabilit y in the Equation (3) . 2.2 Semi-sup ervise d logistic mo del for diﬀeren t distributions Logistic mo dels for semi-sup ervised learning describ ed in Section 2.1 usually assumes that a dens it y function for t he lab eled da t a set is the same as that for the unlab eled data set, i.e., when w e denote t ha t q label ( x ) is a probability densit y f unction of explanatory v ariables for the lab eled data and q unlabel ( x ) is that fo r the unlab eled da t a , q label ( x ) = q unlabel ( x ). Our aim in this section is to construct logistic mo dels under the situation that a densit y for the lab eled data set is diﬀeren t fro m that for the unlab eled data set, i.e., q label ( x ) 6 = q unlabel ( x ). W e recall the log-lik eliho o d function for logistic mo dels with unlab eled data in Equa- tion (6). F or the log-like liho o d function, we prop ose a w eighted log- lik eliho o d function with unlab eled data in the form ℓ ∗ ( w ; γ 1 , γ 2 ) = n 1 X α =1  q unlabel ( x α ) q label ( x α )  γ 1  y α w T x ∗ α − lo g { 1 + exp( w T x ∗ α ) }  + n X α = n 1 +1  q label ( x α ) q unlabel ( x α )  γ 2  t α w T x ∗ α − log { 1 + exp( w T x ∗ α ) }  , (7) where γ 1 , γ 2 ∈ [0 , 1] ar e tuning parameters. If b oth γ 1 and γ 2 are 0 , the log-likelihoo d function in Equation (7) coincides with that in Equation (6 ) . Note tha t the w eight on the ﬁrst term, q unlabel ( x ) /q label ( x ), is bigger near high densities of unlab eled data compared to tho se of lab eled data, while that on the second term, q label ( x ) /q unlabel ( x ), is strengthen near high densities of lab eled data compared to those of unlab eled data. Hence, the log- lik eliho o d function on the ﬁrst term is highly w eigh ted near hig h densities of unla b eled data compared to those of lab eled data, while that on the second term has high weigh ting near high densities of lab eled data compared to those of unlab eled data. An idea of the w eigh t, the ratio of q label ( x ) and q unlabel ( x ), a rises from a statistical inference under co v ariate shift ( Shimo daira, 2000). In the semi-supervised learning, emplo ying a rat io of 5 densities in log- lik eliho o d functions is not new. F or example, Ka wak ita and Kanamori (2012), Sok olo vsk a et al. (200 8 ), and Zou et al. (2 007) use a ratio of densities in the semi-supervised inference. How ev er, the Equation (7) is a nov el form ulation in the sem i- sup ervised con text. The Equation (7) includes unkno wn v alues of ratios, q unlabel ( x ) /q label ( x ) a nd q label ( x ) /q unlabel ( x ), whic h are to be estimated. V arious researc hers address the problem of estimating the ra- tios b y using sev eral methods of statistics or mac hine learning (Bic kel et al. , 2009; Huang et al. , 2007; Kanamori et al. , 2009; Sugiy ama et al. , 2008; Sugiy ama and K aw anab e, 2012; Sugiy ama et al. , 2012). In this pap er, w e emplo y a uLSIF metho d prop o sed b y Kanamori et al. (2009) in determining v alues of the ratios, where the determination is p erformed b efore estimating the para meter w . Also, a source code of the method uLSIF is av ailable in http://www.math.cm.is.nagoya-u.ac .jp/˜kanamori/ softwar e/LSIF . W e do not follow details of the densit y rat io estimation pro cedure by the uLSIF metho d, since these a re not o ur fo cus in this pap er. F or readers that are in terested in the topics, w e refer to Kanamori et al. (2009 ) , and Sugiy ama and Ka wanabe (2012 ) . 2.3 P arameter estimation via regularizatio n In estimating parameters in log istic mo dels, the log-lik eliho o d function often div erges to inﬁnit y when the maxim um lik eliho o d metho d is applied (Konishi and Kitagaw a, 2 008). Hence, the para meter v ector w in Equation (7) is estimated b y t he regularization method. The regularization metho d is t o maximize a follo wing regularized log -lik eliho od function ℓ ∗ λ ( w ; γ 1 , γ 2 ) = ℓ ∗ ( w ; γ 1 , γ 2 ) − n 1 λ 2 w T K w , (8) where λ is a regularization parameter that has p ositiv e v alues and K = diag(0 , I p ) is a ( p + 1) × ( p + 1) matrix. Here, the matrix I p is a p -dimensional identit y matrix. It is not easy to optimize the parameter in volv ed in Equation (8), since the laten t v ariables t α ( α = n 1 + 1 , . . . , n ) are unobserv ed. Hence, w e emplo y an EM-based algorithm dev elop ed b y Kaw ano and Konishi (20 11) as follows : Step1 Estimate the parameter v ector w by maximizing the regularized log-lik eliho o d 6 function using only lab eled data set { ( x α , y α ); α = 1 , . . . , n 1 } along with the tec h- nique of Newton-R a phson method. Step2 Construct a classiﬁc ation rule π ( x α ; ˆ w ). Step3 (E-step) According to the classiﬁcation rule in Step2, compute the conditional probabilities π ( x α ; ˆ w ) for unlab eled dat a x α ( α = n 1 + 1 , . . . , n ). By using the conditional probabilities, estimate t α in the f o rm ˆ t α = π ( x α ; ˆ w ). Step4 (M-step) Replace t α in to ˆ t α in the regula r ized log-likelihoo d function (8), and then determine the parameter v ector w through the maximization of the lo g-lik eliho o d function in Equation (8) with the help o f Newton-R aphson metho d. Step5 Rep eat the Step2 to the Step4 un til the follo wing conditio n | ℓ ∗ λ ( ˆ w ( k +1) ; γ 1 , γ 2 ) − ℓ ∗ λ ( ˆ w ( k ) ; γ 1 , γ 2 ) | < ε (9) is satisﬁed, where ˆ w ( k ) is the v alue of w aft er the k - t h EM iteratio n a nd ε is an arbitrary small n um b er (e.g., 10 − 5 ). It fo llo ws fro m these pro cedures that we o btain a statistical mo del in the form f ( y | x ; ˆ w ) = π ( x ; ˆ w ) y { 1 − π ( x ; ˆ w ) } 1 − y . (10) Note that the statistical mo del is constructed by using b o th lab eled data and unlab eled data. 3 Mo d el selection crit erion The statistical mo del in Equation (10) contains some adjusted parameters including tw o tuning parameters γ 1 , γ 2 in the w eigh ted log-lik eliho o d function and the regularization parameter λ . Regar ding the selection of these a dj usted para meters as that of candidate mo dels, we introduce a mo del sele ction criterion fr om an informa t ion-theoretic approac h. Let y 1 , . . . , y n 1 b e n 1 observ ations dra wn randomly fro m an unkno wn probabilit y dis- tribution function G ( y | x ) ha ving a densit y function g ( y | x ). On the ot her hand, we as- sume that n 1 observ ations for explanatory v ariables x 1 , . . . , x n 1 are non-random; i.e., 7 x 1 , . . . , x n 1 are ﬁxed (for details of this assumption, w e r efer to Konishi and Kitagaw a, 2008). Under these settings, we deriv e a mo del selection criterio n from the viewpo int of information theory . Supp ose that z = ( z 1 , . . . , z n 1 ) T are future o bserv a tions for the resp onse v ariable generated from g ( y | x ). Let f ( z | x ; ˆ w G ) η ( x ) = Q n 1 α =1 f ( z α | x α ; ˆ w G ) η ( x α ) and g ( z | x ) = Q n 1 α =1 g ( z α | x α ), where ˆ w G is a n estimator o f the parameter by an y estimation pro cedures , η ( x ) = η ( x 1 ) + · · · + η ( x n 1 ), a nd η ( x α ) ( α = 1 , . . . , n 1 ) are w eights that dep end on explana- tory v ariables x α , whic h satisfy η ( x α ) > 0 . Note that the w eights η ( x α ) ( α = 1 , . . . , n 1 ) are ﬁxed, since w e assum e that x 1 , . . . , x n 1 are non-random. Then Irizarry (2001) implic- itly prop oses a follo wing Kullbac k–Leibler information in order to measure the dive rgence of the statistical mo del with w eights f r om the t r ue distribution: I { g ; f } = E G ( z | x )  log g ( z | x ) f ( z | x ; ˆ w G ) η ( x )  = E G ( z | x ) [log g ( z | x )] − E G ( z | x )  log f ( z | x ; ˆ w G ) η ( x )  = E G ( z | x ) [log g ( z | x )] − E G ( z | x ) [ η ( x ) log f ( z | x ; ˆ w G )] . (11) The b est mo del can b e regarded as the b est minimizer of the Kullba ck–Leibler inf o rmation (Irizarry , 200 1). Since the ﬁrst term of Equation (11) do es not depend o n the mo dels with the estimator ˆ w G , w e ha v e only to consider the second term of Equation (11). Therefore, w e fo cus on maximizing the second term of Equation (1 1 ) whic h leads to t he minimization of the Kullback–Leible r info r ma t ion. By intro ducing an estimator of the second term of Equation (11), a mo del selection criterion is, generally , g iven by IC = − 2 n 1 X α =1 η ( x α ) log f ( y α | x α ; ˆ w G ) + 2 ˆ b ( G ) , (12) where IC stands for information criterion and ˆ b ( G ) is an estimator of the bias b ( G ) in the follo wing: b ( G ) = E G ( y | x ) " n 1 X α =1 η ( x α ) log f ( y α | x α ; ˆ w G ) − E G ( z | x ) [ η ( x ) log f ( z | x ; ˆ w G )] # . (13) Supp ose that the estimator ˆ w M of the parameter is a n M-estimator deﬁned as the 8 solution of the f ollo wing implicit equation: n 1 X α =1 ψ ( y α | x α ; ˆ w M ) = 0 (14) with ψ b eing referred to as ψ –function (e.g., see, Hub er, 2004). Using the idea of Konishi and Kitagaw a (1996 ), we deriv e a mo del selection criterion for the statistical mo dels with the M-estimator ˆ w M in the f o rm IC M = − 2 n 1 X α =1 η ( x α ) log f ( y α | x α ; ˆ w M ) + 2tr  Q ( ˆ w M ) R − 1 ( ˆ w M )  , (15) where Q ( ˆ w M ) and R ( ˆ w M ) are given by Q ( ˆ w M ) = 1 n 1 n 1 X α =1 ψ ( y α | x α ; w ) η ( x α ) ∂ log f ( y α | x α ; w ) ∂ w T     w = ˆ w M , (16) R ( ˆ w M ) = − 1 n 1 n 1 X α =1 ∂ ψ ( y α | x α ; w ) T ∂ w     w = ˆ w M . (17) In our mo dels , the estim ator ˆ w , whic h maximizes the regularized lo g-lik eliho o d func- tion in Equation (8), can b e regar ded as an M-estimator. Here, we set the ψ –function of the estimator into ψ ( y α | x α ; w ) = ∂ ∂ w  q unlabel ( x α ) q label ( x α )  γ 1  y α w T x ∗ α − lo g { 1 + exp( w T x ∗ α ) }  − λ 2 w T K w  . (18) Note that the ψ –function in Equation (18) is actually incorrect since the estimator ˆ w is obtained by maximizing the Equation (8) with resp ect to t he parameter; i.e., the estimator are constructed b y using b o t h lab eled a nd unlab eled data. How ev er, ψ –functions in the con text of mo del selection criteria must b e giv en b y a regularized or non-regularized log-lik eliho o d function with incomplete dat a ; i.e., the functions do es not include laten t v ariables (for details, see, Hirose et al. , 2 008). Hence, w e emplo y the ψ –function in Equation (18) in order to deriv e a mo del selection criterion. By using the ψ –function in Equation (18) and subs tituting { q unlabel ( x α ) /q label ( x α ) } γ 1 for the weigh ts η ( x α ) ( α = 1 , . . . , n 1 ), we in tro duce a generalized information criterion (GIC) for ev aluating our prop osed semi-sup ervised logistic mo dels estimated by the reg- ularization metho d. The mo del selec tion criterion is giv en by GIC = − 2 n 1 X α =1  q unlabel ( x α ) q label ( x α )  γ 1 log f ( y α | x α ; ˆ w ) + 2tr  Q ( ˆ w ) R − 1 ( ˆ w )  , (19) 9 where the mat rices Q ( ˆ w ) and R ( ˆ w ) are Q ( ˆ w ) = 1 n 1 n X T ˆ W 2 ˆ Λ 2 X − λK ˆ w 1 T n 1 ˆ W ˆ Λ X o , (20) R ( ˆ w ) = 1 n 1 X ˆ Π ˆ W ( I n 1 − ˆ Π) X + λK . (21) Here, 1 n 1 is an n 1 -dimensional ve ctor of whic h the elemen ts are a ll one, and I n 1 is an n 1 -dimensional iden tity ma t rix. Also, X, ˆ W , ˆ Λ, and ˆ Π ar e, resp ectiv ely , giv en by X = ( x ∗ 1 , . . . , x ∗ n 1 ) T , ˆ W = diag  q unlabel ( x 1 ) q label ( x 1 )  γ 1 , . . . ,  q unlabel ( x n 1 ) q label ( x n 1 )  γ 1  , ˆ Λ = diag [ y 1 − π ( x 1 ; ˆ w ) , . . . , y n 1 − π ( x n 1 ; ˆ w )] , ˆ Π = diag [ π ( x 1 ; ˆ w ) , . . . , π ( x n 1 ; ˆ w )] . Note that the GIC in Equation (19) seemingly app ears not to dep end on all adjusted parameters (in particular, γ 2 ). How ev er, the GIC implicitly includes the adjusted param- eters ( λ, γ 1 , γ 2 ), since the estimator ˆ w dep ends on all adjusted parameters. W e c ho ose the adjusted parameters fro m the minimizer of the GIC in Equation (19). 4 Numerical studies W e studied some num erical examples to sho w the eﬃcienc y of our prop osed mo deling strategy . Tw o types of Monte Carlo sim ulations and b enc hmark data analysis are giv en to illustrate the prop osed semi-sup ervised lo g istic disc rimination. 4.1 Sim ulation 1 W e in ve stigated the eﬀectiv eness of the prop osed mo deling pro cedures through Mon te Carlo sim ulations. In this simulation study , w e generated data sets { ( x 1 α , x 2 α , y α ); α = 1 , . . . , n } as lab eled data and { ( x 1 α , x 2 α ); α = 1 , . . . , 500 } as unlab eled dat a . In lab eled data, ( x 1 α , x 2 α ) w ere generated by a normal distribution N (( − 0 . 9 , 1 − sin(sin(0 . 9 2 π ))) T , diag (0 . 0015 , 2)) , and y α w as generated according to a follo wing conditional probabilit y Pr( Y = 1 | x 1 , x 2 ) = 1 /  1 + exp  − sin(2 π x 2 1 ) − x 2 + 1  . (22) 10 T able 1 : Comparison of prediction error rates ( %) and v alues of selected parameters for sev eral num b er of lab eled data p oin ts. Metho d \ # of lab eled data 25 50 100 150 200 250 SSLR CS PE 33.3 33.3 33.9 34.8 35.5 35.0 log 10 ( λ ) –2.20 –3.00 –3.18 –3.54 – 3.80 –3.72 γ 1 0.10 0.10 0.10 0.10 0.10 0.10 γ 2 0.61 0.71 0.74 0.82 0.86 0.82 LSSLR PE 34.3 34.4 34.2 35.3 35.9 35.6 log 10 ( ξ 1 ) –2.72 –3.36 –3.38 –3.7 2 –3.88 –3.9 2 SLR PE 35.6 34.3 34.3 35.2 35.8 35.6 log 10 ( ξ 2 ) –2.06 –2.32 –2.80 –3.1 0 –3.50 –3.6 8 Mean while, unlab eled data ( x 1 α , x 2 α ) w ere obtained by a normal distribution N (( − 0 . 4 , 1 − sin(sin(0 . 4 2 π ))) T , diag(0 . 0 5 , 1)). T est data { ( x 1 α , x 2 α , y α ); α = 1 , . . . , 100 0 } w ere gener- ated as follow s. F irst, ( x 1 α , x 2 α ) w ere deriv ed b y a mixture of lab eled and unlab eled dat a , where the mixing rate is equal ( that is, 0.5). Second, for the ( x 1 α , x 2 α ), y α w as obtained according to the conditio nal probability in Equation (22). W e assumed that lab eled data sizes ( n ) were 25 , 5 0 , 100, 150, 200, and 250. W e ﬁtted our semi-sup ervised logistic r egression mo del to the da ta sets. Note t ha t the densit y ra t io estimation pro cedure b y t he uLSIF metho d desc rib ed in Section 2.2 is no t p erfor med in these sim ulation tria ls, since the densit y ratio is exactly calculated. The sim ulation results w ere o btained by av eraging ov er 5 0 rep eated Monte Carlo trials. F or eac h data set, w e computed av erages of prediction error rates (PE) fo r 5 0 it era t ions. The tuning pa r ameters in our mo dels w ere selected b y using the GIC in Equation (19). F or 50 trials, w e computed a verages of selected a dj usted parameters. The results ar e summarized in T able 1. F rom the table, in the selection of adjusted parameters, the v alues o f the tuning parameter γ 1 are 0.10 in all cases, while those of the parameter γ 2 increase with the increasing n um b ers of lab eled da t a . The r egula r ization para meter λ tak es smaller v alues according to the increasing n umbers o f la b eled data. 11 W e compared the p erformances of the prop osed semi-sup ervised metho dologies (SSLR CS: semi-supervised logistic regression under co v ariate shift) with those of semi-supervised metho d prop osed by Amini and Gallinari (2002 ) ( L SSLR : linear semi-sup ervised logistic regression), whic h is dev elop ed under the condition that densit y functions fo r lab eled and unlab eled data are same, and sup ervised linear logistic discriminan t analysis (SLR: sup er- vised lo gistic regression). Not e t ha t the SLR is constructed by using only lab eled data. Semi-sup ervised and sup ervised logistic mo deling strategies w ere applied in to the data sets. The LSSLR and the SLR include a tuning parameter, resp ectiv ely , where w e denote the tuning parameters a s ξ 1 and ξ 2 , resp ectiv ely . The parameter is determined b y the G IC, where the GIC for LSSLR is obtained by setting q unlabel ( x α ) /q label ( x α ) = 1 ( α = 1 , . . . , n 1 ) in Equation (19) and that for SLR is giv en b y Ando et al. ( 2 008). F or these metho ds, w e also computed a v erages of prediction error rates and selected tuning parameters. It ma y b e seen fr o m T able 1 that SSLR CS is sup erior to other metho ds (LSSLR and SLR) in all cases in the sense that the prop o sed metho d gives smaller prediction error rates. 4.2 Sim ulation 2 W e simulated three data sets giv en in Chakrab or t y (20 11) to examine the p erfo rmances of our prop osed mo deling strategy . F or each of t he sim ulation cases , we generated 10 0 data po in ts in the lab eled data set, 1000 data p oin ts in t he unlab eled data set, and 1000 data p oints in the test data set. Usin g the data sets, w e constructed the SSLRCS, the LSSLR, and the SLR. W e rep eated the pro cedure 50 times. O ur sim ulation settings are giv en as follows (for details, see, Chakrab o r ty (2011, p. 76) ) : • Case 1 : In the lab eled dat a set, generate x = ( x 1 , x 2 ) T giv en b y x i ∼ N (2 , 1) ( i = 1 , 2) fo r Class 1 and x i ∼ N ( − 2 , 1 ) ( i = 1 , 2) f o r Class 2. In the unlab eled data set, x i ∼ N (2 , 2) ( i = 1 , 2) f or Class 1 and x i ∼ N ( − 2 , 2) ( i = 1 , 2) fo r Class 2. In the test data set, x i ∼ 0 . 5 N (2 , 1) + 0 . 5 N (2 , 2) ( i = 1 , 2) for Class 1 and x i ∼ 0 . 5 N ( − 2 , 1) + 0 . 5 N ( − 2 , 2) ( i = 1 , 2) fo r Class 2. • Case 2 : Generate x = ( x 1 , . . . , x 10 ) T giv en by x i ∼ N (1 , 3) ( i = 1 , . . . , 10 ) for Class 1 and x i ∼ N ( − 1 , 3 ) ( i = 1 , . . . , 10 ) f o r Class 2. 12 T able 2 : Comparison of prediction error rates ( %) and v alues of selected parameters for sev eral cases. Metho d \ D ata sets Case 1 Case 2 Case 3 SSLR CS PE 1.28 3.65 9.72 log 10 ( λ ) –2.50 –1.98 –1.98 γ 1 1.00 1.00 1.00 γ 2 0.102 0.10 6 0.106 LSSLR PE 1.36 4.19 11.6 log 10 ( ξ 1 ) –2.50 –2.0 0 –3.00 SLR PE 1.43 5.05 11.7 log 10 ( ξ 2 ) –2.50 –1.9 6 –2.18 • Case 3 : Generate x = ( x 1 , x 2 ) T giv en b y x i ∼ N (5 , 2) ( i = 1 , 2) for Class 1 and x i ∼ N ( 8 , 2) ( i = 1 , 2 ) fo r Class 2 in the lab eled data set. In the unlab eled data set, x i ∼ N (6 , 2) ( i = 1 , 2 ) for Class 1 a nd x i ∼ N (9 , 2) ( i = 1 , 2 ) for Class 2. In the test data set, x i ∼ 0 . 5 N (5 , 2) + 0 . 5 N (6 , 2) ( i = 1 , 2) for Class 1 and x i ∼ 0 . 5 N (8 , 2) + 0 . 5 N (9 , 2) ( i = 1 , 2) for Class 2. The results from the sim ulation studies are in T able 2. W e o btained the v alues in the table b y av eraging o v er 50 trials. The optimal tuning parameters selected b y our mo del selection criterion w ere 1.00 for γ 1 in all situations, 0.10 2 and 0.106 for γ 2 in Case 1 and Case 2, 3 , resp ectiv ely , and 10 − 2 . 50 and 10 − 1 . 98 for λ Case 1 and Case 2, 3, resp ective ly . F rom the simulation results, w e o bserv e that our prop o sed pro cedure p erforms w ell in all cases with resp ect to minimizing prediction error rates ev en though Case 2 is an ordinar y setting of semi-sup ervised learning, i.e., the densit y f unction for la b eled data is same as that for unlab eled data. Hence, w e conclude that our prop osed metho d ma y b e useful ev en if the densities for la b eled and unlabeled data are same. 13 4.3 Benc hmark d ata analysis Thorough analyzing the g10 data set (Chap elle and Zien, 2005), the ionosphere data set (Sigillito et al. , 1989) , and the pima data set (Ripley , 1996), w e illustrated the eﬀectiv eness of the prop osed semi-sup ervised metho dology . The g10 data set includes 550 data p oin ts with 10 predictors, and we prepared 250 training data p oints and 300 test data p oints. The ionosphere data set consists o f 35 6 data p oin ts with 33 predictors, and we split the whole 356 data p oin ts into 150 training data p oin ts and 206 test data p o in ts. The pima data set, whic h consists o f 300 training data p oints and 232 test data p oints , is a binary classiﬁcation with 7 predictors. In order to implemen t the semi-sup ervised pro cedure , the training data p oints we re ra ndo mly split in to t w o halv es with lab eled data p oin ts and unlab eled data p oin ts, where lab eled data p o in ts w ere a ssigned a s 5%, 10%, 20%, 30 %, 40%, and 50% for training data p oints, resp ectiv ely . W e rep eated the random splitting 50 times. W e also compared our prop osed metho d (SSLRCS) with the LSSLR and the SLR, whic h are describ ed in Section 4.1 . T able 3 sho ws the summary of the prediction errors and selected adjusted parameters for the b enc hmark data sets. The v alues in t he ta ble were av eraged o v er 50 r ep etitions. F rom the results, we observ e that the tuning parameter γ 1 pro vides the largest v alues (i.e., 1.00) in a lmo st all cases, while the parameter γ 2 giv es relativ ely smaller v a lues (i.e., from 0.10 to 0.40). W e also ﬁnd that our prop osed pro cedure outp erforms the previously prop osed metho ds in almo st a ll situations, although it is unclear that whether densities for lab eled and unlab eled data are diﬀerent. In particular, the prop osed metho d seems to w ork we ll when the num b er of lab eled data p oin ts is small. 5 Conclud ing remarks W e prop osed a semi-sup ervised logistic classiﬁcation metho dology f o r diﬀerent densit y functions of lab eled and unlab eled data along with the tec hnique of cov aria t e shift adap- tation and regularization. A crucial p o in t for our semi-sup ervised mo deling pro cesses includes the c hoices of some tuning parameters in our prop osed mo dels. W e in tro duced a 14 T able 3 : Comparison of prediction error rates ( %) and v alues of selected parameters for some data sets . Metho d \ % 5 10 20 30 4 0 50 g10 SSLR CS PE 3.40 3.47 3.85 4.06 4.66 5.42 log 10 ( λ ) –3.20 –2.97 –2.99 –3.00 – 3.00 –3.00 γ 1 1.00 1.00 1.00 1.00 1.00 1.00 γ 2 0.15 0.10 0.10 0.10 0.10 0.10 LSSLR PE 26.6 16.2 9.94 7.04 5.66 4.77 log 10 ( ξ 1 ) –3.50 –3.0 0 –3.00 –3.00 –3.00 –3.0 0 SLR PE 26.4 16.4 9.30 6.85 5.45 4.62 log 10 ( ξ 2 ) –3.50 –3.0 0 –3.00 –3.00 –3.00 –3.0 0 Ionosphere SSLR CS PE 18.2 17.3 16.9 16.4 17.3 16.8 log 10 ( λ ) –2.89 –2.86 –2.70 –2.44 – 2.61 –2.66 γ 1 0.99 0.99 1.00 1.00 1.00 1.00 γ 2 0.50 0.46 0.37 0.27 0.37 0.35 LSSLR PE 29.0 22.8 18.9 17.4 16.2 15.4 log 10 ( ξ 1 ) –3.92 –3.5 0 –3.50 –3.00 –3.00 –3.0 0 SLR PE 28.9 23.1 19.5 18.0 16.7 15.7 log 10 ( ξ 2 ) –3.92 –3.5 0 –3.50 –3.00 –3.00 –3.0 0 Pima SSLR CS PE 26.6 26.9 26.6 26.8 26.7 26.7 log 10 ( λ ) 1.41 1.53 1.35 1.30 1.27 1.36 γ 1 1.00 1.00 1.00 1.00 1.00 1.00 γ 2 0.30 0.28 0.26 0.23 0.24 0.23 LSSLR PE 30.1 27.0 27.0 27.0 26.9 26.7 log 10 ( ξ 1 ) 1.27 1.41 1.53 1.72 1.71 1.61 SLR PE 29.3 26.9 26.9 27.0 26.8 26.7 log 10 ( ξ 2 ) 2.46 2.37 2.34 2.23 2.16 2.10 15 mo del selection criterion from the viewpo in t o f information t heory in order to select the v alues of the adjusted pa rameters. Through Mon te Carlo sim ulations and the b enc hmark data analysis, w e sho we d that our mo deling strategy is eﬀectiv eness in practical situatio ns in the viewpoints of yielding relatively lo w er prediction errors than previously dev elop ed metho ds. Our mo deling pro cedure ma y be applied into the problem of constructing a nonlinear semi-supervised classiﬁcation method based on basis expansions, whic h will b e discusse d in another pa p er. Ac kno wledgemen t This work w as suppo rted b y the Ministry of Education, Science, Sp orts and Culture, Gran t-in-Aid for Y o ung Scien tists (B), #24700280, 2012–2015. References [1] Amini, M-R. a nd Gallinari, P . (2002) . Semi-sup ervised logistic regression. Pr o c. 15th Eur. Conf. Artif. Intel l. 390– 394. [2] Ando, T., Konishi, S. and Imoto , S. (200 8). Nonlinear regression mo deling via regu- larized radial basis function netw orks. J. Statist. Plann. Infer enc e 138 , 361 6–3633. [3] Basu, S., Bilenk o, M. and Mo o ney , R. J. (2004). A proba bilistic framew o rk f or semi- sup ervised clustering. Pr o c. 10th ACM SI GKDD Int. Conf. K now l. Disc ov. Data. Min. A CM Press, 59–68. [4] Bennett, K. P . and Demiriz, A. (1998). Semi-sup ervised supp ort v ector machines . A dv. Neur al I nform. Pr o c ess. Syst. 11 , 368–374. [5] Bic k el, S., Br ¨ uc kner, M. and Scheﬀer, T. (20 09). Discriminativ e learning under cov ari- ate shift. J. Mach. L e arn. R es. 10 , 21 3 7–2155. [6] Chakrab orty , S. (201 1). Bay esian semi-sup ervised learning with supp ort ve ctor ma- c hine. Statist. Metho dol. 8 , 68 – 82. [7] Chap elle, O., Sc h¨ olk opf , B. and Zien, A. (2006). Semi-Sup ervise d L e arning . Cam- bridge, MA: MIT Press. 16 [8] Chap elle, O. and Zien, A. (2005 ). Semi-sup ervised classiﬁcation b y low densit y sepa- ration. Pr o c. 10th I nt. Workshop Art if. Intel l. Stat. 57–64. [9] Chen, K. and W ang, S. (2007). Regularized b o ost for semi-sup ervised learning. A dv. Neur al I nform. Pr o c ess. Syst. 20 , 281–288. [10] Dean, N., Murph y , T. B. and Downe y , G. (2 0 06). Using unlab elled data to up date classiﬁcation rules with applications in fo o d authen ticit y studies. J. R oy. Statist. So c. Ser. C 55 , 1–14. [11] Hastie, T., Tibshirani, R. and F riedman, J. (2009). The Elements of S tatistic al L e arn- ing . 2nd ed. New Y ork: Springer. [12] Hirose, K., Ka w ano, S. and Konishi, S. (200 8). Ba y esian factor a nalysis and info r- mation criterion. Bul l. Inform. Cyb ernet. 40 , 75–87. [13] Huang, J., Smola, A., G retton, A., Borgw ardt , K. M. and Sc h¨ olk opf , B. (2007). Correcting sample selection bia s b y unlab eled data. A dv. Neur al Inform. Pr o c ess. Syst. 19 , 601–608. [14] Hub er, P . (20 04). R obust Statistics . New Y ork: Wiley . [15] Irizarry , R. A. (2 001). Information and p o sterior probabilit y criteria for mo del selec- tion in lo cal likelihoo d estimation. J. Am. Stat. Asso c. 96 , 303–315. [16] Jiang, J. a nd Zhai C-X. (2 007). Instance w eighting for domain a daptation in NLP . Pr o c. 45th Annu. Me et. Asso c. Comput. Linguist. 264–2 71. [17] Kanamori, T., Hido, S. and Sugiy a ma, M. ( 2009). A least-squares approac h to direct imp ortance estimation. J. Mach. L e arn. R es. 10 , 1391–1445. [18] Ka w akita, M. and Kana mori, T. (20 1 2). Semi-sup ervised learning with densit y-rat io estimation. Preprin t, a r Xiv:120 4.3965. [19] Ka w ano, S. and Konishi, S. (2011). Semi-sup ervised log istic discrimination via regu- larized Gaussian basis expansions. Co mm. Statist. The ory Metho ds 40 , 2412–24 23. 17 [20] Ka w ano, S. Misumi, T. a nd Konishi, S. (2 0 12). Semi-sup ervised lo gistic discrimina- tion via graph-based regularization. T o app ear in Neur al Pr o c ess. L ett. . [21] Konishi, S. and Kitagaw a, G . (19 96). Generalised informatio n criteria in mo del se- lection. Biometrika 83 , 8 7 5–890. [22] Konishi, S. and Kitagaw a, G. (200 8). Inf ormation Criteria and S tatistic al Mo deling . New Y o r k: Spring er. [23] Laﬀert y , J. and W asserman, L. (20 0 7). Sta tistical analysis of semi-sup ervised regres- sion. A dv. Neur al Inform. Pr o c ess. Syst. 21 , 801–808. [24] Liang, F., Mukherjee, S. and W est, M. ( 2 007). The use of unlab eled data in predictiv e mo deling. Statist. S ci. 22 , 189–2 0 5. [25] Ripley , B. D . (1996). Pattern R e c o gnition and Neur al Networks . Cambridge: Cam- bridge Univ ersit y Press. [26] Shimo daira, H. (2000) . Impro ving predictiv e inference under cov aria t e shift by w eighting the log- lik eliho o d f unction. J. Statist. Plan n. Infer enc e 90 , 227–244. [27] Sigillito, V. G., Wing, S. P ., Hutton, L. V. a nd Baker, K. B. (1989 ). Classiﬁcation of radar r eturns from the iono sphere using neural netw orks. Joh ns Hopkins APL T e ch. Digest 10 , 2 62–266. [28] Sok olovsk a, N., Capp ´ e, O. and Yv on, F . (2008 ). The a symptotics of semi-supervised learning in discriminativ e probabilistic mo dels. Pr o c. 25th Int. Conf. Mach. L e arn. , 984–991. [29] Sugiy ama, M. and Kaw anab e, M. (201 2). Machine L e arning in Non-Stationary En- vir onments: Intr o duction to Covariate Shift A daptation . Cam bridge, MA: MIT Press. [30] Sugiy ama, M., Suzuki, T. and Kanamori, T. (2012). D ensity R atio Estimation in Machine L e arning . Cam bridge: Cambridge Univers it y Press. [31] Sugiy ama, M., Suzuki, T., Nak a jima, S., Kashima, H., von B ¨ unau, P . and Kaw an- ab e, M. (2008). D irect imp ort a nce estimation for co v ariate shift adaptation. Ann. Inst. Statist. Math. 60 , 699–746. 18 [32] Vittaut, J-N., Amini, M-R. and Ga llinari, P . (2002) . Learning classiﬁcation with bot h lab eled and unlab eled data . Pr o c. 13th Eur. Con f. Mach. L e arn. 468–47 9. [33] W u, D ., Lee, W. S. and Y e, N. (2009). Domain adaptive b o otstrapping for names en tity recognitio n. Pr o c. 2009 Con f. Empir. Metho ds Nat. L ang. Pr o c ess. 1523–1532 . [34] Zadrozn y , B. (2004). Learning and ev aluating class iﬁers under sample selection bias. Pr o c. 21th Int. Con f. Mach. L e arn. , 1 14–121. [35] Zhou, D., Bousquet, O., La l, T. N., W eston, J. and Sc h¨ olk opf, B. (2004). Learning with lo cal and global consistency . A dv. Neur al Inf orm. Pr o c ess. Syst. 16 , 3 21–328. [36] Zh u, X. (2008) . Semi-sup ervied learning literature surv ey . Computer Sciences T ec h- nical Rep o rt 1530, Unive rsit y of Wisconsin-Madison. [37] Zou, H., Zhu, J., Ro sset, S. and Hastie, T. (20 0 7). Automatic bias correction metho ds in semi-supervised learning. Contemp. Math. 443 , 165–175. 19

Semi-supervised logistic discrimination via labeled data and unlabeled data from different sampling distributions

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment