Condition Number Analysis of Kernel-based Density Ratio Estimation
The ratio of two probability densities can be used for solving various machine learning tasks such as covariate shift adaptation (importance sampling), outlier detection (likelihood-ratio test), and feature selection (mutual information). Recently, s…
Authors: Takafumi Kanamori, Taiji Suzuki, Masashi Sugiyama
Condition Num b er Analysis of Kernel-based Densit y Ratio Estimation T ak afumi Kanamori Nago ya Univ ersit y kanamori@is .nagoya-u.ac.jp T aiji Suzuki Univ ersit y of T oky o s-taiji@sta t.t.u-tokyo.ac.jp Masashi Sugiy ama T oky o Institute of T ec hnology sugi@cs.tit ech.ac.jp Abstract The ratio of t wo probabilit y densities can be used for solving v arious machine learning tasks such as cov a riate shift a daptation (imp ortance s a mpling), outlier detection (likelihoo d- ratio test), a nd feature se le ction (mut ual informa tion). Recently , several methods of directly estimating the density ratio hav e b een developed, e.g., kernel mean matching, maximum likelihoo d density ratio estimation, and least-squar es density ratio fitting. In this pa p er, we consider a kernelized v ariant of the least-sq uares metho d and in vestigate its theoretical prop erties from the viewp oin t of the condition num b er using smo othed analysis techniques— the condition num b er of the Hessian matr ix determines the co n vergence ra te of optimizatio n and the numerical stabilit y . W e show that the k er nel least-squa res metho d has a smaller condition num b er than a version of kernel mean ma tching and other M-estimators, implying that the kernel lea st-squares metho d has pr eferable numerical prop erties. W e further g iv e an a lternativ e for m ulation o f the k ernel lea s t-squares estimato r whic h is shown to p ossess an even smaller co nditio n num b er. W e show that nu merical studies meet our theoretical analysis. 1 In tro duction The problem of estimating the ratio of tw o probability densities is attracting a great deal of atten tion these da ys, since the densit y ratio can b e used for v arious pu rp oses su c h as co v ariate shift adaptation (Shimo daira, 2000; Z ad r oz n y , 2004; Sugiya m a & M ¨ uller, 2005; Huang et al., 2007; Sugiy ama et al., 2007; Bic k el et al., 2009), outlier detection (Sc h¨ olk opf et al., 2001; T ax & Duin, 2004; Ho dge & Austin, 2004; Hido et al., 2008), and div ergence estimation (Nguy en et al., 2008; Suzuki et al., 2008). A naiv e approac h to den s it y ratio estimation is to fir st separately estimate t w o probabilit y densities and then tak e the r at io of the estimate d d ensities. Ho we ver, density estimation is kno wn to b e a hard p roblem p a rticularly in high-dimensional cases u nless we ha ve simple and go o d parametric densit y mo dels (V apnik, 1998; H¨ ardle et al., 2004), whic h ma y not b e the case in practice. Recen tly , metho ds of directly estimating the d ensit y ratio without going through densit y estimation ha ve b een dev elop ed. The kernel me an matching (KMM) metho d (Huang et al., 2007) 1 directly giv es estimates of the d ensit y r atio b y matc h ing th e t wo distr ib utions efficien tly using a sp ecial prop erty of universal r epr o ducing kernel H i lb ert sp ac es (RKHSs) (Steinw art, 2001). Another app r oa c h is an M-estimator (Nguy en et al., 2008) b ased on non-asymptotic v ariational c haracterization of the f -div ergence (Ali & Silve y , 1966; Csisz´ ar, 196 7). See also Sugiy ama et al. (2008 a) for a similar algorithm under the Kullback-Le ibler div ergence. Non-parametric con ve rgence pr operties of the M-estimator in R K HSs ha ve b een elucidated under the Kullback- Leibler d iv ergence (Nguy en et al., 2008; Sugiy ama et al., 2008b). A squared-loss v ersion of th e M-estimato r for linear d ensit y-ratio mo dels called unc onstr aint L e ast-Squar e Imp ortanc e Fitting (uLSIF) has b een deve lop ed an d has b een sho wn to p ossess u seful compu ta tional p roper ties, e.g., a closed-form solution is av ailable and th e lea ve-o n e-o u t cross-v alidatio n score can b e analytically computed (Kanamori et al., 2009). In this pap er, w e consider a kernelized v arian t of u LSIF (KuLSIF) and analyze its prop erties in numerical optimizati on from the viewp oin t of the c ondition numb er . Th e condition n u m b er of the Hessian matrix of ob jectiv e fun ct ion pla ys a crucial r ole (Luenb erger & Y e , 2008 ; Bertsek a s, 1996) , i.e., it determines the con v ergence r at e of optimizatio n and the numerical stabilit y . When an ob jectiv e function to b e optimized is randomly c hosen and fed in to an optimizati on algorithm, the computational cost of an algorithm can b e assessed by the distrib ution of the cond iti on n u m b er. The distribution of condition num b ers of rand omly p erturb ed matrices has b een stud ied b y the n ame of smo oth e d analysis (Spielman & T en g, 2004; Sank ar et al., 2006). Smo othed analysis wa s originally in tr o duced to explain the success of alg orithm s and h euristics that could not b e w ell-und er s too d through traditional w orst-case and av erage-ca s e analysis—it giv es a m ore realistic analysis of the practical p erformance of algorithms. W e apply smo othed analysis tec hniques to derive the distribution of the condition n umb er of dens ity-ratio estimation algorithms. More sp ecifically , w e first giv e a u nified view of the ob jectiv e f unctions of KuLSI F and KMM. T hen w e sho w that K uLSIF h as a smaller condition n u m b er than an “induction” v arian t of KMM, imp lying that KuLSI F is more p referable than KMM in optimization. W e further sho w that KuLSIF—whic h could b e regarded as an instance of M-estimators—has the s m al lest condition num b er among all M-estimato rs in the m in-max sense (i.e., the worst condition num b er o ver all den sit y ratio f unctions is the smallest in KuLSIF). W e also give probabilistic ev aluation of the condition num b er of M-estimators an d sho w th at KuLSIF is fa vo r able. Th ese th eo r et ical fi ndings are also ve rified th rough n u merical exp eriments. W e fu rther giv e an alternativ e formula tion of KuLSIF wh ic h is d en ot ed as Reduced-KuLSI F, and sho w that it p ossesses an ev en smaller cond iti on num b er. The rest of this pap er is organized a s follo ws. In Section 2, w e formulate the problem of d ensit y ratio esti mation and briefly r evie w existing metho ds. In S ection 3, w e describ e the Ku LSIF algorithm, and sho w its fund ame n tal prop erties su ch as the con ve rgence rate and a v ailabilit y of the analytic-form solution and the analytic-form leav e-one-o ut cross-v alidatio n score. Section 5 is the main con tr ibution of th is pap er, giving condition num b er analysis of densit y ratio estimation method s. In Section 6, w e giv e an alternativ e form ulation of KuL S IF b y transf orming loss f unctions and sh ow that is p ossesses an ev en smaller condition num b er. In Section 7, w e exp erimenta lly in v estigate the b ehavior of the condition num b ers, confirming v alidit y of our theories. In Section 8, we conclude by summarizing our con trib utions and s h o wing p ossible future directions. 2 2 Estimation of Densit y Ratio W e formulate the problem of densit y ratio estimation and briefly review existing metho ds. 2.1 F orm ulation and Notat ions Consider t wo probabilit y distribu tions P and Q on a probabilit y space Z . Assume that b oth distributions ha ve the probabilit y densities p and q , resp ectiv ely . W e assume p ( x ) > 0 for all x ∈ Z . Supp ose that we are giv en tw o sets of indep endent and identic ally distribu ted (i.i.d.) samples, X 1 , . . . , X n i.i.d. ∼ P , Y 1 , . . . , Y m i.i.d. ∼ Q. (1) Our goal is to estimate the density ratio w 0 ( x ) = q ( x ) p ( x ) ( ≥ 0) based on the observe d samples. W e summarize some notations to b e u sed throughout the p aper. F or a vect or a in the Euclidean space, k a k denotes the Euclidean n orm. Giv en a probability distrib u tio n P and a random v ariable h ( X ), w e denote th e exp ectation of h ( X ) under P b y R hdP or R h ( x ) P ( dx ). Giv en samples X 1 , . . . , X n from P , th e empirical distr ib ution is d en o ted by P n . The exp ectation R hdP n denotes the empirical m ea n s of h ( X ), that is, 1 n P n i =1 h ( X i ). Let k · k ∞ b e the infinity norm, and k ·k P b e the L 2 -norm under th e probabilit y P , i.e. k h k 2 P = R | h | 2 dP . F or a repro ducing k ernel Hilb ert sp ac e (RKHS) H (Sc h¨ olk opf & Smola, 2002), the inner pr odu ct and the norm on H are den oted as h· , ·i H and k · k H , resp ectiv ely . Belo w we review sev eral ap p roac hes to densit y ratio estimation. 2.2 Kernel Mean Matc hing The ke rnel me an matching (KMM) m et ho d allo ws us to directly obtain an estimate of w 0 ( x ) at X 1 , . . . , X n without going through den s it y estimati on (Huang et al., 2007). The basic idea of K MM is to find w 0 ( x ) suc h that the mean discrepancy b et ween n on -linearly transformed s amples dra wn from P and Q is minimized in a universal r epr o ducing kernel Hilb ert sp ac e (Steinw art, 2001). W e in tro duce the defi nitio n of universal kernel b elo w. Definition 1 (Stein wa r t (2001)) . A c ontinuous kernel k on a c omp act metric sp ac e Z is c al le d universal if the RKHS H of k is dense in the set of al l c ontinuous functions on Z , th at i s, for every c ontinuous function g on Z and al l ε > 0 , ther e exi sts an f ∈ H such that k f − g k ∞ < ε . The c orr esp onding RKHS is c al le d unive rsal RKHS. The Gaussian k ern el is an example of unive rsal k ernels. L et H b e a un iv ersal RKHS endo wed with the ke r nel fu nction k : Z × Z − → ℜ . F or an y x ∈ Z , the function k ( · , x ) is regarded as an elemen t of H . Then, it has b een s h o wn that the solution of the follo wing optimization pr oblem agrees with the true density ratio w 0 : min w 1 2 Z w ( x ) k ( · , x ) P ( dx ) − Z k ( · , y ) Q ( dy ) 2 H , s.t. Z wdP = 1 and w ≥ 0 . 3 Indeed, when w = w 0 , the loss function equals to zero. An emp ir ica l v ers ion of th e ab o ve problem is reduced to the follo w ing con v ex quadr at ic program: min w 1 ,...,w n 1 2 n n X i,j =1 w i w j k ( X i , X j ) − 1 m m X j =1 n X i =1 w i k ( X i , Y j ) , s.t. 1 n n X i =1 w i − 1 ≤ ǫ and 0 ≤ w 1 , w 2 , . . . , w n ≤ B . (2) T unin g parameters, B ≥ 0 and ǫ ≥ 0, con trol the regularizat ion effec ts. The solution b w 1 , . . . , b w n is an estimate of the density ratio at the samples from P , i.e., w 0 ( X 1 ) , . . . , w 0 ( X n ). Note that KMM d oes not estima te the f unction w 0 on Z but the v a lu es on sample p oin ts (i.e ., transduction). 2.3 M-estimator based on f -div er gence Approac h An estimator of the density ratio based on the f -div ergence (Ali & S ilvey , 1966; Csisz´ ar, 1967) has b een p r oposed by Nguye n et al. (2008 ). Let ϕ : ℜ → ℜ b e a con ve x function, then the f -dive rgence b et we en P and Q is defined by the integral I ( P , Q ) = Z ϕ ( q /p ) dP. Setting ϕ ( z ) = − log z , w e obtain the Ku llbac k-Leibler diverge n ce as an example of f - div ergences. Let the co njugate d u al fu nctio n ψ of ϕ b e ψ ( z ) = sup u ∈ℜ { z u − ϕ ( u ) } = − inf u ∈ℜ { ϕ ( u ) − z u } . When ϕ is a con vex function, w e also hav e ϕ ( z ) = − inf u ∈ℜ { ψ ( u ) − z u } . (3) Substituting (3) into the f -dive r ge nce, we obtain another expression, I ( P , Q ) = − in f w Z ψ ( w ) dP − Z wdQ , (4) where the infimum is tak en ov er all measurab le fun cti ons w : Z → ℜ . The infimum is attained at the function w su c h that q ( x ) p ( x ) = ψ ′ ( w ( x )) , where ψ ′ is the d eriv ative of ψ . Appr oximati ng (4) with the empirical distributions P n and Q m , we ob tain the empirical loss function. This estimator is referred to as the M - estimat or of the dens ity ratio. A more p ractic al algorithm for the Kullbac k-Leibler div ergence has b ee n indep endent ly prop osed in Sugiy ama et al. (2008a ). 4 When an R K HS H is emplo yed as a statistical mo del, an estimator is obtained b y minimizing the loss function whic h approxima tes (4) o ver H , inf w Z ψ ( w ) dP n − Z wdQ m + λ 2 k w k 2 H , w ∈ H . (5) The d ensit y r atio w 0 is estimated by ψ ′ ( b w ( x )), w here b w is the minimizer of (5). Th e regu- larizatio n term λ 2 k w k 2 H with the r eg u lariza tion parameter λ is introdu ce d to a v oid o v erfi tting. In the R K HS H , the represente r theorem (Kimeldorf & W ahba, 1971) is applicable, and the optimization problem on H is r educed to a finite dimensional optimizatio n problem. Statistical con ve rgence p rop ertie s of the kernel estimator f o r the Ku llbac k-Leibler dive rgence hav e b een in vestig ated in Nguy en et al. (2008) and Sugiya m a et al. (2008b). 2.4 Least-squares Approac h The linear mo del b w ( x ) = b X i =1 α i h i ( x ) (6) is assum ed for esti mation of the densit y ratio w 0 , where th e co efficien ts α 1 , . . . , α b are the pa- rameters of the mo del. The basis f unctions h i , i = 1 , . . . , b are c hosen so that the n on-nega tivit y condition h i ( x ) ≥ 0 is satisfied. A p racti cal c hoice would b e the Gaussian k ernel function h i ( x ) = e −k x − c i k 2 / 2 σ 2 with app ropriate kernel cen ter c i ∈ Z and k ernel width σ (Su gi y ama et al., 2008a). The unc onstr a i nt le ast-squar e imp ortanc e fitting (uLSI F) (Kanamori et al., 2009) estimates the parameter α based on th e squ are error: 1 2 Z ( b w − w 0 ) 2 dP = 1 2 Z b w 2 dP − Z b w dQ + 1 2 Z w 2 0 dP . The last term in the ab o ve expression is a constant and can b e safely ignored when m inimizing the square error of the estimator b w . Therefore, the solution of the follo wing minimizat ion problem o ver the linear mo del, min w 1 2 Z w 2 dP n − Z wdQ m + λ · Reg ( α ) , (7) is exp ected to appro ximate the tr ue densit y ratio w 0 , where the regularizatio n term Reg ( α ) with the r egularization p aramet er λ is introduced to a vo id o verfitting. W e d efine the column vec tor α = ( α 1 , . . . , α b ) ⊤ and the v ector-v a lu ed function h ( x ) = ( h 1 ( x ) , . . . , h b ( x )) ⊤ . Substituting the linear mo del (6) in to the ob jectiv e function of (7) , we obtain min α ∈ℜ b 1 2 α ⊤ b H α − b g ⊤ α + λ · Reg( α ) , (8) where b H and b g are the b b y b matrix and the b -dimensional v ector d efined as b H = R hh ⊤ dP n and b g = R hdQ m , resp ect ively . Let b α b e the m inimize r of (8), then the estimator of w 0 is giv en as b w ( x ) = P b i =1 b α i h i ( x ). There are sev eral w ays to imp ose the n on-nega tivit y co n dition b w ( x ) ≥ 0 (Kanamori et al., 2009). Here, truncation of b w defi n ed as b w + ( x ) = max { b w ( x ) , 0 } 5 is used for obtaining a non-negativ e estimator. Note that th e loss fun ctio n (4) with ψ ( z ) = z 2 / 2 is essential ly equ iv alen t to the loss of uLS I F. uLSIF has an adv an tage in computation o v er other M-estima tors: When Reg ( α ) = k α k 2 / 2, the estimator b α can b e obtained in an analytic form. As a result, the lea v e-one-out cross-v alidation (LOOCV) score can also b e computed in a closed form (Kanamori et al., 2009), whic h allo ws us to compute the LOOCV score very efficie n tly . LOOCV is an (almost) unbiased estimator of the prediction error and can be used for determining h yp er-parameters suc h as regularization parameter λ or Gaussian ke r nel width σ . 3 Kernel uLSIF The purp ose of this pap er is to sho w th at a k ernelized v arian t of uLSIF (whic h w e refer to as kernel uLSIF ; KuLS IF) has go od theoretical prop erties and thus u seful. In this section, w e formalize the KuLSIF algorithm and briefly sho w its fundamental pr operties. Then in the next section, we analyze the computational efficiency of K uLSIF algorithm from the viewp oin t of the condition num b er. 3.1 uLSIF on RKHS W e assume that the mo del for the densit y ratio is an RKHS H endow ed w ith a kernel f unction k on Z × Z , and we consider the optimiza tion problem (7) on H . According to (7), th e estimator b w is obtained as min w 1 2 Z w 2 dP n − Z wdQ m + λ 2 k w k 2 H , s. t. w ∈ H . (9) The regularization term λ 2 k w k 2 H with the regulariza tion parameter λ ( ≥ 0) is in tro duced to a vo id o ve r fitting. Th e truncated estimator b w + = max { b w , 0 } ma y b e preferable in practice; the estimation pro cedure of b w or b w + based on (9) is called Ku L SIF. The follo wing theorem reve als the con v ergence rate of the estimators b w and b w + . Theorem 1 (Conv ergence Rate of KuL S IF) . Assume that the domain Z is c omp act. L et H b e an RKHS with the Gaussian kernel. Supp ose that q /p = w 0 ∈ H , and k w 0 k H < ∞ . Set the r e gulariza tion p ar ameter λ = λ n,m so that lim n,m →∞ λ n,m = 0 , λ − 1 n,m = O (( n ∧ m ) 1 − δ ) , wher e n ∧ m = min { n, m } and δ is arbitr ary numb er satisfying 0 < δ < 1 . Then the estimators b w and b w + satisfy k b w + − w 0 k P ≤ k b w − w 0 k P = O p ( λ 1 / 2 n,m ) , wher e k · k P is the L 2 -norm under the pr ob ability P . Pro ofs ma y b e found in App endix A. By c ho osing small δ > 0, the con verge n ce rate will get close to the order of O (1 / √ n ∧ m ) w hic h is the conv ergence rate for parametric mo dels. See Nguy en et al. (2008) and Sugiya m a et al. (2008b) for similar con verge nce analysis under the Kullbac k-Leibler div ergence. 6 Remark 1. Although The or em 1 fo cuses on the Gaussian kernel, extension to the other kernels is str aightfo rwar d . L et Z b e a pr ob ability sp ac e, and k b e a kernel function over Z × Z , and supp ose sup x ∈Z k ( x, x ) < ∞ . A c c or ding to the pr o of of The or em 1, we assume that the br acketing entr opy H B ( δ , H M , P ) is b ounde d ab ove by O ( M /δ ) γ , wher e 0 < γ < 2 (se e the pr o of i n App endix A for the definition). Then, we obtain k b w + − w 0 k P ≤ k b w − w 0 k P = O p ( λ 1 / 2 n,m ) , wher e λ − 1 n,m = O (( n ∧ m ) 1 − δ ) with 1 − 2 / (2 + γ ) < δ < 1 . 3.2 Analytic-form Solution of KuLSIF The pr oblem (9) is an in fi nite dim en sional optimization problem, if the dimension of H is infi nite. The represent er theorem (Kimeldorf & W ah ba, 197 1), h ow ev er, is applicable to RKHSs, and then, w e immediately hav e the follo wing theorem. Theorem 2. Supp ose the samples (1) ar e observe d. The estimator b w given as the solution of (9) has the form of b w ( z ) = n X i =1 α i k ( z , X i ) + m X j =1 β j k ( z , Y j ) , (10) wher e α 1 , . . . , α n , β 1 , . . . , β m ∈ ℜ . The theorem f ollo ws a direct application of the original r ep resen ter theorem, so we omit its pro of. This theorem shows that the estimator b w lies in a fin ite dimens ional sub space of H . F urthermore, for K uLSIF (i. e., the squ ared-loss), the p arameters in b w ( z ) can b e obtained analyticall y . Let K 11 , K 12 , K 21 , and K 22 b e the sub-matrices of the Gram matrix: ( K 11 ) ii ′ = k ( X i , X i ′ ) , ( K 12 ) ij = k ( X i , Y j ) , K 21 = K ⊤ 12 , ( K 22 ) j j ′ = k ( Y j , Y j ′ ) , where i, i ′ = 1 , . . . , n, j, j ′ = 1 , . . . , m . Let 1 m = (1 , . . . , 1) ⊤ ∈ ℜ m for p ositiv e in teger m . Th en the estimated parameters α i and β j are giv en as follo ws. Theorem 3 (Analyti c Solution of KuL SIF) . Supp ose that the r e gulariza tion p ar ameter λ is strictly p ositive. Then the estimate d p ar ameters in KuLSIF ar e given as α = ( α 1 , . . . , α n ) ⊤ = − 1 mλ ( K 11 + nλI n ) − 1 K 12 1 m , (11) β = ( β 1 , . . . , β m ) ⊤ = 1 mλ 1 m , (12) wher e I n is the n by n identity matrix. Pr o o f. W e start to pro ve th e theorem for general M-estimator based on f -div er gences. W e consider the minimization problem of th e loss fu nction Z ψ ( w ) dP n − Z wdQ m + λ 2 k w k 2 H 7 sub ject to w = n X j =1 α j k ( · , X j ) + m X ℓ =1 β ℓ k ( · , Y ℓ ) . Supp ose ψ is a d ifferen tiable conv ex f unction. Let v ( α, β ) ∈ ℜ n b e a ve ctor-v alued fun ct ion defined as v ( α, β ) i = ψ ′ n X j =1 α j k ( X i , X j ) + m X ℓ =1 β ℓ k ( X i , Y ℓ ) , i = 1 , . . . , n , where ψ ′ denotes the d eriv ative of ψ . T h en, the extremal condition of the loss function is given as 1 n K 11 v ( α, β ) − 1 m K 12 1 m + λK 11 α + λK 12 β = 0 , and 1 n K 21 v ( α, β ) − 1 m K 22 1 m + λK 22 β + λK 21 α = 0 . If α and β satisfy the abov e conditions, th ey are the optimal solution b ecause the loss fun ction is con v ex in α and β . S u bstituting β = 1 mλ 1 m , we obtain 1 n K 11 v ( α, 1 m /mλ ) + λK 11 α = 0 , and 1 n K 21 v ( α, 1 m /mλ ) + λK 21 α = 0 . Hence, if the equation 1 n v ( α, 1 m /mλ ) + λα = 0 (13) has a solution, it is rev ealed that β = 1 mλ 1 m is a part of the optimal solution. F or ψ ( z ) = z 2 / 2, w e h a v e v ( α, β ) = K 11 α + K 12 β , th u s , (13) is reduced to ( K 11 + nλI n ) α = − 1 mλ K 12 1 m . (14) The coefficient matrix is non-singular. Therefore, the estimator is represente d b y (11) and (12). Remark 2. As shown in the pr o of of The or e m 3, the estimate β for any f -diver genc e (5) i s given as (12) (but not (11 ) ) under the c ondition th at Equation (13) ha s a solution with r esp e ct to α . Ev entually , the estimator based on the f -div ergence is given by solving the follo wing opti- mization problem, inf w Z ψ ( w ) dP n − Z wdQ m + λ 2 k w k 2 H , s. t. w ( · ) = n X i =1 α i k ( · , X i ) + 1 mλ m X j =1 k ( · , Y j ) , α 1 , . . . , α n ∈ ℜ . (15) 8 When ψ ( z ) = z 2 / 2, the problem (15) is reduced to min α 1 2 α ⊤ 1 n K 2 11 + λK 11 α + 1 nmλ 1 ⊤ m K 21 K 11 α, α ∈ ℜ n (16) b y ignorin g the term indep enden t of the parameter α . O n the other hand, Theorem 3 guaran tees that the p aramet er α in KuLS IF is obtained by the optimal solution of th e follo win g optimization problem 1 : min α 1 2 α ⊤ 1 n K 11 + λI n α + 1 nmλ 1 ⊤ m K 21 α, α ∈ ℜ n . (17) The estimator giv en b y solving the optimizatio n problem (17) is denoted as R e duc e d -KuLSIF (R- KuLSIF). Although KuLS I F and R-Ku LSIF share the same optimal s o lution, the loss fu nctio n is different. In a later section, w e mak e clear that R-KuLS I F is more preferable th an the other estimators including KuLSI F f r om the viewp oin t of numerical computation, esp eciall y wh en the sample size is large. 3.3 Lea v e-one-out Cross-v a lidation In add ition to the s olutions α i and β j , the lea ve-o n e-o u t cross-v alidatio n (LOOCV) score can also b e obtained analytically in KuLSIF. The accuracy of th e KuLS IF estimator w + = max { w , 0 } is measured b y 1 2 R w 2 + dP − R w + dQ , whic h is equal to the square error of w + up to a constan t term. Then the LOOCV score of w + under the square error is defin ed as LOOCV = 1 n ∧ m n ∧ m X ℓ =1 1 2 ( b w ( ℓ ) + ( x ℓ )) 2 − b w ( ℓ ) + ( y ℓ ) , (18) where b w ( ℓ ) + = max { b w ( ℓ ) , 0 } is the estimator based on the samples except x ℓ and y ℓ . T h e index of remo v ed samples could be differen t, for example x ℓ 1 and y ℓ 2 , but for the sak e of simplicit y , w e supp ose that the samples x ℓ and y ℓ are remov ed in the computation of LOOCV. Hyp er- parameters ac hieving the minimum v a lue of LOOCV will b e a go o d c h oic e. Thanks to th e analytic solutions (11) and (12), the leav e-one-o ut s olution b w ( ℓ ) can b e com- puted efficien tly from b w by the u s e of the Sh erman-W oo dbur y -Morrison form u la (Golub & Loan, 1996) . Th e d eta il of the analytic LOOC V expression is deferred to App endix B—the d eriv a tion follo ws a similar line to (Kanamori et al., 20 09) whic h deals w ith a linear model (10); a minor difference is that remo ving the sample ( x ℓ , y ℓ ) in Ku LSIF c hanges the basis f unctions d ue to the k ernel expression. 4 Relation b et w een KuLSIF and KMM W e sh o w the relatio n b etw een KuLSI F and KMM. 1 W e u sed the fact that the solution of Ax = b is giv en as the minimizer of 1 2 x ⊤ Ax − b ⊤ x , when A is p ositiv e- semidefinite. 9 W e assume that the true d ensit y ratio w 0 = q /p is included in H . As sh o wn in Section 2, the loss function of KMM on H is defined as L KMM ( w ) = 1 2 k Φ( w ) k 2 H , Φ( w ) = Z k ( · , x ) w ( x ) P ( dx ) − Z k ( · , y ) Q ( dy ) . In the estimation phase, an emp irical approxi mation of L KMM is optimized in the KMM algo- rithm. On the other hand , the (un r eg ularized) loss fun ct ion of KuLSIF is giv en by L KuLSIF ( w ) = 1 2 Z w 2 dP − Z wdQ. Both L KMM and L KuLSIF are minimized at th e true d en sit y ratio w 0 ∈ H . Although some linear constrain ts ma y b e introdu ced in the optimization ph a se, we study the optimizati on problems of L KMM and L KuLSIF without constrain ts. This is b ecause when the samp le size tends to infinity , th e optimal solutions of L KMM and L KuLSIF without constraints automa tically satisfy the required constrain ts su ch as R wdP = 1 and w ≥ 0. W e co nsider the extremal condition of L KuLSIF ( w ) at w 0 . Substituting w = w 0 + δ · v ( δ ∈ ℜ , v ∈ H ) int o L KuLSIF ( w ), we ha v e L KuLSIF ( w 0 + δ v ) − L KuLSIF ( w 0 ) = δ Z w 0 v dP − Z v dQ + δ 2 2 Z v 2 dP . Since L KuLSIF ( w 0 + δ v ) is m inimized at δ = 0, the deriv ativ e of L KuLSIF ( w 0 + δ v ) at δ = 0 v anishes, i.e., Z w 0 v dP − Z v dQ = 0 . (19) The equalit y (19) holds f o r arbitrary v ∈ H . Using the repro ducing pr operty of the k ernel function k , w e can expr ess (19) in terms of Φ( w 0 ) as follo ws, Z w 0 v dP − Z v dQ = Z w 0 ( x ) h k ( · , x ) , v i H P ( dx ) − Z h k ( · , y ) , v i H Q ( dy ) = Z k ( · , x ) w 0 ( x ) P ( dx ) − Z k ( · , y ) Q ( dy ) , v H = Φ( w 0 ) , v H = 0 , ∀ v ∈ H . (20) Therefore, we obtain Φ( w 0 ) = 0 and w e fin d that Φ( w ) is th e Gˆ ateaux deriv ativ e (Zeidler, 1986) of L KuLSIF at w ∈ H . In summary , let D L KuLSIF b e th e Gˆ ateaux deriv ativ e of L KuLSIF o ve r th e RKHS H , then, the equalit y L KMM ( w ) = 1 2 k D L KuLSIF ( w ) k 2 H (21) holds. Tsub oi et al. (2008) h a v e pointed out a similar relation for M-estimat or based on Kullbac k-Leibler div ergence. No w we illustrate the relati on b et ween KuLSIF and K MM by sho wing an analogous opti- mization example in the Euclidean sp ac e. Let f : ℜ d → ℜ b e a differentiable f unction, and 10 consider the optimiza tion problem m in x f ( x ). A t the optimal solution x 0 , the extremal condi- tion ∇ f ( x 0 ) = 0 should hold, wh er e ∇ f is the gradien t of f . Thus, instead of minimizing f , minimization of k∇ f ( x ) k 2 also pro vid es the minimizer of f . This corresp onds to the relation b et w een KuLS IF and KMM: KuLSIF ⇐ ⇒ min x f ( x ) , KMM ⇐ ⇒ min x 1 2 k∇ f ( x ) k 2 . In other w ord s, in order to find the solution of the equation Φ( w ) = 0 , (22) KMM tries to minimize the norm of Φ( w ). The “dual” expression of (22) is giv en as h Φ( w ) , v i H = 0 , ∀ v ∈ H . (23) By “in tegrating” h Φ( w ) , v i H , w e obtain the loss function L KuLSIF . Remark 3. Gr etton e t al. (2006) have pr op ose d the maximum me an discr ep ancy (M MD) to me asur e the discr ep ancy b etwe en two pr ob abilities P and Q . When the c onstant function 1 is include d in the RKHS H , the M MD b etwe en P and Q is e qual to 2 × L KMM (1) . Due to the e quality (21) , we find that the MM D is also expr esse d as k D L KuLSIF (1) k 2 H , that is, the norm of the derivative of L KuLSIF at 1 ∈ H . This quantity wil l b e r elate d to the discr ep ancy b etwe en th e c onstant fu nc tion 1 and the true density r atio w 0 = q /p . Remark 4. It is str aightforwa r d to e xtend the ab ove r elation to the gener al f -diver genc e ap- pr o ach. The loss f u nction of the M-estimator (Ngu ye n et al., 2008) is given as L ψ ( w ) = Z ψ ( w ) dP − Z wdQ. Then, the loss function of the KMM -typ e may b e define d as L ψ - KMM ( w ) = 1 2 k D L ψ ( w ) k 2 H , wher e D L ψ - KMM ( w ) = Z k ( · , x ) ψ ′ ( w ( x )) P ( dx ) − Z k ( · , y ) Q ( dy ) . We c an c onfirm that L ψ ( w ) and L ψ - KMM ( w ) shar e th e minimizer. If ther e exists w ψ ∈ H such that w 0 = ψ ′ ( w ψ ) , the optimal solution i s giv e n by w ψ . 5 Condition Num b er A nalysis for Densit y Rat io Estimatio n W e hav e elucidated basic pr o p erties of the KuLS IF algorithm. In this section, we study the condition num b er of Ku LSIF and other density ratio estimators in order to inv estigate compu- tational prop erties. T his is the main con tribution of this pap er. 11 5.1 Condition Number in Numerical Analysis and Optimization Condition n u m b ers play crucial roles in n u merical analysis and optimization (Demmel, 1997; Luen b erger & Y e, 2008; Sank ar et al., 2006), which is explained in this section. Let A b e a symmetric p ositiv e d efinite matrix, and the condition num b er of A is defined as λ max /λ min ( ≥ 1), where λ max and λ min are the maximal and minimal eigenv alues of A , resp ec- tiv ely . Th e condition num b er of A is denoted by κ ( A ). In general, th e condition num b er for a matrix which ma y n ot b e symmetric is defined throu gh th e sin gular v alues. Th e ab o ve definition is, ho wev er, en ough f or our p urp ose. In numerical analysis, the condition n u m b er go v erns the round -o ff error of the solution of a linear equation Ax = b . T he matrix A with a large cond ition num b er will lead to a large upp er b ound on the relativ e error of the solution x . More precisely , in the p ertur bed linear equation ( A + δA )( x + δx ) = b + δ b , the relativ e err or of the solution is giv en as follo ws (Demmel, 1997): k δ x k k x k ≤ κ ( A ) 1 − κ ( A ) k δ A k / k A k k δ A k k A k + k δ b k k b k . Hence, smaller condition num b er is preferable in n u merical co mputation. In optimization p roblems, the condition n u mb er determines the con verge n ce rate of optimiza- tion algo rithms. Let us consider a m inimiza tion problem min x f ( x ) , x ∈ ℜ n , where f : ℜ n → ℜ is a differentiable function and let x 0 b e a local optimal solution. W e consider an ite rativ e algorithm which generates a s equ ence { x i } ∞ i =1 . In v arious iterativ e algo r ithms, the sequence is generated as x i +1 = x i − S − 1 i ∇ f ( x i ) , i = 1 , 2 , . . . , (24) where S i is an appr o ximation of the Hessian matrix of f at x 0 , i.e., ∇ 2 f ( x 0 ). Then under a m ild assumption, the sequence { x i } ∞ i =1 con ve rges to x 0 . Numerical tec h n iques such as scaling and pre-conditioning are also incorp orate d in the ab o v e f orm with a certain c hoice of S i . According to S ect ion 10. 1 in Luenberger and Y e (2008), the con v ergence rate of such iterativ e algo r ithms is giv en as k x k − x 0 k = O k Y i =1 κ i − 1 κ i + 1 , where κ i is the co ndition num b er of S − 1 / 2 i ( ∇ 2 f ( x 0 )) S − 1 / 2 i . Thus, the conv ergence rate of the sequence x k is slo w if κ i is large . More critically , wh en { κ i } ∞ i =1 do es n ot con verge to one, the sequence { x i } ∞ i =1 do es not con v erge to x 0 at a sup er-linear rate. When the condition num b er of the Hessian matrix ∇ 2 f ( x 0 ) is large, there is a trade-o ff b et w een the numerical accuracy and the con v ergence rate in op timization problems. Let us illustrate the trade-off using a few examples. When th e Newton metho d is emp lo y ed, S k is giv en as ∇ 2 f ( x k ). Because of the con tinuit y of ∇ 2 f , the condition num b er of S k = ∇ 2 f ( x k ) wo uld b e large if κ ( ∇ 2 f ( x 0 )) is large. Then the n u merica l compu tation of S − 1 k ∇ f ( x k ) b ecome s unstable. When the quasi-Newto n metho ds such as the BFGS metho d or th e DFP metho d (Luenberger & Y e, 2008) are emplo y ed, S k or S − 1 k is successiv ely estimated based on the information of the gradien t. If κ ( ∇ 2 f ( x 0 )) is large, κ ( S k ) is also lik ely to b e large, and thus, the numerical computation of S − 1 k ∇ f ( x k ) is not reliable, ev en when S − 1 k is successiv ely up dated in the quasi- Newton metho ds. The round -off error caused b y nearly singular Hessia n matrices significantly affects the accuracy of the quasi-Newton metho ds. As a result, it may not b e guarante ed that S − 1 k ∇ f ( x k ) is a preferable descent direction of the ob jectiv e fun cti on f . 12 In optimization p roblems with large condition num b ers, the numerical computation tends to b e unreliable. T o av oid numerical instabilit y , the Hessian matrix is often mo dified so that S k has a mo derate condition num b er. F or example, th e optimization to ol b o x in MA TLAB R implemen ts a gradient descen t method in its function fminun c . The default metho d in fminun c is the BF GS m etho d with up date through the C holesky facto rization of S k (not S − 1 k ). Eve n if the p ositiv e definiteness of S k is violated b y the round-off error, the Ch ole s k y f ac torization immediately d ete cts the negativit y of eigen v alues an d th e p ositiv e defin ite n ess of S k is r ec o ve r ed b y adding a correction term. When the modifi ed Cholesky factorization is u s ed, the condition n u m b er of S k is guaran teed to b e b ounded ab o v e by some constan t, C . S ee (Mor´ e & S orensen, 1984) in details. The trade-off b et w een n u merica l accuracy and con v ergence rate is summarized b y the fol- lo wing equalit y: min S : κ ( S ) ≤ C κ ( S − 1 / 2 ( ∇ 2 f ( x 0 )) S − 1 / 2 ) = max κ ( ∇ 2 f ( x 0 )) C , 1 . (25) The pro of of (25) may b e found in Ap p en dix C . W e supp ose that the symmetric p ositiv e definite matrix S k satisfying κ ( S k ) ≤ C is used in the iterativ e algorithm (24). If κ ( ∇ 2 f ( x 0 )) is large, the righ t-hand side of (25) will b e greater than one. Hence, the con vergence rate will b e slo w . That is, the quasi-Newton method with a mod ified Hessian S k suc h that κ ( S k ) ≤ C ma y not ac hiev e a sup er-linear con verge n ce rate. Ev en though some scaling or pre-conditioning tec hn ique is a v ailable, it is p referable that the condition n umb er of the original problem is k ept as small as p ossible. 5.2 Condition Number Analysis of KuLSIF and KMM Let u s consider the optimization problems in KuLS IF and KMM on an RK HS H endo wed with a k ern el fun ctio n k o ve r a set Z . Giv en s amples (1), the optimization problems of KuLSIF and KMM are defined as (KuLSIF) min w 1 2 Z w 2 dP n − Z wdQ m + λ 2 k w k 2 H , w ∈ H , (KMM) min w 1 2 b Φ( w ) + λw 2 H , w ∈ H , where b Φ( w ) = Z k ( · , x ) w ( x ) P n ( dx ) − Z k ( · , y ) Q m ( dy ) . Here, b Φ( w ) + λw is th e Gˆ atea ux deriv ativ e of th e loss function for KuLS IF including the reg- ularization term. In the original KMM metho d, the density ratio on s amp les X 1 , . . . , X n are optimized (Huang et al. , 20 07), i.e ., transd uctio n. Here, w e consider its inductiv e v arian t, i.e., estimating the fu nction w 0 on Z using the loss fun ct ion of KMM. According to T heorem 3, the optimal solution of (KuLSIF) is giv en as the form of w = P n i =1 α i k ( · , X i ) + 1 mλ P m j =1 k ( · , Y j ); note that the optimal solution of (KMM) is also giv en b y the same form . Thus, the v ariables to b e optimized in (KuLSIF) and (KMM) are α 1 , . . . , α n . W e in v estigate the numerical efficiency of (Ku LSIF) and (KMM). When we solve the min - imization problem min x f ( x ), it is not recommended to minimize the norm of the gradien t 13 min x k∇ f ( x ) k 2 , s ince th e problem min x k∇ f ( x ) k 2 generally has a larger condition n umber than min x f ( x ) (Luen b erger & Y e, 2008). F or example, let f b e the con v ex quadratic function de- fined as f ( x ) = 1 2 x ⊤ Ax − b ⊤ x with a p ositiv e-definite matrix A . Then the condition n u m b er of the Hessian matrix equals to κ ( A ). O n the other h and, the Hessian matrix of the function k∇ f ( x ) k 2 = k Ax − b k 2 is equal to κ ( A 2 ) = κ ( A ) 2 , that is, th e condition num b er is squ ared and th u s b ecomes larger. Belo w, we s h o w that the same is true of Ku LSIF and K MM. The Hessian matrices of the ob jectiv e f unctions of K uLSIF and KMM are giv en as H KuLSIF = 1 n K 2 11 + λK 11 , (26) H KMM = 1 n 2 K 3 11 + 2 λ n K 2 11 + λ 2 K 11 . (27) H KuLSIF is derived fr om (16), and H KMM is giv en b y d ir ec t computation based on (KMM). Then, we obtain κ ( H KuLSIF ) = κ ( K 11 ) κ 1 n K 11 + λI n , κ ( H KMM ) = κ ( K 11 ) κ 1 n K 11 + λI n 2 . Since the condition num b er is larger than or equal to one, the inequalit y κ ( H KuLSIF ) ≤ κ ( H KMM ) holds. This imp lie s that the conv ergence r at e of KuLSIF wel l b e faster than that of K MM , when an iterativ e optimization algorithm is used to m inimize ea c h loss function. According to Remark 4, we exp ect that the condition num b er of M-estimator b ased on L ψ is smaller than that of KMM based on L ψ - KMM . Let eac h Hessian matrix at optimal solution b w b e H ψ - div for L ψ and H ψ - KMM for L ψ - KMM , then some calculatio n pr ovides H ψ - div = K 1 / 2 11 1 n K 1 / 2 11 D ψ, b w K 1 / 2 11 + λI n K 1 / 2 11 , H ψ - KMM = K 1 / 2 11 1 n K 1 / 2 11 D ψ, b w K 1 / 2 11 + λI n 2 K 1 / 2 11 , where D ψ, w is the n by n d ia gonal matrix defined as D ψ, w = ψ ′′ ( w ( X 1 )) . . . ψ ′′ ( w ( X n )) , (28) and ψ ′′ denotes the second-order deriv ativ e of ψ . Hence, usin g the inequalit y κ ( AB ) ≤ κ ( A ) κ ( B ) (Horn & Johnson, 1985), we ha v e κ ( H ψ - div ) ≤ κ ( K 11 ) κ 1 n K 1 / 2 11 D ψ, b w K 1 / 2 11 + λI n , κ ( H ψ - KMM ) ≤ κ ( K 11 ) κ 1 n K 1 / 2 11 D ψ, b w K 1 / 2 11 + λI n 2 . F rom the viewp oin t of the n aiv e upp er b ound of condition n u m b ers, the M-estimator based on L ψ will b e preferable to KMM with L ψ - KMM . 14 5.3 Condition Number Analysis of M -Estimators (K)uLSIF is an example of the M-estimators with the squ a red loss. Here, we stu dy the condition n u m b er of the Hessian m at r ix asso ciated with the minimization p roblem in the f -d iv ergence approac h, an d sh o w that KuLS IF is optimal among all M-estimators b ase d on f -divergence s . More sp ecifically , we w ill giv e a min-max ev aluation (S ection 5.3.1) and a probabilistic ev aluation (Section 5.3.2) of the condition num b er. 5.3.1 Min-max E v aluation W e assume that a universal RKHS H (Stein wart, 2001) endow ed with a k ernel fun ct ion k on a compact set Z is used for estimation of w 0 . The M-estimator based on the f -divergence is obtained by solving the problem (15). The Hessian matrix of the loss function at the optimal solution w is equal to 1 n K 11 D ψ, w K 11 + λK 11 , (29) where D ψ, w is the d ia gonal m at r ix defined as Eq. (28). The condition n umber of the Hessian matrix is denoted by κ 0 ( D ψ, w ) = κ 1 n K 11 D ψ, w K 11 + λK 11 . In KuLSIF, w e find ψ ′′ = 1, and th us, the condition num b er is equal to κ 0 ( I n ). W e analyze the relation b et wee n κ 0 ( I n ) and κ 0 ( D ψ, w ). Theorem 4 (Min-max Ev aluation) . Supp ose tha t H is a universal RKHS, and that K 11 is non-singular. Then, inf ψ : ψ ′′ (1)=1 sup w ∈H κ 0 ( D ψ, w ) = κ 0 ( I n ) (30) holds. Her e the infimum is taken over al l c onvex se c ond-or der c o ntinuously differ entiable func- tions ψ such that ψ ′′ (1) = 1 . The pro of is d eferred to App end ix D. When the constraint ψ ′′ (1) = c is imp ose d with some c > 0, the optimal fu nction is giv en as ψ ( z ) = cz 2 / 2 in the min-max sense. Practically , the v alue of ψ ′′ (1) determines the balance b et ween the fitting to training samp le s an d the regularization term. Theorem 4 guarantees th at Ku L SIF minimizes the worst-case co ndition num b er, whic h is brough t b y th e fact that the condition num b er of K uLSIF do es not d ep end on the optimal solution. Since b oth sides of (30) d epend on the samples X 1 , . . . , X n , KuLSIF ac hieve s the min-max solution in terms of the condition n u m b er for e ach observ ation. 5.3.2 P robabilistic Ev aluation Next, w e study probabilistic ev al uation of the condition num b er. As sho wn in min-max ev alua- tion, the Hessian matrix is give n as H = 1 n K 11 D ψ, b w K 11 + λK 11 , 15 where the diagonal elemen ts of D ψ, b w are equal to ψ ′′ ( b w ( X 1 )) , . . . , ψ ′′ ( b w ( X n )). T h e estimator b w is giv en as the minimum solutio n of (15). Let u s d efine the r andom v a riable T n as T n = max 1 ≤ i ≤ n ψ ′′ ( b w ( X i )) , and F n b e the distribution fun ction of T n , then T n is a non-negativ e random v ariable. Belo w, w e first compute the distribu ti on of the cond iti on n umber κ ( H ). Then w e inv estigat e the relation b et we en th e function ψ and th e d istribution of co n dition num b er κ ( H ). W e need to study the eigenv alues and the condition num b ers of random matrices. F or the Wishart distribution, the probabilit y distribu tion of condition n u mb ers has b een in vestig ated b y Edelman (1988 ); Edelman and S utton (2005). Recen tly , the condition n umb er of matrices p erturb ed by additiv e Gaussian n oi se ha ve b een inv estigate d by the n ame of smo ot he d analysis (Sank ar et al., 2006; Spielman & T eng, 2004 ; T a o & V u, 2007 ). Randomness in volv ed in the matrix H defi ned ab o v e is, ho wev er, d ifferen t from that in existing works. Theorem 5 (Probabilistic Ev aluation) . L et H b e a RKH S endowe d with a kernel fu nction k on Z satisfying the fol lowing c ondition: ther e exists ε > 0 such that √ ε ≤ k ( x, x ′ ) ≤ 1 , ∀ x, x ′ ∈ Z . Assume that the Gr am matrix K 11 is almo st sur ely p ositive definite in terms of the pr ob ability me asur e P . Supp ose that ther e exists se quenc es s n and t n such that lim n →∞ s n = ∞ , lim n →∞ F n ( s n ) = 0 , lim n →∞ F n ( t n ) = 1 , (31) and that ther e exists M > 0 such that E [ ψ ′′ ( b w ( X 1 ))] ≤ M holds for lar ge sample size, n and m . Supp ose that λ = λ n,m satisfies lim n →∞ λ n,m < ∞ . Then, f or any smal l ν > 0 , we have lim n →∞ Pr s 1 − ν n ≤ κ ( H ) ≤ κ ( K 11 ) 1 + t n λ = 1 . (32) The pro of is deferred to App endix E. Remark 5. The Gaussia n kernel on a c omp act set me ets the c ondition of The or em 5 under a mild assumption on the pr ob ability P . If the distribution P of samples X 1 , . . . , X n is absolutely c ontinuous with r esp e ct to the L eb esgue me asur e, the Gr am matrix of the Gaussian kernel i s almost sur ely p ositive definite. Be c ause, K 11 is p ositive definite if X i 6 = X j for i 6 = j . When ψ is the qu ad r ati c fu nctio n, ψ ( z ) = z 2 / 2, the distribu tio n function F n is giv en F n ( t ) = 1 [ t ≥ 1], where 1 [ · ] is the indicato r function. Hence, there do es not exist a sequence s n defined in Theorem 5. The up per b ound is, how ev er, still v alid. T hat is, by c ho osing t n = 1, the upp er b ound of κ ( H ) with ψ ( z ) = z 2 / 2 is asymptotica lly given as κ ( K 11 )(1 + λ − 1 n,m ). On the other hand , in the M-estimator w ith Kullbac k-Leibler div ergence (Nguy en et al., 2008), the function ψ is defined as ψ ( z ) = − 1 − log( − z ) , z < 0, and thus, ψ ′′ ( z ) = 1 /z 2 holds. Hence, T n = max 1 ≤ i ≤ n ( b w ( X i )) − 2 is exp ected to b e of the order larger than constan t order, and th us, t n w ould d iv erge to infinity . Th is simple analysis ind icates that th e Ku LSIF will b e more preferable than the M-e stimator with Kullbac k-Leibler d ivergence in the sense of computational efficiency and stabilit y . 16 W e derive an appro ximation of the inequalit y in (32) . Th e target of th e estimator b w is giv en as w suc h that q ( x ) /p ( x ) = ψ ′ ( w ( x )) holds. Th us, we exp ect th a t the condition num b er of 1 n K 11 D ψ, b w K 11 + λK 11 is appro ximated by that of 1 n K 11 D ψ, w K 11 + λK 11 . Th e pro of of Theorem 5 is v alid ev en in the case that the random v ariable T n is d efined b y a fi xed fu nction w ∈ H . The condition num b er of Hessian matrix at a fixed fu nctio n w ∈ H is considered in the p rop osit ion b elo w. Prop osition 1 (Appro ximated Bound ) . The kernel f u nction k and the r e gularization p ar ameter λ satisfy the same c ondition as The or em 5. F or a function w ∈ H , let F b e the distribution function of ψ ′′ ( w ( X )) , and supp ose that the exp e ctation of ψ ′′ ( w ( X )) is finite. L et G b e 1 − F , and supp ose that ther e exists a r e al nu mb er U > 0 such that G ( t ) has the inverse function G − 1 for t ≥ U . L et the r andom matrix H w b e H w = 1 n K 11 D ψ, w K 11 + λK 11 . Then, for any smal l η > 0 and any smal l ν > 0 , we have lim n →∞ Pr { G − 1 (1 /n 1 − η ) } 1 − ν ≤ κ ( H w ) ≤ κ ( K 11 ) 1 + λ − 1 G − 1 (1 /n 1+ η ) = 1 . Pr o o f. Note that F n ( t ) in Theorem 5 is equal to ( F ( t )) n , since ψ ′′ ( w ( X i )) , i = 1 , . . . , n are iden tically and indep endent ly distributed f orm F . The condition num b er κ ( H w ) satisfies Eq.(32) with F n = F n . As sho wn in Figure 1, th e fu nction G − 1 is decreasing. Let s n b e s n = G − 1 (1 /n 1 − η ), then s n → ∞ holds wh en n tends to infi nit y . Thus, w e hav e F n ( s n ) = F ( s n ) n = (1 − G ( s n )) n = 1 − 1 n 1 − η n − → 0 , n → ∞ . On the other hand , let t n b e t n = G − 1 (1 /n 1+ η ), then w e ha ve F n ( t n ) = (1 − G ( t n )) n = 1 − 1 n 1+ η n − → 1 , n → ∞ . Substituting s n and t n in to the inequalit y in (32), we obtai n the r esult. Remark 6. P r op osition 1 implies that for lar ge n , the ine quality { G − 1 (1 /n 1 − η ) } 1 − ν ≤ κ ( H w ) ≤ κ ( K 11 ) 1 + λ − 1 G − 1 (1 /n 1+ η ) (33) holds in high pr ob ability. In KuLSIF, the fu nction ψ is given as ψ ( z ) = z 2 / 2 , and the c orr esp ond - ing distribution function of e ach diagonal element i n D ψ, w is given by F KuLSIF ( d ) = 1 [ d ≥ 1] , and thus, G KuLSIF ( d ) = 1 − F KuLSIF ( d ) = 1 [ d < 1] . In al l M-estimators exc ept KuLSIF, di- agonal elements of D ψ, w c an take various p ositive values. We r e gar d the diagonal elements of D ψ, w as a typic al r e alizatio n of r ando m variables with the distribution function F ( d ) . When the distribution function F is close to F KuLSIF , the function G = 1 − F is also close to G KuLSIF . Then, G − 1 wil l take smal l values as il lustr ate d in Figur e 1. As a r e su lt, we c an exp e ct that the c ondition numb er of KuLSIF i s smal ler than that of the other M-estimators. In a later se ction, we further investigate this issue thr ough numeric al exp eriments. 17 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.1 0.2 0.3 0.4 d G ( d ) G KuLSIF G 1 G 2 G − 1 1 G − 1 2 Figure 1: If the f unction G 1 ( d ) is closer to G KuLSIF ( d ) (= 0) than G 2 ( d ) for large d , then G − 1 1 ( z ) tak es smaller v alue than G − 1 2 ( z ) for small z . Example 1. L et F γ ( d ) b e F γ ( d ) = ( 0 0 ≤ d < 1 , 1 − 1 d γ 1 ≤ d. Supp ose that F γ is the distribution function of ψ ′′ ( w ( X )) = ψ ′′ ( ψ ′− 1 ( q ( X ) /p ( X ))) . Note that the distribution function F KuLSIF ( d ) = 1 [ d ≥ 1] is r epr esente d as 1 [ d ≥ 1] = lim γ →∞ F γ ( d ) exc ept at d = 1 . Then, G γ ( d ) = 1 − F γ ( d ) is e qual to G γ ( d ) = ( 1 0 ≤ d < 1 , 1 d γ 1 ≤ d. F or smal l z > 0 , the inverse function G − 1 γ ( z ) is given as G − 1 γ ( z ) = z − 1 /γ . Henc e f or su ffic i ently smal l η , the ine quality (3 3) is r e duc e d to n (1 − η )(1 − ν ) γ ≤ κ ( H w ) ≤ κ ( K 11 ) 1 + λ − 1 n 1+ η γ . Both upp er and lower b ounds in the ab ove ine quality ar e monotone de cr e asing with r esp e ct to γ . Example 2. L et F γ ( d ) b e F γ ( d ) = 1 1 + e − γ ( d − 1) , d ≥ 0 . 18 The distribution function F KuLSIF ( d ) = 1 [ d ≥ 1] is r epr esente d as 1 [ d ≥ 1] = lim γ →∞ F γ ( d ) exc ept at d = 1 . Then, G γ ( d ) = 1 − F γ ( d ) is e qual to G γ ( d ) = 1 1 + e γ ( d − 1) , d ≥ 0 . F or smal l z , the inverse function G − 1 γ ( z ) is given as G − 1 γ ( z ) = 1 + 1 γ log 1 − z z . Henc e f or smal l η , the ine quality (33) wil l le ad the fol lowing: 1 − η γ log n 2 1 − ν ≤ κ ( H w ) ≤ κ ( K 11 ) · 1 + η λγ log n. The upp er and lower b ounds in the ab ove ine quality ar e monotone de cr e asing with r esp e ct to γ . 6 Reduction of Condition Num b ers in KuLSIF The co n dition num b er in the optimizati on problem of KuLS IF is gi ven a s κ ( H KuLSIF ) = κ ( 1 n K 2 11 + λK 11 ), and that of th e original K MM metho d is equal to κ ( K 11 ) w h ic h is approxima tely deriv ed fr om (2). O n the other hand, the Hessian matrix of R-Ku LSIF is equ al to H R − KuLSIF = 1 n K 11 + λI n . (34) See (17) for the loss fun ct ion of R-KuLSIF. Due to the equalit y κ ( H KuLSIF ) = κ ( K 11 ) κ ( H R − KuLSIF ) , w e h a v e κ ( H R − KuLSIF ) ≤ κ ( H KuLSIF ) . Moreo v er, it is easy to see κ ( H R − KuLSIF ) ≤ κ ( K 11 ) . These inequalities imply that R-Ku LSIF is more preferable than KuLS IF and KMM in the sense of the con verge nt s p eed and numerical stabilit y as explained in S ection 5.1. In this section, we stud y whether reduction of co ndition num b ers is possib le in the general f -dive rgence approac h. W e d o not consider scaling of the parameter (Luenberger & Y e, 2008), but other types of transformation of loss fu nctio n s in order to r educe the condition num b er. Our conclusion is that among all f -div ergence approac hes, the cond it ion num b er is reducible only in KuLSIF. Thus the reduction of condition n u m b ers by R-KuLSIF is a sp ecial pr operty , which mak es R-Ku LSIF p articularly attractiv e in practical us e. W e elucidate the reason why the condition num b er of Ku LSIF can b e reduced fr om κ ( H KuLSIF ) to κ ( H R − KuLSIF ). As explained in Remark 2, in the f -d iv ergence appr oa ch, the 19 optimal solution of β is equal to 1 m /mλ . Then, as sho wn in the pro of of Theorem 3, the gradien t of the loss function with resp ect to α is equal to g ψ ( α ) = 1 n K 11 v ( α, 1 m /mλ ) + λK 11 α, where the fun ct ion v dep ends on ψ . On the other h a nd, the gradient of the loss fun cti on in (17) is equal to K − 1 11 g ψ ( α ) with ψ ( z ) = z 2 / 2. This fact implies that in KuLSIF, there exist s a non-singular matrix C ∈ ℜ n × n , which is indep endent of α , such that C g ψ ( α ) is identic al to th e gradien t of a function F ( α ). If the cond iti on n u m b er of the Hessian matrix of F ( α ) do es not exceed κ ( H KuLSIF ), it will b e n umerically more adv an tageous to us e F ( α ) as the loss function than KuLSIF. Supp ose that th e ℜ n -v alued function C g ψ ( α ) can b e represente d as the gradien t of a function F , that is, ∇ F = C g ψ . Then, the fu nction C g ψ is called inte gr able (Nak ahara, 2003). What w e stud y in this sec tion is to find ψ such that there exists a non-iden tit y matrix C such that C g ψ ( α ) is inte grable. According to Nak ahara (2003), the n ece s s a ry and sufficient condition of in tegrabilit y is th at th e Jacobia n matrix of C g ψ ( α ) is symmetric. The Jacobian matrix of C g ψ ( α ) is equal to 1 n C K 11 D ψ, α K 11 + λC K 11 , where D ψ, α is the diagonal matrix in which the diagonal elemen ts are give n as ( D ψ, α ) ii = ψ ′′ n X j =1 α j k ( X i , X j ) + 1 mλ m X ℓ =1 k ( X i , Y ℓ ) , i = 1 , . . . , n. Let R b e the n b y n m atrix C K 11 , then, the Jacobian matrix is represen ted as M ψ, R ( α ) = 1 n RD ψ, α K 11 + λR . Theorem 6. L et c b e a c onstan t value in ℜ , and the function ψ b e se c ond - or der c ontinuously differ entiable. Supp ose that the Gr am matrix K 11 is non-singular, and that K 11 do es not ha v e zer o element. If ther e exists a no n-singular matrix R 6 = cK 11 such that M ψ, R ( α ) is symmetric for any α ∈ ℜ n , then, ψ ′′ is a c onstant function. The pro of ma y b e found in App endix F. T heorem 6 guaran tees that the condition n u m b er of the loss function is reducible only wh en ψ is a quadratic function. Here, m u lti plying the gradient b y a matrix C , wh ic h is indep enden t of α , is allo we d as tr a nsformation of the loss function. F o r other functions ψ , the gradien t C g ψ, α cannot b e in tegrable u nless C = cI n , c ∈ ℜ . Remark 7. We summa rize the th e or etic al r esults o n c ondition numb ers. L et H ψ - div b e the Hessian matrix (29) of the M-e stima tor. Then, the fol lowing ine qualities hold, κ ( H R − KuLSIF ) ≤ κ ( K 11 ) ≤ κ ( H KuLSIF ) ≤ κ ( H KMM ) , κ ( H KuLSIF ) = su p w ∈H κ ( H KuLSIF ) ≤ su p w ∈H κ ( H ψ - div ) . R ememb er that K 11 is the Hessian matrix of the original (tr ansductive) KMM metho d, and H KMM is its inductive variant. Base d on pr ob abilistic evaluation, the ine quality κ ( H KuLSIF ) ≤ κ ( H ψ - div ) 20 wil l also hold with high pr ob ability. L et H ψ - KMM b e th e Hessian matrix of the loss function L ψ - KMM in R emark 4. Then, we c o nj e ctur e tha t κ ( H ψ - div ) ≤ κ ( H ψ - KMM ) holds in some sense as an extension of th e r elation b etwe en KuLSIF and the inductive variant of KMM. Conse quently, R-KuLSIF wil l b e advantage ous in numeric al c omputation. 7 Sim ulation Results In this section, we exp eriment ally inv estigat e the b ehavi or of the condition num b ers. I n the inductiv e v arian t of K MM estimator, the Hessian matrix is give n b y H KMM defined in (27). In the M-estimator based on f -dive r ge nce, the Hessian matrix inv olv ed in the optimization problem is giv en as H = 1 n K 11 D ψ, w K 11 + λK 11 ∈ ℜ n × n . F or the Ku llb ac k-Leibler div ergence, we hav e ϕ ( z ) = − log z and ψ ( z ) = − 1 − log( − z ) , z < 0, and th u s, ψ ′ ( z ) = − 1 /z and ψ ′′ ( z ) = 1 /z 2 hold for z < 0. I f th e optimal solution p ro vides th e true densit y r at io w 0 , we obtai n ψ ′′ ( w ( x )) = ψ ′′ (( ψ ′ ) − 1 ( w 0 ( x ))) = w 0 ( x ) 2 . Thus, the Hessian matrix is giv en as H KL = 1 n K 11 diag( w 0 ( X 1 ) 2 , . . . , w 0 ( X n ) 2 ) K 11 + λK 11 ∈ ℜ n × n . On the other hand , in KuLSIF, the Hessian matrix is giv en b y H KuLSIF defined in (26 ) , and the Hessian matrix of R-KuLS IF, H R − KuLSIF , is shown in (34). In examples of Section 5.3.2, w e considered the condition num b er of a random matrix H RND = 1 n K 11 diag( d 1 , . . . , d n ) K 11 + λK 11 ∈ ℜ n × n . W e use F γ ( d ) d efined in Examp le 1 with v arious γ as the distribu tio n fun ct ion of d 1 , . . . , d n . The condition n u m b ers of Hessian matrices, H KMM , H KL , H KuLSIF , H R − KuLSIF , and H RND are n u merical ly compared. In addition, the condition n umb er of K 11 is also computed. I n the original transdu ct iv e KMM estimator d efined by (2), the condition num b er of the loss fun ction is equal to κ ( K 11 ). Th us, the conv ergence r at e of numerical optimization in KMM would b e appro xim ately go v ern ed by κ ( K 11 )—w e n eed to tak e the constraints in (2) in to accoun t to derive more accurate con ve rgence rate of the original KMM. The probab ility d ensitie s of P and Q are set to b e b oth th e n orm al distribution on the 10-dimensional Euclidean space with th e unit v ariance-co v ariance matrix I 10 . Th e mean vecto r s of P a nd Q are set to 0 × 1 10 and µ × 1 10 with µ = 0 . 2 or µ = 0 . 5, resp ectiv ely . Note that the mean v alue µ affects only κ ( H KL ). The true densit y ratio w 0 is determined b y P and Q . In the k ernel-based estimators, we use the Gaussian kernel with width σ = 2 or σ = 4. Note that σ = 4 is close to the median of the distance b et w een samples k X i − X j k ; using the median distance as the ke r nel width is a p opular heuristics (Sc h¨ olk opf & Smola, 2002). The sample size f r om P is equal to that from Q , that is, n = m . The regularization parameter λ is set to λ n,m = 1 / ( n ∧ m ) 0 . 9 whic h m ee ts the assump tio n in Th eo r em 1. T able 1 s h o ws the exp eriment al results. In eac h setup , samples X 1 , . . . , X n and diagonal elemen ts d 1 , . . . , d n are randomly generated and the condition num b er is computed. The table 21 sho ws the av erage of the condition n umbers o v er 1000 runs. As sho wn in T able 1, the condition n u m b er of R-KuLSIF is m uc h smaller than the other metho ds for all cases. Thus, it is expected that in optimization, the con v ergence sp eed of R-KuLSIF is faster than the other metho ds and that R-KuLSIF is r obust against n umerical degeneracy . It will b e w orth while to p oin t out that κ ( H R − KuLSIF ) is smaller than κ ( K 11 ). This is b ec ause the identit y matrix in H R − KuLSIF prev ents the smallest eigenv alue fr om b ecoming extremely small. The n umb er of κ ( H RND ) is decreasing as γ tends to large v alues, and seems to conv erge to κ ( H KuLSIF ). This result m ee ts the considerations in Remark 6 and Examp le 1. T able 2 sh o ws the a verag e n u m b er of iterations and the av erage compu ta tion time for solving the optimization p roblems o v er 50 run s. The prob ab ility densities of P and Q are the same as ab o v e on es, and the mean v ector of Q is giv en as 0 . 5 × 1 10 . The n umber s of samples are set to ( n, m ) = (1000 , 1000) , (4000 , 4000) or (6000 , 6000), and the regularization parameter is λ = 1 / ( n ∧ m ) 0 . 9 . The num b er of n is equal to the n umber of parameters to b e optimized. R- KuLSIF, KuLS IF, inductive v arian t of KMM (KMM), and M-esti mator with Ku llbac k-Leibler div ergence (KL) are compared. I n add itio n , the computation time of solving the linear equation (14) is also shown as R-KuLSIF(direct). The k ernel parameter σ is determined b ased on the median of k X i − X j k . T o solv e the optimization problems in the M-esti mators and K MM, we used the BFGS metho d implemen ted in the opt im function in R (R Dev elopmen t Core T ea m, 2009) , and for R-KuLSIF(direct) we use the solve function. The results sh o w th at the n u mb er of iterations in optimiza tion is highly correlated with the condition num b er of the Hessian matrices in T able 1. Although the practical computational time w ould dep end on v arious issues suc h as stopping r ules, our theoretical r esults were shown to b e in go o d agreemen t with the empirical r esults. Thus, the R -K u LSIF w ould b e a stable and computatio nally efficien t densit y- ratio estimator. W e ob s erv e that n u merical optimization metho ds such as the quasi-Newton metho d are comp et itiv e with n um erica l algorithms for solving linear equ at ions (suc h as th e LU or Ch olesky metho ds), especially when the sample size or the n umber of p aramet ers is large. Thus, our results obtained in this pap er w ould b e useful in large sample cases—common situations in practical applications. 8 Conclusions W e considered the p roblem of estimating the ratio of t wo p robabilit y densities and inv estigated theoretical prop erties of the k ern el least-squares estimator called KuLS IF. W e studied the condi- tion num b er of Hessian matrices, and show ed th at KuLS IF h as a smaller condition num b er than the other metho ds. Since the condition num b er determin es the con verge n ce r at e of optimization and the numerical stabilit y , KuLSIF will ha ve a p referable numerical prop erties to the other metho ds. W e further sho w ed that R-KuLSIF, which is an al ternativ e formulation of KuLSIF, p ossesses an ev en smaller condition num b er. Densit y ratio estimation could p ro vide new appr oa ches to v arious mac hin e learnin g problems including co v ariate shift adaptation (Huang et al., 2007 ; Sugiy ama et al., 2008a ; Kanamori et al., 2009; Bic kel et al., 2009), outlier detection (Hido et al., 2008), and feature selection (Suzuki et al., 200 8). Based on the theoreti cal guidance giv en in this pap er, we will dev elop practical algorithms for a wide-range of app lic ations in the futur e w ork . 22 T able 1: Condition num b ers of eac h Hessian matrix. kernel width: σ = 2 H KL H RND n K 11 H R − KuLSIF H KuLSIF H KMM µ = 0 . 2 µ = 0 . 5 γ = 2 γ = 5 γ = 10 20 1.6e+0 1 3.8e+0 0 6.4e+01 2.7e+0 2 9 .0e+01 1.4e+0 3 1.1e+02 7.4 e+01 6.9e+01 50 7.1e+0 1 8.1e+0 0 5.9e+02 5.1e+0 3 7 .6e+02 4.8e+0 3 1.1e+03 7.1 e+02 6.5e+02 100 2 .6e+02 1.5e+ 0 1 4.1e+03 6.5 e+04 5.0e+03 2.7 e+04 7.7e+03 5.0e+0 3 4 .5e+03 200 1 .1e+03 3.0e+ 0 1 3.4e+04 1.0 e+06 4.2e+04 1.6 e+05 6.7e+04 4.2e+0 4 3 .8e+04 300 2 .9e+03 4.4e+ 0 1 1.3e+05 5.7 e+06 1.6e+05 5.8 e+05 2.5e+05 1.6e+0 5 1 .4e+05 400 5 .9e+03 5.8e+ 0 1 3.4e+05 2.0 e+07 4.2e+05 1.5 e+06 6.8e+05 4.3e+0 5 3 .8e+05 500 1 .0e+04 7.3e+ 0 1 7.5e+05 5.5 e+07 9.2e+05 3.1 e+06 1.5e+06 9.4e+0 5 8 .3e+05 kernel width: σ = 4 H KL H RND n K 11 H R − KuLSIF H KuLSIF H KMM µ = 0 . 2 µ = 0 . 5 γ = 2 γ = 5 γ = 10 20 4.3e+0 2 1.2e+0 1 5.2e+03 6.3e+0 4 6 .9e+03 2.8e+0 4 9.9e+03 6.4 e+03 5.7e+03 50 4.2e+0 3 2.8e+0 1 1.2e+05 3.4e+0 6 1 .6e+05 7.7e+0 5 2.3e+05 1.5 e+05 1.3e+05 100 3 .1e+04 5.5e+ 0 1 1.7e+06 9.6 e+07 2.4e+06 1.2 e+07 3.4e+06 2.2e+0 6 1 .9e+06 200 2 .6e+05 1.1e+ 0 2 2.8e+07 3.1 e+09 3.9e+07 2.1 e+08 5.6e+07 3.5e+0 7 3 .2e+07 300 1 .0e+06 1.6e+ 0 2 1.7e+08 2.7 e+10 2.3e+08 1.2 e+09 3.3e+08 2.1e+0 8 1 .9e+08 400 3 .0e+06 2.1e+ 0 2 6.3e+08 1.4 e+11 8.7e+08 5.0 e+09 1.3e+09 7.9e+0 8 7 .0e+08 500 6 .5e+06 2.7e+ 0 2 1.7e+09 4.6 e+11 2.4e+09 1.3 e+10 3.4e+09 2.2e+0 9 1 .9e+09 A Pro of of Theorem 1 Let us define the brack eting en trop y of the set of fu nctio n s. F or distribution function P , define the L 2 metric k g k P = Z | g | 2 dP 1 / 2 , and let L 2 ( P ) b e the metric sp ace defined by this distance. F or any fixed δ > 0, a co v ering for function class S u sing the metric L 2 ( P ) is a collection of functions whic h allo w S to b e co ve red using L 2 ( P ) balls of radius δ cen tered at these f unctions. Let N B ( δ , S , P ) b e the smallest v alue of N for whic h there exist pairs of fun ct ions { ( g L j , g U j ) ∈ L 2 ( P ) × L 2 ( P ) | j = 1 , . . . , N } such that k g L j − g U j k P ≤ δ , and suc h that for eac h s ∈ S , there exists j suc h th at g L j ≤ s ≤ g U j . Th en, T able 2: Av erages of the computation time and the num b er of iterations in the BF GS metho d o ve r 50 r uns. n = 1000 , m = 1000 n = 4000 , m = 400 0 n = 60 00 , m = 60 00 Estimator Comput. time (sec.) Num b er of iterations Comput. time (sec.) Num b er of iterations Comput. time (se c.) Num b er o f iterations R-KuLSIF 1.44 23.02 34.94 29.98 71.69 30.74 KuLSIF 2.25 38.36 5 3 .93 48.76 107.79 47.3 2 KMM 51.83 453.68 59 1.44 400.74 1091.6 9 373.08 KL 27.63 329.06 1 180.72 634.32 2718 .89 669.2 0 R-KuLSIF(direct) 0.46 – 28.85 – 87.0 6 – (CPU: Xeon X5 4 82, 3.2 0GHz, Memo ry: 32GB, OS: Linux 2.6.18) 23 H B ( δ , S , P ) = log N B ( δ , S , P ) is called the br acketing entr opy of S (v an de Geer, 2000). Let H b e the RKHS endo wed with the Gaussian k ern els, k ( x, y ) = e −k x − y k 2 / 2 σ 2 . The norm and in ner p rod uct on H are denoted b y k · k H and h· , ·i H , r espective ly . Let k · k ∞ b e the infinit y norm. F or w ∈ H , we ha v e k w k P ≤ k w k ∞ ≤ k w k H , b ecause for an y x ∈ Z , the inequ al ities | w ( x ) | = |h w, k ( · , x ) i H | ≤ k w k H sup x k ( x, x ) = k w k H holds. Th e s et Z , which is the d omain of fu nctions in H , is assumed to b e compact. Let G = { v 2 | v ∈ H} . Let H M and G M b e H M = { v ∈ H | k v k H < M } , G M = { v 2 | v ∈ H √ M } = { g ∈ G | J ( g ) < M } , (35) where J ( g ) is a measure of complexity defined as J ( g ) = inf {k v k 2 H | v ∈ H , v 2 = g } . It is straigh tforw ard to ve rify the second equalit y of (35). According to Zhou (2002), the brac keti ng entrop y of H M satisfies, for infinitesimally small γ > 0, the condition H B ( δ , H M , P ) = O M δ γ . (36) More pr ecisely , Z h ou (2002) hav e pro ved that the en trop y num b er with th e supremum norm is b ounded abov e by O (( M /δ ) γ ). In addition, the brac ke ting entrop y H B ( δ , H M , P ) is b ound ed ab o v e b y the en tropy n u m b er with the supr em um norm du e to Lemm a 2.1 in v an d e Geer (2000). The follo wing prop osition is crucial to prov e the con v ergence p rop er ty of Ku LSIF. Prop osition 2 (Lemma 5.14 in v an de Geer (2000) ) . L et a map I ( g ) b e a me a su r e of c omplexity of g ∈ G , wher e I is a non-ne gative fu nctiona l on G and I ( g 0 ) < ∞ . Th e n, we define G M = { g ∈ G | I ( g ) < M } satisfying G = ∪ M ≥ 1 G M . Supp ose that ther e exist c 0 > 0 and 0 < γ < 2 such that sup g ∈G M k g − g 0 k P ≤ c 0 M , sup g ∈G M k g − g 0 k P ≤ δ k g − g 0 k ∞ ≤ c 0 M , for al l δ > 0 , and that H B ( δ , G M , P ) = O ( M /δ ) γ . Then, we have sup g ∈G Z ( g − g 0 ) d ( P − P n ) D ( g ) = O p (1) , wher e D ( g ) i s define d as D ( g ) = k g − g 0 k 1 − γ / 2 P I ( g ) γ / 2 √ n ∨ I ( g ) n 2 / (2+ γ ) and a ∨ b denotes max { a, b } . 24 W e u s e Prop osition 2 to derive an upp er b oun d of R ( b w − w 0 ) d ( Q − Q m ) and R ( b w 2 − w 2 0 ) d ( P − P n ). Lemma 1. The br acketing entr opy of G M is b ounde d ab ove as H B ( δ , G M , P ) = O M δ γ for any smal l γ > 0 . Pr o o f. Let v L 1 , v U 1 , v L 2 , v U 2 , . . . , v L N , v U N ∈ L 2 ( P ) b e co verings of H √ M in the sense of b rac k eting, suc h that k v L i − v U i k P ≤ δ h olds for i = 1 , . . . , N . W e can choose these fu nctio ns such that k v L ( U ) i k ∞ ≤ √ M is satisfied for all i = 1 , . . . , N , since for an y v ∈ H √ M , the inequ al it y k v k ∞ ≤ k v k H < √ M holds. F or example, replace v L ( U ) i with min { √ M , max {− √ M , v L ( U ) i }} ∈ L 2 ( P ). Let ¯ v L i and ¯ v U i b e ¯ v L i ( x ) = ( v L i ( x )) 2 v L i ( x ) ≥ 0 , ( v U i ( x )) 2 v U i ( x ) ≤ 0 , 0 v L i ( x ) < 0 < v L i ( x ) , ¯ v U i = max { ( v L i ) 2 , ( v U i ) 2 } , for i = 1 , . . . , N . Then, ¯ v L i ≤ ¯ v U i holds. Moreo v er, f or any v ∈ H √ M satisfying v L i ≤ v ≤ v U i , we ha ve ¯ v L i ≤ v 2 ≤ ¯ v U i . By d efinition, we also ha v e 0 ≤ ¯ v U i ( x ) − ¯ v L i ( x ) ≤ max {| v U i ( x ) 2 − v L i ( x ) 2 | , | v U i ( x ) − v L i ( x ) | 2 } ≤ ( | v U i ( x ) | + | v L i ( x ) | ) · | v U i ( x ) − v L i ( x ) | ≤ 2 √ M | v U i ( x ) − v L i ( x ) | , and th u s, k ¯ v U i − ¯ v L i k P ≤ 2 √ M k v U i − v L i k P holds. Du e to (36), w e obtain H B (2 √ M δ, G M , P ) ≤ H B ( δ , H √ M , P ) = O √ M δ ! γ . Hence, H B ( δ , G M , P ) = O ( M /δ ) γ holds. Lemma 2. Assume the c ondition of The or em 1. Then, for the KuLSIF estimator b w , we have Z ( b w − w 0 ) d ( Q − Q m ) = O p k w 0 − b w k 1 − γ / 2 P k b w k γ / 2 H √ m ∨ k b w k H m 2 / (2+ γ ) ! , Z ( b w 2 − w 2 0 ) d ( P − P n ) = O p k b w − w 0 k 1 − γ / 2 P (1 + k b w k H ) 1+ γ / 2 √ n ∨ k b w k 2 H n 2 / (2+ γ ) ! , wher e γ > 0 is an i nfinitesima l ly smal l value. Pr o o f. There exists c 0 > 0 suc h that sup w ∈H M k w − w 0 k P ≤ c 0 M , sup w ∈H M k w − w 0 k P ≤ δ k w − w 0 k ∞ ≤ c 0 M , (37) sup g ∈G M k g − w 2 0 k P ≤ c 0 M , sup g ∈G M k g − w 2 0 k P ≤ δ k g − w 2 0 k ∞ ≤ c 0 M . (38) 25 The inequalities in (38) are deriv ed as follo ws. F or g ∈ G M , there exists v ∈ H such that v 2 = g and k v k 2 H < M , and then, we ha v e k g − w 2 0 k P ≤ k g − w 2 0 k ∞ ≤ k v k 2 ∞ + k w 0 k 2 ∞ ≤ k v k 2 H + k w 0 k 2 ∞ ≤ M + k w 0 k 2 ∞ ≤ c 0 M , ( M ≥ 1) . In the same w a y , (37) also holds. Therefore, d ue to Prop osit ion 2 and (37), w e ha ve sup w ∈H Z ( w 0 − w ) d ( Q − Q m ) D ( w ) = O p (1) , where D ( w ) is defined as D ( w ) = k w 0 − w k 1 − γ / 2 P k w k γ / 2 H √ m ∨ k w k H m 2 / (2+ γ ) . In the same w a y , we ha v e sup w ∈H Z ( w 2 − w 2 0 ) d ( P − P n ) E ( w ) = O p (1) , where E ( w ) is defined as E ( w ) = k w 2 − w 2 0 k 1 − γ / 2 P J ( w 2 ) γ / 2 √ n ∨ J ( w 2 ) n 2 / (2+ γ ) . Note that k w 2 − w 2 0 k P ≤ ( k w 0 k ∞ + k w k H ) k w − w 0 k P = O ((1 + k w k H ) k w − w 0 k P ) and J ( w 2 ) ≤ k w k 2 H . Th en, we obtain E ( w ) ≤ k w − w 0 k 1 − γ / 2 P (1 + k w k H ) 1+ γ / 2 √ n ∨ k w k 2 H n 2 / (2+ γ ) . No w we sho w the p roof of Theorem 1. Pr o o f. The estimator b w satisfies the inequalit y 1 2 Z b w 2 dP n − Z b w dQ m + λ 2 k b w k 2 H ≤ 1 2 Z w 2 0 dP n − Z w 0 dQ m + λ 2 k w 0 k 2 H . Then, we hav e 1 2 k b w − w 0 k 2 P = Z ( w 0 − b w ) dQ + 1 2 Z ( b w 2 − w 2 0 ) dP ≤ Z ( w 0 − b w ) dQ + 1 2 Z ( b w 2 − w 2 0 ) dP + Z ( b w − w 0 ) dQ m + 1 2 Z ( w 2 0 − b w 2 ) dP n + λ 2 k w 0 k 2 H − λ 2 k b w k 2 H . 26 As a result, we hav e 1 2 k b w − w 0 k 2 P + λ 2 k b w k 2 H ≤ Z ( b w − w 0 ) d ( Q − Q m ) + 1 2 Z ( b w 2 − w 2 0 ) d ( P − P n ) + λ 2 k w 0 k 2 H ≤ λ 2 k w 0 k 2 H + O p k w 0 − b w k 1 − γ / 2 P (1 + k b w k H ) 1+ γ / 2 √ n ∧ m ∨ (1 + k b w k H ) 2 ( n ∧ m ) 2 / (2+ γ ) ! , where Lemma 2 is used. W e need to stud y three p ossibilities: 1 2 k w 0 − b w k 2 P + λ 2 k b w k 2 H ≤ O p ( λ ) , (39) 1 2 k w 0 − b w k 2 P + λ 2 k b w k 2 H ≤ O p k w 0 − b w k 1 − γ / 2 P (1 + k b w k H ) 1+ γ / 2 √ n ∧ m ! , (40) 1 2 k w 0 − b w k 2 P + λ 2 k b w k 2 H ≤ O p (1 + k b w k H ) 2 ( n ∧ m ) 2 / (2+ γ ) . (41) One of the ab o v e inequalities should b e satisfied. W e stud y eac h inequalit y b elo w. Case (3 9): we h a v e 1 2 k w 0 − b w k 2 P ≤ O p ( λ ) , λ 2 k b w k 2 H ≤ O p ( λ ) , and hence the inequalities k w 0 − b w k P ≤ O p ( λ 1 / 2 ) and k b w k H ≤ O p (1) hold. Case (4 0): we h a v e k w 0 − b w k 2 P ≤ O p k w 0 − b w k 1 − γ / 2 P (1 + k b w k H ) 1+ γ / 2 ( n ∧ m ) 1 / 2 ! , λ k b w k 2 H ≤ O p k w 0 − b w k 1 − γ / 2 P (1 + k b w k H ) 1+ γ / 2 ( n ∧ m ) 1 / 2 ! . The first inequalit y provides k w 0 − b w k P ≤ O p 1 + k b w k H ( n ∧ m ) 1 / (2+ γ ) . Th us, the second inequalit y leads to λ k b w k 2 H ≤ O p k w 0 − b w k 1 − γ / 2 P (1 + k b w k H ) 1+ γ / 2 ( n ∧ m ) 1 / 2 ! ≤ O p 1 + k b w k H ( n ∧ m ) 1 / (2+ γ ) 1 − γ / 2 (1 + k b w k H ) 1+ γ / 2 ( n ∧ m ) 1 / 2 ! = O p (1 + k b w k H ) 2 ( n ∧ m ) 2 / (2+ γ ) . 27 Hence, w e h a v e k b w k H ≤ O p 1 λ 1 / 2 ( n ∧ m ) 1 / (2+ γ ) = o p (1) for infinitesimally small γ > 0. Th en, we obtain k w 0 − b w k P ≤ O p 1 ( n ∧ m ) 1 / (2+ γ ) ≤ O p ( λ 1 / 2 ) . Case (4 1): we h a v e k w 0 − b w k 2 P ≤ O p (1 + k b w k H ) 2 ( n ∧ m ) 2 / (2+ γ ) , λ k b w k 2 H ≤ O p (1 + k b w k H ) 2 ( n ∧ m ) 2 / (2+ γ ) . Then, as sho w n in the case (40), we h a v e k b w k H = o p (1). Hence, we obtain k w 0 − b w k P ≤ O p 1 ( n ∧ m ) 1 / (2+ γ ) ≤ O p ( λ 1 / 2 ) . B Lea v e-one-out Cross-v alidation of KuLS IF The pro cedure to compute the leav e-one-o ut cross-v alidation score of K uLSIF is p resen ted here. Let K ( ℓ ) 11 ∈ ℜ ( n − 1) × ( n − 1) and K ( ℓ ) 12 = K ( ℓ ) ⊤ 21 ∈ ℜ ( n − 1) × ( m − 1) b e the Gram m atrices of samples except x ℓ and y ℓ , r espective ly . According to Theorem 3, the estimated parameters e α ( ℓ ) and e β ( ℓ ) of b w ( ℓ ) ( z ) = X i 6 = ℓ α i k ( z , X i ) + X j 6 = ℓ β j k ( z , Y j ) is equal to e α ( ℓ ) = − 1 ( m − 1) λ ( K ( ℓ ) 11 + ( n − 1) λI n − 1 ) − 1 K ( ℓ ) 12 1 m − 1 , e β ( ℓ ) = 1 ( m − 1) λ 1 m − 1 , where I n − 1 denotes the ( n − 1) by ( n − 1) identit y matrix. Hence, the parameter e α ( ℓ ) is the solution of the follo wing con vex quadratic pr ob lem, min α 1 2 α ⊤ ( K ( ℓ ) 11 + ( n − 1) λI n − 1 ) α + 1 ( m − 1) λ 1 ⊤ m − 1 K ( ℓ ) 21 α, α ∈ ℜ n − 1 . (42) The same solution can b e obtained b y solving min α 1 2 α ⊤ ( K 11 + ( n − 1) λI n ) α + 1 ( m − 1) λ ( 1 m − e m,ℓ ) ⊤ K 21 α, s. t. α ∈ ℜ n , α ℓ = 0 , (43) where e m,ℓ ∈ ℜ m is the standard un it v ector with only the ℓ -th comp onen t b eing 1. The optimal solution of (43) denoted b y α ( ℓ ) is equal to α ( ℓ ) = ( K 11 + ( n − 1) λI n ) − 1 − 1 ( m − 1) λ K 12 ( 1 m − e m,ℓ ) − c ℓ e n,ℓ , 28 where c ℓ is determined s o that α ( ℓ ) ℓ = 0. The estimator e α ( ℓ ) ∈ ℜ n − 1 is equal to the ( n − 1)-dimensional v ector consisting of α ( ℓ ) except the ℓ -th co m p onent, i.e., e α ( ℓ ) = ( α ( ℓ ) 1 , . . . , α ( ℓ ) ℓ − 1 , α ( ℓ ) ℓ +1 , . . . , α ( ℓ ) n ) ⊤ . The parameters of the lea ve-o n e-o u t estimator, A = ( α (1) , . . . , α ( n ∧ m ) ) ∈ ℜ n × ( n ∧ m ) , B = ( β (1) , . . . , β ( n ∧ m ) ) ∈ ℜ m × ( n ∧ m ) also hav e analytic expressions. Let G ∈ ℜ n × n b e G = ( K 11 + ( n − 1) λI n ) − 1 , an d E ∈ ℜ m × ( n ∧ m ) b e the matrix defined as E ij = ( 1 i 6 = j, 0 i = j. Let S ∈ ℜ n × ( n ∧ m ) b e S = − 1 ( m − 1) λ K 12 E , and T ∈ ℜ n × ( n ∧ m ) b e T ij = ( GS ) ii G ii i = j, 0 i 6 = j. Then, we obtain A = G ( S − T ) , B = 1 ( m − 1) λ E . Let K X ∈ ℜ ( n ∧ m ) × ( n + m ) b e the s ub-matrix of ( K 11 K 12 ) form ed by the first n ∧ m rows and all columns. S imilarly , let K Y ∈ ℜ ( n ∧ m ) × ( n + m ) b e th e su b-matrix of ( K 21 K 22 ) formed b y the first n ∧ m ro ws and all column s. L et the pro duct U ∗ U ′ b e the elemen t-wise multi plication of matrices U and U ′ of the same size, i.e., the ( i, j ) elemen t is giv en by U ij U ′ ij . Then, w e h a v e b w X = ( b w (1) ( X 1 ) , . . . , b w ( n ∧ m ) ( X n ∧ m )) ⊤ = ( K X ∗ ( A ⊤ B ⊤ )) 1 n + m , b w Y = ( b w (1) ( Y 1 ) , . . . , b w ( n ∧ m ) ( Y n ∧ m )) ⊤ = ( K Y ∗ ( A ⊤ B ⊤ )) 1 n + m , b w X + = ( b w (1) + ( X 1 ) , . . . , b w ( n ∧ m ) + ( X n ∧ m )) ⊤ = max { b w X , 0 } , b w Y + = ( b w (1) + ( Y 1 ) , . . . , b w ( n ∧ m ) + ( Y n ∧ m )) ⊤ = max { b w Y , 0 } , where the m ax op eration for a v ector is applied in the element-wise manner. As a result, LOOC V (18) is equal to LOOCV = 1 n ∧ m 1 2 b w ⊤ X + b w X + − 1 ⊤ n ∧ m b w Y + . C Pro of of Eq. (25) Let κ ( A ) b e the condition n umb er of the symmetric p ositiv e definite matrix A , then we p ro v e that the follo wing equalit y min S : κ ( S ) ≤ C κ ( S AS ) = max κ ( A ) C 2 , 1 29 holds. The same equalit y holds f or the condition num b er defin ed through singular v alues for non-symmetric matrices. W e pro v e the case that S is a symmetric m atrix for simp lic it y . Note that κ ( S 2 ) = κ ( S ) 2 and κ ( S ) = κ ( S − 1 ), th us we obtain Eq. (25), i.e. min S : κ ( S ) ≤ C κ ( S − 1 / 2 AS − 1 / 2 ) = max κ ( A ) C , 1 . Pr o o f. First, w e p ro v e min S : κ ( S ) ≤ C κ ( S AS ) ≥ max { κ ( A ) C 2 , 1 } . The matrix A is symmetric p ositiv e defi nite, th us, there exists an orthogonal m at r ix Q and a diagonal matrix Λ = diag ( λ 1 , . . . , λ n ) suc h that A = Q Λ Q ⊤ . The eig env al ues are arranged in the decreasing order, i.e., λ 1 ≥ λ 2 ≥ · · · ≥ λ n > 0. In th e similar w a y , let S b e P D P ⊤ , where P is an orthogonal matrix and D = diag( d 1 , . . . , d n ) is a diagonal matrix suc h that d 1 ≥ d 2 ≥ · · · ≥ d n > 0 and d 1 /d n ≤ C . Hence, κ ( S AS ) = κ ( P D P ⊤ Q Λ Q ⊤ P D P ⊤ ) = κ ( D P ⊤ Q Λ Q ⊤ P D ) . Let Q ⊤ P b e R ⊤ whic h is also an orthogonal matrix. The maxim um eigenv alue of D R Λ R ⊤ D is giv en as max k x k =1 x ⊤ D R Λ R ⊤ D x . Let R = ( r 1 , . . . , r n ), where r i ∈ ℜ n , and w e choose x 1 suc h that r ⊤ i D x 1 = 0 for i = 2 , . . . , n and k x 1 k = 1. Th en, max k x k =1 x ⊤ D R Λ R ⊤ D x ≥ x ⊤ 1 D R Λ R ⊤ D x 1 = λ 1 ( x ⊤ 1 D r 1 ) 2 . F rom the assumption on x 1 , D x 1 is represen ted as c r 1 for some c , and w e ha ve ( x ⊤ 1 D r 1 ) 2 = c 2 = x ⊤ 1 D 2 x 1 ≥ d 2 n . Hence, w e ha ve max k x k =1 x ⊤ S AS x ≥ λ 1 d 2 n . On the other hand , the minimum eigen v alue of D R Λ R ⊤ D is giv en as min k x k =1 x ⊤ D R Λ R ⊤ D x . W e choose x n suc h th at r ⊤ i D x n = 0 for i = 1 , . . . , n − 1. Th en, min k x k =1 x ⊤ D R Λ R ⊤ D x ≤ x ⊤ n D R Λ R ⊤ D x n = λ n ( x ⊤ n D r n ) 2 ≤ λ n x ⊤ n D 2 x n (Sc hw arz in equali t y) ≤ λ n d 2 1 . As a result, the condition num b er of S AS is b ounded b elo w as κ ( S AS ) ≥ λ 1 d 2 n λ n d 2 1 = κ ( A ) ( d 1 /d n ) 2 ≥ κ ( A ) C 2 . 30 Next, we prov e min S : κ ( S ) ≤ C κ ( S AS ) ≤ max { κ ( A ) C 2 , 1 } . If κ ( A ) ≤ C 2 , the inequal- it y min S : κ ( S ) ≤ C κ ( S AS ) = 1 holds, b ecause we can choose S = A − 1 / 2 . Then, w e pro ve min S : κ ( S ) ≤ C κ ( S AS ) ≤ κ ( A ) C 2 , if 1 ≤ C 2 ≤ κ ( A ) is satisfied. Let S = Q Γ Q ⊤ with Γ b e a diagonal matrix diag( γ 1 , . . . , γ n ), then κ ( S AS ) = κ (diag( γ 2 1 λ 1 , . . . , γ 2 n λ n )) holds. Let γ 1 = 1 and γ n = C . Since 1 ≤ C 2 ≤ κ ( A ) = λ 1 /λ n holds, for k = 2 , . . . , n − 1 w e hav e 1 ≤ min C, s λ 1 λ k , C s λ n λ k ≤ min C, s λ 1 λ k and th u s, we obtain max 1 , C s λ n λ k ≤ min C, s λ 1 λ k , k = 2 , . . . , n − 1 . Hence, there exists γ k , k = 2 , . . . , n − 1 suc h that max 1 , C s λ n λ k ≤ γ k ≤ min C, s λ 1 λ k . Th us, 1 ≤ γ k ≤ C holds for all k = 2 , . . . , n − 1. Moreo v er, C 2 λ n ≤ γ 2 k λ k ≤ λ 1 also holds. These inequ a lities im p ly κ ( S ) = C and κ ( S AS ) = λ 1 / ( C 2 λ n ) = κ ( A ) /C 2 . Therefore min S : κ ( S ) ≤ C κ ( S AS ) ≤ κ ( A ) C 2 holds if 1 ≤ C 2 ≤ κ ( A ). D Pro of of T heorem 4 W e sh o w the pro of of Theorem 4. Pr o o f. Let w 1 b e the constant function taking 1 o ver Z . In a univ ersal RKHS, for an y δ > 0, there exists w ∈ H suc h that k w 1 − w k ∞ ≤ δ . According to App endix D in Horn and Johnson (1985 ), eigen v alues of a matrix are con tinuous on its entrie s, and th u s so do the minimal and maximal eige nv alues and the condition num b er as long as the condition n umb er is well-defined. Then, for an y ε > 0 and for an y ψ satisfying ψ ′′ (1) = 1, there exists w ∈ H su c h that | κ 0 ( D ψ, w ) − κ 0 ( I n ) | ≤ ε. Then, for fixed samples X 1 , . . . , X n , w e fi nd that sup { κ 0 ( D ψ, w ) | w ∈ H} ≥ κ 0 ( I n ) . On the other hand , for ψ ( z ) = z 2 / 2, w e obtain sup { κ 0 ( D ψ, w ) | w ∈ H} = κ 0 ( I n ) . Th us, (30) holds. 31 E Pro of of Th eore m 5 The follo wing lemma is the ke y to prov e Theorem 5. Lemma 3. Supp ose that the kernel function k satisfies the c ondition in The or em 5, and that the e xp e ctation of ψ ′′ ( b w ( X 1 )) exi sts. The pr ob ability Pr ( · · · ) is define d fr om the distribution of samples X 1 , . . . , X n , Y 1 , . . . , Y m . Then, ther e exists a p ositive c onstant ε > 0 such that the pr ob ability distribution of κ ( H ) is b o u nde d ab o v e by Pr ( κ ( H ) < δ ) ≤ F n c ε + δ c ( E [ ψ ′′ ( b w ( X 1 ))] + λ ) , (44) wher e c is an arbitr ary p ositive value. On the other hand, for any p ositive nu mb er c > 0 , we have Pr κ ( H ) > κ ( K 11 ) 1 + c λ ≤ 1 − F n ( c ) (45) if the Gr am matrix K 11 is almost sur ely p ositive definite. Pr o o f. Let k i b e the i -th column ve ctor of the Gram matrix K 11 . Due to th e condition on th e k ernel function, there exists a constant ε > 0 s u c h that Pr √ ε ≤ ( K 11 ) ij ≤ 1 , i, j = 1 , . . . , n = 1 , where the probabilit y is indu ced from the joint probabilit y of X 1 , . . . , X n . Hence, Pr( εn ≤ k k i k 2 ≤ n, i = 1 , . . . , n ) = 1 (46) also holds. Let d i b e ψ ′′ ( b w ( X i )), then the matrix H is represente d as H = 1 n n X i =1 d i k i k ⊤ i + λK 11 ∈ ℜ n × n . Let us define Y n = min k a k =1 a ⊤ H a, Z n = max k a k =1 a ⊤ H a. Y n and Z n are th e minim al and maximal eig env alues of H . Thus, the condition n u mb er of H is giv en as κ ( H ) = Z n / Y n . W e derive an upp er b ound of Y n and a lo w er b ound of Z n to pro ve th e fir st inequalit y (4 4). The minimal eigen v alue is less than or equal to the a verag e of all eigen v alues, and the s um of eigen v alues is equal to the trace of the matrix. Thus, we ha ve Y n ≤ 1 n T r 1 n n X i =1 d i k i k ⊤ i + λK 11 ! ≤ 1 n n X i =1 d i + λ, 32 where (46) wa s used. O n the other hand, for an y j = 1 , . . . , n , the inequalit y Z n = m ax k a k =1 1 n n X i =1 d i ( k ⊤ i a ) 2 + λa ⊤ K 11 a ≥ m ax k a k =1 1 n n X i =1 d i ( k ⊤ i a ) 2 ≥ 1 n n X i =1 d i ( k ⊤ i k j / k k j k ) 2 ( k j / k k j k is substituted into a ) ≥ 1 n d i k k j k 2 ≥ εd j holds. T he last in equalit y follo ws (46). Hence, we ha v e Z n ≥ ε max j d j . Therefore, for any δ > 0, we ha v e Pr( κ ( H ) < δ ) ≤ Pr ε max i d i 1 n P n i =1 d i + λ < δ ! . (47) The probabilit y of the numerator in (47) is giv en as Pr( ε max i d i ≤ c 1 ) = F n c 1 ε , c 1 > 0 . F or the p r obabilit y of the denominator in (47), w e use Mark ov’s inequalit y: Pr 1 n n X i =1 d i + λ − 1 ≤ c 2 = Pr 1 n n X i =1 d i + λ ≥ 1 /c 2 ≤ c 2 ( E [ d 1 ] + λ ) , c 2 > 0 . Com bin ing these t wo b ounds 2 , we find Pr ε max i d i 1 n P n i =1 d i + λ < c 1 c 2 ! ≤ F n c 1 ε + c 2 ( E [ d 1 ] + λ ) . Therefore, for any δ > 0 and c > 0, we ha v e (44). W e prov e the second inequalit y (45 ) . Let τ 1 and τ n b e the maximal an d min imal eigen v alues of K 11 . Since all diagonal elemen ts of K 11 are less than or equ al to one, we hav e 0 < τ 1 ≤ 2 Let A , B , a , and b b e four p ositi ve num b ers. If A ≥ a and B ≥ b , then we ha ve AB ≥ ab . As the contra p osi tion, if A B < a b , t h en A < a or B < b holds. 33 T r K 11 ≤ n . Then, we ha v e a lo wer b ound of Y n and an upp er b ound of Z n as follo ws: Y n = min k a k =1 1 n n X i =1 d i ( k ⊤ i a ) 2 + λa ⊤ K 11 a ≥ λτ n , Z n = max k a k =1 1 n n X i =1 d i ( k ⊤ i a ) 2 + λa ⊤ K 11 a ≤ max j d j n max k a k =1 n X i =1 ( k ⊤ i a ) 2 + λτ 1 = max j d j n τ 2 1 + λτ 1 ≤ τ 1 max j d j + λτ 1 , where the last inequalit y for Z n follo ws fr om τ 1 ≤ n . Therefore, for any c > 0, we ha v e Pr κ ( H ) > κ ( K 11 ) 1 + c λ ≤ P r τ 1 max j d j + λτ 1 λτ n > κ ( K 11 ) 1 + c λ = P r max j d j > c = 1 − Pr max j d j ≤ c = 1 − F n ( c ) . In Lemma 3, the distr ib utions of Y n and Z n are sep arat ely computed. This idea is b orrow ed from smo othed analysis of the condition n u m b ers (S ank a r et al., 2006). In smo othed analysis, the probabilit y Pr( κ ( H ) ≥ δ ) is b ounded ab o ve to ensur e that the cond iti on n u m b er is unlikely to b e large. In the ab o v e lemma, w e used the same tec hnique also for upp er-b ounding the probabilit y of the form Pr( κ ( H ) ≤ δ ). As a result, we obtained th e p ossible lo w est order of the condition num b er κ ( H ). Belo w, we sho w the p roof of Theorem 5. pr o of of The or em 5. The inequalit y (44) in Lemm a 3 p r o vides Pr( κ ( H ) < δ n ) ≤ F n c n ε + δ n c n ( M + λ ) . Let c n b e εs n and δ n b e o ( s n ) then, w e obtain lim n →∞ Pr( κ ( H ) < δ n ) = 0 . (48) W e pr o v e another inequalit y . Due to the second inequ al it y in Lemm a 3, w e ha v e lim n →∞ Pr κ ( H ) > κ ( K 11 ) 1 + t n λ ≤ 1 − lim n →∞ F n ( t n ) = 0 (49) W e complete the p roof by combining (48) and (49). 34 F Pro of of Theorem 6 W e sh o w the pro of of Theorem 6 Pr o o f. Assume th at ψ ′′ ( z ) is not a constan t function. Since K 11 is non-singular, the vec - tor K 11 α + 1 mλ K 12 1 m tak es an arbitrary v a lu e in ℜ n b y v arying α ∈ ℜ n . Hence, eac h di- agonal element of D ψ, α can tak e arbitrary v alues in an op en subset S ⊂ ℜ . W e consider R − 1 M ψ, R ( α )( R ⊤ ) − 1 instead of M ψ, R . S upp ose that there exist s a matrix R such that the ma- trix R − 1 M ψ, R ( α )( R ⊤ ) − 1 = 1 n diag( s 1 , . . . , s n ) K 11 ( R ⊤ ) − 1 + λ ( R ⊤ ) − 1 is symmetric f or an y ( s 1 , . . . , s n ) ∈ S n . Let a ij b e th e ( i, j ) element of K 11 ( R ⊤ ) − 1 , and t ij b e th e ( i, j ) elemen t of ( R ⊤ ) − 1 . Th en, the ( i, j ) and ( j, i ) elemen ts of R − 1 M ψ, R ( α )( R ⊤ ) − 1 are equ al to 1 n s i a ij + λt ij and 1 n s j a j i + λt j i , resp ectiv ely . Du e to the assumption, the equalit y 1 n s i a ij + λt ij = 1 n s j a j i + λt j i holds for any s i , s j ∈ S . When i 6 = j , w e obtain a ij = a j i = 0 and t ij = t j i . Th u s, K 11 ( R ⊤ ) − 1 should b e equ al to some diagonal matrix, and ( R ⊤ ) − 1 is a symmetric matrix. Thus, there exists a diagonal m at r ix Q = diag( q 1 , . . . , q n ) suc h that K 11 = QR holds. As a r esu lt, we ha v e ( K 11 ) ij = q i R ij , ( K 11 ) j i = q j R j i , R ij = R j i , and ( K 11 ) ij = ( K 11 ) j i . Hence w e obtain ( K 11 ) ij = q i R ij = q j R ij , and then, q i = q j or R ij = 0 holds for an y i and j . Since ( K 11 ) ij is n on-z ero element , the only p ossibilit y is q 1 = q 2 = · · · = q n 6 = 0. Therefore, th e diagonal matrix Q sh ould b e prop ortional to the identit y matrix and there exists a constant c ∈ ℜ suc h that the equalit y R = cK 11 holds. This equalit y con tr ad icts the assumption. References Ali, S. M., & Silv ey , S. D. (1966). A general class of coefficients of div ergence of one distribution from another. Journal of the R oyal Statistic al So ciety, Series B , 28 , 131–142. Bertsek as, D. (1996) . Nonline a r pr o gr amming . A thena Scienti fic. Bic k el, S., Brckner, M., & Scheffer, T. (2009). Discrimin ative learning und er co v aria te shift. Journal of Machine L e arning R ese ar c h , 10 , 2137–2155 . Csisz´ ar, I. (1967). Information-t yp e m easur es of difference of probabilit y distributions and indirect observ ation. Studia Scientiarum Mathematic arum Hu ngar i c a , 2 , 229–3 18. Demmel, J. W. (1997). Applie d numeric al line ar algebr a . Ph iladelph ia , P A, USA: So ciet y for Industr ia l and Applied Mathematics. Edelman, A. (19 88). Eigen v alues an d condition n um b ers of random matrices. SIAM Journal on Matrix Analys is and Applic ations , 9 , 543–560 . 35 Edelman, A., & Sutton, B. D. (2005). T ails of condition num b er d istributions. SIAM J ournal on Matrix Analys is and Applic ations , 27 , 547–560. Golub, G. H., & Loan, C. F. V. (1996). Matrix c omputa tions . Ba ltimore, MD: Johns Hopkins Univ ersity Press. Gretton, A., Bo r gw ardt, K. M., Rasc h , M. J., Sc h¨ olk opf, B., & S mola, A. J. (2006). A k ernel metho d for the t wo -sample-problem. NIP S (pp . 513–520 ). H¨ ardle, W., M ¨ uller, M., Sp erlic h, S., & W erw atz, A. (2004). N onp ar ametric and se mip ar ametric mo dels . Sp ringer Series in Statistic s . Berlin: Springer. Hido, S., T sub oi, Y., Kashima, H., Sugiyama , M., & Kanamori, T . (2008) . Inlier-based outlier detection via direct densit y r at io estimation. Pr o c e e dings of IEEE International Confer enc e on Data M ining (ICDM2008) (pp . 223–232). Pisa, Italy . Ho dge, V., & Au stin, J. (2004) . A sur v ey of ou tlier d ete ction m et ho dologie s. A rtificial Intel li- genc e R eview , 22 , 85–12 6. Horn, R., & Johnson, C. (1985). Matrix analysis . Cambridge Univ ersity Press. Huang, J., S m ol a, A., Gretton, A., Borgw ardt, K. M., & Sch¨ olko p f, B. (2007). Correcting sample selection b ia s by unlab eled data. A dvanc es in Neur al Information Pr o c essing Systems 19 (pp. 601–6 08). Cam b ridge, MA: MIT Press. Kanamori, T., Hido, S ., & Sugiyama , M. (2009). A least-squares approac h to direct imp ortance estimation. Journal of Machine L e arning R ese ar c h , 10 , 1391–1445 . Kimeldorf, G. S., & W ahba, G. (1971 ). Some results on Tc hebyc heffian sp line f unctions. Journal of Mathematic al Analysis and Applic ations , 33 , 82–95. Luen b erger, D., & Y e, Y. (2008) . Line ar and nonline ar pr o gr a mming . Sp ringer. Mor ´ e, J. J., & Sorensen, D. C. (1984 ). Newton’s metho d. In G. H. Golub (Ed.), Studies in numeric al analysis . pub -MA TH-ASSOC-AMER. Nak ahara, M. (200 3). Ge ometry, top olo gy and physics, se c ond e dition . T a y lor & F r an cis. Nguy en, X., W ain wright, M., & Jordan, M. (2008 ). E stimat in g divergence functionals and the like liho od ratio by p enalized conv ex r isk minimization. A dvanc es in N eur al Information Pr o c e ssing Systems 20 (pp. 1089– 1096). Cam bridge, MA: MIT Press. R Dev elopment Core T eam (2009) . R: A language and envir onment for statistic al c omputing . R F oundation for Statistical Computing, Vienna, Austria. IS BN 3-900051 -07-0. Sank ar, A., Spielman, D. A., & T e ng, S.-H. (2006). Smo othed analysis of the condition num b ers and gro wth factors of matrices. SIAM Journal on Matrix A nalysis and Applic ations , 28 , 446–4 76. Sc h¨ olk opf , B., Platt, J. C., Sh a w e-T a ylor, J., Smola, A. J., & Wi lliamson, R. C . (2001 ). Esti- mating the sup p ort of a high-dimensional distribution. Neur al Computation , 13 , 1443–1471 . 36 Sc h¨ olk opf , B., & Smola, A. J. (2002). L e arning with kernels . Cambridge, MA: MIT Press. Shimo daira, H. (2000). Impr o ving predictiv e inference un d er co v ariate shift by w eigh ting the log-lik elihoo d f unction. Journal of Statistic al Planning and Infer enc e , 90 , 227–244 . Spielman, D. A., & T eng, S.-H. (2004 ). Sm oothed analysis of algorithms: Wh y the s implex algorithm usually tak es p olynomial time. Journal of the ACM , 51 , 385–463. Stein wart, I. (2001). On the in fluence of the kernel on the consistency of supp ort vect or mac h ines. Journal of Machine L e arning R ese ar c h , 2 , 67–93 . Sugiy ama, M., Kr auledat , M., & M ¨ uller, K.-R. (2007). Cov ariate shift adap tation by imp ortance w eighte d cr oss v alidation. Journal of Machine L e arning R ese ar ch , 8 , 985–1005. Sugiy ama, M., & M ¨ u lle r , K.-R. (2005). Inp ut-dep enden t estimation of generaliza tion error und er co v ariate sh if t. Statistics & D e cisions , 23 , 249–27 9. Sugiy ama, M., Nak a jima, S., K a shima, H., vo n B ¨ unau, P ., & K a w anab e, M. (2008a). Direct imp ortance estimatio n with mo del selection and its app lication to co v ariate sh ift adaptation. A dvanc es in Ne ur al Information P r o c essing Systems 20 (p p. 1433–1440 ). Cambridge, MA: MIT Press. Sugiy ama, M., Suzuki, T ., Nak a jima, S ., Kashima, H., v on B ¨ u n au, P ., & K a w anab e, M. (200 8b). Direct imp orta n ce estimation for co v ariat e shift adaptation. Annals of the Institute of Statis- tic al M athematics , 60 , 699–74 6. Suzuki, T., Sugiy ama, M., Sese, J., & Kanamori, T . (2008). Approximat ing m utual information b y maximum lik eliho o d densit y ratio estimation. J MLR Workshop and Confer enc e Pr o c e e dings (pp. 5–20). T ao, T., & V u, V. H. (200 7). T he condition num b er of a rand omly p erturb ed matrix. Pr o c e e dings of the Thirty-N inth A nnual ACM Symp osium on The ory of Computing (pp. 248 –255). New Y ork, NY, USA: ACM. T ax, D. M. J., & Duin, R. P . W. (20 04). Supp ort v ector data description. M achine L e arning , 54 , 45–66. Tsub oi, Y., Kashima, H., Hido, S., Bic kel , S., & Su giyama , M. (2008 ). Direct densit y ratio estimation for large-scale co v ariate shift adaptation. SDM (pp. 443–454). v an de Geer, S. (2000). Empiric al pr o c esses in M-estimation . Cam b ridge Univ ersit y Press. V apnik, V. N. (1998). Statistic al le arning the ory . New Y ork: Wile y . Zadrozn y , B. (2004 ). L earn ing and ev aluating classifiers un der sample selection bias. Pr o c e e dings of the Twenty-First International Confer enc e on Machine L e arning . New Y ork, NY: ACM Press. Zeidler, E. (1986). Nonline ar functional ana lysis and its applic at ions, I: Fixe d-p oint the or ems . Springer-V erlag. Zhou, D.-X. (2002). Th e co vering num b er in learning theory . Journal of Complexity , 18 , 739–76 7. 37
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment