Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities

In many signal detection and classification problems, we have knowledge of the distribution under each hypothesis, but not the prior probabilities. This paper is aimed at providing theory to quantify the performance of detection via estimating prior …

Authors: Jiantao Jiao, Lin Zhang, Robert Nowak

Minimax-Optimal Bounds for Detectors Based on Estimated Prior   Probabilities
SUBMITTED TO IEEE TRANSA CTIONS ON INF ORMA TION THEOR Y 1 Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities Jiantao Jiao*, Lin Zhang, Member , IEEE a nd Robert D. Nowak, F ellow , IEEE Abstract —In many signal detection and classification problems, we hav e knowledge of the distributi on under each hypothesis, but not the prior pr obabilities. This paper is aimed at providing theory to q uantify the perform ance of detection via estimating prior probabilities fr om either labeled or unlabeled training data. The err or or ri sk is considered as a function of the prior probabilities. W e show t hat t he risk fu nction is locally Lipsch itz in the vicini ty of the true p rior probabilities, and the err or of detectors b ased on estimated prior probabilities depend s on th e behavio r of the risk function in this loca lity . In general, we show that t he error of detectors based on the M aximum Likelihood Estimate (MLE) of the prior pro babiliti es conv erges to the Bay es error at a rate of n − 1 / 2 , wher e n is the number of training data. If the behavior of the risk function is more f a vora ble, then detectors based on the M LE hav e errors con ver ging to the corr esponding Bayes errors at optimal rates of the form n − (1+ α ) / 2 , wh ere α > 0 is a parameter gov erning the behavior of the risk function with a typical v alue α = 1 . The limit α → ∞ corr esponds to a situation where the risk function is flat near the true probabilities, and thus insensitive to small err ors in the ML E; in this case the error of the detector based on the MLE con ver ges to t he Bayes error exponentially fast wit h n . W e show the bound s are achie vable no matter giv en l abeled or unlabeled training data and are minimax- optimal in labeled case. Index T erm s —Detector , min imax-optimality , maximum like- lihood estimate (MLE), prior probability , statistical l earning theory I . I N T R O D U C T I O N I N many signal detection an d classification problem s th e condition al distribution under each hyp othesis is known, but the prior p robabilities are unk nown. For example, we m ay have a good m odel f or the sym ptoms o f a cer tain disease, but might not kn ow how pr ev alen t the d isease is. Ther e ar e two ways to pro ceed: 1) Neyman-Pearson d etectors 2) Estimate prior probab ilities from training data Neyman-Pearson detectors are de signed to control on e type of error while minimizing the other . Detector s b ased on esti- mating prior probabilities aim to ach ie ve the per forman ce of the Bayes d etector (see, e.g. Devroye, Gyorfi, an d Lug osi[1]). W e study this seco nd approa ch an d provide theory to quan tify the per forman ce of detecto rs based o n estimating prior p rob- abilities from training data. W e w ill focus on simple binary hypoth eses and minimum proba bility of error detection, but J. Jiao and L. Zhang are with the Department of Electronic Engineering , Tsinghua Uni versity , Beijing, 100084 China. e-mail: (xajjt1990@ gmail.com; linzha ng@tsinghua .edu.cn). R. Now ak is with the Department of Electrical and Computer En- gineeri ng, Uni versity of Wiscon sin, Madison, WI 53706 USA. e-mail: no wak@ec e.wisc.edu. the theory and methods can be extended to h andle other error criteria that weight different error types and to m -ar y detection problem s. This pro blem can b e viewed as a special case of the classification problem in machine lear ning in w hich we have k nowledge of the density under each hyp othesis. These condition al densities are called the class-cond itional den sities, in the parlance of machine learning, and we will use this terminolo gy here . Detector s b ased on “plugging -in” the Maxi- mum Likelihood Estimate (MLE) of the prior probabilities a re simply a special case o f th e w ell-known plug-in ap proach in statistical learning theor y . W e u se this con nection to dev elop upper and lower bou nds on the perf ormance of detectors based on the ML E of prior p robabilities. Let us first intr oduce some notatio ns f or the prob lem. Let X ∈ R d denote a sign al and con sider a b inary hypoth esis testing p roblem H 0 : X ∼ p 0 H 1 : X ∼ p 1 , where p 0 and p 1 are known prob ability den sities on R d . Let Y be a binary rand om variable indicating which hyp othesis X follows, an d define q := P ( Y = 1) , the pr obability that hypoth esis H 1 is true. The Bayes detector is de fined by the likelihood ra tio test p 1 ( X ) p 0 ( X ) H 1 ≷ H 0 1 − q q , and it minimizes th e p robability of e rror . Let Λ( x ) := p 1 ( x ) /p 0 ( x ) an d define the r e g r ession function η ( x ) : η ( x ) := P ( Y = 1 | X = x ) = q p 1 ( x ) (1 − q ) p 0 ( x ) + q p 1 ( x ) , then the Baye s d etector can be expr essed as f ∗ ( x ) = 1 { η ( x ) ≥ 1 / 2 } . Note that η ( x ) is paramete rized by the prior probability q . Let us consider the probability of error , o r risk , as a function of this parame ter . For any f easible prior p robability q ′ , let R ( q ′ ) denote the risk (pro bability of error) incurr ed by u sing q ′ in place of q . The value q d efined above produce s the min imum risk. Th e difference R ( q ′ ) − R ( q ) quan tifies the subop timality of q ′ . The quantity R ( q ′ ) can b e expressed as: R ( q ′ ) = q P 1 ( q ′ ) + (1 − q ) P 0 ( q ′ ) , 2 SUBMITTED TO IEE E TRANSA CTIONS ON INFORMA TION THE OR Y where P 1 ( q ′ ) := P (Λ( x ) < (1 − q ′ ) /q ′ | H 1 ) = Z 1 { Λ( x ) < (1 − q ′ ) /q ′ } p 1 ( x ) dx P 0 ( q ′ ) := P (Λ( x ) ≥ (1 − q ′ ) /q ′ | H 0 ) = Z 1 { Λ( x ) ≥ (1 − q ′ ) /q ′ } p 0 ( x ) dx Assume there is a joint distribution π = π X Y over the signal X and label Y . This distrib ution de termines bo th the class-condition al densities ( by condition ing on Y = 0 or Y = 1 ) and the prior probabilities ( by margin alizing over X ). Supp ose we hav e n training d ata distributed ind epende nt and identically according to π . W e w ill consider c ases with “labeled” { ( X i , Y i ) } n i =1 or “u nlabeled” { X i } n i =1 data an d use them to estimate th e u nknown prior pr obability q . Let b q stand for the ML E of q based o n trainin g data, the risk of the detector based on b q is R ( b q ) . Note that R ( b q ) is a rando m variable and it is gre ater than o r equal to R ( q ) . The g oal of this paper is to bound th e difference E [ R ( b q )] − R ( q ) , where E is the expectation op erator, and to provide lower boun ds on the perf ormance of any detector derived fr om knowledge of th e class-con ditional densities an d th e trainin g data. Th e difference E [ R ( b q )] − R ( q ) is u sually called th e excess risk o r r egr et , and it is a function of n . Statistical lear ning theory is typically c oncerned with the construction estimators b ased on labeled training data with- out prior knowledge o f class -con ditional densities. There are two commo n approa ches: plug -in ru les and empirical risk minimization (ERM) r ules (see, e. g., De vr oye, Gy orfi, and Lugosi[1] an d V ap nik[2]). Statistical pro perties of these two types of classifiers as well as of other related ones have been extensively stu died (see Aizer man, Braverman, and Rozon oer[3], V apn ik and Chervonenkis[4], V apnik[ 2 ][5], Breiman, Friedman , Olshen, and Stone[6], De vroye, Gy- orfi, and Lug osi[1], Antho ny and Bartlett[7], Cristianini and Shawe-T a ylor[8] and Scholkopf and Smola[9] an d the ref- erences therein). Results conce rning the con vergence of the excess risk obtain ed in the literature are of the form E [ R ( b f n )] − R ( f ∗ ) = O ( n − β ) where β > 0 is some expon ent, and typically β ≤ 1 / 2 if R ( f ∗ ) 6 = 0 . H ere b f n denotes the non parametric estimator of the classifi er, f ∗ denotes the Bay es classifier . Mammen and Tsybakov[10 ] first sho wed that one c an attain fast rates, approa ching n − 1 , and f or fu rther results abou t the fast rates see Koltchinskii[11], Steinwart and Scovel[12], Tsybakov and van d e Geer[13], Massart[14] an d Catoni[1 5]. The behavior of the regression function η ar ound the bo undary ∂ G ∗ = { x : η ( x ) = 1 / 2 } has an impo rtant effect o n the c on vergence of the excess risk, which has been d iscussed earlier und er different assumption s by Devroye, Gy rofi, and Lugosi[1] and Horvath and Lugosi[16]. In this paper, we are consider the “margin assump tion” intro duced in Tsybakov[17 ] . In Aud ibert and Tsyb akov[18 ] , they showed there exist p lug-in rules con- verging with super-fast rates, that is, faster than n − 1 under the margin assumption in Tsybakov[17 ]. In our case, wh ich can be viewed a s a special case of p lug-in rule, we take advantage of Lemma 3.1 in[18]. Our main re sults can be summarized as fo llows. No matter giv en lab eled or unlabe led data, we sh ow the excess risk conv erges and deduc e the rate of this co n vergence. The c on- vergence rate depends on the local behavior of the function R ( b q ) near q , wh ich is determ ined by the be havior of η ( x ) in the vicinity of η ( x ) = 1 / 2 . I n general, R is locally Lipschitz at q , and the conv ergence rate is p roportio nal to n − 1 / 2 . If R is smoother/flatter a t q , then th e co n vergence rate can be much faster taking the f orm n − (1+ α ) / 2 , where α > 0 is a parameter reflecting the smoo thness of R at q . The value α = 1 is a typical value and we actually ha ve n − 1 conv ergence rate un der mild conditions. The lim it α → ∞ co rrespond s to a situation where the r isk fun ction is flat near the true prob abilities, and thu s insensiti ve to small error s in the estimate of pr ior probab ilities, in wh ich case the d etector based on the M LE conv erges to the Bayes error expo nentially fas t with n . W e also show th at the conv ergence rates ar e minimax -optimal giv en labeled d ata. Fig. 1 depicts three cases illustrating the smoothne ss co nditions and c orrespon ding η ( x ) considered in the p aper . (a) difficult case (b) moderate case (c) best case Fig. 1. Examples of R ( b q ) and corresponding η ( x ) leading to dif ferent con verge nce rates The pap er is organized as fo llows. In Section II a nd III ,we discuss the m inimax lo wer bounds and up per bounds ach ie ved by MLE with labeled data. Section IV discusses c on vergence rates wh en we only h av e un labeled training data. Section V compare s ou r results with those in standard passive learnin g and makes final rem arks on our work. I I . C O N V E R G E N C E R A T E S I N G E N E R A L C A S E W I T H L A B E L E D D A TA This section discusses the co n vergence rates of propo sed detector tra ined with labeled data without any assumptions. Let b q be the MLE of q , i.e. b q = ( n X i =1 1 { Y i =1 } ) /n, define P := { ( p 1 , p 0 , q ) } , JIA O et al. : M INIMAX-OPTIMAL BOUNDS FOR DETECTORS BASED ON ESTIMA TED PRIOR PROB ABILITIES 3 where p 1 , p 0 are class-conditiona l densities and q is prior probab ility . W e set up a minim ax lower bo und: A. Minimax Lower Bou nd Theorem 1. Ther e exists a co nstant c > 0 such that inf b q sup P E [ R ( b q )] − R ( q ) ≥ cn − 1 / 2 , wher e sup P takes supr emu m over all possible triples ( p 1 , p 0 , q ) and inf b q denotes the infim um over all possible estimators of q d erived fr o m n samples o f training da ta with the prior knowledge of cla ss-conditiona l d ensities. Theorem 1 can be vie wed as a corollary of Theorem 3 (given in the following section) if we take α = 0 and re move constraints on p 1 ( x ) , p 0 ( x ) and q in Theorem 3. B. Upper Bou nd Theorem 2. If b q is MLE of q , we have sup P E [ R ( b q )] − R ( q ) ≤ 1 2 n − 1 / 2 . Pr oof: Define pa rametrized risk function as b R ( q 1 ; q 2 ) := q 2 P 1 ( q 1 ) + (1 − q 2 ) P 0 ( q 1 ) , following the proof showing q = arg min b q R ( b q ) , we know b q = arg min q ′ b R ( q ′ ; b q ) . W e express the excess risk as E [ R ( b q ) − R ( q )] = E [ b R ( b q ; q ) − b R ( q ; b q )] ≤ E [ b R ( b q ; q ) − b R ( b q ; b q )] , if we write b R ( b q ; q ) − b R ( b q ; b q ) explicitly as follows b R ( b q ; q ) − b R ( b q ; b q ) = ( q − b q )( P 1 ( b q ) − P 0 ( b q )) , thus we h av e E [ R ( b q ) − R ( q )] ≤ E [( q − b q )( P 1 ( b q ) − P 0 ( b q ))] ≤ E [ | q − b q | ] ≤ p E [( q − b q ) 2 ] = r q (1 − q ) n ≤ 1 2 n − 1 / 2 , which completes the p roof of T heorem 2. Remark 1. General r esults in this section also apply wh en p i ( x ) , i = 0 , 1 are p r obab ility mass functions (pmf). In this case, we can write p i ( x ) , i = 0 , 1 a s summa tion of a series of weighted Dirac D elta functions , i.e., p i ( x ) = X j w i,j δ ( x − x j ) , then all of th e ar g uments ab ove hold . I I I . F A S T E R C O N V E R G E N C E R AT E S W I T H L A B E L E D D A TA In Section I II and Section IV, without lo ss of gen erality , we a ssume the true prior prob ability q lies in closed interval [ θ, 1 − θ ] , wher e θ is an arbitrar ily small positiv e real n umber . The reason why we ne ed this assumption is exp lained in Section III-A. Define th e trimmed ML E of q as b q := arg max q ∈ [ θ , 1 − θ ] q P n i =1 Y i (1 − q ) P n i =1 (1 − Y i ) , and construct th e regression fun ction estimator b η n ( x ; b q ) as b η n ( x ; b q ) = b q p 1 ( x ) (1 − b q ) p 0 ( x ) + b q p 1 ( x ) . The a ccuracy o f b η n ( x ; b q ) is closely related t o that of estimating q from n training data. W e set up a lemm a to describe the Lipsch itz property of b η n ( x ; b q ) as a fun ction of b q . Lemma 1. The r e gr ession function estimator b η n ( x ) satisfies Lipschitz pr o perty as a fu nction o f b q ∀ q 1 , q 2 ∈ [ θ, 1 − θ ] , sup x ∈ R d | b η n ( x ; q 1 ) − b η n ( x ; q 2 ) | ≤ L | q 1 − q 2 | , wher e L = 1 / (4 θ (1 − θ )) . Pr oof: Denote f ( t, x ) = tp 1 ( x ) / ( tp 1 ( x ) + (1 − t ) p 0 ( x )) , we are in terested in the p artial d eriv ative of f over t : ∂ f ∂ t = p 0 p 1 ( tp 1 ( x ) + (1 − t ) p 0 ( x )) 2 ≥ 0 . Since t ∈ [ θ , 1 − θ ] , we have p 0 p 1 ( tp 1 ( x ) + (1 − t ) p 0 ( x )) 2 ≤ p 0 p 1 (2 p t (1 − t ) p 1 p 0 ) 2 ≤ 1 4 θ (1 − θ ) , thus ∀ q 1 , q 2 ∈ [ θ, 1 − θ ] , sup x ∈ R d | b η n ( x ; q 1 ) − b η n ( x ; q 2 ) | ≤ L | q 1 − q 2 | . where L = 1 / (4 θ (1 − θ )) ≥ 1 . Remark 2. On the decision boun dary , we hav e q p 1 ( x ) = (1 − q ) p 0 ( x ) , which makes the inequality shown in the p r oof of Lemma 1 hold equality , thus we know the Lipschitz c onstant L canno t be fu rther impr oved. A. P olyn omial Rates Tsybakov[17 ] intro duced a parametr ized margin assumption denoted as Assumption (M A): There exist constants C 0 > 0 , c > 0 , an d α ≥ 0 , such that when α < ∞ , we have P X (0 < | η ( X ) − 1 2 | ≤ t ) ≤ C 0 t α ∀ t > 0 , when α = ∞ , we have P X (0 < | η ( X ) − 1 2 | ≤ c ) = 0 4 SUBMITTED TO IEE E TRANSA CTIONS ON INFORMA TION THE OR Y Denote P θ ,α := { ( p 1 , p 0 , q ) : Assumption ( MA) satisfied with parameter α and q ∈ [ θ , 1 − θ ] } , the ca se α = 0 is trivial (n o margin assum ption) a nd it is the case explored in Section II. If d = 1 and the decision b oundar y reduces, for examp le, to one p oint x 0 , Assumptio n (MA) may be in terpreted as η ( x ) − 1 2 ∼ ( x − x 0 ) 1 /α for x clo se to x 0 . T his interpretation shed light o n one fact that α = 1 is ty pical. If η ( x ) is differentiable with non-zero first-order d eriv ative at x = x 0 , th en we know the first-order approx imation o f η ( x ) in the ne ighbou rhood exists, which means α = 1 in th is case. When η ( x ) is s mooth er , for e xample, if the first-or der derivati ve vanishes at x = x 0 but the second- order derivati ve doesn’t, then we hav e α = 1 / 2 . When η ( x ) is not differentiable at x = x 0 , then we m ay h av e α > 1 , fo r example, when α = 2 , the deriv ative of η ( x ) at x = x 0 goes to infin ity . π X Y satisfying Assumption (MA) with larger α all have more d rastic ch anges near the bou ndary η ( x ) = 1 / 2 , which makes R ( b q ) less sensitive to small errors, leading to faster ra tes. T he R ( b q ) and corresp onding η ( x ) with typical α = 1 in Assumptio n (M A) ar e shown in Fig. 1(b ). W e e xplain the r eason why we need to b ound the d omain of q by showing what determines C 0 in Assumption (MA). Consider the ty pical case when α = 1 , d = 1 , calculate the deriv ativ e of η ( x ) ag ainst x at po int x = x 0 that th e decision bound ary reduce s to, we h av e η ′ ( x 0 ) = q (1 − q ) p ′ 1 ( x 0 ) p 0 ( x 0 ) − p ′ 0 ( x 0 ) p 1 ( x 0 ) ( q p 1 ( x 0 ) + (1 − q ) p 0 ( x 0 )) 2 ∝ q (1 − q ) . W itho ut loss of g enerality , suppose the margin al distrib u tion of X is u niform , a s the first-order approxim ation o f η ( x ) is ∆ η ( x ) ≈ η ′ ( x 0 )∆ x, we know P X (0 < | η ( X ) − 1 2 | ≤ t ) ∝ 1 η ′ ( x 0 ) t ∝ t q (1 − q ) . Then we can see if q goes to zero or o ne, the constant C 0 will ap proach infinity , which illustrates why we assume q ∈ [ θ, 1 − θ ] , θ > 0 in the b eginning of Section III. Assump tion (MA) provides a useful characteriza tion of the beh avior of the regression function η ( x ) in the v icinity of the level η ( x ) = 1/2, which turn s o ut to b e crucial in determ ining convergence rates. First we state a minimax lower bo und under Assumption (MA) as f ollows: Theorem 3. Ther e exists a co nstant c > 0 such that inf b q sup P θ,α E [ R ( b q )] − R ( q ) ≥ cn − (1+ α ) / 2 The proof is given in Ap pendix A. It follows the general minimax analysis strategy but is a no n-trivial result. Next we show n − (1+ α ) / 2 is also an upper bou nd. Introduce Lemma 3.1 in Au dibert and Tsy bakov[18 ] which is reph rased as follows: Lemma 2. Let b η n be an estimator of th e r egr ession functio n η and P a set of π X Y satisfying Mar g in Assumption (MA). If we have some constants C 1 > 0 , C 2 > 0 , fo r some po sitive sequence a n , for n ≥ 1 , an y δ > 0 , and fo r almost all x w .r .t. P X , sup P ∈P P ( | b η n ( x ) − η ( x ) | ≥ δ ) ≤ C 1 e − C 2 a n δ 2 Then the plug -in detector b f n = 1 { b η n ≥ 1 / 2 } satisfies the following inequa lity: sup P ∈P E [ R ( b f n )] − R ( f ∗ ) ≤ C a − (1+ α ) / 2 n for n ≥ 1 with so me co nstant C > 0 depending on ly o n α, C 0 , C 1 and C 2 , wher e f ∗ denotes the Bayes d etector . Remark 3. F ollowing the pr oof of Lemma 2, we know C incr eases as the incr ease of C 1 , the incr ease of constan t C 0 in Assump tion (MA) C 0 , and th e decr ease o f co nstant C 2 . Theorem 4. I f b q is the tr immed MLE of q , there e xists a constant C > 0 such that sup P θ,α E [ R ( b q )] − R ( q ) ≤ C n − (1+ α ) / 2 Pr oof: Accor ding to Lemma 1, we have sup x ∈ R d , b q ,q ∈ [ θ , 1 − θ ] | b η n ( x ; b q ) − b η n ( x ; q ) | ≤ L | b q − q | Combining with Ho effding’ s inequ ality , we have sup P θ,α P ( | b η n ( x ) − η ( x ) | ≥ δ ) ≤ sup P θ,α P ( | b q − q | ≥ δ L ) ≤ 2 e − 2 L 2 nδ 2 , where L > 0 is the constant in Lemma 1. The inequality above shows we can take C 1 = 2 , C 2 = 2 /L 2 , a n = n in Lemma 2. According to Lemma 2, we kn ow sup P θ,α E [ R ( b q )] − R ( q ) ≤ C n − (1+ α ) / 2 . Remark 4 . Consider the typical case when α = 1 . The optimal rate h er e is n − 1 , which is faster than naive worst case n − 1 / 2 shown in Section II and the optimal rate in standa r d passive learnin g, n − 2 / (2+ ρ ) , ρ > 0 shown in A ud ibert and Tsybakov[18] . Remark 5 . Con sider the case when true prior pr obab ility q lies near zer o or one. This will make the constant C 0 in Assumption (MA) go to infinity as shown in the intr oductio n of Assumption (MA), co nstant C 2 go to zer o a s sho wn in the pr oo f of Theorem 4, which slows down the conver gence of excess risk. JIA O et al. : M INIMAX-OPTIMAL BOUNDS FOR DETECTORS BASED ON ESTIMA TED PRIOR PROB ABILITIES 5 B. Exponential Rates W e in vestigate the c on vergence rates when α = ∞ in Assumption (MA). Intu iti vely as α g rows bigger, the rates can be faster than any poly nomial rates with fixed degree as is sh own in Theor em 4. Theorem 5. If b q is the trimmed MLE defined above, under Assumption (MA) when α = ∞ , we ha ve sup P θ, ∞ E [ R ( b q )] − R ( q ) ≤ 2 e − 2 nc 2 /L 2 , wher e c is the po sitive constant in Assumption (MA), L is the constant in Lemma 1. Pr oof: According to Lemma 1, we kn ow as long as | b q − q | ≤ c /L, b q ∈ [ θ , 1 − θ ] , the err or of regression fun ction estimator is bounded u niformly by c , in curring no error in detection accordin g to Assumptio n (MA) wh en α = ∞ . The mathematical repr esentation is: R ( b q ) = R ( q ) , ∀ b q ∈ [ q − c/L, q + c/L ] ∩ [ θ, 1 − θ ] . Then we write th e excess risk as follows: ( Z | b q − q |≥ δ + Z | b q − q |≤ δ )[ R ( b q ) − R ( q )] dP where P is the prob ability measure on samp le space Ω o f { ( X i , Y i ) } n i =1 . T a king δ = c/ L , th e second ter m vanishes. Applying Chern off ’ s bound , the first term is bo unded by 2 e − 2 nδ 2 , so we conclud e sup P θ, ∞ E [ R ( b q )] − R ( q ) ≤ 2 e − 2 nc 2 /L 2 . Remark 6. Wh en p i ( x ) , i = 0 , 1 ar e pr ob ability mass fu nc- tions, if x takes va lue in X with # {X } < ∞ , th en inf x i ∈X ,η ( x i ) 6 =1 / 2 | η ( x i ) − 1 / 2 | ≥ c > 0 which mean s there exists a con stant c > 0 such that P (0 < | η ( X ) − 1 / 2 | ≤ c ) = 0 . Based on discussions above, a n exponential conver gence rate is always guaranteed when x lies in discrete finite do main. I f # {X } is infi nite, then we may h ave finite α > 0 with optimal conver gence rates n − (1+ α ) / 2 . Ho wever , finite # {X } is the case that often arise in practice. I V . C O N V E R G E N C E R AT E S W I T H U N L A B E L E D D A TA In th is section , we discuss co n vergence r ates when we only have unla beled training data. Relati vely speakin g, unlabeled data is m ore likely and easier to be obtained in practice than th e lab eled, th us co n vergence rates analy sis in this case deserves mo re attention. Mean while, it also helps revealing how much infor mation is stored in { X i } n i =1 in the training data p airs { ( X i , Y i ) } n i =1 . In this case, we are faced with a classical parameter esti- mation problem . Gi ven X 1 , . . . , X n iid ∼ q p 1 ( x ) + (1 − q ) p 0 ( x ) , we want to con struct estimator b q to estimate q as ef ficiently as possible. Here we use the MLE and derive uppe r b ounds under Assumption ( MA). Before starting the p roof, we intro duce a stand ard qu antity measuring distances b etween probab ility m easures. Definition 1. The t otal v ariatio n dista nce between two pr ob ability density function s p, q is defi ned as follows: V ( p , q ) = s up A | Z A ( p − q ) dν | = 1 − Z min( p, q ) dν wher e ν denotes Lebesgu e measur e on signal space R d and A is any sub set o f the d omain. W e will q uantify our results in ter ms o f the total variation distance. Here we assume V ( p 1 , p 0 ) ≥ V min > 0 , ensuring that the two class-condition al densities are not ‘too’ indiscernible , so that it is possible to learn the prior probab ility q f rom un labeled data. For details about how this assumption works plea se see Ap pendix B. Define a c lass of triples: P θ ,α,V min := { ( p 1 , p 0 , q ) : Assumption ( MA) satisfied with parameter α , q ∈ [ θ, 1 − θ ] and V ( p 1 , p 0 ) ≥ V min > 0 } , and define the tr immed MLE b q in this case as b q := arg max q ∈ [ θ , 1 − θ ] n X i =1 log( q p 1 ( x i ) + (1 − q ) p 0 ( x i )) . W e set up an upp er bound fo r the perfor mance of trimme d MLE b q : Theorem 6 . If b q is the trimmed MLE defi ned ab ove, ther e exis ts a constan t C > 0 su ch that sup P θ,α,V min E [ R ( b q )] − R ( q ) ≤ C n − (1+ α ) / 2 The proof of T heorem 6 is g iv en in Ap pendix B. Remark 7. W e can show the calculation of MLE is a co n vex optimization pr oblem, for which we ha ve efficient methods. Remark 8. Compa r ed to learnin g detecto rs based o n lab eled data, we need to s acrifice conver gence rates by a constant factor when g iven unla beled data. Given true prior pr obability q , when V ( p 1 , p 0 ) is sma ller , th e c onstant C 2 in Lemma 2 becomes smaller at the same time, which slows do wn the conver gence of e xcess risk. This p henomen on is discussed in the p r oof of Theorem 6. V . F I NA L R E M A R K S This p aper present conv ergence rates analy sis for detector s constructed using k nown c lass-conditional d ensities an d esti- mated prior probabilities usin g the MLE. All o f th e bounds are dime nsion-free . The bo unds are m inimax-o ptimal g i ven labeled d ata and achiev able no matter giv en labeled o r unla- beled data . It remains an inter esting open question to show the rate n − ( α +1) / 2 is minima x-optimal gi ven unlabeled data u nder 6 SUBMITTED TO IEE E TRANSA CTIONS ON INFORMA TION THE OR Y assumption (MA) and the extra assumptio n on V ( p 1 , p 0 ) , or to establish the same u pper bound o n c on vergence rates for unlabeled case with out the extra assumption on V ( p 1 , p 0 ) in Section IV. W e sh ow the constant factors in con vergenc e rates are m ainly influenced by two elem ents: 1) The value of true prior proba bility 2) Unlabeled data case: V ( p 1 , p 0 ) W e sho w a prior probab ility near zero o r one will lead to slower convergence no m atter gi ven lab eled or unlabeled data, in un labeled da ta case, a sm aller V ( p 1 , p 0 ) lea ds to slower conv ergence. Our results are analogous to those of general classification in statistical learn ing. Intuitiv ely , learning the class-conditional densities is the main ch allenge in s tandar d passi ve learning and it is sensible f or us to say that kn owing the class-con ditional densities makes the problem relati vely easy . The following quantitative results convince us of that. W e pick ou t the fastest- ev er rate sho w n befo re for stand ard passive learning und er Assumption (MA) in Audibert and T sybakov[18 ] an d compare it with our result in table I: T ABLE I C O N V E R G E N C E R ATE S C O M PA R I S O N U N D E R A S S U M P T I O N ( M A ) Passi ve Learning ( p 1 , p 0 unkno wn) Passi ve Learning ( p 1 , p 0 kno wn) n − α +1 2+ ρ n − α +1 2 Here ρ = d/β > 0 , where β is the Holder exponent of η ( x ) . The rate n − α +1 2+ ρ is obtained with ano ther stron g assumption that the ma rginal distribution of X is boun ded from b elow and ab ove, which isn’t necessary h ere. Here we can see the factor ρ reflec ts th e price we have to pay for not knowing class- condition al densities and it is directly related to the complexity of non-pa rametrically learning the density function s. V I . A C K N O W L E D G E M E N T The authors th ank the revie wers f or their helpf ul commen ts, especially for raising the q uestion of minimax optimality in the unlabeled data case. A P P E N D I X A P R O O F O F T H E O R E M 3 The p roof strategy follows the idea of standard minimax analysis introdu ced in Tsybakov[19 ] and consists in reducin g the prob lem of classification to a h ypothesis testing prob lem. In th is case, it su ffices to consider two hy potheses. Here, we have to pay extra attention to the d esign of hyp otheses because we ha ve access to class-conditiona l densities, wh ich p uts extra constraint on hy potheses d esign. W e rephrase a bo und from Tsybakov[19 ]: Lemma 3. De note P the class of join t distributions rep- r esen ted by tr iples ( p 1 , p 0 , q ) w her e ( p 1 , p 0 ) a r e class- condition al densities and q is prior pr obab ility . Associa ted with each element ( p 1 , p 0 , q ) ∈ P , we have a pr oba bility measur e π X Y defined on R d × { 0 , 1 } . Let d ( · , · ) : P × P → R be a semidistance. Let ( p 1 , p 0 , q 0 ) , ( p 1 , p 0 , q 1 ) ∈ P be such that d (( p 1 , p 0 , q 0 ) , ( p 1 , p 0 , q 1 )) ≥ 2 a , with a > 0 . Assume also that KL ( π X Y ( p 1 , p 0 , q 1 ) k π X Y ( p 1 , p 0 , q 0 )) ≤ γ , wher e K L d enotes to the K ullback-Leibler diverg ence. The following bound h olds: inf b q sup P P π X Y ( p 1 ,p 0 ,q ) ( d (( p 1 , p 0 , b q ) , ( p 1 , p 0 , q )) ≥ a ) ≥ inf b q max j ∈{ 0 , 1 } P π X Y ( p 1 ,p 0 ,q j ) ( d (( p 1 , p 0 , b q ) , ( p 1 , p 0 , q j )) ≥ a ) ≥ max( 1 4 exp( − γ ) , 1 − p γ / 2 2 ) wher e the infimu m is ta ken with r espect to the collection of all po ssible e stimators of q ( based on a samp le fr om π X Y ( p 1 , p 0 , q ) with known class-cond itional densities). Fig. 2. T wo η ( x ) used for the proof of Theorem 3 when d = 1 Denote b G n := { x : b η n ( x ; b q ) ≥ 1 / 2 } where b η n ( x ; b q ) is defined in Section IV and the optimal decision regions as G ∗ j := { x : η j ( x ) ≥ 1 / 2 } , where the subscript j indicates that the excess risk is being measured with respect to the distribution π X Y (( p 1 , p 0 , q j )) , j = 0 , 1 . T ake P = P θ ,α . W e are in terested in controlling th e excess risk R j ( b q ) − R j ( q ) . T o prove the lower boun d we will use the following class- condition al den sities, which allow us to easily attain any desired margin par ameter α in Assumption (MA) b y ad justing the p arameter κ b elow . p 1 ( x ) =      (1+2 cx κ − 1 d )(1 − 2 t κ − 1 ) 1 − 4 c ( tx d ) κ − 1 x ∈ [0 , 1] d − 1 × [0 , t ) 1 + 2 c 1 ( x d − t ) κ − 1 x ∈ [0 , 1] d − 1 × [ t, 1] 0 x ∈ R d / [0 , 1] d p 0 ( x ) = ( 2 − p 1 ( x ) x ∈ [0 , 1 ] d 0 x ∈ R d / [0 , 1] d where x = ( x 1 , . . . , x d ) , 0 < c ≪ 1 , κ > 1 are constan ts. The quan tity 0 < t ≪ 1 is a small real number which goes to zero as n → ∞ , and will be determined later . I t is easy to verify that in order to m ake R R d p i = 1 , i = 0 , 1 h old, as t → 0 , th e n umber c 1 is of order O ( t κ ) , which also goes to zero. Assign ing p rior proba bilities to H 1 and H 0 q 0 = 1 2 q 1 = 1 2 + t κ − 1 , obviously the margin distribution of X , P (0) X is uniform on [0 , 1] d , P (1) X is ap proxima tely u niform o n [0 , 1] d . W e can compute the r egression func tions based on equ ation η j ( x ) = q j p 1 ( x ) q j p 1 ( x ) + (1 − q j ) p 0 ( x ) , JIA O et al. : M INIMAX-OPTIMAL BOUNDS FOR DETECTORS BASED ON ESTIMA TED PRIOR PROB ABILITIES 7 and have the explicit expressions of η j ( x ) , j ∈ { 0 , 1 } as η 0 ( x ) =      (1 / 2+ cx κ − 1 d )(1 − 2 t κ − 1 ) 1 − 4 c ( tx d ) κ − 1 x ∈ [0 , 1] d − 1 × [0 , t ) 1 2 + c 1 ( x d − t ) κ − 1 x ∈ [0 , 1] d − 1 × [ t, 1] 0 x ∈ R d / [0 , 1] d η 1 ( x ) =      1 2 + c x κ − 1 d x ∈ [0 , 1] d − 1 × [0 , t ) (1+2 t κ − 1 )( 1 2 + c 1 ( x d − t ) κ − 1 ) 1+4 t κ − 1 c 1 ( x d − t ) κ − 1 x ∈ [0 , 1] d − 1 × [ t, 1] 0 x ∈ R d / [0 , 1] d From above we see G ∗ 0 = [0 , 1] d − 1 × [ t, 1] , G ∗ 1 = [0 , 1] d . Fig. 2 depicts η j ( x ) , j ∈ { 0 , 1 } wh en d = 1 . In order to further analyze d esigned hypoth eses, we sho w that the parameter α in Assumption (MA) for η j ( x ) , j = 0 , 1 is α = 1 / ( κ − 1) . Consider the case j = 0 (the ca se j = 1 is analogo us). As η 0 ((0 , . . . , 0 , t )) = 1 / 2 − (1 − c ) t κ − 1 1 − 4 ct 2 κ − 2 < 1 / 2 − (1 − c ) t κ − 1 = 1 / 2 − τ ∗ , provided τ ≤ τ ∗ , we h av e P 0 (0 < | η 0 ( X ) − 1 2 | ≤ τ ) = P 0 (0 < x d − t ≤ ( τ c 1 ) 1 / ( κ − 1) ) = ( τ c 1 ) 1 / ( κ − 1) = C η τ 1 / ( κ − 1) , where C η > 1 . The seco nd step follows sinc e P (0) X is u niform on [0 , 1] d . Since the excess risk is not a semidistance, w e cann ot apply Lem ma 3 d irectly , but we can relate excess risk a nd th e symmetric d istance measure, and then use the lemma. First w e introdu ce Proposition 1 in Tsyb akov[17 ] rep hrased as fo llows : Lemma 4 . Assume that P (0 < | η ( X ) − 1 / 2 | ≤ τ ) ≤ C η τ α for some fin ite C η > 0 , α > 0 and all 0 < τ ≤ τ ∗ , whe r e τ ∗ ≤ 1 / 2 . Th en we kno w ther e exist c α > 0 , 0 < ǫ 0 ≤ 1 such that R j ( b q ) − R j ( q ) ≥ c α d ∆ ( b G n , G ∗ P ) 1+1 /α for all b G n such that d ∆ ( b G n , G ∗ P ) ≤ ǫ 0 ≤ 1 , wher e c α = 2 C − 1 /α η α ( α + 1 ) − 1 − 1 /α , ǫ 0 = C η ( α + 1 ) τ α ∗ , d ∆ ( b G n , G ∗ P ) := R b G n ∆ G ∗ P d x is the symmetric distance measure . When j = 0 , plug in τ ∗ = (1 − c ) t κ − 1 , since c is very small, we know ǫ 0 = C η (1 + 1 / ( κ − 1))(1 − c ) 1 / ( κ − 1) t ≥ t/ 2 . Analogou sly we can show when j = 1 , ǫ 0 ≥ t/ 2 a lso holds. W e n ow pro ceed by applying L emma 3 to the semidis- tance d ∆ and then use Lemm a 3 to co ntrol the ex- cess r isk. Note that d ∆ ( G ∗ 0 , G ∗ 1 ) = t . Let P 0 ,n := P (0) X 1 ,...,X n ; Y 1 ,...,Y n be th e pro bability measure o f th e ran- dom variables { ( X i , Y i ) } n i =1 under h ypothe sis 0 and define analogo usly P 1 ,n := P (1) X 1 ,...,X n ; Y 1 ,...,Y n . Consider th e KL- div ergence KL ( P 1 ,n k P 0 ,n ) : KL( P 1 ,n k P 0 ,n ) = E 1 [log Π n i =1 p (1) X i ,Y i ( X i , Y i ) Π n i =1 p (0) X i ,Y i ( X i , Y i ) ] = n X i =1 E 1 [log p (1) X i ,Y i ( X i , Y i ) p (0) X i ,Y i ( X i , Y i ) ] = n E 1 [log p (1) X,Y ( X, Y ) p (0) X,Y ( X, Y ) ] , where E 1 [log p (1) X,Y ( X,Y ) p (0) X,Y ( X,Y ) ] can be simplified as Z R d q 1 p 1 ( x ) log q 1 p 1 ( x ) q 0 p 1 ( x ) + Z R d (1 − q 1 ) p 0 ( x ) log (1 − q 1 ) p 0 ( x ) (1 − q 0 ) p 0 ( x ) = q 1 log q 1 q 0 + (1 − q 1 ) 1 − q 1 1 − q 0 . The expression in th e last line is the KL-d iv ergence between two Bern oulli rand om variables. I t can be ea sily verified that the KL-divergence between two Bernoulli rando m variables is bound ed as in the f ollowing lemma: Lemma 5. Let P an d Q b e Bern oulli random variab les with parameters, respectively , 1/2- p an d 1/2 - q . Let | p | , | q | ≤ 1 / 4 , then KL ( P k Q ) ≤ 8( p − q ) 2 . Thus we know KL( P 1 ,n k P 0 ,n ) ≤ 8 n ( t κ − 1 ) 2 = 8 nt 2 κ − 2 . T a king t = n − 1 2 κ − 2 , d (( p 1 , p 0 , b q ) , ( p 1 , p 0 , q j )) := d ∆ ( b G n , G ∗ j ) and u sing Lem ma 3, we know for n large enough (imp lying t small), inf b q max j ∈{ 0 , 1 } P j ( d (( p 1 , p 0 , b q ) , ( p 1 , p 0 , q j )) ≥ t/ 2 ) ≥ 1 / 4 exp( − 8) . Notice in Lemma 4, ǫ 0 ≥ t/ 2 , so we can apply Le mma 4 to show inf b q max j ∈{ 0 , 1 } P j ( R j ( b q ) − R j ( q ) ≥ c α ( t/ 2) κ ) ≥ inf b q max j ∈{ 0 , 1 } P j ( d (( p 1 , p 0 , b q ) , ( p 1 , p 0 , q j )) ≥ t/ 2 ) ≥ 1 / 4 exp( − 8) . According to Markov’ s inequality , we con clude inf b q sup P θ,α E [ R ( b q ) − R ( q )] ≥ c ′ n − κ 2 κ − 2 = c ′ n − (1+ α ) / 2 where α = 1 / ( κ − 1) , c ′ = 1 4 e − 8 c α ( 1 2 ) α +1 α . A P P E N D I X B P R O O F O F T H E O R E M 6 W e introdu ce two m ore quan tities measurin g d istances be- tween probability distributions. 8 SUBMITTED TO IEE E TRANSA CTIONS ON INFORMA TION THE OR Y Definition 2. The H ellinger distance between two pr obab ility density functions p, q is defined as follows: H ( p, q ) = ( Z ( √ p − √ q ) 2 dν ) 1 / 2 Definition 3 . The χ 2 diver gence between two pr obab ility density functions p, q is defined as follows: χ 2 ( p, q ) = Z pq> 0 p 2 q dν − 1 As is shown in Tsyb akov[19 ], we have th e following inequalities V 2 ( p, q ) ≤ H 2 ( p, q ) ≤ χ 2 ( p, q ) . Define f ( x, q ) = q p 1 ( x ) + (1 − q ) p 0 ( x ) , we use Hellinger distance to measur e the error of estimating q from training data: r 2 ( q , q + h ) := H ( f ( x, q ) , f ( x, q + h )) W e introd uce a con centration ineq uality for MLE , i.e., Theorem I .5.3 in Ibrag imov an d Has’m inskii[20] reph rased as follows: Lemma 6. Let Q be a bou nded interval in R , f ( x, q ) be a con tinuous fun ction of q on Q fo r ν -almost all x w her e ν denotes the Lebesgue measur e on R d , let th e following condition s be satisfied: 1) Ther e exis ts a number ξ > 1 su ch that sup q ∈Q sup h | h | − ξ r 2 2 ( q , q + h ) = A < ∞ 2) F or any comp act set K ther e corr espon ds a positive number a ( K ) = a > 0 such that r 2 2 ( q , q + h ) ≥ a | h | ξ 1 + | h | ξ q ∈ K , then the maximum likelihood estima tor b q is define d, consistent and sup q ∈ K P q ( | b q − q | > ǫ ) ≤ B 0 e − b 0 anǫ ξ , wher e the positive constants B 0 and b 0 do no t depend on K , n is th e number of training da ta. T a king Q = K = [ θ , 1 − θ ] , it suffices to show th e two assumptions in Lemma 6 hold with ξ = 2 , then we can use Lemma 2 to complete th e proof. Pr oof: 1. sup q ∈Q sup h | h | − 2 r 2 2 ( q , q + h ) = A < ∞ r 2 2 ( q , q + h ) ≤ χ 2 ( f ( x, q + h ) , f ( x, q )) = Z f ( x, q ) + ( p 1 − p 0 ) 2 h 2 f ( x, q ) +2 h ( p 1 − p 0 ) dν − 1 = h 2 Z ( p 1 − p 0 ) 2 q p 1 + (1 − q ) p 0 = h 2 Z p 1 >p 0 ( p 1 − p 0 ) 2 q ( p 1 − p 0 ) + p 0 + h 2 Z p 1

p 0 ( p 1 − p 0 ) 2 q ( p 1 − p 0 ) + h 2 Z p 1

ǫ ) ≤ B 0 e − b 0 anǫ 2 Applying Lemma 2 by taking C 1 = B 0 , C 2 = b 0 a , a n = n , we complete th e proof o f Theorem 6 . Remark 9. In the pr o of o f Theor em 6, we h ave 1 /  Z ( p 1 − p 0 ) 2 q p 1 + (1 − q ) p 0  ≥ q (1 − q ) V ( p 1 , p 0 ) , wher e th e left term is the r ecipr o cal o f the fisher information given unlabe led da ta, and the right term is the fisher informa- tion g iven labeled d ata div ided by V ( p 1 , p 0 ) . Th is ineq uality holds eq uality whe n p 1 and p 0 don’t verlap at all. Since the minimum variance o f unbiased estimator is described by the r ecip r ocal of fish er information , this ineq uality shows that the conver gence fr om b q to q in unlabeled case can n ever be faster than that in la beled case, and will be slower if V ( p 1 , p 0 ) is small. JIA O et al. : M INIMAX-OPTIMAL BOUNDS FOR DETECTORS BASED ON ESTIMA TED PRIOR PROB ABILITIES 9 R E F E R E N C E S [1] L. De vroye, L. Gyorfi, and G. Lugosi, A Pr obabilisti c Theo ry of P att ern Recogn ition . Springe r , New Y ork, 1996. [2] V . V apnik, St atistical Learning Theory . W ile y , Ne w Y ork, 1998. [3] M. Ziaerman, E. Braverma n, and L. Rozonoer , Method of P otentia l Functions in the Theory of Learning Machines . Nauka, Moscow (in Russian), 1970. [4] V . V apn ik a nd A. Cherv onenkis, The ory of P atte rn Reco gnition . Nauka, Mosco w (in Russian), 1974. [5] V . V apnik, Estimation of Dependenc ies Based on Empirical Data . Springer , New Y ork, 1998. [6] L. Breima n, J. Frie dman, C. J. Stone, and R. Olshen, Cla ssification and Re gression T r ees . W adsworth, Bel mont, CA, 1984. [7] M. Anthony and P . Bartl ett, Neural Network Learning: Theor etical F oun dations . Cambridge Uni v . Press, 1999. [8] N. Cristianini and J. Shawe-T ayl or , An Intr oduction to Support V ect or Mach ines . Cambridge Univ . Press, 2000. [9] B. Scholk opf and A. Sm ola, Learning with Kernels . MIT Press, 2002. [10] E. Mammen and A. Tsybako v , “Smoot h discrimination analysis, ” Annals of Stati stics , vol. 27, pp. 1808–1829, 1999. [11] V . K oltchin skii, “Local rademacher complexit ies and oracle inequaliti es in risk minimizati on, ” Annals of Statistic s , vol . 34, no. 6, pp. 2593–26 56, 2006. [12] I. Steinwart and J. Scove l, “Fast rate s for support v ector machi nes using gaussian ke rnels, ” Annals of Sta tistics , vol. 35, pp. 575–607, 2007. [13] A. Tsybakov and S. van de Geer , “Squa re root penalty: adapta tion to th e margin in classifica tion and in edge estimation , ” The Annals of St atistics , vol. 33, pp. 1203–1224, 2005. [14] P . Massart, “Some applic ations of concentrat ion inequaliti es to statis- tics, ” Ann. F ac. Sci. T oulouse Math , vol. 9, pp. 245–303, 2000. [15] O. Catoni, “Randomize d estimators and empirical complexity for patte rn recogn ition and least square regressio n, ” 2001. [Online]. A v ailable : http://www .proba.jussie u.fr/ [16] M. Horv ath and G. Lugosi, “Scal e-sensiti ve di mensions and skele ton estimate s for classification , ” Discr ete Applied Mathematics , vo l. 86, pp. 37–61, 1998. [17] A. Tsybak ov , “Optimal aggrega tion of cla ssifiers in stat istical learning, ” Annals of Sta tistics , vol. 32(1), pp. 135–166, 2004. [18] J.-Y . Audibert and A. T sybako v , “Fa st learnin g rates for plug-in classi - fiers, ” Annals of Statistics , vol. 35(2) , pp. 608–633, 2007. [19] A.Tsybak ov , Intr oduction t o nonpa rametric estimat ion . Springer , 2008. [20] I.A.Ibragimo v and R.Z.H as’minskii, Statistical E stimation : Asymptotic Theory . Springer -V erlag, 1981.


Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment