SUBMITTED TO IEEE TRANSA CTIONS ON INF ORMA TION THEOR Y 1 Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities Jiantao Jiao*, Lin Zhang, Member , IEEE a nd Robert D. Nowak, F ellow , IEEE Abstract —In many signal detection and classification problems, we hav e knowledge of the distributi on under each hypothesis, but not the prior pr obabilities. This paper is aimed at providing theory to q uantify the perform ance of detection via estimating prior probabilities fr om either labeled or unlabeled training data. The err or or ri sk is considered as a function of the prior probabilities. W e show t hat t he risk fu nction is locally Lipsch itz in the vicini ty of the true p rior probabilities, and the err or of detectors b ased on estimated prior probabilities depend s on th e behavio r of the risk function in this loca lity . In general, we show that t he error of detectors based on the M aximum Likelihood Estimate (MLE) of the prior pro babiliti es conv erges to the Bay es error at a rate of n − 1 / 2 , wher e n is the number of training data. If the behavior of the risk function is more f a vora ble, then detectors based on the M LE hav e errors con ver ging to the corr esponding Bayes errors at optimal rates of the form n − (1+ α ) / 2 , wh ere α > 0 is a parameter gov erning the behavior of the risk function with a typical v alue α = 1 . The limit α → ∞ corr esponds to a situation where the risk function is flat near the true probabilities, and thus insensitive to small err ors in the ML E; in this case the error of the detector based on the MLE con ver ges to t he Bayes error exponentially fast wit h n . W e show the bound s are achie vable no matter giv en l abeled or unlabeled training data and are minimax- optimal in labeled case. Index T erm s —Detector , min imax-optimality , maximum like- lihood estimate (MLE), prior probability , statistical l earning theory I . I N T R O D U C T I O N I N many signal detection an d classification problem s th e condition al distribution under each hyp othesis is known, but the prior p robabilities are unk nown. For example, we m ay have a good m odel f or the sym ptoms o f a cer tain disease, but might not kn ow how pr ev alen t the d isease is. Ther e ar e two ways to pro ceed: 1) Neyman-Pearson d etectors 2) Estimate prior probab ilities from training data Neyman-Pearson detectors are de signed to control on e type of error while minimizing the other . Detector s b ased on esti- mating prior probabilities aim to ach ie ve the per forman ce of the Bayes d etector (see, e.g. Devroye, Gyorfi, an d Lug osi[1]). W e study this seco nd approa ch an d provide theory to quan tify the per forman ce of detecto rs based o n estimating prior p rob- abilities from training data. W e w ill focus on simple binary hypoth eses and minimum proba bility of error detection, but J. Jiao and L. Zhang are with the Department of Electronic Engineering , Tsinghua Uni versity , Beijing, 100084 China. e-mail: (xajjt1990@ gmail.com; linzha ng@tsinghua .edu.cn). R. Now ak is with the Department of Electrical and Computer En- gineeri ng, Uni versity of Wiscon sin, Madison, WI 53706 USA. e-mail: no wak@ec e.wisc.edu. the theory and methods can be extended to h andle other error criteria that weight different error types and to m -ar y detection problem s. This pro blem can b e viewed as a special case of the classification problem in machine lear ning in w hich we have k nowledge of the density under each hyp othesis. These condition al densities are called the class-cond itional den sities, in the parlance of machine learning, and we will use this terminolo gy here . Detector s b ased on “plugging -in” the Maxi- mum Likelihood Estimate (MLE) of the prior probabilities a re simply a special case o f th e w ell-known plug-in ap proach in statistical learning theor y . W e u se this con nection to dev elop upper and lower bou nds on the perf ormance of detectors based on the ML E of prior p robabilities. Let us first intr oduce some notatio ns f or the prob lem. Let X ∈ R d denote a sign al and con sider a b inary hypoth esis testing p roblem H 0 : X ∼ p 0 H 1 : X ∼ p 1 , where p 0 and p 1 are known prob ability den sities on R d . Let Y be a binary rand om variable indicating which hyp othesis X follows, an d define q := P ( Y = 1) , the pr obability that hypoth esis H 1 is true. The Bayes detector is de fined by the likelihood ra tio test p 1 ( X ) p 0 ( X ) H 1 ≷ H 0 1 − q q , and it minimizes th e p robability of e rror . Let Λ( x ) := p 1 ( x ) /p 0 ( x ) an d define the r e g r ession function η ( x ) : η ( x ) := P ( Y = 1 | X = x ) = q p 1 ( x ) (1 − q ) p 0 ( x ) + q p 1 ( x ) , then the Baye s d etector can be expr essed as f ∗ ( x ) = 1 { η ( x ) ≥ 1 / 2 } . Note that η ( x ) is paramete rized by the prior probability q . Let us consider the probability of error , o r risk , as a function of this parame ter . For any f easible prior p robability q ′ , let R ( q ′ ) denote the risk (pro bability of error) incurr ed by u sing q ′ in place of q . The value q d efined above produce s the min imum risk. Th e difference R ( q ′ ) − R ( q ) quan tifies the subop timality of q ′ . The quantity R ( q ′ ) can b e expressed as: R ( q ′ ) = q P 1 ( q ′ ) + (1 − q ) P 0 ( q ′ ) , 2 SUBMITTED TO IEE E TRANSA CTIONS ON INFORMA TION THE OR Y where P 1 ( q ′ ) := P (Λ( x ) < (1 − q ′ ) /q ′ | H 1 ) = Z 1 { Λ( x ) < (1 − q ′ ) /q ′ } p 1 ( x ) dx P 0 ( q ′ ) := P (Λ( x ) ≥ (1 − q ′ ) /q ′ | H 0 ) = Z 1 { Λ( x ) ≥ (1 − q ′ ) /q ′ } p 0 ( x ) dx Assume there is a joint distribution π = π X Y over the signal X and label Y . This distrib ution de termines bo th the class-condition al densities ( by condition ing on Y = 0 or Y = 1 ) and the prior probabilities ( by margin alizing over X ). Supp ose we hav e n training d ata distributed ind epende nt and identically according to π . W e w ill consider c ases with “labeled” { ( X i , Y i ) } n i =1 or “u nlabeled” { X i } n i =1 data an d use them to estimate th e u nknown prior pr obability q . Let b q stand for the ML E of q based o n trainin g data, the risk of the detector based on b q is R ( b q ) . Note that R ( b q ) is a rando m variable and it is gre ater than o r equal to R ( q ) . The g oal of this paper is to bound th e difference E [ R ( b q )] − R ( q ) , where E is the expectation op erator, and to provide lower boun ds on the perf ormance of any detector derived fr om knowledge of th e class-con ditional densities an d th e trainin g data. Th e difference E [ R ( b q )] − R ( q ) is u sually called th e excess risk o r r egr et , and it is a function of n . Statistical lear ning theory is typically c oncerned with the construction estimators b ased on labeled training data with- out prior knowledge o f class -con ditional densities. There are two commo n approa ches: plug -in ru les and empirical risk minimization (ERM) r ules (see, e. g., De vr oye, Gy orfi, and Lugosi[1] an d V ap nik[2]). Statistical pro perties of these two types of classifiers as well as of other related ones have been extensively stu died (see Aizer man, Braverman, and Rozon oer[3], V apn ik and Chervonenkis[4], V apnik[ 2 ][5], Breiman, Friedman , Olshen, and Stone[6], De vroye, Gy- orfi, and Lug osi[1], Antho ny and Bartlett[7], Cristianini and Shawe-T a ylor[8] and Scholkopf and Smola[9] an d the ref- erences therein). Results conce rning the con vergence of the excess risk obtain ed in the literature are of the form E [ R ( b f n )] − R ( f ∗ ) = O ( n − β ) where β > 0 is some expon ent, and typically β ≤ 1 / 2 if R ( f ∗ ) 6 = 0 . H ere b f n denotes the non parametric estimator of the classifi er, f ∗ denotes the Bay es classifier . Mammen and Tsybakov[10 ] first sho wed that one c an attain fast rates, approa ching n − 1 , and f or fu rther results abou t the fast rates see Koltchinskii[11], Steinwart and Scovel[12], Tsybakov and van d e Geer[13], Massart[14] an d Catoni[1 5]. The behavior of the regression function η ar ound the bo undary ∂ G ∗ = { x : η ( x ) = 1 / 2 } has an impo rtant effect o n the c on vergence of the excess risk, which has been d iscussed earlier und er different assumption s by Devroye, Gy rofi, and Lugosi[1] and Horvath and Lugosi[16]. In this paper, we are consider the “margin assump tion” intro duced in Tsybakov[17 ] . In Aud ibert and Tsyb akov[18 ] , they showed there exist p lug-in rules con- verging with super-fast rates, that is, faster than n − 1 under the margin assumption in Tsybakov[17 ]. In our case, wh ich can be viewed a s a special case of p lug-in rule, we take advantage of Lemma 3.1 in[18]. Our main re sults can be summarized as fo llows. No matter giv en lab eled or unlabe led data, we sh ow the excess risk conv erges and deduc e the rate of this co n vergence. The c on- vergence rate depends on the local behavior of the function R ( b q ) near q , wh ich is determ ined by the be havior of η ( x ) in the vicinity of η ( x ) = 1 / 2 . I n general, R is locally Lipschitz at q , and the conv ergence rate is p roportio nal to n − 1 / 2 . If R is smoother/flatter a t q , then th e co n vergence rate can be much faster taking the f orm n − (1+ α ) / 2 , where α > 0 is a parameter reflecting the smoo thness of R at q . The value α = 1 is a typical value and we actually ha ve n − 1 conv ergence rate un der mild conditions. The lim it α → ∞ co rrespond s to a situation where the r isk fun ction is flat near the true prob abilities, and thu s insensiti ve to small error s in the estimate of pr ior probab ilities, in wh ich case the d etector based on the M LE conv erges to the Bayes error expo nentially fas t with n . W e also show th at the conv ergence rates ar e minimax -optimal giv en labeled d ata. Fig. 1 depicts three cases illustrating the smoothne ss co nditions and c orrespon ding η ( x ) considered in the p aper . (a) difficult case (b) moderate case (c) best case Fig. 1. Examples of R ( b q ) and corresponding η ( x ) leading to dif ferent con verge nce rates The pap er is organized as fo llows. In Section II a nd III ,we discuss the m inimax lo wer bounds and up per bounds ach ie ved by MLE with labeled data. Section IV discusses c on vergence rates wh en we only h av e un labeled training data. Section V compare s ou r results with those in standard passive learnin g and makes final rem arks on our work. I I . C O N V E R G E N C E R A T E S I N G E N E R A L C A S E W I T H L A B E L E D D A TA This section discusses the co n vergence rates of propo sed detector tra ined with labeled data without any assumptions. Let b q be the MLE of q , i.e. b q = ( n X i =1 1 { Y i =1 } ) /n, define P := { ( p 1 , p 0 , q ) } , JIA O et al. : M INIMAX-OPTIMAL BOUNDS FOR DETECTORS BASED ON ESTIMA TED PRIOR PROB ABILITIES 3 where p 1 , p 0 are class-conditiona l densities and q is prior probab ility . W e set up a minim ax lower bo und: A. Minimax Lower Bou nd Theorem 1. Ther e exists a co nstant c > 0 such that inf b q sup P E [ R ( b q )] − R ( q ) ≥ cn − 1 / 2 , wher e sup P takes supr emu m over all possible triples ( p 1 , p 0 , q ) and inf b q denotes the infim um over all possible estimators of q d erived fr o m n samples o f training da ta with the prior knowledge of cla ss-conditiona l d ensities. Theorem 1 can be vie wed as a corollary of Theorem 3 (given in the following section) if we take α = 0 and re move constraints on p 1 ( x ) , p 0 ( x ) and q in Theorem 3. B. Upper Bou nd Theorem 2. If b q is MLE of q , we have sup P E [ R ( b q )] − R ( q ) ≤ 1 2 n − 1 / 2 . Pr oof: Define pa rametrized risk function as b R ( q 1 ; q 2 ) := q 2 P 1 ( q 1 ) + (1 − q 2 ) P 0 ( q 1 ) , following the proof showing q = arg min b q R ( b q ) , we know b q = arg min q ′ b R ( q ′ ; b q ) . W e express the excess risk as E [ R ( b q ) − R ( q )] = E [ b R ( b q ; q ) − b R ( q ; b q )] ≤ E [ b R ( b q ; q ) − b R ( b q ; b q )] , if we write b R ( b q ; q ) − b R ( b q ; b q ) explicitly as follows b R ( b q ; q ) − b R ( b q ; b q ) = ( q − b q )( P 1 ( b q ) − P 0 ( b q )) , thus we h av e E [ R ( b q ) − R ( q )] ≤ E [( q − b q )( P 1 ( b q ) − P 0 ( b q ))] ≤ E [ | q − b q | ] ≤ p E [( q − b q ) 2 ] = r q (1 − q ) n ≤ 1 2 n − 1 / 2 , which completes the p roof of T heorem 2. Remark 1. General r esults in this section also apply wh en p i ( x ) , i = 0 , 1 are p r obab ility mass functions (pmf). In this case, we can write p i ( x ) , i = 0 , 1 a s summa tion of a series of weighted Dirac D elta functions , i.e., p i ( x ) = X j w i,j δ ( x − x j ) , then all of th e ar g uments ab ove hold . I I I . F A S T E R C O N V E R G E N C E R AT E S W I T H L A B E L E D D A TA In Section I II and Section IV, without lo ss of gen erality , we a ssume the true prior prob ability q lies in closed interval [ θ, 1 − θ ] , wher e θ is an arbitrar ily small positiv e real n umber . The reason why we ne ed this assumption is exp lained in Section III-A. Define th e trimmed ML E of q as b q := arg max q ∈ [ θ , 1 − θ ] q P n i =1 Y i (1 − q ) P n i =1 (1 − Y i ) , and construct th e regression fun ction estimator b η n ( x ; b q ) as b η n ( x ; b q ) = b q p 1 ( x ) (1 − b q ) p 0 ( x ) + b q p 1 ( x ) . The a ccuracy o f b η n ( x ; b q ) is closely related t o that of estimating q from n training data. W e set up a lemm a to describe the Lipsch itz property of b η n ( x ; b q ) as a fun ction of b q . Lemma 1. The r e gr ession function estimator b η n ( x ) satisfies Lipschitz pr o perty as a fu nction o f b q ∀ q 1 , q 2 ∈ [ θ, 1 − θ ] , sup x ∈ R d | b η n ( x ; q 1 ) − b η n ( x ; q 2 ) | ≤ L | q 1 − q 2 | , wher e L = 1 / (4 θ (1 − θ )) . Pr oof: Denote f ( t, x ) = tp 1 ( x ) / ( tp 1 ( x ) + (1 − t ) p 0 ( x )) , we are in terested in the p artial d eriv ative of f over t : ∂ f ∂ t = p 0 p 1 ( tp 1 ( x ) + (1 − t ) p 0 ( x )) 2 ≥ 0 . Since t ∈ [ θ , 1 − θ ] , we have p 0 p 1 ( tp 1 ( x ) + (1 − t ) p 0 ( x )) 2 ≤ p 0 p 1 (2 p t (1 − t ) p 1 p 0 ) 2 ≤ 1 4 θ (1 − θ ) , thus ∀ q 1 , q 2 ∈ [ θ, 1 − θ ] , sup x ∈ R d | b η n ( x ; q 1 ) − b η n ( x ; q 2 ) | ≤ L | q 1 − q 2 | . where L = 1 / (4 θ (1 − θ )) ≥ 1 . Remark 2. On the decision boun dary , we hav e q p 1 ( x ) = (1 − q ) p 0 ( x ) , which makes the inequality shown in the p r oof of Lemma 1 hold equality , thus we know the Lipschitz c onstant L canno t be fu rther impr oved. A. P olyn omial Rates Tsybakov[17 ] intro duced a parametr ized margin assumption denoted as Assumption (M A): There exist constants C 0 > 0 , c > 0 , an d α ≥ 0 , such that when α < ∞ , we have P X (0 < | η ( X ) − 1 2 | ≤ t ) ≤ C 0 t α ∀ t > 0 , when α = ∞ , we have P X (0 < | η ( X ) − 1 2 | ≤ c ) = 0 4 SUBMITTED TO IEE E TRANSA CTIONS ON INFORMA TION THE OR Y Denote P θ ,α := { ( p 1 , p 0 , q ) : Assumption ( MA) satisfied with parameter α and q ∈ [ θ , 1 − θ ] } , the ca se α = 0 is trivial (n o margin assum ption) a nd it is the case explored in Section II. If d = 1 and the decision b oundar y reduces, for examp le, to one p oint x 0 , Assumptio n (MA) may be in terpreted as η ( x ) − 1 2 ∼ ( x − x 0 ) 1 /α for x clo se to x 0 . T his interpretation shed light o n one fact that α = 1 is ty pical. If η ( x ) is differentiable with non-zero first-order d eriv ative at x = x 0 , th en we know the first-order approx imation o f η ( x ) in the ne ighbou rhood exists, which means α = 1 in th is case. When η ( x ) is s mooth er , for e xample, if the first-or der derivati ve vanishes at x = x 0 but the second- order derivati ve doesn’t, then we hav e α = 1 / 2 . When η ( x ) is not differentiable at x = x 0 , then we m ay h av e α > 1 , fo r example, when α = 2 , the deriv ative of η ( x ) at x = x 0 goes to infin ity . π X Y satisfying Assumption (MA) with larger α all have more d rastic ch anges near the bou ndary η ( x ) = 1 / 2 , which makes R ( b q ) less sensitive to small errors, leading to faster ra tes. T he R ( b q ) and corresp onding η ( x ) with typical α = 1 in Assumptio n (M A) ar e shown in Fig. 1(b ). W e e xplain the r eason why we need to b ound the d omain of q by showing what determines C 0 in Assumption (MA). Consider the ty pical case when α = 1 , d = 1 , calculate the deriv ativ e of η ( x ) ag ainst x at po int x = x 0 that th e decision bound ary reduce s to, we h av e η ′ ( x 0 ) = q (1 − q ) p ′ 1 ( x 0 ) p 0 ( x 0 ) − p ′ 0 ( x 0 ) p 1 ( x 0 ) ( q p 1 ( x 0 ) + (1 − q ) p 0 ( x 0 )) 2 ∝ q (1 − q ) . W itho ut loss of g enerality , suppose the margin al distrib u tion of X is u niform , a s the first-order approxim ation o f η ( x ) is ∆ η ( x ) ≈ η ′ ( x 0 )∆ x, we know P X (0 < | η ( X ) − 1 2 | ≤ t ) ∝ 1 η ′ ( x 0 ) t ∝ t q (1 − q ) . Then we can see if q goes to zero or o ne, the constant C 0 will ap proach infinity , which illustrates why we assume q ∈ [ θ, 1 − θ ] , θ > 0 in the b eginning of Section III. Assump tion (MA) provides a useful characteriza tion of the beh avior of the regression function η ( x ) in the v icinity of the level η ( x ) = 1/2, which turn s o ut to b e crucial in determ ining convergence rates. First we state a minimax lower bo und under Assumption (MA) as f ollows: Theorem 3. Ther e exists a co nstant c > 0 such that inf b q sup P θ,α E [ R ( b q )] − R ( q ) ≥ cn − (1+ α ) / 2 The proof is given in Ap pendix A. It follows the general minimax analysis strategy but is a no n-trivial result. Next we show n − (1+ α ) / 2 is also an upper bou nd. Introduce Lemma 3.1 in Au dibert and Tsy bakov[18 ] which is reph rased as follows: Lemma 2. Let b η n be an estimator of th e r egr ession functio n η and P a set of π X Y satisfying Mar g in Assumption (MA). If we have some constants C 1 > 0 , C 2 > 0 , fo r some po sitive sequence a n , for n ≥ 1 , an y δ > 0 , and fo r almost all x w .r .t. P X , sup P ∈P P ( | b η n ( x ) − η ( x ) | ≥ δ ) ≤ C 1 e − C 2 a n δ 2 Then the plug -in detector b f n = 1 { b η n ≥ 1 / 2 } satisfies the following inequa lity: sup P ∈P E [ R ( b f n )] − R ( f ∗ ) ≤ C a − (1+ α ) / 2 n for n ≥ 1 with so me co nstant C > 0 depending on ly o n α, C 0 , C 1 and C 2 , wher e f ∗ denotes the Bayes d etector . Remark 3. F ollowing the pr oof of Lemma 2, we know C incr eases as the incr ease of C 1 , the incr ease of constan t C 0 in Assump tion (MA) C 0 , and th e decr ease o f co nstant C 2 . Theorem 4. I f b q is the tr immed MLE of q , there e xists a constant C > 0 such that sup P θ,α E [ R ( b q )] − R ( q ) ≤ C n − (1+ α ) / 2 Pr oof: Accor ding to Lemma 1, we have sup x ∈ R d , b q ,q ∈ [ θ , 1 − θ ] | b η n ( x ; b q ) − b η n ( x ; q ) | ≤ L | b q − q | Combining with Ho effding’ s inequ ality , we have sup P θ,α P ( | b η n ( x ) − η ( x ) | ≥ δ ) ≤ sup P θ,α P ( | b q − q | ≥ δ L ) ≤ 2 e − 2 L 2 nδ 2 , where L > 0 is the constant in Lemma 1. The inequality above shows we can take C 1 = 2 , C 2 = 2 /L 2 , a n = n in Lemma 2. According to Lemma 2, we kn ow sup P θ,α E [ R ( b q )] − R ( q ) ≤ C n − (1+ α ) / 2 . Remark 4 . Consider the typical case when α = 1 . The optimal rate h er e is n − 1 , which is faster than naive worst case n − 1 / 2 shown in Section II and the optimal rate in standa r d passive learnin g, n − 2 / (2+ ρ ) , ρ > 0 shown in A ud ibert and Tsybakov[18] . Remark 5 . Con sider the case when true prior pr obab ility q lies near zer o or one. This will make the constant C 0 in Assumption (MA) go to infinity as shown in the intr oductio n of Assumption (MA), co nstant C 2 go to zer o a s sho wn in the pr oo f of Theorem 4, which slows down the conver gence of excess risk. JIA O et al. : M INIMAX-OPTIMAL BOUNDS FOR DETECTORS BASED ON ESTIMA TED PRIOR PROB ABILITIES 5 B. Exponential Rates W e in vestigate the c on vergence rates when α = ∞ in Assumption (MA). Intu iti vely as α g rows bigger, the rates can be faster than any poly nomial rates with fixed degree as is sh own in Theor em 4. Theorem 5. If b q is the trimmed MLE defined above, under Assumption (MA) when α = ∞ , we ha ve sup P θ, ∞ E [ R ( b q )] − R ( q ) ≤ 2 e − 2 nc 2 /L 2 , wher e c is the po sitive constant in Assumption (MA), L is the constant in Lemma 1. Pr oof: According to Lemma 1, we kn ow as long as | b q − q | ≤ c /L, b q ∈ [ θ , 1 − θ ] , the err or of regression fun ction estimator is bounded u niformly by c , in curring no error in detection accordin g to Assumptio n (MA) wh en α = ∞ . The mathematical repr esentation is: R ( b q ) = R ( q ) , ∀ b q ∈ [ q − c/L, q + c/L ] ∩ [ θ, 1 − θ ] . Then we write th e excess risk as follows: ( Z | b q − q |≥ δ + Z | b q − q |≤ δ )[ R ( b q ) − R ( q )] dP where P is the prob ability measure on samp le space Ω o f { ( X i , Y i ) } n i =1 . T a king δ = c/ L , th e second ter m vanishes. Applying Chern off ’ s bound , the first term is bo unded by 2 e − 2 nδ 2 , so we conclud e sup P θ, ∞ E [ R ( b q )] − R ( q ) ≤ 2 e − 2 nc 2 /L 2 . Remark 6. Wh en p i ( x ) , i = 0 , 1 ar e pr ob ability mass fu nc- tions, if x takes va lue in X with # {X } < ∞ , th en inf x i ∈X ,η ( x i ) 6 =1 / 2 | η ( x i ) − 1 / 2 | ≥ c > 0 which mean s there exists a con stant c > 0 such that P (0 < | η ( X ) − 1 / 2 | ≤ c ) = 0 . Based on discussions above, a n exponential conver gence rate is always guaranteed when x lies in discrete finite do main. I f # {X } is infi nite, then we may h ave finite α > 0 with optimal conver gence rates n − (1+ α ) / 2 . Ho wever , finite # {X } is the case that often arise in practice. I V . C O N V E R G E N C E R AT E S W I T H U N L A B E L E D D A TA In th is section , we discuss co n vergence r ates when we only have unla beled training data. Relati vely speakin g, unlabeled data is m ore likely and easier to be obtained in practice than th e lab eled, th us co n vergence rates analy sis in this case deserves mo re attention. Mean while, it also helps revealing how much infor mation is stored in { X i } n i =1 in the training data p airs { ( X i , Y i ) } n i =1 . In this case, we are faced with a classical parameter esti- mation problem . Gi ven X 1 , . . . , X n iid ∼ q p 1 ( x ) + (1 − q ) p 0 ( x ) , we want to con struct estimator b q to estimate q as ef ficiently as possible. Here we use the MLE and derive uppe r b ounds under Assumption ( MA). Before starting the p roof, we intro duce a stand ard qu antity measuring distances b etween probab ility m easures. Definition 1. The t otal v ariatio n dista nce between two pr ob ability density function s p, q is defi ned as follows: V ( p , q ) = s up A | Z A ( p − q ) dν | = 1 − Z min( p, q ) dν wher e ν denotes Lebesgu e measur e on signal space R d and A is any sub set o f the d omain. W e will q uantify our results in ter ms o f the total variation distance. Here we assume V ( p 1 , p 0 ) ≥ V min > 0 , ensuring that the two class-condition al densities are not ‘too’ indiscernible , so that it is possible to learn the prior probab ility q f rom un labeled data. For details about how this assumption works plea se see Ap pendix B. Define a c lass of triples: P θ ,α,V min := { ( p 1 , p 0 , q ) : Assumption ( MA) satisfied with parameter α , q ∈ [ θ, 1 − θ ] and V ( p 1 , p 0 ) ≥ V min > 0 } , and define the tr immed MLE b q in this case as b q := arg max q ∈ [ θ , 1 − θ ] n X i =1 log( q p 1 ( x i ) + (1 − q ) p 0 ( x i )) . W e set up an upp er bound fo r the perfor mance of trimme d MLE b q : Theorem 6 . If b q is the trimmed MLE defi ned ab ove, ther e exis ts a constan t C > 0 su ch that sup P θ,α,V min E [ R ( b q )] − R ( q ) ≤ C n − (1+ α ) / 2 The proof of T heorem 6 is g iv en in Ap pendix B. Remark 7. W e can show the calculation of MLE is a co n vex optimization pr oblem, for which we ha ve efficient methods. Remark 8. Compa r ed to learnin g detecto rs based o n lab eled data, we need to s acrifice conver gence rates by a constant factor when g iven unla beled data. Given true prior pr obability q , when V ( p 1 , p 0 ) is sma ller , th e c onstant C 2 in Lemma 2 becomes smaller at the same time, which slows do wn the conver gence of e xcess risk. This p henomen on is discussed in the p r oof of Theorem 6. V . F I NA L R E M A R K S This p aper present conv ergence rates analy sis for detector s constructed using k nown c lass-conditional d ensities an d esti- mated prior probabilities usin g the MLE. All o f th e bounds are dime nsion-free . The bo unds are m inimax-o ptimal g i ven labeled d ata and achiev able no matter giv en labeled o r unla- beled data . It remains an inter esting open question to show the rate n − ( α +1) / 2 is minima x-optimal gi ven unlabeled data u nder 6 SUBMITTED TO IEE E TRANSA CTIONS ON INFORMA TION THE OR Y assumption (MA) and the extra assumptio n on V ( p 1 , p 0 ) , or to establish the same u pper bound o n c on vergence rates for unlabeled case with out the extra assumption on V ( p 1 , p 0 ) in Section IV. W e sh ow the constant factors in con vergenc e rates are m ainly influenced by two elem ents: 1) The value of true prior proba bility 2) Unlabeled data case: V ( p 1 , p 0 ) W e sho w a prior probab ility near zero o r one will lead to slower convergence no m atter gi ven lab eled or unlabeled data, in un labeled da ta case, a sm aller V ( p 1 , p 0 ) lea ds to slower conv ergence. Our results are analogous to those of general classification in statistical learn ing. Intuitiv ely , learning the class-conditional densities is the main ch allenge in s tandar d passi ve learning and it is sensible f or us to say that kn owing the class-con ditional densities makes the problem relati vely easy . The following quantitative results convince us of that. W e pick ou t the fastest- ev er rate sho w n befo re for stand ard passive learning und er Assumption (MA) in Audibert and T sybakov[18 ] an d compare it with our result in table I: T ABLE I C O N V E R G E N C E R ATE S C O M PA R I S O N U N D E R A S S U M P T I O N ( M A ) Passi ve Learning ( p 1 , p 0 unkno wn) Passi ve Learning ( p 1 , p 0 kno wn) n − α +1 2+ ρ n − α +1 2 Here ρ = d/β > 0 , where β is the Holder exponent of η ( x ) . The rate n − α +1 2+ ρ is obtained with ano ther stron g assumption that the ma rginal distribution of X is boun ded from b elow and ab ove, which isn’t necessary h ere. Here we can see the factor ρ reflec ts th e price we have to pay for not knowing class- condition al densities and it is directly related to the complexity of non-pa rametrically learning the density function s. V I . A C K N O W L E D G E M E N T The authors th ank the revie wers f or their helpf ul commen ts, especially for raising the q uestion of minimax optimality in the unlabeled data case. A P P E N D I X A P R O O F O F T H E O R E M 3 The p roof strategy follows the idea of standard minimax analysis introdu ced in Tsybakov[19 ] and consists in reducin g the prob lem of classification to a h ypothesis testing prob lem. In th is case, it su ffices to consider two hy potheses. Here, we have to pay extra attention to the d esign of hyp otheses because we ha ve access to class-conditiona l densities, wh ich p uts extra constraint on hy potheses d esign. W e rephrase a bo und from Tsybakov[19 ]: Lemma 3. De note P the class of join t distributions rep- r esen ted by tr iples ( p 1 , p 0 , q ) w her e ( p 1 , p 0 ) a r e class- condition al densities and q is prior pr obab ility . Associa ted with each element ( p 1 , p 0 , q ) ∈ P , we have a pr oba bility measur e π X Y defined on R d × { 0 , 1 } . Let d ( · , · ) : P × P → R be a semidistance. Let ( p 1 , p 0 , q 0 ) , ( p 1 , p 0 , q 1 ) ∈ P be such that d (( p 1 , p 0 , q 0 ) , ( p 1 , p 0 , q 1 )) ≥ 2 a , with a > 0 . Assume also that KL ( π X Y ( p 1 , p 0 , q 1 ) k π X Y ( p 1 , p 0 , q 0 )) ≤ γ , wher e K L d enotes to the K ullback-Leibler diverg ence. The following bound h olds: inf b q sup P P π X Y ( p 1 ,p 0 ,q ) ( d (( p 1 , p 0 , b q ) , ( p 1 , p 0 , q )) ≥ a ) ≥ inf b q max j ∈{ 0 , 1 } P π X Y ( p 1 ,p 0 ,q j ) ( d (( p 1 , p 0 , b q ) , ( p 1 , p 0 , q j )) ≥ a ) ≥ max( 1 4 exp( − γ ) , 1 − p γ / 2 2 ) wher e the infimu m is ta ken with r espect to the collection of all po ssible e stimators of q ( based on a samp le fr om π X Y ( p 1 , p 0 , q ) with known class-cond itional densities). Fig. 2. T wo η ( x ) used for the proof of Theorem 3 when d = 1 Denote b G n := { x : b η n ( x ; b q ) ≥ 1 / 2 } where b η n ( x ; b q ) is defined in Section IV and the optimal decision regions as G ∗ j := { x : η j ( x ) ≥ 1 / 2 } , where the subscript j indicates that the excess risk is being measured with respect to the distribution π X Y (( p 1 , p 0 , q j )) , j = 0 , 1 . T ake P = P θ ,α . W e are in terested in controlling th e excess risk R j ( b q ) − R j ( q ) . T o prove the lower boun d we will use the following class- condition al den sities, which allow us to easily attain any desired margin par ameter α in Assumption (MA) b y ad justing the p arameter κ b elow . p 1 ( x ) = (1+2 cx κ − 1 d )(1 − 2 t κ − 1 ) 1 − 4 c ( tx d ) κ − 1 x ∈ [0 , 1] d − 1 × [0 , t ) 1 + 2 c 1 ( x d − t ) κ − 1 x ∈ [0 , 1] d − 1 × [ t, 1] 0 x ∈ R d / [0 , 1] d p 0 ( x ) = ( 2 − p 1 ( x ) x ∈ [0 , 1 ] d 0 x ∈ R d / [0 , 1] d where x = ( x 1 , . . . , x d ) , 0 < c ≪ 1 , κ > 1 are constan ts. The quan tity 0 < t ≪ 1 is a small real number which goes to zero as n → ∞ , and will be determined later . I t is easy to verify that in order to m ake R R d p i = 1 , i = 0 , 1 h old, as t → 0 , th e n umber c 1 is of order O ( t κ ) , which also goes to zero. Assign ing p rior proba bilities to H 1 and H 0 q 0 = 1 2 q 1 = 1 2 + t κ − 1 , obviously the margin distribution of X , P (0) X is uniform on [0 , 1] d , P (1) X is ap proxima tely u niform o n [0 , 1] d . W e can compute the r egression func tions based on equ ation η j ( x ) = q j p 1 ( x ) q j p 1 ( x ) + (1 − q j ) p 0 ( x ) , JIA O et al. : M INIMAX-OPTIMAL BOUNDS FOR DETECTORS BASED ON ESTIMA TED PRIOR PROB ABILITIES 7 and have the explicit expressions of η j ( x ) , j ∈ { 0 , 1 } as η 0 ( x ) = (1 / 2+ cx κ − 1 d )(1 − 2 t κ − 1 ) 1 − 4 c ( tx d ) κ − 1 x ∈ [0 , 1] d − 1 × [0 , t ) 1 2 + c 1 ( x d − t ) κ − 1 x ∈ [0 , 1] d − 1 × [ t, 1] 0 x ∈ R d / [0 , 1] d η 1 ( x ) = 1 2 + c x κ − 1 d x ∈ [0 , 1] d − 1 × [0 , t ) (1+2 t κ − 1 )( 1 2 + c 1 ( x d − t ) κ − 1 ) 1+4 t κ − 1 c 1 ( x d − t ) κ − 1 x ∈ [0 , 1] d − 1 × [ t, 1] 0 x ∈ R d / [0 , 1] d From above we see G ∗ 0 = [0 , 1] d − 1 × [ t, 1] , G ∗ 1 = [0 , 1] d . Fig. 2 depicts η j ( x ) , j ∈ { 0 , 1 } wh en d = 1 . In order to further analyze d esigned hypoth eses, we sho w that the parameter α in Assumption (MA) for η j ( x ) , j = 0 , 1 is α = 1 / ( κ − 1) . Consider the case j = 0 (the ca se j = 1 is analogo us). As η 0 ((0 , . . . , 0 , t )) = 1 / 2 − (1 − c ) t κ − 1 1 − 4 ct 2 κ − 2 < 1 / 2 − (1 − c ) t κ − 1 = 1 / 2 − τ ∗ , provided τ ≤ τ ∗ , we h av e P 0 (0 < | η 0 ( X ) − 1 2 | ≤ τ ) = P 0 (0 < x d − t ≤ ( τ c 1 ) 1 / ( κ − 1) ) = ( τ c 1 ) 1 / ( κ − 1) = C η τ 1 / ( κ − 1) , where C η > 1 . The seco nd step follows sinc e P (0) X is u niform on [0 , 1] d . Since the excess risk is not a semidistance, w e cann ot apply Lem ma 3 d irectly , but we can relate excess risk a nd th e symmetric d istance measure, and then use the lemma. First w e introdu ce Proposition 1 in Tsyb akov[17 ] rep hrased as fo llows : Lemma 4 . Assume that P (0 < | η ( X ) − 1 / 2 | ≤ τ ) ≤ C η τ α for some fin ite C η > 0 , α > 0 and all 0 < τ ≤ τ ∗ , whe r e τ ∗ ≤ 1 / 2 . Th en we kno w ther e exist c α > 0 , 0 < ǫ 0 ≤ 1 such that R j ( b q ) − R j ( q ) ≥ c α d ∆ ( b G n , G ∗ P ) 1+1 /α for all b G n such that d ∆ ( b G n , G ∗ P ) ≤ ǫ 0 ≤ 1 , wher e c α = 2 C − 1 /α η α ( α + 1 ) − 1 − 1 /α , ǫ 0 = C η ( α + 1 ) τ α ∗ , d ∆ ( b G n , G ∗ P ) := R b G n ∆ G ∗ P d x is the symmetric distance measure . When j = 0 , plug in τ ∗ = (1 − c ) t κ − 1 , since c is very small, we know ǫ 0 = C η (1 + 1 / ( κ − 1))(1 − c ) 1 / ( κ − 1) t ≥ t/ 2 . Analogou sly we can show when j = 1 , ǫ 0 ≥ t/ 2 a lso holds. W e n ow pro ceed by applying L emma 3 to the semidis- tance d ∆ and then use Lemm a 3 to co ntrol the ex- cess r isk. Note that d ∆ ( G ∗ 0 , G ∗ 1 ) = t . Let P 0 ,n := P (0) X 1 ,...,X n ; Y 1 ,...,Y n be th e pro bability measure o f th e ran- dom variables { ( X i , Y i ) } n i =1 under h ypothe sis 0 and define analogo usly P 1 ,n := P (1) X 1 ,...,X n ; Y 1 ,...,Y n . Consider th e KL- div ergence KL ( P 1 ,n k P 0 ,n ) : KL( P 1 ,n k P 0 ,n ) = E 1 [log Π n i =1 p (1) X i ,Y i ( X i , Y i ) Π n i =1 p (0) X i ,Y i ( X i , Y i ) ] = n X i =1 E 1 [log p (1) X i ,Y i ( X i , Y i ) p (0) X i ,Y i ( X i , Y i ) ] = n E 1 [log p (1) X,Y ( X, Y ) p (0) X,Y ( X, Y ) ] , where E 1 [log p (1) X,Y ( X,Y ) p (0) X,Y ( X,Y ) ] can be simplified as Z R d q 1 p 1 ( x ) log q 1 p 1 ( x ) q 0 p 1 ( x ) + Z R d (1 − q 1 ) p 0 ( x ) log (1 − q 1 ) p 0 ( x ) (1 − q 0 ) p 0 ( x ) = q 1 log q 1 q 0 + (1 − q 1 ) 1 − q 1 1 − q 0 . The expression in th e last line is the KL-d iv ergence between two Bern oulli rand om variables. I t can be ea sily verified that the KL-divergence between two Bernoulli rando m variables is bound ed as in the f ollowing lemma: Lemma 5. Let P an d Q b e Bern oulli random variab les with parameters, respectively , 1/2- p an d 1/2 - q . Let | p | , | q | ≤ 1 / 4 , then KL ( P k Q ) ≤ 8( p − q ) 2 . Thus we know KL( P 1 ,n k P 0 ,n ) ≤ 8 n ( t κ − 1 ) 2 = 8 nt 2 κ − 2 . T a king t = n − 1 2 κ − 2 , d (( p 1 , p 0 , b q ) , ( p 1 , p 0 , q j )) := d ∆ ( b G n , G ∗ j ) and u sing Lem ma 3, we know for n large enough (imp lying t small), inf b q max j ∈{ 0 , 1 } P j ( d (( p 1 , p 0 , b q ) , ( p 1 , p 0 , q j )) ≥ t/ 2 ) ≥ 1 / 4 exp( − 8) . Notice in Lemma 4, ǫ 0 ≥ t/ 2 , so we can apply Le mma 4 to show inf b q max j ∈{ 0 , 1 } P j ( R j ( b q ) − R j ( q ) ≥ c α ( t/ 2) κ ) ≥ inf b q max j ∈{ 0 , 1 } P j ( d (( p 1 , p 0 , b q ) , ( p 1 , p 0 , q j )) ≥ t/ 2 ) ≥ 1 / 4 exp( − 8) . According to Markov’ s inequality , we con clude inf b q sup P θ,α E [ R ( b q ) − R ( q )] ≥ c ′ n − κ 2 κ − 2 = c ′ n − (1+ α ) / 2 where α = 1 / ( κ − 1) , c ′ = 1 4 e − 8 c α ( 1 2 ) α +1 α . A P P E N D I X B P R O O F O F T H E O R E M 6 W e introdu ce two m ore quan tities measurin g d istances be- tween probability distributions. 8 SUBMITTED TO IEE E TRANSA CTIONS ON INFORMA TION THE OR Y Definition 2. The H ellinger distance between two pr obab ility density functions p, q is defined as follows: H ( p, q ) = ( Z ( √ p − √ q ) 2 dν ) 1 / 2 Definition 3 . The χ 2 diver gence between two pr obab ility density functions p, q is defined as follows: χ 2 ( p, q ) = Z pq> 0 p 2 q dν − 1 As is shown in Tsyb akov[19 ], we have th e following inequalities V 2 ( p, q ) ≤ H 2 ( p, q ) ≤ χ 2 ( p, q ) . Define f ( x, q ) = q p 1 ( x ) + (1 − q ) p 0 ( x ) , we use Hellinger distance to measur e the error of estimating q from training data: r 2 ( q , q + h ) := H ( f ( x, q ) , f ( x, q + h )) W e introd uce a con centration ineq uality for MLE , i.e., Theorem I .5.3 in Ibrag imov an d Has’m inskii[20] reph rased as follows: Lemma 6. Let Q be a bou nded interval in R , f ( x, q ) be a con tinuous fun ction of q on Q fo r ν -almost all x w her e ν denotes the Lebesgue measur e on R d , let th e following condition s be satisfied: 1) Ther e exis ts a number ξ > 1 su ch that sup q ∈Q sup h | h | − ξ r 2 2 ( q , q + h ) = A < ∞ 2) F or any comp act set K ther e corr espon ds a positive number a ( K ) = a > 0 such that r 2 2 ( q , q + h ) ≥ a | h | ξ 1 + | h | ξ q ∈ K , then the maximum likelihood estima tor b q is define d, consistent and sup q ∈ K P q ( | b q − q | > ǫ ) ≤ B 0 e − b 0 anǫ ξ , wher e the positive constants B 0 and b 0 do no t depend on K , n is th e number of training da ta. T a king Q = K = [ θ , 1 − θ ] , it suffices to show th e two assumptions in Lemma 6 hold with ξ = 2 , then we can use Lemma 2 to complete th e proof. Pr oof: 1. sup q ∈Q sup h | h | − 2 r 2 2 ( q , q + h ) = A < ∞ r 2 2 ( q , q + h ) ≤ χ 2 ( f ( x, q + h ) , f ( x, q )) = Z f ( x, q ) + ( p 1 − p 0 ) 2 h 2 f ( x, q ) +2 h ( p 1 − p 0 ) dν − 1 = h 2 Z ( p 1 − p 0 ) 2 q p 1 + (1 − q ) p 0 = h 2 Z p 1 >p 0 ( p 1 − p 0 ) 2 q ( p 1 − p 0 ) + p 0 + h 2 Z p 1
p 0 ( p 1 − p 0 ) 2 q ( p 1 − p 0 ) + h 2 Z p 1
ǫ ) ≤ B 0 e − b 0 anǫ 2 Applying Lemma 2 by taking C 1 = B 0 , C 2 = b 0 a , a n = n , we complete th e proof o f Theorem 6 . Remark 9. In the pr o of o f Theor em 6, we h ave 1 / Z ( p 1 − p 0 ) 2 q p 1 + (1 − q ) p 0 ≥ q (1 − q ) V ( p 1 , p 0 ) , wher e th e left term is the r ecipr o cal o f the fisher information given unlabe led da ta, and the right term is the fisher informa- tion g iven labeled d ata div ided by V ( p 1 , p 0 ) . Th is ineq uality holds eq uality whe n p 1 and p 0 don’t verlap at all. Since the minimum variance o f unbiased estimator is described by the r ecip r ocal of fish er information , this ineq uality shows that the conver gence fr om b q to q in unlabeled case can n ever be faster than that in la beled case, and will be slower if V ( p 1 , p 0 ) is small. JIA O et al. : M INIMAX-OPTIMAL BOUNDS FOR DETECTORS BASED ON ESTIMA TED PRIOR PROB ABILITIES 9 R E F E R E N C E S [1] L. De vroye, L. Gyorfi, and G. Lugosi, A Pr obabilisti c Theo ry of P att ern Recogn ition . Springe r , New Y ork, 1996. [2] V . V apnik, St atistical Learning Theory . W ile y , Ne w Y ork, 1998. [3] M. Ziaerman, E. Braverma n, and L. Rozonoer , Method of P otentia l Functions in the Theory of Learning Machines . Nauka, Moscow (in Russian), 1970. [4] V . V apn ik a nd A. Cherv onenkis, The ory of P atte rn Reco gnition . Nauka, Mosco w (in Russian), 1974. [5] V . V apnik, Estimation of Dependenc ies Based on Empirical Data . Springer , New Y ork, 1998. [6] L. Breima n, J. Frie dman, C. J. Stone, and R. Olshen, Cla ssification and Re gression T r ees . W adsworth, Bel mont, CA, 1984. [7] M. Anthony and P . Bartl ett, Neural Network Learning: Theor etical F oun dations . Cambridge Uni v . Press, 1999. [8] N. Cristianini and J. Shawe-T ayl or , An Intr oduction to Support V ect or Mach ines . Cambridge Univ . Press, 2000. [9] B. Scholk opf and A. Sm ola, Learning with Kernels . MIT Press, 2002. [10] E. Mammen and A. Tsybako v , “Smoot h discrimination analysis, ” Annals of Stati stics , vol. 27, pp. 1808–1829, 1999. [11] V . K oltchin skii, “Local rademacher complexit ies and oracle inequaliti es in risk minimizati on, ” Annals of Statistic s , vol . 34, no. 6, pp. 2593–26 56, 2006. [12] I. Steinwart and J. Scove l, “Fast rate s for support v ector machi nes using gaussian ke rnels, ” Annals of Sta tistics , vol. 35, pp. 575–607, 2007. [13] A. Tsybakov and S. van de Geer , “Squa re root penalty: adapta tion to th e margin in classifica tion and in edge estimation , ” The Annals of St atistics , vol. 33, pp. 1203–1224, 2005. [14] P . Massart, “Some applic ations of concentrat ion inequaliti es to statis- tics, ” Ann. F ac. Sci. T oulouse Math , vol. 9, pp. 245–303, 2000. [15] O. Catoni, “Randomize d estimators and empirical complexity for patte rn recogn ition and least square regressio n, ” 2001. [Online]. A v ailable : http://www .proba.jussie u.fr/ [16] M. Horv ath and G. Lugosi, “Scal e-sensiti ve di mensions and skele ton estimate s for classification , ” Discr ete Applied Mathematics , vo l. 86, pp. 37–61, 1998. [17] A. Tsybak ov , “Optimal aggrega tion of cla ssifiers in stat istical learning, ” Annals of Sta tistics , vol. 32(1), pp. 135–166, 2004. [18] J.-Y . Audibert and A. T sybako v , “Fa st learnin g rates for plug-in classi - fiers, ” Annals of Statistics , vol. 35(2) , pp. 608–633, 2007. [19] A.Tsybak ov , Intr oduction t o nonpa rametric estimat ion . Springer , 2008. [20] I.A.Ibragimo v and R.Z.H as’minskii, Statistical E stimation : Asymptotic Theory . Springer -V erlag, 1981.
Comments & Academic Discussion
Loading comments...
Leave a Comment