A Lower Bound on Arbitrary $f$--Divergences in Terms of the Total Variation

A Lo w er Bound on Arbitrary f –Div ergences in T erms of the T ot al V ariation Jo c hen Br¨ oc k er ∗ Max–Planc k–Institut f ¨ ur Ph ysik komplex er Systeme N¨ othnitzer Str asse 34 01187 Dresden German y No v em b er 10, 2018 Abstract An imp ortant tool to qu antif y the likeness of t wo probabil ity measures are f –diverg ences, which have seen widespread application in statistics and information theory . An example is the total vari ation, whic h plays an exceptional role among the f –divergences. It is show n t hat every f – divergence is bounded from below by a monotonous function of the total v ariation. U nder appropriate regularit y conditions, t his fun ction is sho wn to b e monotonous. Remark: The proof of the main proposition is relativel y easy , whence it is highly lik ely that the result is known. The author w ould be very grateful for an y information regarding references or related work. 1 The total v ariation Let (Ω , σ ) b e a probability space. A signe d m e asur e ν is a σ – additive set function with v alues in R ∪ {−∞ , ∞} , and so that either ν > −∞ or ν < ∞ . I will use the standard term me asur e if ν is nonnega tive. T o an y signed measure ν , there c o rresp o nds a Hahn–Jor dan de c omp osition of Ω into t wo mea surable sets P, N so that P ∪ N = Ω, P ∩ N = ∅ and ν + ( . ) = ν ( . ∩ P ) , ν − ( . ) = − ν ( . ∩ N ) (1) are b oth (nonnegative) measur es. Obviously , ν = ν + − ν − . F urther mo re, the representation ν + ( A ) = sup B ⊂ A ν ( B ) , ν − ( A ) = − inf B ⊂ A ν ( B ) (2) holds for every measurable set A . F or a pro of of these facts see [2]. The measure h ν i = ν + + ν − is ca lled the varia tion me asur e of ν , which in turn deﬁnes the ∗ email: bro ecker@pks.mp g.de 1 total variatio n k ν k = h ν i (Ω). If ν (Ω) = 0, it follows easily from the previous statements that h ν i (Ω) = 2 sup B ∈ σ | ν ( B ) | . (3) A pr ob ability me asur e is a measure µ so that µ (Ω) = 1. F or any t w o pro ba- bilit y mea sures, µ , ν , the diﬀ erence µ − ν is a sig ned measure, and Eq ua tion (3) applies. Hence, k µ − ν k = h µ − ν i ( Ω) = 2 sup B ∈ σ | µ ( B ) − ν ( B ) | . (4) Obviously , k µ − ν k is a metric for pro bability measures, namely the total vari- ation metric , with E q uation (4) providing t wo possible r epresentations. If µ is absolutely contin uous with resp ect to µ , then there is a thir d representation, namely k µ − ν k = Z | d µ d ν − 1 | d ν. (5) Proof of this fact 2 The f -div ergences Equation (5) can be read as follo ws: k µ − ν k = Z f ( d µ d ν )d ν, (6) with f ( x ) = | x − 1 | . There is a way to g eneralise this approach b y using other forms of f . Let f be a con vex function on R ≥ 0 that v anishes at x = 1. Let µ, ν t wo probabilit y measures with µ being abso lutely co ntin uous with r esp ect to ν (whic h will b e written as µ ≪ ν ). The f – diver genc e betw een µ and ν is g iven by D f ( µ, ν ) = Z f ( d µ d ν )d ν. (7) F or, if µ = ν we have d µ d ν = 1, w e see that f ( µ, ν ) v anishes in this cas e. F urthermore, D f ( µ, ν ) is non-nega tive. Indeed, b y Jensen’s inequality , 0 = f (1) = f ( Z d µ d ν d ν ) ≤ Z f ( d µ d ν )d ν = f ( µ, ν ) . Note though that f ( µ, ν ) may b e inﬁnite. F urther more f ( µ, ν ) may v anish even if µ 6 = ν . T o exclude this, further co nditions on f hav e to be impo sed, for example as in the f ollowing 2.1. L emma . Suppos e there is an a ∈ R so that the function g ( x ) := f ( x ) − a ( x − 1) is non-neg ative and v anishes only if x = 1, then f ( µ, ν ) v anishes only if µ = ν . 2 Pr o of. The function g ( x ) is conv ex as well. F urther more D f ( µ, ν ) = D g ( µ, ν ). But since g is non-nega tive, D g ( µ, ν ) = Z g ( d µ d ν )d ν can o nly v anish if g ( d µ d ν ) is ident ical to zero , which implies that d µ d ν = 1 ν -a.s. But this means µ = ν . The concept of f -div ergence s was introduced by Csisz ´ ar [1 ], who also noted the r e sult in Lemma 2.1. Common choices for f are ( √ x − 1) 2 Hellinger divergence HE | x − 1 | total–v ariation divergence TV x log( x ) Kullba ck–Leibler divergence KL ( x − 1) 2 Pearson divergence PE The tra nsformation f ∗ ( x ) = xf (1 /x ) yie lds a divergence D f ∗ which is equal to D f but with in terchanged a r guments. Applying this transformation to the Kullback–Leibler divergence for ex ample, we get a divergence which is also sometimes referred to as the Kullback–Leibler div erge nc e , or alter natively as the Shanno n divergence SH . The total v ariation divergence pla ys a central role, since all f –divergences allow for a n estimate against TV , a s will be shown in the following pro p osition, which forms the main r esult of this sho rt note. 2.2. Pr op osition . F or tw o proba bility measur es µ, ν , it holds in gene r al that f (1 + 1 2 TV ( µ, ν )) + f (1 − 1 2 TV ( µ, ν )) ≤ D f ( µ, ν ) . Pr o of. The pro of of this fac t is a genera lisation of the metho d use d in [3] to prov e the special case of the KL divergence. Since f (1) = 0 , we hav e the general prop erty that f ( x ) = f (max { x, 1 } ) + f (min { x, 1 } ) . Using this fac t and the conv exity of f we get the gener al estimate D f ( µ, ν ) = Z f ( d µ d ν )d ν = Z f (max { d µ d ν , 1 } )d ν + Z f (min { d µ d ν , 1 } )d ν ≥ f ( Z max { d µ d ν , 1 } d ν ) + f ( Z max { d µ d ν , 1 } d ν ) . Now use that max { x, 1 } = 1 + x + | 1 − x | 2 min { x, 1 } = 1 + x − | 1 − x | 2 to c o mplete the theorem. 3 Recalling that always TV ≤ 2 , the prop ositio n rises the ques tio n a s to when the function f (1 + x ) + f (1 − x ) is monotonous on x ∈ [0 , 1]. The following lemma partia lly answers this. 2.3. L emma . Under the conditions of Le mma 2.1, the function φ ( x ) = f (1 + x ) + f (1 − x ) is strictly monotonous o n x ∈ [0 , 1]. Pr o of. The conditions imply that φ (0) = 0 , φ ( x ) > 0 for x > 0, and that φ is conv ex. Let 0 ≤ x 1 < x 2 ≤ 1. F or any τ ∈ ]0 , 1 [, (1 − τ ) φ (0) + τ φ ( x 2 ) > φ ((1 − τ )0 + τ x 2 ) which o bviously implies φ ( x 2 ) > τ φ ( x 2 ) > φ ( τ x 2 ) (since τ ∈ ]0 , 1[). Now take τ = x 1 /x 2 to get the result. As a corolla r y of Prop o sition 2.2, we get the following well known estima tes betw een T V and KL 2.4. Cor o l lary (Br etagnole–Hub er and F urstemb er g ine quality) . TV ( µ, ν ) ≤ 2 p 1 − exp ( − SH ( µ, ν )) ≤ 2 p SH ( µ, ν ) Recall that SH ( µ, ν ) = KL ( ν, µ ). A further useful estimate co ncerns the Hellinger divergence 2.5. Cor o l lary . F or the Hellinger divergence HE , the estimate TV ≤ ( 2 − 2  1 − √ HE  2 if HE < 1 2 otherwise (8) holds. Pr o of. Theo rem 2 .2 gives the inequa lity HE ≥ r 1 + 1 2 TV − 1 ! 2 + r 1 − 1 2 TV − 1 ! 2 . (9) The r ight hand side of Equa tion (9) is la rger than  q 1 − 1 2 TV − 1  2 , whence HE ≥ r 1 − 1 2 TV − 1 ! 2 , which, after solv ing for T V , yields the res ult. References [1] Imre Csiszar. Information-type measues o f diﬀerence of pr obability distri- butions and indirect obser v ations. S tudia Sci. Math. H ungar. , 2:29 9–31 8, 1967. [2] Jo seph L. Do o b. Me asur e The ory . Springe r, 1994 . [3] Vladimir N. V apnik. St atistic al L e arning The ory . John Wiley & Sons, Inc., New Y or k, 19 98. 4

A Lower Bound on Arbitrary $f$--Divergences in Terms of the Total Variation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment