A Lower Bound on Arbitrary $f$--Divergences in Terms of the Total Variation

Reading time: 5 minute
...

📝 Original Info

  • Title: A Lower Bound on Arbitrary $f$–Divergences in Terms of the Total Variation
  • ArXiv ID: 0903.1765
  • Date: 2009-03-11
  • Authors: Researchers from original ArXiv paper

📝 Abstract

An important tool to quantify the likeness of two probability measures are f-divergences, which have seen widespread application in statistics and information theory. An example is the total variation, which plays an exceptional role among the f-divergences. It is shown that every f-divergence is bounded from below by a monotonous function of the total variation. Under appropriate regularity conditions, this function is shown to be monotonous. Remark: The proof of the main proposition is relatively easy, whence it is highly likely that the result is known. The author would be very grateful for any information regarding references or related work.

💡 Deep Analysis

Deep Dive into A Lower Bound on Arbitrary $f$--Divergences in Terms of the Total Variation.

An important tool to quantify the likeness of two probability measures are f-divergences, which have seen widespread application in statistics and information theory. An example is the total variation, which plays an exceptional role among the f-divergences. It is shown that every f-divergence is bounded from below by a monotonous function of the total variation. Under appropriate regularity conditions, this function is shown to be monotonous. Remark: The proof of the main proposition is relatively easy, whence it is highly likely that the result is known. The author would be very grateful for any information regarding references or related work.

📄 Full Content

1 The total variation Let (Ω, σ) be a probability space. A signed measure ν is a σ-additive set function with values in R ∪ {-∞, ∞}, and so that either ν > -∞ or ν < ∞. I will use the standard term measure if ν is nonnegative. To any signed measure ν, there corresponds a Hahn-Jordan decomposition of Ω into two measurable sets P, N so that P ∪ N = Ω, P ∩ N = ∅ and

are both (nonnegative) measures. Obviously, ν = ν +ν -. Furthermore, the representation

holds for every measurable set A. For a proof of these facts see [2]. The measure ν = ν + + ν -is called the variation measure of ν, which in turn defines the total variation ν = ν (Ω). If ν(Ω) = 0, it follows easily from the previous statements that

A probability measure is a measure µ so that µ(Ω) = 1. For any two probability measures, µ, ν, the difference µν is a signed measure, and Equation (3) applies. Hence,

Obviously, µν is a metric for probability measures, namely the total variation metric, with Equation (4) providing two possible representations. If µ is absolutely continuous with respect to µ, then there is a third representation, namely

Proof of this fact 2 The f -divergences Equation ( 5) can be read as follows:

with f (x) = |x -1|. There is a way to generalise this approach by using other forms of f . Let f be a convex function on R ≥0 that vanishes at x = 1. Let µ, ν two probability measures with µ being absolutely continuous with respect to ν (which will be written as µ ≪ ν). The f -divergence between µ and ν is given by

For, if µ = ν we have dµ dν = 1, we see that f (µ, ν) vanishes in this case. Furthermore, D f (µ, ν) is non-negative. Indeed, by Jensen’s inequality,

Note though that f (µ, ν) may be infinite. Furthermore f (µ, ν) may vanish even if µ = ν. To exclude this, further conditions on f have to be imposed, for example as in the following 2.1. Lemma. Suppose there is an a ∈ R so that the function

is non-negative and vanishes only if x = 1, then f (µ, ν) vanishes only if µ = ν.

Proof. The function g(x) is convex as well. Furthermore D f (µ, ν) = D g (µ, ν). But since g is non-negative,

)dν can only vanish if g( dµ dν ) is identical to zero, which implies that dµ dν = 1 ν-a.s. But this means µ = ν.

The concept of f -divergences was introduced by Csiszár [1], who also noted the result in Lemma 2.1. Common choices for f are

The transformation f * (x) = xf (1/x) yields a divergence D f * which is equal to D f but with interchanged arguments. Applying this transformation to the Kullback-Leibler divergence for example, we get a divergence which is also sometimes referred to as the Kullback-Leibler divergence, or alternatively as the Shannon divergence SH. The total variation divergence plays a central role, since all f -divergences allow for an estimate against TV, as will be shown in the following proposition, which forms the main result of this short note.

2.2. Proposition. For two probability measures µ, ν, it holds in general that

Proof. The proof of this fact is a generalisation of the method used in [3] to prove the special case of the KL divergence. Since f (1) = 0, we have the general property that f (x) = f (max{x, 1}) + f (min{x, 1}).

Using this fact and the convexity of f we get the general estimate

to complete the theorem.

Recalling that always TV ≤ 2, the proposition rises the question as to when the function f (1 + x) + f (1x) is monotonous on x ∈ [0, 1]. The following lemma partially answers this. Proof. The conditions imply that φ(0) = 0, φ(x) > 0 for x > 0, and that φ is convex.

As a corollary of Proposition 2.2, we get the following well known estimates between TV and KL 2.4. Corollary (Bretagnole-Huber and Furstemberg inequality).

Recall that SH(µ, ν) = KL(ν, µ). A further useful estimate concerns the Hellinger divergence 2.5. Corollary. For the Hellinger divergence HE, the estimate

Proof. Theorem 2.2 gives the inequality

The right hand side of Equation ( 9) is larger than 1 -1 2 TV -1

, whence

which, after solving for TV, yields the result.

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut