Hilbert space embeddings and metrics on probability measures
A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as $\gamma_k$, indexed by the kernel function $k$ that defines the inner product in the RKHS. We present three theoretical properties of $\gamma_k$. First, we consider the question of determining the conditions on the kernel $k$ for which $\gamma_k$ is a metric: such $k$ are denoted {\em characteristic kernels}. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g. on compact domains), and are difficult to check, our conditions are straightforward and intuitive: bounded continuous strictly positive definite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on $\bb{R}^d$, then it is characteristic if and only if the support of its Fourier transform is the entire $\bb{R}^d$. Second, we show that there exist distinct distributions that are arbitrarily close in $\gamma_k$. Third, to understand the nature of the topology induced by $\gamma_k$, we relate $\gamma_k$ to other popular metrics on probability measures, and present conditions on the kernel $k$ under which $\gamma_k$ metrizes the weak topology.
💡 Research Summary
The paper investigates the embedding of probability measures into a reproducing kernel Hilbert space (RKHS) and the induced distance γₖ, which is the norm of the difference between the mean elements of two measures. The central question is under what conditions on the kernel k the induced pseudometric becomes a true metric—i.e., γₖ(μ,ν)=0 if and only if μ=ν. Kernels satisfying this property are called characteristic kernels.
Main Contributions
-
Simple Sufficient Conditions for Characteristic Kernels
- The authors prove that any bounded, continuous, strictly positive‑definite kernel is characteristic. Strict positive‑definiteness means that for any finite set of distinct points {x₁,…,xₙ} and non‑zero coefficients {α₁,…,αₙ}, the quadratic form Σᵢ,ⱼ αᵢαⱼk(xᵢ,xⱼ) is strictly positive. This condition guarantees that the mean embedding map μ↦μₖ is injective.
- For translation‑invariant kernels on ℝᵈ, i.e., k(x,y)=ψ(x−y), the paper shows that γₖ is a metric iff the Fourier transform ˆψ has full support ℝᵈ. By Bochner’s theorem, such kernels correspond to non‑negative spectral densities that cover all frequencies. If the spectral support is missing any region (e.g., a band‑limited kernel), distinct distributions can share the same embedding, violating the metric property.
-
Existence of Arbitrarily Close Distinct Distributions
Even when k is characteristic, γₖ does not dominate stronger distances such as total variation. The authors construct sequences of distinct probability measures (μₙ,νₙ) whose γₖ‑distance tends to zero. The construction exploits the fact that characteristic kernels may down‑weight high‑frequency differences; by concentrating the discrepancy in those frequencies, the RKHS norm can be made arbitrarily small while the measures remain different. This demonstrates that γₖ induces a topology weaker than the total variation topology. -
Topology Induced by γₖ and Weak Convergence
The paper relates γₖ to several classical metrics (e.g., Prohorov, Wasserstein) and asks when γₖ metrizes the weak topology on the space of probability measures. The authors prove that if k is bounded, continuous, integrable, characteristic, and separating (the RKHS separates points of the underlying space), then γₖ indeed generates the weak topology. Consequently, convergence in γₖ is equivalent to convergence in distribution. A prominent example is the Gaussian kernel k(x,y)=exp(−‖x−y‖²/(2σ²)), which satisfies all required properties.
Implications for Statistical Learning
- Two‑sample testing (MMD) – Using a characteristic kernel guarantees that the Maximum Mean Discrepancy (MMD) equals zero only under the null hypothesis of identical distributions, eliminating false “zero‑distance” cases.
- Independence testing (HSIC) – HSIC relies on the product of two characteristic kernels; the paper’s results ensure that HSIC is zero only when the variables are truly independent.
- Dimensionality reduction and representation learning – Kernel PCA, kernel CCA, and related methods embed data via mean elements. Characteristic kernels preserve the full information of the underlying distribution, making these embeddings lossless in a statistical sense.
Technical Approach
The authors start by formalizing the mean embedding μₖ = ∫k(·,x)dμ(x) and defining γₖ(μ,ν)=‖μₖ−νₖ‖ₕₖ. They then employ functional analysis tools: Mercer’s theorem for compact domains, Bochner’s theorem for translation‑invariant kernels, and properties of strictly positive‑definite functions. The proof of the Fourier‑support condition proceeds by showing that if ˆψ vanishes on a non‑empty open set, one can construct two distinct measures whose characteristic functions agree on the support of ˆψ, leading to identical embeddings. Conversely, full support forces the characteristic functions to be identical, implying μ=ν.
For the “arbitrarily close” result, the authors construct a sequence of probability measures that differ only in a high‑frequency component whose weight is attenuated by the kernel’s spectral decay. By scaling the amplitude of this component, the RKHS norm of the difference can be driven to zero while the measures remain distinct in total variation.
Finally, to connect γₖ with the weak topology, the paper leverages the fact that bounded continuous kernels generate a separating class of functions. Under integrability, the mean embedding becomes a continuous linear functional on the space of bounded continuous functions, and the induced metric coincides with the topology of weak convergence.
Future Directions
The authors suggest extending the analysis to non‑Euclidean domains (graphs, manifolds), designing computationally efficient characteristic kernels for large‑scale data, and exploring quantitative relationships between γₖ and optimal‑transport based distances such as the Wasserstein metric.
In summary, the paper provides clear, verifiable criteria for when a kernel yields an injective embedding of probability measures, clarifies the limitations of the induced RKHS distance, and establishes conditions under which this distance faithfully captures weak convergence—insights that are directly applicable to modern kernel‑based statistical methods.
Comments & Academic Discussion
Loading comments...
Leave a Comment