A Gentle Introduction to the Kernel Distance
This document reviews the definition of the kernel distance, providing a gentle introduction tailored to a reader with background in theoretical computer science, but limited exposure to technology more common to machine learning, functional analysis and geometric measure theory. The key aspect of the kernel distance developed here is its interpretation as an L_2 distance between probability measures or various shapes (e.g. point sets, curves, surfaces) embedded in a vector space (specifically an RKHS). This structure enables several elegant and efficient solutions to data analysis problems. We conclude with a glimpse into the mathematical underpinnings of this measure, highlighting its recent independent evolution in two separate fields.
💡 Research Summary
The paper provides a gentle yet thorough introduction to the kernel distance, targeting readers with a theoretical computer‑science background but limited exposure to machine‑learning, functional‑analysis, and geometric‑measure‑theory concepts. It begins by defining a similarity kernel K : ℝᵈ × ℝᵈ → ℝ, assuming K(x,x)=1 and that K decreases as the Euclidean distance between x and y grows; the Gaussian kernel K(x,y)=exp(−‖x−y‖²/σ²) serves as the canonical example.
The kernel distance between two point sets P and Q is introduced as the squared quantity
D_K²(P,Q)=∑{p∈P}∑{p′∈P}K(p,p′)+∑{q∈Q}∑{q′∈Q}K(q,q′)−2∑{p∈P}∑{q∈Q}K(p,q).
This expression mirrors the classic set‑theoretic identity |AΔB|=|A|+|B|−2|A∩B|, replacing cardinalities with self‑similarities and intersections with cross‑similarities. For singleton sets the formula reduces to D_K²({p},{q})=2(1−K(p,q)), showing that 1−K behaves like a squared Euclidean distance when K(p,p)=1.
A central theme is the reinterpretation of D_K as an L₂ distance between embedded representations. When K is a positive‑definite (PD) kernel, Mercer’s theorem guarantees the existence of a reproducing kernel Hilbert space (RKHS) H and a feature map Φ such that K(x,y)=⟨Φ(x),Φ(y)⟩_H. Substituting this into the definition yields
D_K(P,Q)=‖∑{p∈P}Φ(p)−∑{q∈Q}Φ(q)‖_H,
so the kernel distance is exactly the Euclidean norm of the difference of the summed feature vectors. Consequently, D_K is a pseudometric; under mild additional conditions on K it becomes a true metric (i.e., D_K(P,Q)=0 ⇒ P=Q).
The paper extends the basic definition in three directions. First, weighted point sets are handled by attaching scalar weights w(p) and w′(q) to points, leading to a weighted cross‑similarity κ(P,Q)=∑{p∈P}∑{q∈Q}w(p)K(p,q)w′(q). Second, the authors replace discrete sums with integrals to compare arbitrary probability measures μ and ν: κ(μ,ν)=∬K(p,q)dμ(p)dν(q). This recovers the Maximum Mean Discrepancy (MMD) and shows that the kernel distance metrizes distributions. Third, they adapt the construction to geometric objects such as C¹ curves and orientable surfaces. By incorporating tangent (or normal) vectors, the pointwise similarity becomes K(p,q)⟨t_P(p),t_Q(q)⟩ (or ⟨n_P(p),n_Q(q)⟩), yielding the so‑called “current distance” used in shape analysis.
The authors discuss why not every similarity function yields a metric. A counterexample using a box kernel shows that non‑PD kernels can produce negative squared distances. They then present the sufficient condition: K must be PD. For matrix‑valued kernels K(x,y)=xᵀAy, PDness of A ensures K is a valid inner product after the linear map Φ(x)=Bx where A= BᵀB. For general kernels, Mercer’s theorem provides an eigenfunction expansion K(x,y)=∑ λ_i v_i(x)v_i(y), and the feature map is Φ(x)=(√λ_i v_i(x))_i.
From a computational standpoint, the embedding into H enables dramatic speedups. Although H may be infinite‑dimensional, for any finite sample the effective dimensionality can be reduced via random Fourier features, Nystrom approximations, or other kernel‑trick techniques. This reduces the naïve O(n²) pairwise computation to O(n·ρ), where ρ is often logarithmic in n or even constant.
Finally, the paper situates the kernel distance within the broader framework of Integral Probability Metrics (IPM). By selecting the function class F={f:‖f‖H≤1}, the IPM distance d_F(P,Q)=sup{f∈F}|∫f dP−∫f dQ| coincides with D_K. This unifies two historically independent motivations: (1) metrizing probability distributions so that convergence in D_K implies weak convergence, and (2) providing a correspondence‑free metric for shapes that respects geometric structure.
In summary, the article systematically builds the kernel distance from elementary similarity notions, establishes its rigorous mathematical foundations via positive‑definite kernels and RKHS theory, generalizes it to weighted measures, curves, and surfaces, and highlights both theoretical elegance and practical algorithmic benefits. It serves as a concise bridge for theoretical computer scientists to enter the rich landscape of kernel‑based geometric and statistical analysis.
Comments & Academic Discussion
Loading comments...
Leave a Comment