On approximations via convolution-defined mixture models
An often-cited fact regarding mixing or mixture distributions is that their density functions are able to approximate the density function of any unknown distribution to arbitrary degrees of accuracy, provided that the mixing or mixture distribution is sufficiently complex. This fact is often not made concrete. We investigate and review theorems that provide approximation bounds for mixing distributions. Connections between the approximation bounds of mixing distributions and estimation bounds for the maximum likelihood estimator of finite mixtures of location- scale distributions are reviewed.
💡 Research Summary
The paper investigates the long‑standing “folk theorem” that finite mixture models can approximate any probability density arbitrarily well provided the mixture is sufficiently complex. While this claim is frequently quoted in textbooks, rigorous statements and proofs are rarely presented. Nguyen and McLachlan fill this gap by reviewing and extending several approximation theorems, focusing on mixture models defined through convolution with location‑scale kernels.
The authors begin by formalising mixing distributions as densities of the form
(f(x)=\int f(x;\theta),d\Pi(\theta))
and note that finite mixtures arise when the mixing distribution (\Pi) is a discrete weighted sum of Dirac masses. They then discuss DasGupta’s (2008) theorem, which asserts that the class of marginally independent location‑scale mixtures can approximate any target density in total‑variation distance, but whose proof is absent in the cited source. By constructing an explicit proof based on approximate identities, the authors provide a more transparent foundation.
Key to the analysis is the concept of an approximate identity (\alpha_k(x)=k^p\alpha(kx)) where (\alpha) is a non‑negative, unit‑integral function. Lemma 4 (Cheney & Light, 2000) shows that if (\alpha) belongs to the class (F_3) of marginally independent scaled densities, then its dilations form an approximate identity. Theorem 5 (Makarov & Podkorytov, 2013) then guarantees that for any (f\in L^q(\mathbb{R}^p)) with (1\le q<\infty), the convolution (f*\alpha_k) converges to (f) in the (L^q) norm as (k\to\infty).
From this general result the authors derive several concrete corollaries:
- Corollary 6 shows that mixing only over the location parameter (keeping the scale fixed) already yields arbitrarily close approximations in any (L^q) norm, including total variation ((q=1)). This improves upon DasGupta’s theorem by eliminating the need for scale mixing.
- Theorem 8 (Cheney & Light, 2000) provides uniform convergence on compact sets for bounded continuous target densities, establishing a stronger sup‑norm approximation.
- Theorem 9 and Corollary 11 give explicit convergence rates when the target density is Lipschitz. Using a standard normal kernel, the sup‑norm error decays as (A/k), i.e., an (O(1/k)) rate.
The paper then connects these approximation results to statistical divergence measures. Lemma 13 (Zeevi & Meir, 1997) bounds the Kullback‑Leibler (KL) divergence between two densities that are bounded away from zero on a compact set by their (L^2) distance: (D_{KL}(f,g)\le \beta^{-1}|f-g|2^2). Combining this with Barron’s (1993) result on approximating functions in the convex hull of a Hilbert space, the authors prove that any density in the convex hull of the kernel class can be approximated by an (n)-component finite mixture with an (L^2) error of order (1/n). Consequently, Theorem 15 and Corollary 21 establish that for any compactly supported density (f) bounded below by (\beta>0), there exists an (n)-component mixture (f_n) such that
(D{KL}(f,f_n)\le \frac{\varepsilon}{\beta}+ \frac{C\gamma}{n}),
where (C) and (\gamma) depend on the chosen kernel and the support set.
Finally, the authors translate these deterministic approximation bounds into statistical guarantees for the maximum‑likelihood estimator (MLE). By invoking Li & Barron’s (1999) theorems, they show that the MLE based on a finite mixture of location‑scale kernels achieves a KL error that also decays at rate (O(1/n)) toward the true density, provided the kernel satisfies a mild logarithmic boundedness condition. This bridges the gap between pure approximation theory and practical estimation: as the number of mixture components grows, the MLE not only fits the data but also converges to the underlying distribution at a quantifiable rate.
In summary, the paper delivers a rigorous, multi‑layered treatment of mixture‑model approximation: it supplies concrete proofs for previously informal claims, quantifies convergence in several norms (total variation, (L^q), sup‑norm), derives explicit rates under Lipschitz smoothness, and finally links these results to KL‑based error bounds for both deterministic approximations and the MLE. The work thus provides both theoreticians and practitioners with clear criteria for how many mixture components are needed to achieve a desired level of accuracy, and clarifies that mixing over locations alone is often sufficient, simplifying model design without sacrificing approximation power.
Comments & Academic Discussion
Loading comments...
Leave a Comment