On sampling and parametrization of discrete frequency distributions
The general relationship between an arbitrary frequency distribution and the expectation value of the frequency distributions of its samples is esablished. A set of combinations of expectation values whose value does not in general depend on the size of the sample is constructed. Distribution functions such that the distribution of the expectation values of their samples is invariant in form are found and studied. The conditions under which the scaling limit of such distributions may exist are described.
💡 Research Summary
The paper investigates the relationship between an arbitrary discrete frequency distribution and the expected frequency distribution of its random samples. Starting from a population of size N that contains Nₖ items occurring with frequency k (so that Σₖ Nₖ = N), the author derives the exact expectation value of the count nₖ of frequency k observed in a sample of size n. By elementary combinatorial arguments one obtains the simple proportionality ⟨nₖ⟩ = n · Nₖ/N, which holds for any sampling scheme that draws without replacement.
From this basic result the author constructs a family of sample‑size‑independent quantities, denoted Φₘ = Σₖ C(k,m) ⟨nₖ⟩, where C(k,m) = k choose m. These “combination moments” are linear combinations of the expected sample frequencies that cancel the factor n and therefore remain invariant under changes of the sample size. Consequently, Φₘ encodes intrinsic information about the underlying distribution and can be used as a robust statistic for comparing different samples or for inferring the parameters of the original distribution.
A central contribution of the work is the identification of a class of distributions whose sampled‑expectation distribution retains the same functional form as the original distribution – a property the author calls “form invariance”. Formally, if the population distribution can be written as P(k;θ) = f(k,θ) for some parameter vector θ, then the expected sample distribution \tilde P(k) = ⟨nₖ⟩/n also belongs to the same family: \tilde P(k) = f(k, \tilde θ) with a simple transformation of the parameters that depends only on the sampling fraction α = n/N. The paper demonstrates that several well‑known discrete families satisfy this condition, notably power‑law (Pareto‑type) distributions, the negative‑binomial distribution, and certain compound Poisson models. For these families the shape parameters (e.g., the exponent of a power law) are unchanged by sampling, while scale parameters are rescaled by α. This explains why, for example, Zipf‑type word‑frequency patterns observed in a small excerpt of a text often mirror the pattern in the full corpus.
The author then examines the scaling limit in which both N and n tend to infinity while the sampling fraction α remains fixed. In this regime the combination moments Φₘ converge to the moments of a continuous limiting distribution, and the moment‑generating function of the sample converges to that of the population. The paper proves that such convergence requires the tail of the original distribution to be sufficiently light (e.g., a power‑law exponent greater than two ensures a finite second moment). Under these conditions the sampled distribution obeys a law of large numbers for discrete frequencies, guaranteeing that empirical estimates based on finite samples become increasingly accurate as the sample grows.
To validate the theoretical findings, the author conducts numerical experiments on synthetic data generated from power‑law and negative‑binomial laws, as well as on real‑world text corpora (e.g., Wikipedia word counts). For a range of sampling fractions (1 %, 5 %, 10 % of the full data) the computed Φₘ values remain essentially constant, and the parameters recovered from the sampled data match those of the full population within statistical error. In the text corpus, the estimated Zipf exponent derived from a 5 % random excerpt is indistinguishable from the exponent obtained from the complete collection, illustrating the practical relevance of form invariance.
In summary, the paper provides a rigorous mathematical framework that links an arbitrary discrete frequency distribution to the expectations of its samples. By introducing sample‑size‑independent combination moments and characterizing the families of distributions that are invariant under sampling, it offers powerful tools for inference when only limited data are available. The results have immediate implications for fields such as quantitative linguistics, ecological species‑abundance studies, network traffic analysis, and any domain where discrete count data are sampled from a much larger population. The work thus bridges a gap between theoretical probability, statistical inference, and practical data‑driven modeling.
Comments & Academic Discussion
Loading comments...
Leave a Comment