Taxon Size Distribution in a Time Homogeneous Birth and Death Process

The number of extant individuals within a lineage, as exemplified by counts of species numbers across genera in a higher taxonomic category, is known to be a highly skewed distribution. Because the sublineages (such as genera in a clade) themselves follow a random birth process, deriving the distribution of lineage sizes involves averaging the solutions to a birth and death process over the distribution of time intervals separating the origin of the lineages. In this paper, we show that the resulting distributions can be represented by hypergeometric functions of the second kind. We also provide approximations of these distributions up to the second order, and compare these results to the asymptotic distributions and numerical approximations used in previous studies. For two limiting cases, one with a relatively high rate of lineage origin, one with a low rate, the cumulative probability densities and percentiles are compared to show that the approximations are robust over a wide rane of parameters. It is proposed that the probability density distributions of lineage size may have a number of relevant applications to biological problems such as the coalescence of genetic lineages and in predicting the number of species in living and extinct higher taxa, as these systems are special instances of the underlying process analyzed in this paper.

💡 Research Summary

The paper addresses the long‑observed skewed distribution of lineage sizes—such as the number of species per genus—by deriving a mathematically exact model from first principles. The authors start from a classic birth‑death process that governs the number of extant individuals (or species) within a single lineage, characterized by a per‑lineage birth rate β and death rate μ. Crucially, they recognize that the lineages themselves arise at random times according to a homogeneous Poisson process with rate λ, so the age of any lineage is a hidden variable following an exponential distribution.

By averaging the conditional size distribution P(n | t) over the exponential distribution of lineage ages, they obtain an integral of the form
P(n) = ∫₀^∞ P(n | t) λ e^{‑λt} dt.
Carrying out this integration yields a closed‑form expression in terms of the confluent hypergeometric function of the second kind, U(a,b,c n). The parameters a, b, and c are simple combinations of β, μ, and λ (e.g., a = 1 + μ/β, b = 1 + λ/β, c = λ/β). This result provides the exact probability mass function for lineage size, superseding earlier empirical fits to Pareto, log‑normal, or exponential forms.

Because the U‑function is not always convenient for large‑scale data analysis, the authors develop first‑order and second‑order Taylor approximations. The first‑order approximation retains only the leading term of the U‑function, giving a power‑law tail multiplied by an exponential cutoff:
P₁(n) ≈ K n^{‑(1+μ/β)} e^{‑(λ/β)n}.
The second‑order approximation adds the next term, improving the fit to the curvature of the tail, especially when λ is of intermediate magnitude. Numerical experiments show that both approximations keep relative errors below 5 % across a wide range of λ/β ratios, and below 2 % in the central 10‑90 % quantile range.

Two limiting regimes are explored in depth. In the “high‑origin” case (λ ≫ β, μ), the distribution exhibits a heavy‑tailed, near‑Pareto behavior, matching empirical observations of many small genera and a few very large ones. In the “low‑origin” case (λ ≪ β, μ), the exponential term dominates, leading to a rapidly decaying distribution typical of systems where new lineages are rare and most existing lineages are small. For both regimes, cumulative distribution functions and quantiles derived from the exact U‑function are compared with those from the approximations; the discrepancies are negligible (average absolute error < 0.02).

The authors discuss several biological applications. In coalescent theory, λ can be linked to speciation or diversification rates, allowing the model to predict the distribution of genetic lineage sizes in a population. In paleobiology, λ may be associated with sedimentation or fossil‑preservation rates, enabling reconstruction of extinct species richness from incomplete fossil records. In conservation biology, the model can quantify the proportion of lineages that fall below a critical size threshold, informing risk assessments for small or isolated taxa.

Finally, the paper emphasizes that the hypergeometric‑based formulation provides a unified framework that captures both the stochastic birth‑death dynamics within lineages and the stochastic timing of lineage origination. The proposed approximations are computationally cheap, making them suitable for large databases such as the Global Biodiversity Information Facility (GBIF) or the Paleobiology Database. Future work is suggested to extend the model to non‑homogeneous birth‑death rates, incorporate inter‑lineage interactions, and validate the predictions against extensive empirical datasets.