Central limit theorems for interacting innovation processes, related statistical tools and general results
We study a networked system of innovation processes, where each process is modeled as an urn with infinitely many colors-a classical framework for capturing the emergence of novelties. Extending this paradigm, we analyze a model of interacting urns, where the probability of generating or reusing elements in one process is influenced by the histories of others. This interaction is governed by two matrices that control innovation triggering and reinforcement dynamics across the system. The core contribution of this work is a detailed analysis of the second-order asymptotic behavior of the model. Building on these theoretical results, we develop statistical tools to infer the structure and strength of inter-process influence. The methodology is framed in a general setting, making it broadly applicable. We validate our approach with applications to two real-world datasets from Reddit discussions and Gutenberg text corpora.
💡 Research Summary
The paper introduces a novel framework for modeling innovation dynamics by extending classical infinite‑color urn schemes to a network of interacting urns. Each urn represents an individual innovation process (e.g., a subreddit, a document, or a research group) and contains an unbounded set of “colors” that stand for distinct items such as words, memes, or patents. At every discrete time step a ball is drawn from each urn with replacement. The drawn color is classified as “new” if it has never appeared anywhere in the whole system; otherwise it is “old”.
Two interaction matrices govern the evolution. The matrix Γ (γ_{j,h}) controls how the appearance of a new color in urn j influences the probability that urn h will generate a new color in the next step. The matrix W (w_{j,h}) controls how the cumulative counts of a given color across all urns affect the probability that urn h will draw that color again. Both matrices are assumed to be non‑negative, irreducible, and satisfy a balance condition that keeps the total number of balls added to each urn at every step constant. By re‑parameterising the reinforcement parameters (ρ, ν, b ρ) the authors obtain compact expressions for the conditional probability of a new color,
Z*{t,h}=θ_h + (∑{j=1}^N γ_{j,h} D*_{t,j})/(θ_h+t),
and for the conditional probability of drawing an existing color c,
P_t(h,c)= (∑{j=1}^N w{j,h} K_t(j,c) – γ_{j*(c),h})/(θ_h+t).
Here D*_{t,j} is the cumulative number of distinct colors first generated by urn j up to time t, K_t(j,c) is the number of times color c has been drawn from urn j, and j*(c) denotes the urn that first produced color c.
The authors first recall their earlier first‑order results (Theorems 2.1 and 2.2): under irreducibility of Γ and W, the vector of distinct‑color counts D*t grows like t^{γ*} u, where γ*∈(0,1) is the Perron–Frobenius eigenvalue of Γ and u its left eigenvector; simultaneously, the empirical frequencies of each color converge almost surely to a random limit e P∞(c)∈(0,1). This demonstrates a synchronization phenomenon: all urns share the same asymptotic innovation rate and the same distribution of observed items.
The core contribution of the present work is a second‑order analysis. Under two technical spectral‑gap assumptions (Assumptions 3.1 and 3.2) – namely that Γ and W are diagonalizable, their sub‑dominant eigenvalues have real parts strictly smaller than half of the dominant ones – the paper proves central limit theorems for the key processes.
Theorem 3.1 (CLT for distinct‑color counts) states that
t^{γ*/2} ( D*t / t^{γ*} – D*{∞} u )
converges stably to a multivariate normal distribution with mean zero and a deterministic covariance matrix C_{det,Γ} that can be expressed explicitly in terms of the eigen‑structure of Γ. D*_{∞} is the random positive limit appearing in the first‑order theorem.
Theorem 3.2 (CLT for color frequencies) shows that for any fixed color c, the vector of empirical frequencies across urns satisfies
√t ( P_t(c) – e P_∞(c) 1 ) → N(0, Σ_W),
where Σ_W is a covariance matrix derived from the eigenvectors of W, and 1 denotes the all‑ones vector. Consequently, the total count K_t(c) obeys a similar √t‑scaled normal limit.
The proofs rely on writing the dynamics of Z*{t,h} as a stochastic approximation recursion with step size r{t,h}=1/(θ_h+t+1). By centering and linearising around the deterministic limit, the authors obtain a martingale difference array whose asymptotic variance is computed via the spectral decomposition of Γ (or W). Because the limiting covariance matrices involve the random variable D*_{∞}, the authors employ the notion of stable convergence, which is stronger than convergence in distribution and preserves dependence on the underlying σ‑field.
Building on these limit theorems, the paper develops statistical inference tools. The CLTs provide asymptotic normality for estimators of the interaction parameters. In practice, one observes the sequence of colors drawn from each urn, computes empirical counts of new versus old draws, and fits linear regression models derived from the approximated recursions to obtain estimates of γ_{j,h} and w_{j,h}. Standard errors follow directly from the derived covariance structures, enabling hypothesis testing (e.g., testing whether a particular off‑diagonal entry of Γ is zero, which would indicate no direct innovation triggering between two processes). The authors also outline a Bayesian alternative, placing priors on the matrices and using Markov‑chain Monte‑Carlo to sample from the posterior, with the CLTs serving as a guide for proposal distributions.
The methodology is illustrated on two real‑world data sets. In the Reddit case study, each subreddit is treated as an urn; tokens (words, hashtags, or URLs) constitute colors. A token is labeled “new” the first time it appears in any subreddit. The estimated Γ matrix reveals strong cross‑subreddit innovation spillovers among thematically related communities, while the W matrix captures the diffusion of popular tokens across the network. In the Gutenberg corpus, each book is an urn and words are colors; the analysis uncovers how certain lexical innovations in early‑published works propagate to later works, and how high‑frequency words become reinforced across the literary network. The empirical findings align with known linguistic phenomena such as Zipf’s law and Heaps’ law, confirming that the model faithfully reproduces the heavy‑tailed behavior observed in real data.
Finally, the authors discuss extensions. The current theory requires irreducibility, diagonalizability, and a spectral gap; relaxing these assumptions (e.g., allowing time‑varying interaction matrices, incorporating external innovation shocks, or considering finite‑color urns) is identified as future work. They also note that the stable‑convergence framework could be adapted to other reinforcement schemes, such as preferential attachment graphs or stochastic block models with reinforcement.
In summary, the paper delivers a rigorous second‑order probabilistic analysis of a networked infinite‑color urn system, provides explicit central limit theorems, and translates these results into practical statistical tools for uncovering and quantifying inter‑process influence in complex innovation ecosystems.
Comments & Academic Discussion
Loading comments...
Leave a Comment