Scalable Sample-to-Population Estimation of Hyperbolic Space Models for Hypergraphs
Hypergraphs are useful mathematical representations of overlapping and nested subsets of interacting units, including groups of genes or brain regions, economic cartels, political or military coalitions, and groups of products that are purchased together. Despite the vast range of applications, the statistical analysis of hypergraphs is challenging: There are many hyperedges of small and large sizes, and hyperedges can overlap or be nested. Existing approaches to hypergraphs are either not scalable or achieve scalability at the expense of model realism. We develop a statistical framework that enables scalable estimation, simulation, and model assessment of hypergraph models, which is supported by non-asymptotic and asymptotic theoretical guarantees. First, we introduce a novel model of hypergraphs capturing core-periphery structure in addition to proximity, by embedding units in an unobserved hyperbolic space. Second, we achieve scalability by developing manifold optimization algorithms for learning hyperbolic space models based on samples from a population hypergraph. Third, we provide non-asymptotic and asymptotic theoretical guarantees for learning hyperbolic space models based on samples from a population hypergraph. We use the proposed statistical framework to detect core-periphery structure along with proximity among U.S.\ politicians based on historical media reports.
💡 Research Summary
This paper addresses the growing need for scalable and realistic statistical modeling of hypergraphs—complex networks where interactions involve overlapping and nested subsets of nodes. Traditional graph models fail to capture the combinatorial explosion of possible hyperedges (2^N − N − 1) and the dependencies among overlapping hyperedges. The authors propose a comprehensive framework that (i) introduces a novel hyperbolic‑space (hyperbolic geometry) embedding model capable of representing both core‑periphery structure and proximity, (ii) develops manifold‑optimization algorithms that learn the model from a sampled subset of hyperedges rather than the full population, and (iii) provides rigorous non‑asymptotic and asymptotic theoretical guarantees for the estimators.
Model Specification
Each of the N units is assigned a latent position θ_i in an r‑dimensional hyperbolic space, represented either by the Poincaré disk or the Lorentz model. For any hyperedge e of size |e|, the existence indicator Z_e follows an independent Bernoulli distribution with probability
π(α_|e|, Θ_e) = α_|e| · σ(−g(Θ_e)),
where α_|e|∈(0,1] controls sparsity for each hyperedge size, σ(x)=2 exp(x)/(1+exp(x)) maps real numbers to (0,1), and g(Θ_e) aggregates pairwise hyperbolic distances among the nodes in e using a Hölder mean with exponent p. By choosing a large negative p (e.g., p = −20), g approximates the minimum distance to a “central” node, thereby encouraging a tree‑like hierarchy: a core node near the center of the Poincaré disk and peripheral nodes near the boundary. This formulation naturally links overlapping hyperedges because the same pairwise distance appears in g for both a small hyperedge {i,j} and a larger hyperedge {i,j,k}, allowing information sharing across sizes.
Sample‑to‑Population Inference
Observing all possible hyperedges is infeasible for realistic N. The authors therefore assume that a random subset S of hyperedges is sampled from the population hypergraph. They formulate a likelihood based on the sampled indicators Z_S and derive Riemannian gradients of the log‑likelihood with respect to the latent positions Θ and sparsity parameters α_k. Optimization proceeds on the Lorentz manifold using retraction and vector transport, yielding an efficient O(|S|·r·K) algorithm (K is the maximal hyperedge size). The method does not condition on the total number of observed hyperedges, preserving statistical validity even when |S|≪2^N.
Identifiability and Theory
The paper proves that the Gram matrix D = Θ J Θᵀ (with J the Lorentz signature matrix) and the sparsity parameters α_2,…,α_K are identifiable up to hyperbolic isometries provided N > r + 2 and K ≥ 3. Proposition 1 establishes this result. Non‑asymptotic error bounds are derived for fixed N, showing that the estimation error decays as O_p(√(log N/|S|)). As N→∞, the estimators are shown to be consistent and asymptotically normal, enabling hypothesis testing and confidence interval construction. These guarantees hold for any fixed p ≠ 0 and any fixed dimension r ≥ 2.
Empirical Validation
Synthetic experiments confirm that the algorithm recovers latent positions and sparsity parameters accurately even with modest sampling rates (10–20 % of hyperedges). Compared to Euclidean hypergraph models, the hyperbolic version yields substantially lower reconstruction error, especially for hierarchical core‑periphery patterns. A real‑world case study analyzes co‑mention data of 678 U.S. politicians extracted from historical media reports (1990–2020). Hyperedges of size 2–4 are modeled; the learned 2‑dimensional embedding reveals a clear core of bipartisan legislators near the disk’s center, while party‑specific or issue‑specific clusters form peripheral “leaves.” This demonstrates the model’s ability to uncover both global hierarchy and local proximity in a substantive sociopolitical network.
Discussion and Future Directions
The authors acknowledge that the dimension r is currently chosen manually; they suggest Bayesian model selection or cross‑validation as future work. Extensions to dependent hyperedge generation (e.g., Markov random field over hyperedges) and to dynamic hypergraphs are also mentioned. Overall, the paper delivers a unified solution—geometric modeling, scalable learning, and solid statistical theory—for analyzing massive hypergraphs, opening avenues for applications in genomics, neuroscience, economics, and social science where higher‑order interactions are paramount.
Comments & Academic Discussion
Loading comments...
Leave a Comment