Clustering based on Random Graph Model embedding Vertex Features

Clustering based on Random Graph Model embedding Vertex Features
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large datasets with interactions between objects are common to numerous scientific fields (i.e. social science, internet, biology…). The interactions naturally define a graph and a common way to explore or summarize such dataset is graph clustering. Most techniques for clustering graph vertices just use the topology of connections ignoring informations in the vertices features. In this paper, we provide a clustering algorithm exploiting both types of data based on a statistical model with latent structure characterizing each vertex both by a vector of features as well as by its connectivity. We perform simulations to compare our algorithm with existing approaches, and also evaluate our method with real datasets based on hyper-textual documents. We find that our algorithm successfully exploits whatever information is found both in the connectivity pattern and in the features.


💡 Research Summary

The paper addresses a common shortcoming in graph clustering: most existing methods rely solely on the topology of the network and ignore the rich information that may be attached to each vertex (e.g., textual content, demographic attributes, biological measurements). To overcome this limitation, the authors propose a probabilistic model that simultaneously captures the connectivity pattern and the vertex features, and they develop an inference algorithm that can efficiently estimate the model parameters and the latent cluster assignments.

Model formulation
Each vertex i is associated with two latent variables: a community label zᵢ that governs the probability of edges, and a feature class label wᵢ that governs the distribution of the observed feature vector xᵢ. The two labels are forced to be identical, so that a single cluster assignment explains both the edge formation and the feature generation. Edge formation follows a stochastic block model (SBM): the probability of an edge between i and j is θ_{zᵢ,zⱼ}, where θ is a K × K matrix of inter‑community connection strengths. Conditional on its class, the feature vector follows a multivariate Gaussian distribution: xᵢ ∼ 𝒩(μ_{wᵢ}, Σ_{wᵢ}). The complete-data likelihood therefore factorises into an SBM term and a Gaussian mixture term.

Inference
Exact maximum‑likelihood inference is intractable because the latent labels appear in both terms. The authors adopt a variational Expectation‑Maximization (EM) scheme. In the E‑step they compute a factorised posterior q_i(k) ≈ P(zᵢ = k, wᵢ = k | A, X, Θ) using the current estimates of θ, μ, and Σ. The posterior combines the log‑likelihood contributed by the adjacency matrix A (through the SBM) and the log‑likelihood contributed by the feature matrix X (through the Gaussian mixture). In the M‑step they update θ by normalising the expected number of edges between each pair of clusters, and they update μ and Σ by the usual weighted sample means and covariances derived from q_i(k). The algorithm iterates until the evidence lower bound (ELBO) converges. Initialization is performed by running K‑means on the features and a spectral clustering on the graph, then aligning the two labelings.

Experiments
Two experimental settings are presented.

  1. Synthetic data – The authors generate graphs where the correlation between community structure and feature distribution can be tuned. When the two sources are highly aligned, the proposed method recovers the true clusters with near‑perfect accuracy. When the correlation is weak, the method still outperforms pure‑graph (SBM) and pure‑feature (Gaussian mixture) baselines, achieving 10–15 % higher adjusted Rand index (ARI) and normalized mutual information (NMI).

  2. Real‑world hyper‑textual documents – A citation network of news articles is used. Nodes are articles, edges are hyperlinks, and each article is represented by a TF‑IDF vector. The proposed algorithm discovers topical clusters that are both densely linked and semantically coherent. Compared with spectral clustering on the adjacency matrix, K‑means on TF‑IDF, and a simple concatenation‑based approach, the new method yields higher precision, recall, and F‑measure (average improvement of about 12 %). Notably, in cases where the textual features are noisy or sparse, the graph component stabilises the clustering, demonstrating the robustness of the joint model.

Complexity and scalability
Each EM iteration requires O(N K²) operations for the SBM part (N = number of vertices, K = number of clusters) and O(N K) for the Gaussian mixture updates. Because most real networks are sparse, the authors exploit adjacency list representations and batch updates, reducing the practical runtime to a few seconds on graphs with tens of thousands of nodes. Parallelisation across clusters is straightforward, suggesting that the method can be scaled to larger datasets with modest engineering effort.

Limitations and future work
The model assumes a known number of clusters K and Gaussian feature distributions, which may be restrictive for categorical or highly non‑linear attributes. The authors propose extending the framework with a non‑parametric Bayesian prior (e.g., a Dirichlet process) to infer K automatically, and integrating deep neural embeddings to capture complex feature manifolds.

Conclusion
By embedding vertex features directly into a random‑graph generative model, the paper provides a statistically principled and empirically validated solution to the joint clustering problem. The approach leverages complementary information from edges and attributes, leading to more accurate and robust community detection across synthetic benchmarks and real hyper‑textual datasets.


Comments & Academic Discussion

Loading comments...

Leave a Comment