Convex optimization for the planted k-disjoint-clique problem
We consider the k-disjoint-clique problem. The input is an undirected graph G in which the nodes represent data items, and edges indicate a similarity between the corresponding items. The problem is to find within the graph k disjoint cliques that cover the maximum number of nodes of G. This problem may be understood as a general way to pose the classical clustering' problem. In clustering, one is given data items and a distance function, and one wishes to partition the data into disjoint clusters of data items, such that the items in each cluster are close to each other. Our formulation additionally allows noise’ nodes to be present in the input data that are not part of any of the cliques. The k-disjoint-clique problem is NP-hard, but we show that a convex relaxation can solve it in polynomial time for input instances constructed in a certain way. The input instances for which our algorithm finds the optimal solution consist of k disjoint large cliques (called `planted cliques’) that are then obscured by noise edges and noise nodes inserted either at random or by an adversary.
💡 Research Summary
The paper addresses the k‑disjoint‑clique problem, a graph‑theoretic formulation of clustering. Given an undirected graph G = (V,E) where vertices represent data items and edges indicate similarity, the goal is to select k mutually disjoint cliques C₁,…,C_k that together cover as many vertices as possible. This problem is NP‑hard in general, but the authors focus on a planted model: the input graph consists of k large, internally complete subgraphs (the planted cliques) that are subsequently obscured by two types of noise—random or adversarial edges between cliques and additional “noise” vertices that belong to no clique.
The central contribution is a convex relaxation based on a semidefinite program (SDP). The relaxation introduces a symmetric matrix X ∈ ℝ^{n×n} with X_{ij}=1 if vertices i and j are assigned to the same clique and X_{ij}=0 otherwise. The constraints enforce (i) unit diagonal (each vertex is in some clique), (ii) positive semidefiniteness (ensuring a Gram‑matrix interpretation), and (iii) row‑sum equal to k (each vertex belongs to exactly one of the k cliques). The objective minimizes the total weight assigned to non‑edges, i.e., Σ_{(i,j)∉E} X_{ij}. This SDP can be solved in polynomial time using standard interior‑point methods.
Two noise regimes are analyzed:
-
Random (Bernoulli) noise – each non‑edge is independently turned on with probability p. The authors prove that if each planted clique has size at least Ω(√n) and p = O(1/√n), then the SDP’s optimal solution exactly recovers the planted cliques with high probability. The proof constructs a dual certificate: a matrix Y that satisfies the KKT conditions, is positive semidefinite, and is orthogonal to the primal optimum X*. Bounding the spectral norm of the random noise matrix (via matrix concentration inequalities) is the key technical step.
-
Adversarial noise – an adversary may add a limited number of edges and vertices. The paper shows that as long as the total number of adversarial edges is O(n) and the number of noise vertices is o(n), the same SDP still recovers the planted structure. The dual certificate is built by explicitly handling the worst‑case placement of edges, leveraging the fact that the planted cliques dominate the spectrum of the adjacency matrix.
The theoretical results are complemented by experiments. Synthetic graphs are generated with varying clique sizes (5 %–20 % of n) and noise levels (p = 0.01–0.1). In the random‑noise setting, the SDP perfectly recovers the planted cliques whenever the empirical conditions match the theory. In the adversarial setting, recovery remains robust up to the prescribed edge budget. Failure cases occur when the noise density exceeds the derived thresholds or when cliques become too small for spectral separation.
Significance and implications
- The work demonstrates that a convex approach can solve a fundamentally combinatorial clustering problem exactly, provided the data exhibits a planted‑clique structure with sufficient size and limited contamination.
- By handling both stochastic and worst‑case noise, the analysis offers strong robustness guarantees, a rare feature in the literature on planted‑clique recovery.
- The dual‑certificate technique, combined with spectral norm bounds, provides a template for analyzing other graph‑based recovery problems (e.g., community detection, submatrix localization).
Limitations and future directions
- The requirement that each clique be at least on the order of √n limits applicability to datasets with many small clusters.
- The model assumes disjoint cliques; extensions to overlapping communities or heterogeneous clique sizes are non‑trivial.
- Scaling to very large graphs may demand more specialized SDP solvers or low‑rank approximations.
- Future work could explore tighter probabilistic bounds, incorporate side information (e.g., node attributes), or adapt the relaxation to hierarchical clustering frameworks.
In summary, the paper presents a rigorous convex‑optimization framework that exactly solves the planted k‑disjoint‑clique problem under realistic noise conditions, bridging a gap between theoretical guarantees and practical clustering algorithms.
Comments & Academic Discussion
Loading comments...
Leave a Comment