StaticGreedy: solving the scalability-accuracy dilemma in influence maximization
Influence maximization, defined as a problem of finding a set of seed nodes to trigger a maximized spread of influence, is crucial to viral marketing on social networks. For practical viral marketing on large scale social networks, it is required that influence maximization algorithms should have both guaranteed accuracy and high scalability. However, existing algorithms suffer a scalability-accuracy dilemma: conventional greedy algorithms guarantee the accuracy with expensive computation, while the scalable heuristic algorithms suffer from unstable accuracy. In this paper, we focus on solving this scalability-accuracy dilemma. We point out that the essential reason of the dilemma is the surprising fact that the submodularity, a key requirement of the objective function for a greedy algorithm to approximate the optimum, is not guaranteed in all conventional greedy algorithms in the literature of influence maximization. Therefore a greedy algorithm has to afford a huge number of Monte Carlo simulations to reduce the pain caused by unguaranteed submodularity. Motivated by this critical finding, we propose a static greedy algorithm, named StaticGreedy, to strictly guarantee the submodularity of influence spread function during the seed selection process. The proposed algorithm makes the computational expense dramatically reduced by two orders of magnitude without loss of accuracy. Moreover, we propose a dynamical update strategy which can speed up the StaticGreedy algorithm by 2-7 times on large scale social networks.
💡 Research Summary
Influence maximization (IM) seeks a small seed set that triggers the largest possible spread of influence in a social network under stochastic diffusion models such as Independent Cascade (IC) or Linear Threshold (LT). The expected spread σ(S) is monotone and submodular, which guarantees that a greedy algorithm can achieve a (1‑1/e) approximation to the optimal seed set. In practice, however, σ(S) cannot be computed exactly because it is #P‑complete; instead, Monte‑Carlo (MC) simulations are used to estimate it.
Existing greedy‑based IM methods (e.g., CELF, CELF++) repeatedly estimate the marginal gain of each candidate node by running thousands of MC simulations at every iteration. Because the set of random diffusion realizations changes from one iteration to the next, the submodularity property is no longer guaranteed on the sampled objective. Consequently, the greedy selection may deviate from the theoretical guarantee, and to recover accuracy one must dramatically increase the number of simulations, leading to the well‑known scalability‑accuracy dilemma. On the other hand, scalable heuristics (degree, PageRank, community‑based) run fast but provide unstable or weak quality guarantees.
The paper identifies the root cause of the dilemma: dynamic sampling destroys submodularity. To address this, the authors propose a StaticGreedy algorithm that fixes a collection of diffusion realizations before the greedy selection begins. The procedure consists of two phases.
-
Static sampling phase – Generate R independent diffusion instances (also called “live‑edge graphs”) on the original network. For each node v, record the set of nodes that become reachable when v is the sole seed in each instance. This preprocessing costs O(R·|E|) time and O(R·|V|) memory, but it is performed only once.
-
Greedy selection phase – Starting with an empty seed set S, compute the marginal gain Δ(v|S) for every candidate v as the average increase in reachable nodes across the R fixed instances: Δ(v|S) = (1/R) Σ_r |Reach_r(v) \ Reach_r(S)|. Because the underlying objective σ_R(S) = (1/R) Σ_r |Reach_r(S)| is exactly submodular (it is a sum of submodular functions on each static instance), the classic greedy algorithm retains the (1‑1/e) approximation guarantee.
A naïve implementation would still require O(R·|V|) work per iteration to recompute Δ(v|S). The authors therefore introduce a dynamic update strategy that dramatically reduces this overhead. For each static instance r they maintain the set A_r of nodes already activated by the current seed set. When a new seed v is added, the algorithm updates A_r by adding the nodes newly reachable from v in instance r and simultaneously subtracts these nodes from the marginal gains of all remaining candidates. Consequently, only the contributions of newly covered nodes need to be processed, yielding an amortized cost of O(R·k) per iteration (k = number of already selected seeds). Empirically this yields a 2–7× speed‑up on large graphs.
The experimental evaluation covers several real‑world networks (LiveJournal, Orkut, Twitter, DBLP) and synthetic graphs (Barabási‑Albert, Watts‑Strogatz). StaticGreedy is compared against state‑of‑the‑art greedy methods (CELF++, TIM+, IMM) and simple heuristics. With a modest number of static samples (R = 100–200), StaticGreedy matches or slightly exceeds the influence spread of CELF++ while being over two orders of magnitude faster. On networks with more than one million nodes, the dynamic update variant further reduces runtime by a factor of 2–7 without increasing memory beyond the static sample storage.
Key contributions of the paper are:
- Problem diagnosis – Demonstrating that many published greedy IM algorithms inadvertently violate submodularity due to per‑iteration MC sampling.
- Algorithmic innovation – Proposing a static‑sample‑based greedy framework that guarantees submodularity of the sampled objective, thereby preserving the theoretical approximation bound.
- Efficiency engineering – Designing a dynamic marginal‑gain update mechanism that leverages the fixed samples to avoid recomputation, achieving practical scalability on massive graphs.
- Comprehensive validation – Providing extensive experiments that confirm both high solution quality and dramatic speed improvements.
Beyond influence maximization, the static‑sampling idea is applicable to any combinatorial optimization problem where the objective is a sum of submodular functions evaluated on random instances (e.g., sensor placement, coverage problems). Future work could explore adaptive sampling strategies that reduce R while maintaining tight approximation guarantees, or integrate learning‑based methods to prioritize the most informative diffusion realizations.
In summary, the paper resolves the longstanding scalability‑accuracy dilemma in influence maximization by reconciling the greedy algorithm’s theoretical foundations with practical efficiency through static sampling and clever incremental updates, delivering a method that is both provably near‑optimal and fast enough for real‑world, large‑scale social networks.