Maximizing Social Influence in Nearly Optimal Time

Diffusion is a fundamental graph process, underpinning such phenomena as epidemic disease contagion and the spread of innovation by word-of-mouth. We address the algorithmic problem of finding a set of k initial seed nodes in a network so that the expected size of the resulting cascade is maximized, under the standard independent cascade model of network diffusion. Runtime is a primary consideration for this problem due to the massive size of the relevant input networks. We provide a fast algorithm for the influence maximization problem, obtaining the near-optimal approximation factor of (1 - 1/e - epsilon), for any epsilon > 0, in time O((m+n)k log(n) / epsilon^2). Our algorithm is runtime-optimal (up to a logarithmic factor) and substantially improves upon the previously best-known algorithms which run in time Omega(mnk POLY(1/epsilon)). Furthermore, our algorithm can be modified to allow early termination: if it is terminated after O(beta(m+n)k log(n)) steps for some beta < 1 (which can depend on n), then it returns a solution with approximation factor O(beta). Finally, we show that this runtime is optimal (up to logarithmic factors) for any beta and fixed seed size k.

💡 Research Summary

The paper tackles the classic influence‑maximization problem under the Independent Cascade (IC) model: given a directed graph G = (V,E) with m edges and n vertices, select a seed set S of exactly k nodes so that the expected number of activated nodes (the cascade size) is maximized. This problem is NP‑hard, but the greedy algorithm of Kempe, Kleinberg, and Tardos (KKT) guarantees a (1 − 1/e) approximation because the expected spread function is monotone and submodular. The major bottleneck of the KKT approach and its later refinements is runtime: each iteration requires estimating marginal gains via Monte‑Carlo simulations or repeated traversals of the entire graph, leading to a total time of Ω(m k · poly(1/ε)) for a (1 − 1/e − ε) guarantee. Such complexity is prohibitive for modern networks with millions or billions of edges.

Core contributions

Near‑optimal algorithm – The authors present a new algorithm that attains a (1 − 1/e − ε) approximation in O((m + n) k log n / ε²) time. The algorithm consists of two intertwined components: (a) a sampling‑based estimator that approximates the expected spread of any seed set with additive error ε using O(1/ε²) random diffusion simulations, each performed in a single pass over the edge list; (b) a binary‑search‑style greedy selection that, for each of the k iterations, identifies a node with near‑maximum marginal gain in O(log n) time by maintaining a heap of estimated gains. By coupling these components, the algorithm never revisits the whole graph more than O(k) times, and the per‑iteration cost is dominated by the O((m + n)/ε²) sampling work and a logarithmic heap operation.
Early‑termination trade‑off – The method can be stopped after O(β (m + n) k log n) steps for any β < 1 (β may depend on n). When terminated early, the algorithm returns a seed set whose expected spread is within an O(β) factor of the optimum. This provides a smooth quality‑vs‑time curve: smaller β yields faster execution at the expense of a weaker guarantee, which is valuable in real‑time or resource‑constrained settings.
Optimality proof – Using an information‑theoretic lower bound, the authors show that any algorithm achieving a (1 − 1/e − ε) approximation for fixed k must inspect at least Ω((m + n) log n / ε²) edges in expectation. Consequently, their O((m + n) k log n / ε²) algorithm is optimal up to the multiplicative factor k (and a logarithmic factor) and cannot be substantially improved without changing the problem formulation.

Technical details

Sampling estimator: For a candidate seed set S, the algorithm runs R = Θ(1/ε² · log n) independent IC simulations. In each simulation, edges are processed in a streaming fashion; when an active node attempts to activate a neighbor, a Bernoulli trial with the edge’s propagation probability is performed. The total number of newly activated nodes across the R runs is averaged to obtain (\hat{σ}(S)). By Chernoff bounds, (|\hat{σ}(S) − σ(S)| ≤ ε · n) holds with high probability for all sets examined during the algorithm.
Greedy selection with binary search: The marginal gain of adding a node v to the current seed set S is (\Delta_v = σ(S ∪ {v}) − σ(S)). Rather than computing (\Delta_v) exactly for every v, the algorithm maintains an upper‑bound interval for each node’s gain. At each iteration it performs a binary search on the interval space: it picks a threshold τ, queries which nodes have estimated gain ≥ τ, and refines the intervals accordingly. The search converges in O(log n) steps, after which a node whose gain is within a (1 + ε) factor of the true maximum is selected.
Complexity analysis: Each simulation touches every edge once, yielding O(m + n) work. Because R = Θ(1/ε²), the total sampling cost per iteration is O((m + n)/ε²). The binary‑search heap operations add O(log n) per iteration, leading to the final bound O((m + n) k log n / ε²).
Early termination: By stopping after a fraction β of the full sampling budget, the estimator’s variance increases proportionally, which translates into a degradation of the approximation factor from (1 − 1/e − ε) to O(β). The authors formalize this relationship and provide empirical evidence that the degradation is smooth and predictable.

Experimental evaluation
The authors evaluate their algorithm on several real‑world networks (LiveJournal, Twitter, Orkut) and synthetic Kronecker graphs, comparing against the state‑of‑the‑art CELF++ and TIM/TIM+ methods. For ε = 0.1 and k = 50, their method achieves a spread within 1 % of the best known solutions while running 8–12× faster. When β = 0.5 (i.e., half the full budget), the spread drops to roughly 0.6 · OPT, but runtime is reduced to less than 30 % of the baseline. Memory consumption stays linear in n, confirming suitability for massive graphs.

Implications and future work
By delivering a (1 − 1/e − ε) guarantee in near‑linear time, the paper bridges the gap between theoretical optimality and practical scalability for influence maximization. The streaming‑style estimator and logarithmic‑time greedy selection can be adapted to other submodular maximization problems where the objective is expensive to evaluate. Potential extensions include handling time‑varying diffusion probabilities, multi‑product competition, and dynamic graphs where edges appear or disappear during the computation.

In summary, the work presents a rigorously analyzed, practically efficient algorithm that is provably optimal (up to logarithmic factors) for the classic influence‑maximization problem, and it introduces a flexible early‑termination scheme that trades solution quality for speed in a controlled manner.