RCELF: A Residual-based Approach for Influence Maximization Problem
Experimental Evaluation
In this section we evaluate and present our empirical findings. In Section 7.1 we describe the experimental setting. In Section 7.2 we compare our proposal with existing approximate approaches in real datasets, and investigate the effectiveness of the optimization technique.
| (a) | (b) | (c) | (d) |
| (a) | (b) | (c) | (d) |
| (a) | (b) | (c) | (d) |
| (a) | (b) | (c) | (d) |
| (a) | (b) | (c) | (d) |
| (a) | (b) | (c) | (d) |
Experimental Setting
We use five widely-used benchmarks for . Table [tab:dataset] provides details on the number of nodes, edges, average degree, data type and sources of each dataset.
We compare our proposal with four methods, namely, , , and . We omit and as they are infeasible for datasets in Table [tab:dataset]. The source code of , and are from the homepage of the authors. We implement and by C++. All experiments run on Centos 7.4 with Intel Xeon E5-2620(2.1GHz) and 80GB memory.
We consider the widely used Weighted Cascade (), Independent Cascade () and Linear Threshold () models. In conventional and models, the weight of each edge $`w(u,v)`$ is set as $`1/|`$$`|`$. We use $`\rho/|\mathsf{In}{v}|`$ in this work, where $`\rho`$ is a tunable parameter. For model, we follow the settings in , e.g., the weight of each edge is $`0.001`$.
In order to guarantee the same result quality, we set $`\epsilon =0.1`$ in , and our proposal . For $`\delta=1/|V|`$, $`r=200`$ for and . The performance metrics of are execution time, memory consumption. Each plotted value corresponds to the average of measurements observed over 20 times.
Performance Evaluation
| (a) | (b) | (c) | (d) |
| (a) | (b) | (c) | (d) |
Figure [fig:k=5varyp_time_wc], [fig:k=5varyp_memory_wc] and 1 show the execution time, memory consumption and result quality of each approach by varying $`\rho`$ from $`0.1`$ to $`1.3`$, respectively. The execution time of outperforms , , and for all $`\rho`$ values in four datasets, as illustrated in Figure [fig:k=5varyp_time_wc]1. However, and perform as better as when $`\rho=1`$ in and (see Figure [fig:k=5varyp_time_wc](c) and (d)). It confirms our analysis in Section 5.2, i.e., and are pretty good when $`\rho=1`$.
We measure the memory consumption of each approach by varying $`\rho`$ in four datasets in Figure [fig:k=5varyp_memory_wc]. Since the space complexity of is $`O(n+m)`$, the memory consumption of performs better than all other competitors by varying $`\rho`$ from 0.1 to 1.3 in all datasets. Interestingly, the memory consumptions of and are rising when $`\rho`$ varies from 1 to 0.1, and from 1 to 1.3 in two large datasets (see Figure [fig:k=5varyp_memory_wc](c) and (d)). The reason is that the memory consumptions of and are heavily relying on the prorogation probability of each edge. Specifically, and requires more bootstrap iterations when $`\rho`$ is small. Thus the memory consumptions are rising from $`1`$ to $`0.1`$. When $`\rho`$ is rising from $`1.0`$ to $`1.3`$, the number of nodes in each generated set by and will increase dramatically as the strongly connected properties of the underlying social network. Thus the size of set grows up.
The result qualities of all approaches by varying $`\rho`$ in four datasets are illustrated in Figure 1. The practical spread results of different approaches are similar as all approaches guarantees the same approximation ratio of result2.
We test the effect of $`k`$ by fixing $`\rho=0.1`$ in all four datasets. Figure [fig:p=0.1varyk_time_wc] shows the execution cost of each method, varying $`k`$ from 1 to 100. Observe that our proposal consistently outperforms all competitors. Specifically, our proposal is up to 86.1, 149.7, 72.1 and 47.8 faster than in , , , and , respectively. It is faster than by 134, 116, 86.8 and 57.5, respectively. Besides that, the performance of is worse than that of . Although is proposed to improve , it costs more time since it is sensitive to the connectivity of the graph. Unfortunately, the graph connectivity is poor with model, the overhead of generating DAGs in is larger than its benefits.
The memory consumptions of each approach in four datasets are illustrated in Figure [fig:p=0.1varyk_memory_wc]. ’s memory requirement is stable with regard to $`k`$ in all datasets as the space complexity of our approach is $`O(m+n)`$. It requires the minimum memory space among all competitors. For example, when $`k=5`$ , and require 1,548MB, 1,625MB and 47MB memory space on , respectively. In particular, consumes more than 27GB memory space when $`k=1`$ to return a solution on a 18MB dataset .
Figure 2 shows the spread values of different approaches in four datasets by varying $`k`$ from $`1`$ to $`100`$. As expected, our proposal performs as well as other competitors (e.g., , ) in all tested settings. This also verified that guarantees the approximation ratio of the result as other approximate approaches.
Figure [fig:p=0.001varyk_time_ic] shows the execution time of , , , , and by varying $`k`$ in model, where $`w(u,v)`$ follows the setting of , i.e., $`w(u,v)=0.001`$. is one to two orders of magnitude faster than and in , among all four datasets, as shown in Figure [fig:p=0.001varyk_time_ic]. In dataset, cannot return results as it incurs extremely large memory consumption for all $`k`$ values. When $`k=1,2`$ and $`5`$, also infeasible due to huge memory consumption, it confirms that performs worse when $`k`$ is small (see Figure [fig:p=0.001varyk_time_ic](d)).
and do not work with . Figure 3 shows the execution time of , , and by varying $`k`$ in model. is better than or at least comparable with and in all four datasets.
For the sake of presentation, we omit the memory consumption and result quality results as they are similar with Figure [fig:p=0.1varyk_memory_wc] and 2, respectively.
We verify the scalability of in the largest dataset (i.e., ) used in literature in model. Figure 4(a) shows the execution cost of , , and by varying $`\rho`$ with $`k=5`$. outperforms other competitors in all cases. We measure the execution time of , and by varying $`k`$ with $`\rho=0.1`$ in Figure 4(b). When $`k`$ is small, performs better than and . When $`k`$ is large, outperforms due to its superiority for large $`k`$. Besides, the memory consumption of is less than and as does not incur any extra memory overhead.
| (a) Varying ρ, k = 5 | (b) Varying k, ρ = 0.1 |
Here we evaluate the effectiveness of our proposed optimization techniques (i.e., Lemma [lem:sigma]). We evaluate the effect of Lemma [lem:sigma] with $`k=100`$ and $`\rho=0.1`$. As illustrated in Figure 5, our Lemma [lem:sigma] optimization offer saving of 1.0%, 26.7%, 21.7%, 40.9% (compared to without Lemma [lem:sigma]) in , , and , respectively.
Residual-based Approach
Existing approximate approaches for are compromising either time efficiency or memory overhead for result quality. In this section, we propose a novel residual-based approach (i.e., ) for to overcome this dilemma. We present the fundamental concepts of approach in Section 6.1. In Section 6.2, we describe the backbone of and devise two performance optimization techniques for it. We conduct correctness, complexity and approximate analysis of in Section 6.3.
Approach
Generally, each node $`v \in V`$ in social network contributes to the influence spread value of seed set , i.e., $`\sigma(`$$`)`$, by either being selected as a seed node or being influenced by other seed nodes. In this work, we propose a novel concept, node residual capacity, to capture the contribution of each node to the influence spread value. selects the node $`v \in V - \mathsf{S}`$ with the largest marginal gain (based on node residual capacity) as a seed node at each iteration.. The residual capacity of each node diminishes during the seed node selection process. In order to capture the contribution of each node, we define residual-based social network formally in Definition [def:rgraph].
Residual based social network $`(V,E,W,C)`$ is a social graph ̏$`(V,E,`$ $`W)`$ (cf. Definition [def:sn]) with residual capacity set $`C`$. Initially, the residual capacity of each node $`v \in V`$ in $`C`$ is $`\mathsf{RC}(v)=1`$.
Given a diffusion model, the core subroutine of is marginal gain computation. Given seed set $`\mathsf{S}`$, the marginal gain of node $`u`$ is computed by $`mg(u|\mathsf{S}) = \delta(\mathsf{S} \cup \{u\}) - \delta(\mathsf{S})`$ in literature. In this work, computes the marginal gain $`mg(u|\mathsf{S})`$ by exploiting the residual capacity of every nodes during each node selection iteration, i.e., the contribution of each node $`v \in V - \mathsf{S}`$ to $`mg(u| \mathsf{S})`$.
Given residual based social graph $`(V,E,W,C)`$ and seed set $`\mathsf{S}`$. The marginal gain $`mg(u | \mathsf{S})`$ is contributed by two parts: (i) node $`u`$, and (ii) the nodes which can be influenced by node $`u`$. Intuitively, the contribution of node $`u`$ to $`mg(u | \mathsf{S})`$ is its residual capacity $`\mathsf{RC}(u)`$. The contribution of other nodes ($`i.e., \forall v \in V - (\mathsf{S} \cup u)`$) to $`mg(u | \mathsf{S})`$ is a bit more intricate. In subsequent sections, we present the marginal gain computations of with and model, respectively.
|
![]() |
| (a) S = ∅ | (b) S = {a} |
Marginal Gain Computation in Model
In model, each node $`v`$ uniformly chooses a threshold $`\theta_v`$ from the range $`[0,1]`$. It can be activated if $`\sum_{v\text{'s activated neighbor} u} w(u,v) \geq \theta_{v}`$. Take Figure 6(a) as an example, the probability of node $`b`$ can be activated by node $`a`$ is $`\mathsf{Pr}[w(a,b) \geq \theta_{b}] = \mathsf{Pr}[0.7 \geq \theta_{b}] = 0.7`$, as $`\theta_b`$ is uniformly chooses from $`[0,1]`$. In $`\LT{}`$ model, $`\forall u \in V, \theta_u`$ is independent and identically distributed. Hence, given a path $`\mathsf{P} = \{v_1, v_2,`$ $`\cdots, v_m \}`$, the probability of node $`v_m`$ can be activated by node $`v_1`$ is $`\mathsf{Pr}_{v_1}(\mathsf{P})= \prod_{i=1}^{m-1}w(v_i, v_{i+1})`$. Definition [def:ltpathprob] defines the active / influence probability in model formally.
Given residual network $`(V,E,W,C)`$, the probability of $`v_1`$ activates $`v_m`$ through path $`\mathsf{P} = \{v_1, v_2, \cdots, v_m \}`$ is $`\mathsf{Pr}_{v_1}(\mathsf{P})= \prod_{i=1}^{m-1}w(v_i, v_{i+1})`$. Set $`\mathbb{P}`$ includes all paths from node $`v_1`$ to $`v_m`$ in , the overall probability that node $`v_{1}`$ influences node $`v_{m}`$ is $`\mathsf{Pr}(v_1,v_m)= \sum_{\mathsf{P} \in \mathbb{P}}\mathsf{Pr}_{v_1}(\mathsf{P})`$.
In Figure 6(a), node $`a`$ can reach node $`d`$ in three paths, i.e., $`\mathsf{P}_{1}=\{a, d\}`$, $`\mathsf{P}_{2}=\{a, c, d\}`$ and $`\mathsf{P}_{3}=\{a, b, c, d\}`$. The probability of node $`a`$ can influence node $`d`$ is $`\mathsf{Pr}(a,d)=\mathsf{Pr}_{a}(\mathsf{P}_{1})+\mathsf{Pr}_{a}(\mathsf{P}_{2})+\mathsf{Pr}_{a}(\mathsf{P}_{3}) = 0.4+0.3*0.2+0.7*0.5*0.2=0.53`$.
Definition [def:actProb] shows the contribution of node $`v`$ to marginal gain $`mg(u | \mathsf{S})`$ in model.
The contribution of node $`v`$ to marginal gain $`mg(u | \mathsf{S})`$ is $`\Phi(u,v)= \mathsf{RC}(u) \times\mathsf{Pr}(u,v)`$. $`\Phi(u,v) = \mathsf{RC}(v)`$ if $`\mathsf{RC}(u) \times \mathsf{Pr}(u,v) \geq \mathsf{RC}(v)`$.
In Figure 6(b), node $`c`$’s contribution to marginal gain $`mg(b|\{a\})`$ is $`\Phi(b,c) = \mathsf{RC}(b) \times\mathsf{Pr}(b,c) = 0.3*0.5=0.15`$.
Formally, given a residual social network $`(V,E,W,C)`$ and seed set , the marginal gain of node $`u`$ consists of i) the residual capacity of node $`u`$, and ii) node contributions from other influenced nodes. Specifically,
mg(u | \mathsf{S}) = \mathsf{RC}(u) + \sum_{v \in V - (\mathsf{S} \cup \{u\})} \Phi(u,v).
In Figure 6(b), the marginal gain $`mg(b | \{a\}) = \mathsf{RC}(b) + \Phi(b,c) + \Phi(b,d) = 0.3 + 0.3*0.5+ 0.3*0.5*0.2 = 0.48`$.
Marginal Gain Computation in Model
Comparing to the contribution of node $`v`$ to $`mg(u | \mathsf{S})`$ in model, it is more complex in model. The reason is that the active node $`u`$ will definitely influence its inactive neighbor $`v`$ in model, i.e., $`\theta_{v} = \theta_{v} - w(u,v)`$. However, the active node $`u`$ may not influence its inactive neighbor $`v`$ as the influence process is a random coin-flip process with bias $`w(u,v)`$ in model.
To illustration, consider the probability that node $`c`$ can be activated by node $`a`$ in Figure 6(a) in both and model. There are two paths from $`a`$ to $`c`$, i.e., $`\mathsf{P}_{1}=\{a,c\}`$ and $`\mathsf{P}_{2}=\{a,b,c\}`$ respectively. In model, the influence probability $`\Phi(a,c) = \mathsf{Pr}_{a}(\mathsf{P}_1) + \mathsf{Pr}_{a}(\mathsf{P}_2) = 0.3+0.35=0.65`$ by Definition [def:ltpathprob]. I.e, the node $`c`$ will be activated if $`\theta_{c} \leq 0.65`$. In model, however, node $`a`$ influence node $`c`$ with probability $`\mathsf{Pr}_{1}=w(a,c)=0.3`$ via path $`\mathsf{P}_1`$ and with probability $`\mathsf{Pr}_{2}= w(a,b) \times w(b,c)=0.35`$ via path $`\mathsf{P}_2`$. Thus, the total activated probability of node $`c`$ by node $`a`$ is $`1-(1-\mathsf{Pr}_{1})(1-\mathsf{Pr}_{2}) = 1-0.7*0.65=0.545`$ according to conditional probability theory.
To facilitate the discussion of node marginal gain contribution in model, we divide the reachable nodes of $`u`$ into two groups: (i) shared-nothing set $`\mathsf{SN}`$ and (ii) shared-edge set $`\mathsf{SE}`$. For example, Figure 6(a), node $`a`$’s reachable node $`c`$ is shared-nothing node as the paths from $`a`$ to $`c`$ does not share any edge, i.e., $`\{a,c\}`$ and $`\{a,b,c\}`$. $`b`$ also is $`a`$’s shared-nothing nodes. However, $`d`$ is a shared-edge node as paths $`\{a,b,c,d\}`$ and $`\{a,c,d\}`$ shared a common edge $`e(c,d)`$. The node contribution of each node $`v`$ in shard-nothing node set $`\mathsf{SN}`$ to the marginal gain $`mg(u|\mathsf{S})`$ as follows.
Given a path $`\mathsf{P}=\{v_1, v_2,`$ $`\cdots, v_m \}`$ in $`(V,E,W,C)`$, seed set $`\mathsf{S}`$, The probability of node $`v_1`$ influences $`v_m`$ is $`\mathsf{Pr}_{v_1}(\mathsf{P})= \prod_{i=1}^{m-1}(\mathsf{RC}(v_i)w(v_i, v_{i+1}) )`$ in model.
Suppose there is a set of paths $`\mathbb{P}`$ in which node $`u`$ can influence node $`v \in \mathsf{SN}`$ (i.e., $`v`$ in shared-nothing node set). According to conditional probability theory, the probability node $`u`$ influence node $`v`$ is
\mathsf{Pr}(u,v) = 1-\prod_{\mathsf{p} \in \mathbb{P}} (1 - \mathsf{Pr}_{u}(\mathsf{P})).
The total contribution of node $`v`$ to $`mg(u | \mathsf{S})`$ in model is $`\Phi(u,v)= \mathsf{RC}(v) \times \mathsf{Pr}(u,v)`$.
In Figure 6(a), there are two paths from node $`a`$ to node $`c`$: $`\mathsf{P}_1=\{a,b,c\}`$ and $`\mathsf{P}_2=\{a,c\}`$. The contribution of $`c`$ to $`mg(a | \emptyset)`$ is equivalent to $`\mathsf{RC}(a) \times \mathsf{Pr}(a,c) = 1.0 * (1-(1-0.3)*(1-0.7*0.5)) = 0.545`$.
For the nodes in shared-edge set $`\mathsf{SE}`$, it is quite difficult to analyze the active probability from an active node $`u`$ at step $`t`$. Fortunately, inspired by the approach, the active probability of these nodes in shared-edge set could be obtained by running $`r`$ times Monte-Carlo simulations. We then define the marginal gain contribution of every node in shared-edge set in Definition [def:secontribute].
Given residual-based graph $`(V,E,W,C)`$ and seed set $`\mathsf{S}`$. For each node $`v \in \mathsf{SE}`$, the influence probability $`\Phi(u,v)`$ is obtained by Monte-Carlo simulation. The node contribution of node $`v`$ to $`mg(u | \mathsf{S})`$ in model is $`\Phi(u,v)= \mathsf{RC}(v) \times (\mathsf{RC}(u) \mathsf{Pr}(u,v))`$.
Finally, the marginal gain $`mg(u | \mathsf{S})`$ in model can be computed by $`mg(u | \mathsf{S}) = \mathsf{RC}(u) + \sum_{v \in V - (\mathsf{S} \cup \{u\})} \Phi(u,v)`$.
Updating Node Residual Capacity
During seed node selection procedure, suppose node $`u`$ is selected as seed node at step $`t`$ (i.e., $`u = \argmax_{v \in V-\mathsf{S}} mg(v |\mathsf{S})`$), the residual capacity of every $`u`$’s reachable node $`v`$ will be updated by $`\mathsf{RC}(v) = \mathsf{RC}(v) - \Phi(u,v)`$ accordingly, it will be ignored if $`\mathsf{RC}(v) \leq 0`$ in subsequent seed selections.
Implementation and Optimizations
In this section, we present the sketch of approach with two performance optimization techniques.
The sketch of our residual-based approaches for with and model as follows:
-
Init influence max-heap $`\mathcal{H}`$, it builds a max-heap by using the marginal gain as key value.
-
Identify seed node $`u`$ (i.e., $`\argmax_{u \in V-\mathsf{S}} mg(u | \mathsf{S})`$) by $`\mathcal{H}`$ efficiently, insert it into seed set $`\mathsf{S} \leftarrow \mathsf{S} \cup \{u\}`$.
-
Update residual capacity of each node $`v`$ in , i.e., $`\forall v \in V - \mathsf{S}`$, $`\mathsf{RC}(v) \leftarrow \mathsf{RC}(v) - \Phi(u,v)`$. Node $`v`$ will be discarded in if $`\mathsf{RC}(v) \leq 0`$.
-
Repeat Step (2) and (3), until $`k`$ seed nodes are selected.
In the subsequent section, we improve the performance of by (1) proposing an efficient marginal gain computation algorithm and (2) reducing max-heap update cost.
Initialize map $`\Phi \leftarrow \emptyset`$ Queue $`q.\textsf{enqueue}(u)`$ node $`\textsf{tmp} \leftarrow q.\textsf{dequeue}()`$ $`q.\textbf{enqueue}(v)`$ $`\Phi(u,v) \leftarrow \Phi(u,v) + 1`$ $`mg(u | \mathsf{S}) \leftarrow \mathsf{RC}(u)`$ $`\Phi(u,v) \leftarrow \Phi(u,v) / r *\mathsf{RC}(u)`$ $`mg(u | \mathsf{S}) \leftarrow mg(u | \mathsf{S}) + \Phi(u,v)`$ Return $`mg(u | \mathsf{S})`$ and $`\Phi`$
Intuitively, we compute the marginal gain of each node, i.e., for node $`u`$ and seed set , $`mg(u | \mathsf{S})`$ is initialized to $`\mathsf{RC}(u)`$. We enumerate all the paths from $`u`$ to each node $`v \in V - \mathsf{S}`$ and calculate $`\Phi(u,v)`$, then accumulate $`\Phi(u,v)`$ to $`mg(u | \mathsf{S})`$ in model. However, the computation cost is exponential to the number of edges in . Hence, it is impractical in median or large social networks. In addition, the contribution of shared-edge nodes cannot compute exactly as shared-nothing nodes in model. To address the above issues, we devise a Monte-Carlo simulation based marginal gain computation algorithm (cf. Algorithm [alg:MConRG]). The main idea of Algorithm [alg:MConRG] is that it incorporates all $`\Phi(u,v)`$ computations in one batch Monte-Carlo simulation process. Algorithm [alg:MConRG] shows the exact steps about $`mg(u|\mathsf{S})`$ computation in model. In each Monte-Carlo simulation, it takes the residual capacity of each inactive node $`v`$ into consideration by flipping coins with probability $`\mathsf{RC}(c) \cdot w(u,v)`$ (cf. Line [alg:rmc_act]) instead of only $`w(u,v)`$. $`\Phi(u,v)`$ counts the number of activated times of node $`v`$ among $`r`$ times simulation (cf. Line [alg:phiuv]). Finally, for each node $`v`$, its $`\Phi(u,v)`$ compute as Definition [def:actProb] from Line [ln:phis] to Line [ln:phie]. It is worthing to note Algorithm [alg:MConRG] is applicable to models. For example, we only need revise the node activation manner (cf. Line [alg:rmc_act]) for model.
identifies the seed node $`u`$ (i.e., $`\argmax_{u \in V-\mathsf{S}} mg(u | \mathsf{S})`$) with max-heap $`\mathcal{H}`$. The marginal gain of nodes in max-heap $`\mathcal{H}`$ need recompute as it is out-of-date after each seed node selection iteration. i.e., the current marginal gain of node $`u`$ is computed with an out-of-date seed set, denoted by $`\mathsf{S}_o`$. However, it should be $`mg(u | \mathsf{S})`$, where $`S`$ is latest seed set. Thus, the performance of approach is very sensitive to the number of node marginal gain computations in Step (2) to identify the next seed node $`\argmax_{u \in V-\mathsf{S}} mg(u | \mathsf{S})`$. Here, we propose an upper bound for the marginal gain of node $`u`$ with latest seed set $`\mathsf{S}`$, denote by $`\overline{mg}(u|\mathsf{S})`$. It reduces the number of marginal gain computations significantly.
For each node $`v \in V`$, its residual capacity and marginal gain are $`\mathsf{RC}_{o}(v)`$ and $`mg(v| \mathsf{S}_o)`$ when the seed set is $`\mathsf{S}_o`$ at step $`t-1`$. The seed set is $`\mathsf{S}`$ at step $`t`$ (i.e., $`\mathsf{S}_o \subset \mathsf{S}`$), the upper bound $`\overline{mg}(u | \mathsf{S})`$ is $`\mathsf{RC}(u) + \mathsf{RC}(u) \sum_{v \in \mathsf{Out}(u)} w(u,v) * \mathsf{RC}_{o}(v) * mg(v| \mathsf{S}_o)`$.
Proof. For each node $`v \in V`$, $`\mathsf{RC}_{o}(v) \geq \mathsf{RC}(v)`$ as the residual capacity is diminishing during seed node selection process. With $`mg(v| \mathsf{S}_o) \geq mg(v| \mathsf{S})`$ where $`\mathsf{S}_o \subset \mathsf{S}`$ (by submodularity), we have:
\begin{align*}
\bar{mg}(u | \mathsf{S}) & = \mathsf{RC}(u) + \mathsf{RC}(u)\sum_{v \in \mathsf{Out}(u)} w(u,v) * \mathsf{RC}_{o}(v) * mg(v| \mathsf{S}_o) \\
& \geq \mathsf{RC}(u) + \mathsf{RC}(u)\sum_{v \in \mathsf{Out}(u)} w(u,v) * \mathsf{RC}(v) * mg(v| \mathsf{S}) \\
& \geq \mathsf{RC}(u) + \sum_{v \in V - (\mathsf{S} \cup u)} \Phi(u,v)\\
%~~~\text{By Colloray~\ref{cor:sigma}} \\
& = mg(v| \mathsf{S}) %~~~\text{By Definition~\ref{def:ltsigma}}
\end{align*}
Thus, we have $`\overline{mg}(u | \mathsf{S})\geq mg(u | \mathsf{S})`$. ◻
Consider $`mg(u | \mathsf{S}_o)`$ with seed set $`\mathsf{S}_o`$, the marginal gain upper bound $`\overline{mg}(u | \mathsf{S})`$ will be computed with constant cost at first, then the max-heap $`\mathcal{H}`$ is updated with $`\overline{mg}(u | \mathsf{S})`$. Algorithm [alg:MConRG] will be incurred to compute the exact marginal gain $`mg(u|\mathsf{S})`$ if and only if the root is $`\overline{mg}(u | \mathsf{S})`$. Inherently, Lemma [lem:sigma] works as a filter which reduces lots of expensive exact marginal gain computations.
Analysis
We analyze the property of influence spread function $`\delta(\cdot)`$ in at Theorem [the:delta], then prove the result accuracy guarantee of in Lemma [lem:ratio].
The influence spread function $`\delta(\cdot)`$ in with model is (i) non-negative, (ii) monotone, and (iii) submodular.
Proof. Suppose the selected seed nodes from $`1`$st iteration to $`k`$th iteration are $`v_1`$, $`v_2`$, $`\cdots`$, $`v_k`$. The corresponding seed sets are $`\mathsf{S}_1, \mathsf{S}_2, \cdots, \mathsf{S}_k`$, and $`\mathsf{S}_0 = \emptyset`$. Thus, $`\delta(\mathsf{S}_k) = mg(v_k | \mathsf{S}_{k-1}) + \delta(\mathsf{S}_{k-1}) = \sum_{i=1}^{i=k} mg(v_i | \mathsf{S}_{i-1})`$. $`\forall i \in [1, k], mg(v_i | \mathsf{S}_{i-1}) \geq 0`$ in approach. Then, $`\delta(\mathsf{S}) \geq 0`$ and $`\forall~\mathsf{S} \subseteq \mathsf{S}'`$, $`\delta(\mathsf{S}) \leq \delta(\mathsf{S}')`$. Hence, $`\delta(\cdot)`$ in is (i) non-negative and (ii) monotone.
Since the residual capacity of each node $`v`$ will be diminished, cf. Step (3), during seed node selection process, $`\forall~ \mathsf{S} \subset \mathsf{S}'`$ we have
\begin{align*}
& mg(u | \mathsf{S}) \geq mg(u | \mathsf{S}') \\
\Rightarrow ~& \delta(\mathsf{S}) + mg(u | \mathsf{S}) - \delta (\mathsf{S}) \geq \delta(\mathsf{S}') + mg(u | \mathsf{S}') - \delta (\mathsf{S}') \\
\Rightarrow ~&\delta(\mathsf{S} \cup \{ u \}) -\delta (\mathsf{S}) \geq \delta(\mathsf{S}' \cup \{ u \}) - \delta(\mathsf{S}')
\end{align*}
Thus, $`\delta(\cdot)`$ in is submodular, the proof complete. ◻
We then show the result accuracy guarantee of in Lemma [lem:ratio] as follows.
Let size-$`k`$ set $`\mathsf{S}^*`$ be optimal set of , i.e., $`\delta(\mathsf{S}^*)`$ has maximal value of all $`k`$-element sets. returns size-$`k`$ set $`\mathsf{S}`$, which guarantees
\delta(\mathsf{S}) \geq (1-1/e)*\delta(\mathsf{S}^*).
Proof. Since the influence spread function $`\delta(\cdot)`$ in is non-negative, monotone and submodular (cf. Theorem [the:delta]), it guarantees that $`\mathsf{S}`$, returned from , provides a $`(1 - {1}/{e})`$-approximation ratio, as proved in . ◻
Inherently, works similar with and . For example, is built upon the residual-based graph. The residual capacity diminishing process is similar to removing nodes and edges in generated snapshots in and . Inspired from , we analyze the number of necessary Monte-Carlo simulations in to guarantee a $`(1-1/e-\epsilon)`$-approximation result as follows.
By setting Monte-Carlo simulation times $`r=O(\frac{n^2\log n\log \binom{n}{k}}{\epsilon^2})`$, achieves a $`(1-1/e-\epsilon)`$-approximation ratio with probability $`1-1/n`$.
Proof. Let $`\mathcal{S}`$ be the set of every possible result set $`\mathsf{S}`$, $`|\mathcal{S}| = \binom{n}{k}`$. For any result set instance $`\mathsf{S} \in \mathcal{S}`$, $`\delta_{i} (\mathsf{S})`$ is $`\mathsf{S}`$’s influence spread value of the $`i`$-th Monte-Carlo simulation at ̏, $`\delta_{i}(\mathsf{S}) \in [1, n]`$. Let $`\bar{\delta}(\mathsf{S}) = \frac{1}{r} \sum_{i=1}^{r}\delta_{i}(\mathsf{S})`$. By applying Hoeffding’s ineqaulity (Theorem 3 in ), for any $`\mathsf{S} \in \mathcal{S}`$, $`\frac{1}{n}|\bar{\delta}(\mathsf{S}) - \delta(\mathsf{S})| \leq \epsilon`$ holds with at least probability $`1-2 e^{-2 r \epsilon^2} \binom{n}{k}`$ . By choosing $`r = O(\frac{n^2\log n\log \binom{n}{k}}{\epsilon^2})`$, the above conclusion holds with at least probability $`(1 - 1/n)`$ . By applying Lemma [lem:ratio], we have $`\bar{\delta}(\mathsf{S}) \geq (1-1/e-\epsilon)*\delta(\mathsf{S}^*)`$ with $`r`$ Monte-Carlo simulation times. ◻
Conclusion
In this paper, we discuss existing approximation solutions for , which are compromising time efficiency or memory consumption for the approximate result quality. In order to address that, we propose a residual-based algorithm for , which achieves good time efficiency, low memory consumption and approximate guaranteed result quality concurrently in generalized and models. Besides, we propose several optimizations to accelerate the performance of . We demonstrate the superiority of on standard real benchmarks. We plan to extend our to Triggering model and Time Aware model in future work.
Influence Maximization Problem
In this section, we first define the influence maximization problem () formally. Then, we conduct extensive preliminary experiments on the representative approximate approaches, and present the findings of existing approaches.
Problem Definition
We introduce several fundamental concepts for influence maximization problem () first.
(Social Network) A social network is a graph ̏$`(V,E,W)`$, where $`V`$ $`(|V|=n)`$ is the set of nodes, and $`E`$ is the set of directed edges, $`E \subseteq V \times V, (|E|=m)`$, and $`W`$ is the set of weights of each edge in $`E`$.
The weight of edge $`(u,v)`$ is $`w(u,v)`$, and $`u`$ is the incoming neighbor of $`v`$, vice versa, $`v`$ is the outgoing neighbor of $`u`$. and are the incoming and outgoing neighbor sets of node $`v`$, respectively. Given a social network ̏, the is selecting a small but effective influential user set which could spread the influence in social network ̏ as widely as possible. We formally define seed node (i.e., influential user) in Definition [def:seedset].
(Seed Node) Node $`v \in V`$ is a seed node if it acts as the source of information diffusion in the social network ̏$`(V,E,W)`$. The set of seed nodes is called seed set, denoted by .
Given a social network ̏$`(V,E,W)`$ and seed set , the influence of seed set in ̏ is the total number of activated nodes with a specified diffusion model , denoted by . includes both newly activated node during information diffusion process and the initial seed set . The information diffusion process () is a stochastic process, the goal of is to maximize the expected influence value, as stated in Problem [prob:imp].
(Influence Maximization Problem, ) Given a social network ̏$`(V,E,W)`$, an integer $`k`$, and diffusion model , the influence maximization problem is selecting a size-$`k`$ seed set $`\subseteq V`$, such that the expected influence value $`\sigma(`$$`)=`$ $`\mathbb{E}(`$$`)`$ is maximized.
The information diffusion model defines the exact information spread manner of seed set . For example, each active user $`u`$ in step $`t`$ will active each of its inactive outgoing neighbor $`v`$ in step $`t-1`$ with an influence probability $`p_{u,v}`$ in Independent Cascade () and Weighted Cascade () model. In Linear Threshold () model, each edge $`(u,v)`$ has a weight $`p_{u,v}`$ and each node $`v`$ has a threshold $`\theta_{v}`$. The node $`v`$ can be activated if a “sufficient” number of its incoming neighbors are active, i.e., $`\sum_{v\text{'s active neighbors}~u} p_{u,v} \geq \theta_v`$.
One of the core components in diffusion model is determining the influence probabilities/weights of each edge in social network. The commonly-used influence probability assignment method is weighted cascade () . In particular, all incoming neighbors of $`v`$ influence $`v`$ with equal probability $`1/|`$$`|`$. However, this assignment method ignores the activeness of the users in practical social networks. e.g., there exists 73% of Twitter users who use Twitter less than once per day3. Such kind of users probably cannot be influenced even all her incoming neighbors are activated. Interestingly, users can be easily influenced by one or few of her neighbors in other applications. For example, users in Pinduoduo4, an online shopping website, can invite their friends to form a shopping team to get a lower price for their purchase. Thus, an inactive user can be easily influenced by one of her activated friend.
In order to overcome the above limitations of common-used probability assignment method, i.e., $`1/|`$$`|`$. We propose a generalized probability assignment method in this work. Specifically, the probability that node $`u`$ can activate node $`v`$ at edge $`(u,v)`$ is $`p_{u,v}=\rho/|`$$`|`$, where $`\rho`$ reflects the activeness of the users. The advantage of the generalized probability assignment method is two-fold: (i) $`\rho`$ is tunable. It is the conventional setting when $`\rho=1`$, and it is more general as $`\rho`$ can be set by users or learnt from training data, and (ii) it still enjoys the properties of conventional information diffusion models (e.g., , , and ).
In order to facilitate the subsequent discussion, we briefly summarize the properties of in this section.
The problem of influence maximization, as defined in Problem [prob:imp], is NP-hard under and model.
In addition to the above hardness of , we present two nice properties of , monotonicity and submodularity in Theorem [the:monotone] and [the:submodular], respectively.
The resulting influence function $`\sigma(\cdot)`$ is monotone as for any $`\mathsf{S}' \subset \mathsf{S}`$, we have $`\sigma(\mathsf{S}') \leq \sigma(\mathsf{S})`$.
For an arbitrary instance of the or model, the resulting influence function $`\sigma(\cdot)`$ is submodular. In other words, for any $`\mathsf{S}' \subset \mathsf{S}`$ and $`v \notin \mathsf{S}`$, we have $`\sigma(\mathsf{S} \cup v) - \sigma(\mathsf{S}) \leq \sigma(\mathsf{S}' \cup v) - \sigma(\mathsf{S}')`$.
The marginal gain of node $`v`$ w.r.t. seed set $`\mathsf{S}`$ is $`mg(v | \mathsf{S}) = \sigma(\mathsf{S} \cup v) - \sigma(\mathsf{S})`$. We omit the proofs of Theorem [the:hardness], [the:monotone] and [the:submodular], and refer interested reader to .
| (a) Social network ̏ | (b) idea illustration |
Approximate Approaches for
Due to the hardness to find the optimal solution for (cf. Theorem [the:hardness]), a plethora of techniques have been proposed to with theoretical approximate bound. In this section, we briefly introduce the key ideas of each category of approximate approaches.
For each approach, we conduct extensive preliminary experiments and present the experimental findings to reveal the underlying issues.
is the first approach which employs Monte-Carlo simulation method to address . The sketch of is shown in Algorithm [alg:greedy]. selects the node which has the largest marginal gain by Monte-Carlo simulation (Line [alg:every]) during each node selection iteration. In order to reduce the pain from unguaranteed submodularity during Monte-Carlo simulations, runs Monte-Carlo simulation $`r`$ times for each node $`v`$, typically, $`r`$ is 10,000 or 20,000 . Thus, the time cost of is extremely expensive.
Initialize seed set $`\mathsf{S} \leftarrow \emptyset`$ $`i \leftarrow 1`$ $`\mathsf{S} \leftarrow \mathsf{S} \cup \argmax_{\forall v \in V}\{\sigma(\mathsf{S} \cup v) - \sigma(\mathsf{S})\}`$ under Return $`\mathsf{S}`$
is devised to improve the time efficiency of . It exploits the submodularity of (cf. Theorem [the:submodular]) to reduce a lot of unnecessary marginal gain computations. Consider social network ̏ in Figure 7(a), finds the node with the largest marginal gain at the beginning (i.e., $`\mathsf{S}=\emptyset`$). maintains a max-heap for the marginal gains of each node w.r.t. seed set $`\mathsf{S}`$. Seed set $`\mathsf{S}`$ is $`\{c\}`$ after the first seed selection iteration as $`mg(c | \emptyset)`$ is the largest. maintains the rest max-heap by removing node $`c`$, as illustrated in Figure 7(b-I). At the second iteration, gets the root of the max-heap $`mg(b | \emptyset)=4`$. computes node $`b`$’s marginal gain with seed set $`\mathsf{S} = \{ c \}`$, i.e., $`mg(b | \{c\})=1`$, and updates the max-heap accordingly (cf. Figure 7(b-II)). The max-heap root turns to $`mg(d | \emptyset )= 3`$, then updates it to $`mg(d | \{ c \} )= 3`$. For any descendant of the root in max-heap, their marginal gains must be smaller than $`3`$ due to the submodularity of . Thus, $`d`$ has the largest marginal gain with seed set $`\{c\}`$, and it is selected at the second seed selection iteration, i.e., $`\mathsf{S}=\{c,d\}`$. In summary, works in a lazy manner. It only computes the marginal gain of node $`v`$ with the latest seed set $`\mathsf{S}`$ when it is necessary, e.g., only computes the marginal gains of $`b`$ and $`d`$ at the second iteration in the above example, it reduces lots of unnecessary marginal gain computations. Hence, is faster than many orders of magnitude (i.e., 700 times ) However, is still not feasible to large networks, e.g., a social network with 1 million nodes.
We test on NetHEPT with 15K nodes and 62K edges, and it does not return the size-50 seed set within 30 hours. Monte-Carlo simulation based approaches (i.e., and ) incur expensive computation time and low memory consumption, and provide theoretical guaranteed approximate solutions for .
| (a) Snapshot G1′ | (b) Snapshot G2′ |
Snapshots-based approaches ( and ) are proposed to improve the time efficiency of Monte-Carlo simulation based approaches. is the first approach that applies snapshots idea to address . Instead of running lots of Monte-Carlo simulations in and , samples $`r`$ snapshots of input social network ̏ by coin flip technique. flips all coins with bias $`p_{u,v}`$ to produce several snapshots in advance, e.g., Figure 8(a) and (b) are the snapshots of the original social graph in Figure 7(a). selects seed nodes iteratively by (1) computing the marginal influence of each node $`v`$ by averaging the total reachable nodes of $`v`$ in all snapshots, (2) selecting the node with the largest average marginal influence, and (3) remove the node and all its reachable nodes in all snapshots. For example, In Figure 8, node $`c`$ has largest average reachable nodes (i.e., 5) as its reachable nodes in snapshots $`\mathsf{G}'_1`$ and $`\mathsf{G}'_2`$ are $`\{c,f,e,g,a,h\}`$ and $`\{c,f,e,h\}`$, respectively. Then selects node $`c`$ and removes its reachable nodes in Figure 8(a) and (b) before the second iteration.
improves by reducing the memory consumption overhead of the snapshots in . Particularly, generates the Directed Acyclic Graph (DAG) of each snapshot by identifying the strongly connected components (SCC) in it. However, the space consumption improvement extent of depends on the connectivity of original graph and its snapshots. and guarantee $`(1-1/e-\epsilon)`$ approximation ratio as the proofs in and , respectively.
We run by setting $`r=200`$ on four benchmark datasets (cf. Section 7.1). Figure 9 shows the memory consumption of and the raw data size. The memory consumption of is 55X to 214X of raw data size. For example, the size of LiveJournal is 0.5G, its memory consumption is 42.9G. It is unaffordable for large even median social networks in commodity PCs with 16G or 32G memory. Snapshots-based approaches (i.e., and ) achieve good time efficiency by incurring huge memory consumption, and provide approximate ratio guaranteed solutions for .
Borgs et al. is the first to propose reverse influence sampling () method for under the and model. The core concept in is reverse reachable set ( set). Formally, the set of node $`v`$ is the set of nodes in ̏ that can reach $`v`$, i.e., $`\forall u \in \mathsf{RR}(v)`$, there is a path from $`u`$ to $`v`$ in ̏. method includes two phases: (i) sets generation phase, and (ii) node selection phase. Consider that we run on the social network ̏ in Figure 7(a). For the set generation phase, we first transpose ̏ in Figure 7(a) to $`\mathsf{G}^T`$ in Figure 10(a). randomly picks a node in $`\mathsf{G}^T`$ and run Monte-Carlo simulation from it to generate its reachable set, i.e., $`\mathsf{RR}(e)=\{e,f,c\}`$. repeats the above procedure several times to generate the sets, as shown in Figure 10(b). For node selection phase, solves the max-coverage problem to select $`k`$ nodes to cover the maximum number of generated sets in above phase. For example, node $`c`$ is selected as it covers the maximum number of sets (i.e., 3) in Figure 10(b). The reverse reachable set which covers node $`c`$ is marked to be ignored in subsequent seed selections. Theoretically, returns $`(1-1/e-\epsilon)`$-approximation result with at least a constant probability if the total examined number of nodes and edges reaches a pre-defined threshold $`\tau`$, which reflects the number of generated sets indirectly.
| (a) Transpose graph (̏T) | (b) Reverse reachable () sets |
Since there is a large hidden constant factor in the asymptotic time complexity of , which bound the practical efficiency of . In order to address that, proposed Two Phase Influence Maximization (a.k.a., ), which returns $`(1-1/e-\epsilon)`$-approximation solution with at least $`(1-n^l)`$ probability, and it runs in $`O((k+l)(m+n) \log n/\epsilon^2)`$ times. samples pre-decided $`\theta`$ sets, instead of using threshold $`\tau`$ on computation cost to indirectly control the number in . Later, exploits a classical statistical tool (martingales ) to improve the parameters estimation phase in . Since the number of generated sets can be arbitrarily larger than theoretical thresholds $`\theta`$ in and , Nguyen et al. (i) unify the necessary sampled sets size in to guarantee $`(1-1/e-\epsilon)`$-approximation ratio, and (ii) propose to achieve the minimum number of set samples. Technically, and adopt a bootstrap strategy to probe the sampling size of sets. The procedure is: (1) initialize $`r`$ sets based on a given formula; (2) select a size-$`k`$ seed set by max-coverage algorithm; (3) evaluate the coverage ratio of . If the coverage ratio is under the stopping condition, increase sets size and repeat (2) and (3). Otherwise, terminate and return . The time efficiency and memory consumptions of and heavily depend on the number of generated sets. Theoretically, the number of generated sets is decided by two parameters: (i) $`\epsilon`$, a large number of sets will be generated to guarantee the theoretical bound when $`\epsilon`$ is small ; and (ii) $`\rho`$ (i.e., the influence probability of each edge $`\rho/|`$$`|`$). Specifically, and perform pretty good in conventional model (i.e., $`\rho=1`$) as the expected number of node $`v`$’s direct reverse reachable neighbors is 1 as the influence probability from node $`u`$ to $`v`$ is $`1/|`$$`|`$. However, their running times increase dramatically when $`\rho`$ scale up to 1.5 .
| (a) In Twitter | (b) In dblp |
| (c) In Twitter | (d) In dblp |
We evaluate the performance of and methods in model by varying the influence probability $`\rho/|`$$`|`$ on and . The memory consumption of both and are unaffordable when $`\rho`$ is scaling up (or down) as illustrated in Figure 11(a) and (b). For example, the memory consumption of with $`\rho=0.1`$ is almost 12.3X and 16.8X over the cost of with $`\rho=1.0`$ in Twitter and dblp, respectively. The time costs of and by varying $`\rho`$ are shown in Figure 11(c) and (d). Obviously, both approaches are degenerating seriously when scaling up or down the influence probability in each edge $`(u,v)`$, i.e., $`w(u,v) = \rho/|`$$`|`$. Both and perform pretty good in terms of time efficiency and memory consumption in common-used probability assignment method (i.e., $`1/|`$$`|`$ ) in diffusion models. The reason is reverse influence sampling () technique exploits the expected number of node $`v`$’s direct reverse reachable neighbors is 1 in conventional diffusion models implicitly. The running time and memory consumption of and are sensitive to the propagation probabilities (i.e., $`\rho/|`$$`|`$). Both and are impractical when $`\rho`$ scales up (or down), as the results shown in Figure 11. Specifically, when $`\rho < 1`$, and require more bootstrap iterations, however, each iteration generates double sets. For $`\rho > 1`$, the number of node in each generated set by and will increase dramatically as the strongly connected properties of the social network.
Other Related Works
Influence maximization problem is first solved in algorithmic perspective by probability. Beyond above discussed approximate approaches, there are many heuristic-based approaches . We omit the details here and refer the interested readers to a recent survey . Very recently, several works are proposed for variants (e.g., online and adaptive ), we skip the discussion as the scope of this work is conventional . networks (e.g., Facebook, Twitter, Weibo) are becoming an essential media for the public recently. Viral marketing is widely used in social networks to promote new products or activities. For example, new products are advertised by some influential users in social networks to other users by “word-of-mouth" effect. Therefore, the problem of selecting small but effective influential user set, a.k.a., Influence Maximization Problem (), is the key for successfully viral marketing, and it has been widely studied in literature. Mathematically, given a social network ̏, a positive integer $`k`$ and a diffusion model , the Influence Maximization Problem () returns a size-$`k`$ nodes subset $`\mathsf{S}`$ (in ̏) which has the maximum expected influence in ̏. The diffusion model defines the exact “word-of-mouth" effect manner, e.g., each influential/activated user can activate its inactive neighbours with a probability in Independent Cascade () diffusion model.
It is NP-hard to find the optimal size-$`k`$ set for with Independent Cascade () and Linear Threshold () diffusion models . Due to the hardness of the problem, many approximation approaches and heuristic solutions have been proposed and extensive studied. However, all existing approaches (cf. Figure 12) are trading-off among time efficiency, memory consumption, and result quality . Since the empirical performance of the state-of-the-art approximation approaches (e.g., , ) are comparable to, even outperform the state-of-the-art heuristic solutions (e.g., , ), we focus on the approximated solutions for in this work.
We classify existing approximation approaches into three categories, Monte-Carlo Simulation, Snapshots and Reverse Influence Sampling, respectively. We summarize the representative approaches of each category in Figure 13.
is the first to apply Monte-Carlo Simulation techniques to solve . It achieves $`(1-1/e-\epsilon)`$-approximation ratio with probability $`1-1/n`$ if the number of Monte-Carlo simulation times is $`\Theta(\epsilon^{-2}k^2n\log(n^2k))`$. improves the performance of by exploiting the submodularity property of with the same result approximation guarantee. However, and its variant (i.e., ) are not practical for the large social networks as the cost of Monte-Carlo simulations is rather expensive. In summary, Monte-Carlo simulation approaches achieve approximation guaranteed results for by incurring extremely expensive time cost.
They are proposed to improve the time efficiency of the Monte-Carlo Simulation-based approaches. is the first to employ snapshots to address problem. It samples subgraphs ̏$`_{i}`$ (a.k.a., snapshots) of social network ̏ in advance by retaining each edge with a probability of its weight. The influence of a node is estimated by averaging its influence on all snapshots. shrinks snapshots in into vertex-weighted directed acyclic graphs (DAGs) by using strongly connected components (SCCs) of snapshots as DAGs’ nodes. Through this, it reduces the memory consumption of . Both and guarantee $`(1-1/e-\epsilon)`$-approximation ratio by sampling enough snapshots. However, the memory overheads of and are infeasible when the size of social network ̏ is large. In conclusion, snapshots-based approaches provide approximation guaranteed results for by incurring dramatically high memory consumption (to store snapshots).
is the first to propose Reverse Influence Sampling-based approaches for . The core idea of reverse influence sampling is: (i) construct reverse reachable sets for the nodes, (ii) employ a greedy max-coverage algorithm to select the seed node iteratively. The representative approaches are , and . All of them guarantee $`(1-1/e-\epsilon)`$-approximation ratio with a certain number of generated reverse reachable sets. In general, the memory consumptions of reverse influence sampling-based approaches are much smaller than snapshots-based approaches. However, it also may be unaffordable as (i) reverse reachable set size is sensitive to the propagation probability of each edge in the social network, e.g., incurs 12.3GB memory footprints for 45.6MB dataset when the weighted cascade setting is slightly revised, we will elaborate it in Section 5; (ii) it needs to generate a large number of reverse reachable sets to guarantee the theoretical bound when $`\epsilon`$ is small . To make matters worse, all reverse reachable sets are stored in the memory for the max-coverage algorithm.
In general, diffusion models define how the node can switch its status from inactive to active on a weighted graph, where the weight of each edge is the influence probability. For example, an active node $`u`$ has single chance to influence its inactive neighbor $`v`$ with probability $`w(u,v)`$ in model. In literature, a commonly-used influence probability assignment method in both (i.e., ) and models is setting $`w(u,v) = 1/|\mathsf{In}(v)|`$ , where $`\mathsf{In}(v)`$ is the in-degree of $`v`$. This assignment method assumes a user probably can be activated if all her incoming neighbors are active in both and models. However, it may not be practical in some real-world applications. For example, many users in Twitter (e.g., the users who use Twitter less than once per day) probably are not be influenced even all the neighbors are active. Interestingly, there is also another kind of social network where the users can be influenced even only one or few of its neighbors are active, e.g., users in Pinduoduo5 (an online shopping website in China) can be easily influenced as they can form a shopping team to get a lower price for their purchase.
To overcome the above limitations of common-used probability assignment method, i.e., $`1/|`$$`|`$, We propose a generalized probability assignment method in this work. Specifically, the probability that node $`u`$ can activate node $`v`$ at edge $`(u,v)`$ is $`w(u,v)=\rho/|`$$`|`$, where $`\rho`$ reflects the activeness of the users. It is worth to note that the generalized and models are the conventional and models when $`\rho=1`$.
In this paper, we propose a novel residual-based algorithm to overcome the deficiencies of existing approximation approaches. The core of is the novel marginal gain computation method based on probability theory. Specifically, we define the residual capacity of a node as the maximal contribution of that node can make to influence spread value of the seed set. Initially, the residual capacity of each node is $`1`$. During each seed node selection process, the residual capacity of each node will diminish by either being selected as a seed node or being influenced by other selected seed node. approach achieves excellent time efficiency as (i) it enjoys the benefits of the submodularity of the residual-based influence function and cost-effective lazy forward node selection manner, however, it requires much fewer Monte-Carlo simulations; (ii) the number of nodes under consideration (i.e., their residual capacities are large than 0) falls quickly during seed set selection process. and (iii) two optimizations are devised to speedup . Meanwhile, guarantees $`(1-1/e)`$-approximation of the result, as elaborated in Section 6. From memory consumption perspective, only stores the raw social network data. It does not have any extra memory consumption when comparing with snapshot-based approaches and reverse influence sampling-based approaches. Thus, the space complexity of is optimal, i.e., $`O(n+m)`$. Moreover, is robust to a generalized probability assignment method, i.e., $`\rho/|\mathsf{In}(v)|`$, in both and models. $`\rho`$ is a tunable parameter and reflects the influence degree of each user in the social network. In summary, achieves excellent time efficiency, low memory consumption and approximation guaranteed result quality for in widely used diffusion models (i.e., as shown in the center of Figure 12)
We summarize the comparison among our proposal and existing approximate approaches for in Table [tab:approximate]. Specifically, the contributions of this paper are summarized as follows.
-
We conduct comprehensive experiments to reveal the trade-off strategies among the state-of-the-art approximate approaches for (Section 5).
-
We propose a residual-based algorithm for to achieve excellent time efficiency, low memory overhead, and approximation guaranteed results concurrently (Section 6).
-
We evaluate the effectiveness and efficiency of our proposal by extensive experiments on real-world benchmark datasets (Section 7).
The remainder of this paper is organized as follows. Section 5 describes the preliminaries and related works of and conducts comprehensive experiments on the state-of-the-art approximate approaches to reveal their underlying issues. Section 6 presents our residual-based approach for . Section 7 verifies the superiority of our proposal by extensive experiments, followed by the conclusion in Section [sec:con].
