Efficient Bitruss Decomposition for Large-scale Bipartite Graphs

Efficient Bitruss Decomposition for Large-scale Bipartite Graphs

Existing Solutions

In this section, we briefly discuss the existing algorithms to solve the $`\btsd`$ problem. both propose bitruss decomposition algorithms. Since these two algorithms follow the same paradigm with inherently the same peeling idea (as illustrated in the introduction), here we only outline the state-of-the-art algorithm of in Algorithm [algo:benchmark].

compute $`\btf_e`$ for each $`e \in E(G)`$ // the counting process $`\bts_e \gets \btf_e`$ $`\btf_{e'} \gets \btf_{e'} - 1`$ $`E(G) \gets E(G) \backslash e`$ mark $`e`$ as assigned

As shown in , the time complexity of is $`O(\sum_{u \in L(G)}\sum_{v_1, v_2 \in \nb(u)} \max\{\degree(v_1), \degree(v_2)\} + \sum_{(u, v) \in E(G)}\sum_{w \in \nb(v)} \max\{\degree(u), \degree(w)\})`$ where the first term is for the counting process and the second term is for the peeling process. The time complexity of the counting process can be reduced to $`O(\sum_{(u, v) \in E(G)} \min\{\degree(u), \degree(v)\})`$ using the algorithm in .

The performance bottleneck of . Here we analyse the dominant cost of . We first define the $`edge\ removal\ operation`$ as follows.

Given a bipartite graph $`G(V, E)`$ and an edge $`e \in G`$, an edge removal operation for $`e`$ denoted as $`\re`$ has two steps. Firstly, find all the edges which share at least one butterfly with $`e`$ in $`G`$ and compute their butterfly supports in $`G \backslash e`$. Secondly, remove $`e`$ from $`G`$.

Time cost of on different datasets

As shown in Figure 1, the dominant cost of is to accomplish the peeling phase on the testing datasets. Moreover, the dominant cost in the peeling phase is incurred when performing the edge removal operations as shown in Algorithm [algo:benchmark].

Problem Definition

In this section, we formally introduce the notations and definitions. Mathematical notations used throughout this paper are summarized in Table 1.

Notation Definition
$`G`$ a bipartite graph
$`V(G) / E(G)`$ the vertex/edge set of $`G`$
$`U(G),L(G)`$ a vertex layer of $`G`$
$`u, v, w, x`$ a vertex in a bipartite graph
$`(u, v), e`$ an edge in a bipartite graph
$`\blm / \kblm`$ a bloom/$`k`$-bloom in a bipartite graph
$`\mblm`$ a maximal priority-obeyed bloom
$`(u, v, w)`$ a wedge formed by $`u`$, $`v`$, $`w`$
$`[u, v, w, x]`$ a butterfly formed by $`u`$, $`v`$, $`w`$, $`x`$
$`\degree(u) / \p(u)`$ the degree/priority of $`u`$
$`\nb(u)`$ the set of neighbors of $`u`$
$`\btf_e`$ the number of butterflies containing $`e`$
$`\btf_{\blm}/\btf_G`$ the number of butterflies in $`\blm`$/$`G`$
$`\ggk`$ $`\ggk \subseteq G`$ where $`\btf_e \geq k`$ for each $`e \in \ggk`$
$`n, m`$ the number of vertices and edges in $`G`$ ($`m > n`$)

The summary of notations

Our problem is defined over an undirected bipartite graph $`G(V=(U, L), E)`$, where $`U(G)`$ denotes the set of vertices in the upper layer, $`L(G)`$ denotes the set of vertices in the lower layer, $`U(G) \cap L(G) = \emptyset`$, $`V(G) = U(G) \cup L(G)`$ denotes the vertex set, and $`E(G) \subseteq U(G) \times L(G)`$ denotes the edge set. An edge between two vertices $`u`$ and $`v`$ in $`G`$ is denoted as $`(u, v)`$ or $`(v, u)`$. The set of neighbors of a vertex $`u`$ in $`G`$ is denoted as $`\nb(u) = \{ v\in V(G) \mid (u, v) \in E(G) \}`$, and the degree of $`u`$ is denoted as $`\degree(u) = |N_G(u)|`$. Each vertex $`u`$ has a unique id and we assume for every pair of vertices $`u \in U(G)`$ and $`v \in L(G)`$, $`u.id > v.id`$.

Given a bipartite graph $`G(V, E)`$ and vertices $`u`$, $`v`$, $`w \in V(G)`$, a path starting from $`u`$, going through $`v`$ and ending at $`w`$ is called a wedge which is denoted as $`(u, v, w)`$. For a wedge $`(u, v, w)`$, we call $`u`$ the start-vertex, $`v`$ the middle-vertex and $`w`$ the end-vertex.

Given a bipartite graph $`G`$ and four vertices $`u, v, w, x \in V(G)`$ where $`u, w \in U(G)`$ and $`v, x \in L(G)`$, a butterfly induced by the vertices $`u, v, w, x`$ is a (2,2)-biclique of $`G`$; that is, $`u`$ and $`w`$ are both connected to $`v`$ and $`x`$, respectively, by edges $`(u, v), (u, x), (w, v), (w, x)\in E(G)`$.

Given a bipartite graph $`G(V, E)`$, a bloom denoted as $`\blm`$ is a biclique in $`G`$ where there are exactly two vertices in $`U(B)`$ (or $`L(B)`$). Given a positive integer $`k`$, a $`k`$-bloom denoted as $`\kblm`$ is a $`(2, k)`$-biclique in $`G`$; that is, there are two vertices in $`U(\kblm)`$ (or $`L(\kblm)`$) connected with $`k`$ vertices in $`L(\kblm)`$ (or $`U(\kblm)`$). For a $`k`$-bloom, we call $`k`$ the bloom number. Given a set of vertices $`S \subseteq V(G)`$ such that the induced subgraph of $`S`$ is a bloom, we denote this bloom as $`\blm(S)`$.

A butterfly induced by vertices $`u, v, w, x`$ is denoted as $`[u, v, w, x]`$. We denote the number of butterflies containing an edge $`e`$ as $`\btf_e`$, the number of butterflies in a bloom $`B`$ as $`\btf_B`$ and the number of butterflies in $`G`$ as $`\btf_G`$. Also $`\btf_e`$ is called the butterfly support of $`e`$.

Given a bipartite graph $`G`$ and a positive integer $`k`$, a denoted as $`H_k`$ is a maximal subgraph of $`G`$ where $`\btf_e \geq k`$ for each edge $`e \in H_k`$.

Given a bipartite graph $`G`$, the bitruss number of an edge $`e`$ denoted as $`\bts(e)`$ is the largest $`k`$ such that a in $`G`$ contains $`e`$.

Problem Statement. Given a bipartite graph $`G(V,E)`$, our problem is to compute $`\bts(e)`$ for each edge $`e \in E(G)`$.

(a) the bipartite graph G, (b) the 1-bitruss of G, H1; (c) the 2-bitruss of G, H2

Considering the bipartite graph $`G`$ in Figure 2, the bitruss numbers of the edges in $`H_2`$ are 2, the bitruss numbers of the edges in $`E(H_1) \backslash E(H_2)`$ are 1 and the bitruss numbers of the other edges are 0.

Introduction

Bipartite networks are widely used in many real-world applications where we need to model relationships between two different types of entities. For example, author-paper relationships (e.g., authors form the upper layer and papers form the lower layer in the network in Figure 3), user-product relationships, etc. Consequently, cohesive subgraph mining in bipartite networks (graphs) becomes a popular research topic recently. In unipartite graphs, there are extensive studies on $`k`$-truss decomposition which constructs the hierarchy of $`k`$-trusses (each edge in $`k`$-truss is contained in at least $`k`$ triangles). However, $`k`$-truss decomposition cannot be used in bipartite graphs since there is no triangle structure existing in bipartite graphs. Also, since the degree distributions of most real-world bipartite graphs are skewed, it will cause the explosion in the number of edges/triangles if we project bipartite graphs to unipartite graphs .

In bipartite graphs, butterfly (i.e., a complete $`2 \times 2`$ biclique) is the smallest non-trivial cohesive structure and is recognised as an analogue of triangle in unipartite graphs. Based on butterfly, $`k`$-bitruss is defined as the cohesive subgraph where each edge is contained in at least $`k`$ butterflies . Consequently, the bitruss number of an edge $`e`$, denoted by $`\bts_e`$, is defined as the largest $`k`$ such that a $`k`$-bitruss contains $`e`$. In this paper, we study the bitruss decomposition problem, which computes the bitruss number for each edge in a bipartite graph. For instance, in Figure 3, the bitruss numbers of the edges in blue color (i.e., $`(u_0, v_0)`$, $`(u_0, v_1)`$, $`(u_1, v_0)`$, $`(u_1, v_1)`$, $`(u_2, v_0)`$, $`(u_2, v_1)`$), yellow color (i.e., $`(u_2, v_2)`$, $`(u_3, v_1)`$, $`(u_3, v_2)`$) and gray color (i.e., $`(u_2, v_3)`$, $`(u_3, v_4)`$) are 2, 1 and 0, respectively. In the literature, the study of bitruss decomposition can be easily adopted in many applications. We list some examples below.

An author-paper bipartite network

$`\bullet`$ Fraud detection. In social media such as Facebook, there exist fraudulent users who give fake “like”s. Also, with the improvement of the fraud detection techniques, the cost of opening fake accounts is increased, thus frauds cannot rely on too many fake accounts . Therefore, these malicious users tend to form a closely connected group. Although the size of the cluster of frauds is unknown, the output of bitruss decomposition applied on the bipartite network (e.g., user-page network) can reveal the close communities at different level of granularity for further investigation.

$`\bullet`$ Identifying nested research groups. Bipartite graphs are natural fits for modelling the relationship between authors and publications. The bitruss decomposition algorithm can reveal the hierarchical relations of researchers by finding a loose connected research group first and further decomposing it into smaller, more cohesive groups . For instance, in Figure 3, all the researchers belong to a loosely research group, while $`\{v_0, v_1, v_2\}`$ constructs a more cohesive one, and $`\{v_0, v_1\}`$ constructs the most cohesive research group.

$`\bullet`$ Recommendation system. When applied to bipartite graphs with user-item structure, bitruss decomposition algorithm can effectively identify dense subgraphs in hierarchical manner. The denser the subgraph is, the more similar the users/items are in this subgraph. Finding users/items at different similarity levels is especially helpful to support the construction of recommendation systems .

In real-world applications, the graphs can be very large and the state-of-the-art algorithms cannot handle large-scale bipartite graphs efficiently. For example, on the graph Wiki-it with $`10^7`$ edges, the decomposition algorithm in needs more than 30 hours to solve the bitruss decomposition problem as evaluated. Therefore, the study of more efficient bitruss decomposition algorithms is essential to support large-scale graph analysis.

Existing techniques. both propose a bottom-up approach by iteratively peeling the edges with the lowest butterfly support. It has two key steps: (1) in the counting process, for each edge $`e`$, it counts the number of butterflies containing $`e`$ (i.e., the butterfly support of $`e`$ — $`\btf_e`$); (2) in the peeling process, it iteratively removes the edge $`e`$ with minimum $`\btf_e`$ and assigns the bitruss number to $`e`$ as $`\btf_e`$. To complete the counting process, a novel algorithm recently proposed in takes $`O(\sum_{(u, v) \in E(G)} \min\{\degree(u), \degree(v)\})`$ time; on the other hand, the peeling process still requires $`O(|E(G)|^2)`$ time in or $`O(\sum_{(u, v) \in E(G)}\sum_{w \in \nb(v)} \max\{\degree(u), \degree(w)\})`$ time in and, consequently, becomes the performance bottleneck of bitruss decomposition. Here, $`E(G)`$ denotes the edge set of a graph $`G`$, $`\degree(v)`$ and $`\nb(v)`$ denote the degree and the neighbor set of a vertex $`v`$, respectively.

Motivation and challenges. In the peeling process, when an edge $`e`$ is removed, the butterfly supports of the edges which share butterflies with $`e`$ need to be updated correspondingly. In , this edge removal operation needs to enumerate all the butterflies containing $`e`$. The butterfly enumeration methods used by are inherently the same — enumerate the combinations of four vertices with three edges first, then check whether there exists the forth edge to form a butterfly. The main drawback of the existing combination-based methods is that if the forth edge does not exist (e.g., the butterfly $`[u_1, v_1, u_2, v_2]`$ does not exist in Figure 4(a)), the time of combining and checking is wasted. For instance, considering the graph in Figure 4(a) with 4002 vertices, $`u_0`$ is connected with $`v_0`$, $`v_1`$, and $`u_1`$ ($`v_1`$) is connected with $`v_0`$ to $`v_{1000}`$ ($`u_0`$ to $`u_{1000}`$), and $`u_2`$ ($`v_2`$) is connected with $`v_{1001}`$ to $`v_{2000}`$ ($`u_{1001}`$ to $`u_{2000}`$), respectively. When edge $`(u_1, v_1)`$ is removed, the existing algorithms enumerate butterflies containing $`(u_1, v_1)`$ by (1) checking whether there is an edge between $`u_1`$’s neighbors and $`v_1`$’s neighbors which needs $`\degree(u_1) \times \degree(v_1) = 1001 \times 1001`$ checks ; or (2) checking whether there is an edge between $`v_1`$’s two-hop neighbors (e.g., $`v_{1001}`$) and $`u_1`$ which needs $`\sum_{w \in \nb(v_1)} \max\{\degree(u_1), \degree(w)\} = 1001 \times 1001`$ checks . However, there only exists one butterfly containing $`(u_1, v_1)`$: $`[u_0, v_0, u_1, v_1]`$.

In addition, we observe that the degree distributions of most real-world graphs are skewed (e.g., Wiki-it and Delicious). In these graphs, some edges can have very high butterfly supports (i.e., hub edges), though their bitruss numbers are comparatively much smaller. For example, the maximum bitruss number for an edge is only $`6{,}638`$ on the Delicious dataset, while its butterfly support reaches $`1{,}219{,}319`$. For those hub edges, it requires a large number of butterfly support updates to obtain their bitruss numbers in the peeling process.

Observations
(a) a bipartite graph (also a 1001-bloom), (b) the corresponding , (u0, vi) is denoted as ei, (u1, vi) is denoted as ei + 1001

Motivated by the above observations, in this paper, we aim to significantly improve the efficiency of bitruss decomposition by addressing the following two major challenges:

  1. When performing edge removal operations, it is a challenge to efficiently enumerate the butterflies containing each removed edge.

  2. It is also a challenge to efficiently handle edges with high butterfly supports (i.e., hub edges).

Our approaches. To address Challenge 1, we observe that the $`bloom`$ structure (i.e., a biclique with exactly 2 vertices in one layer) is the combination of butterflies which may have the ability to be used in compacting the butterflies. For example, in Figure 5(a), the graph is a $`1001`$-bloom (also a $`(2, 1001)`$-biclique) which contains $`\frac{1001*(1001-1)}{2}`$ butterflies. Thus, given a bipartite graph $`G`$, we can compact all the butterflies in $`G`$ into blooms. Besides, to guarantee that each butterfly is contained in exactly one bloom, we only identify the maximal priority-obeyed blooms — the maximal bloom where the vertex with the largest priority belongs to the layer with only two vertices. Here, the higher the degree, the higher the priority; and the ties are broken by vertex ID. Then, the index is constructed by linking the maximal priority-obeyed blooms with the edges they contain; that is the Bloom-Edge-Index (). For example, for the graph in Figure 5(a), we can construct the corresponding as shown in Figure 5(b). The can be efficiently constructed after the counting process which needs only $`O(\sum_{(u, v) \in E(G)}\min\{\degree(u), \degree(v)\})`$ time. Then, when we perform an edge removal operation for $`e`$, we can directly find all the affected edges through the blooms in rather than enumerating the butterflies containing $`e`$ using combination-based methods as what existing techniques does. For example, to remove $`(u_1, v_1)`$ in Figure 4(a), we can directly find the 4 edges to be updated in as shown in Figure 4(b) instead of using $`1001 \times 1001`$ butterfly checks in existing solutions. Also as shown in Figure 5, we can also directly find all the affected edges if one of those edges is removed. Based on , the total peeling process needs only $`O(\btf_G)`$ time where $`\btf_G`$ is the number of butterflies in the graph $`G`$.

To address Challenge 2, we propose the progressive compression approach based on the observation that $`\btf_e`$ is a lower bound of $`\bts_e`$ for an edge $`e`$. Unlike the bottom-up algorithms which process the edges with minimum butterfly supports first, handles a bunch of edges with high butterfly supports (i.e., hub edges) first within cohesive subgraphs and compresses those edges after assigning bitruss numbers to them. In this manner, can significantly reduce the number of butterfly support updates, especially for those hub edges. This is because after assigning the bitruss number for a hub edge, we only need to preserve its support in the and do not need to update its butterfly supports when edges with lower bitruss numbers are removed.

Contribution. Our principal contributions are summarized as follows.

  • We propose a novel online index — the . Based on the , our new bitruss decomposition algorithm significantly reduces the time complexities of the existing algorithms as shown in Section 12.1. We also propose two batch-based optimizations to further enhance the performance of .

  • To deal with the hub edge issue, we propose the algorithm which processes the hub edges within cohesive subgraphs and compresses the processed edges progressively. In this manner, greatly reduces the number of butterfly support updates for those hub edges.

  • We conduct extensive experiments on real bipartite graphs. The result shows that the proposed algorithm outperforms the state-of-the-art algorithm by up to two orders of magnitude. For instance, the algorithm can solve the bitruss decomposition problem within 20 minutes on Wiki-it dataset with $`10^7`$ edges, while the state-of-the-art algorithm runs more than 30 hours.

Organization. The rest of the paper is organized as follows. The related work directly follows. Section 5 presents the problem definition. Section 6 introduces the existing algorithms . The is presented in Section 14. Section 12 introduces the -based algorithms including , and . Section 11 reports the experimental results. Section 13 concludes the paper.

Related Work. In the literature, there are many cohesive subgraph models and recent works on graph decomposition are based on these models .

Unipartite graphs. In unipartite networks, many models are defined to capture the cohesiveness of subgraphs such as $`k`$-core , $`k`$-truss and clique . Furthermore, researchers also study the core decomposition and truss decomposition algorithms. Among those works, truss decomposition is the most similar topic. The reason is that the cohesive structure used in the truss decomposition (i.e., triangle) is the smallest non-trivial clique in unipartite networks, while the cohesive structure used in the bitruss decomposition (i.e., butterfly) is the smallest non-trivial biclique in bipartite networks. However, the structures are different (4-hops’ circle vs 3-hops’ circle) and the applied networks are different (bipartite network vs unipartite network). Thus, the truss decomposition techniques are not applicable.

Bipartite graphs. In bipartite networks, some studies are conducted towards core-like (e.g., ($`\alpha, \beta`$)-core , ($`p, q`$)- core , fractional $`k`$-core ), truss-like (e.g., bitruss ), and clique-like (e.g., ($`p, q`$)-biclique , quasi-biclique ) cohesive structures. Among those works, the core-like and clique-like structures are inherently different from bitruss. For instance, ($`\alpha, \beta`$)-core is the maximal subgraph where the degree of each vertex in the upper/lower layer is at least $`\alpha`$/$`\beta`$; biclique is the maximal complete subgraph. Thus, the techniques in these works cannot be used to solve our problem. In , the authors project the bipartite graph into a unipartite graph and apply the $`k`$-truss decomposition algorithm. As we mentioned before, this will cause the explosion of edges/triangles. Thus, the study in this paper aims to improve the recent works in which directly solve the problem.