Let ${\cal V}$ be a finite set of $n$ elements and ${\cal F}=\{X_1,X_2, >..., X_m\}$ a family of $m$ subsets of ${\cal V}.$ Two sets $X_i$ and $X_j$ of ${\cal F}$ overlap if $X_i \cap X_j \neq \emptyset,$ $X_j \setminus X_i \neq \emptyset,$ and $X_i \setminus X_j \neq \emptyset.$ Two sets $X,Y\in {\cal F}$ are in the same overlap class if there is a series $X=X_1,X_2, ..., X_k=Y$ of sets of ${\cal F}$ in which each $X_iX_{i+1}$ overlaps. In this note, we focus on efficiently identifying all overlap classes in $O(n+\sum_{i=1}^m |X_i|)$ time. We thus revisit the clever algorithm of Dahlhaus of which we give a clear presentation and that we simplify to make it practical and implementable in its real worst case complexity. An useful variant of Dahlhaus's approach is also explained.
Deep Dive into A Note On Computing Set Overlap Classes.
Let ${\cal V}$ be a finite set of $n$ elements and ${\cal F}=\{X_1,X_2, >..., X_m\}$ a family of $m$ subsets of ${\cal V}.$ Two sets $X_i$ and $X_j$ of ${\cal F}$ overlap if $X_i \cap X_j \neq \emptyset,$ $X_j \setminus X_i \neq \emptyset,$ and $X_i \setminus X_j \neq \emptyset.$ Two sets $X,Y\in {\cal F}$ are in the same overlap class if there is a series $X=X_1,X_2, ..., X_k=Y$ of sets of ${\cal F}$ in which each $X_iX_{i+1}$ overlaps. In this note, we focus on efficiently identifying all overlap classes in $O(n+\sum_{i=1}^m |X_i|)$ time. We thus revisit the clever algorithm of Dahlhaus of which we give a clear presentation and that we simplify to make it practical and implementable in its real worst case complexity. An useful variant of Dahlhaus’s approach is also explained.
Let V be a finite set of n = |V| elements and F = {X 1 , X 2 , . . . , X m } a family of m subsets of V. Two sets X i and X j of F overlap if X i ∩ X j = ∅, X i \ X j = ∅, and X j \ X i = ∅. We denote |F | as the sum of the sizes of all X i ∈ F . We define the overlap graph OG(F , E) as the graph with all X i as vertices and E = {(i, j) | X i overlaps X j }, ∀ 1 ≤ i, j ≤ m. A connected component of this graph is called an overlap class.
In this note we focus on efficiently identifying all overlap classes of OG(F , E). This problem is a classical one in graph clustering related topics but it also appears frequently in many graph problems related to graph decomposition [2] or PQ-tree manipulation [3].
An efficient O(n+|F |) time algorithm has already been presented by Dahlhaus in [2]. The algorithm is very clever but uses an off-line Lowest Common Ancestor algorithm (LCA) as subroutine. From a theoretical point of view, off-line LCA queries have been proved to be solvable in constant time (after a linear time preprocessing) in a RAM model (accepting an additional constant time specific register operation) but also recently in a pointer machine model [1]. However, in practice, it is very difficult to implement these LCA algorithms in their real linear complexity. Another difficulty with Dahlhaus’s algorithm comes from that its original presentation is difficult to follow. These two points motivated this note. Dahlhaus’s algorithm is really clever and deserves a clear presentation, all the more so we show how to replace LCA queries by set partitioning, which makes Dahlhaus’s algorithm easily implementable in practice in its real complexity. We also provide a source code freely available in [4]. We eventually explain how to simply modify Dahlhaus’s approach to efficiently compute a spanning tree of each connected component of the overlap graph. This simplifies a graph construction in [3].
The overlap graph OG(F , E) might have Θ(m 2 ) edges, which can be quadratic in O(|F |). For instance, if
The approach of Dahlhaus is quite surprising since that, instead of computing a subgraph of the overlap graph, Dahlhaus considers a second graph D(F , L) on the same vertex set but with different edges. This graph has however a strong property: its connected components are the same than that of OG(F , E), although that in the general case D(F , L) is not a subgraph of OG(F , E).
Let LF be the list of all X ∈ F sorted in decreasing size order. The ordering of sets of equal size is arbitrarily fixed. Given X ∈ F , we denote Max(X) as the largest Y ∈ F taken in LF order such that |Y | ≥ |X| and Y overlaps X. Note that Max(X) might be undefined for some sets of F . In this latter case, in order to simplify the presentation of some technical points, we write Max(X) = ∅. Dahlhaus’s algorithm is based on the following observation:
But in this case, as |Y | ≤ |Max(X)|, Y = Max(X) and overlaps X. Therefore Y overlaps X or Max(X). ✷ Let us assume that we already computed all Max(X). For each v ∈ V we compute the list SL(v) of all sets X ∈ F to which v belongs. This list is sorted in increasing order of the sizes of the sets. Computing and sorting all lists for all v ∈ V can be done in O(|F |) time using a global bucket sort.
Dahlhaus’s graph D(F , L) is built on those lists. Let X be a set containing v such that Max(X) = ∅. Then for all consecutive pairs Y W after X in SL(v) (X included, i.e. Y can be instanced by X) and such that |W | ≤ |Max(X)|, create an edge (Y, W ) in the graph D. Therefore, in SL(v), there exits a serie of consecutive pairs Y W from A to B that are linked in D(F , L). In consequence, A and B are connected in D(F , L). ✷ Notice that the order of equally sized sets in SL lists has no importance for the construction of a Dahlhaus’s graph. Figure 1 shows an example of an overlap graph and a Dahlhaus’s graph.
). Given all Max(X), X ∈ F , the graph D(F , L) can be built in O(|F |) time and its number of edges is less than or equal to |F |.
Proof. To build the graph D(F , L) from the SL lists, it suffices to go through each SL list from the smallest set to the largest and remenber at each step the largest Max(X) already seen. If the size of the current set is smaller than or equal to this value, an edge is created between the last two sets considered.
Let us now consider the number of edges of D(F , L). As at most one edge is created for each set in a list SL, at most |F | edges are created after processing all lists. ✷ Identifying the overlap classes of OG(F , E) can therefore be done by a simple Depth First Search on D(F , L) in O(n+|F |) time. It remains however to explain how to efficiently compute all Max(X).
Let LF be the list of all X ∈ F sorted in decreasing size order. The order of sets of equal size is not important. We consider a boolean matrix BM of size |F |× |V | such that each row represents a set X ∈ F in the order of LF, and each column
The first step of Dahlhaus’s algorithm is to sort the
…(Full text truncated)…
This content is AI-processed based on ArXiv data.