Drawing Binary Tanglegrams: An Experimental Evaluation

Dra wing Binary T anglegrams: An Exp erimen tal Ev aluation Martin N¨ ollen burg 1 ? , Dann y Holten 2 , Markus V¨ olk er 1 , and Alexander W olﬀ 2 1 F akult¨ at f¨ ur Informatik, Univ ersit¨ at Karlsruhe, German y . { noellenburg , mvoelker } @iti.uka.de 2 F aculteit Wiskunde en Informatica, TU Eindho ven, The Netherlands. d.h.r.holten@tue.nl , http://www.win.tue.nl/~awolff Abstract. A binary tangle gr am is a pair h S, T i of binary trees whose leaf sets are in one-to-one corresp ondence; matc hing leav es are connected by inter-tree edges. F or applications, for example in phylogenet ics or softw are engineering, it is required that the individual trees are drawn crossing-free. A natural optimization problem, denoted tangle gr am layout pr oblem , is th us to minimize the num b er of crossings betw een inter-tree edges. The tanglegram lay out problem is NP-hard and is curren tly considered both in application domains and theory . In this pap er w e presen t an exp erimental comparison of a recursive algorithm of Buchin et al. [2], our v ariant of their algorithm, the algorithm hierarch y sort of Holten and v an Wijk [8], and an in teger quadratic program that yields optimal solutions. 1 In tro duction In this paper w e are in terested in ev aluating the p erformance of t wo recently sug- gested algorithms for dra wing so-called tangle gr ams [11], that is, pairs of trees whose leaf sets are in one-to-one correspondence. The need to visually compare pairs of trees arises in applications suc h as the analysis of softw are pro jects, ph ylogenetics, or clus- tering. In the ﬁrst application, trees may represen t pack age-class-metho d hierarchies or the decomp osition of a pro ject into la yers, units, and mo dules. The aim is to ana- lyze c hanges in hierarch y ov er time or to compare human-made decompositions with automatically generated ones. Whereas trees in softw are analysis can ha ve no des of arbitrary degree, trees from our second application, that is, (ro oted) ph ylogenetic trees, are binary trees. This makes binary tanglegrams an interesting sp ecial case, see Fig. 1. Hierarchical clusterings, our third application, are usually visualized by a binary tree-lik e structure called dendr o gr am , where elements are represented by the lea ves and each internal no de of the tree represents the cluster con taining the lea ves in its subtree. Pairs of dendrograms stemming from diﬀerent clustering pro cesses of the same data can b e compared visually using tanglegrams. F rom the application p oin t of view it mak es sense to insist that (a) the trees under consideration are drawn plane, that is, without edge crossings, (b) eac h leaf of one tree is connected b y an inter-tr e e edge to the corresponding leaf in the other tree, and (c) the num ber of crossings among the in ter-tree edges is minimized. F ollo wing the ? Supp orted b y gran t WO 758/4-3 of the German Research F oundation (DFG). 2 Martin N¨ ollenburg, Dann y Holten, Markus V¨ olker, and Alexander W olﬀ (a) arbitrary la yout (b) optimal la yout Fig. 1: A binary tanglegram of ph ylogenetic trees for lice of p ock et gophers [7]. bioinformatics literature (e.g., [11, 10]), we call this the tangle gr am layout problem; F ernau et al. [5] refer to it as two-tr e e cr ossing minimization . Pr oblem: (T angle gr am L ayout (TL)). Given a tanglegram h S, T i consisting of tw o ro oted trees S and T on n leav es and a bijection b et ween their leaf sets, ﬁnd a tanglegram la yout, that is, t wo plane drawings of S and T , suc h that 1. the dra wing of S is to the left of the line x = 0 with all lea ves on x = 0; 2. the dra wing of T is to the righ t of the line x = 1 with all lea ves on x = 1; 3. the in ter-tree edges are drawn as straigh t-line segments; 4. the n umber of inter-tree edge crossings is minim um. Giv en a tree T , w e say that a linear order of its leav es is c omp atible with T if for eac h no de v of T the no des in the subtree of v form an interv al in the order. Note that TL is a purely combinatorial problem. In short, given tw o trees S and T , TL consists of ﬁnding an order σ of the leav es of S compatible with S and an order τ of the leav es of T compatible with T suc h that the n umber of inv ersions b et ween τ and σ is minimum [5, 2]. Let the cr ossing numb er of a tanglegram h S, T i b e the minim um num b er of in ter-tree edge crossings of any tanglegram la yout of h S, T i . In the following we restrict our attention to binary tanglegrams suc h as, for example, pairs of phylogenetic trees or clustering dendrograms. The restriction of TL to binary trees is denoted as binary TL. After presen ting related work (Section 2), w e introduce the algorithms that we wan t to compare exp erimen tally (Section 3). W e ﬁrst sketc h a recursive algorithm of Buc hin et al. [2]. Then, in an algorithm engineering pro cess we adapt their algorithm to the needs of unbalanced trees in order to achiev e b etter results. W e apply branch-and-bound to sp eed up the improv ed v arian t. Next, we introduce hierarch y sort, a crossing-reduction heuristic used in the visualization tool of Holten and v an Wijk [8], and a quadratic integer program that solv es binary TL optimally . Finally , we provide a detailed description of the results of our exp erimen tal comparison of these algorithms, see Section 4. Dra wing Binary T anglegrams: An Experimental Ev aluation 3 2 Related W ork In graph drawing the so-called two-side d cr ossing minimization pr oblem (2SCM) is an important NP-hard problem that o ccurs when computing lay ered graph lay outs. Suc h la youts hav e b een in tro duced by Sugiyama et al. [12] and are widely used for dra wing hierarchical graphs. In 2SCM, vertices of a bipartite graph are to b e placed on t wo parallel lines (called layers ) suc h that for each v ertex on one line all its adjacen t vertices lie on the other line. As in TL the ob jective is to minimize the n umber of edge crossings pro vided that edges are dra wn as straigh t-line segmen ts. In one-sided crossing minimization (1SCM) the order of the vertices on one of the la y ers is ﬁxed. Even 1SCM is NP-hard [4]. J¨ unger and Mutzel [9] p erformed an exp erimen tal comparison of exact and heuristic algorithms for b oth 1SCM and 2SCM. The main ﬁndings w ere that for 1SCM the exact solution can b e computed quickly for up to 60 vertices in the free la yer, and for 2SCM an iterated barycenter heuristic is the metho d of c hoice for instances with more than 15 vertices in each la yer. The main diﬀerence betw een TL and 2SCM is that in TL, the possible orders of the lea ves are limited to those that are compatible with the t wo input trees. F urther- more the inter-tree edges are usually restricted to be a matc hing of the lea ves. Dwy er and Schreiber [3] studied drawing a series of tanglegrams in 2.5 dimensions, that is, the trees are drawn on a set of stack ed tw o-dimensional planes. They considered a one-sided version of binary TL b y ﬁxing the la yout of the ﬁrst tree in the stac k, and then, lay er-by-la y er, computing an optimal compatible le af order of the next tree in O ( n 2 log n ) time each. F ernau et al. [5] sho wed that binary TL is NP-hard and ga ve a ﬁxed-parameter algorithm that runs in O ? ( c k ) time, where the O ? -notation ignores p olynomial factors, c is a constant that F ernau et al. estimate to b e 1024, and k is the minim um num b er of crossings in any dra wing of the given tanglegram. They further sho wed that the problem can b e solved in O ( n log 2 n ) time if the leaf order of one tree is ﬁxed. This impro ves the result of Dwy er and Sc hreib er [3]. They also made the simple observ ation that the edges of the tanglegram can be directed from one ro ot to the other. Th us the existence of a planar drawing can b e veriﬁed using a linear-time up ward-planarit y test for single-source directed acyclic graphs [1]. Later, apparently not knowing these previous results, Lozano et al. [10] ga ve a quadratic-time algo- rithm for the same sp ecial case, to whic h they refer as planar tangle gr am layout . Recen tly , Buchin et al. [2] show ed that binary TL remains NP-hard ev en if b oth trees are complete binary trees. F or this case they gav e an O ( n 3 )-time factor-2 ap- pro ximation algorithm and a simple O ? (4 k )-time ﬁxed-parameter algorithm, where k is the minimum n umber of crossings as b efore. Their approximation algorithm is based on recursiv e splitting of the instance and can also b e used as a heuristic for general binary trees. Holten and v an Wijk [8] presen t a tanglegram visualization tool for the comparison of t wo (not necessarily binary) trees that uses local optimization to reduce in ter-tree crossings an edge-bundling technique to reduce visual clutter. 3 Algorithms In this section we describ e the recursive splitting algorithm of Buchin et al. [2] and our improv ed v ariant of it, then the algorithm hierarc h y sort of Holten and v an 4 Martin N¨ ollenburg, Dann y Holten, Markus V¨ olker, and Alexander W olﬀ Wijk [8], and ﬁnally a simple integer quadratic program (IQP) that pro vides us with exact solutions for the exp erimen tal comparison that follows in Section 4. 3.1 Recursiv e Splitting Algorithm The main idea b ehind the recursive splitting algorithm is to recursively consider for an instance h S, T i the four p ossible orders of the tw o subtrees S 1 , S 2 of S and T 1 , T 2 of T b elo w the ro ots v S and v T of S and T as in Fig. 2. Eac h order giv es rise to a certain num b er of crossings at that lev el of the recursion (called curr ent- level crossings), whic h is added to the num b er of crossings of b oth recursively solv ed subproblems induced by that order (called lower-level crossings). Each current-lev el T 2 S 1 S 2 T 1 S T v S v T Fig. 2: A subinstance h S, T i with a curren t-level crossing. crossing has the prop erty that it can be remo ved b y swapping the subtrees of v S or v T . F or exam- ple the crossing depicted in Fig. 2 can b e remov ed b y swapping the subtrees of v S and placing S 2 ab o ve S 1 . Of course such a swap generally in- tro duces other curren t-level crossings. The mini- m um of the four p ossibilities will b e returned to the previous level of the recursion. The t wo sub- problems that arise from each recursive split are not indep endent. Nev ertheless, they are treated indep enden tly b y the algorithm. This obviously in tro duces an error with resp ect to the actual num b er of crossings, whic h, for the case of c omplete binary trees, can b e b ounded b y the num b er of crossings in an optimal solution [2]. F or complete binary trees the recursive algorithm thus yields a 2-appro ximation. Ob viously , the depth of the recursion equals the minimum heigh t h of the tw o trees. The recursion tree is of size O (8 h ) since each instance starts eight recursiv e calls (tw o for eac h of the four subtree arrangements). The computation of all current-lev el crossings is done in O (4 h n ) time, resulting in a total running time of O (8 h + 4 h n ). F or complete trees with h = log n this resolv es to O ( n 3 ) time. In applications most binary TL instances do not consist of c omplete binary trees. The ab ov e recursive algorithm can b e applied to any pair of binary trees as a heuristic but an approximation guarantee cannot be giv en an y more. Under the Unique Games Conjecture a constant-factor appro ximation do es not ev en exist for general binary trees [2]. The original algorithm alwa ys divides an instance into an upp er and a lo wer subinstance, that is, the tw o problems h S 1 , T 1 i and h S 2 , T 2 i in the example of Fig. 2. F or unbalanced trees this can lead to an unnecessarily high num b er of ignored crossings as Fig. 3 shows. The original algorithm aligns the leav es (no des 7 and 8) attached directly to the ro ots since this causes no current-lev el crossings. All 14 crossings in Fig. 3b are crossings that the algorithm do es not take into account. A small mo diﬁcation of our algorithm weak ens this eﬀect (and yields the optimum solution in the given example). Instead of alwa ys dividing into an upp er and a low er subinstance, we can also consider dividing i nto the tw o diagonal subinstances h S 1 , T 2 i and h S 2 , T 1 i in the example of Fig. 2. The improv ed algorithm alwa ys selects among the tw o p ossible splits the one that has the higher total num b er of edges b etw een its t wo subinstances. This mo diﬁcation not only improv es the algorithm p erformance, Dra wing Binary T anglegrams: An Experimental Ev aluation 5 (a) optimal la yout: 1 crossing (b) heuristic lay out: 14 crossings Fig. 3: Example of a binary tree for which the original heuristic p erforms badly . but it also allows us to precompute in O ( n 2 h ) time all required n umbers of current- lev el crossings for a constant-time lo okup. Thus the total running time reduces to O (8 h + n 2 h ), which still equals O ( n 3 ) for complete trees. Interestingly , it can b e pro ved (omitted here) that the approximation factor of 2 still holds for this mo diﬁed algorithm in the case of complete trees. In our most reﬁned implementation of the improv ed algorithm we additionally mak e use of a branc h-and-b ound technique in order to prune large parts of the searc h tree as early as p ossible. This considerably sped up the naiv e implemen tation, see Section 4.3. Let’s consider an instance of the problem with ro ots v S 0 and v T 0 . Instead of computing the n umber of crossings for all four p ossible arrangements of the resp ectiv e subtrees of v S 0 and v T 0 w e ﬁrst consider the one that yields the low est n umber of current-lev el crossings and recurse. This giv es us an initial upp er b ound on the num ber of crossings once the leaf level is reac hed. Now at each level w e can immediately prune the resp ective parts of the search tree for those arrangements that exceed this upp er b ound. The rest of the searc h tree is examined further, and eac h time a b etter solution is found the upp er b ound is updated accordingly . 3.2 Hierarc hy Sort The algorithm hierarch y sort of Holten and v an Wijk [8] pe rforms a n umber of collapse-and-expand cycles on both trees of the binary tanglegram. During eac h step of these cycles, the well-kno wn barycentric metho d of Sugiyama et al. [12] for 1SCM is used by successively ﬁxing one tree, optimizing the leaf order of the other, and then changing the trees’ roles until no further crossing reduction is p ossible. W e illustrate the algorithm using the example in Fig. 4. Figure 4a shows a binary tanglegram with 13 inter-tree edge crossings. Figures 4b to 4p illustrate the hierarch y sorting algorithm using one full collapse-and-expand cycle. Since crossings are to b e reduced on c orr esp onding lev els in the t wo trees, the n umbers of lev els of the tw o trees need to equalized. This is done by in tro ducing dumm y no des that bring all leav es to the low est level. In our example, this results in a tanglegram consisting of t wo four-lev el binary trees, see Fig. 4b. 6 Martin N¨ ollenburg, Dann y Holten, Markus V¨ olker, and Alexander W olﬀ L1 L2 L1 L2 4 5 6 2 1 3 3 6 4 2 1 5 L1 L2 L1 L2 4 5 6 2 1 3 3 6 4 2 1 5 L1 L2 L1 L2 6 4 5 3 2 1 6 3 5 4 2 1 L1 L2 L1 L2 4 5 6 2 1 3 3 6 4 2 1 5 L1 L1 4 5 6 2 1 3 6 3 5 4 2 1 L1 L2 L3 L4 L1 L2 L3 6 5 4 3 2 1 6 3 5 4 2 1 L1 L2 L3 L4 L1 L2 L3 L4 6 5 4 3 2 1 6 3 5 2 1 4 L1 L2 L3 L4 L1 L2 L3 L4 L1 L2 L3 L1 L2 L3 6 5 4 3 2 1 6 3 5 2 1 4 (f ) collapse t o L2 (6) (g) cross-r educ e L2 (1) ( j) e xpand t o L2 (1) (k) cross-r educ e L2 (1) (h) c ollapse t o L1 (1) (i) cross-r educ e L1 (1) (c) cr oss-r educ e L4 (12) (a) or ig inal tr ees (13) (e) cr oss-r educ e L3 (9) (d) c ollapse to L3 (12) (b ) equaliz e lev els; max = L4 (13) (l) e xpand t o L3 (3) (m) cross-r educ e L3 (2) (n) e xpand t o L4 (2) (o) cross-r educ e L4 (2) (p ) r emo v e dumm y nodes (2) L1 L2 L3 L4 L1 L2 L3 L4 6 5 4 3 2 1 6 3 5 2 1 4 L1 L2 L3 L4 L1 L2 L3 L4 6 5 4 3 2 1 6 3 5 2 1 4 4 L1 L2 L3 L4 L1 L2 L3 6 3 5 2 1 6 5 4 3 2 1 6 3 5 2 1 4 L1 L2 L3 L1 L2 L3 6 5 4 3 2 1 6 3 5 2 1 4 L1 L2 L3 L1 L2 L3 L1 L2 L3 L1 L2 L3 6 4 5 3 2 1 2 1 5 4 6 3 L1 L1 4 5 6 2 1 3 6 3 5 4 2 1 6 5 4 3 2 1 2 1 5 4 6 3 6 5 4 3 2 1 Fig. 4: Step-by-step crossing reduction (CR) using hierarch y sort. No des that are sw app ed during CR are encircled. The num b ers of crossings after each step are given in parentheses. Crossing reduction (CR) is no w performed per lev el b y emplo ying the barycentric metho d on corresp onding levels. Due to the hierarchical structure of the data that w e consider, only no des having the same parent may b e swapped. No des that are sw app ed during CR are encircled in Fig. 4. After having completed CR at the lo w est lev el, w e mov e up one lev el in b oth trees. W e do this b y c ol lapsing b oth lev els, that is, b y con tracting all edges ending in leav es (see the step from Fig. 4c to 4d, for example). CR can no w proceed. Collapsing and CR are rep eated until the levels b elow the ro ots are reac hed (Fig. 4i). A t this point, the pro cess is rev ersed and levels are expanded again (with in-b et ween CR) un til the leaf levels are reached. This is illustrated in Figs. 4j to 4o. Such collapse-and-expand cycles are rep eated un til the num b er of crossings does not decrease any further. A last step remains: the original num b er of levels in b oth hierarchies needs to b e restored, that is, all dummy no des are contracted (see Fig. 4p). In our example the hierarc hy sorting algorithm has reduced the n umber of crossings from 13 to 2. The asymptotic running time of this algorithm dep ends of course on the num- b er N of collapse-and-expand cycles and the maximum num b er N 0 of executions of the linear-time barycen tric heuristic on each lev el. In our exp erimen ts (see Section 4) it turned out that in all instances we had N ≤ 2 and N 0 ≤ 2. Under the condition that b oth N and N 0 are constants, hierarch y sort runs in O ( n · H ) time, where H is the maximum heigh t of the tw o trees. In the case of complete trees H = log n , and the running time is O ( n log n ). Dra wing Binary T anglegrams: An Experimental Ev aluation 7 3.3 In teger Quadratic Program F or the IQP w e in tro duce a binary v ariable x u for eac h inner no de of S ∪ T . If x u = 1, the tw o subtrees of u change their order with resp ect to the input drawing, other- wise the order of the input dra wing is kept. Let ab and cd be tw o inter-tree edges with a, c ∈ S and b, d ∈ T . Let v ∈ S and w ∈ T b e the lo west common ancestors of the lea ves a and c , and of b and d , resp ectiv ely . Assume that ab and cd cross eac h other in the original drawing. Then ab and cd cross each other in the solution enco ded b y the IQP if and only if x u · x v = 1 or (1 − x u ) · (1 − x v ) = 1. Otherwise, if ab and cd do not cross each other originally , they will cross in the solution enco ded b y the IQP if and only if x u · (1 − x v ) = 1 or (1 − x u ) · x v = 1. Thus the total num b er of edge crossings can b e expressed as the sum of these products for all pairs of edges. The IQP minimizes this sum as its ob jectiv e function. No further constraints, apart from the v ariables being binary , are necessary . 4 Exp erimen tal Results The recursive splitting algorithms w ere written in Ja v a 1.5 and executed in SuSE Lin ux 9.3 running on an AMD Opteron 248 2.2 GHz system with 4 GB RAM. The hierarc hy sorting algorithm was implemented in Delphi 7.0 and executed in Windows XP on an Intel P entium 4 2.8 GHz system with 1 GB RAM. The quadratic program w as solved with the mathematical programming softw are CPLEX 9.1 running on the ab o ve Linux system. 4.1 Data W e generated four sets (A–D) of random tanglegrams. Set A contains ten pairs of complete binary trees with random leaf orders for each n = 16 , 32 , . . . , 256. In set B w e simulated tree mutations by starting with tw o identical complete binary trees and then randomly swapping the p ositions of up to 20% of the lea ves of one tree. This is done as follows: we ﬁrst pick a leaf uniformly at random and then iteratively clim b up the tree with probability 0.75 in each step. F rom the node thus reached we clim b bac k down and ﬂip a coin at each no de to choose its left or right child until w e reac h another leaf. This leaf and the leaf pic ked in the b eginning are swapped. Th us the probabilit y of t wo leav es b eing swapped decreases with their distance in the tree. Set C con tains ten pairs of general binary trees for each n = 20 , 40 , . . . , 200. The trees are constructed from a set of nodes, initially containing the n lea ves, by iterativ ely joining tw o random no des in a new paren t no de that replaces its children in the set. This pro cess generates trees that resemble ph ylogenetic trees or clustering dendrograms. Set D is similar to set C but again in each tanglegram the second tree is a m utation of the ﬁrst tree, where up to 10% of the leav es can sw ap p ositions as done in set B and up to 25% of the subtrees can reattach to another edge. This edge is selected in a random walk starting at the subtree’s old position. The walk contin ues with probability 0.75 and chooses the left or right edge by tossing a coin. T rees in this set are of in terest since real-w orld tanglegrams often consist of tw o related and rather similar trees. The av erage crossing num b ers of the trees in sets A–D are given in Fig. 8 in the app endix. 8 Martin N¨ ollenburg, Dann y Holten, Markus V¨ olker, and Alexander W olﬀ Our real-world examples comprise three sets (E–G) of tanglegrams. Set E con- tains six pairs of dendrograms of a hierarchically clustered so cial netw ork based on email communication of 21 sub jects [6]. Sets F and G contain six and ten pairs of ph ylogenetic trees for 15 sp ecies of p o c ket gophers and 17 sp ecies of lice, resp ec- tiv ely [7]. (Fig. 1 shows a tanglegram in set G.) While the email tanglegrams hav e b et ween 23 and 45 crossings in an optimal solution, the phylogenetic trees can b e dra wn with at most tw o crossings, most of them even without any crossings. 4.2 P erformance In the follo wing w e denote the original recursiv e algorithm of Buchin et al. [2] b y r e c- split and our mo diﬁcation for unbalanced trees by r e c-split-impr ove d . The algorithm of Holten and v an Wijk is hier ar chy sort . Let n b e the size of an instance, that is, the n umber of leav es p er tree. T o each tanglegram w e applied the three algorithms and the IQP , and recorded the crossing n umbers c i in their respective solutions for i = r e c-split , r e c-split- impr ove d , hier ar chy sort . W e then computed for each tanglegram the p erformance ratios ( c i + 1) / ( c opt + 1), where c opt denotes the optimal num b er of crossings obtained from solving the IQP . Note that we add one to b oth crossing num b ers in order to ha ve a w ell-deﬁned ratio also for crossing-free instances. The results for sets A–D are shown in Fig. 5. F or complete binary trees (sets A and B) r e c-split and r e c-split-impr ove d achiev ed similar p erformance ratios that tend to 1 as the size of the trees grows. Recall that on these complete instances b oth al- gorithms are 2-appro ximations. On av erage both metho ds p erformed slightly b etter on mutated trees (B) than on random trees (A). In several cases the instances in set B could be solv ed optimally . The a verage performance ratio of hier ar chy sort was sligh tly w orse for the random trees of set A and drastically worse for the m utated trees of set B with av erage v alues b etw een 2.23 and 4.4 in comparison to v alues b et ween 1 and 1.05 for the recursive algorithms. F urthermore, hier ar chy sort p er- formed b etter on random trees rather than on mutated trees. Note that the absolute n umber of crossings is low er for mutated trees; thus a diﬀerence of only 1 or 2 to the optim um can already lead to relatively large ratios for small n . F or general binary trees the p erformance ratios of r e c-split and r e c-split-impr ove d are no longer upp er-b ounded b y 2 but at least for random trees (C) the ratios were on av erage w ell below 2. As exp ected, r e c-split-impr ove d outp erformed r e c-split due to the mo diﬁcation for unbalanced trees. Algorithm r e c-split attained p erformance ratios close to 1 for most random instances but it had some outliers as well. The solutions of r e c-split-impr ove d were not only closer to the optim um, they also spread m uch less. Note that due to excessive computation times of several hours we did not record the results of r e c-split for n ≥ 110. The hierarch y sorting algorithm yielded results that were clearly inferior to those of the recursive algorithms. But it still ac hieved p erformance ratios b elo w 1.2 as n grows. In general, the b ehavior of b oth r e c-split-impr ove d and hier ar chy sort did not diﬀer muc h b etw een sets A and C and th us the completeness of the trees seems of low impact on the solution qualit y . In con trast r e c-split p erformed w orse on general trees than on complete trees. F or the m utated trees of set D with relativ ely fewer crossings in the optimal solu- tion we use a logarithmic scale for the p erformance ratio. The results were generally Dra wing Binary T anglegrams: An Experimental Ev aluation 9 1 1.05 1.1 1.15 1.2 16 32 64 128 256 performance ratio number of leaves A) random complete binary trees optimum rec-split rec-split-improved 1 1.5 2 2.5 3 16 32 64 128 256 performance ratio number of leaves A) random complete binary trees optimum hierarchy-sort 1 1.05 1.1 1.15 1.2 16 32 64 128 256 performance ratio number of leaves B) mutated complete binary trees optimum rec-split rec-split-imp’d 1 2 3 4 5 6 7 8 9 10 16 32 64 128 256 performance ratio number of leaves B) mutated complete binary trees optimum hierarchy-sort 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 20 40 60 80 100 120 140 160 180 200 performance ratio number of leaves C) random general binary trees optimum rec-split rec-split-improved hierarchy-sort 1 10 100 1000 20 40 60 80 100 120 140 160 180 200 performance ratio number of leaves D) mutated general binary trees Fig. 5: Performance ratios of the three algorithms r e c-split , r e c-split-impr ove d , and hier- ar chy sort . The boxplots show medians, ﬁrst and third quartiles, minim um and maximum v alues. Arithmetic means are indicated by crosses. 10 Martin N¨ ollenburg, Dann y Holten, Markus V¨ olker, and Alexander W olﬀ w orse and spread a lot more for all three algorithms, but still r e c-split-impr ove d had the b est p erformance with the upp er quartile of the ratios mostly b elo w 3. On the other hand the upper quartile of hier ar chy sort reached v alues mostly ab o ve 20 and up to 85 for n = 100. Algorithm r e c-split reached p erformance ratios betw een those of the other t wo algorithms. F or n ≥ 90 w e stopp ed considering r e c-split since its computation times b ecame again too high. 1 10 E F G performance ratio optimum rec-split rec-split-imp’d hierarchy-sort Fig. 6: P erformance ratios for real- w orld examples. The relative performance of the algorithms for random non-complete trees w as conﬁrmed by the results for the sets E–G of real-world examples, see Fig. 6. F or the clustering data of set E with an av erage of 33.5 crossings r e c-split-impr ove d reac hed an av erage p erformance ratio of 1.06, r e c- split was slightly w orse with 1.15, and hier ar chy sort had an av erage ratio of 1.86. The phyloge- netic data of sets F and G can often b e dra wn without crossings and thus hav e av erage crossing n umbers of only 0.17 and 0.7, respectively . This explains the relatively high p erformance ratios. Still, r e c-split-impr ove d found an optimum la yout for four of the six examples in set F and solv ed all ten instances in set G optimally . 4.3 Running Time Although the num b er of crossings is the main asp ect to assess the quality of TL algorithms, their running time is also imp ortant—especially if the lay outs are to b e pro duced interactiv ely . Figure 7 shows plots of the running times of r e c-split , r e c-split-impr ove d , r e c-split-bb (the branc h-and-b ound implementation of r e c-split- impr ove d ), hier ar chy sort , and the IQP for all four classes of random tanglegrams. Note the use of log scales. Recall that hier ar chy sort was written in Delphi instead of Jav a and executed on a diﬀerent system. Hence the absolute running times of hier ar chy sort are to b e taken with a grain of salt. By far the fastest algorithm in all our examples w as hier ar chy sort , which to ok at most 12 ms for complete trees and less than 90 ms for arbitrary trees. No diﬀerence in terms of the running time could be seen b et ween random pairs and m utated pairs of trees. In contrast, the measured running times of the branch-and-bound algorithm r e c-split-bb were about ten times higher for complete trees and b etw een four and six times higher for general trees. Still, the median running time of r e c-split- bb was less than 360 ms for all instances. In the direct comparison of random (C) and mutated (D) pairs of non-complete trees with the same num b er of leav es, the random pairs, which ha ve a m uch larger crossing num b er, required b et ween 50 and 100% more computation time. The naive recursive implemen tations of r e c-split and r e c-split-impr ove d were far slo wer than hier ar chy sort and r e c-split-bb . F or complete binary trees they b oth gro w at a cubic rate in n , but r e c-split-impr ove d is ab out three times faster than r e c-split . This is due to the additional O ( n 2 h )-time prepro cessing step men tioned in Section 3.1. Both algorithms are not inﬂuenced by the class of the complete trees Dra wing Binary T anglegrams: An Experimental Ev aluation 11 0.01 0.1 1 10 100 16 32 64 128 256 running time [s] number of leaves A) random complete binary trees rec-split rec-split-imp’d rec-s-bb h-sort IQP 0.01 0.1 1 10 100 16 32 64 128 256 running time [s] number of leaves B) mutated complete binary trees rec-split rec-split-improved rec-split-bb hierarchy-sort IQP 0.01 0.1 1 10 100 20 40 60 80 100 120 140 160 180 200 running time [s] number of leaves C) random general binary trees rec-split rec-s-imp’d rec-split-bb hierarchy-sort IQP 0.01 0.1 1 10 100 20 40 60 80 100 120 140 160 180 200 running time [s] number of leaves D) mutated general binary trees rec-split rec-split-imp’d rec-split-bb hierarchy-sort IQP Fig. 7: Median running times of the algorithms r e c-split , r e c-split-impr ove d , r e c-split-bb , hier ar chy sort , and IQP (in seconds). Note the use of log scales. (A or B) as to b e exp ected from their deﬁnitions. F or general binary trees their running times quickly grew up to sev eral hours, at least for some of the instances, whic h is due to the fact that the running time is exp onential in the tree height. Th us complete trees with 256 leav es could b e solved as fast as some random trees with only 50 leav es. The b etter p erformance in terms of crossings of r e c-split-impr ove d for un balanced trees is paid for by higher running times in comparison to r e c-split . In terestingly , la y outs for pairs of mutated and th us rather similar trees to ok m uc h more time to compute than lay outs for tw o random trees. One explanation is that for a subinstance h S 0 , T 0 i the smaller of the heights of S 0 and T 0 determines the recursion depth. Thus for tw o similar trees with similar heigh ts the recursion depth will b e larger on av erage than for tw o random trees with fairly diﬀerent heights. It is also noteworth y that the running time of r e c-split-bb in our exp erimen ts was dominated b y the ab o ve mentioned O ( n 2 h )-time prepro cessing step, which unlike the recursiv e algorithms do es not dep end exp onen tially on the height. Finally , we lo ok at the running times of the IQP . In contrast to the recursive algorithms the running time of the IQP is indep enden t of the height (and thus the completeness) of the trees. Rather it is the v alue of the ob jective function, that is, the crossing n umber, that inﬂuences the solution time. Therefore all m utated trees, whic h hav e relatively small crossing n umbers, could b e solv ed optimally within the time limits of ten min utes. On the other hand for random tanglegrams optimality could only be prov en for n ≤ 64. F or larger instances a slo wly increasing gap b etw een the best found in teger solution and the fractional solution remained, see Fig. 8 (top). 12 Martin N¨ ollenburg, Dann y Holten, Markus V¨ olker, and Alexander W olﬀ 5 Conclusions The exp erimental ev aluation shows that in terms of crossing reduction our impro ve- men t of the recursive splitting algorithm of Buchin et al. [2] clearly has the b est p erformance for all instances that were included in the tests. Moreov er, our branc h- and-b ound implementation is fast enough (less than 0.4 seconds for trees with 200 lea ves each) to b e used interactiv ely . Thus it is the metho d of choice for dra wing binary tanglegrams with up to a few hundred leav es. Still, in terms of running time the hierarch y sorting heuristic of Holten and v an Wijk [8] outp erforms the recur- siv e splitting algorithm; it can thus also b e used for very large trees if the num b er of crossings is not the main optimization criterion. Also, it is curren tly the only method that can draw non-binary tanglegrams. F or medium-sized tanglegrams that consist of tw o similar trees and thus hav e a rather small crossing num ber it is worth to giv e it a try and solve the very simple in teger quadratic program to obtain the optimal solution—often this tak es but a few seconds. References 1. P . Bertolazzi, G. Di Battista, C. Mannino, and R. T amassia. Optimal up ward planarity testing of single-source digraphs. SIAM J. Comput. , 27(1):132–169, 1998. 2. K. Buchin, M. Buc hin, J. Byrk a, M. N¨ ollen burg, Y. Ok amoto, R. I. Silv eira, and A. W olﬀ. Drawing binary tanglegrams: Hardness, appro ximation, ﬁxed-parameter tractabilit y . Av ailable at h 3. T. Dwy er and F. Sc hreib er. Optimal leaf ordering for t wo and a half dimensional ph y- logenetic tree visualization. Pro c. Austr alasian Symp os. Inform. Visual. (InVis.au’04) , v olume 35 of CRPIT , pages 109–115. Australian Computer So ciet y , 2004. 4. P . Eades and N. W ormald. Edge crossings in drawings of bipartite graphs. Algorithmic a , 10:379–403, 1994. 5. H. F ernau, M. Kaufmann, and M. Poths. Comparing trees via crossing minimization. In Pr oc. 25th Intern. Conf. F ound. Softw. T e chn. The oret. Comput. Sci. (FSTTCS’05) , v olume 3821 of L e ctur e Notes Comput. Sci. , pages 457–469, 2005. 6. R. G¨ orke, M. Gaertler, and D. W agner. LunarVis – Analytic Visualizations of Large Graphs. In Pr o c. 15th Internat. Symp os. Gr aph Dr awing (GD’07) , volume 4875 of L e ctur e Notes Comput. Sci. , pages 352–364. Springer-V erlag, 2008. 7. M. S. Hafner, P . D. Sudman, F. X. Villablanca, T. A. Spradling, J. W. Demastes, and S. A. Nadler. Disparate rates of molecular ev olution in cospeciating hosts and parasites. Scienc e , 265:1087–1090, 1994. 8. D. Holten and J. J. v an Wijk. Visual comparison of hierarc hically organized data. In Pr o c. 10th Eur o gr aphics/IEEE-VGTC Symp os. Visualization (Eur oVis’08) , pages 759–766, 2008. 9. M. J ¨ unger and P . Mutzel. 2-lay er straightline crossing minimization: Performance of exact and heuristic algorithms. J. Gr aph Algorithms Appl. , 1(1):1–25, 1997. 10. A. Lozano, R. Y. Pin ter, O. Rokhlenk o, G. V alien te, and M. Ziv-Ukelson. Seeded tree alignmen t and planar tanglegram la yout. In Pr o c. 7th Internat. Workshop A lgorithms Bioinformatics (W ABI’07) , volume 4645 of L e ctur e Notes Comput. Sci. , pages 98–110. Springer-V erlag, 2007. 11. R. D. M. P age, editor. T angle d T r e es: Phylo geny, Cosp e ciation, and Co evolution . Uni- v ersity of Chicago Press, 2002. 12. K. Sugiy ama, S. T aga wa, and M. T o da. Methods for visual understanding of hierarc hical system structures. IEEE T r ans. Systems, Man, and Cyb ernetics , 11(2):109–125, 1981. Dra wing Binary T anglegrams: An Experimental Ev aluation 13 App endix 0 2000 4000 6000 8000 10000 12000 14000 0 16 32 50 64 100 128 150 200 256 number of crossings number of leaves (A) random complete binary trees (C) random general binary trees 0 100 200 300 400 500 600 700 800 900 0 16 32 50 64 100 128 150 200 256 number of crossings number of leaves (B) mutated complete binary trees (D) mutated general binary trees Fig. 8: Average crossing num b ers of our randomly-generated instances. F or random (non- m utated) trees in sets A and C (top) with n ≥ 70 there is a remaining gap betw een the b est found integer solution and the b est fractional solution. Both v alues are plotted; how ever, the gap is relativ ely small (less than 70 for n ≤ 256) and hardly visible.

Drawing Binary Tanglegrams: An Experimental Evaluation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment