on visualizing a pair of phylogenetic treesfined distance between two clusters either exactly or at...

4
On Visualizing a Pair of Phylogenetic Trees SUNG-HYUK CHA Pace University Department of Computer Science 1 Pace Plaza, New York, NY, 10038 USA [email protected] Abstract: When n is the number of taxa, a phylogenetic tree can be displayed in 2 n-1 planar ways. Which tree should be displayed to users, i.e, which one is the most visually appealing is a quite hard problem to be formu- lated computationally. Another debatable question in bioinformatics is which tree is better when two different phylogenetic trees are built using different measures and/or methods. This article claims that displaying a pair of alternative phylogenetic trees together by finding a proper order of taxa in the leaf nodes might give better insights to understand the tree of life. If there exists an order out of all permutations such that there is no crossing among branches for both trees, we claim that it is the most visually pleasing order for both trees. If not, the problem is formulated as minimizing the number of crossings in the trees. A genetic algorithm is suggested to find the semi-optimal leaf node order. Key–Words: dendrogram, hierarchical clustering, phylogenetic tree 1 Introduction Clustering is one of the most important data analy- sis concepts and numerous methods have been stud- ied [1]. Among these, hierarchical clustering meth- ods [1, 2] have been widely used, especially in bi- ological taxonomy [3] and bioinformatics [4] due to their outstanding visual output as exemplified in Fig 1. They produce a visual tree representation of the hi- erarchical clusters called a ‘dendrogram’ in general or a ‘cladogram’ in bioinformatics. Numerous algo- rithms to build or find a dendrogram have been studied (see [1, 2] for efficient algorithms and their computa- tional complexities). Figure 1: A sample phylogenetic tree Two major problems in the area of phylogenetic trees or hierarchical cluster trees in general include comparing and visualizing trees. First, two or more distinct and conflicting trees are often produced not only by different algorithms or techniques, but also by different measures. Albeit numerous methods includ- ing the earliest cophenetic correlation coefficient [5] to measure and compare multiple dendrograms [6, 7] have been proposed, it is a very hard problem and oc- casionally subjective. The second problem is a visualization problem; which tree is the most visually pleasing or meaningful tree among a combinatorially large number of isomor- phic trees. In other words, which order of leaf nodes should be selected? Again, it is another hard and sub- jective problem. One possible solution to answer both questions is suggested in this article. When there are two conflict- ing trees, the proposed method shows both trees with the leaf node order such that no crossing in either tree occrus as shown in Fig. 2. When there exists no such order, the problem is formulated as a crossing min- imization probloem which minimizes the number of crossings of branches. The rest of the paper is organized as follows. Sec- tion 2 deals with how the dendrogram is represented and its related combinatorics. Phylogenetic trees with regard to FoxP2 are introduced to justify the necessity of this study in section 3. The crossing minimization algorithm is descrbied in section 4. Finally, section 5 concludes this work. Mathematics and Computers in Biology and Biomedical Informatics ISBN: 978-960-474-333-9 73

Upload: others

Post on 17-Jan-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On Visualizing a Pair of Phylogenetic Treesfined distance between two clusters either exactly or at least order-wise. The crossing minimization problem of a pair of phylogenetic trees

On Visualizing a Pair of Phylogenetic Trees

SUNG-HYUK CHAPace University

Department of Computer Science1 Pace Plaza, New York, NY, 10038

[email protected]

Abstract: When n is the number of taxa, a phylogenetic tree can be displayed in 2n−1 planar ways. Which treeshould be displayed to users, i.e, which one is the most visually appealing is a quite hard problem to be formu-lated computationally. Another debatable question in bioinformatics is which tree is better when two differentphylogenetic trees are built using different measures and/or methods. This article claims that displaying a pair ofalternative phylogenetic trees together by finding a proper order of taxa in the leaf nodes might give better insightsto understand the tree of life. If there exists an order out of all permutations such that there is no crossing amongbranches for both trees, we claim that it is the most visually pleasing order for both trees. If not, the problemis formulated as minimizing the number of crossings in the trees. A genetic algorithm is suggested to find thesemi-optimal leaf node order.

Key–Words: dendrogram, hierarchical clustering, phylogenetic tree

1 IntroductionClustering is one of the most important data analy-sis concepts and numerous methods have been stud-ied [1]. Among these, hierarchical clustering meth-ods [1, 2] have been widely used, especially in bi-ological taxonomy [3] and bioinformatics [4] due totheir outstanding visual output as exemplified in Fig 1.They produce a visual tree representation of the hi-erarchical clusters called a ‘dendrogram’ in generalor a ‘cladogram’ in bioinformatics. Numerous algo-rithms to build or find a dendrogram have been studied(see [1, 2] for efficient algorithms and their computa-tional complexities).

Figure 1: A sample phylogenetic tree

Two major problems in the area of phylogenetictrees or hierarchical cluster trees in general includecomparing and visualizing trees. First, two or moredistinct and conflicting trees are often produced not

only by different algorithms or techniques, but also bydifferent measures. Albeit numerous methods includ-ing the earliest cophenetic correlation coefficient [5]to measure and compare multiple dendrograms [6, 7]have been proposed, it is a very hard problem and oc-casionally subjective.

The second problem is a visualization problem;which tree is the most visually pleasing or meaningfultree among a combinatorially large number of isomor-phic trees. In other words, which order of leaf nodesshould be selected? Again, it is another hard and sub-jective problem.

One possible solution to answer both questions issuggested in this article. When there are two conflict-ing trees, the proposed method shows both trees withthe leaf node order such that no crossing in either treeoccrus as shown in Fig. 2. When there exists no suchorder, the problem is formulated as a crossing min-imization probloem which minimizes the number ofcrossings of branches.

The rest of the paper is organized as follows. Sec-tion 2 deals with how the dendrogram is representedand its related combinatorics. Phylogenetic trees withregard to FoxP2 are introduced to justify the necessityof this study in section 3. The crossing minimizationalgorithm is descrbied in section 4. Finally, section 5concludes this work.

Mathematics and Computers in Biology and Biomedical Informatics

ISBN: 978-960-474-333-9 73

Page 2: On Visualizing a Pair of Phylogenetic Treesfined distance between two clusters either exactly or at least order-wise. The crossing minimization problem of a pair of phylogenetic trees

Figure 2: Displaying two dendrograms: T1 on top andT2 below

Table 1: A sample distance matrix of five taxa.A B C D E

A 0 8.0 20.0 16.0 9.8B 8.0 0 21.5 17.9 9.8C 20.0 21.5 0 4.0 11.7D 16.0 17.9 4.0 0 8.1E 9.8 9.8 11.7 8.1 0

2 Tree Representation and its Com-binatorics

Suppose that there are five taxa {A,B,C,D,E} andthe distance matrix among them is given in Table 1.When the agglomerative single linkage clusteringmethod (see [1, 2] for algorithm description and com-plexity) is used, the dendrogram in Fig. 3 can be builtand displayed in its default form. Note that numerousother methods described in [1, 2] may produce alter-native trees. When another distance matrix is used,different trees may also be produced. In this section,the combinatorics of a single dendrogram is exam-ined.

When there are n taxa, then there are n − 1 in-ternal and n leaf nodes. The internal node representsa merge of two clusters. A dendrogram can be rep-

Figure 3: Default dendrogram of Table 1

Table 2: Dendrogram representation.level Left Right Parent Height

4 {C} {D} 4.03 {A} {B} 8.02 {C,D} {E} 8.11 {A,B} {C,D,E} 9.8

resented as a table of internal nodes as exemplified inTable 2. The external nodes are sorted in ascending or-der by height. Since the dendrogram is a binary tree,each external node has two child nodes which are leftand right nodes. The last column contains its height.

In the default display of the dendrogram in Fig. 3,the leaf level has its order (A,B,C,D,E). However, anypermutation can be used and hence, n! number of treesare equivalent to the default dendrogram. Some goodtrees do not have any crossings while a bad orderingmay result in crossings as shown in Fig. 4 (b).

(a) Good dendrograms

(a) Bad dendrograms with crossings

Figure 4: Permuted dendrograms of Table 2

Theorem 1 There are 2n−1 good dendrograms thatdo not have any crossing.

Proof: Suppose that external nodes can rotate as ifit is a mobile toy. In other words, the left and right

Mathematics and Computers in Biology and Biomedical Informatics

ISBN: 978-960-474-333-9 74

Page 3: On Visualizing a Pair of Phylogenetic Treesfined distance between two clusters either exactly or at least order-wise. The crossing minimization problem of a pair of phylogenetic trees

children can be swapped. Even with any number ofswaps for any internal node, the dendrogram cannothave any crossing. Since there are n−1 internal nodes,there are 2n−1 good dendrograms without crossing. �

Each good dendrogram can be encoded in a bi-nary string of length n− 1 as depicted in Fig. 3. Zeromeans no swap and one means swap between left andright child. All valid good dendrograms of five taxaare enumerated in Fig. 4 (a) with the correspondingencoded binary string.

3 Phylogenetic Trees with regard toFoxP2

Forkhead box protein P2 or simply FoxP2 geneDNA sequences which appear in the followingeleven species are retrieved from the database(http://blast.ncbi.nlm.nih.gov). These species includebos taurus, canis familiaris, equus caballus, gorilla,macaca mulatta, monodelphis, mus musculus, panpaniscus bonobo, pan troglodytes chimp, pongo pyg-maeus bornean orangutan, and rattus norvegicus.

0

2

4

6

8

10

Pan_troglo

Pan_panisc

Gorilla_go

Pongo_pygm

Macaca_mul

Bos_taurus

Equus_caba

Mus_muscul

Monodelphi

Canis_fami

Rattus_nor

Figure 5: T3 default dendrogram

Although numerous distinct alternative trees canbe produced depending on choice of distance mea-sures and/or algorithms, three distinct dendrogramsare examined in this section. Two trees, T1 and T2

were already given in Fig. 2 and T3 is shown inFig 5. The Jukes-Cantor method to calculate pair-wise distances is used for T1 and T3 whereas thealignment-score is used for T2. The score to treat in-dels in nucleotides is used for T1 and T3 whereas thepairwise-delete is used for T2. The unweighted pairgroup method with arithmetic mean, single linkage,and complate linkage clustering methods are used forT1, T2, and T3, respectively. The point is that T1, T2,and T3 are all distinct alternative dendrograms.

Among 39916800 permutations of the leaf nodeorder, each tree has 1024 orders which do not haveany crossing in the tree. 16 leaf node orders intersect

16

640

0

1008

960

944

T1 T2

T3

Figure 6: Awesome Image

between T1 and T2’s good dendrograms as depictedin Fig. 6. However, there is no such leaf node orderthat both T2 and T3 have no crossing. It is inevitableto have some crossings when both dendrograms aredisplayed together as shown in Fig. 7.

Figure 7: Displaying two dendrograms: T2 on top andT3 below

4 Crossing Minimization ProblemThis section considers a problem of finding the mini-mum number of crossings in a given dendrogram anda given leaf node order. It should be noted that thecrossing in this article differs from that in the pla-narity problem which states that every tree is a planargraph and thus drawable without any crossings.

There are two important constraints in drawingdendrograms. First, leaf nodes must be in the samebottom level and the order is specified. Second, all in-ternal nodes’ heights must correspond to the user de-

Mathematics and Computers in Biology and Biomedical Informatics

ISBN: 978-960-474-333-9 75

Page 4: On Visualizing a Pair of Phylogenetic Treesfined distance between two clusters either exactly or at least order-wise. The crossing minimization problem of a pair of phylogenetic trees

fined distance between two clusters either exactly orat least order-wise.

The crossing minimization problem of a pair ofphylogenetic trees takes an order of leaf nodes anda dendrogram table as exemplified in Table 2. Outof 120 possible orders of five taxa, 16 orders do nothave any crossing. When an arbitrary order is given,what is the minimum number of crossings? Consider<A,B,C,D,E>, <C,E,A,D,B>, and <C,A,D,E,B>leaf node orders with the dendrogram in Table 2. Theanswers are 0, 2, and 2, respectively. Although thethrid case is drawn with three crossings at the root ofthe trraversal tree in Fig. 8 (c), it is certainly drawablewith only two crossings but impossible in zero or onecrossing.

(a) best case (b) typical case .

(c) worst case

Figure 8: Binary traversal trees for computing theminimum number of crossings.

Here is a naı̈ve algorithm to compute the mini-mum number of crossings. It starts with merging theclosest pair of clusters and counting how many otherclusters are between these two clusters. In Fig. 8 (b),

the closest clusters are {C} and {D} and there aretwo other clusters {E} and {A} between them. Now{C} and {D} can be merged to left or right. We mustexamine both possible choices. When these processesare repeated recursively, the leaf node in the traversaltree contains only two clusters with the sums of allcrossings. The minimum number in the leaf node isthe minimum number of crossings.

The best case in the computational complexity ofthis algorithm is when there is no crossing as given inFig. 8 (a). The worst case is the almost perfect binarytree as given in Fig. 8 (c). A genetic algorithm can beapplied to find the sub optimal order as the number ofcrossings as an evaluation measure.

5 ConclusionsThis article suggests visualizing two alternative phy-logenetic trees together. To do so requires solving thecrossing minimization problem and a naı̈ve algorithmwas introduced. More efficient algorithms or furtherstudies are needed to solve the crossing minimizationproblem.

The binary tree representation of a dendrogramimplies that the distance between clusters are strictlyordered. Often, certain nodes can have more than twobranches. One order the distances strictly and the pro-posed algorithm can solve them.

References:

[1] R.–O. Duda, D.–G. Hart, and D.–G. Stork, Pat-tern Classification, Wiley New York, 2nd ed.,2000

[2] C. Olson, Parallel Algorithms for HierarchicalClustering, Parallel Computing, vol 21, pp 1313-1325, 1995

[3] G. Dunn and B.–S. Everitt, An Introduction toMathematical Taxonomy, Cambridge UniversityPress, 1982

[4] S. Sujatha, S. Balaji, and N. Srinivasan, PALI-adatabase of alignments and phylogeny of homol-ogous protein structures, Bioinformatics vol 17,pp. 375-376, 2001

[5] R.–R. Sokal and F.–J. Rohlf, The Comparison ofDendrograms by Objective Methods, Taxon vol11, no 2 pp. 33-40, 1962

[6] D.–F. Robinson and L.–R. Foulds, Compari-son of phylogenetic trees, Mathematical Bio-sciences, vol 53, pp. 131-147, 1981

[7] T.–M.W. Nye, P. Lio, and W.–R. Gilks. A NovelAlgorithm and Web-Based Tool for ComparingTwo Alternative Phylogenetic Trees, Bioinfor-matics, vol 22 ,issue 1, pp. 117-119, 2005.

Mathematics and Computers in Biology and Biomedical Informatics

ISBN: 978-960-474-333-9 76