using networks to explore, quantify, and summarize phylogenetic tree space

35
Using networks to explore, quantify, and summarize phylogenetic tree space Jeremy M. Brown 1 , Guifang Zhou 2 , Wen Huang 2 , Jeremy Ash 1 , Melissa Marchand 2 , Kyle Gallivan 2 , and Jim Wilgenbusch 3 1 Louisiana State University, Dept. of Biological Sciences 2 Florida State University, Dept. of Mathematics 3 Florida State University, Dept. of Scientific Computing 160 150 140 130 120 110 100 90 80 70 540 550 560 570 580 590 600 610 620 630

Upload: jembrown

Post on 17-Dec-2014

237 views

Category:

Science


2 download

DESCRIPTION

A talk I gave at the 2014 Evolution meetings in Raleigh, NC on the use of network approaches to understanding sets of phylogenetic trees.

TRANSCRIPT

Page 1: Using networks to explore, quantify, and summarize phylogenetic tree space

Using networks to explore, quantify, and summarize phylogenetic tree space

Jeremy M. Brown1, Guifang Zhou2, Wen Huang2, Jeremy Ash1, Melissa Marchand2, Kyle Gallivan2, and Jim Wilgenbusch3

1 Louisiana State University, Dept. of Biological Sciences 2 Florida State University, Dept. of Mathematics

3 Florida State University, Dept. of Scientific Computing

−160 −150 −140 −130 −120 −110 −100 −90 −80 −70540

550

560

570

580

590

600

610

620

630

Page 2: Using networks to explore, quantify, and summarize phylogenetic tree space

The Team

Jeremy Brown

Jeremy Ash

Jim Wilgenbusch Guifang Zhou Wen Huang

Melissa MarchandKyle Gallivan

Page 3: Using networks to explore, quantify, and summarize phylogenetic tree space

Overview• Motivation

• Our network approaches

• Some applications

• Initial results

• Software

Page 4: Using networks to explore, quantify, and summarize phylogenetic tree space

Tree Sets

Motivation Our Approaches Applications Initial Results Software

Page 5: Using networks to explore, quantify, and summarize phylogenetic tree space

Summarizing Tree Sets• Consensus trees

• MASTs

• Clustering

• NLDR

Motivation Our Approaches Applications Initial Results Software

0.9

0.7

Page 6: Using networks to explore, quantify, and summarize phylogenetic tree space

Summarizing Tree Sets• Consensus trees

• Agreement subtrees

• Clustering

• NLDR

Motivation Our Approaches Applications Initial Results Software

0.9

Prune

Page 7: Using networks to explore, quantify, and summarize phylogenetic tree space

Summarizing Tree Sets• Consensus trees

• Agreement subtrees

• Clustering

• NLDR

Motivation Our Approaches Applications Initial Results Software

BIOINFORMATICS Vol. 18 Suppl. 1 2002Pages S285–S293

Statistically based postprocessing ofphylogenetic analysis by clustering

Cara Stockham 1, Li-San Wang 2,∗ and Tandy Warnow 2

1Texas Institute for Computational and Applied Mathematics, University of Texas,ACES 6.412, Austin, TX 78712, USA and 2Department of Computer Sciences,University of Texas, Austin, TX 78712, USA

Received on January 24, 2002; revised and accepted on March 29, 2002

ABSTRACTMotivation: Phylogenetic analyses often produce thou-sands of candidate trees. Biologists resolve the conflict bycomputing the consensus of these trees. Single-tree con-sensus as postprocessing methods can be unsatisfactorydue to their inherent limitations.Results: In this paper we present an alternative approachby using clustering algorithms on the set of candidatetrees. We propose bicriterion problems, in particular usingthe concept of information loss, and new consensus treescalled characteristic trees that minimize the informationloss. Our empirical study using four biological datasetsshows that our approach provides a significant improve-ment in the information content, while adding only a smallamount of complexity. Furthermore, the consensus treeswe obtain for each of our large clusters are more resolvedthan the single-tree consensus trees. We also providesome initial progress on theoretical questions that arise inthis context.Availability: Software available upon request from the au-thors. The agglomerative clustering is implemented usingMatlab (MathWorks, 2000) with the Statistics Toolbox. TheRobinson-Foulds distance matrices and the strict consen-sus trees are computed using PAUP (Swofford, 2001) andthe Daniel Huson’s tree library on Intel Pentium worksta-tions running Debian Linux.Contact: E-mail : [email protected] Information:http://www.cs.utexas.edu/users/lisan/ismb02/Keywords: consensus methods; clustering; phylogenet-ics; information theory; maximum parsimony.

INTRODUCTIONPhylogenetic analysis can be divided into three stages. Inthe first stage, a researcher collects data (such as DNAsequences) for each of the different taxa (genes, species,etc.) under study. In the second phase, she applies a tree re-construction method to the data. Many tree reconstruction

∗To whom correspondence should be addressed.

methods produce more than one candidate tree for the in-put dataset. For example, the maximum parsimony (Swof-ford et al., 1996) method returns those binary trees withthe lowest parsimony score. (The parsimony score of a treeis the minimum tree length, i.e., the sum of distances be-tween two endpoints across all edges, obtained by any wayof labeling the internal nodes.) Very often the number oftrees can be in the hundreds or thousands. In the last phase,a consensus tree of the candidate trees is computed so asto resolve the conflict, summarize the information, and re-duce the overwhelming number of possible solutions tothe evolutionary history.

Many consensus tree methods are available, but acommon feature to all of them is that they produce onetree. There are several shortcomings of this approachincluding loss of information and sensitivity to outliers.

In this paper we present a different approach to post-processing. The set of candidate trees is divided into sev-eral subsets using clustering methods. Each cluster is thencharacterized by its own consensus tree. We pose severaltheoretical optimization problems for these kinds of out-puts, and present some initial progress on these problems;these are presented in the section on Clustering Criteria.The bulk of our paper is focused on an empirical study,which is presented in the Experiments section. We con-clude our study and propose additional research problemsin the Conclusions section.

BACKGROUNDPhylogenetic treesA leaf-labeled tree topology can be decomposed into aset of bipartitions in the following manner. Each edge,when deleted from the tree, induces a bipartition of theleaves; thus, we can identify each edge with its inducedbipartition. Let t1 and t2 be two trees on the same leafset, and let E(t1) and E(t2) denote their sets of internaledges. The quantity |E(t1)!E(t2)| = |(E(t1) − E(t2)) ∪(E(t2) − E(t1))| is called the Robinson-Foulds (RF)distance (Robinson and Foulds, 1981) between the two

c⃝ Oxford University Press 2002 S285

Report multiple consensus trees, while attempting to minimize the amount

of information lost from the full distribution.

Page 8: Using networks to explore, quantify, and summarize phylogenetic tree space

Summarizing Tree Sets• Consensus trees

• Agreement subtrees

• Clustering

• Dimensionality Reduction

Motivation Our Approaches Applications Initial Results Software

478 SYSTEMATIC BIOLOGY VOL. 54

FIGURE 7. Comparison of trees generated from two differentBayesian MCMC analyses of the same data set. Both Bayesian anal-yses were conducted under the same conditions (see text). Each colorrepresents the trees from a different analysis. (a) Analysis based on un-weighted Robinson-Foulds distances; (b) analysis based on weightedRobinson-Foulds distances.

be a useful means for describing the Bayesian MCMCapproach in classes and workshops on phylogenetics(see http://lewis.eeb.uconn.edu/lewishome/software.html for another useful MCMC instruction tool).

FIGURE 8. Progress in a Bayesian MCMC analysis. The progress inthe search can be visualized in the Tree Set Visualization program, as ademonstration of how an MCMC analysis functions. In the visualiza-tion, the progress of the chain through tree-space moves from regions oflow optimality scores (blue) to regions of high optimality scores (red).

Another common problem in phylogenetics is the dis-covery of several distinct “tree islands” of equally opti-mal or near-optimal phylogenetic solutions for a givendataset (Maddison, 1991). In analyzing a particular dataset, one might discover that there are a large number ofsolutions that fit the data equally well. A consensus ofthese trees may show little or no resolution. However,an unresolved consensus tree does not necessarily indi-cate that all potential solutions fit the data equally well.Separate summaries of each of the tree islands is likely toshow a much higher degree of resolution, and the sepa-rate tree islands may represent alternative phylogeneticsolutions for the data set. Tree Set Visualization can beused to identify and analyze these tree islands, as shownin Figure 9.

Potential Limitations of MDS for Visualizing Tree-SpaceMultidimensional scaling based on RF distances is

clearly not the only way (and is not necessarily eventhe best way) to visualize and represent tree-space. Wehave found this approach to the problem to be useful forexploring large sets of phylogenetic trees, but we alsorecognize that the approach has some limitations. For in-stance, any reduction of high-dimensional space into twodimensions necessarily will result in some distortions. Asan example of distortion, consider the MDS visualizationof trees shown in Figure 10. In this case, a reference treeis shown in blue, and a series of trees that differ fromthe reference tree by one bipartition each (RF = 2) areshown in red. All of the trees are equally distant from thereference tree in tree-space, and in multiple dimensionswould form a “multidimensional sphere” around the

by guest on August 15, 2012

http://sysbio.oxfordjournals.org/D

ownloaded from

Page 9: Using networks to explore, quantify, and summarize phylogenetic tree space

Networks of Trees

Tree-to-Tree!Affinities (Similarities)

Motivation Our Approaches Applications Initial Results Software

Page 10: Using networks to explore, quantify, and summarize phylogenetic tree space

Networks of Trees

Tree-to-Tree!Affinities (Similarities)

High

Low

Motivation Our Approaches Applications Initial Results Software

Page 11: Using networks to explore, quantify, and summarize phylogenetic tree space

Networks of Bipartitions

OD | ABC

OA | BCD

OC | ABD

OD | ABC

AB | OCD

AC | OBD

OB | ACD AD | OBC

BC | OAD

BD | OAC

CD | OAB

Motivation Our Approaches Applications Initial Results Software

Page 12: Using networks to explore, quantify, and summarize phylogenetic tree space

Bipartition CovariancesA

B

C

D

E

G

F

H

0.51

A

C

B

D

H

G

F

E

0.49

A

B

C

D

E

G

F

H

0.26

A

B

C

D

H

G

F

E

0.25

A

C

B

D

E

G

F

H

0.25

A

C

B

D

H

G

F

E

0.24

Cov(Orange,Blue)1,000

= 0.25

Cov(Orange,Blue)1,000

= 0.00

0.51

0.51

Sampling 2

Sampling 1

Motivation Our Approaches Applications Initial Results Software

Page 13: Using networks to explore, quantify, and summarize phylogenetic tree space

Bipartition CovariancesA

B

C

D

E

G

F

H

0.51

A

C

B

D

H

G

F

E

0.49

A

B

C

D

E

G

F

H

0.26

A

B

C

D

H

G

F

E

0.25

A

C

B

D

E

G

F

H

0.25

A

C

B

D

H

G

F

E

0.24

Cov(Orange,Blue)1,000

= 0.25

Cov(Orange,Blue)1,000

= 0.00

0.51

0.51

Sampling 2

Sampling 1

Motivation Our Approaches Applications Initial Results Software

Page 14: Using networks to explore, quantify, and summarize phylogenetic tree space

Bipartition CovariancesA

B

C

D

E

G

F

H

0.51

A

C

B

D

H

G

F

E

0.49

A

B

C

D

E

G

F

H

0.26

A

B

C

D

H

G

F

E

0.25

A

C

B

D

E

G

F

H

0.25

A

C

B

D

H

G

F

E

0.24

Cov(Orange,Blue)1,000

= 0.25

Cov(Orange,Blue)1,000

= 0.00

0.51

0.51

Sampling 2

Sampling 1

Motivation Our Approaches Applications Initial Results Software

Page 15: Using networks to explore, quantify, and summarize phylogenetic tree space

Networks of Bipartitions

OD | ABC

OA | BCD

OC | ABD

OD | ABC

AB | OCD

AC | OBD

OB | ACD AD | OBC

BC | OAD

BD | OAC

CD | OAB

= + Cov

= - Cov Uniform Distribution

of Topologies

Motivation Our Approaches Applications Initial Results Software

Page 16: Using networks to explore, quantify, and summarize phylogenetic tree space

Networks of Bipartitions

OD | ABC

OA | BCD

OC | ABD

OD | ABC

AB | OCD

AC | OBD

OB | ACD AD | OBC

BC | OAD

BD | OAC

CD | OAB

= + Cov

= - Cov Two Equally Frequent

Topologies

Motivation Our Approaches Applications Initial Results Software

Page 17: Using networks to explore, quantify, and summarize phylogenetic tree space

Network Visualizations

Motivation Our Approaches Applications Initial Results Software

Page 18: Using networks to explore, quantify, and summarize phylogenetic tree space

Network Visualizations

Motivation Our Approaches Applications Initial Results Software

Gene 1

Gene 2Gene 3

Nodes ordered on each axis by frequency of topology

Page 19: Using networks to explore, quantify, and summarize phylogenetic tree space

Network Visualizations

Motivation Our Approaches Applications Initial Results Software

Gene 1

Gene 2Gene 3

Edges connect identical topologies

Page 20: Using networks to explore, quantify, and summarize phylogenetic tree space

Motivation Our Approaches Applications Initial Results Software

Assessing Model FitUsing parametric bootstrapping or posterior

prediction, we can compare network structures between observed and simulated datasets.

Empirical

OD | ABC

OA | BCD

OC | ABD

OD | ABC

AB | OCD

AC | OBD

OB | ACD AD | OBC

BC | OAD

BD | OAC

CD | OAB

= + Cov

= - Cov

Simulated

OD | ABC

OA | BCD

OC | ABD

OD | ABC

AB | OCD

AC | OBD

OB | ACD AD | OBC

BC | OAD

BD | OAC

CD | OAB

= + Cov

= - Cov

=

Page 21: Using networks to explore, quantify, and summarize phylogenetic tree space

Detecting Distinct Phylogenetic Signals

Tree-to-Tree!Affinities (Similarities)

High

Low

Motivation Our Approaches Applications Initial Results Software

Page 22: Using networks to explore, quantify, and summarize phylogenetic tree space

Tree-to-Tree!Affinities (Similarities)

High

Low

Motivation Our Approaches Applications Initial Results Software

Detecting Distinct Phylogenetic Signals

Page 23: Using networks to explore, quantify, and summarize phylogenetic tree space

Initial Results

Tree-to-Tree!Affinities (Similarities)

High

Low

Motivation Our Approaches Applications Software

Detecting Distinct Phylogenetic Signals

Page 24: Using networks to explore, quantify, and summarize phylogenetic tree space

OD | ABC

OA | BCD

OC | ABD

OD | ABC

AB | OCD

AC | OBD

OB | ACD AD | OBC

BC | OAD

BD | OAC

CD | OAB

= + Cov

= - Cov Two Equally Frequent

Topologies

Motivation Our Approaches Applications Initial Results Software

Detecting Distinct Phylogenetic Signals

Page 25: Using networks to explore, quantify, and summarize phylogenetic tree space

Two Equally Frequent

Topologies

OD | ABC

OA | BCD

OC | ABD

OD | ABC

AB | OCD

AC | OBD

OB | ACD AD | OBC

BC | OAD

BD | OAC

CD | OAB

= + Cov

= - Cov

Motivation Our Approaches Applications Initial Results Software

Detecting Distinct Phylogenetic Signals

Page 26: Using networks to explore, quantify, and summarize phylogenetic tree space

Network Visualizations

Motivation Our Approaches Applications Initial Results Software

Completely distinct signals in two genes

Gene 1

Gene 2

Community 1

Community 2

Page 27: Using networks to explore, quantify, and summarize phylogenetic tree space

Network Visualizations

Motivation Our Approaches Applications Initial Results Software

Partially overlapping signal

Gene 1

Gene 2

Community 1

Community 2

Page 28: Using networks to explore, quantify, and summarize phylogenetic tree space

Proof of Principle

Motivation Our Approaches Applications Initial Results Software

Topologies used for simulating two halves of an alignment.

Page 29: Using networks to explore, quantify, and summarize phylogenetic tree space

Proof of Principle

Motivation Our Approaches Applications Initial Results Software

Bootstrap

CombineSimulate

???Summarize

Page 30: Using networks to explore, quantify, and summarize phylogenetic tree space

Proof of Principle

Motivation Our Approaches Applications Initial Results Software

0.09

taxon_14

taxon_8

taxon_5

taxon_13

taxon_2

taxon_15

taxon_11

taxon_4

taxon_16

taxon_9taxon_10

taxon_12

taxon_1

taxon_3

taxon_6taxon_7

1

1

1

1

1

0.67

1

0.61

0.74

1

10.64

0.64

0.63

z

Majority-Rule Consensus Tree

Page 31: Using networks to explore, quantify, and summarize phylogenetic tree space

Proof of Principle

Motivation Our Approaches Applications Initial Results Software

17#18#

14#15#

16#

5#

3#

1#13#

8#

Community)A)Community)B)Free)Nodes)

10#

11#

7#

6#

4#

9#

12#

2#

Negative Covariances

Positive Covariances

Page 32: Using networks to explore, quantify, and summarize phylogenetic tree space

Networks Detect Strong Conflict

Motivation Our Approaches Applications Initial Results Software

0.08

taxon_1

taxon_14

taxon_12

taxon_8

taxon_15

taxon_2

taxon_4

taxon_10

taxon_5

taxon_9

taxon_6

taxon_16

taxon_3

taxon_7

taxon_13

taxon_11

1

1

1

1

1

1

0.72

1

1

1

1

1

1

1

1"

1"

2"

3"

7"

6"

8"

9"

10"

11"

12"

13"

4"

5"

36%$of$Trees$

0.1

taxon_2

taxon_5taxon_14

taxon_1

taxon_15

taxon_9

taxon_13

taxon_4taxon_3

taxon_11

taxon_16

taxon_12

taxon_7

taxon_10

taxon_8

taxon_6

1

1

1

10.89

1

1

11

1

1

1

1

1

14#

14#

2#

17#

7#

6#

15#9#

10#

11#

12#

18#

4#

16#

37%$of$Trees$

0.8

taxon_9

taxon_14

taxon_12

taxon_7taxon_6

taxon_10

taxon_8

taxon_11

taxon_5

taxon_15

taxon_13

taxon_3taxon_4

taxon_1taxon_2

taxon_16

1

1

1

1

1

1

1

1

1

1

1

1

1

0.93

14#

14#

1#

15#

7#2#

9#

10#

11#

12#

13#

6#

4#

5#

27%$of$Trees$

Page 33: Using networks to explore, quantify, and summarize phylogenetic tree space

TreeScaper

Wen Huang. Tuesday morning iEvoBio Lightning Talk.Motivation Our Approaches Applications Initial Results Software

Page 34: Using networks to explore, quantify, and summarize phylogenetic tree space

Web Interface (future)www.treescaperonline.org

TreeScaperOnline

TreeScaper OnlineInput !Create Networks !Community Detection !Report Network Stats !Visualizations

OD | ABC

OA | BCD

OC | ABD

OD | ABC

AB | OCD

AC | OBD

OB | ACD AD | OBC

BC | OAD

BD | OAC

CD | OAB

= + Cov

= - Cov

−160 −150 −140 −130 −120 −110 −100 −90 −80 −70540

550

560

570

580

590

600

610

620

630

Motivation Our Approaches Applications Initial Results Software

Gene 1

Gene 2Gene 3

Page 35: Using networks to explore, quantify, and summarize phylogenetic tree space

Acknowledgements• Computing support from FSU’s Research Computing

Center and HPC@LSU

• Financial support from NSF (DBI 1262571)