gene order phylogeny tandy warnow the program in evolutionary dynamics, harvard university the...

39
Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Upload: frederica-lamb

Post on 31-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Gene Order Phylogeny

Tandy Warnow

The Program in Evolutionary Dynamics, Harvard University

The University of Texas at Austin

Page 2: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

• Cyber-Infrastructure for Phylogenetic RESearch (http://www.phylo.org)

• Main research: Large-scale phylogenetics, reticulate evolution, gene order phylogeny, complex simulations, and databases

• Funded by $11.6M ITR Grant from NSF• 40 biologists, computer scientists, and

mathematicians collaborating on the project

Page 3: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

CIPRes Members

University of New MexicoBernard Moret David Bader Tiffani Williams

UCSD/SDSCFran Berman Alex Borchers David Stockwell Phil Bourne John Huelsenbeck Dana Jermanis Mark MillerMichael Alfaro Tracy Zhao University of ConnecticutPaul O Lewis

University of PennsylvaniaJunhyong Kim Sampath Kannan

UT Austin Tandy Warnow David M. Hillis Warren Hunt Robert Jansen Randy Linder Lauren Meyers Daniel Miranker Usman Roshan Luay Nakhleh

University of ArizonaDavid R. Maddison

University of British ColumbiaWayne Maddison

North Carolina State UniversitySpencer Muse

American Museum of Natural HistoryWard C. Wheeler

UC BerkeleySatish Rao Joseph M. Hellerstein Richard M Karp Brent MishlerElchanan MosselEugene W. MyersChristos M. PapadimitriouStuart J. Russell

SUNY BuffaloWilliam Piel

Florida State UniversityDavid L. SwoffordMark Holder

Yale Michael DonoghuePaul Turner

Aventis PharmaceuticalsLisa Vawter

Page 4: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Limitations of DNA phylogenetics

• Deep evolutionary histories may not be recoverable from DNA sequence phylogeny due to lack of specificity -- too much noise (homoplasy) and insufficient sequence length

• The systematics community has looked to “rare genomic changes” for better sources of phylogenetic signal

Page 5: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Whole-Genome Phylogenetics

A

B

C

D

E

F

X

Y

ZW

A

B

C

D

E

F

Page 6: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Genomes As Signed Permutations

1 –5 3 4 -2 -6or

6 2 -4 –3 5 –1etc.

Page 7: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Genomes Evolve by Rearrangements

• Inverted Transposition

1 2 3 9 -8 –7 –6 –5 –4 10

1 2 3 4 5 6 7 8 9 10

• Inversion (Reversal)

1 2 3 –8 –7 –6 –5 -4 9 10

• Transposition

1 2 3 9 4 5 6 7 8 10

Page 8: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Other types of events

• Duplications, Insertions, and Deletions (changes gene content)

• Fissions and Fusions (for genomes with more than one chromosome)

These events change the number of copies of each gene in each genome (“unequal gene content”)

Page 9: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Genome Rearrangement Has A Huge State Space

• DNA sequences : 4 states per site• Signed circular genomes with n genes:

states, 1 site

• Circular genomes (1 site)

– with 37 genes (mitochondria): states

– with 120 genes (chloroplasts): states

)!1(2 1 −− nn

521056.2 ×2321070.3 ×

Page 10: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Why use gene orders?

• “Rare genomic changes”: huge state space and relative infrequency of events (compared to site substitutions) could make the inference of deep evolution easier, or more accurate.

• Our research shows this is true, but accurate analysis of gene order data is computationally very intensive!

Page 11: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Phylogeny reconstruction from gene orders

• Distance-based reconstruction: estimate pairwise distances, and apply methods like Neighbor-Joining or Weighbor

• “Maximum Parsimony”: find tree with the minimum length (inversions, transpositions, or other edit distances)

• Maximum Likelihood: find tree and parameters of evolution most likely to generate the observed data

Page 12: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Maximum Parsimony on Rearranged Genomes (MPRG)

• The leaves are rearranged genomes.• Find the tree that minimizes the total number of rearrangement events (e.g., inversion phylogeny minimizes the number of inversions)

A

B

C

D

3 6

2

3

4

A

B

C

D

EF

Total length= 18

Page 13: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Optimization problems for gene order phylogeny

• Breakpoint phylogeny: find the phylogeny which minimizes the total number of breakpoints (NP-hard, even to find the median of three genomes)

• Inversion phylogeny: find the phylogeny which minimizes the sum of inversion distances on the edges (NP-hard, even to find the median of three genomes)

Page 14: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Inversion phylogenies

Phylogenetic trees

Tree length

Global optimum

Local optimum

• When the data are close to saturated, even the best distance-based analyses are insufficiently accurate. In these cases, our initial investigations suggest that the inversion phylogeny approach may be superior.

• Problem: finding the best trees is enormously hard, since even the “point estimation” problem is hard (worse than estimating branch lengths in ML).

Page 15: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Observations

• For equal gene content, heuristics for the inversion phylogeny problem are extremely accurate, even under model conditions in which transpositions are dominant.

• For unequal gene content, the parsimony style problems are too computationally intense -- but NJ (neighbor joining) with a new distance estimator (Moret et al. 2004) works extremely well.

Page 16: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Software

• BPAnalysis (Sankoff): open source, restricted to the breakpoint phylogeny reconstruction

• GRAPPA (Moret et al.): open source, restricted to single chromosome genomes, but can handle both equal and unequal gene content

• MGR (Pevzner et al.): multiple chromosome, limited to equal gene content, performs well if the dataset is small (less than 10 genomes)

• Bayesian analysis by Bret Larget (not yet released).

Page 17: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Tobacco

Platycodon

Cyananthus

Asyneuma

Tiodanus

Legousia

Merciera

Wahlenbergia

Symphyandra

Adenophora

Trachelium

The strict consensus of 24 trees, each with inversion length of 64.

Finished within 40 minutes on a laptop using GRAPPA version 1.8

CampanulaCodonopsis

Page 18: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

GRAPPA (Genome Rearrangement Analysis under Parsimony and other

Phylogenetic Algorithms)

http://www.cs.unm.edu/~moret/GRAPPA/• Heuristics for maximum parsimony style problems

for equal gene content • Fast polynomial time distance-based methods • Contributors: U. New Mexico,U. Texas at Austin,

Universitá di Bologna, Italy• Freely available in source code at this site.• Project leader: Bernard Moret (UNM)

([email protected])

Page 19: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Speeding up MP and ML: DCM3

Tandy Warnow

Radcliffe Institute

The University of Texas at Austin

Page 20: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Reconstructing the “Tree” of Life

Handling large datasets: Handling large datasets: millions of speciesmillions of species

Page 21: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Main research objectives

• Determine the best current methods available for MP and ML, and then improve upon them

• Focus on performance within one day, one week, or one month, on large real datasets (1K to 20K sequences for MP)

• Final objective is hundreds of thousands (or millions) of sequences.

Page 22: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Initial results

• Very large datasets are hard for both MP and ML, no matter what software is used

• Suboptimal solutions to MP yield reasonable estimates of the optimal MP trees - but only if they are within .01% of optimal MP score

• Improving upon techniques for searching treespace will yield improvements for both MP and ML

Page 23: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Datasets

• 1322 lsu rRNA of all organisms• 2000 Eukaryotic rRNA• 2594 rbcL DNA• 4583 Actinobacteria 16s rRNA • 6590 ssu rRNA of all Eukaryotes• 7180 three-domain rRNA• 7322 Firmicutes bacteria 16s rRNA• 8506 three-domain+2org rRNA• 11361 ssu rRNA of all Bacteria• 13921 Proteobacteria 16s rRNA

Obtained from various researchers and online databases

Page 24: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Problems with current techniques for MP

00.010.020.030.040.050.060.070.080.090.1

Average MP

score above optimal at 24

hours, shown as a percentage of the

optimal

1 2 3 4 5 6 7 8 9 10Dataset#

TNT

Average MP scores above “optimal” of best methods at 24 hours across 10 datasets

Best current techniques fail to reach 0.01% of optimal at the end of 24 hours, on large datasets

Page 25: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Problems with current techniques for MP

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 4 8 12 16 20 24

Hours

Average MP score above

optimal, shown as a percentage of

the optimal

The best current method (default TNT) fails to reach acceptable levels of accuracy (0.01% of “optimal”) within 24 hours on many large datasets -- evidence suggests that this level will not be reached for weeks or months (or more) of further analysis.

Performance of TNT with time

Page 26: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Observations

• The best methods cannot get acceptably good solutions within 24 hours on most of these large datasets.

• Datasets of these sizes may need months (or years) of further analysis to reach reasonable solutions.

• Apparent convergence can be misleading.

Page 27: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Observations

• The best methods cannot get acceptably good solutions within 24 hours on most of these large datasets.

• Datasets of these sizes may need months (or years) of further analysis to reach reasonable solutions.

• Apparent convergence can be misleading.

Page 28: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Observations

• The best methods cannot get acceptably good solutions within 24 hours on most of these large datasets.

• Datasets of these sizes may need months (or years) of further analysis to reach reasonable solutions.

• Apparent convergence can be misleading.

Page 29: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

“Boosting” MP heuristics

• DCMs “boost” the performance of phylogeny reconstruction methods.

DCMBase method M DCM-M

Page 30: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

DCM3 technique for speeding up MP searches

Page 31: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Iterative-DCM3

T

T’

Base methodDCM3

Page 32: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

New DCMs

• DCM31. Compute subproblems using DCM3 decomposition

2. Apply base method to each subproblem to yield subtrees

3. Merge subtrees using the Strict Consensus Merger technique

4. Randomly refine to make it binary

• Recursive-DCM3• Iterative DCM3

1. Compute a DCM3 tree

2. Perform local search and go to step 1

• Recursive-Iterative DCM3

Page 33: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Performance Study

• How well do these “boosted” versions of the best MP heuristics perform, compared to the best MP heuristics?

• We examine performance with respect to “optimal” MP scores (best found so far, using any method) for a number of very large datasets, over 24 hours.

• The benchmark MP heuristic is the default TNT.

Page 34: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Datasets

• 1322 lsu rRNA of all organisms• 2000 Eukaryotic rRNA• 2594 rbcL DNA• 4583 Actinobacteria 16s rRNA • 6590 ssu rRNA of all Eukaryotes• 7180 three-domain rRNA• 7322 Firmicutes bacteria 16s rRNA• 8506 three-domain+2org rRNA• 11361 ssu rRNA of all Bacteria• 13921 Proteobacteria 16s rRNA

Obtained from various researchers and online databases

Page 35: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Rec-I-DCM3 significantly improves performance

Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 4 8 12 16 20 24

Hours

Average MP score above

optimal, shown as a percentage of

the optimal

Current best techniques

DCM boosted version of best techniques

Page 36: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Rec-I-DCM3(TNT) vs. TNT(Comparison of scores at 24 hours)

Base method is the default TNT technique, the current best method for MP. Rec-I-DCM3 significantly improves upon the unboosted TNT by returning trees which are at most 0.01% above optimal on most datasets.

00.010.020.030.040.050.060.070.080.090.1

Average MP score above

optimal at 24 hours, shown as a

percentage of the optimal

1 2 3 4 5 6 7 8 9 10

Dataset#

TNT Rec-I-DCM3

Page 37: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Summary

• Rec-I-DCM3 is a powerful technique for escaping local optima, and “boosts” the performance of the best heuristics for solving MP

• The improvement increases with the difficulty of the dataset - Rec-I-DCM3(TNT) is 50 times faster than TNT on our hardest datasets, but we expect even bigger speedups in our next version

• DCMs also boost the performance of Maximum Likelihood heuristics (not shown)

Page 38: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Acknowledgements

• Collaborators: Bernard Moret (UNM), Usman Roshan (UT-Austin), and Tiffani Williams (UNM)

• Funding: NSF, The David and Lucile Packard Foundation, The Radcliffe Institute for Advanced Study, The Institute for Cellular and Molecular Biology at UT-Austin, and The Program in Evolutionary Dynamics at Harvard University

• Software will be part of the CIPRES Project’s first distribution - see http://www.phylo.org

Page 39: Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

• Cyber-Infrastructure for Phylogenetic RESearch (http://www.phylo.org)

• Main research: Large-scale phylogenetics, reticulate evolution, gene order phylogeny, and databases

• Funded by $11.6M ITR Grant from NSF• 40 biologists, computer scientists, and

mathematicians collaborating on the project