the number of fully resolved trees taxa (with a root): n...
TRANSCRIPT
Number of taxa Number of unrooted
bifurcating trees
3 1
4 3
5 15
10 2 x 106
20 2 x 1020
50 2 x 1074
100 2 x 10182
1,000 2 x 102,860
10,000 2 x 1038,658
• The number of fully resolved trees
for n taxa (with a root):
• The number of fully resolved trees
for n taxa (without a root):
Chapter 2 in: Felsenstein J. (2004). Inferring phylogenies. Sinauer Associates, Sunderland. 664 pp.
Tree “building” methods
• Purely algorithmic methods (e.g., neighbor-
joining)
• Methods based on an optimality criterion
– Maximum likelihood
– Parsimony
– Bayesian phylogenetics
Basic concepts: tree searches based on optimality criteria
• Exact algorithms– Exhaustive searches
All possible trees are evaluated
– Branch-and-bound (Hendy & Penny 1982)
Does not require evaluating all possible trees, but guarantees an optimal solution
• Heuristic algorithms
Do not guarantee finding an optimal solution
– Initial tree building (Wagner addition)• Single starting points
– Stepwise addition of taxa
– Star decomposition methods
• Multiple starting points: Random addition sequence (ras)
– Branch swapping• Subtree pruning and regrafting (spr)
• Tree bisection and reconnection (tbr)
– Other methods for refining trees• Sectorial searches (Goloboff 1999)
• Tree-drifting (Goloboff 1999)
• Tree-fusing (Goloboff 1999)
• Ratchet (Nixon 1999)
ISLANDS
Islands of trees
Maddison DR (1991) The discovery and importance of multiple islands of most-
parsimonious trees. Systematic Zoology 40:315-328
Exhaustive enumeration of all possible (unrooted) trees for 5 taxa (15 trees)
Branch-and-bound
algorithm
Heuristic searches
• Wagner tree (or other startingpoint)
• Branch swapping
– Subtree pruning and regrafting(SPR)
– Tree bisection andreconnection (TBR)
– Tree fusing (geneticalgorithms)
– Tree drifting (simulatedannealing)
– Sectorial searches (divide andconquer techniques)
– Ratcheting
Wagner algorithmRandom seed derived from system time: -985792416
default outgroup: Lineus_bilineatus
addition sequence 'as is'
outgroup is Lineus_bilineatus, add node Neocrania_anomala ->41
add node 3/128 Phascolion_strombi discrepancy: real cost 75 calculated cost 66 ->75 1 tree
add node 4/128 Chaetoderma_nitidulum ->110 1 tree
add node 5/128 Loxosomella_murmanica ->154 1 tree
add node 6/128 Lepidopleurus_cajetanus discrepancy: real cost 182 calculated cost 187 ->182 1 tree
add node 7/128 Leptochiton_asellus discrepancy: real cost 209 calculated cost 210 ->209 1 tree
add node 8/128 Callochiton_septemvalvis ->259discrepancy: real cost 252 calculated cost 255 ->252 1 tree
add node 9/128 Chaetopleura_apiculata discrepancy: real cost 271 calculated cost 274 ->271 1 tree
add node 10/128 Callistochiton_antiquus ->287 1 tree
add node 11/128 Lorica_volvox ->314->309 1 tree
add node 12/128 Chiton_olivaceus discrepancy: real cost 317 calculated cost 318 ->317 1 tree
add node 13/128 Mopalia_muscosa ->355->350->339 1 tree
add node 14/128 Tonicella_lineata ->373->368->367discrepancy: real cost 351 calculated cost 352 ->351 1 tree
add node 15/128 Acanthochitona_crinita ->381->373->366 2 trees
add node 16/128 Cryptochiton_stelleri ->399->392discrepancy: real cost 374 calculated cost 373
->374discrepancy: real cost 374 calculated cost 373 2 trees
add node 17/128 Rhabdus_rectius ->425->421discrepancy: real cost 418 calculated cost 420 ->418 1 tree
add node 18/128 Dentalium_inaequicostatum ->458discrepancy: real cost 447 calculated cost 448
->447->445discrepancy: real cost 443 calculated cost 437 ->443->442discrepancy: real cost 439 calculated cost 440 ->439 1 tree
add node 19/128 Antalis_entalis ->496->492discrepancy: real cost 491 calculated cost 492 ->491->483 1 tree
add node 20/128 Alcadia_dyssonia discrepancy: real cost 528 calculated cost 529
->528discrepancy: real cost 521 calculated cost 524
->521discrepancy: real cost 512 calculated cost 515
->512discrepancy: real cost 509 calculated cost 512 ->509 1 tree
SPR branch swapping TBR branch swapping
Allen, B. L. and M. Steel. 2001. Subtree transfer operations and their induced metrics on evolutionary trees. Annals of Combinatorics 5: 1-15.
Search strategies
• Number of starting trees
• Number of trees to swap per replicate
• Number of trees to swap in total
• Algorithms to use
• “Stopping rules”
• Parallelism in systematics
Traditional searches
• Random addition sequence followed by some sort of tree refining
technique (e.g., SPR and/or TBR)
• Cannot deal with problems of composite optima, as in large data sets
(> 150 taxa)
“Large phylogeny estimation is a combinatorial optimization problem that no
future computer will ever be able to solve exactly in practical computing
time. The difficulty of the problem is amplified by the need to use complex
evolutionary models and large taxon samplings. Hence, many heuristic
approaches have been developed, with varying degrees of success.”
Lemmon & Milinkovitch 2002
Tree estimation is a np-hard problem
Heuristic implementations
• Initial tree (random, Wagner, etc.)
• Some process of tree refining technique (spr, tbr, ratchet, tree-
fusing, sectorial searches, tree-drifting, DCM…)
• Repeat the process multiple times (hopefully seeking for
convergence towards a solution)
Applications of parallel computing
• Multiple starting points (replicates)
– Sequential
– Parallel
• Refining techniques
– Sequential
– Parallel
• Spawning the jobs Fault tolerance
• Achieving linearity
Levels of parallelism in phylogenetic
reconstruction
Technique Efficiency
Tree building
• “Sneaker” 100% linearity
• Multibuild 100% linearity
• Parallel build communication tradeoff
P, NP, and NPC
• Easy problems for which exists apolynomial time algorithmic solution(P). An algorithm that can solve aproblem in time O(nk) for someconstant k.
• Hard problems (NP or nondeterministicpolynomial) require super-polynomialtime to solve, but if given a solution,the solution can be verified inpolynomial time. NP-completeproblems (NPC) exist in a nether worldwhere no known polynomial timesolution exist (but there is no proof oftheir non-existence either). Theseproblems are frequentlycombinatorially explosive with solutionspaces increasing at a factorial pace.NPC problems are traveling salesman,circuit design, scheduling, andphylogenetic tree search andalignments.