quest for the best tree - finding optimal trees jarno tuimala 2015-04-20 grey slides from mcinerney

46
Quest for the best tree - Finding optimal trees Jarno Tuimala 2015-04-20 Grey slides from McInerney

Upload: donna-marshall

Post on 31-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Quest for the best tree - Finding optimal trees

Jarno Tuimala

2015-04-20

Grey slides from McInerney

BACKGROUND

General idea

• Find an optimal tree using some optimality criterion– minimum evolution– parsimony– maximum likelihood– bayesian methods

• It is an NP-hard problem– Solution (in polynomial time) does not exist

How many trees?

Rooted trees (taxa / trees)1 1

2 1

3 3

4 15

5 105

6 945

7 10395

8 135135

9 2027025

10 34459425

20 8.20079453263789e+21

30 4.9517976900802e+38

40 1.00984736473787e+57

50 2.75292135328357e+76

Unrooted trees (taxa / trees)1 1

2 1

3 1

4 3

5 15

6 105

7 945

8 10395

9 135135

10 2027025

20 2.216430954767e+20

30 8.68736436856175e+36

40 1.31149008407515e+55

50 2.8380632508078e+74

Possibilities

• Exact solutions:– Go through all possible trees (exhaustive

search)– Go through all trees using some logic to

bound the search (branch and bound)• Heuristic solutions:

– A number of different algorithms, such as NNI, SPR and TBR

Computational sciences

• Phylogenetics is very much a computational science.– Computational science (also scientific computing or scientific

computation) is concerned with constructing mathematical models and quantitative analysis techniques and using computers to analyze and solve scientific problems. [Wikipedia]

• Often the implementation (software) and the method itself can't be distinguished.– That's because, if you develop a superior method, but don't make an

implementation, it's seldom taken into use– If the implementation is superior, even suboptimal methods might gain

good penetrance in a field• For example, the first tree building method was proposed by

Wagner in 1961 and first implemented by Fassis in 1970.

Practical issues

• Analysis is usually done using some software package on a computer.

• First, an initial tree is created fast, possibly using Wagner's method:– Random addition sequence– Sequential addition / jumble (Phylip)– Stepwise addition / replicates (PAUP)– Build (POY)

• This initial tree is then rearranged in order to find the shortest (MPT) tree.– Exact solutions– Heuristics!

COMPUTATIONAL METHODS

Development of parsimony method

• Hill climbing methods (heuristics)– NNI: Robinson 1971, Moore 1973– Branch and bound: Hendy, 1982– SPR: Swofford 1987, 1993– TBR: Maddison 1991– Ratchet: Nixon 1999– TD, TF, SS: Goloboff 1999

Tree space may be populated by local minima and islands of optimal trees

GLOBAL MINIMUM

LocalMinimum

LocalMinima

TreeLength

RANDOM ADDITION SEQUENCE REPLICATES (RAS or jumble)

SUCCESSFAILURE FAILURE

Branch SwappingBranch Swapping

Branch Swapping

SMALL NUMBER OF TAXA

Finding optimal trees - exact solutions

• Exact solutions can only be used for small numbers of taxa

• Exhaustive search examines all possible trees

• Typically used for problems with less than 10 taxa

Finding optimal trees - exhaustive search

A

B C

1

2a

Starting tree, any 3 taxa

A

B D

C

A

BD C

A

B C

D2b 2c

E

E

EE

E

Add fourth taxon (D) in each of three possible positions -> three trees

Add fifth taxon (E) in each of the five possible positions on each of the three trees -> 15 trees, and so on ....

Finding optimal trees - exact solutions

• Branch and bound saves time by discarding families of trees during tree construction that cannot be shorter than the shortest tree found so far

• Can be enhanced by specifying an initial upper bound for tree length

• Typically used only for problems with less than 18 taxa

Finding optimal trees - branch and bound

A

B C

B1

A

B D

C

A

B C

D

B3

A1

A

B E

D

CC1.1

A

B D

E

CC1.3

A

B D

C

EC1.2

A

B

CC1.4

E D

A

B C

C1.5

ED

A

BD C

B2

C2.1

C2.2

C2.3

C2.4

C2.5

C3.1

C3.2

C3.3

C3.4

C3.5

MODERATE NUMBER OF TAXA

Finding optimal trees - heuristics

• The number of possible trees increases exponentially with the number of taxa making exhaustive searches impractical for many data sets (an NP complete problem)

• Heuristic methods are used to search tree space for most parsimonious trees by building or selecting an initial tree and swapping branches to search for better ones

• The trees found are not guaranteed to be the most parsimonious - they are best guesses

Finding optimal trees - heuristics• Stepwise addition Asis - the order in the data matrix Closest -starts with shortest 3-taxon tree adds taxa in

order that produces the least increase in tree length (greedy heuristic)

Simple - the first taxon in the matrix is a taken as a reference - taxa are added to it in the order of their decreasing similarity to the reference

Random - taxa are added in a random sequence, many different sequences can be used

• Recommend random with as many (e.g. 10-100) addition sequences as practical

Finding most parsimonious trees - heuristics

• Branch Swapping:

Nearest neighbor interchange (NNI) Subtree pruning and regrafting (SPR) Tree bisection and reconnection (TBR) Ratchet Tree fusing Tree drifting Sectorial searches

Finding optimal trees - heuristics

• Nearest neighbor interchange (NNI)

A

B

C DE

F

G

A

B

D CE

F

G

A

B

C D

E

F

G

NNI

Finding optimal trees - heuristics

• Subtree pruning and regrafting (SPR)

A

B

C DE

F

G

A

B

C DE

F

G

C

D

G

B

A

E F

SPR

Finding optimal trees - heuristics

• Tree bisection and reconnection (TBR)

A

B

C DE

F

G

A

B

CD

E

F

G

A

C

F

D

E

B G

TBR

LARGE NUMBER OF TAXACOMPLICATED SEARCHES

Escape from a local minimum

TreeLength

RANDOM ADDITION SEQUENCE REPLICATES (RAS or jumble)

SUCCESSFAILURE FAILURE

Escape from a local minimum

TreeLength

RANDOM ADDITION SEQUENCE REPLICATES (RAS or jumble)

SUCCESSFAILURE

Ratchet

• Try to escape from a local optimum.• Needs only standard software.

1. Generate an optimal tree (RAS + TBR)

2. Perturb the dataset by changing weigts of a set of characters / sites

3. Optimize the tree using the perturbed data

4. Then plug the original data back, and continue optimizing this / these new trees.

• 1.-4. are typically run several times

Tree fusing

• Needs to have some trees in memory, typically from RAS+TBR searches

• Resembles genetic algorithms: swap branches between two trees

1. Pick two trees2. Exchange one compatible branch between the

trees, and make SPR-search3. Repeat 1. several times4. Calculate the lenght of all trees, and pick the

shortest one

Tree drifting

• Also known as simulated annealing

• While rearranging the tree, even suboptimal rearrangements (such that make the tree longer) can be accepted, although with a small probability.

Sectorial searches

• Divide-and-conquer algorithm– Sectorial searches– Disc-covering methods

1. Select a smaller data set from the tree, typically 35-55 taxa.

2. Make a few RAS+TBR search for this subset, and put the subtree back to its correct place.

3. Rearrange the whole tree using TBR

• Repeat the whole cycle a few dozen times.

PRACTICAL CONSIDERATIONS

Principles

• Parsimony– Maximize the number of RAS, because that makes

the searched through tree space more thorough.– Start with lax swapping algorithms (NNI/SPR), and

complete with more strict ones (TBR).

• Likelihood– As above, but additionally use the optimized branch

length optimization (in PAUP), or software that does that automatically (e.g., RAxML, PHYML).

What is the most efficient way to find the shortest/most likely tree?• Example (with just 1 RAS):

– make 1 RAS (wagner tree)– swap with SPR until completion (or terminate after

certain time is exceeded), and retain the 10 best trees– Take these 10 best SPR trees, and swap with TBR

• Only keep the 10 best trees after each TBR swap and continue from them

– Finally, retain the 10 best trees, and examine them more closely

• This allows for more initial RAS build, and hence a more thorough search of the tree space *with the same computational time*.

Practical considerations

• Small dataset (<50 taxa)– Make 100 RAS– Rearrange with (SPR + ) TBR

• Big dataset (>50-100 taxa)– Make at least a 100 RAS– Rearrange using TBR

• Save one tree per RAS replicate– Use Tree drifting and sectorial searches to

further optimize these 100 trees

Software

Software RAS NNI SPR TBR LSR Ratchet GA TF TD SS

PAUP x x x x (x)

TNT x x x x x x x

POY x x x x x x x

RAxML x x (x) x

Metapiga x (?) (?) x x

Phylip x x x (x)

PHYML x x x

Missing data• Missing data is ignored in tree building but can lead to alternative equally

parsimonious optimizations in the absence of homoplasy

A B C D E

**

singleorigin0 => 1on any one of 3branches

1 ? ? 0 0

*Abundant missing data can lead to multiple equally parsimonious trees.

This can be a serious problem with morphological data but is unlikely to arise with molecular data unless analyses are of incomplete data

Consensus trees

Multiple optimal trees

• Many methods can yield multiple equally optimal trees

• We can further select among these trees with additional criteria, but

• Typically, relationships common to all the optimal trees are summarised with consensus trees

Consensus methods

• A consensus tree is a summary of the agreement among a set of fundamental trees

• There are many consensus methods that differ in:

1. the kind of agreement 2. the level of agreement• Consensus methods can be used with

multiple trees from a single analysis or from multiple analyses

Strict consensus methods

A B C D E F G A B C E D F G

TWO FUNDAMENTAL TREES

A B C D E F G

STRICT COMPONENT CONSENSUS TREE

Majority rule consensus

A B C D E F G A B C E D F G

A B C E D F G

MAJORITY-RULE COMPONENT CONSENSUS TREE

A B C E F D G

100

66

66

66

66

THREE FUNDAMENTAL TREES

Numbers indicate frequency ofclades in the fundamental trees

Consensus methods

Spirostomumum

OchromonasSymbiodiniumProrocentrumLoxodesTetrahymena

TracheloraphisEuplotesGruberia

OchromonasSymbiodiniumProrocentrumLoxodesTetrahymenaSpirostomumumEuplotesTracheloraphisGruberia

OchromonasSymbiodiniumProrocentrumLoxodesTetrahymenaEuplotesSpirostomumumTracheloraphisGruberia

OchromonasSymbiodiniumProrocentrumLoxodesTetrahymenaTracheloraphisSpirostomumEuplotesGruberia

OchromonasSymbiodiniumProrocentrumLoxodesTetrahymenaSpirostomumEuplotesTracheloraphisGruberia

Three fundamental trees

majority-rule

strict (component)

100

100100

100

6666