joe felsenstein department of genome sciences...

52
Reconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences and Department of Biology University of Washington, Seattle Reconstructing phylogenies: how? how well? why? – p.1/29

Upload: others

Post on 04-Apr-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Reconstructing phylogenies: how? how well? why?

Joe Felsenstein

Department of Genome Sciences and Department of Biology

University of Washington, Seattle

Reconstructing phylogenies: how? how well? why? – p.1/29

Page 2: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

A review that asks these questions

What are some of the strengths and weaknesses of different ways of

reconstructing evolutionary trees (phylogenies)?

How can we find out how accurate we may have been inreconstructing the phylogeny?

Why do we want to reconstruct it? What are phylogenies used for?

Reconstructing phylogenies: how? how well? why? – p.2/29

Page 3: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

A review that asks these questions

What are some of the strengths and weaknesses of different ways of

reconstructing evolutionary trees (phylogenies)?

How can we find out how accurate we may have been inreconstructing the phylogeny?

Why do we want to reconstruct it? What are phylogenies used for?

Reconstructing phylogenies: how? how well? why? – p.2/29

Page 4: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

A review that asks these questions

What are some of the strengths and weaknesses of different ways of

reconstructing evolutionary trees (phylogenies)?

How can we find out how accurate we may have been inreconstructing the phylogeny?

Why do we want to reconstruct it? What are phylogenies used for?

Reconstructing phylogenies: how? how well? why? – p.2/29

Page 5: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

What does “tree space” (with branch lengths) look like?

t1t2

t1t2

an example: three species with a clock

A B C

t 1

t 2

t 1

t 2

OK

not possible

trifurcation

etc.

when we consider all three possible topologies, the space looks like:

Reconstructing phylogenies: how? how well? why? – p.3/29

Page 6: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

For one tree topology

The space of trees varying all 2n − 3 branch lengths, each a nonegativenumber, defines an “orthant" (open corner) of a 2n − 3-dimensional realspace:

A

B

C

D

E

F

v1

v

vv

v

v

v23

v4

5

6

78

v9

wall wall

floor v9

Reconstructing phylogenies: how? how well? why? – p.4/29

Page 7: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Through the looking-glass

Shrinking one of the n − 1 interior branches to 0, we arrive at atrifurcation:

A

B

C

D

E

F

v1

v

vv

v

v

v23

v4

5

6

78

v9

A

B

C

E

F

v1

v

vv

vv2

3

D

v

v4

56

78

A

B

C

D

E

F

v1

v

v

vv

v

v23

v4

5

6

7

8

v9

A

B

C

D

E

F

v1

v

vv

v

v

v23

v4

56

78

v9

Here, as we pass “through the looking glass" we are also touch the space

for two other tree topologies, and we could decide to enter either.

Reconstructing phylogenies: how? how well? why? – p.5/29

Page 8: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

The graph of all trees of 5 speciesThe space of all these orthants, one for each topology, connecting ones

that share faces (looking glasses):

C DB EA

D BC EA

D BE CA

C ED AB D C

A EB

A CD EB

E BC DA B C

D EA

C BD EA

A BD EC

A BE CD

B CE DA

B DC EA E B

D CA

E CB DA

The Schoenberg graph (all 15 trees of size 5 connected by NNI’s)Reconstructing phylogenies: how? how well? why? – p.6/29

Page 9: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

There are very large numbers of trees

For 21 species, the number of possible unrooted tree topologies exceeds

Avogadro’s Number: it is

3 × 5 × 7 × 9 × 11 × 13 × 15 × 17 × 19×21 × 23 × 25 × 27 × 29 × 31 × 33 × 35 × 37

= 8, 200, 794, 532, 637, 891, 559, 375

... and that’s not even asking about how hard it is to optimize the 39

branch lengths for each of these trees.

What this goes with is that most methods of finding the best tree are

NP-hard, and not easy to approximate either.

Reconstructing phylogenies: how? how well? why? – p.7/29

Page 10: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Parsimony methods

Alpha Delta Gamma Beta Epsilon

1

23

4 45 56

SitesSpecies 1 2 3 4 5 6

Alpha T A G C A TBeta C A A G C TGamma T C G G C TDelta T C G C A AEpsilon C A A C A T

Reconstructing phylogenies: how? how well? why? – p.8/29

Page 11: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Advantages and disadvantages of parsimony methods

Disadvantage: not model-based so people think it makes noassumptions.

Advantage: reasonably fast, no search of branch lengths needed

and quick to compute the criterion.

Advantage: good statistical properties when amounts of change are

small.

Disadvantage: statistical misbehavior (inconsistency) when some

nearby branches on the tree are long (Long Branch Attraction).

Disadvantage: likely to make you think you have William of

Ockham’s endorsement.

Disadvantage: may lead to the delusion that you know exactly whathappened in evolution, in detail.

Reconstructing phylogenies: how? how well? why? – p.9/29

Page 12: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Advantages and disadvantages of parsimony methods

Disadvantage: not model-based so people think it makes noassumptions.

Advantage: reasonably fast, no search of branch lengths needed

and quick to compute the criterion.

Advantage: good statistical properties when amounts of change are

small.

Disadvantage: statistical misbehavior (inconsistency) when some

nearby branches on the tree are long (Long Branch Attraction).

Disadvantage: likely to make you think you have William of

Ockham’s endorsement.

Disadvantage: may lead to the delusion that you know exactly whathappened in evolution, in detail.

Reconstructing phylogenies: how? how well? why? – p.9/29

Page 13: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Advantages and disadvantages of parsimony methods

Disadvantage: not model-based so people think it makes noassumptions.

Advantage: reasonably fast, no search of branch lengths needed

and quick to compute the criterion.

Advantage: good statistical properties when amounts of change are

small.

Disadvantage: statistical misbehavior (inconsistency) when some

nearby branches on the tree are long (Long Branch Attraction).

Disadvantage: likely to make you think you have William of

Ockham’s endorsement.

Disadvantage: may lead to the delusion that you know exactly whathappened in evolution, in detail.

Reconstructing phylogenies: how? how well? why? – p.9/29

Page 14: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Advantages and disadvantages of parsimony methods

Disadvantage: not model-based so people think it makes noassumptions.

Advantage: reasonably fast, no search of branch lengths needed

and quick to compute the criterion.

Advantage: good statistical properties when amounts of change are

small.

Disadvantage: statistical misbehavior (inconsistency) when some

nearby branches on the tree are long (Long Branch Attraction).

Disadvantage: likely to make you think you have William of

Ockham’s endorsement.

Disadvantage: may lead to the delusion that you know exactly whathappened in evolution, in detail.

Reconstructing phylogenies: how? how well? why? – p.9/29

Page 15: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Advantages and disadvantages of parsimony methods

Disadvantage: not model-based so people think it makes noassumptions.

Advantage: reasonably fast, no search of branch lengths needed

and quick to compute the criterion.

Advantage: good statistical properties when amounts of change are

small.

Disadvantage: statistical misbehavior (inconsistency) when some

nearby branches on the tree are long (Long Branch Attraction).

Disadvantage: likely to make you think you have William of

Ockham’s endorsement.

Disadvantage: may lead to the delusion that you know exactly whathappened in evolution, in detail.

Reconstructing phylogenies: how? how well? why? – p.9/29

Page 16: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Advantages and disadvantages of parsimony methods

Disadvantage: not model-based so people think it makes noassumptions.

Advantage: reasonably fast, no search of branch lengths needed

and quick to compute the criterion.

Advantage: good statistical properties when amounts of change are

small.

Disadvantage: statistical misbehavior (inconsistency) when some

nearby branches on the tree are long (Long Branch Attraction).

Disadvantage: likely to make you think you have William of

Ockham’s endorsement.

Disadvantage: may lead to the delusion that you know exactly whathappened in evolution, in detail.

Reconstructing phylogenies: how? how well? why? – p.9/29

Page 17: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Distance matrix methods

A B C D E

A

B

C

D

E

0

0

0

0

0

0.20

0.24

0.20

0.24

0.19

0.19

0.17

0.17

0.16

0.16

0.24

0.24

0.15

0.15

0.25

0.25

0.10

0.10

0.24

0.24

ABCDE

CCTAACCTCTGACCC ...CGTAACCTCCGGCCC ...CGTAACCTCTGGCCC ...CGCAACCTCTGGCTC ...CCTAACCTCTGGCCC ...

The sequences: yield distances:

compare:alter tree until predictions matchobserved distances as closely as possible

A B C D E

A

B

C

D

E

0

0

0

0

0

0.23 0.16 0.20 0.17

0.23 0.17 0.24

0.11

0.21

0.23

0.16

0.20

0.17

0.23

0.17

0.24 0.11 0.21

0.10

0.07

0.05

0.08

0.030.06

0.05

A B

CD

E0.20

0.20

A suggested tree: predicts:

Reconstructing phylogenies: how? how well? why? – p.10/29

Page 18: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Advantages and disadvantages of distance methods

Advantage: model-based so assumptions are clearer.

Advantage: it’s geometry so mathematical scientists love it.

Advantage: often fast (especially Neighbor-Joining method), canhandle large numbers of sequences.

Disadvantage: not using data fully statistically efficiently.

Advantage: when tested by simulation, found to be surprisinglyefficient anyway.

Disadvantage: cannot easily propagate some information about

local features in the sequences from one distance calculation toanother.

Disadvantage: it’s geometry so mathematical scientists hang onto

it beyond the point of reason.

Reconstructing phylogenies: how? how well? why? – p.11/29

Page 19: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Advantages and disadvantages of distance methods

Advantage: model-based so assumptions are clearer.

Advantage: it’s geometry so mathematical scientists love it.

Advantage: often fast (especially Neighbor-Joining method), canhandle large numbers of sequences.

Disadvantage: not using data fully statistically efficiently.

Advantage: when tested by simulation, found to be surprisinglyefficient anyway.

Disadvantage: cannot easily propagate some information about

local features in the sequences from one distance calculation toanother.

Disadvantage: it’s geometry so mathematical scientists hang onto

it beyond the point of reason.

Reconstructing phylogenies: how? how well? why? – p.11/29

Page 20: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Advantages and disadvantages of distance methods

Advantage: model-based so assumptions are clearer.

Advantage: it’s geometry so mathematical scientists love it.

Advantage: often fast (especially Neighbor-Joining method), canhandle large numbers of sequences.

Disadvantage: not using data fully statistically efficiently.

Advantage: when tested by simulation, found to be surprisinglyefficient anyway.

Disadvantage: cannot easily propagate some information about

local features in the sequences from one distance calculation toanother.

Disadvantage: it’s geometry so mathematical scientists hang onto

it beyond the point of reason.

Reconstructing phylogenies: how? how well? why? – p.11/29

Page 21: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Advantages and disadvantages of distance methods

Advantage: model-based so assumptions are clearer.

Advantage: it’s geometry so mathematical scientists love it.

Advantage: often fast (especially Neighbor-Joining method), canhandle large numbers of sequences.

Disadvantage: not using data fully statistically efficiently.

Advantage: when tested by simulation, found to be surprisinglyefficient anyway.

Disadvantage: cannot easily propagate some information about

local features in the sequences from one distance calculation toanother.

Disadvantage: it’s geometry so mathematical scientists hang onto

it beyond the point of reason.

Reconstructing phylogenies: how? how well? why? – p.11/29

Page 22: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Advantages and disadvantages of distance methods

Advantage: model-based so assumptions are clearer.

Advantage: it’s geometry so mathematical scientists love it.

Advantage: often fast (especially Neighbor-Joining method), canhandle large numbers of sequences.

Disadvantage: not using data fully statistically efficiently.

Advantage: when tested by simulation, found to be surprisinglyefficient anyway.

Disadvantage: cannot easily propagate some information about

local features in the sequences from one distance calculation toanother.

Disadvantage: it’s geometry so mathematical scientists hang onto

it beyond the point of reason.

Reconstructing phylogenies: how? how well? why? – p.11/29

Page 23: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Advantages and disadvantages of distance methods

Advantage: model-based so assumptions are clearer.

Advantage: it’s geometry so mathematical scientists love it.

Advantage: often fast (especially Neighbor-Joining method), canhandle large numbers of sequences.

Disadvantage: not using data fully statistically efficiently.

Advantage: when tested by simulation, found to be surprisinglyefficient anyway.

Disadvantage: cannot easily propagate some information about

local features in the sequences from one distance calculation toanother.

Disadvantage: it’s geometry so mathematical scientists hang onto

it beyond the point of reason.

Reconstructing phylogenies: how? how well? why? – p.11/29

Page 24: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Advantages and disadvantages of distance methods

Advantage: model-based so assumptions are clearer.

Advantage: it’s geometry so mathematical scientists love it.

Advantage: often fast (especially Neighbor-Joining method), canhandle large numbers of sequences.

Disadvantage: not using data fully statistically efficiently.

Advantage: when tested by simulation, found to be surprisinglyefficient anyway.

Disadvantage: cannot easily propagate some information about

local features in the sequences from one distance calculation toanother.

Disadvantage: it’s geometry so mathematical scientists hang onto

it beyond the point of reason.

Reconstructing phylogenies: how? how well? why? – p.11/29

Page 25: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Maximum likelihood

A C C C G

xy

z

w

t1 t t

t t

t

t

23

4 5

6t7

8

t i are

"branch lengths",

X (rate time)

To compute the likelihood for one site, sum over all possible states(bases) at interior nodes:

L(i) =∑

x

y

z

w

Prob (w) Prob (x | w, t7)

× Prob (A | x, t1) Prob (C | x, t2) Prob (z | w, t8)

× Prob (C | z, t3) Prob (y | z, t6) Prob (C | y, t4) Prob (G | y, t5)

Reconstructing phylogenies: how? how well? why? – p.12/29

Page 26: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Advantages and disadvantages of likelihood

Advantage: uses a model, so assumptions are clear.

Advantage: fully statistically efficient.

Disadvantage: computationally slower.

Advantage: statistical testing by likelihood ratio tests available

Disadvantage: can’t use the LRT test to test tree topologies.

Reconstructing phylogenies: how? how well? why? – p.13/29

Page 27: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Advantages and disadvantages of likelihood

Advantage: uses a model, so assumptions are clear.

Advantage: fully statistically efficient.

Disadvantage: computationally slower.

Advantage: statistical testing by likelihood ratio tests available

Disadvantage: can’t use the LRT test to test tree topologies.

Reconstructing phylogenies: how? how well? why? – p.13/29

Page 28: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Advantages and disadvantages of likelihood

Advantage: uses a model, so assumptions are clear.

Advantage: fully statistically efficient.

Disadvantage: computationally slower.

Advantage: statistical testing by likelihood ratio tests available

Disadvantage: can’t use the LRT test to test tree topologies.

Reconstructing phylogenies: how? how well? why? – p.13/29

Page 29: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Advantages and disadvantages of likelihood

Advantage: uses a model, so assumptions are clear.

Advantage: fully statistically efficient.

Disadvantage: computationally slower.

Advantage: statistical testing by likelihood ratio tests available

Disadvantage: can’t use the LRT test to test tree topologies.

Reconstructing phylogenies: how? how well? why? – p.13/29

Page 30: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Advantages and disadvantages of likelihood

Advantage: uses a model, so assumptions are clear.

Advantage: fully statistically efficient.

Disadvantage: computationally slower.

Advantage: statistical testing by likelihood ratio tests available

Disadvantage: can’t use the LRT test to test tree topologies.

Reconstructing phylogenies: how? how well? why? – p.13/29

Page 31: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Bayesian inference methods

Basically uses the likelihood machinery, and adds priors on parameters

and on trees.

Implemented by Markov chain Monte Carlo methods to sample from theposterior on trees (or parameters, or both).

Very popular right now.

Advantage: interpretation is straightforward, once theassumptions are met.

Advantage: gives you what you want, the probability of the result.

Disadvantage: how long is long enough to run the MCMC?

Disadvantage: where do we get priors from, what effect do they

have?

Disadvantage: they keep chanting in unison “We are the

statisticians of Bayes – you will be assimilated.”

Reconstructing phylogenies: how? how well? why? – p.14/29

Page 32: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Bayesian inference methods

Basically uses the likelihood machinery, and adds priors on parameters

and on trees.

Implemented by Markov chain Monte Carlo methods to sample from theposterior on trees (or parameters, or both).

Very popular right now.

Advantage: interpretation is straightforward, once theassumptions are met.

Advantage: gives you what you want, the probability of the result.

Disadvantage: how long is long enough to run the MCMC?

Disadvantage: where do we get priors from, what effect do they

have?

Disadvantage: they keep chanting in unison “We are the

statisticians of Bayes – you will be assimilated.”

Reconstructing phylogenies: how? how well? why? – p.14/29

Page 33: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Bayesian inference methods

Basically uses the likelihood machinery, and adds priors on parameters

and on trees.

Implemented by Markov chain Monte Carlo methods to sample from theposterior on trees (or parameters, or both).

Very popular right now.

Advantage: interpretation is straightforward, once theassumptions are met.

Advantage: gives you what you want, the probability of the result.

Disadvantage: how long is long enough to run the MCMC?

Disadvantage: where do we get priors from, what effect do they

have?

Disadvantage: they keep chanting in unison “We are the

statisticians of Bayes – you will be assimilated.”

Reconstructing phylogenies: how? how well? why? – p.14/29

Page 34: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Bayesian inference methods

Basically uses the likelihood machinery, and adds priors on parameters

and on trees.

Implemented by Markov chain Monte Carlo methods to sample from theposterior on trees (or parameters, or both).

Very popular right now.

Advantage: interpretation is straightforward, once theassumptions are met.

Advantage: gives you what you want, the probability of the result.

Disadvantage: how long is long enough to run the MCMC?

Disadvantage: where do we get priors from, what effect do they

have?

Disadvantage: they keep chanting in unison “We are the

statisticians of Bayes – you will be assimilated.”

Reconstructing phylogenies: how? how well? why? – p.14/29

Page 35: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Bayesian inference methods

Basically uses the likelihood machinery, and adds priors on parameters

and on trees.

Implemented by Markov chain Monte Carlo methods to sample from theposterior on trees (or parameters, or both).

Very popular right now.

Advantage: interpretation is straightforward, once theassumptions are met.

Advantage: gives you what you want, the probability of the result.

Disadvantage: how long is long enough to run the MCMC?

Disadvantage: where do we get priors from, what effect do they

have?

Disadvantage: they keep chanting in unison “We are the

statisticians of Bayes – you will be assimilated.”

Reconstructing phylogenies: how? how well? why? – p.14/29

Page 36: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Aren’t these graphical models?

x1

x4

x3

x5

x6

x7

x8

x9

x0

x1

x4

x3

x5

x6

x7

x8

x9

x0

x1

x4

x3

x5

x6

x7

x8

x9

x0

x1

x4

x3

x5

x6

x7

x8

x9

x0

x1

x4

x3

x5

x6

x7

x8

x9

x0

v1

v4 v

3

v6

v5

v8

v9

v7

(You have to imaging it going back 500 layers or so). The problem is to

use the data, which is at the tips but not available for the interior nodes,to infer the topology and branch lengths of the tree that is shared by allsites.

Reconstructing phylogenies: how? how well? why? – p.15/29

Page 37: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Could we use graphical model machinery here?

Like Moliére’s character who is delighted to discover that he’s beenspeaking prose all his life, we found we had already been using the

relevant Graphical Model machinery since about 1973.

So alas there was nothing to gain.

The same thing is true for statistical genetics, where the graphical model

machinery reinvents the standard “peeling” algorithms for computing

likelihoods on pedigrees, in use since 1970.

Reconstructing phylogenies: how? how well? why? – p.16/29

Page 38: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Bootstrap sampling of phylogenies

OriginalData

sequences

sites

Reconstructing phylogenies: how? how well? why? – p.17/29

Page 39: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Draw columns randomly with replacement

OriginalData

sequences

sites

Bootstrapsample#1

Estimate of the tree

sample same numberof sites, with replacementsequences

sites

Reconstructing phylogenies: how? how well? why? – p.18/29

Page 40: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Make a tree from that resampled data set

OriginalData

sequences

sites

Bootstrapsample#1

Estimate of the tree

Bootstrap estimate ofthe tree, #1

sample same numberof sites, with replacementsequences

sites

Reconstructing phylogenies: how? how well? why? – p.19/29

Page 41: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Draw another bootstrap sample

OriginalData

sequences

sites

Bootstrapsample#1

Bootstrapsample

#2

Estimate of the tree

Bootstrap estimate ofthe tree, #1

sample same numberof sites, with replacement

sample same numberof sites, with replacement

sequences

sequences

sites

sites

(and so on)

Reconstructing phylogenies: how? how well? why? – p.20/29

Page 42: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

... and get a tree for it too. And so on.

OriginalData

sequences

sites

Bootstrapsample#1

Bootstrapsample

#2

Estimate of the tree

Bootstrap estimate ofthe tree, #1

Bootstrap estimate of

sample same numberof sites, with replacement

sample same numberof sites, with replacement

sequences

sequences

sites

sites

(and so on)the tree, #2

Reconstructing phylogenies: how? how well? why? – p.21/29

Page 43: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Summarizing the cloud of trees by support for branches

Bovine

Mouse

Squir Monk

Chimp

Human

Gorilla

Orang

Gibbon

Rhesus Mac

Jpn Macaq

Crab−E.Mac

BarbMacaq

Tarsier

Lemur

80

72

74

9999

100

77

42

35

49

84

Reconstructing phylogenies: how? how well? why? – p.22/29

Page 44: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Some alternatives to bootstrapping

Parametric bootstrapping – same, but simulate data sets from our

best estimate of the tree instead of sampling sites.

Bayesian inference of course gets statistical support informationfrom the posterior.

The Kishino-Hasegawa-Templeton test (KHT test) which comparesprespecified trees to each other by paired sites tests.

Reconstructing phylogenies: how? how well? why? – p.23/29

Page 45: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Some alternatives to bootstrapping

Parametric bootstrapping – same, but simulate data sets from our

best estimate of the tree instead of sampling sites.

Bayesian inference of course gets statistical support informationfrom the posterior.

The Kishino-Hasegawa-Templeton test (KHT test) which comparesprespecified trees to each other by paired sites tests.

Reconstructing phylogenies: how? how well? why? – p.23/29

Page 46: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Some alternatives to bootstrapping

Parametric bootstrapping – same, but simulate data sets from our

best estimate of the tree instead of sampling sites.

Bayesian inference of course gets statistical support informationfrom the posterior.

The Kishino-Hasegawa-Templeton test (KHT test) which comparesprespecified trees to each other by paired sites tests.

Reconstructing phylogenies: how? how well? why? – p.23/29

Page 47: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Why want to know the tree?

It affects all parts of the genomes – it is the essential part of propagating

information about the evolution of one part of the genome to inquiriesabout another part.

The standard method for finding functional regions of the genome isnow using “PhyloHMMs” which use Hidden Markov Model machinery

together with phylogenies to find regions that have unusually low rates

of evolution.

Reconstructing phylogenies: how? how well? why? – p.24/29

Page 48: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Another kind of tree: the coalescentCoalescent trees are trees of ancestry of copies of a single gene locus

within a species. They are weakly inferrable as most have only a few sites

(SNPs) varying among individuals.

Since each coalescent tree applies to a very short region of genome,maybe as little as one gene, there is less interest in the tree.

But they do illuminate the values of parameters such as population

size, migration rates, recombination rates etc. This allows us to

accumulate information across different loci (genes).

To don this we have to sum over our uncertainty about the tree byusing MCMC methods, accumulating the information (as log

likelihood or using Bayesian machinery) to make inferences aboutthe parameters.

This is the interface between within-species population geneticsand between-species work on phylogenies.

It is also the statistical foundation of inferences frommitochondrial genealogies (“mitochondrial Eve”) and Y

chromosome genealogies, and of the samples from the rest of the

genome that are now being added to this.Reconstructing phylogenies: how? how well? why? – p.25/29

Page 49: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

A coalescent

Time

Reconstructing phylogenies: how? how well? why? – p.26/29

Page 50: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Yet another kind of tree: trees of gene families

Gene duplications in evolution create new genes. Both the new gene andthe original one then evolve.

Frog Human Monkey Squirrel

gene duplication

a a ab b b

species

boundary

tree of genes

Some forks are gene duplications, leading to subtrees that are all

supposed to have the same phylogeny as they are in the same set ofspecies. Example: Hemoglobin proteins.

Reconstructing phylogenies: how? how well? why? – p.27/29

Page 51: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

Yet another kind of tree: trees of gene families

Gene duplications in evolution create new genes. Both the new gene andthe original one then evolve.

Frog Human Monkey Squirrel

gene duplication

a a ab b b

species

boundary

tree of genes

FrogHuman Monkey Squirrel Human Monkey Squirrel

a a a b b b

These twotrees should beidentical

Some forks are gene duplications, leading to subtrees that are all

supposed to have the same phylogeny as they are in the same set ofspecies. Example: Hemoglobin proteins.

Reconstructing phylogenies: how? how well? why? – p.28/29

Page 52: Joe Felsenstein Department of Genome Sciences …evolution.gs.washington.edu/microsoft07.pdfReconstructing phylogenies: how? how well? why? Joe Felsenstein Department of Genome Sciences

References

Felsenstein, J. 2004. Inferring Phylogenies. Sinauer Associates, Sunderland,Massachusetts. [Book you and all your friends must rush out andbuy]

Semple, C. and M. Steel. 2003. Phylogenetics. Oxford Lecture Series inMathematics and Its Applications, 24. Oxford University Press. [Morerigorous mathematical treatment]

Yang, Z. 2007. Computational Molecular Evolution. Oxford Series in Ecology

and Evolution. Oxford University Press, Oxford. [Careful survey ofmolecular phylogeny methods, from a leader]

For a list of 348 phylogeny programs, many available free, see

http://evolution.gs.washington.edu/phylip/software.html

Reconstructing phylogenies: how? how well? why? – p.29/29