molecular phylogeny biology 224 instructor: tom peavy nov 4, 9 & 16

90
Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16 <Images adapted from Bioinformatics and Functional Genomics by Jonathan Pevsner>

Upload: tamsyn-thompson

Post on 05-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Molecular Phylogeny

Biology 224Instructor: Tom Peavy

Nov 4, 9 & 16

<Images adapted from Bioinformatics and Functional Genomics by Jonathan Pevsner>

Page 2: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Charles Darwin’s theory of evolution.

--struggle for existence induces a natural selection.

--Offspring are dissimilar from their parents(that is, variability exists), and individuals that are morefit for a given environment are selected for.

--over long periods of time, species evolve.

--Groups of organisms change over time so that descendants differ structurally and functionally from their ancestors.

Introduction

Page 3: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

The basic processes of evolution are[1] mutation, [2] genetic recombination[3] chromosomal organization (and its variation);[4] natural selection [5] reproductive isolation, which constrains the effects of selection on populations

Page 4: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

At the molecular level, evolution is a process ofmutation with selection.

Molecular evolution is the study of changes in genesand proteins throughout different branches of the tree of life.

Phylogeny is the inference of evolutionary relationships.Traditionally, phylogeny relied on the comparisonof morphological features between organisms. Today,molecular sequence data are also used for phylogeneticanalyses.

Page 5: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Phylogeny can answer questions such as:

Goals of molecular phylogeny

• How many genes are related to my favorite gene?

• Was Darwin correct that humans are closest to chimps and gorillas?

• How related are whales, dolphins & porpoises to cows?

• Where and when did HIV originate?

• What is the history of life on earth?

Page 6: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

By the 1950s, it became clear that amino acid substitutions occur nonrandomly

e.g. most amino acid changes in the insulin A chain are restricted to a disulfide loop region.Such differences are called “neutral” changes

rate of nucleotide (and of amino acid) substitution is about six-to ten-fold higher in the C peptide, relative to the A and Bchains.

Historical background: insulin

Page 7: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Mature insulin consists of an A chain and B chainheterodimer connected by disulphide bridges

The signal peptide and C peptide are cleaved,and their sequences display fewerfunctional constraints.

Page 8: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Number of nucleotide substitutions/site/year for insulin

0.1 x 10-9

0.1 x 10-91 x 10-9

Page 9: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Surprisingly, insulin from the guinea pig (and from the related coypu) evolve seven times faster than insulinfrom other species. Why?

The answer is that guinea pig and coypu insulindo not bind two zinc ions, while insulin molecules frommost other species do. There was a relaxation on thestructural constraints of these molecules, and so the genes diverged rapidly.

Historical background: insulin

Page 10: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

In the 1960s, sequence data were accumulated forsmall, abundant proteins such as globins,cytochromes c, and fibrinopeptides. Some proteinsappeared to evolve slowly, while others evolvedrapidly.

Linus Pauling, Emanuel Margoliash and others proposed the hypothesis of a molecular clock:

For every given protein, the rate of molecular evolution is approximately constant in all evolutionary lineages

Molecular clock hypothesis

Page 11: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Richard Dickerson (1971) plotted datafrom three protein families: cytochrome c, hemoglobin, and fibrinopeptides.

The x-axis shows the divergence times of the species,estimated from paleontological data. The y-axis showsm, the corrected number of amino acid changes per 100 residues.

n is the observed number of amino acid changes per100 residues, and it is corrected to m to account forchanges that occur but are not observed.

Molecular clock hypothesis

N100

= 1 – e-(m/100)

Page 12: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Hidden mutation due to multiple substitutions

Page 13: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Millions of years since divergence

corr

ecte

d a

min

o a

cid

ch

ang

es

per

100

res

idu

es (

m)

Dickerson (1971)

Page 14: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

• For each protein, the data lie on a straight line. Thus, the rate of amino acid substitution has remained constant for each protein.

• The average rate of change differs for each protein. The time for a 1% change to occur between two lines of evolution is 20 MY (cytochrome c), 5.8 MY (hemoglobin), and 1.1 MY (fibrinopeptides).

• The observed variations in rate of change reflect functional constraints imposed by natural selection.

Page 15: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Fibrinopeptides 9.0Kappa casein 3.3Lactalbumin 2.7Serum albumin 1.9Lysozyme 0.98Trypsin 0.59Insulin 0.44Cytochrome c 0.22Histone H2B 0.09Ubiquitin 0.010Histone H4 0.010

Molecular clock for proteins:rate of substitutions per aa site per 109 years

Page 16: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

If protein sequences evolve at constant rates,they can be used to estimate the times that sequences diverged. This is analogous to datinggeological specimens by radioactive decay.

Molecular clock hypothesis: implications

N = total number of substitutionsL = number of nucleotide sites compared

between two sequences

K = = number of substitutionsper nucleotide site

NL

See Graur and Li (2000), p. 140

Page 17: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Rate of nucleotide substitution r and time of divergence T

See Graur and Li (2000), p. 140

r = rate of substitution= 0.56 x 10-9 per site per year for hemoglobin alpha

K = 0.093 = number of substitutionsper nucleotide site (rat versus human)

r = K / 2TT = .093 / (2)(0.56 x 10-9) = 80 million years

Page 18: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Kimura’s (1968) neutral theory of molecular evolution:

--the vast majority of DNA changes are not selected for in a Darwinian sense.

--The main cause of evolutionary change is randomdrift of mutant alleles that are selectively neutral(or nearly neutral).

--Positive Darwinian selection does occur, but limited role.

e.g. the divergent C peptide of insulinchanges according to the neutral mutation rate.

Neutral theory of evolution

Page 19: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

“fast-clock” organisms • These organisms with long branches are called “fast-clock”

• They really acumulate substitutions faster than the rest of organisms (their rate of substitution is higher)

• Some authors have proposed various hypothesis to try to explain this phenomenon:– higher metabolic rate, short generation time, differences

in the number of replications of DNA in the germinal line, deficiences in DNA repair, mutagens,

Page 20: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Solutions?

• Use methods less sensible to this type of inconsistency (ML?)

• If it is possible, eliminate long branches:– eliminate the “fast-clock” organism– substitute by another of the same group that is not

“fast-clock”– increase the number of organisms of that group

Page 21: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Solutions?

• We first need to know if we really have a “fast-clock” organism

• Relative Rate Test – Sarich and Wilson, 1973 for proteins– Wu and Li (1985) and Li and Tanimura (1987)

extended it to nucleotides

Page 22: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

A B C

O

Relative Rate Test

• Uses 3 species A, B and one “outgroup” C

• Tests if A and B have the same rate of substitution since their split:

dAO = dBO

dAC = dBC

d = dAC - dBC = 0

Page 23: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Relative Rate Test

A B C

O

• This method is time independent

• We have to be sure about the phylogeny

Page 24: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

A B C

O

Relative Rate Test

• Our null hypothesis is:

d = dAC - dBC = 0

• It is assumed that the number of nucleotide substitutions follows a Poisson,

• then we can use the standarized normal distribution to test if the number of substituions in the 2 lineages is the same

Page 25: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Relative Rate Test• d = dAC - dBC = 0

• d ± Var(d)

• Var(d) = Var(dAC) + Var(dBC) - 2 Cov (dAC,dBC)

– |d| > 1.96 Var(d) = significant at the 5% level– |d| > 2.96 Var(d) = significant at the 1% level

0 x + 1.96x – 1.96

Page 26: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

The Structure of Phylogenetic Trees

Page 27: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

A

B

C

D

E

F

G

HI

time

6

2

1 1

2

1

2

6

1

2

2

1

A

BC

2

1

2

D

Eone unit

Molecular phylogeny uses trees to depict evolutionaryrelationships among organisms. These trees are basedupon DNA and protein sequence data.

Page 28: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

A

B

C

D

E

F

G

HI

time

6

2

1 1

2

1

2

6

1

2

2

1

A

BC

2

1

2

D

Eone unit

Tree nomenclature

Branches are unscaled... Branches are scaled...

…branch lengths areproportional to number ofamino acid changes

…OTUs are neatly aligned,and nodes reflect time

Page 29: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

A

B

C

D

E

F

G

HI

time

6

2

1 1

2

1

2

6

1

2

2

1

A

BC

2

1

2

D

Eone unit

Tree nomenclature

taxon

operational taxonomic unit (OTU) such as a protein sequence

Page 30: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

A

B

C

D

E

F

G

HI

time

6

2

1 1

2

1

2

6

1

2

2

1

A

BC

2

1

2

D

Eone unit

Tree nomenclature

branch (edge)

Node (intersection or terminating pointof two or more branches)

Page 31: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

A

B

C

D

E

F

G

HI

time

6

2

1 1

2

1

2

6

1

2

2

1

A

BC

22

D

Eone unit

Tree nomenclature

bifurcatinginternal node

multifurcatinginternal

node

Page 32: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

A

B

C

D

E

F

G

HI

time

6

2

1 1

2

1

2

Tree nomenclature: clades

Clade ABF (monophyletic group)

Page 33: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

A

B

C

D

E

F

G

HI

time

6

2

1 1

2

1

2

Tree nomenclature

Clade ABF/CDH/G

Page 34: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

The root of a phylogenetic tree represents thecommon ancestor of the sequences. Some treesare unrooted, and thus do not specify the commonancestor.

A tree can be rooted using an outgroup (that is, ataxon known to be distantly related from all other OTUs).

Tree roots

Page 35: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Tree nomenclature: roots

past

present

1

2 3 4

5

6

7 8

9

4

5

87

1

2

36

Rooted tree(specifies evolutionarypath)

Unrooted tree

Page 36: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Tree nomenclature: outgroup rooting

past

present

1

2 3 4

5

6

7 8

9

Rooted tree

1

2 3 4

5 6

Outgroup(used to place the root)

7 9

10

root

8

Page 37: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Numbers of trees

Number Number of Number of of OTUs rooted trees unrooted trees

2 1 13 3 14 15 35 105 1510 34,459,425 105

Page 38: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Molecular evolutionary studies can be complicatedby the fact that both species and genes evolve.speciation usually occurs when a species becomesreproductively isolated. In a species tree, eachinternal node represents a speciation event.

Genes (and proteins) may duplicate or otherwise evolvebefore or after any given speciation event. The topologyof a gene (or protein) based tree may differ from thetopology of a species tree.

Species trees versus gene/protein trees

Page 39: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

species 1 species 2

speciationevent

Species trees versus gene/protein trees

past

present

Page 40: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

species 1 species 2

speciationevent

Species trees versus gene/protein trees

Gene duplicationevents

OTUs

Page 41: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

How to Construct Phylogenetic Trees

Page 42: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Molecular phylogenetic analysis may be describedin four stages:

[1] Selection of sequences for analysis

[2] Multiple sequence alignment

[3] Tree building

[4] Tree evaluation

Four stages of phylogenetic analysis

Page 43: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

-Protein alignments are more informative as to structure function relationships

-Although DNA may be preferable for the phylogenetic analysis since the protein-coding portion of DNA has synonymous and nonsynonymous substitutions

-RNA is useful for the other non-protein coding genes (e.g. tRNAs) if looking at structure-function relationshipsBut often use the gene instead for phylogeny (e.g. genesFor rRNA)

Stage 1: Use of DNA, RNA, or protein

Page 44: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

For phylogeny, protein sequences are also often used.

--Proteins have 20 states (amino acids) instead of onlyfour for DNA, so there is a stronger phylogenetic signal.

Nucleotides are unordered characters: any onenucleotide can change to any other in one step.

An ordered character must pass through one or moreintermediate states before reaching the final state.

Amino acid sequences are partially ordered character states: there is a variable number of states betweenthe starting value and the final value.

Stage 1: Use of DNA, RNA, or protein

Page 45: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

If the synonymous substitution rate (dS) is greater thanthe nonsynonymous substitution rate (dN), the DNAsequence is under negative (purifying) selection. Thislimits change in the sequence (e.g. insulin A chain).

If dS < dN, positive selection occurs. For example, a duplicated gene may evolve rapidly to assume new functions.

Synonymous vs Nonsynonymous rates

Page 46: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16
Page 47: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

DNA can be more informative also due to:

--Rates of transitions and transversions can be measured.

--Noncoding regions (such as 5’ and 3’ untranslatedregions) may be analyzed using molecular phylogeny.

--Pseudogenes (nonfunctional genes) are studied bymolecular phylogeny

-- Additional mutational events can be inferred byanalysis of ancestral sequences. These changesinclude parallel substitutions, convergent substitutions,and back substitutions.

Page 48: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16
Page 49: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

-- in order to predict ancestral sequence, other distantly related sequences are analyzed

Page 50: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

The fundamental basis of a phylogenetic tree isa multiple sequence alignment.

(If there is a misalignment, or if a nonhomologoussequence is included in the alignment, it will stillbe possible to generate a tree.)

Consider the following (see Fig. 3.2)

Stage 2: Multiple sequence alignment

Page 51: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Some positions of the multiple sequence alignment areinvariant (arrow 2). Some positions distinguish fish RBPfrom all other RBPs (arrow 3).

Alignment of 13 orthologous retinol-binding proteins

Page 52: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

[1] Confirm that all sequences are homologous

[2] Adjust gap creation and extension penalties as needed to optimize the alignment

[3] Restrict phylogenetic analysis to regions of the multiple sequence alignment for which data are available for all taxa (delete columns having incomplete data).

[4] Many experts recommend that you delete any column of an alignment that contains gaps (even if the gap occurs in only one taxon)

Stage 2: Multiple sequence alignment

Page 53: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Discuss two tree-building methods:distance-based versus character-based.

Distance-based methods involve a distance metric,such as the number of amino acid changes betweenthe sequences, or a distance score. Examples ofdistance-based algorithms are UPGMA and neighbor-joining.

Character-based methods include maximum parsimonyand maximum likelihood. Parsimony analysis involvesthe search for the tree with the fewest amino acid(or nucleotide) changes that account for the observeddifferences between taxa.

Stage 3: Tree-building methods

Page 54: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Fish RBP orthologs

common carp

zebrafish

rainbow trout

teleost

African clawed frog

chicken

mouserat

rabbitcowpighorse

human

10 changes

Other vertebrateRBP orthologs

Page 55: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Distance-based treeCalculate the pairwise alignments;if two sequences are related,put them next to each other on the tree

Page 56: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Character-based tree: identify positions that best describe how characters (amino acids) are derived from common ancestors

Page 57: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Regardless of whether you use distance- or character-based methods for building a tree,the starting point is a multiple sequence alignment.

ReadSeq is a convenient web-based program thattranslates multiple sequence alignments intoformats compatible with most commonly usedphylogeny programs such as PAUP and PHYLIP.Mega has its own text converter.

Stage 3: Tree-building methods

Page 58: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Stage 3: Tree-building methods: distance

The simplest approach to measuring distances between sequences is to align pairs of sequences, andthen to count the number of differences. The degree ofdivergence is called the Hamming distance. For analignment of length N with n sites at which there aredifferences, the degree of divergence D is:

D = n / N

But observed differences do not equal genetic distance!Genetic distance involves mutations that are notobserved directly

Page 59: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Stage 3: Tree-building methods: distance

Jukes and Cantor (1969) proposed a corrective formula:

D = (- ) ln (1 – p)34

43

This model describes the probability that one nucleotidewill change into another. It assumes that each residue is equally likely to change into any other (i.e. the rate oftransversions equals the rate of transitions). In practice,the transition is typically greater than the transversionrate.

Page 60: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Models of nucleotide substitution

A G

C T

transition

transition

transversiontransversion

Page 61: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

A

Jukes and Cantor one-parameter model of nucleotide substitution

G

T C

Page 62: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Stage 3: Tree-building methods: distance

Jukes and Cantor (1969) proposed a corrective formula:

D = (- ) ln (1 – p)34

43

Consider an alignment where 3/60 aligned residues differ.The normalized Hamming distance is 3/60 = 0.05.The Jukes-Cantor correction is

D = (- ) ln (1 – 0.05) = 0.05234

43

When 30/60 aligned residues differ, the Jukes-Cantor correction is more substantial:

D = (- ) ln (1 – 0.5) = 0.8234

43

Page 63: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

http://evolution.genetics.washington.edu/phylip/software.html

This site lists 200 phylogeny packages. Perhaps the best-known programs are PAUP (David Swofford et al.),PHYLIP (Joe Felsenstein) and MEGA (Kumar et al.)

Many software packages are available for making phylogenetic trees.

Page 64: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

UPGMA (distance-based tree)

Page 65: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Tree-building methods: UPGMA

UPGMA is unweighted pair group methodusing arithmetic mean

1 2

3

4

5

Page 66: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Tree-building methods: UPGMA

1 2

3

4

5

Cluster the smallest pairwise alignmentsAnd repeat until all clusters are drawn

Page 67: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

1 2

3

4

5

1 2

6

Step 1

1 2

3

4

5

1 2

6

4 5

7

Step 2

Page 68: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

1 2

3

4

5 1 2

6

4 5

7

3

8

Step 3

1 2

3

4

5

1 2

6

4 5

7

3

8

9

Step 4

Page 69: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

UPGMA is a simple approach for making trees.

• An UPGMA tree is always rooted.

• An assumption of the algorithm is that the molecular clock is constant for sequences in the tree. If there are unequal substitution rates, the tree may be wrong.

• While UPGMA is simple, it is less accurate than the neighbor-joining approach (described next).

Distance-based methods: UPGMA trees

Page 70: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

The neighbor-joiningmethod of Saitou and Nei(1987) Is especially usefulfor making a tree having a large number of taxa.

Begin by placing all the taxa in a star-like structure.

Making trees using neighbor-joining

Page 71: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Tree-building methods: Neighbor joining

Next, identify neighbors (e.g. 1 and 2) that are most closelyrelated. Connect these neighbors to other OTUs via aninternal branch, XY. At each successive stage, minimizethe sum of the branch lengths.

dXY = 1/2(d1Y + d2Y – d12)

Page 72: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Example of aneighbor-joiningtree: phylogeneticanalysis of 13RBPs

Page 73: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Tree-building methods: character based

Rather than pairwise distances between proteins,evaluate the aligned columns of amino acidresidues (characters).

Tree-building methods based on characters includemaximum parsimony and maximum likelihood.

Page 74: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

The main idea of character-based methods is to findthe tree with the shortest branch lengths possible.Thus we seek the most parsimonious (“simple”) tree.

• Identify informative sites. For example, constant characters are not parsimony-informative.

• Construct trees, counting the number of changesrequired to create each tree. For about 12 taxa orfewer, evaluate all possible trees exhaustively; for >12 taxa perform a heuristic search.

• Select the shortest tree (or trees).

Making trees using character-based methods

Page 75: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

As an example of tree-building using maximum parsimony, consider these four taxa:

AAGAAAGGAAGA

How might they have evolved from a common ancestor such as AAA?

Page 76: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

AAG AAA GGA AGA

AAAAAA

1 1AGA

AAG AGA AAA GGA

AAAAAA

1 2AAA

AAG GGA AAA AGA

AAAAAA

1 1AAA

1 2

Tree-building methods: Maximum parsimony

Cost = 3 Cost = 4 Cost = 4

1

In maximum parsimony, choose the tree(s) with the lowest cost (shortest branch lengths).

Page 77: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

In PAUP’s implementationof maximum parsimony,many arrangements are tried

and the best trees(lowest branch lengths)are saved

Page 78: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Phylogram

(values are proportionalto branchlengths)

Page 79: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Rectangularphylogram

(values are proportionalto branchlengths)

Page 80: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Cladogram

(values are not proportionalto branchlengths)

Page 81: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Rectangularcladogram

(values are not proportionalto branchlengths)

These four trees display the same datain different formats.

Page 82: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

A1AG HUMAA1AG RABI

A1AH MOUSA1AG RAT

APHR CRICOBP RAT 2PBAS RATMUP1 MOUSMUPM MOUS

MUP RAT 3CO8G HUMAAMBP HUMA

FAB1 MANSFAB2 MANS

FABL CHICILBP PIGILBP RAT

FABA HUMAMYP2 BOVI

FABE HUMAFABH BOVI

FABL GINCFABP ECHG

FABP SCHMRET3 BOVIRET1 HUMARET2 MOUS

FABI HUMAFABL HUMA

AMBP PLEPOLFA RANP

LALP MACEVEG1 RAT

VEGP HUMAERBP RAT

ESP4 LACVQSP CHICK

LIPO BUFMPGHD HUMANGAL HUMANGAL MOUS

LACA CANFLACB BOVILACB PIGLACA EQUA

LACB EQUAPAEP HUMA

LACB MACGAPD HUMAN

BBP PIEBRICYA MANS

PURP CHICRET1 ONCM

RETB BOVIRETB XENL

CRA2 HOMGCRC1 HOMG

OBP BOVIN

50 changes

3740

27

252124

64

32

3623 40

4461

33

331522

2725

28

24

31

26

28

34

22

66

33

40

64

40 3424

21

20

2425

38 1621

23

17

1826

1517

1418

151936

2021

25 4337

32

25 2731

5438

52

3871

3158

29 3024

5348

57

34

3525 46

41

49 2632

29

4330

16

1121

212128

21 3715

4056

46

33

2854

42 5338

5935

2324

182533

39 4549

41

retinol-binding protein

odorant-binding protein (rat)

odorant-binding protein (bovine)

lactoglobulin

Page 83: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

For some phylogenetic trees, particularly those basedon maximum parsimony, the artifact of long-branchattraction may occur.

Branch lengths often depict the number of substitutionsthat occur between two taxa. Parsimony assumes alltaxa evolve at the same rate, and all characters contribute the same amount of information.

Rapidly evolving taxa may be placed on the same branch,not because they are related, but because they bothhave many substitutions.

Tree artifacts: long branch attraction

Page 84: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Long branch attraction (LBA)

• When the length of the branches or the substitution rates are extremely unequal, there is a violation of the assumptions made by inference methods

– termed the Felsenstein Zone

Page 85: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Long branch chain attraction can confoundphylogenetic analyses

Page 86: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Maximum likelihood is an alternative to maximumparsimony. It is computationally intensive. A likelihoodis calculated for the probability of each residue inAn alignment, based upon some model of thesubstitution process.

Making trees using maximum likelihood

Page 87: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

The main criteria by which the accuracy of a phylogentic tree is assessed are consistency,efficiency, and robustness. Evaluation of accuracy can refer to an approach (e.g. UPGMA) or to a particular tree.

Stage 4: Evaluating trees

Page 88: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

Bootstrapping is a commonly used approach tomeasuring the robustness of a tree topology.Given a branching order, how consistently doesan algorithm find that branching order in a randomly permuted version of the original data set?

Stage 4: Evaluating trees: bootstrapping

Page 89: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

To bootstrap, make an artificial dataset obtained by randomly sampling columns from your multiple sequence alignment. Make the dataset the same size as the original. Do 100 (to 1,000) bootstrap replicates.Observe the percent of cases in which the assignmentof clades in the original tree is supported by the bootstrap replicates. >70% is considered significant.

Stage 4: Evaluating trees: bootstrapping

Page 90: Molecular Phylogeny Biology 224 Instructor: Tom Peavy Nov 4, 9 & 16

In 61% of the bootstrapresamplings, ssrbp and btrbp(pig and cow RBP) formed adistinct clade. In 39% of the cases, another protein joinedthe clade (e.g. ecrbp), or oneof these two sequences joinedanother clade.