gcb04tut

An Introduction to Molecular Phylogeny

Dr. Kerstin Hoef- EmdenUniversitt zu KlnBotanisches Institut

Gyrhofstr. 1550931 Kln

What is molecular phylogeny?

phylon = Greek for stemgenesis = Greek for origin

molecular phylogeny = studying relationships among organisms using molecular markers (e.g. DNA or protein sequences)

dissimilarities among sequences = genetic divergence caused by mutations during the course of time

Molecular Phylogenetic Methods

- their accuracy can be tested in in silico simulations

- are based on assumptions about the processes of molecular evolution

- may be computationally intense (this refers more to CPU time than to memory)- may be sensitive to artefacts

- usually results are displayed as trees

Accuracy of Molecular Phylogenetic Methods

consistency = Does a method reconstruct the correct tree given an infinite amount of data? (All methods do, if assumptions are not violated.)

efficiency = How quickly converges a method to the correct tree with a finite amount of data? (The less data is needed to infer the correct tree, the more

efficient the method.)

robustness = How well is the performance of a method, if the assumptionsabout the evolutionary process are violated?

How to test phylogenetic methods?

e.g. by simulation in silico (= in the computer)

a) Simulate the evolution of a randomly chosen DNA or protein sequence under a given evolutionary model and tree topology into several lineages.b) Use the phylogenetic method under test to infer a phylogenetic tree.c) Does the resulting phylogenetic tree correspond to the true tree?d) Modify the tree topology to different extremes in branch lengths and repeat the test.

Phylogenetic Methods and Real Life Sequences

- The true tree is unkown; each inferred tree represents a hypothesis.- No infinite amounts of data are available (no nuclei with infinite space, which contain infinite amounts of DNA).- By using robust and efficient methods and appropriate evolutionary models, the inferred trees hopefully converge to the real phylogeny as close as possible.- The simulation studies give some hints about potential vulnerabilities of the phylogenetic methods.

Trees: Nomenclature

terminal branchinternal branch

terminal node = operational taxonomicunit (OTU) = contemporary taxon

internal node = unknownancestor = extinct taxon

mathematics: branch = edge; node = vertex (plural: vertices)

Tree Types: Unscaled Trees

slanted cladogram rectangular cladogram

disadvantage:no information

about evolutionaryrates in a tree

Tree Types: Scaled Trees

rooted phenogramunrooted phenogram

disadvantage: direction of evolution is unknownadvantage: higher resolution

Treefile Formats#NEXUS Begin trees; [Treefile saved Thu Sep 16 03:12:19 2004]

Translate1 MPorph,2 S11679,3 CCMP736,4 MCont316,5 S1382,[...]21 UTEX637;tree PAUP_1 = [&U] (1:0.096043,(((((2:0.041968,12:0.011298):0.014339,(13:0,20:0):0.012408):0.012250,3:0.188535):0.027987,(4:0.107423,5:0.115320):0.013260,((((14:0.001531,15:0):0.038667,19:0.005060):0.000989,18:0.018300):0.015597,(17:0.006567,21:0.029673):0.017735):0.009011):0.011165,(((6:0.021826,9:0.009905):0.002602,(7:0.014488,11:0.066979):0.005662):0.014995,(8:0.046098,10:0.079671):0.010953):0.009027):0.020750,16:0.048813);End;

Phylip (Newick)(MPorph:0.094736,(((((S11679:0.041475,U1424:0.011216):0.014077,(M1712:0,S10379:0):0.012338):0.011839,CCMP736:0.183396):0.027333,(MCont316:0.106050,S1382:0.117050):0.013282,((((S3794:0.001532,S3694:0):0.038496,S899:0.004933):0.001127,S3194:0.018102):0.015038,(S4094:0.006579,UTEX637:0.029760):0.018250):0.009256):0.011383,(((Chondrus:0.021641,S13531a:0.009722):0.002405,(S4194a:0.014451,S1896:0.065892):0.005690):0.014595,(S4194b:0.046357,S13531b:0.078882):0.010916):0.008914):0.020599,S5981:0.048038);

Displaying Trees

Paup for MacOS 9: graphical output to screen, file or printer; nexus or Newick format (unscaled, scaled, rooted and unrooted trees)Phylip: treefile to graphics converter; Newick format (unscaled, scaled, rooted and unrooted trees)Paup for Windows or portable format (Unixoids): auxiliary program necessary

- Phylip converter programs- TreeView for MacOS 9 (and Windows?): all tree types and tree edition; nexus and Newick for Unixoids: no unrooted trees, no tree editing; nexus and Newick- TreeEdit (MacOS), Treetool etc.

General purposePaup 4b: Windows, MacOS 9, Unixoids (Linux, Solaris, MacOS X etc.)Phylip 3.62: Windows, MacOS 8, 9, X, Linux, C- Sources

Bayesian AnalysesMrBayes 3: Windows, MacOS, C- Sources (Unixoids)

Links to phylogeny- related software collected by Joe Felsenstein: http://evolution.genetics.washington.edu/phylip/software.html

Phylogeny Programs (1)

Paup* 4b10= Phylogenetic Analysis Using Parsimony (* and other methods)

Written by David Swofford. First versions up to Paup 3 were available for MacOS < 9 only and were focused on the parsimony method. Paup 4 is available for different OS and one of the most powerful toolsfor phylogenetic analyses concerning nucleotide sequences. Sold as a beta version, but more stable than some sold final versions of other software.

MacOS 9PPC: graphical user interface (i.e. mouse driven) and graphical output of treesAll others (Windows, Unixoids): command line (can be submitted to batch queues, unfortunately no checkpointing)Distributor in Europe: Palgrave- MacMillan, UK (Windows and MacOS 9 PPC; GBP 62/72)Distributor in USA (and for portable versions): Sinauer Associates (USD 85- 150)


Phylip 3.62= Phylogenetic Inference Package

Written by Joe Felsenstein. Multiple purpose package for nucleotide as well asprotein sequences. Approx. 30 different programs to fullfil different tasks.Freely available over the internet (http://evolution.gs.washington.edu/phylip.html).

Precompiled for: MacOS 8/9 PPC, MacOS X, Windows, Red Hat Linux (i368)C- Sources for all other UnixoidsUser- interface: text- based menu system


MrBayes 3

Written by John Huelsenbeck and Frederic Ronquist. Specialised on Bayesian Analyses. Handles nucleotide as well as proteinsequences. Partitioned computation of concatenated data sets.Freely available over internet (http://morphbank.ebc.uu.se/mrbayes/)

Runs under MacOS X, Windows, UnixoidsC- SourcesUser- interface: command line- driven; syntax similar to Paup

Parallelised version available; no checkpointing.


Input Formats

Paup: nexus format (interleaved or sequential)Phylip: phylip formatMrBayes: nexus format (interleaved or sequential)


Phylogeny Programs (6)#NEXUSBEGIN TAXA; DIMENSIONS NTAX=6; TAXLABELS 'S3694' 'S5981' 'S4094' 'S3194' 'S899' 'S10379' ;END;BEGIN CHARACTERS; DIMENSIONS NCHAR=14; FORMAT DATATYPE=NUCLEOTIDE GAP=- ;MATRIX[1] 'S3694'cCCAAGCGTTTCCG[2] 'S5981'CCCAATCGTTTCCC[3] 'S4094'CCCAATCGTTTCCG[4] 'S3194'GCCAATCGTTTCCG[5] 'S899'CCCAAGCGTTTCCG[6] 'S10379'CCCAATCGTTTCCG;END;

Phylip Format

6 14S3694 cCCAAGCGTTTCCGS5981 CCCAATCGTTTCCCS4094 CCCAATCGTTTCCGS3194 GCCAATCGTTTCCGS899 CCCAAGCGTTTCCGS10379 CCCAATCGTTTCCG

Nexus Format

user- defined treesand topology testing

aim:group of organisms

or gene family

choice of molecular marker(s)and

taxon sampling

amplification/sequencing

alignment

choice of evolutionary model

phylogenetic analyses

tree(s)results

impr

ov e

men

t of

Work- Flow

Taxon Sampling

Strategy for an initial Taxon Sampling

- the diversity of the group should be represented (guessing by looking at phenotype or using systematics of group). e.g. combinations of morphological characters or representatives of all species/genera of a group or serotypes or ....- at least two representatives of each presumed clade (guessing)- not to few taxa (> 15)- outgroup taxa (closest related sistergroup only!)

Choice of Molecular Marker(s):Phylogenies of Gene Families (1)

All orthologues and paralogues (or alleles) of a gene in an organism have to be sequenced!

Why?

Homo

Drosophila

Arabidopsis

Homo

Drosophila

Arabidopsis

Chlamydomonas

Chlamydomonas

Drosophila

Homo

Arabidopsis

Chlamydomonas

Choice of Molecular Marker(s):Phylogenies of Gene Families (2)

A very fictitious example for a weird tree caused by an incomplete sampling of a gene family (taxon sampling also not recommended).correct tree

very bad tree!

Choice of Molecular Marker(s):Phylogenies of Organisms (1)

- choose single copy genes (protein- coding) or highly synchronised genes (ribosomal DNA)- choose higher variable genes for closely related organisms and conserved genes for farther related organisms- in sexually reproducing organisms, two alleles may occur


e.g. the eukaryotic ribosomal operon

SSU rDNA LSU rDNA

5.8S rDNA

ITS1 ITS2

conserved = potentially suited for phylogenies of genera or higher level taxa

highly variable = potentially suited for phylogenies of species or lower level taxa


some examples for more conserved genes:actin

elongation factor 1 (EF- 1)rbcL

tubulinsand lots more ...



or gene family


taxon sampling


alignment



tree(s)results

impr

ov e

men

t of

Work- Flow

DNA Amplification (1)genomic DNA

PCR

template for sequencing

cloning

mRNA

RT- PCR

cDNA

DNA Amplification (2)

genomic DNA cDNA

advantagedisadvantage

introns (add information)introns (splice sites?)

no introns (just ORF)no introns

Taq polymeraseno proofreading = introduces reading errors (pred. transitions))direct sequencing (large template pool) - > usually no problem

cloning of PCR products - > problem! solutions: a) proofreading polymerase instead of Taq b) sequence more than 2 clones (better more than three; only an option, if no allelic variation can be expected!)

DNA Amplification (3)

Sequencing

Reduce/avoid sequencing errors

- sequencing of forward and reverse strands- thoroughful proofreading ribosomal RNA sequences: secondary structure protein sequences: translation (Stop codons?)- BLAST search (PCR contamination, chimaeric sequences?)



or gene family


taxon sampling


alignment



tree(s)results

impr

ov e

men

t of

Work- Flow

automatic alignmentgood as a starting point for an alignmentnot good, if sequences contain a lot of indels and highly variable regions (i.e. non- coding regions such as ITS or intron sequences or variable regions in ribosomal RNA sequences)

proteins: check alignment afterwards by eyeribosomal RNA, intron and ITS sequences: always manual editing needed

Alignment (1)

Alignment (2)

Second round of proofreading

Unusual amino acids in the translated sequence?Deviations in a highly conserved region of ribosomal RNA?One G whereas all others have two in a highly conserved region?

- > back to the assembly data and cross checking (it may be true, though!)

The alignment is the very basis of the phylogeneticanalyses. A software can not differentiate between a realmutation and a sequencing or alignment error.

Alignment (3)

Effects of an Erroneous Alignment

Decreasing of the resolution.Worst case: artefactual tree topology

Alignment (4)

Preparation of the Alignment for the Phylogenetic Analyses

Exclusion of nonalignable regions and saving of the data set in nexus (PAUP) or phylip (Phylip, PAML, Molphy) format depending on the software used for phylogenetic analyses.In protein- coding sequences: perhaps excluding third codon position

CCMP152 TAGGAAATCTAGAGCTAATACATGCACCATCGCTCTAATTTGATATTTT--------M1303 TAGGAAA-CTACAGCTAATACATGCTCCATCGCTTTTTTTGTGTTTAGTT-------M2180 TAGGAAA-CTACAGCTAATACATGCTCCATCGCTTTTTTTGTGTTTAGTT-------S9772e TAGGAAA-CTACAGCTAATACATGCTCCATCGCTTTTTTTTACAATATCTAA-----S9772b TAGgAAA-CTACAGCTAATACATGCTCCATCGCTTTTTTTATATATATTTGT-----C9772a TAGgAAA-CTACAGCTAATACATGCTCCATCGCTTTTTTTATATATATTTGT-----M1703 TAGGAATTCTAGAGCTAATACATGCACCAGTGCCCTTAGTTTATTCTTTTTTAAGACAM1481 TAGGAATTCTAGAGCTAATACATGCACCATCGCTTTTTTTTCTTTTTTCTTTTTTCTTSB9801 TAGGAATTCTAGAGCTAATACATGCACCATCGTTTTTCTTGACAGGAAGGAAGAAAAAM1318 TAGGAATTCTAGAGCTAATACATGCACCAGTGCCCTTAGTTTATTCTTTTTTAAGACAM1312 TAGgAATTCTAGAGCTAATACATGCACCATAGCCTTTTGTAATTTTTTTTAAAGTTTTHruf TAGGAATTCTAGAGCTAATACATGCCCCATCGCTTTCGAAGTTTTTTAATTTTTTTTC

mask XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXxxx

Alignment (5)

CTC = discussion zone between alignable and nonalignable regions TTT = highly variable region -> exclude from analyses

e.g. ribosomal DNA

Properties of an alignment editor suited for phylogenetic purposes

multiple sequence alignmentlimits of sequence number and length sufficiently high?manual editingprotected mode (no deletion of nucleotides)several import/export formats (e.g. clustal for automated pre- alignment)for phylogenetic analyses: nexus/phylip format exportdefinition of a mask to exclude nonalignable regions

Alignment (6)

Automatic Alignment Tools

Clustal W and relativesT- Coffee

MalignTreeAlign

PileUpand others ...

Alignment (7)

Manual Alignment

AlignBioEdit

SeaViewSeAl

ARB (RNA- coding DNA)DCSE (RNA- coding DNA)

SeqLab (GCG)MacVector

and others ...

Questions?



or gene family


taxon sampling


alignment



tree(s)results

impr

ov e

men

t of

Work- Flow

consist of a combination of the following parameters:

base frequenciessubstitution rate matrix

proportion of invariable sitesgamma- distributed among- site rate variation

(covarion/covariotide)

Evolutionary Models (1)


base frequenciespercentages of A, C, G or T in the alignment;can be set to: - equal (0.25 for each nt)- empirical (= computed from the alignment)- estimate (= optimised as a likelihood parameter)- manually set

homogeneity of base frequencies among taxa:Program Paup performs a chi square test for biased base frequencies.(command line: basefreqs)


substitution rate matrixassumptions about substitution rates of point mutations

Jukes- Cantor model: All point mutations occur at the same rate. (number of substitution types [nst] =1)Hasegawa- Kishino- Yano and Kimura- 2- parameter: Differing rates for transitions and transversions (nst=2)General time reversible model (GTR): Each type of substition has a different substitution rate, reversals are considered equally likely (nst=6)


substitution rate matrixnst=2

A C G T

A - v i v

C v - v i

G i v - v

T v i v -

i = transition

v = transversion

nst=6

A C G T

A - a b c

C a - d e

G b d - f

T c e f -

equal rates for

both directions

nst=6 (but 3 rate classes)

A C G T

A - a b a

C a - a e

G b a - a

T a e a -

the Tamura-Nei

model


proportion of invariable sites andgamma- distributed among- site rate variation

DNA sequences do not evolve at the same rate in all positions.

protein- coding sequences: faster rates at the third position ribosomal DNA, internal transcribed spacers: alternating pattern of more conserved and highly variable regions correlating with secondary structure (helices and unpaired regions).


123 123 123DNA TCA CGA GTA TCC CGC GTC TCG CGG GTG TCT CGT GTT

protein Ser Arg Val

protein- coding genes: degenerate code

ribosomal DNA

AG

A

AA

A

U-GA-UC-GC-G

CUA A

RNA secondary structure

TACCATGAAAAAGTGGAC

DNA

red = highly variable positions


proportion of invariable sites = proportion of positions, which do not evolve

gamma- distributed among- site rate variation = nucleotides evolve at differing rates in differing positions; modelled by the shape parameter

both parameters may be used separately or can be combined

gamma- distributed among- site rate variation


pro

porti

on

of s

it es

continuous gamma distribution

=

= 1 ~ 10

substitution rate

= 0.25

discrete gamma distribution


Covarion/Covariotide

Individual sequences or lineages evolve faster than others. = not evolving according to a molecular clock.

Not implemented in most phylogeny programs (MrBayes is an exception).


Example for a command line in Paup 4b to calculate the parameters of aTamura- Nei model with unequal base frequencies, proportion of invariable sites and gamma distribution (a tree has to be available in the memory)

lscores 1/ nst=6 basefre=est rmat=est rclass=(a b a a e a) pinv=est rate=gam shape=est;


different combinations of

base frequencies+

substitution rate matrix+

proportion of invariable sites/gamma distribution

= 56 evolutionary models in the program Paup 4b

How to decide, which model fits best a data set?

Choice of Evolutionary Model (1)

The program Modeltest 3.5 (by Posada and Crandall) performs hierarchical likelihood ratio tests (hLRT) and also computes the Akaike information criterion (AIC).

Modeltest consists of a command file for Paup (modelblock)and an executable (Posada Lab at http://darwin.uvigo.es/)


Running Modeltest

1.) Start Paup and load data set.2.) Load modelblock of Modeltest with command execute into Paup3.) Paup will follow the commands given in the modelblock: First a tree is constructed using the simplest and fastest method. Then Paup computes the likelihood values for all 56 evolutionary models for the data set given the tree. The likelihood scores of the 56 models are saved in a file called model.scores.4.) The Modeltest executable is started and fed with the model.scores file. It performs the hLRT and AIC tests and saves the results to a file.


Testing models of evolution - Modeltest Version 3.06(c) Copyright, 1998-2000 David Posada ([email protected])Department of Zoology, Brigham Young UniversityWIDB 574, Provo, UT 84602, USA_______________________________________________________________

Wed Sep 15 21:33:13 2004

Input format: Paup matrix file

** Log Likelihood scores ** +I +G +I+GJC = 3853.2573 3853.2573 3814.9705 3806.2795F81 = 3843.7336 3843.7336 3806.0015 3797.3303K80 = 3852.4849 3852.4849 3814.0562 3805.3757HKY = 3842.8232 3842.8232 3804.8003 3796.1357TrNef = 3852.4060 3852.4060 3813.3804 3804.7378TrN = 3842.6401 3842.6401 3804.7976 3796.1355K81 = 3851.4536 3851.4536 3813.0771 3804.3674K81uf = 3842.2886 3842.2886 3804.3494 3795.6624TIMef = 3851.3740 3851.3740 3812.3914 3803.7239TIM = 3842.1130 3842.1130 3804.3457 3795.6621TVMef = 3846.7188 3846.7188 3807.1191 3798.2488TVM = 3840.2319 3840.2319 3802.2058 3793.3215SYM = 3846.6523 3846.6523 3806.4158 3797.5940GTR = 3839.9839 3839.9839 3802.2041 3793.3086

** Hierarchical Likelihood Ratio Tests (hLRTs) **

Equal base frequencies Null model = JC -lnL0 = 4124.0898 Alternative model = F81 -lnL1 = 4117.4712 2(lnL1-lnL0) = 13.2373 df = 3 P-value = 0.004151 Ti=Tv Null model = F81 -lnL0 = 4117.4712 Alternative model = HKY -lnL1 = 4117.0146 2(lnL1-lnL0) = 0.9131 df = 1 P-value = 0.339297 Equal rates among sites Null model = F81 -lnL0 = 4117.4712 Alternative model = F81+G -lnL1 = 3806.0015 2(lnL1-lnL0) = 622.9395 df = 1 Using mixed chi-square distribution P-value =

Model selected: F81+I+G -lnL = 3797.3303 Base frequencies: freqA = 0.3045 freqC = 0.2348 freqG = 0.2328 freqT = 0.2279 Substitution model: All rates equal Among-site rate variation Proportion of invariable sites (I) = 0.4495 Variable sites (G) Gamma distribution shape parameter = 0.6163

[...]

BEGIN PAUP;Lset Base=(0.3045 0.2348 0.2328) Nst=1 Rates=gamma Shape=0.6163 Pinvar=0.4495;END;


Questions?



or gene family


taxon sampling


alignment



tree(s)results

impr

ov e

men

t of

Work- Flow

Combination of an Optimality Criterion with a Tree Search Algorithm

1) Optimality CriterionScoring method to decide which tree is the beste.g. maximum parsimony, distance analysis, maximum likelihood

2) Tree Search AlgorithmMethod to construct a treee.g. exhaustive search, branch- and bound, heuristic search, quartet puzzling, neighbor- joining

Phylogenetic Analysis Methods

Phylogenetic Analysis: Maximum Parsimony

The tree which requires the fewest mutation steps to explain the nucleotide pattern of an alignment is the best.Each point mutation equals one point in the scoring system, thus in unweighted parsimony only integer values are possible as scores.

best tree = maximum parsimony tree (MPT)if several trees are equally scored = equally parsimonious trees (EPT)

Problem: Evolutionary model is implicit and cannot be adapted to the data set. All mutations at all positions are considered equal, even in more variable regions.

Phylogenetic Analysis: Distance Matrix Methods (1)

All sequences of an alignment are compared pairwise with each other. Each pairis assigned an evolutionary distance value expressing the degree of divergence.The results are listed in a distance matrix, which is used to construct a tree.

To calculate the distances, one out of the 56 different evolutionary modelscan be chosen or the estimators of the maximum likelihood method can be used.

The tree with the shortest sum of distances is the best. Since the distancesare no integers, there is usually only one best tree.

Phylogenetic Analysis: Distance Matrix Methods (2)

paup> showdist

HKY85 distance matrix

1 2 3 4 5 6 7 1 MPorph - 2 S11679 0.15819 - 3 CCMP736 0.16875 0.15158 - 4 MCont316 0.15953 0.13655 0.18535 - 5 S1382 0.17529 0.14077 0.18336 0.15227 - 6 Chondrus 0.10471 0.10205 0.15863 0.12481 0.13192 - 7 S4194a 0.10967 0.10370 0.15845 0.12654 0.12807 0.03539 -

Phylogenetic Analysis: Maximum Likelihood (1)

Probablistic method: Tries to find the tree that optimises the probability of observing the data in the alignment. Likelihood is expressed as negative natural logarithm (- lnL; lowest - lnL is the best).

Computation Steps: - A tree is given. - For each position in the alignment, the site- wise log likelihood is calculated. This includes all possible combinations of ancestral character states in a tree. - The log likelihoods of all positions of the alignment are multiplied and result in the total log likelihood value.

Phylogenetic Analysis: Maximum Likelihood (2)

Example

A

A

C

G? ?

A-A C-A G-A T-AA-C C-C G-C T-CA-G C-G G-G T-GA-T C-T G-T T-T

position 1 of 1500 positions

tour- taxon- tree =16 possible combinations

of ancestral states

AlignmentSeq 1 ATTA...Seq 2 ACTA...Seq 3 CCTA...Seq 4 GGTG... 1234...

probabilities dependin evolutionary model settings

site- wise log likelihood:all probabilites of the

16 character combinations

total log likelihood:product of 1500 site- wise

log likelihoods

Phylogenetic Analysis: Exhaustive Tree Search

All possible trees are calculated according to the chosen optimality criterion.

Theoretically good: Safest method to find the best tree!Problem: Computationally intense! With a lot of taxa impossible to do.

e.g. rooted bifurcating trees:

6 taxa = 945 trees10 taxa = 34,459,425 trees

15 taxa = 213,458,046,676,875 trees

Phylogenetic Analysis: Branch- and- Bound Tree Search

Branch and bound is a speed- up procedure for exhaustive search. It also considersall possible trees.

The score/distance/likelihood of a randomly generated starting tree is calculated and used as a threshold. All trees that are already worse than this threshold during construction procedure are not finished, but skipped. If a tree turns out to be better, it is used as a new threshold.

Disadvantage: Still too time consuming for larger data sets.

Phylogenetic Analysis: Heuristic Tree Search (1)

The trees are considered to form a landscape called the treespace.The best trees are on top of the hills, the worst trees arein the valleys.

Heuristic searches start with a random tree, which may be located in a valleyand try to find the best tree located in the global maximum of the tree spaceby rearranging the branches of the starting tree.


1 Starts with a randomly generated tree, which may be in a valley.2 Local rearrangements by exchanging neighbouring branches optimise the tree. The tree may end up in a local optimum only.3 Global rearrangements help to cross the valley and find the another hill.4 The new tree is again rearranged by small exchanges to climb up the hill to the top. If this is the global optimum, further global rearrangements will not improve the tree.

1

2

3 4


Tree Rearrangement Methods

Nearest- Neighbour Interchange (NNI) = Adjacent branches are rearranged.

Subtree Pruning and Regrafting (SPR) = A branch with a subtree is removed from a tree and added between two nodes somewhere else in the tree (= one new tree).

Tree Bisection and Reconnection (TBR) = A tree is split into two subtrees and both parts are connected between all possible nodes of the other (= several new trees are considered).

Phylogenetic Analysis: Neighbor- Joining (1)

Preferred method to infer trees from distance matrices. Belongs to the clustering methods.

Computation steps:

1) Calculate net divergence of each taxon from the others, and compute a corrected distance for further use.2) Start with a star- like tree (belongs to star- decomposition methods).3) Join the two taxa with the lowest divergence.4) Recalculate the distance matrix by treating the joined taxa as one.5) Repeat steps 3 to 4 until all taxa are joined and the tree is resolved.

Phylogenetic Analysis: Neighbor- Joining (2)

first distance matrix

corrected distance matrix

recalculation of distance matrix

Phylogenetic Analysis: A Comparison of Methods

Maximum Parsimony

discrete charactersimplicit evolutionary model

heuristic tree search

Distance Matrix

continuous charactersexplicit evolutionary model

neighbor- joining trees

fastest methodlarge data sets

Maximum Likelihood

discrete charactersexplicit evolutionary model

heuristic tree search

robust methodcomputational intense

Phylogenetic Analysis: Paup Commands

Begin paup; set autoclose increase=auto outroot=monophy; outgroup 1-4; set crit=p; hsear addseq=rand nreps=10; savetrees file=pars.tre brlens;Lset Base=(0.3315 0.2201 0.2334) Nst=1 Rates=gamma Shape=0.9144 Pinvar=0.3911; set crit=d; dset dist=ml; nj; savetrees file=nj.tre brlens; set crit=l; hsear addseq=rand nreps=1; savetrees file=ml.tre brlens; quit;End;

Phylogenetic Analysis: Bayesian Analysis (1)

Uses also likelihoods in calculations, but is based on a formula introduced by Reverend Bayes and uses posterior probabilities.

Bayesian analysis starts with a set of a priori expectations about evolutionary model, tree topology and branch lengths. By examining the data (the alignment), the posterior probabilities of the hypotheses given the data are calculated using the Bayes formula.Since it is impossible to compute the complete joint posterior probability distribution of trees and evolutionary model parameters (a landscape with hills and valleys), samples are drawn using a Metropolis- coupled Markov chain Monte Carlo method.


1) initialization of Markov chain with random tree and random evolutionary parameters - > calculation of probability2) proposal of new state of chain with one changed parameter (topology, branch length or evolutionary model) - > calculation of probability 3) If P(Tnew)/P(Told) 1 - > accepting new state, if the ratio is < 1, a random number decidesThis corresponds to one generation of the Markov chain. A chain is run over several thousands to millions of generations. Every 100th generation a tree and its parameters are sampled and saved to files. After a while, the chain starts circling around a probability optimum comprising the best trees.

CH1

H2

H3


Since the Markov chain may end up stuck in a local maximum, in addition to this cold chain also three so- called heated chains (H1 to H3) are initialized. These chains have lower thresholdsto be able to jump over valleys more easily. From time to time a heated chain exchanges parameters (= Metropolis- coupled) with the cold chain, helping it to find the global optimum.


Maximum Likelihood usually results in one optimal tree (sometimes also two), whereas Bayesian analysis results in a set of optimal trees.

Results of a Bayesian analysis after summarizing of the sampled trees and evolutionary model parameters:

A tree file listing the trees according to their posterior and accumulative posterior probabilities.A list of credibility intervals and mean values for all parameters of the evolutionary model.A consensus tree with branch lengths and posterior probabilities indicating support for branches.


Example command block for MrBayes to be attached to the nexus file.

Begin mrbayes; set autoclose=yes; lset nst=6 rates=invgamma ngammacat=4 covarion=yes; mcmcp ngen=3500000 printfreq=1000 samplefreq=100 nchains=4 savebrlens=yes filename=SSU; mcmc; quit;End;


Summary of analysis data

sump filename=SSU.p burnin=8000;

sumt filename=SSU.t burnin=8000;

All trees and parameters, which were sampled prior to reaching the likelihood plateau (i.e. the burnin phase before arriving at the global optimum) a excluded from the summaries. The sump command results in a plot showing the likelihood values. If the burnin was not properly removed, the command has to be repeated with a higher burnin value.

Questions?



or gene family


taxon sampling


alignment



tree(s)results

imp r

ov e

men

t of

Work- Flow

Trees - Support for Branches: Bootstrap Analysis

In bootstrap analysis, single positions are randomly drawn from the alignment (imagine a lottery) and assembled to a new dataset of the same size as the original alignment. As a result, some positions may occur several times, whereas others are excluded.This lottery is repeated at least 100 times resulting in at least 100 subsamples of the original alignment.Of each subsample a phylogenetic analysis is done (MP, distance, ML). The results are summarised in a consensus tree.

In a 50 percent majority rule consensus tree, all branches that occur in at least 50% of all bootstrap subsamples are displayed on the branches.

bootstrap values > 95% = significantly supported branches

Trees: Support for Branches Posterior Probabilities

The consensus tree resulting from a Bayesian analyses is a 50% majority rule consensus tree, inferred from sampled trees of the Markov chain.Similar to the support values of a bootstrap consensus, the posterior probabilities express how many of the sampled trees are found with this topology.

Posterior probabilites are usually higher than bootstrap support values and have been subject of debates.

ArtefactsTrees: The Long Branch Attraction Artefact (LBA) (1)

correct tree result of phylogenetic analysis

A A

B BC C

D C

ArtefactsTrees: The Long Branch Attraction Artefact (LBA) (2)

ExplanationLong branches indicate a higher rate of mutations.

a) A high rate of mutations results in multiple reversals to the original character state (= homoplasies).b) In addition, a high mutation rate causes signal noise blurring the information in the sequences.

Consequence: Homoplasies are erroneously interpreted as indicators for relatedness.Choice of inappropriate evolutionary model increases vulnerability against LBA.

Trees: Potential Indicators for LBA (1)

Tree is ladderised at the root, i.e. most or all long branches emerge successively close

to the root of the tree.

Trees: Potential Indicators for LBA (2)

Differing tree topologies depending in whether simple or complex evolutionary models are used.

maximum parsimony maximum likelihood (F81+I+)



or gene family


taxon sampling


alignment



tree(s)results

imp r

ove

men

t of

Work- Flow

1.) Improved Taxon Sampling

Breaking up the long branches by adding related taxa.

more taxa = higher resolution by adding information to variable positions

2.) Improved Choice of Markers

Use several genes with differing evolutionary rates and concatenate. The influence of the long branches may be broken (also good: use of genes from different genomes, e.g. nuclear, mitochondrial, plastid).

more positions = higher resolution by extending the data matrix

Preventing/Reducing LBA

How to Handle Concatenated Data? (1)

Choice of Evolutionary Model

Different genes most likely will need different evolutionary models.Neither Paup nor Phylip allow for a partitioning of data.The more genes are included and the more divergent the data are in terms of evolutionary rates, the more likely Modeltest will propose the

most complex evolutionary model, GTR+I+.


Choice of Evolutionary Model: MrBayes allows for a partitioning of data

Begin mrbayes; set autoclose=yes; log start filename=sum.log; charset NM=1-1564; charset ITS2=1565-1850; charset LSU=1851-2733; partition concP2=3:NM,ITS2,LSU; set partition=concP2; lset applyto=(1) nst=6 nucmodel=4by4 rates=invgamma ngammacat=4 covarion=yes; lset applyto=(2) nst=6 nucmodel=4by4 rates=invgamma ngammacat=4 covarion=yes; lset applyto=(3) nst=6 nucmodel=4by4 rates=invgamma ngammacat=4 covarion=yes; unlink statefreq=(all); unlink shape=(all); unlink revmat=(all); unlink switchrates=(all); prset ratepr=variable; mcmcp ngen=3500000 printfreq=1000 samplefreq=100 nchains=4 savebrlens=yes filename=conc; mcmc; quit;End;


The Likelihood Summation Method

- Run a phylogenetic analysis with each singe- gene dataset and the concatenated data set.- Let Paup save the 1000 best trees resulting from each analysis.- Concatenate the tree files.- Calculate the likelihood scores for each of the trees with the lscores command in Paup, but use as a data matrix the single- gene alignments only.- Load the scorefiles for each data set in a spreadsheet program and calculate the sum of log likelihoods for each tree.- Sort the trees according to their log likelihood.



or gene family


taxon sampling


alignment



tree(s)results

imp r

ov e

men

t of

Work- Flow

Kishino- Hasegawa and Shimodaira- Hasegawa tests are used totest hypothetical user- defined trees by comparison with the optimal tree.

Problem:Tests were designed to compare random trees, but used by biologists to compare trees with the optimal tree.

Testing Tree Topology (1)

Consel- written by H. Shimodaira- needs input files with site- wise log likelihoods- accepts Paup, Molphy and PAML scorefiles- consists of a suite of programs for different tasks- C source code; Unix command line or DOS console- performs: Approximately unbiased test Kishino- Hasegawa test (unweighted/weighted) Shimodaira- Hasegawa test (unweighted/weighted) bootstrap probabilities posterior probabilities

Testing Tree Topology (2): Consel

Using Consel in Combination with Paup

1) Construct constraints to test a hypothesis2) Infer the optimal constraint trees with Paup (ML)3) Concatenate all treefiles that are supposed to be subjected to the test.4) Let Paup calculate the total log likelihood and the site- wise log likelihoods and save the data to a scorefile.5) Use a text editor to delete superfluous data (all parameters of the evolutionary model).6) Feed Consel with the scorefile.

Testing Tree Topology (3): Consel and Paup

Testing Tree Topology (4): Consel Commands

1) Generate bootstrap subsamples

makermt paup scorefile.txt

2) Perform the test

consel scorefile

3) Generate an output file with the test results

catpv scorefile

Testing Tree Topology (5): How Does Consel Work?

makermt Generates multiscale bootstrap subsamples from the site- wise log likelihoods in the scorefile. Bootstrap samples from site- wise log likelihoods = the RELL method Multiscale bootstrap = sizes of subsamples differ from original data set By default, makermt generates 10 sets of replicates, each with 10,000 subsamples (0.5x, 0.6, 0.7, 0.8, 0.9, 1.1, 1.2, 1.3, 1.4, 1.5 fold the size of the original data set) and an additional set with 10,000 subsamples of 1.0- fold size.consel Calculates the probabilities for KHT, SHT and normal bootstrap using the 10,000 subsamples of 1.0- fold size and the probabilities for the AUT and the multiscale bootstrap using the multiscale bootstrap samples.

Testing Tree Topology (6): Consel Outputcatpv summarises the results of the test (the probability values)

# reading nm.pv# rank item obs au np | bp pp kh sh wkh wsh |# 1 1 -15.8 0.957 0.942 | 0.944 1.000 0.936 0.994 0.936 0.999 |# 2 6 15.8 0.064 0.055 | 0.054 1e-07 0.064 0.415 0.064 0.218 |# 3 5 35.2 0.005 0.002 | 0.002 5e-16 0.006 0.079 0.006 0.027 |# 4 4 36.3 0.002 0.001 | 4e-04 2e-16 0.005 0.063 0.005 0.019 |# 5 7 56.8 7e-05 1e-04 | 2e-04 2e-25 3e-04 0.012 3e-04 0.001 |# 6 2 129.4 2e-64 5e-21 | 0 6e-57 0 0 0 0 |# 7 3 142.2 4e-06 5e-06 | 0 2e-62 0 0 0 0 |rank = tree rankingitem = no. of tree in treefileobs = log likelihood differenceau = p- values of the approximately unbiased testnp = p- values of the multiscale bootstrapbp = p- values of the normal bootstrappp = posterior probabilitieskh = Kishino- Hasegawa testsh = Shimodaira- Hasegawa testwkh, wsh = weighted Kishino- Hasegawa and Shimodaira- Hasegawa tests

Phylogeny With Protein Sequences (1)

Due to degeneration of the genetic code, codons may be biased!This bias may apply not only to the third position, but also to first and second.

Presumably it would be better to use protein sequences instead, but:

Protein alignments have 20 character states instead of 4 = analyses, especially ML analyses take much longer!

Using nucleotide data, but considering nonsynonymous/synonymous substitutions and/ortranslating during analysis - > this is also quite time- consumptive!


Maximum likelihood analyses of protein sequences are usually based on substitution matrices derived from empirical data instead of estimating the substitution rate matrix from the data set.

e.g. Dayhoff (Dayhoff et al. 1978) JTT (Jones, Thornton, Taylor 1992) WAG (Wheelan, Goldman 2001)

Paup: only limited possiblities, no substitution matrices included; no maximum likelihood

Programs for phylogenetic analyses of protein sequences

Phylip (text- based menu): phylip format; maximum likelihood; PAM, JTT, PMBTree- Puzzle (text- based menu): phylip format; maximum likelihood with quartet puzzling; Dayhoff, JTT, WAG, VT etc.PAML (Unixoids, DOS console): phylip format; maximum likelihood; Dayhoff, JTT, WAG etc.(Molphy [Unixoids]): phylip format; maximum likelihood; Dayhoff, JTT)MrBayes: Bayesian analysis



One possibility:

Calculate a tree and gamma categories using Tree- Puzzle 5.2 (Schmidt, Strimmer and von Haeseler 2004).

Use the gamma category estimates as settings to perform a maximum likelihood analyses with proml from the Phylip 3.62 package (Joe Felsenstein 2004).


Text based menu of Tree- Puzzle 5.2

GENERAL OPTIONS b Type of analysis? Tree reconstruction k Tree search procedure? Quartet puzzling v Approximate quartet likelihood? Yes u List unresolved quartets? No n Number of puzzling steps? 1000 j List puzzling step trees? No o Display as outgroup? Chondrus (1) z Compute clocklike branch lengths? No e Parameter estimates? Approximate (faster) x Parameter estimation uses? Neighbor-joining treeSUBSTITUTION PROCESS d Type of sequence input data? Auto: Amino acids m Model of substitution? Auto: JTT (Jones et al. 1992) f Amino acid frequencies? Estimate from data setRATE HETEROGENEITY w Model of rate heterogeneity? Uniform rate

Quit [q], confirm [y], or change [menu] settings:


Phylip 3.62suite of 30 programs

e.g. ML bootstrapping:a) start seqboot = generate bootstrap samples from data setb) run dnaml or proml (depending in data set) c) run contree to create a consensus tree d) consensus treefile may be loaded into a tree displaying program

Molecular Clock

Assumes that sequences evolve at equal evolutionary rates. This is usually not the case and puts the analysis under a constraint. Molecular clock hypothesis may be tested with a likelihood ratio test similar to the evolutionary models in Modeltest.If fossils are available one may try dating the divergences of lineages.

Secondary Structure Analyses

may add information to the results (predominantly RNA- coding, ITS or intron regions)

Synapomorphy Analyses

Searching for synapomorphic characters or strings of characters may be useful for systematic purposes

Diverse

Questions?

That's it!