gcb04tut

Upload: linubinoi

Post on 30-Oct-2015

13 views

Category:

Documents


0 download

DESCRIPTION

s

TRANSCRIPT

  • An Introduction to Molecular Phylogeny

    Dr. Kerstin Hoef- EmdenUniversitt zu KlnBotanisches Institut

    Gyrhofstr. 1550931 Kln

  • What is molecular phylogeny?

    phylon = Greek for stemgenesis = Greek for origin

    molecular phylogeny = studying relationships among organisms using molecular markers (e.g. DNA or protein sequences)

    dissimilarities among sequences = genetic divergence caused by mutations during the course of time

  • Molecular Phylogenetic Methods

    - their accuracy can be tested in in silico simulations

    - are based on assumptions about the processes of molecular evolution

    - may be computationally intense (this refers more to CPU time than to memory)- may be sensitive to artefacts

    - usually results are displayed as trees

  • Accuracy of Molecular Phylogenetic Methods

    consistency = Does a method reconstruct the correct tree given an infinite amount of data? (All methods do, if assumptions are not violated.)

    efficiency = How quickly converges a method to the correct tree with a finite amount of data? (The less data is needed to infer the correct tree, the more

    efficient the method.)

    robustness = How well is the performance of a method, if the assumptionsabout the evolutionary process are violated?

  • How to test phylogenetic methods?

    e.g. by simulation in silico (= in the computer)

    a) Simulate the evolution of a randomly chosen DNA or protein sequence under a given evolutionary model and tree topology into several lineages.b) Use the phylogenetic method under test to infer a phylogenetic tree.c) Does the resulting phylogenetic tree correspond to the true tree?d) Modify the tree topology to different extremes in branch lengths and repeat the test.

  • Phylogenetic Methods and Real Life Sequences

    - The true tree is unkown; each inferred tree represents a hypothesis.- No infinite amounts of data are available (no nuclei with infinite space, which contain infinite amounts of DNA).- By using robust and efficient methods and appropriate evolutionary models, the inferred trees hopefully converge to the real phylogeny as close as possible.- The simulation studies give some hints about potential vulnerabilities of the phylogenetic methods.

  • Trees: Nomenclature

    terminal branchinternal branch

    terminal node = operational taxonomicunit (OTU) = contemporary taxon

    internal node = unknownancestor = extinct taxon

    mathematics: branch = edge; node = vertex (plural: vertices)

  • Tree Types: Unscaled Trees

    slanted cladogram rectangular cladogram

    disadvantage:no information

    about evolutionaryrates in a tree

  • Tree Types: Scaled Trees

    rooted phenogramunrooted phenogram

    disadvantage: direction of evolution is unknownadvantage: higher resolution

  • Treefile Formats#NEXUS Begin trees; [Treefile saved Thu Sep 16 03:12:19 2004]

    Translate1 MPorph,2 S11679,3 CCMP736,4 MCont316,5 S1382,[...]21 UTEX637;tree PAUP_1 = [&U] (1:0.096043,(((((2:0.041968,12:0.011298):0.014339,(13:0,20:0):0.012408):0.012250,3:0.188535):0.027987,(4:0.107423,5:0.115320):0.013260,((((14:0.001531,15:0):0.038667,19:0.005060):0.000989,18:0.018300):0.015597,(17:0.006567,21:0.029673):0.017735):0.009011):0.011165,(((6:0.021826,9:0.009905):0.002602,(7:0.014488,11:0.066979):0.005662):0.014995,(8:0.046098,10:0.079671):0.010953):0.009027):0.020750,16:0.048813);End;

    Phylip (Newick)(MPorph:0.094736,(((((S11679:0.041475,U1424:0.011216):0.014077,(M1712:0,S10379:0):0.012338):0.011839,CCMP736:0.183396):0.027333,(MCont316:0.106050,S1382:0.117050):0.013282,((((S3794:0.001532,S3694:0):0.038496,S899:0.004933):0.001127,S3194:0.018102):0.015038,(S4094:0.006579,UTEX637:0.029760):0.018250):0.009256):0.011383,(((Chondrus:0.021641,S13531a:0.009722):0.002405,(S4194a:0.014451,S1896:0.065892):0.005690):0.014595,(S4194b:0.046357,S13531b:0.078882):0.010916):0.008914):0.020599,S5981:0.048038);

  • Displaying Trees

    Paup for MacOS 9: graphical output to screen, file or printer; nexus or Newick format (unscaled, scaled, rooted and unrooted trees)Phylip: treefile to graphics converter; Newick format (unscaled, scaled, rooted and unrooted trees)Paup for Windows or portable format (Unixoids): auxiliary program necessary

    - Phylip converter programs- TreeView for MacOS 9 (and Windows?): all tree types and tree edition; nexus and Newick for Unixoids: no unrooted trees, no tree editing; nexus and Newick- TreeEdit (MacOS), Treetool etc.

  • General purposePaup 4b: Windows, MacOS 9, Unixoids (Linux, Solaris, MacOS X etc.)Phylip 3.62: Windows, MacOS 8, 9, X, Linux, C- Sources

    Bayesian AnalysesMrBayes 3: Windows, MacOS, C- Sources (Unixoids)

    Links to phylogeny- related software collected by Joe Felsenstein: http://evolution.genetics.washington.edu/phylip/software.html

    Phylogeny Programs (1)

  • Paup* 4b10= Phylogenetic Analysis Using Parsimony (* and other methods)

    Written by David Swofford. First versions up to Paup 3 were available for MacOS < 9 only and were focused on the parsimony method. Paup 4 is available for different OS and one of the most powerful toolsfor phylogenetic analyses concerning nucleotide sequences. Sold as a beta version, but more stable than some sold final versions of other software.

    MacOS 9PPC: graphical user interface (i.e. mouse driven) and graphical output of treesAll others (Windows, Unixoids): command line (can be submitted to batch queues, unfortunately no checkpointing)Distributor in Europe: Palgrave- MacMillan, UK (Windows and MacOS 9 PPC; GBP 62/72)Distributor in USA (and for portable versions): Sinauer Associates (USD 85- 150)

    Phylogeny Programs (2)

  • Phylip 3.62= Phylogenetic Inference Package

    Written by Joe Felsenstein. Multiple purpose package for nucleotide as well asprotein sequences. Approx. 30 different programs to fullfil different tasks.Freely available over the internet (http://evolution.gs.washington.edu/phylip.html).

    Precompiled for: MacOS 8/9 PPC, MacOS X, Windows, Red Hat Linux (i368)C- Sources for all other UnixoidsUser- interface: text- based menu system

    Phylogeny Programs (3)

  • MrBayes 3

    Written by John Huelsenbeck and Frederic Ronquist. Specialised on Bayesian Analyses. Handles nucleotide as well as proteinsequences. Partitioned computation of concatenated data sets.Freely available over internet (http://morphbank.ebc.uu.se/mrbayes/)

    Runs under MacOS X, Windows, UnixoidsC- SourcesUser- interface: command line- driven; syntax similar to Paup

    Parallelised version available; no checkpointing.

    Phylogeny Programs (4)

  • Input Formats

    Paup: nexus format (interleaved or sequential)Phylip: phylip formatMrBayes: nexus format (interleaved or sequential)

    Phylogeny Programs (5)

  • Phylogeny Programs (6)#NEXUSBEGIN TAXA; DIMENSIONS NTAX=6; TAXLABELS 'S3694' 'S5981' 'S4094' 'S3194' 'S899' 'S10379' ;END;BEGIN CHARACTERS; DIMENSIONS NCHAR=14; FORMAT DATATYPE=NUCLEOTIDE GAP=- ;MATRIX[1] 'S3694'cCCAAGCGTTTCCG[2] 'S5981'CCCAATCGTTTCCC[3] 'S4094'CCCAATCGTTTCCG[4] 'S3194'GCCAATCGTTTCCG[5] 'S899'CCCAAGCGTTTCCG[6] 'S10379'CCCAATCGTTTCCG;END;

    Phylip Format

    6 14S3694 cCCAAGCGTTTCCGS5981 CCCAATCGTTTCCCS4094 CCCAATCGTTTCCGS3194 GCCAATCGTTTCCGS899 CCCAAGCGTTTCCGS10379 CCCAATCGTTTCCG

    Nexus Format

  • user- defined treesand topology testing

    aim:group of organisms

    or gene family

    choice of molecular marker(s)and

    taxon sampling

    amplification/sequencing

    alignment

    choice of evolutionary model

    phylogenetic analyses

    tree(s)results

    impr

    ov e

    men

    t of

    Work- Flow

  • Taxon Sampling

    Strategy for an initial Taxon Sampling

    - the diversity of the group should be represented (guessing by looking at phenotype or using systematics of group). e.g. combinations of morphological characters or representatives of all species/genera of a group or serotypes or ....- at least two representatives of each presumed clade (guessing)- not to few taxa (> 15)- outgroup taxa (closest related sistergroup only!)

  • Choice of Molecular Marker(s):Phylogenies of Gene Families (1)

    All orthologues and paralogues (or alleles) of a gene in an organism have to be sequenced!

    Why?

  • Homo

    Drosophila

    Arabidopsis

    Homo

    Drosophila

    Arabidopsis

    Chlamydomonas

    Chlamydomonas

    Drosophila

    Homo

    Arabidopsis

    Chlamydomonas

    Choice of Molecular Marker(s):Phylogenies of Gene Families (2)

    A very fictitious example for a weird tree caused by an incomplete sampling of a gene family (taxon sampling also not recommended).correct tree

    very bad tree!

  • Choice of Molecular Marker(s):Phylogenies of Organisms (1)

    - choose single copy genes (protein- coding) or highly synchronised genes (ribosomal DNA)- choose higher variable genes for closely related organisms and conserved genes for farther related organisms- in sexually reproducing organisms, two alleles may occur

  • Choice of Molecular Marker(s):Phylogenies of Organisms (2)

    e.g. the eukaryotic ribosomal operon

    SSU rDNA LSU rDNA

    5.8S rDNA

    ITS1 ITS2

    conserved = potentially suited for phylogenies of genera or higher level taxa

    highly variable = potentially suited for phylogenies of species or lower level taxa

  • Choice of Molecular Marker(s):Phylogenies of Organisms (3)

    some examples for more conserved genes:actin

    elongation factor 1 (EF- 1)rbcL

    tubulinsand lots more ...

  • user- defined treesand topology testing

    aim:group of organisms

    or gene family

    choice of molecular marker(s)and

    taxon sampling

    amplification/sequencing

    alignment

    choice of evolutionary model

    phylogenetic analyses

    tree(s)results

    impr

    ov e

    men

    t of

    Work- Flow

  • DNA Amplification (1)genomic DNA

    PCR

    template for sequencing

    cloning

    mRNA

    RT- PCR

    cDNA

  • DNA Amplification (2)

    genomic DNA cDNA

    advantagedisadvantage

    introns (add information)introns (splice sites?)

    no introns (just ORF)no introns

  • Taq polymeraseno proofreading = introduces reading errors (pred. transitions))direct sequencing (large template pool) - > usually no problem

    cloning of PCR products - > problem! solutions: a) proofreading polymerase instead of Taq b) sequence more than 2 clones (better more than three; only an option, if no allelic variation can be expected!)

    DNA Amplification (3)

  • Sequencing

    Reduce/avoid sequencing errors

    - sequencing of forward and reverse strands- thoroughful proofreading ribosomal RNA sequences: secondary structure protein sequences: translation (Stop codons?)- BLAST search (PCR contamination, chimaeric sequences?)

  • user- defined treesand topology testing

    aim:group of organisms

    or gene family

    choice of molecular marker(s)and

    taxon sampling

    amplification/sequencing

    alignment

    choice of evolutionary model

    phylogenetic analyses

    tree(s)results

    impr

    ov e

    men

    t of

    Work- Flow

  • automatic alignmentgood as a starting point for an alignmentnot good, if sequences contain a lot of indels and highly variable regions (i.e. non- coding regions such as ITS or intron sequences or variable regions in ribosomal RNA sequences)

    proteins: check alignment afterwards by eyeribosomal RNA, intron and ITS sequences: always manual editing needed

    Alignment (1)

  • Alignment (2)

    Second round of proofreading

    Unusual amino acids in the translated sequence?Deviations in a highly conserved region of ribosomal RNA?One G whereas all others have two in a highly conserved region?

    - > back to the assembly data and cross checking (it may be true, though!)

  • The alignment is the very basis of the phylogeneticanalyses. A software can not differentiate between a realmutation and a sequencing or alignment error.

    Alignment (3)

    Effects of an Erroneous Alignment

    Decreasing of the resolution.Worst case: artefactual tree topology

  • Alignment (4)

    Preparation of the Alignment for the Phylogenetic Analyses

    Exclusion of nonalignable regions and saving of the data set in nexus (PAUP) or phylip (Phylip, PAML, Molphy) format depending on the software used for phylogenetic analyses.In protein- coding sequences: perhaps excluding third codon position

  • CCMP152 TAGGAAATCTAGAGCTAATACATGCACCATCGCTCTAATTTGATATTTT--------M1303 TAGGAAA-CTACAGCTAATACATGCTCCATCGCTTTTTTTGTGTTTAGTT-------M2180 TAGGAAA-CTACAGCTAATACATGCTCCATCGCTTTTTTTGTGTTTAGTT-------S9772e TAGGAAA-CTACAGCTAATACATGCTCCATCGCTTTTTTTTACAATATCTAA-----S9772b TAGgAAA-CTACAGCTAATACATGCTCCATCGCTTTTTTTATATATATTTGT-----C9772a TAGgAAA-CTACAGCTAATACATGCTCCATCGCTTTTTTTATATATATTTGT-----M1703 TAGGAATTCTAGAGCTAATACATGCACCAGTGCCCTTAGTTTATTCTTTTTTAAGACAM1481 TAGGAATTCTAGAGCTAATACATGCACCATCGCTTTTTTTTCTTTTTTCTTTTTTCTTSB9801 TAGGAATTCTAGAGCTAATACATGCACCATCGTTTTTCTTGACAGGAAGGAAGAAAAAM1318 TAGGAATTCTAGAGCTAATACATGCACCAGTGCCCTTAGTTTATTCTTTTTTAAGACAM1312 TAGgAATTCTAGAGCTAATACATGCACCATAGCCTTTTGTAATTTTTTTTAAAGTTTTHruf TAGGAATTCTAGAGCTAATACATGCCCCATCGCTTTCGAAGTTTTTTAATTTTTTTTC

    mask XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXxxx

    Alignment (5)

    CTC = discussion zone between alignable and nonalignable regions TTT = highly variable region -> exclude from analyses

    e.g. ribosomal DNA

  • Properties of an alignment editor suited for phylogenetic purposes

    multiple sequence alignmentlimits of sequence number and length sufficiently high?manual editingprotected mode (no deletion of nucleotides)several import/export formats (e.g. clustal for automated pre- alignment)for phylogenetic analyses: nexus/phylip format exportdefinition of a mask to exclude non- alignable regions

    Alignment (6)

  • Automatic Alignment Tools

    Clustal W and relativesT- Coffee

    MalignTreeAlign

    PileUpand others ...

    Alignment (7)

    Manual Alignment

    AlignBioEdit

    SeaViewSeAl

    ARB (RNA- coding DNA)DCSE (RNA- coding DNA)

    SeqLab (GCG)MacVector

    and others ...

  • Questions?

  • user- defined treesand topology testing

    aim:group of organisms

    or gene family

    choice of molecular marker(s)and

    taxon sampling

    amplification/sequencing

    alignment

    choice of evolutionary model

    phylogenetic analyses

    tree(s)results

    impr

    ov e

    men

    t of

    Work- Flow

  • consist of a combination of the following parameters:

    base frequenciessubstitution rate matrix

    proportion of invariable sitesgamma- distributed among- site rate variation

    (covarion/covariotide)

    Evolutionary Models (1)

  • Evolutionary Models (2)

    base frequenciespercentages of A, C, G or T in the alignment;can be set to: - equal (0.25 for each nt)- empirical (= computed from the alignment)- estimate (= optimised as a likelihood parameter)- manually set

    homogeneity of base frequencies among taxa:Program Paup performs a chi square test for biased base frequencies.(command line: basefreqs)

  • Evolutionary Models (3)

    substitution rate matrixassumptions about substitution rates of point mutations

    Jukes- Cantor model: All point mutations occur at the same rate. (number of substitution types [nst] =1)Hasegawa- Kishino- Yano and Kimura- 2- parameter: Differing rates for transitions and transversions (nst=2)General time reversible model (GTR): Each type of substition has a different substitution rate, reversals are considered equally likely (nst=6)

  • Evolutionary Models (4)

    substitution rate matrixnst=2

    A C G T

    A - v i v

    C v - v i

    G i v - v

    T v i v -

    i = transition

    v = transversion

    nst=6

    A C G T

    A - a b c

    C a - d e

    G b d - f

    T c e f -

    equal rates for

    both directions

    nst=6 (but 3 rate classes)

    A C G T

    A - a b a

    C a - a e

    G b a - a

    T a e a -

    the Tamura-Nei

    model

  • Evolutionary Models (5)

    proportion of invariable sites andgamma- distributed among- site rate variation

    DNA sequences do not evolve at the same rate in all positions.

    protein- coding sequences: faster rates at the third position ribosomal DNA, internal transcribed spacers: alternating pattern of more conserved and highly variable regions correlating with secondary structure (helices and unpaired regions).

  • Evolutionary Models (6)

    123 123 123DNA TCA CGA GTA TCC CGC GTC TCG CGG GTG TCT CGT GTT

    protein Ser Arg Val

    protein- coding genes: degenerate code

    ribosomal DNA

    AG

    A

    AA

    A

    U-GA-UC-GC-G

    CUA A

    RNA secondary structure

    TACCATGAAAAAGTGGAC

    DNA

    red = highly variable positions

  • Evolutionary Models (7)

    proportion of invariable sites = proportion of positions, which do not evolve

    gamma- distributed among- site rate variation = nucleotides evolve at differing rates in differing positions; modelled by the shape parameter

    both parameters may be used separately or can be combined

  • gamma- distributed among- site rate variation

    Evolutionary Models (8)

    pro

    porti

    on

    of s

    it es

    continuous gamma distribution

    =

    = 1 ~ 10

    substitution rate

    = 0.25

    discrete gamma distribution

  • Evolutionary Models (9)

    Covarion/Covariotide

    Individual sequences or lineages evolve faster than others. = not evolving according to a molecular clock.

    Not implemented in most phylogeny programs (MrBayes is an exception).

  • Evolutionary Models (10)

    Example for a command line in Paup 4b to calculate the parameters of aTamura- Nei model with unequal base frequencies, proportion of invariable sites and gamma distribution (a tree has to be available in the memory)

    lscores 1/ nst=6 basefre=est rmat=est rclass=(a b a a e a) pinv=est rate=gam shape=est;

  • Evolutionary Models (11)

    different combinations of

    base frequencies+

    substitution rate matrix+

    proportion of invariable sites/gamma distribution

    = 56 evolutionary models in the program Paup 4b

    How to decide, which model fits best a data set?

  • Choice of Evolutionary Model (1)

    The program Modeltest 3.5 (by Posada and Crandall) performs hierarchical likelihood ratio tests (hLRT) and also computes the Akaike information criterion (AIC).

    Modeltest consists of a command file for Paup (modelblock)and an executable (Posada Lab at http://darwin.uvigo.es/)

  • Choice of Evolutionary Model (2)

    Running Modeltest

    1.) Start Paup and load data set.2.) Load modelblock of Modeltest with command execute into Paup3.) Paup will follow the commands given in the modelblock: First a tree is constructed using the simplest and fastest method. Then Paup computes the likelihood values for all 56 evolutionary models for the data set given the tree. The likelihood scores of the 56 models are saved in a file called model.scores.4.) The Modeltest executable is started and fed with the model.scores file. It performs the hLRT and AIC tests and saves the results to a file.

  • Choice of Evolutionary Model (3)

    Testing models of evolution - Modeltest Version 3.06(c) Copyright, 1998-2000 David Posada ([email protected])Department of Zoology, Brigham Young UniversityWIDB 574, Provo, UT 84602, USA_______________________________________________________________

    Wed Sep 15 21:33:13 2004

    Input format: Paup matrix file

    ** Log Likelihood scores ** +I +G +I+GJC = 3853.2573 3853.2573 3814.9705 3806.2795F81 = 3843.7336 3843.7336 3806.0015 3797.3303K80 = 3852.4849 3852.4849 3814.0562 3805.3757HKY = 3842.8232 3842.8232 3804.8003 3796.1357TrNef = 3852.4060 3852.4060 3813.3804 3804.7378TrN = 3842.6401 3842.6401 3804.7976 3796.1355K81 = 3851.4536 3851.4536 3813.0771 3804.3674K81uf = 3842.2886 3842.2886 3804.3494 3795.6624TIMef = 3851.3740 3851.3740 3812.3914 3803.7239TIM = 3842.1130 3842.1130 3804.3457 3795.6621TVMef = 3846.7188 3846.7188 3807.1191 3798.2488TVM = 3840.2319 3840.2319 3802.2058 3793.3215SYM = 3846.6523 3846.6523 3806.4158 3797.5940GTR = 3839.9839 3839.9839 3802.2041 3793.3086

  • ** Hierarchical Likelihood Ratio Tests (hLRTs) **

    Equal base frequencies Null model = JC -lnL0 = 4124.0898 Alternative model = F81 -lnL1 = 4117.4712 2(lnL1-lnL0) = 13.2373 df = 3 P-value = 0.004151 Ti=Tv Null model = F81 -lnL0 = 4117.4712 Alternative model = HKY -lnL1 = 4117.0146 2(lnL1-lnL0) = 0.9131 df = 1 P-value = 0.339297 Equal rates among sites Null model = F81 -lnL0 = 4117.4712 Alternative model = F81+G -lnL1 = 3806.0015 2(lnL1-lnL0) = 622.9395 df = 1 Using mixed chi-square distribution P-value =

  • Model selected: F81+I+G -lnL = 3797.3303 Base frequencies: freqA = 0.3045 freqC = 0.2348 freqG = 0.2328 freqT = 0.2279 Substitution model: All rates equal Among-site rate variation Proportion of invariable sites (I) = 0.4495 Variable sites (G) Gamma distribution shape parameter = 0.6163

    [...]

    BEGIN PAUP;Lset Base=(0.3045 0.2348 0.2328) Nst=1 Rates=gamma Shape=0.6163 Pinvar=0.4495;END;

    Choice of Evolutionary Model (5)

  • Questions?

  • user- defined treesand topology testing

    aim:group of organisms

    or gene family

    choice of molecular marker(s)and

    taxon sampling

    amplification/sequencing

    alignment

    choice of evolutionary model

    phylogenetic analyses

    tree(s)results

    impr

    ov e

    men

    t of

    Work- Flow

  • Combination of an Optimality Criterion with a Tree Search Algorithm

    1) Optimality CriterionScoring method to decide which tree is the beste.g. maximum parsimony, distance analysis, maximum likelihood

    2) Tree Search AlgorithmMethod to construct a treee.g. exhaustive search, branch- and bound, heuristic search, quartet puzzling, neighbor- joining

    Phylogenetic Analysis Methods

  • Phylogenetic Analysis: Maximum Parsimony

    The tree which requires the fewest mutation steps to explain the nucleotide pattern of an alignment is the best.Each point mutation equals one point in the scoring system, thus in unweighted parsimony only integer values are possible as scores.

    best tree = maximum parsimony tree (MPT)if several trees are equally scored = equally parsimonious trees (EPT)

    Problem: Evolutionary model is implicit and cannot be adapted to the data set. All mutations at all positions are considered equal, even in more variable regions.

  • Phylogenetic Analysis: Distance Matrix Methods (1)

    All sequences of an alignment are compared pairwise with each other. Each pairis assigned an evolutionary distance value expressing the degree of divergence.The results are listed in a distance matrix, which is used to construct a tree.

    To calculate the distances, one out of the 56 different evolutionary modelscan be chosen or the estimators of the maximum likelihood method can be used.

    The tree with the shortest sum of distances is the best. Since the distancesare no integers, there is usually only one best tree.

  • Phylogenetic Analysis: Distance Matrix Methods (2)

    paup> showdist

    HKY85 distance matrix

    1 2 3 4 5 6 7 1 MPorph - 2 S11679 0.15819 - 3 CCMP736 0.16875 0.15158 - 4 MCont316 0.15953 0.13655 0.18535 - 5 S1382 0.17529 0.14077 0.18336 0.15227 - 6 Chondrus 0.10471 0.10205 0.15863 0.12481 0.13192 - 7 S4194a 0.10967 0.10370 0.15845 0.12654 0.12807 0.03539 -

  • Phylogenetic Analysis: Maximum Likelihood (1)

    Probablistic method: Tries to find the tree that optimises the probability of observing the data in the alignment. Likelihood is expressed as negative natural logarithm (- lnL; lowest - lnL is the best).

    Computation Steps: - A tree is given. - For each position in the alignment, the site- wise log likelihood is calculated. This includes all possible combinations of ancestral character states in a tree. - The log likelihoods of all positions of the alignment are multiplied and result in the total log likelihood value.

  • Phylogenetic Analysis: Maximum Likelihood (2)

    Example

    A

    A

    C

    G? ?

    A-A C-A G-A T-AA-C C-C G-C T-CA-G C-G G-G T-GA-T C-T G-T T-T

    position 1 of 1500 positions

    tour- taxon- tree =16 possible combinations

    of ancestral states

    AlignmentSeq 1 ATTA...Seq 2 ACTA...Seq 3 CCTA...Seq 4 GGTG... 1234...

    probabilities dependin evolutionary model settings

    site- wise log likelihood:all probabilites of the

    16 character combinations

    total log likelihood:product of 1500 site- wise

    log likelihoods

  • Phylogenetic Analysis: Exhaustive Tree Search

    All possible trees are calculated according to the chosen optimality criterion.

    Theoretically good: Safest method to find the best tree!Problem: Computationally intense! With a lot of taxa impossible to do.

    e.g. rooted bifurcating trees:

    6 taxa = 945 trees10 taxa = 34,459,425 trees

    15 taxa = 213,458,046,676,875 trees

  • Phylogenetic Analysis: Branch- and- Bound Tree Search

    Branch and bound is a speed- up procedure for exhaustive search. It also considersall possible trees.

    The score/distance/likelihood of a randomly generated starting tree is calculated and used as a threshold. All trees that are already worse than this threshold during construction procedure are not finished, but skipped. If a tree turns out to be better, it is used as a new threshold.

    Disadvantage: Still too time consuming for larger data sets.

  • Phylogenetic Analysis: Heuristic Tree Search (1)

    The trees are considered to form a landscape called the treespace.The best trees are on top of the hills, the worst trees arein the valleys.

    Heuristic searches start with a random tree, which may be located in a valleyand try to find the best tree located in the global maximum of the tree spaceby rearranging the branches of the starting tree.

  • Phylogenetic Analysis: Heuristic Tree Search (2)

    1 Starts with a randomly generated tree, which may be in a valley.2 Local rearrangements by exchanging neighbouring branches optimise the tree. The tree may end up in a local optimum only.3 Global rearrangements help to cross the valley and find the another hill.4 The new tree is again rearranged by small exchanges to climb up the hill to the top. If this is the global optimum, further global rearrangements will not improve the tree.

    1

    2

    3 4

  • Phylogenetic Analysis: Heuristic Tree Search (3)

    Tree Rearrangement Methods

    Nearest- Neighbour Interchange (NNI) = Adjacent branches are rearranged.

    Subtree Pruning and Regrafting (SPR) = A branch with a subtree is removed from a tree and added between two nodes somewhere else in the tree (= one new tree).

    Tree Bisection and Reconnection (TBR) = A tree is split into two subtrees and both parts are connected between all possible nodes of the other (= several new trees are considered).

  • Phylogenetic Analysis: Neighbor- Joining (1)

    Preferred method to infer trees from distance matrices. Belongs to the clustering methods.

    Computation steps:

    1) Calculate net divergence of each taxon from the others, and compute a corrected distance for further use.2) Start with a star- like tree (belongs to star- decomposition methods).3) Join the two taxa with the lowest divergence.4) Recalculate the distance matrix by treating the joined taxa as one.5) Repeat steps 3 to 4 until all taxa are joined and the tree is resolved.

  • Phylogenetic Analysis: Neighbor- Joining (2)

    first distance matrix

    corrected distance matrix

    recalculation of distance matrix

  • Phylogenetic Analysis: A Comparison of Methods

    Maximum Parsimony

    discrete charactersimplicit evolutionary model

    heuristic tree search

    Distance Matrix

    continuous charactersexplicit evolutionary model

    neighbor- joining trees

    fastest methodlarge data sets

    Maximum Likelihood

    discrete charactersexplicit evolutionary model

    heuristic tree search

    robust methodcomputational intense

  • Phylogenetic Analysis: Paup Commands

    Begin paup; set autoclose increase=auto outroot=monophy; outgroup 1-4; set crit=p; hsear addseq=rand nreps=10; savetrees file=pars.tre brlens;Lset Base=(0.3315 0.2201 0.2334) Nst=1 Rates=gamma Shape=0.9144 Pinvar=0.3911; set crit=d; dset dist=ml; nj; savetrees file=nj.tre brlens; set crit=l; hsear addseq=rand nreps=1; savetrees file=ml.tre brlens; quit;End;

  • Phylogenetic Analysis: Bayesian Analysis (1)

    Uses also likelihoods in calculations, but is based on a formula introduced by Reverend Bayes and uses posterior probabilities.

    Bayesian analysis starts with a set of a priori expectations about evolutionary model, tree topology and branch lengths. By examining the data (the alignment), the posterior probabilities of the hypotheses given the data are calculated using the Bayes formula.Since it is impossible to compute the complete joint posterior probability distribution of trees and evolutionary model parameters (a landscape with hills and valleys), samples are drawn using a Metropolis- coupled Markov chain Monte Carlo method.

  • Phylogenetic Analysis: Bayesian Analysis (2)

    1) initialization of Markov chain with random tree and random evolutionary parameters - > calculation of probability2) proposal of new state of chain with one changed parameter (topology, branch length or evolutionary model) - > calculation of probability 3) If P(Tnew)/P(Told) 1 - > accepting new state, if the ratio is < 1, a random number decidesThis corresponds to one generation of the Markov chain. A chain is run over several thousands to millions of generations. Every 100th generation a tree and its parameters are sampled and saved to files. After a while, the chain starts circling around a probability optimum comprising the best trees.

  • CH1

    H2

    H3

    Phylogenetic Analysis: Bayesian Analysis (3)

    Since the Markov chain may end up stuck in a local maximum, in addition to this cold chain also three so- called heated chains (H1 to H3) are initialized. These chains have lower thresholdsto be able to jump over valleys more easily. From time to time a heated chain exchanges parameters (= Metropolis- coupled) with the cold chain, helping it to find the global optimum.

  • Phylogenetic Analysis: Bayesian Analysis (4)

    Maximum Likelihood usually results in one optimal tree (sometimes also two), whereas Bayesian analysis results in a set of optimal trees.

    Results of a Bayesian analysis after summarizing of the sampled trees and evolutionary model parameters:

    A tree file listing the trees according to their posterior and accumulative posterior probabilities.A list of credibility intervals and mean values for all parameters of the evolutionary model.A consensus tree with branch lengths and posterior probabilities indicating support for branches.

  • Phylogenetic Analysis: Bayesian Analysis (5)

    Example command block for MrBayes to be attached to the nexus file.

    Begin mrbayes; set autoclose=yes; lset nst=6 rates=invgamma ngammacat=4 covarion=yes; mcmcp ngen=3500000 printfreq=1000 samplefreq=100 nchains=4 savebrlens=yes filename=SSU; mcmc; quit;End;

  • Phylogenetic Analysis: Bayesian Analysis (6)

    Summary of analysis data

    sump filename=SSU.p burnin=8000;

    sumt filename=SSU.t burnin=8000;

    All trees and parameters, which were sampled prior to reaching the likelihood plateau (i.e. the burn- in phase before arriving at the global optimum) a excluded from the summaries. The sump command results in a plot showing the likelihood values. If the burn- in was not properly removed, the command has to be repeated with a higher burnin value.

  • Questions?

  • user- defined treesand topology testing

    aim:group of organisms

    or gene family

    choice of molecular marker(s)and

    taxon sampling

    amplification/sequencing

    alignment

    choice of evolutionary model

    phylogenetic analyses

    tree(s)results

    imp r

    ov e

    men

    t of

    Work- Flow

  • Trees - Support for Branches: Bootstrap Analysis

    In bootstrap analysis, single positions are randomly drawn from the alignment (imagine a lottery) and assembled to a new dataset of the same size as the original alignment. As a result, some positions may occur several times, whereas others are excluded.This lottery is repeated at least 100 times resulting in at least 100 subsamples of the original alignment.Of each subsample a phylogenetic analysis is done (MP, distance, ML). The results are summarised in a consensus tree.

    In a 50 percent majority rule consensus tree, all branches that occur in at least 50% of all bootstrap subsamples are displayed on the branches.

    bootstrap values > 95% = significantly supported branches

  • Trees: Support for Branches Posterior Probabilities

    The consensus tree resulting from a Bayesian analyses is a 50% majority rule consensus tree, inferred from sampled trees of the Markov chain.Similar to the support values of a bootstrap consensus, the posterior probabilities express how many of the sampled trees are found with this topology.

    Posterior probabilites are usually higher than bootstrap support values and have been subject of debates.

  • ArtefactsTrees: The Long Branch Attraction Artefact (LBA) (1)

    correct tree result of phylogenetic analysis

    A A

    B BC C

    D C

  • ArtefactsTrees: The Long Branch Attraction Artefact (LBA) (2)

    ExplanationLong branches indicate a higher rate of mutations.

    a) A high rate of mutations results in multiple reversals to the original character state (= homoplasies).b) In addition, a high mutation rate causes signal noise blurring the information in the sequences.

    Consequence: Homoplasies are erroneously interpreted as indicators for relatedness.Choice of inappropriate evolutionary model increases vulnerability against LBA.

  • Trees: Potential Indicators for LBA (1)

    Tree is ladderised at the root, i.e. most or all long branches emerge successively close

    to the root of the tree.

  • Trees: Potential Indicators for LBA (2)

    Differing tree topologies depending in whether simple or complex evolutionary models are used.

    maximum parsimony maximum likelihood (F81+I+)

  • user- defined treesand topology testing

    aim:group of organisms

    or gene family

    choice of molecular marker(s)and

    taxon sampling

    amplification/sequencing

    alignment

    choice of evolutionary model

    phylogenetic analyses

    tree(s)results

    imp r

    ove

    men

    t of

    Work- Flow

  • 1.) Improved Taxon Sampling

    Breaking up the long branches by adding related taxa.

    more taxa = higher resolution by adding information to variable positions

    2.) Improved Choice of Markers

    Use several genes with differing evolutionary rates and concatenate. The influence of the long branches may be broken (also good: use of genes from different genomes, e.g. nuclear, mitochondrial, plastid).

    more positions = higher resolution by extending the data matrix

    Preventing/Reducing LBA

  • How to Handle Concatenated Data? (1)

    Choice of Evolutionary Model

    Different genes most likely will need different evolutionary models.Neither Paup nor Phylip allow for a partitioning of data.The more genes are included and the more divergent the data are in terms of evolutionary rates, the more likely Modeltest will propose the

    most complex evolutionary model, GTR+I+.

  • How to Handle Concatenated Data? (2)

    Choice of Evolutionary Model: MrBayes allows for a partitioning of data

    Begin mrbayes; set autoclose=yes; log start filename=sum.log; charset NM=1-1564; charset ITS2=1565-1850; charset LSU=1851-2733; partition concP2=3:NM,ITS2,LSU; set partition=concP2; lset applyto=(1) nst=6 nucmodel=4by4 rates=invgamma ngammacat=4 covarion=yes; lset applyto=(2) nst=6 nucmodel=4by4 rates=invgamma ngammacat=4 covarion=yes; lset applyto=(3) nst=6 nucmodel=4by4 rates=invgamma ngammacat=4 covarion=yes; unlink statefreq=(all); unlink shape=(all); unlink revmat=(all); unlink switchrates=(all); prset ratepr=variable; mcmcp ngen=3500000 printfreq=1000 samplefreq=100 nchains=4 savebrlens=yes filename=conc; mcmc; quit;End;

  • How to Handle Concatenated Data? (3)

    The Likelihood Summation Method

    - Run a phylogenetic analysis with each singe- gene dataset and the concatenated data set.- Let Paup save the 1000 best trees resulting from each analysis.- Concatenate the tree files.- Calculate the likelihood scores for each of the trees with the lscores command in Paup, but use as a data matrix the single- gene alignments only.- Load the scorefiles for each data set in a spreadsheet program and calculate the sum of log likelihoods for each tree.- Sort the trees according to their log likelihood.

  • user- defined treesand topology testing

    aim:group of organisms

    or gene family

    choice of molecular marker(s)and

    taxon sampling

    amplification/sequencing

    alignment

    choice of evolutionary model

    phylogenetic analyses

    tree(s)results

    imp r

    ov e

    men

    t of

    Work- Flow

  • Kishino- Hasegawa and Shimodaira- Hasegawa tests are used totest hypothetical user- defined trees by comparison with the optimal tree.

    Problem:Tests were designed to compare random trees, but used by biologists to compare trees with the optimal tree.

    Testing Tree Topology (1)

  • Consel- written by H. Shimodaira- needs input files with site- wise log likelihoods- accepts Paup, Molphy and PAML scorefiles- consists of a suite of programs for different tasks- C source code; Unix command line or DOS console- performs: Approximately unbiased test Kishino- Hasegawa test (unweighted/weighted) Shimodaira- Hasegawa test (unweighted/weighted) bootstrap probabilities posterior probabilities

    Testing Tree Topology (2): Consel

  • Using Consel in Combination with Paup

    1) Construct constraints to test a hypothesis2) Infer the optimal constraint trees with Paup (ML)3) Concatenate all treefiles that are supposed to be subjected to the test.4) Let Paup calculate the total log likelihood and the site- wise log likelihoods and save the data to a scorefile.5) Use a text editor to delete superfluous data (all parameters of the evolutionary model).6) Feed Consel with the scorefile.

    Testing Tree Topology (3): Consel and Paup

  • Testing Tree Topology (4): Consel Commands

    1) Generate bootstrap subsamples

    makermt paup scorefile.txt

    2) Perform the test

    consel scorefile

    3) Generate an output file with the test results

    catpv scorefile

  • Testing Tree Topology (5): How Does Consel Work?

    makermt Generates multiscale bootstrap subsamples from the site- wise log likelihoods in the scorefile. Bootstrap samples from site- wise log likelihoods = the RELL method Multiscale bootstrap = sizes of subsamples differ from original data set By default, makermt generates 10 sets of replicates, each with 10,000 subsamples (0.5x, 0.6, 0.7, 0.8, 0.9, 1.1, 1.2, 1.3, 1.4, 1.5 fold the size of the original data set) and an additional set with 10,000 subsamples of 1.0- fold size.consel Calculates the probabilities for KHT, SHT and normal bootstrap using the 10,000 subsamples of 1.0- fold size and the probabilities for the AUT and the multiscale bootstrap using the multiscale bootstrap samples.

  • Testing Tree Topology (6): Consel Outputcatpv summarises the results of the test (the probability values)

    # reading nm.pv# rank item obs au np | bp pp kh sh wkh wsh |# 1 1 -15.8 0.957 0.942 | 0.944 1.000 0.936 0.994 0.936 0.999 |# 2 6 15.8 0.064 0.055 | 0.054 1e-07 0.064 0.415 0.064 0.218 |# 3 5 35.2 0.005 0.002 | 0.002 5e-16 0.006 0.079 0.006 0.027 |# 4 4 36.3 0.002 0.001 | 4e-04 2e-16 0.005 0.063 0.005 0.019 |# 5 7 56.8 7e-05 1e-04 | 2e-04 2e-25 3e-04 0.012 3e-04 0.001 |# 6 2 129.4 2e-64 5e-21 | 0 6e-57 0 0 0 0 |# 7 3 142.2 4e-06 5e-06 | 0 2e-62 0 0 0 0 |rank = tree rankingitem = no. of tree in treefileobs = log likelihood differenceau = p- values of the approximately unbiased testnp = p- values of the multiscale bootstrapbp = p- values of the normal bootstrappp = posterior probabilitieskh = Kishino- Hasegawa testsh = Shimodaira- Hasegawa testwkh, wsh = weighted Kishino- Hasegawa and Shimodaira- Hasegawa tests

  • Phylogeny With Protein Sequences (1)

    Due to degeneration of the genetic code, codons may be biased!This bias may apply not only to the third position, but also to first and second.

    Presumably it would be better to use protein sequences instead, but:

    Protein alignments have 20 character states instead of 4 = analyses, especially ML analyses take much longer!

    Using nucleotide data, but considering nonsynonymous/synonymous substitutions and/ortranslating during analysis - > this is also quite time- consumptive!

  • Phylogeny With Protein Sequences (2)

    Maximum likelihood analyses of protein sequences are usually based on substitution matrices derived from empirical data instead of estimating the substitution rate matrix from the data set.

    e.g. Dayhoff (Dayhoff et al. 1978) JTT (Jones, Thornton, Taylor 1992) WAG (Wheelan, Goldman 2001)

  • Paup: only limited possiblities, no substitution matrices included; no maximum likelihood

    Programs for phylogenetic analyses of protein sequences

    Phylip (text- based menu): phylip format; maximum likelihood; PAM, JTT, PMBTree- Puzzle (text- based menu): phylip format; maximum likelihood with quartet puzzling; Dayhoff, JTT, WAG, VT etc.PAML (Unixoids, DOS console): phylip format; maximum likelihood; Dayhoff, JTT, WAG etc.(Molphy [Unixoids]): phylip format; maximum likelihood; Dayhoff, JTT)MrBayes: Bayesian analysis

    Phylogeny With Protein Sequences (3)

  • Phylogeny With Protein Sequences (4)

    One possibility:

    Calculate a tree and gamma categories using Tree- Puzzle 5.2 (Schmidt, Strimmer and von Haeseler 2004).

    Use the gamma category estimates as settings to perform a maximum likelihood analyses with proml from the Phylip 3.62 package (Joe Felsenstein 2004).

  • Phylogeny With Protein Sequences (5)

    Text based menu of Tree- Puzzle 5.2

    GENERAL OPTIONS b Type of analysis? Tree reconstruction k Tree search procedure? Quartet puzzling v Approximate quartet likelihood? Yes u List unresolved quartets? No n Number of puzzling steps? 1000 j List puzzling step trees? No o Display as outgroup? Chondrus (1) z Compute clocklike branch lengths? No e Parameter estimates? Approximate (faster) x Parameter estimation uses? Neighbor-joining treeSUBSTITUTION PROCESS d Type of sequence input data? Auto: Amino acids m Model of substitution? Auto: JTT (Jones et al. 1992) f Amino acid frequencies? Estimate from data setRATE HETEROGENEITY w Model of rate heterogeneity? Uniform rate

    Quit [q], confirm [y], or change [menu] settings:

  • Phylogeny With Protein Sequences (6)

    Phylip 3.62suite of 30 programs

    e.g. ML bootstrapping:a) start seqboot = generate bootstrap samples from data setb) run dnaml or proml (depending in data set) c) run contree to create a consensus tree d) consensus treefile may be loaded into a tree displaying program

  • Molecular Clock

    Assumes that sequences evolve at equal evolutionary rates. This is usually not the case and puts the analysis under a constraint. Molecular clock hypothesis may be tested with a likelihood ratio test similar to the evolutionary models in Modeltest.If fossils are available one may try dating the divergences of lineages.

    Secondary Structure Analyses

    may add information to the results (predominantly RNA- coding, ITS or intron regions)

    Synapomorphy Analyses

    Searching for synapomorphic characters or strings of characters may be useful for systematic purposes

    Diverse

  • Questions?

  • That's it!