from protein sequences to phylogenetic trees

47
FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES Simon Harris Wellcome Trust Sanger Institute, UK

Upload: margaret-johnston

Post on 31-Dec-2015

87 views

Category:

Documents


3 download

DESCRIPTION

FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES. Simon Harris Wellcome Trust Sanger Institute, UK. Agenda. Remind you that molecular phylogenetics is complex the more you know about the compared proteins and the method used, the better Try to avoid the black box approach as much as possible! - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

Simon HarrisWellcome Trust Sanger Institute,

UK

Page 2: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

Agenda

• Remind you that molecular phylogenetics is complex – the more you know about the compared proteins and the method used, the better• Try to avoid the black box approach as much as possible!

• Give an overview of some phylogenetic methods and software used with protein alignments - some practical issues…

Page 3: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

From DNA/protein sequences to trees

Modified from Hillis et al., (1993). Methods in Enzymology 224, 456-487

1

2

3

4

5

Align Sequences

Phylogenetic signal?Patterns—>evolutionary processes?

Test phylogenetic reliability

Distance methodsChoose a method

MB ML

Character based methods

Single treeOptimality criterion

Calculate or estimate best fit tree

LS ME NJ

Distance calculation(which model?)

Model?

MPWeighting?

(sites, changes)?Model?

Sequence data

*

**

Page 4: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

Phylogenies from proteins

• Parsimony• Distance matrices• Maximum likelihood• Bayesian methods

***

Page 5: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

1 2 3 4 5A G G G C GB A G G C TC A T T C TD A T T T G

A B C DA - d1 d2 d3B - d4 d5C - d6 D -

A

B C

D

Distance matrix

MPMLBM

Explicit model of sequence evolution

Distance methodsCharacters based methods

Page 6: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

Phylogenetic trees from protein alignments

• Distance methods - model for distance estimation

– Simple formula (e.g. Kimura, use of Dij, LogDet)

– Complex models• Probability of amino acid changes - Mutational Data Matrices

• Site rate heterogeneity

• Maximum likelihood and Bayesian methods- MDM based models are used for lnL calculations of sites -> lnL of trees

• Site rate heterogeneity

• Homogenous versus heterogeneous models

• Estimations of data specific rate matrices (amino acid groupings - GTR like)

Page 7: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

Software: an overview • CLUSTALXv2 - kimura distances• PHYLIPv3.62 - distance, MP, and ML methods (and more)

– Some complex protein models• PAM, JTT ± site rate heterogeneity

– Bootstrapping - bootstrap support values• TREE-PUZZLEv5.2 - distance and a ML method

– ML - quartet method– Complex protein models

• JTT, WAG…matrices ± site rate heterogeneity

– From quartets to n-taxa tree - PUZZLE support values– Some sequence statistics - aa frequency and heterogeneity between sequences– Tree comparisons - KH test

• PHYML - ML methods (protein and DNA)• MRBAYES - Bayesian

– Complex protein models• JTT, WAG…matrices ± site rate heterogeneity• Data partitioning

– Posteriors as support values• PHYLOBAYESv2.3• P4

– All the things you can dream off… almost… ask Peter Foster– Heterogeneous models among taxa or sites– Estimation of amino acid rate matrices for grouped categories (6x6 rate

matrices can be calculated)

Page 8: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

PHYLIP3.62• Protpars: parsimony• Protdist: models for distance calculations:

– PAM1, JTT, Kimura formula (PAM like), others...– Correction for rate heterogeneity between sites! Removal of invariant sites? (not estimated, see TREE-PUZZLE5.2!)

• NJ and LS distance trees (± molecular clock)• Proml: protein ML analysis (no estimation of site rate

heterogeneity - see TREE-PUZZLE5.2)

– Coefficient of variation (CV) versus alpha shape parameter CV=1/alpha1/2

• Bootstrapping

Page 9: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

A two step approach - two choices!

1) Estimate all pairwise distancesChoose a method (100s) - has an explicit model for sequence evolution

• Simple formula

• Complex models - PAM, JTT, site rate variation

2) Estimate a tree from the distance matrixChoose a method: with (ME, LS) or without an optimality criterion (NJ)?

Distance methods

Page 10: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

Simple and complex models

dij = -Ln (1 - Dij - (Dij2/5)) (Kimura)Simple and fast but can be unreliable - underestimates changes, hence distances, which can lead to misleading trees - PHYLIP, CLUSTALXDij is the fraction of residues that differs between sequence i and j (Dij = 1 - Sij)

dij = ML [P(), (pinv), Xij] (bad annotation!)ML is used to estimate the dij based on the sequence alignment and a given model - MDM, gamma shape parameter and pinv - PHYLIP, PUZZLE. Each site is used for the calculation of dij, not just the Dij value.More realistic complexity in relation to protein evolution and the subtle patterns of amino acid exchange rates…Note: the values of the different parameters (alpha+pinv) have to be either estimated, or simply chosen (MDM), prior the dij calculations

Page 11: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

1) Choosing/estimating the parameter of a model1) Mutation Data Matrices: PAM, JTT, WAG…

• What are the properties of the protein alignment (% identity, amino acid frequencies, globular, membrane)?

• Can be corrected for the specific dataset amino acid frequencies (-F) - in some software only

• Compare ML of different models for a given data and tree ModelGenerator and ProtTest are

designed for this2) Alpha and pInv values have to be

estimated on a tree• TREE-PUZZLE can do that. Reasonable

trees give similar values…

Page 12: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

2) Inferring phylogenetic trees from

the estimated dija) Without an optimality criterion

• Neighbor-joining (NJ) (NEIGHBOR) Different algorithms exist - improvement

of the computing

If the dij are additive, or close to it, NJ will find the ME tree

• BIONJ, WEIGBOR, FastME

b) With an optimality criterion • Least squares (FITCH)

• Minimum evolution (in PAUP - now also PHYLIP)

Page 13: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

• Seeks to minimise the weighted squared deviation of the tree path length distances from the distance estimates -uses an objective function

Fitch Margoliash Method 1968

E = wij |dij - pij|

i=1j=i+1

T-1 T

dij = F(Xij) pairwise distances estimate - from the data using a specific model (or simply Dij)pij = length of path between i and j implied on a given treedij = pij for additive datasets (all methods will find the right tree)

E = the error of fitting dij to pijT = number of taxaif = 2 weighted least squareswij = the weighting scheme

Page 14: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

Minimum Evolution Method• For each possible alternative tree one can estimate the length of each branch from the estimated pairwise distances between taxa (using the LS method) and then compute the sum (S) of all branch length estimates. The minimum evolution criterion is to choose the tree with the smallest S value S = Vk

k=1

2T-3

With Vk being the length of the branch k on a tree

Page 15: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

Distance methods• Advantages:

– Can be fast (NJ)– Some distance methods (LogDet) can be superior to more complex approached (ML) in some conditions (shown for DNA alignments)

– Distance trees can be used to estimate parameter values for more complex models and then used in a ML method

– Provides trees with branch lengths• Disadvantages:

– Can lose information by reducing the sequence alignment into pairwise distances

– Can produce misleading (like any method) trees in particular if distance estimates are not realistic (bad models), deviates from additivity

Page 16: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

Character based methods

• Maximum likelihood based methods– Quartet puzzling method - TREE-PUZZLE

– Standard ML - PHYML, PROML (PHYLIP)

• Bayesian based methods– MrBayes v3.1

– Phylobayes v2.3

– P4 (Peter Foster)

Page 17: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

1 2 3 4 5A G G G C GB A G G C TC A T T C TD A T T T G

A B C DA - d1 d2 d3B - d4 d5C - d6 D -

A

B C

D

Distance matrix

MPMLBM

Explicit model of sequence evolution

Distances methodsCharacters based methods

Page 18: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

TREE-PUZZLE5.2• Protein maximum likelihood method using “quartet puzzling”

– With various protein rate matrices (JTT, WAG…)– Can include correction for rate heterogeneity between sites - pinv + gamma shape (can estimates the values)– Can estimate amino acid frequencies from the data– List site rates categories for each site (2-16)– Composition statistics– Molecular clock test– Can deal with large datasets

• Can be used for ML pairwise distance estimates with complex models - used with puzzleboot to perform bootstrapping with PHYLIP

Page 19: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

A gamma distribution can be used to model site rate heterogeneity

Yang 1996TREE, 11,367-372

Page 20: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

TREE-PUZZLE5.2

1) Parameters (pInv-gamma) are estimated on a NJ n-taxa tree 2) Calculate the ML tree for all possible quartets (4-taxa)3) Combine quartets in a n-taxa tree (puzzling step)4) Repeat the puzzling step numerous times (with randomised order of quartet input)5) Compute a majority rule consensus tree from all n-trees - has the puzzle support value

The quartet ML tree search method has four steps:

Puzzle support values are not bootstrap values!

Page 21: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES
Page 22: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

TREE-PUZZLE5.2•Models for amino acid changes:

– PAM, JTT, BLOSUM64, mtREV24, WAG (with correction for amino acid frequencies)

– Correction for specific dataset amino acid frequencies– Discrete gamma model for rate heterogeneity between sites 4-16 categories.-> output gives the rate category for each site. Can be used to partition your data and analyse them separately…

•Taxa composition heterogeneity test•Molecular clock test

Page 23: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

Combination of categories that contributes the most to the likelihood(computation done without clock assumption assuming quartet-puzzling tree):

66348060655113163168016402640155135351755587777885757678866630260137005633775758678418874036673587278337841000224245847183622611151658686136640100010026331405414845653410774876588

Page 24: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

• Can be used to calculate pairwise distances with a broad diversity of models - puzzleboot (Holder & Roger) – Can be used in combination with PHYLIP programs for bootstrapping:

– SEQBOOT– NJ or LS…– CONSENSE

• But PHYML can do ML bootstrapping in a fair amount of time…

TREE-PUZZLE5.2

Page 25: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

TREE-PUZZLE5.2• Advantages:

– Can handle larger numbers of taxa for maximum likelihood analyses

– Implements various models (BLOSUM, JTT, WAG…) and can incorporate a correction for rate heterogeneity (pinv+gamma)

– Can estimate for a given tree the gamma shape parameter and the fraction of constant sites and attribute to each site a rate category

• Disadvantages:– Quartet based tree search - amplification of the long branch attraction artefact within each quartet analysis?

Page 26: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

MrBayes 3.1• Bayesian approach

– Iterative process leading to improvement of trees and model parameters and that will provide the most probable trees (and parameter values)

• Complex models for amino acid changes:– PAM and JTT, WAG (with correction for amino acid frequencies, but you have to type it!?!?!)

– Correction for rate heterogeneity between sites (pinv, discrete gamma, site specific rates)

• Powerful parameter space search– Tree space (tree topologies)– Shape parameter (alpha shape parameter, pinv)– Can work with large dataset– Provides probabilities of support for clades (posterior probabilities)

Page 27: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

MrBayes 3.1• MrBayes will produce a population of trees and parameter values - obtained by a Markov chain (mcmcmc). If the chain is working well these will have converged to “probable” values

• In practice we plot the results of an mcmcmc to determine the region of the chain that converged to probable values. The “burn in” is the region of the mcmcmc that is ignored for calculation of the consensus tree

– Trees and parameter values from the region of equilibrium are used to estimate a consensus tree

– The number of trees recovering a given clade corresponds to the posterior for that clade, the probability that this clade exists

– The mcmcmc uses the lnL function to compare trees between generations

Page 28: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

Most methods provide a single tree and

parameters value

– Bootstrapping provide a distribution

of tree topologies

– Puzzling steps also provides a

distribution tree topologies

• Bootstrap values - Puzzle support values

- Posteriors values ???

• But not to sure how to interpret these

different support values - in each case the

support values are for a given dataset and

method used

• Posteriors are typically higher then

bootstrap and puzzle support values?!

MrBayes 3.1

Page 29: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

MrBayes 3.1: some options

Page 30: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

#NEXUSbegin data; dimensions ntax=8 nChar=500; format datatype=protein gap=- missing=?; matrix

Etc…

Begin mrbayes;log start filename=d.res.nex.log replace;prset aamodelpr=fixed(wag);lset rates=invgamma Ngammacat=4; set autoclose=yes;mcmc ngen=5000 printfreq=500 samplefreq=10 nchains=4 savebrlens=yesstartingtree=random filename=d.res.nex.out;

quit;end;[Begin mrbayes;log start filename=d.res.nex.con.log replace; sumt filename=d.res.nex.out.t burnin=150 contype=allcompat;end;]

MrBayes 3.1: an example

Block I

Block II

Page 31: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES
Page 32: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

A Bayesian analysis-Propose a starting tree topology and parameters values (branch length, alpha, pinv), calculate lnL-Change one of these and compare the lnL with previous proposal-If the lnL is improved accept it-If not, accept it only sometimes-Do many of these…-Plot the change of lnL in relationship to the number of generations run-Determine the region where the chain converged and calculate the consensus tree for that region

-> consensus tree with posteriors for clade support

#generations (mcmcmc)

Tree

lnL

Tree

lnL

Zooming in

Page 33: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

alpha

pinv

“Burn in” determines the trees to be ignored for consensus tree calculation -Was the chain run long enough?-Do we get the same result from an independent chain?

#generations (mcmcmc)

Page 34: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

+-----------------------A|+--------------------------B|| +-----------------------C| +---------|(0.98)| | | +-------------------------D| | +---------|(0.99)+--------|(0.49) +----------------------E | | +-----------------------F +---------|(0.96) | +------------------------G +--------|(0.81) +--------------------------H

Consensus tree with a burn in of 1500 (150)Showing posterior values for the different clades - probability for a given clade to be correct (for the given data and method used!!!)

Page 35: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

1.Rate matrix choice (20x20 matrices)

– WAG, BLOSSUM62, etc…

2.Recoding protein datasets

– 20x20 --> 6x6 rate matrix (or else)

– Implemented in P4 and PhyloBayes v2.3

Model choice in protein analyses

Page 36: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

Effect of using different rate matrices on phylogenetics

Keane et al. 2006

PHYML MtRev matrix

PHYMLWAG matrix

Page 37: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

-Numerous eukaryotes do not possess mitochondria-They possess instead hydrogenosomes or mitosomes-What is the evolutionary origin of these organelles and what is their function?

Page 38: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

Trichomonas NuoF localises in the hydrogenosomes

Complex I

News and views by Gray (2005). Nature 434, 29-30.

Page 39: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

Amino acids categories - recoding in p4

– Sulfhydryl: C (1)– Smallhydrophilic: S, T, A, P, G (2)

– Acid,amide: D, E, N, Q (3)– Basic: H, R, K (4)– Smallhydrophobic: M, I, L, V (5)– Aromatic: F, Y, W (6)

1 2 3 4 5 61- x1 x2 x3 x4 x52- x6 x7 x8 x93- x10 x11 x124- x13 x145- x156-

Recoding into 6 states (1-6) allows the estimation of a GTR like matrix with 14 free parameters

Page 40: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

Why recode amino acids?

Potential advantages: • Allows generation of a rate matrix specific for the investigated alignment• Contributes to mitigating amino acid composition heterogeneity and homoplasy due to frequent changes within categories - equivalent to DNA transversion analyses

Potential disadvantage:Loss of potential useful signal by reducing the alphabet from 20 to 6 letters (or else)

Page 41: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

0.1

MUSXENOPUS

ANOPHELESDRO MAL1

DRO MAL2CAENORHABD

NEUROSPORAASPERGILLU

CANDIDAYARROWIA

DICTYOSTELARA THA1ORYZA

SOLANUMLEISHMANIA

RHO SPHPARACOCCUS

BRU MELAGROBACTER

SINORHIZOBBRADYRHIZO

CAULOBACTERHODOPSEUD

MESORHIZOBNOVOSPHING

MAGNETOSPIRHODOSPIRIRIC CON

RIC PROTRICHOMONAS

MAGNETOCOCBURKHOLDERRALSTONIA

NITROSOMONNEISSERIA

XANTHOMONAXYLELLA

COXIELLASTR AVEMYCOBACTER

THERMOBIFIESCHERICHI

CHLOROFLEXDEINOCOCCU

LEPTOSPIRAENTEROCOCC

MUSXENOPUS

ANOPHELESDRO MAL1

DRO MAL2CAENORHABD

NEUROSPORAASPERGILLUCANDIDAYARROWIA

TRICHOMONASLEISHMANIA

ARA THA1ORYZA

SOLANUMDICTYOSTELBRADYRHIZO

CAULOBACTERHODOPSEUD

MESORHIZOBBRU MELAGROBACTER

SINORHIZOBRHO SPH

PARACOCCUSNOVOSPHING

MAGNETOSPIRHODOSPIRIRIC CON

RIC PROBURKHOLDERRALSTONIA

NITROSOMONNEISSERIA

XANTHOMONAXYLELLA

COXIELLAMAGNETOCOC

STR AVEMYCOBACTER

THERMOBIFIDEINOCOCCU

LEPTOSPIRACHLOROFLEX

ESCHERICHITHE MAR1

ENTEROCOCC

WAG (20x20) +pInv+G

GTRrecodedDayhoff classes (6x6)+pInv+G

-proteobacteria

Recoding effect on NuoF phylogeny

0.921.0

1.0

1.0

1.00.8

1.0

0.961.0

Page 42: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

Summary• No single program allows thorough phylogenetic analyses of protein alignments

• Combination of PHYLIPv3.6, TREE-PUZZLEv5.2, PHYML, MrBAYESv3.1, PHYLOBAYESv2.3 and P4 allow detailed protein phylogenetics

• Experimenting with your data and available methods/models can lead to interesting and biologically relevant results (data <-> method)– Incorporate site rate heterogeneity correction in the model or reduce heterogeneity by data editing (with and without invariant sites?)

– Partitioning of the alignment (variant - various rates, invariant sites, secondary structure, protein domains…)

– Amino acid groupings (6 categories - GTR like)– LogDet for proteins - rare/absent changes? For long alignments?– DNA based LogDet or the protein alignment…?

• Do not take support values as absolute. Any support value is for a given method and data, only!

Page 43: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

outside

PM

inside

• TM domain have very specific structural requirements: AA composition from TM domain is very distinct from non TM domains!• Extracellular and intracellular domains may also have important functional differences --> different functional constrains can lead to different AA composition.• More generally in any protein, surface exposed AA composition is typically distinct from “internal” AA.

TM domains

Page 44: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

outside

PM

inside

Global alignment (AA)

Sub-alignment 1

Sub-alignment 1

Model 1

Model 2

A

B C

D

A

B C

D

Page 45: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

outside

PM

inside

Model 1 Model 2

A

B C

D

Partition 1

Partition 2

Page 46: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

outside

PM

inside

Global alignment (AA)

A

B C

D

AA recoding, can mitigate compositional difference between domains

– Sulfhydryl: C (1)– Smallhydrophilic: S, T,

A, P, G (2)– Acid,amide: D, E, N, Q

(3)– Basic: H, R, K (4)– Smallhydrophobic: M, I,

L, V (5)– Aromatic: F, Y, W (6)

Page 47: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

outside

PM

insideGlobal alignment (AA)

A

B C

D

Global alignment (DNA)

DNA based modelsLogDet? Codon 1,2 or 3?