phylogenetic analysis introduction to bioinformatics stinus lindgreen [email protected]...

Phylogenetic Analysis

Introduction to bioinformatics

Stinus [email protected]

Bioinformatics Centre, University of Copenhagen

Outline of the lecture What is a phylogeny? Why and how to interpret them Programs: PHYLIP, PAUP* and BioEdit Building a tree 1: Multiple alignment Building a tree 2: The model Building a tree 3: Construction Building a tree 4: Evaluation

Nothing in Biology Makes Sense Except in the Light of Evolution

Theodosius Dobzhansky (1900-1975)

Phylogeny Phylogenetic inference predicts a tree based

on characters (of some sort) Some variation needed Group together similar species/genes Connect to most common ancestor

Unrooted tree: Just show connections Rooted tree: Direction of evolution Branch lengths can show divergence

Before sequences Phylogenetic trees show evolutionary

relationships Existed longer than sequencing methods Previously based on morphological characters Still partly today – at least for checking Mainly based on biological sequences

DNA or protein Base phylogeny on mutations

Morphological tree

Modern tree

A A G C G

X

X

Some pitfalls Determining phylogeny is important for

understanding biology But also a very difficult problem Beware of incorrect trees Important to understand models and methods The programs are helpful tools

The result is only as good as the alignment

Assumptions

Basic concepts of evolutionary theory Relation to common ancestor Phylogenetics represented by bifurcating tree Mutations occur over evolutionary time

Necessary to make phylogenetic inference possible

Tree of Life

Interpretation Know your model

Both evolutionary and for tree construction Know the assumptions of the model

Evolution independent? Identical between sites? The same for all sequences?

Are the sequences correct? And are they representative? And are they homologous? Is the multiple alignment correct?What you get out is no better than what you put in

Some biological pitfallsDon’t make hasty conclusions! Does your tree contradict common sense?

Then it’s probably wrong! Differentiate between the homologs Orthologs

Speciation, common ancestor, similar function Paralogs

Gene duplication, within 1 organism, differing functions

Xenologs Horizontal gene transfer – hard to tell, similar function

Software

Today we’ll look at the programs before the methods

Some programs for phylogenetic analysis A multiple alignment program:

Clustal, T-Coffee, MAFFT, Muscle… A phylogenetic program:

Phylip, PAUP*, MacClade, BioEdit… Visualizing the tree:

TreeView, NJplot

PAUP* Commercial package Apparently good Many different methods and analysis methods But since we don’t own a copy…

Similarly: MacClade only works on Macintosh…

PHYLIP Free package Many programs Both distance and character based Bootstrapping possible But: It can be a little difficult No graphical user interface And you will need to run many programs

BioEdit Has phylogeny methods built in Can call Phylip routines No need for you to learn the command line But no bootstrapping… (as far as I know)

Point and click: Select the sequences in the alignment Choose the wanted phylogeny Voila!

PhyloWin Another free program Simple, not many possibilities But you can make bootstrapping

Getting the software Install BioEdit, PHYLIP, PhyloWin and NJplot Links on the wiki

Constructing a treeTo make a phylogenetic tree, four steps are

needed:1. Perform multiple alignment2. Choose your model3. Build the tree4. Evaluate the quality

A brief note: Ideally: Parallel alignment and phylogenetic

inference Very difficult – but it has been pursued

1) The multiple alignment

Already discussedSome notes: Recall that MA programs are not exact

Some manual editing often necessary Consider the algorithm used

Does it consider the phylogeny of the data? Clustal’s guide tree: Not correct phylogeny

What parameters are used? Solve ambiguities, remove near-identical

sequences Gappy regions, identical sequences can bias the result

2) The model

The model describes the data Evolutionary events Overall mutability Evolutionary model?

Crucial – both for alignment and tree building Are you looking at nucleotides or amino acids?

Where do we get most information? Know the basis for the chosen model

Nucleotide models Create 4×4 matrix Either fixed cost

Character state Or rate matrices

Probabilities Used for different kinds of tree

estimations

Include site specific information Third codon position more variable

Nucleotide model 1 Fixed cost for transitions and

transversion E.g. transversions are twice as costly

as transitions For a tree: Count the number of

transitions/transversions Calculate cost Tends to minimize number of

transversion Cluster transitions

A C G T

A - 2 1 2

C 2 - 2 1

G 1 2 - 2

T 2 1 2 -

Nucleotide model 2 Simple substitution rate matrix Assume same rates AB and BA Assume all mutations equally likely: Rate α The Jukes-Cantor model

A C G T

A -3α α α α

C α -3α α α

G α α -3α α

T α α α -3α

Nucleotide model 3

A C G T

A-

(α2+α1)α2 α1 α2

C α2

-(α2+α1)

α2 α1

G α1 α2

-(α2+α1)

α2

T α2 α1 α2 -(α2+α1)

More advanced rate matrix Include transitions/tranversions Rates α1 and α2

The Kimura 2-parameter model

Amino acid models A 20×20 substitution matrix The BLOSUM matrices

Fixed cost matrices Or the PAM matrices

Rate matrices Described last week

3) Building the tree

We have the sequences, the alignment and the model

Find the best tree What is the best tree? Two main strategies: Distance based

Look at dissimilarities (=distances) Character based

Look at the data

Problems with trees The number of possible trees grows

exponentially For 15 taxa: 2.13·1014 possibilities… How to search?

Branch and Bound Branch swapping

Rooting the tree Not a simple problem

All the following methods produce unrooted trees Use an outgroup Midpoint of longest branch

Distance methods Some sequences more similar than others Closely related sequences should be close in

the tree Abstract view on the data

Loss of information is usually a bad sign Only use the distances between sequences

Recall Clustal All methods start with a distance matrix

Distance methods Can we get the correct answer? Yes, if all mutation events were present But: After one mutation, the site is ”saturated”

Additional mutations do not give additional info

A B C: Distance 2A C: Distance 1 And mutations back will fool the methodA B A: Distance 2A A: Distance 0

UPGMA

Unweighted Pair Group Method with Arithmetic Mean Unweighted: The distances are used as they are Pair: Find the two closest elements Group: Put them together in a new group Arithmetic Mean: Gives distances from the new

group

Correct tree assuming a molecular clock Evolutionary divergence time can be found from

mutations Mutation rates are constant

UPGMA illustrated Find two closest: A and D Create a new group [A+D] Update distances:

72

682

BDBABD][A

A B C D E

A - 8 3 2 5

B - - 5 6 6

C - - - 7 5

D - - - - 3

E - - - - -

A+D

B C E

A+D

- 7 5 4

B - - 5 6

C - - - 5

E - - - -

Repeat for all sequences Next time: Connect [A+D]

with E

Trying UPGMA Go to the wiki and do the UPGMA exercise

Neighbour joining A little like UPGMA Difference: NJ does not assume a molecular

clock But it assumes an additive tree

Distance between two leaves is the sum of the edges Find the closest pair that is most apart from the

rest of the tree Connect pair and update distances

A little advanced: Take the overall distance to the rest of the tree into account

Corrects for varying mutation Fast and can give good results

Fitch-MargoliashFM method We have the pairwise distances Each branch in the tree has a length The length of all paths can be found Optimize tree by moving internal nodes

around The best fit minimizes the overall error

The minimum squared deviation

ij

2ijij )p(d

Minimum Evolution

The ME method Find the shortest tree

Count number of changes Similar to FM but only looks at branches

FM

ij

2ijij )l(d

A

B

B

A

ME

Trying NJ Go to the wiki and do the NJ exercise

Character methods Use the data (the actual characters) All information at hand More advanced, slower, but also more

accurate Maximum Parsimony (MP)

Occam’s razor: Simplest explanation Maximum Likelihood (ML)

Advanced statistical method Most probable tree given the data and the model

Maximum parsimony How does evolution work? Assumption: Path of least resistance True evolution gives rise to fewest changes

The tree we want: Describe the given sequences by fewest

changes The ancestral nodes must be as similar as possible

Predict a tree Count the number of changes needed

MP illustrated

A C G G C

{A,C} {G}

{A,C,G}

{C}

MP illustrated

A C G G C

{A,C} {G}

{A,C,G}

{C}

X

X

Cost: 2 changes

MP illustrated

A C G G C

C G

C

C

CG

CA

Maximum Likelihood Given the data, predict the most probable

model Can optimize both tree and substitution model

We know the sequences What is the most likely substitution rates?

Estimate from the alignment (and the phylogeny) And what is the most likely tree?

Estimate from alignment and substitution rates Computationally heavy and rather slow Normally good results

Maximum Likelihood General practice: Optimize model then tree Calculate probability for each alignment column Combine to probability for entire alignment Averages over low and high probability sites Likelihood of column given tree

A A C

A

A

A A C

C

A

A A C

G

A

L=P +P +P +…

Maximum likelihood Then repeat this for all possible tree topologies And all possible assignments to internal nodes And then choose the combination that gives

the highest probability…

Clearly very difficult

MP and ML exercise Go to the wiki and do the MP and ML exercises

Summary of methods

Distance Character based

Clustering

UPGMANeighbour Joining

Optimality criterion

Least SquaresMinimum evolution

Maximum parsimonyMaximum likelihood(Bayesian statistics)

The differences Sometimes the differences can seem minimal They affect the tree – but the same result is

possible

UPGMA and NJ Minimize the overall length of the treeMaximum parsimony Finds tree with fewest changesMaximum likelihood Maximizes the probability of the tree given the data

4) Evaluating trees

How good is the predicted tree?

Some sequence variation needed Is the signal strong enough?There are so many possible trees Are there many trees similar to the prediction?

Which one to choose? Is the tree robust?

Does it change much when e.g. removing a sequence?

Randomization Is it possible that tree is just random? Permute the columns of the alignment

i.e. shuffle the characters in a column Build a new tree Is it (partly) identical? If the tree is just as likely to be random, then

don’t put too much faith in it

Bootstrapping The story of Baron von Münchausen He pulled himself out of a swamp by his

bootstraps The idea: Evaluate the quality of the result

using the same data all over again Make a large number of new datasets Create phylogenetic tree Observe the number of times clades are made

Bootstrapping The datasets should be similar Thereby: The trees are comparable Alignments of same size (length and sequences) Non-parametric: Sample with replacement

Choose a random column and add new alignment Parametric: Simulate new datasets

Use model that look like your data

Characteristics are preserved (unlike randomization)

Bootstrap example Non-parametric

bootstrapping We have an alignment:A: A G G C U C C A A AB: A G G U U C G A A AC: A G C C C C G A A AD: A U U U C C G A A C#: 0 1 2 0 3 0 1 2 0 1

Sample columns:A: G G G U U U C A A AB: G G G U U U G A A A C: G C C C C C G A A AD: U U U C C C G A A C

A B C D

A - - - -

B 1 - - -

C 5 5 - -

D 8 7 4 -

A

B

C

D

Bootstrap example Sample 2:A: A U U C C C C A A AB: A U U C C G G A A AC: A C C C C G G A A AD: A C C C C G G C C C

A B C D

A - - - -

B 2 - - -

C 4 2 - -

D 7 5 3 -

A

B

C

D

Bootstrap example Sample 3:A: A C C C A A G G C CB: A C C G A A G G U UC: A C C G A A C C C C D: A C C G C C U U U U

A B C D

A - - - -

B 3 - - -

C 3 4 - -

D 7 4 6 -

A

B

C

D

Bootstrap example Calculate consensus tree

Can be done on many ways Put the bootstrap number at each branch point

The proportions of times this branch is observed Of course, more than three samples needed

A

B

C

D

1.0

0.66

Bootstrapping exercise Do the bootstrapping exercise on the wiki

Summary What is phylogenetic inference? What can a phylogenetic tree be used for? Be aware of the multiple alignment The different models Tree building methods: NJ, UPGMA, ML and MP Evaluating trees: Bootstrapping Programs: Phylip, PAUP*,PhyloWin and BioEdit

Next time: Gene finding (with Anders Krogh)Then RNA structure prediction with me again

phylogenetic analysis introduction to bioinformatics stinus lindgreen [email protected]...

Documents

morphological tree slide

tree of life slide

divergence slide

njplot slide

evaluation slide

tree mutations

tree construction

modern tree