phylogenetic analysis introduction to bioinformatics stinus lindgreen [email protected]...
TRANSCRIPT
Phylogenetic Analysis
Introduction to bioinformatics
Stinus [email protected]
Bioinformatics Centre, University of Copenhagen
Outline of the lecture What is a phylogeny? Why and how to interpret them Programs: PHYLIP, PAUP* and BioEdit Building a tree 1: Multiple alignment Building a tree 2: The model Building a tree 3: Construction Building a tree 4: Evaluation
Phylogeny Phylogenetic inference predicts a tree based
on characters (of some sort) Some variation needed Group together similar species/genes Connect to most common ancestor
Unrooted tree: Just show connections Rooted tree: Direction of evolution Branch lengths can show divergence
Before sequences Phylogenetic trees show evolutionary
relationships Existed longer than sequencing methods Previously based on morphological characters Still partly today – at least for checking Mainly based on biological sequences
DNA or protein Base phylogeny on mutations
Some pitfalls Determining phylogeny is important for
understanding biology But also a very difficult problem Beware of incorrect trees Important to understand models and methods The programs are helpful tools
The result is only as good as the alignment
Assumptions
Basic concepts of evolutionary theory Relation to common ancestor Phylogenetics represented by bifurcating tree Mutations occur over evolutionary time
Necessary to make phylogenetic inference possible
Interpretation Know your model
Both evolutionary and for tree construction Know the assumptions of the model
Evolution independent? Identical between sites? The same for all sequences?
Are the sequences correct? And are they representative? And are they homologous? Is the multiple alignment correct?What you get out is no better than what you put in
Some biological pitfallsDon’t make hasty conclusions! Does your tree contradict common sense?
Then it’s probably wrong! Differentiate between the homologs Orthologs
Speciation, common ancestor, similar function Paralogs
Gene duplication, within 1 organism, differing functions
Xenologs Horizontal gene transfer – hard to tell, similar function
Software
Today we’ll look at the programs before the methods
Some programs for phylogenetic analysis A multiple alignment program:
Clustal, T-Coffee, MAFFT, Muscle… A phylogenetic program:
Phylip, PAUP*, MacClade, BioEdit… Visualizing the tree:
TreeView, NJplot
PAUP* Commercial package Apparently good Many different methods and analysis methods But since we don’t own a copy…
Similarly: MacClade only works on Macintosh…
PHYLIP Free package Many programs Both distance and character based Bootstrapping possible But: It can be a little difficult No graphical user interface And you will need to run many programs
BioEdit Has phylogeny methods built in Can call Phylip routines No need for you to learn the command line But no bootstrapping… (as far as I know)
Point and click: Select the sequences in the alignment Choose the wanted phylogeny Voila!
Constructing a treeTo make a phylogenetic tree, four steps are
needed:1. Perform multiple alignment2. Choose your model3. Build the tree4. Evaluate the quality
A brief note: Ideally: Parallel alignment and phylogenetic
inference Very difficult – but it has been pursued
1) The multiple alignment
Already discussedSome notes: Recall that MA programs are not exact
Some manual editing often necessary Consider the algorithm used
Does it consider the phylogeny of the data? Clustal’s guide tree: Not correct phylogeny
What parameters are used? Solve ambiguities, remove near-identical
sequences Gappy regions, identical sequences can bias the result
2) The model
The model describes the data Evolutionary events Overall mutability Evolutionary model?
Crucial – both for alignment and tree building Are you looking at nucleotides or amino acids?
Where do we get most information? Know the basis for the chosen model
Nucleotide models Create 4×4 matrix Either fixed cost
Character state Or rate matrices
Probabilities Used for different kinds of tree
estimations
Include site specific information Third codon position more variable
Nucleotide model 1 Fixed cost for transitions and
transversion E.g. transversions are twice as costly
as transitions For a tree: Count the number of
transitions/transversions Calculate cost Tends to minimize number of
transversion Cluster transitions
A C G T
A - 2 1 2
C 2 - 2 1
G 1 2 - 2
T 2 1 2 -
Nucleotide model 2 Simple substitution rate matrix Assume same rates AB and BA Assume all mutations equally likely: Rate α The Jukes-Cantor model
A C G T
A -3α α α α
C α -3α α α
G α α -3α α
T α α α -3α
Nucleotide model 3
A C G T
A-
(α2+α1)α2 α1 α2
C α2
-(α2+α1)
α2 α1
G α1 α2
-(α2+α1)
α2
T α2 α1 α2 -(α2+α1)
More advanced rate matrix Include transitions/tranversions Rates α1 and α2
The Kimura 2-parameter model
Amino acid models A 20×20 substitution matrix The BLOSUM matrices
Fixed cost matrices Or the PAM matrices
Rate matrices Described last week
3) Building the tree
We have the sequences, the alignment and the model
Find the best tree What is the best tree? Two main strategies: Distance based
Look at dissimilarities (=distances) Character based
Look at the data
Problems with trees The number of possible trees grows
exponentially For 15 taxa: 2.13·1014 possibilities… How to search?
Branch and Bound Branch swapping
Rooting the tree Not a simple problem
All the following methods produce unrooted trees Use an outgroup Midpoint of longest branch
Distance methods Some sequences more similar than others Closely related sequences should be close in
the tree Abstract view on the data
Loss of information is usually a bad sign Only use the distances between sequences
Recall Clustal All methods start with a distance matrix
Distance methods Can we get the correct answer? Yes, if all mutation events were present But: After one mutation, the site is ”saturated”
Additional mutations do not give additional info
A B C: Distance 2A C: Distance 1 And mutations back will fool the methodA B A: Distance 2A A: Distance 0
UPGMA
Unweighted Pair Group Method with Arithmetic Mean Unweighted: The distances are used as they are Pair: Find the two closest elements Group: Put them together in a new group Arithmetic Mean: Gives distances from the new
group
Correct tree assuming a molecular clock Evolutionary divergence time can be found from
mutations Mutation rates are constant
UPGMA illustrated Find two closest: A and D Create a new group [A+D] Update distances:
72
682
BDBABD][A
A B C D E
A - 8 3 2 5
B - - 5 6 6
C - - - 7 5
D - - - - 3
E - - - - -
A+D
B C E
A+D
- 7 5 4
B - - 5 6
C - - - 5
E - - - -
Repeat for all sequences Next time: Connect [A+D]
with E
Neighbour joining A little like UPGMA Difference: NJ does not assume a molecular
clock But it assumes an additive tree
Distance between two leaves is the sum of the edges Find the closest pair that is most apart from the
rest of the tree Connect pair and update distances
A little advanced: Take the overall distance to the rest of the tree into account
Corrects for varying mutation Fast and can give good results
Fitch-MargoliashFM method We have the pairwise distances Each branch in the tree has a length The length of all paths can be found Optimize tree by moving internal nodes
around The best fit minimizes the overall error
The minimum squared deviation
ij
2ijij )p(d
Minimum Evolution
The ME method Find the shortest tree
Count number of changes Similar to FM but only looks at branches
FM
ij
2ijij )l(d
A
B
B
A
ME
Character methods Use the data (the actual characters) All information at hand More advanced, slower, but also more
accurate Maximum Parsimony (MP)
Occam’s razor: Simplest explanation Maximum Likelihood (ML)
Advanced statistical method Most probable tree given the data and the model
Maximum parsimony How does evolution work? Assumption: Path of least resistance True evolution gives rise to fewest changes
The tree we want: Describe the given sequences by fewest
changes The ancestral nodes must be as similar as possible
Predict a tree Count the number of changes needed
Maximum Likelihood Given the data, predict the most probable
model Can optimize both tree and substitution model
We know the sequences What is the most likely substitution rates?
Estimate from the alignment (and the phylogeny) And what is the most likely tree?
Estimate from alignment and substitution rates Computationally heavy and rather slow Normally good results
Maximum Likelihood General practice: Optimize model then tree Calculate probability for each alignment column Combine to probability for entire alignment Averages over low and high probability sites Likelihood of column given tree
A A C
A
A
A A C
C
A
A A C
G
A
L=P +P +P +…
Maximum likelihood Then repeat this for all possible tree topologies And all possible assignments to internal nodes And then choose the combination that gives
the highest probability…
Clearly very difficult
Summary of methods
Distance Character based
Clustering
UPGMANeighbour Joining
Optimality criterion
Least SquaresMinimum evolution
Maximum parsimonyMaximum likelihood(Bayesian statistics)
The differences Sometimes the differences can seem minimal They affect the tree – but the same result is
possible
UPGMA and NJ Minimize the overall length of the treeMaximum parsimony Finds tree with fewest changesMaximum likelihood Maximizes the probability of the tree given the data
4) Evaluating trees
How good is the predicted tree?
Some sequence variation needed Is the signal strong enough?There are so many possible trees Are there many trees similar to the prediction?
Which one to choose? Is the tree robust?
Does it change much when e.g. removing a sequence?
Randomization Is it possible that tree is just random? Permute the columns of the alignment
i.e. shuffle the characters in a column Build a new tree Is it (partly) identical? If the tree is just as likely to be random, then
don’t put too much faith in it
Bootstrapping The story of Baron von Münchausen He pulled himself out of a swamp by his
bootstraps The idea: Evaluate the quality of the result
using the same data all over again Make a large number of new datasets Create phylogenetic tree Observe the number of times clades are made
Bootstrapping The datasets should be similar Thereby: The trees are comparable Alignments of same size (length and sequences) Non-parametric: Sample with replacement
Choose a random column and add new alignment Parametric: Simulate new datasets
Use model that look like your data
Characteristics are preserved (unlike randomization)
Bootstrap example Non-parametric
bootstrapping We have an alignment:A: A G G C U C C A A AB: A G G U U C G A A AC: A G C C C C G A A AD: A U U U C C G A A C#: 0 1 2 0 3 0 1 2 0 1
Sample columns:A: G G G U U U C A A AB: G G G U U U G A A A C: G C C C C C G A A AD: U U U C C C G A A C
A B C D
A - - - -
B 1 - - -
C 5 5 - -
D 8 7 4 -
A
B
C
D
Bootstrap example Sample 2:A: A U U C C C C A A AB: A U U C C G G A A AC: A C C C C G G A A AD: A C C C C G G C C C
A B C D
A - - - -
B 2 - - -
C 4 2 - -
D 7 5 3 -
A
B
C
D
Bootstrap example Sample 3:A: A C C C A A G G C CB: A C C G A A G G U UC: A C C G A A C C C C D: A C C G C C U U U U
A B C D
A - - - -
B 3 - - -
C 3 4 - -
D 7 4 6 -
A
B
C
D
Bootstrap example Calculate consensus tree
Can be done on many ways Put the bootstrap number at each branch point
The proportions of times this branch is observed Of course, more than three samples needed
A
B
C
D
1.0
0.66
Summary What is phylogenetic inference? What can a phylogenetic tree be used for? Be aware of the multiple alignment The different models Tree building methods: NJ, UPGMA, ML and MP Evaluating trees: Bootstrapping Programs: Phylip, PAUP*,PhyloWin and BioEdit
Next time: Gene finding (with Anders Krogh)Then RNA structure prediction with me again