probabilistic methods for phylogenetic trees (part 2) sushmita roy bmi/cs 576 [email protected]...

Probabilistic methods for phylogenetic trees (Part 2)

Sushmita RoyBMI/CS 576

www.biostat.wisc.edu/bmi576/[email protected]

Oct 7th, 2014

http://www.biostat.wisc.edu/bmi576/

mailto:[email protected]

RECAP

• Probabilistic methods for phylogenetic tree construction– P(data|tree)– Maximum likelihood

• Felsenstein algorithm for computing the likelihood of a sequence given a tree

Probabilistic models of evolution

• The probability of a character switching from a to b along a branch of length t, P(b|a,t) is captured by the matrix

• For example for DNA this is:

Defining the conditional probability distributions

• If we consider t to be evolutionary time, these conditional probabilities can be obtained what is called a continuous time Markov process

• Such processes are defined by a K-by-K rate matrix R– Each entry of R, R(a,b) gives a rate of substitution from a

to b• The time spent in any state (character) is

exponentially distributed• If we have R, S(t) can be obtained from R

– Using the theory of continuous time Markov processes

Rate matrices

• A rate matrix R – Is a K-by-K matrix where K is the size of our

alphabet– E.g. for DNA K=4

• Different rate matrices make different assumptions of substitutions– Jukes Cantor: all substitutions have same rates.– Kimura: transitions (A<->G, C<->T) and transversions

(A<->C,A<->T,G<->C,G<->T) have different rates.– Hasegawa, Kishino, Yano (HKY, all substitutions have

different rates).

Jukes Cantor Rate matrix

• Simplest possible rate matrix for DNA sequence evolution

• Assumes all bases change at the same rate

A

T

G

C

A T G C

Conditional probabilities from Jukes Cantor

• The conditional probability matrix, P(a|b,t) has a similar form as the rate matrix

Equilibrium distribution: ¼ for all bases

ATGC

A T G C

P(G|C,t)

Searching phylogenetic tree space with maximum likelihood

• As in the maximum parsimony case we need to– Score a tree– Search over the space of possible trees

• Score a given tree– Branch lengths are parameters– Estimate the branch lengths to maximize the likelihood of data given

tree• Search over trees

– Start with an initial tree• A greedy approach of adding a branch that maximizes the likelihood• Neighbor Joining

– Revisit using nearest neighbor interchange or subtree grafting approaches until convergence

Some advantages of probabilistic approaches

• Probabilistic models can be naturally extended to more realistic model– Model site specific parameters– Model gaps

• A probabilistic framework can be used to evaluate different models of varying complexity (more parameters)– Different evolutionary models– Easily combined with other probabilistic models

• Hidden Markov models

Modeling site-specific parameters

• Recall we had assumed that the probabilities at each is the same

• This could be relaxed by introducing additional parameters per site, ru

Probabilistic interpretation of Parsimony

• Recall P(a|b,t) is the key quantity of interest• Replace P(a|b,t) by P(a|b) and use –log P(a|b) as the

score • Applying the weighted parsimony algorithm on this

score to get the minimal cost tree will give an approximation to likelihood– The one associated with the most likely assignment of the

ancestral states

Bootstrap: Assessing reliability of phylogenetic trees

• Bootstrap: a computational strategy used to assess confidence in an estimated quantity– E.g. branch length– Tree branching topology

• Generate a bunch of trees, {T1,…,TN}, from N random samples of the data– Sample columns/sites with replacement– Reconstruct a tree from sampled columns

• One can estimate the confidence of any tree feature based on the proportion of times the feature is seen in a tree in {T1,…,TN}

Example of bootstrap

Ziheng Yang and Bruce Rannala, Nature Reviews Genetics 2012

Some common phylogenetic tree construction algorithms

• PhyML– Maximum likelihood, Nearest neighbor interchange, subtree pruning and

regrafting• RAxML (Randomized Axelerated Maximum Likelihood)

– Exists in both sequential and parallel versions– Also does subtree pruning and regrafting

• PhyLIP (From Felsenstein)– Package for distance-based, parsimony, ML methods

• BEAST (Bayesian)– MCMC based sampling

• MrBayes (Bayesian)– MCMC based sampling

• Visit here for more http://evolution.genetics.washington.edu/phylip/software.html

Comments about phylogenetic tree construction

• Which method to pick?– Neighbor joining: fast, constructs right tree if the distances are additive– Parsimony: does not make any assumption of distances– Probabilistic:

• More principled, provides a systematic framework to estimate model parameters• Enables us to quantify uncertainty in the model, evaluate different models of evolution• If ML distances are additive NJ can construct the right tree• If branch lengths are ignored, weighted parsimony and maximum likelihood are equivalent

• Search space may be large, but– can find the optimal tree efficiently in some cases– heuristic methods can be applied

• Difficult to evaluate inferred phylogenies: ground truth not usually known– can look at agreement across different sources of evidence– can look at repeatability across subsamples of the data (bootstrap)– can look at indirect predictions, e.g. conservation of sites in proteins

• Methods could be assessed based on a simulation framework based on a probabilistic model of phylogenies

• Phylogenies for bacteria, viruses not so straightforward because of lateral transfer of genetic material

probabilistic methods for phylogenetic trees (part 2) sushmita roy bmi/cs 576 [email protected]...

Documents

tree slide

rate matrix r

t g c atgc slide

convergence slide

r u slide

different rate matrices

branch of length t

rate of substitution