14.4. tue introduction to models (jarno) 16.4. thu distance-based methods (jarno) 17.4. fri ml...

Maximum likelihood

14.4. Tue Introduction to models (Jarno) 16.4. Thu Distance-based methods (Jarno) 17.4. Fri ML analyses (Jarno)

20.4. Mon Assessing hypotheses (Jarno) 21.4. Tue Problems with molecular data (Jarno) 23.4. Thu Problems with molecular data (Jarno) Phylogenomics 24.4. Fri Search algorithms, visualization, and other computational aspects (Jarno)

Schedule

J

Maximum likelihood methods of phylogenetic inference evaluate a hypothesis about evolutionary history (the branching order and branch lengths of a tree) in terms of a probability that a proposed model of the evolutionary process and the hypothesised history (tree) would give rise to the data we observe

Maximum Likelihood

The probability, P, of the data (D), given the hypothesis (H)

◦ L = P (D | H)

Likelihood of a hypothesis

Observed data (aligned sequences) Tree topology, branch lengths and

model of evolution

In statistical usage, a distinction is made depending on the roles of the outcome or parameter.

Probability is used when describing a function of the outcome given a fixed parameter value. For example, if a coin is flipped 10 times and it is a fair coin, what is the probability of it landing heads-up every time?

Likelihood is used when describing a function of a parameter given an outcome. For example, if a coin is flipped 10 times and it has landed heads-up 10 times, what is the likelihood that the coin is fair? [Wikipedia, article on likelihood]

Likelihood or probability?What's the difference?

J

An optimality criterion (as is parsimony) Given a model and data we can evaluate a

tree We can choose between trees based on the

likelihood of a given tree The tree(s) with the highest likelihood is the

best

Maximum Likelihood

JC

Variable base frequencies

3 substitution types


Single substitution type



Variable base frequencies

Equal base frequencies

F81

HKY85

F84

TrN

GTR

K2P

K3ST

SYM



Maximum Likelihood estimates parameter values of an explicit model from observed data

Likelihood provides ways of evaluating models in terms of their log likelihoods

Different trees can also be evaluated for their fit to the data under a particular model (likelihood ratio tests of two trees after Kishino & Hasegawa)

Maximum Likelihood

Let's toss coin ten times (n). It lands 4 times heads up (x), 6 times tails up. What is probability of a head in a single toss?◦ Compare: What is the likelihood of the data given

the process? Naturally phat= x / n = 4 / 10 = 0.4 This is also a maximum likelihood estimater

for phat. Let's see why...

Likelihood function, example

J

Coin toss is a binomial process:◦ Pr (X=x|n, p)

Likelihood function then becomes:◦ L(p|x, n)

Note: in the binomial formula X is the unknown, whereas in the binomial the p is the unknown (because we have the data, the coin tosses).


J

The likelihood function can be solved analytically or using "brute force".

For example, result for p=0.4 is:◦ L = 210 * 0.4^4 * 0.6^6 = 0.2508227◦ logL = log(L) = -1.383009◦ -logL = -logL = 1.383009

Analytically, the point where the derivative of the likelihood function is zero, and the second derivative is negative, is the maximum of the function.

Graphically...


J

Maximum Likelihood

p

Likelihood

Maximumlikelihood

Maximumlikelihood

estimator of p

Maximum Likelihood

μ1

Likelihood

Precise estimate

Imprecise estimate

l<-function(x, n) { p<-seq(0,1,0.01) L<-rep(NA, length(p)) for(i in 1:length(p)){ L[i]<-p[i]^x* (1-p[i])^(n-x)* (factorial(n)/ (factorial(x)* factorial(n-x))) } d<-data.frame(p=p, L=L, logL=log(L)) return(d)}plot(l(4,10)[,c(1,3)], ylim=c(-30,0), type="l")

l2<-function(x, n) { p<-seq(0,1,0.01) L<-rep(NA, length(p)) for(i in 1:length(p)) { L[i]<- dbinom(4,size=10, prob=p[i],log=TRUE) } d<-data.frame(p=p, L=L) return(d)}plot(l2(), type="l")

R code

J

plot(l2(), type="L")

R example result

J

Why log likelihood?

L(0.99|10, 4) = 0.0000000002017251 -logL(0.99|10, 4) = -22.324115

◦ When you multiply very small values together, the result is even smaller, and at some point the precision disappears (a restriction of computers)

◦ The same does not happen with log values: L = 210 * 0.4^4 * 0.6^6 = 0.2508227 logL = log(210) + 4*log(0.4)+6*log(0.6) = -1.383009

Likelihood format, why logs?

J

DNA sequences can be thought of as four sided dice.

Thus, the previous coin example can be straight-forwardly generealized to DNA sequences.

DNA sequences

J

1 CGAGAC2 AGCGAC3 AGATTA4 GGATAG

What is the probability that unrooted Tree A (rather than another tree) could have generated the data shown under our chosen model ?

Maximum likelihood tree reconstruction

1

2

3

4

Tree A

1 CGAGA C2 AGCGA C3 AGATT A4 GGATA G

j

The likelihood for a particular site j is the sum of the probabilities of every possible reconstruction of ancestral states under a chosen model

4 x 4 possibilities

Tree A

C

C

A

G


ACGT

? ?

Stationarity!

1 CGAGA C2 AGCGA C3 AGATT A4 GGATA G

j


A C G TA α α αC α α αG α α αT α α α

Tree A


C

C

A

G

ACGT

? ?

1 CGAG A C2 AGCG A C3 AGAT T A4 GGAT A G

j


Tree A

A C G TA α α αC α α αG α α αT α α α

A

A

TACGT

? ?

A


Assume a Jukes-Cantor model (all nucleotide frequencies are equal). Further assume that the branch length is 0.1.

Then we can generate a so called P-matrix from the Jukes-Cantor model's Q-matrix:

These are probabilities of a nucleotide changing to some other nucleotide.

Likelihood of a tree, example

J

A: acct B: gcct

L = (0.25 * 0.0062)^1 * (0.25 * 0.9815)^3 = 2.289932e-05

logL = log(L) = -4.64

For other branch lengths, the P matrix can be multiplied by itself k times. This gives a P matrix for a k cex length.

A branch lenght can be optimized by maximizing the likelihood of a certain branch lenght.

Likelihood, two taxon case

J

Depending on the software, each iteration (in the tree optimization algorithm) has to for a certain tree topology:

Calculate the likelihood of the tree topology given the model and the observed data

Estimate the optimal branch lenghts

Possibly a huge number of calculations

J

The likelihood of Tree A is the product of the likelihoods at each site

The likelihood is usually evaluated by summing the log of the likelihoods (because the summed probabilities are so small) at each site and reported as the log likelihood of the full tree

The Maximum likelihood tree is the one with the highest likelihood (might not be Tree A i.e. it could be another tree topology)◦ Note: highest likelihood (largest value) = the largest

–logL (closest to zero) = smallest logL (closest to zero)


The probability of any change is independent of the prior history of the site (a Markov Model)

Substitution probabilities do not change with time or over the tree (a homogeneous Markov process)

Change is time reversible e.g. the rate of change of A to T is the same as T to A

Typical assumptions of ML substitution models

A model is always a simplification of what happens in nature◦ Assumes evolution works parsimoniously

A given model will give more weight to certain changes over others

ML – an objective criterion for choosing one weighting scheme over another?

Sophisticated way to weight your data

A Bayesian Approach to

Phylogenetics

Based largely on slides by Paul Lewis (www.eeb.uconn.edu)

D will stand for Data H will mean any one of a number of things:

◦ a discrete hypothesis◦ a distinct model (e.g. JC, HKY, GTR, etc.)◦ a tree topology◦ one of an infinite number of continuous model

parameter values (e.g. ts:tv rate ratio)

Bayesian inference in general

In ML, we choose the hypothesis that gives the highest (maximized) likelihood to the data

The likelihood is the probability of the data given the hypothesis L = P (D | H).

A Bayesian analysis expresses its results as the probability of the hypothesis given the data.◦ this may be a more desirable way to express the

result

A Bayesian approach compared to ML

The posterior probability, [P (H | D)], is the probability of the hypothesis given the observations, or data (D)

The main feature in Bayesian statistics is that it takes into account prior knowledge of the hypothesis

The posterior probability of a hypothesis

P (H | D) = P (D | H) * P (H) P (D)

Posterior probability of hypothesis H

Likelihood of hypothesis

Prior probability of hypothesis

Probability of the data (a normalizing constant)

Both ML and Bayesian methods use the likelihood function◦ In ML, free parameters are optimized, maximizing

the likelihood◦ In a Bayesian approach, free parameters are

probability distributions, which are sampled.

Likelihood function is common

Data D: 6 heads (out of 10 flips) H = true underlying proportion of heads

(the probability of coming up heads on any single flip)

if H = 0.5, coin is perfectly fair if H = 1.0, coin always comes up heads (i.e.

it is a trick coin)

Coin-flipping example

F: there exists true probability H of getting heads, H0: H=0.5◦ Does the data reject the null hypothesis?

B: what is the range around 0.5 that we are willing to accept as being in the ”fair coin” range? ◦ What is the probability that H is in this range?◦ For the coin tossing example, we can calculate

exactly the probabilities◦ For more complex data, we need to explore the

probability space MCMC

The Frequentist and the Bayesian

Markov chain Monte Carlo

Start somewhere◦ That “somewhere” will have a likelihood

associated with it◦ Not the optimized, maximum likelihood

Randomly propose a new state◦ If the new state has a better likelihood, the chain

goes there

How the MCMC works

The target distribution is the posterior distribution of interest

The proposal distribution is used to decide where to go next; you have much flexibility here, and the choice affects the efficiency of the MCMC algorithm

Target vs. proposal distributions

Pro: taking big steps helps in jumping from one “island” in the posterior density to another

Con: taking big steps often results in poor mixing

Solution: MCMCMC!

The Tradeoff

MC3 involves running several chains simultaneously (one “cold” and several “heated”)

The cold chain is the one that counts, the heated chains are “scouts”

Chain is heated by raising densities to a power less than 1.0 (values closer to 0.0 are warmer)

Metropolis-coupled Markov chain Monte Carlo (MCMCMC, or MC3)

Bayesian phylogenetics

Marginal = taking into account all possible values for all parameters

Record the position of the robot every 100 or 1000 steps (1000 represents more “thinning” than 100)

This sample will be autocorrelated, but not much so if it is thinned appropriately (can measure autocorrelation to assess this)

If using heated chains, only the cold chain is sampled

The marginal distribution of any parameter can be obtained from this sample

Sampling the chain

Start with random tree and arbitrary initial values for branch lengths and model parameters

Each generation consists of one of these (chosen at random):◦ Propose a new tree (e.g. Larget-Simon move)

and either accept or reject the move◦ Propose (and either accept or reject) a new

model parameter value Every k generations, save tree topology, branch

lengths and all model parameters (i.e. sample the chain)

After n generations, summarize sample using histograms, means, credible intervals, etc.

Putting it all together

Prior distributions

For topologies: discrete Uniform distribution For proportions: Beta(a,b) distribution

flat when a=b peaked above 0.5 if a=b and both are greater than 1

For base frequencies: Dirichlet(a,b,c,d) distribution

flat when a=b=c=d all base frequencies close to 0.25 if v=a=b=c=d and v

large (e.g. 300) For GTR model relative rates:

Dirichlet(a,b,c,d,e,f) distribution

Prior Distributions

For other model parameters and branch lengths: Gamma(a,b) distribution◦ Exponential(λ) equals Gamma(1, λ-1) distribution◦ Mean of Gamma(a,b) is ab (so mean of an

Exponential(10) distribution is 0.1)◦ Variance of a Gamma(a,b) distribution is ab2 (so

variance of an Exponential(10) distribution is 0.01)

Prior Distributions

Flat (uninformative) priors mean that the posterior probability is directly proportional to the likelihood◦ The value of H at the peak of the posterior

distribution is equal to the MLE of H Informative priors can have a strong effect

on posterior probabilities

The effect of priors

10 important considerations

1. Beware arbitrarily truncated priors2. Branch length priors particularly important3. Beware high posteriors for very short branch lengths4. Partition with care (prefer fewer subsets)5. MCMC run length should depend on number of

parameters6. Calculate how many times parameters were updated7. Pay attention to parameter estimates8. Run without data to explore prior9. Run long and run often!10. Future: model selection should include effects of priors

Top 10 List (of important considerations)

Marshall, D.C., 2010. Cryptic failure of partitioned Bayesian phylogenetic analyses: lost in the land of long trees. Syst Biol 59, 108-117.

1. Beware arbitrarily truncated priors2. Branch length priors particularly important3. Beware high posteriors for very short branch lengths4. Partition with care (prefer fewer subsets)5. MCMC run length should depend on number of

parameters6. Calculate how many times parameters were updated7. Pay attention to parameter estimates8. Run without data to explore prior9. Run long and run often!10. Future: model selection should include effects of

priors

Top 10 List (of important considerations)

Bayesian methods are here to stay in phylogenetics

Are able to take into account uncertainty in parameter estimates

Are able to relax most assumptions, including rate homogeneity among branches◦ Timing of divergence analyses

Being heavily developed, new features and algorithms appear regularly

To conclude

14.4. tue introduction to models (jarno) 16.4. thu distance-based methods (jarno) 17.4. fri ml...

Documents