14.4. tue introduction to models (jarno) 16.4. thu distance-based methods (jarno) 17.4. fri ml...
TRANSCRIPT
Maximum likelihood
14.4. Tue Introduction to models (Jarno) 16.4. Thu Distance-based methods (Jarno) 17.4. Fri ML analyses (Jarno)
20.4. Mon Assessing hypotheses (Jarno) 21.4. Tue Problems with molecular data (Jarno) 23.4. Thu Problems with molecular data (Jarno) Phylogenomics 24.4. Fri Search algorithms, visualization, and other computational aspects (Jarno)
Schedule
J
Maximum likelihood methods of phylogenetic inference evaluate a hypothesis about evolutionary history (the branching order and branch lengths of a tree) in terms of a probability that a proposed model of the evolutionary process and the hypothesised history (tree) would give rise to the data we observe
Maximum Likelihood
The probability, P, of the data (D), given the hypothesis (H)
◦ L = P (D | H)
Likelihood of a hypothesis
Observed data (aligned sequences) Tree topology, branch lengths and
model of evolution
In statistical usage, a distinction is made depending on the roles of the outcome or parameter.
Probability is used when describing a function of the outcome given a fixed parameter value. For example, if a coin is flipped 10 times and it is a fair coin, what is the probability of it landing heads-up every time?
Likelihood is used when describing a function of a parameter given an outcome. For example, if a coin is flipped 10 times and it has landed heads-up 10 times, what is the likelihood that the coin is fair? [Wikipedia, article on likelihood]
Likelihood or probability?What's the difference?
J
An optimality criterion (as is parsimony) Given a model and data we can evaluate a
tree We can choose between trees based on the
likelihood of a given tree The tree(s) with the highest likelihood is the
best
Maximum Likelihood
JC
Variable base frequencies
3 substitution types
2 substitution types
Single substitution type
3 substitution types
2 substitution types
Variable base frequencies
Equal base frequencies
F81
HKY85
F84
TrN
GTR
K2P
K3ST
SYM
6 substitution types
6 substitution types
Maximum Likelihood estimates parameter values of an explicit model from observed data
Likelihood provides ways of evaluating models in terms of their log likelihoods
Different trees can also be evaluated for their fit to the data under a particular model (likelihood ratio tests of two trees after Kishino & Hasegawa)
Maximum Likelihood
Let's toss coin ten times (n). It lands 4 times heads up (x), 6 times tails up. What is probability of a head in a single toss?◦ Compare: What is the likelihood of the data given
the process? Naturally phat= x / n = 4 / 10 = 0.4 This is also a maximum likelihood estimater
for phat. Let's see why...
Likelihood function, example
J
Coin toss is a binomial process:◦ Pr (X=x|n, p)
Likelihood function then becomes:◦ L(p|x, n)
Note: in the binomial formula X is the unknown, whereas in the binomial the p is the unknown (because we have the data, the coin tosses).
Likelihood function, example
J
The likelihood function can be solved analytically or using "brute force".
For example, result for p=0.4 is:◦ L = 210 * 0.4^4 * 0.6^6 = 0.2508227◦ logL = log(L) = -1.383009◦ -logL = -logL = 1.383009
Analytically, the point where the derivative of the likelihood function is zero, and the second derivative is negative, is the maximum of the function.
Graphically...
Likelihood function, example
J
Maximum Likelihood
p
Likelihood
Maximumlikelihood
Maximumlikelihood
estimator of p
Maximum Likelihood
μ1
Likelihood
Precise estimate
Imprecise estimate
l<-function(x, n) { p<-seq(0,1,0.01) L<-rep(NA, length(p)) for(i in 1:length(p)){ L[i]<-p[i]^x* (1-p[i])^(n-x)* (factorial(n)/ (factorial(x)* factorial(n-x))) } d<-data.frame(p=p, L=L, logL=log(L)) return(d)}plot(l(4,10)[,c(1,3)], ylim=c(-30,0), type="l")
l2<-function(x, n) { p<-seq(0,1,0.01) L<-rep(NA, length(p)) for(i in 1:length(p)) { L[i]<- dbinom(4,size=10, prob=p[i],log=TRUE) } d<-data.frame(p=p, L=L) return(d)}plot(l2(), type="l")
R code
J
plot(l2(), type="L")
R example result
J
Why log likelihood?
L(0.99|10, 4) = 0.0000000002017251 -logL(0.99|10, 4) = -22.324115
◦ When you multiply very small values together, the result is even smaller, and at some point the precision disappears (a restriction of computers)
◦ The same does not happen with log values: L = 210 * 0.4^4 * 0.6^6 = 0.2508227 logL = log(210) + 4*log(0.4)+6*log(0.6) = -1.383009
Likelihood format, why logs?
J
DNA sequences can be thought of as four sided dice.
Thus, the previous coin example can be straight-forwardly generealized to DNA sequences.
DNA sequences
J
1 CGAGAC2 AGCGAC3 AGATTA4 GGATAG
What is the probability that unrooted Tree A (rather than another tree) could have generated the data shown under our chosen model ?
Maximum likelihood tree reconstruction
1
2
3
4
Tree A
1 CGAGA C2 AGCGA C3 AGATT A4 GGATA G
j
The likelihood for a particular site j is the sum of the probabilities of every possible reconstruction of ancestral states under a chosen model
4 x 4 possibilities
Tree A
C
C
A
G
Maximum likelihood tree reconstruction
ACGT
? ?
Stationarity!
1 CGAGA C2 AGCGA C3 AGATT A4 GGATA G
j
The likelihood for a particular site j is the sum of the probabilities of every possible reconstruction of ancestral states under a chosen model
A C G TA α α αC α α αG α α αT α α α
Tree A
Maximum likelihood tree reconstruction
C
C
A
G
ACGT
? ?
1 CGAG A C2 AGCG A C3 AGAT T A4 GGAT A G
j
The likelihood for a particular site j is the sum of the probabilities of every possible reconstruction of ancestral states under a chosen model
Tree A
A C G TA α α αC α α αG α α αT α α α
A
A
TACGT
? ?
A
Maximum likelihood tree reconstruction
A
C
C
C
G
Branch lengths also need to be estimated!
y x w
zt1
t2
t6
t8
t4
t5
t3
t7
P(A,C,C,C,G,x,y,z,w|T)=Prob(x) Prob(y|x,t6) Prob(A|y,t1) Prob(C|y,t2)
Prob(z|x,t8) Prob(C|z,t3)
Prob(w|z,t7) Prob(C|w,t4) Prob(G|w,t5)
ti are branch lengths
(rate x time)
Assume a Jukes-Cantor model (all nucleotide frequencies are equal). Further assume that the branch length is 0.1.
Then we can generate a so called P-matrix from the Jukes-Cantor model's Q-matrix:
These are probabilities of a nucleotide changing to some other nucleotide.
Likelihood of a tree, example
J
A: acct B: gcct
L = (0.25 * 0.0062)^1 * (0.25 * 0.9815)^3 = 2.289932e-05
logL = log(L) = -4.64
For other branch lengths, the P matrix can be multiplied by itself k times. This gives a P matrix for a k cex length.
A branch lenght can be optimized by maximizing the likelihood of a certain branch lenght.
Likelihood, two taxon case
J
Depending on the software, each iteration (in the tree optimization algorithm) has to for a certain tree topology:
Calculate the likelihood of the tree topology given the model and the observed data
Estimate the optimal branch lenghts
Possibly a huge number of calculations
J
The likelihood of Tree A is the product of the likelihoods at each site
The likelihood is usually evaluated by summing the log of the likelihoods (because the summed probabilities are so small) at each site and reported as the log likelihood of the full tree
The Maximum likelihood tree is the one with the highest likelihood (might not be Tree A i.e. it could be another tree topology)◦ Note: highest likelihood (largest value) = the largest
–logL (closest to zero) = smallest logL (closest to zero)
Maximum likelihood tree reconstruction
The probability of any change is independent of the prior history of the site (a Markov Model)
Substitution probabilities do not change with time or over the tree (a homogeneous Markov process)
Change is time reversible e.g. the rate of change of A to T is the same as T to A
Typical assumptions of ML substitution models
A model is always a simplification of what happens in nature◦ Assumes evolution works parsimoniously
A given model will give more weight to certain changes over others
ML – an objective criterion for choosing one weighting scheme over another?
Sophisticated way to weight your data
A Bayesian Approach to
Phylogenetics
Based largely on slides by Paul Lewis (www.eeb.uconn.edu)
D will stand for Data H will mean any one of a number of things:
◦ a discrete hypothesis◦ a distinct model (e.g. JC, HKY, GTR, etc.)◦ a tree topology◦ one of an infinite number of continuous model
parameter values (e.g. ts:tv rate ratio)
Bayesian inference in general
In ML, we choose the hypothesis that gives the highest (maximized) likelihood to the data
The likelihood is the probability of the data given the hypothesis L = P (D | H).
A Bayesian analysis expresses its results as the probability of the hypothesis given the data.◦ this may be a more desirable way to express the
result
A Bayesian approach compared to ML
The posterior probability, [P (H | D)], is the probability of the hypothesis given the observations, or data (D)
The main feature in Bayesian statistics is that it takes into account prior knowledge of the hypothesis
The posterior probability of a hypothesis
P (H | D) = P (D | H) * P (H) P (D)
Posterior probability of hypothesis H
Likelihood of hypothesis
Prior probability of hypothesis
Probability of the data (a normalizing constant)
Both ML and Bayesian methods use the likelihood function◦ In ML, free parameters are optimized, maximizing
the likelihood◦ In a Bayesian approach, free parameters are
probability distributions, which are sampled.
Likelihood function is common
Data D: 6 heads (out of 10 flips) H = true underlying proportion of heads
(the probability of coming up heads on any single flip)
if H = 0.5, coin is perfectly fair if H = 1.0, coin always comes up heads (i.e.
it is a trick coin)
Coin-flipping example
F: there exists true probability H of getting heads, H0: H=0.5◦ Does the data reject the null hypothesis?
B: what is the range around 0.5 that we are willing to accept as being in the ”fair coin” range? ◦ What is the probability that H is in this range?◦ For the coin tossing example, we can calculate
exactly the probabilities◦ For more complex data, we need to explore the
probability space MCMC
The Frequentist and the Bayesian
Markov chain Monte Carlo
Start somewhere◦ That “somewhere” will have a likelihood
associated with it◦ Not the optimized, maximum likelihood
Randomly propose a new state◦ If the new state has a better likelihood, the chain
goes there
How the MCMC works
The target distribution is the posterior distribution of interest
The proposal distribution is used to decide where to go next; you have much flexibility here, and the choice affects the efficiency of the MCMC algorithm
Target vs. proposal distributions
Pro: taking big steps helps in jumping from one “island” in the posterior density to another
Con: taking big steps often results in poor mixing
Solution: MCMCMC!
The Tradeoff
MC3 involves running several chains simultaneously (one “cold” and several “heated”)
The cold chain is the one that counts, the heated chains are “scouts”
Chain is heated by raising densities to a power less than 1.0 (values closer to 0.0 are warmer)
Metropolis-coupled Markov chain Monte Carlo (MCMCMC, or MC3)
Bayesian phylogenetics
Marginal = taking into account all possible values for all parameters
Record the position of the robot every 100 or 1000 steps (1000 represents more “thinning” than 100)
This sample will be autocorrelated, but not much so if it is thinned appropriately (can measure autocorrelation to assess this)
If using heated chains, only the cold chain is sampled
The marginal distribution of any parameter can be obtained from this sample
Sampling the chain
Start with random tree and arbitrary initial values for branch lengths and model parameters
Each generation consists of one of these (chosen at random):◦ Propose a new tree (e.g. Larget-Simon move)
and either accept or reject the move◦ Propose (and either accept or reject) a new
model parameter value Every k generations, save tree topology, branch
lengths and all model parameters (i.e. sample the chain)
After n generations, summarize sample using histograms, means, credible intervals, etc.
Putting it all together
Prior distributions
For topologies: discrete Uniform distribution For proportions: Beta(a,b) distribution
flat when a=b peaked above 0.5 if a=b and both are greater than 1
For base frequencies: Dirichlet(a,b,c,d) distribution
flat when a=b=c=d all base frequencies close to 0.25 if v=a=b=c=d and v
large (e.g. 300) For GTR model relative rates:
Dirichlet(a,b,c,d,e,f) distribution
Prior Distributions
For other model parameters and branch lengths: Gamma(a,b) distribution◦ Exponential(λ) equals Gamma(1, λ-1) distribution◦ Mean of Gamma(a,b) is ab (so mean of an
Exponential(10) distribution is 0.1)◦ Variance of a Gamma(a,b) distribution is ab2 (so
variance of an Exponential(10) distribution is 0.01)
Prior Distributions
Flat (uninformative) priors mean that the posterior probability is directly proportional to the likelihood◦ The value of H at the peak of the posterior
distribution is equal to the MLE of H Informative priors can have a strong effect
on posterior probabilities
The effect of priors
10 important considerations
1. Beware arbitrarily truncated priors2. Branch length priors particularly important3. Beware high posteriors for very short branch lengths4. Partition with care (prefer fewer subsets)5. MCMC run length should depend on number of
parameters6. Calculate how many times parameters were updated7. Pay attention to parameter estimates8. Run without data to explore prior9. Run long and run often!10. Future: model selection should include effects of priors
Top 10 List (of important considerations)
Marshall, D.C., 2010. Cryptic failure of partitioned Bayesian phylogenetic analyses: lost in the land of long trees. Syst Biol 59, 108-117.
1. Beware arbitrarily truncated priors2. Branch length priors particularly important3. Beware high posteriors for very short branch lengths4. Partition with care (prefer fewer subsets)5. MCMC run length should depend on number of
parameters6. Calculate how many times parameters were updated7. Pay attention to parameter estimates8. Run without data to explore prior9. Run long and run often!10. Future: model selection should include effects of
priors
Top 10 List (of important considerations)
Bayesian methods are here to stay in phylogenetics
Are able to take into account uncertainty in parameter estimates
Are able to relax most assumptions, including rate homogeneity among branches◦ Timing of divergence analyses
Being heavily developed, new features and algorithms appear regularly
To conclude