model selection anders gorm pedersen molecular evolution group

CE

NT

ER

FO

R B

IOLO

GIC

AL

SE

QU

EN

CE

AN

ALY

SIS Model SelectionModel Selection

Anders Gorm PedersenAnders Gorm Pedersen

Molecular Evolution GroupMolecular Evolution GroupCenter for Biological Sequence AnalysisCenter for Biological Sequence AnalysisTechnical University of Denmark (DTU)Technical University of Denmark (DTU)

CE

NT

ER

FO

R B

IOLO

GIC

AL

SE

QU

EN

CE

AN

ALY

SIS

The maximum likelihood approach

Likelihood = Probability (Data | Model)Likelihood = Probability (Data | Model)

Maximum likelihood: Maximum likelihood: Best estimate is the set of parameter Best estimate is the set of parameter values which gives the highest possible values which gives the highest possible likelihood.likelihood.

CE

NT

ER

FO

R B

IOLO

GIC

AL

SE

QU

EN

CE

AN

ALY

SIS

Probabilistic modeling applied to phylogeny

• Observed data: multiple alignment of sequencesObserved data: multiple alignment of sequences

H.sapiens globinH.sapiens globin A G G G A T T C AA G G G A T T C A M.musculus globinM.musculus globin A C G G T T T - AA C G G T T T - A R.rattus globinR.rattus globin A C G G A T T - AA C G G A T T - A

• Probabilistic model parameters (simplest case):Probabilistic model parameters (simplest case):– Nucleotide frequencies: Nucleotide frequencies: AA, , CC, , GG, , TT

– Tree topology and branch lengthsTree topology and branch lengths– Nucleotide-nucleotide substitution rates (or Nucleotide-nucleotide substitution rates (or

substitution probabilities):substitution probabilities):

CE

NT

ER

FO

R B

IOLO

GIC

AL

SE

QU

EN

CE

AN

ALY

SIS

Computing the probability of one column in an alignment given tree topology and

other parameters

A C G G A T T C AA C G G A T T C AA C G G T T T - AA C G G T T T - AA A G G A T T - AA A G G A T T - AA G G G T T T - AA G G G T T T - A

AA

C

C A

G

Columns in alignment contain homologous nucleotides

Assume tree topology, branch lengths, and other parameters are given. Assume ancestral states were A and A. Start computation at any internal or external node.

Pr = C PCA(t1) PAC(t2) PAA(t3) PAG(t4) PAA(t5)

t4

t5

t3t1

t2

CE

NT

ER

FO

R B

IOLO

GIC

AL

SE

QU

EN

CE

AN

ALY

SIS

Computing the probability of an entire alignment given tree topology and other

parameters• Probability must be summed over all possible combinations of ancestral nucleotides.

(Here we have two internal nodes giving 16 possible combinations)

• Probability of individual columns are multiplied to give the overall probability of the alignment, i.e., the likelihood of the model.

• Often the log of the probability is used (log likelihood)

CE

NT

ER

FO

R B

IOLO

GIC

AL

SE

QU

EN

CE

AN

ALY

SIS

Model Selection: How Do We Choose Between Different Types

of Models?

Select model with best fit?

CE

NT

ER

FO

R B

IOLO

GIC

AL

SE

QU

EN

CE

AN

ALY

SIS

Over-fitting: More parameters always result in a better fit to the data, but not necessarily in a better description

y = ax + b

2 parameter modelGood description, poor fit

y = ax6+bx5+cx4+dx3+ex2+fx+g

7 parameter modelPoor description, good fit

CE

NT

ER

FO

R B

IOLO

GIC

AL

SE

QU

EN

CE

AN

ALY

SIS

Selecting the best model: the likelihood ratio test

• The fit of two alternative models can be compared using the The fit of two alternative models can be compared using the ratio of their likelihoods:ratio of their likelihoods:

LR =LR = P(Data P(Data | | M1) = L,M1M1) = L,M1P(Data | M2) L,M2P(Data | M2) L,M2

• Note that LR > 1 if model 1 has the highest likelihoodNote that LR > 1 if model 1 has the highest likelihood

• For For nested modelsnested models it can be shown that it can be shown that

= 2*ln(LR) = 2* (lnL,M1 - lnL,M2)= 2*ln(LR) = 2* (lnL,M1 - lnL,M2)

follows a follows a 22 distribution with degrees of freedom equal to distribution with degrees of freedom equal to the number of extra parameters in the most complicated the number of extra parameters in the most complicated model.model.

This makes it possible to perform stringent statistical This makes it possible to perform stringent statistical tests to determine which model (hypothesis) best describes tests to determine which model (hypothesis) best describes the datathe data

CE

NT

ER

FO

R B

IOLO

GIC

AL

SE

QU

EN

CE

AN

ALY

SIS

Asking biological questions in a likelihood ratio testing framework

• Fit two alternative, nested models to the data.Fit two alternative, nested models to the data.

• Record optimized likelihood and number of free parameters Record optimized likelihood and number of free parameters for each fitted model.for each fitted model.

• Test if alternative (parameter-rich) model is Test if alternative (parameter-rich) model is significantlysignificantly better than null-model, given number of better than null-model, given number of additional parameters (nadditional parameters (nextraextra):):

n Compute Compute = 2 x (lnL = 2 x (lnLAlternative Alternative - lnL- lnLNullNull) ) n Compare Compare to to 22 distribution with n distribution with nextraextra degrees of degrees of

freedomfreedom

• Depending on models compared, different biological Depending on models compared, different biological questions can be addressed (presence of molecular clock, questions can be addressed (presence of molecular clock, presence of positive selection, difference in mutation presence of positive selection, difference in mutation rates among sites or branches, etc.)rates among sites or branches, etc.)

CE

NT

ER

FO

R B

IOLO

GIC

AL

SE

QU

EN

CE

AN

ALY

SIS

Positive selection I: synonymous and non-synonymous mutations

• 20 amino acids, 61 codons 20 amino acids, 61 codons – Most amino acids encoded by more than one codonMost amino acids encoded by more than one codon– Not all mutations lead to a change of the encoded amino acidNot all mutations lead to a change of the encoded amino acid– ””Synonymous mutations” are rarely selected againstSynonymous mutations” are rarely selected against

CTA(Leu)

CTC(Leu)

CTG(Leu)

CTT(Leu)

CAA(Gln)

CCA(Pro)

CGA(Arg)

ATA(Ile)

GTA(Val)TTA(Leu)

1 non-synonymous nucleotide site

1 synonymous nucleotide site

1/3 synonymous2/3 nonsynymousnucleotide site

CE

NT

ER

FO

R B

IOLO

GIC

AL

SE

QU

EN

CE

AN

ALY

SIS

Positive selection II: non-synonymous and synonymous mutation rates contain information about selective pressure

• dN: rate of non-synonymous mutations dN: rate of non-synonymous mutations per non-synonymous per non-synonymous sitesite

• dS: rate of synonymous mutations dS: rate of synonymous mutations per synonymous siteper synonymous site

• Recall: Evolution is a two-step process:Recall: Evolution is a two-step process:

(1) Mutation (random)(1) Mutation (random)

(2) Selection (non-random)(2) Selection (non-random)

• Randomly occurring mutations will lead to dN/dS=1.Randomly occurring mutations will lead to dN/dS=1.• Significant deviations from this most likely caused by Significant deviations from this most likely caused by

subsequent selection.subsequent selection.

• dN/dS < 1dN/dS < 1: Higher rate of synonymous mutations: : Higher rate of synonymous mutations: negative negative (purifying) selection(purifying) selection

• dN/dS > 1dN/dS > 1: Higher rate of non-synonymous mutations: : Higher rate of non-synonymous mutations: positive selectionpositive selection

CE

NT

ER

FO

R B

IOLO

GIC

AL

SE

QU

EN

CE

AN

ALY

SIS

Today’s exercise: positive selection in HIV?

• Fit two alternative models to HIV data:Fit two alternative models to HIV data:– M0: one, common dN/dS ratio in entire sequenceM0: one, common dN/dS ratio in entire sequence

– M3: three distinct classes with different dN/dS ratiosM3: three distinct classes with different dN/dS ratios

• Use likelihood ratio test to examine if M3 is significantly better than Use likelihood ratio test to examine if M3 is significantly better than M0,M0,

• If that is the case: is there a class of codons with dN/dS>1 (positive If that is the case: is there a class of codons with dN/dS>1 (positive selection)?selection)?

• If M3 significantly better than M0 AND if some codons have dN/dS>1 then If M3 significantly better than M0 AND if some codons have dN/dS>1 then you have statistical evidence for positive selection. you have statistical evidence for positive selection.

• Most likely reason: immune escape (i.e., sites must be in epitopes)Most likely reason: immune escape (i.e., sites must be in epitopes)

: Codons showing dN/dS > 1: likely epitopes

model selection anders gorm pedersen molecular evolution group

Documents