hidden markov models for biological sequence analysis...

37
Hidden Markov Models for biological sequence analysis II Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Master in Bioinformatics UPF 2017-2018 http://comprna.upf.edu/courses/Master_AGB/

Upload: others

Post on 10-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

Hidden Markov Models for biological sequence

analysis II

Eduardo Eyras Computational Genomics

Pompeu Fabra University - ICREA Barcelona, Spain

Master in Bioinformatics UPF 2017-2018

http://comprna.upf.edu/courses/Master_AGB/

Page 2: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

HMM model structure

Page 3: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

p

(1-p)

p = transition probability to itself, 1-p= probability of leaving the state

Probability of staying in the state for n residues = (1-p) pn

Exponential decaying (geometric distribution)

n

P=0.7n

P=0.5n

P

Duration modeling

Page 4: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

How to avoid this decay? For instance, using several states with the same emission probabilities and transitions between each other

Eg: models sequences of minimum length 5, and exponential decaying for longer ones.

Eg: this can model any distribution of lengths between 2 and 5

Duration modeling

Page 5: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

Duration modeling

Minimum-length then geometric

Negative binominal

Exponential decay p

(1-p)

p

1-p

p

(1-p)

p

(1-p)

p

(1-p) (1-p)

p

Page 6: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

Duration modelling

p

(1-p)

p

(1-p)

p

(1-p) (1-p)

p

This type of array of n states can model sequences of length n or longer

For a path of length m>n: transition probabilities =

Transition probability over all possible paths of length m

P(m) =m −1n −1#

$ %

&

' ( pm−n (1− p)n

nk"

# $ %

& ' =

n!k!(n − k)!Where we use the Binomial coefficients

pm−n (1− p)n

Negative binominal

Page 7: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

Intron Exon Intergenic L

P

Empirical intron length distribution

Explicit duration modeling

Can use any arbitrary length distribution Generalized HMM. Often used in gene finders Upon entering a state: 1.  Choose duration d according to probability distribution 2.  Generate d letters according to emission probabilities, e.g. P(A|I) 3.  Take a transition to next state according to transition probabilities

Page 8: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

Profile - HMMs

Page 9: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

Finding distant members of a protein family

A distant cousin of functionally related sequences in a protein family may have weak pairwise similarities with each member of the family and thus may fail to be found using standard pairwise methods (e.g. BLAST).

Even though they may have weak similarities with many members of

the family, the goal is to align a sequence to all members of the family at once.

Family of related proteins can be represented by their multiple

alignment and a corresponding profile. Can we represent the profile as a probabilistic model?

Page 10: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

We use a multiple alignment to build a profile-HMM. It is a HMM: It is a probabilistic representation of a multiple alignment and we can use the same HMM algorithms (Viterbi, etc…) We can add position-dependent gap penalties (to model gaps in the alignment) We can add variable states with position-dependent random emission probabilities (to model variable regions) This model then may be used to find and score less obvious potential matches of new protein sequences. The profile-HMM is used to ask whether a new sequence S belongs to a given model (e.g. a given family of proteins, e.g. contains a given domain).

Profile-HMM

Page 11: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

A protein family is generally represented by a multiple alignment Example: SH3 domains:

Profiles and HMMs

Page 12: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

----exon----intron CAGGTACCC GAGGTGAGA CTGGTGAGG TAGGTGAGT CAGGTCTGT CTGGTGAGC CAGGTAAGT

pos 1 2 3 4 5 6 7 8 9

A 0 0.71 0 0 0 0.28 0.71 0 0.14

C 0.71 0 0.28 0 0 0.14 0.14 0.14 0.28

G 0.14 0 0.71 1 0 0.57 0 0.85 0.14

T 0.14 0.28 0 0 1 0 0.14 0 0.42

E.g. position 1, P( C ) = frequency = 5/7 = 0.71

Profile representation of protein families

Aligned DNA sequences can be represented by a 4 ·n profile matrix reflecting the frequencies of nucleotides in every aligned position

S = log ei(si)qii=1

L

∑Motif probabilities

Background probabilities

Position Specific Scoring Matrix (also PWM)

Page 13: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

The conserved regions can be modeled as in a PSSM

… … begin end Mj

A PSSM can be viewed as a trivial HMM with identical states, Match states, separated by transitions of probability 1

Score = logeM i

(si)qii=1

L

∑Emission probabilities

Background probabilities

Profile-HMM: Match states

eMi(a) Emission probability in Match state = frequency of each amino acid in alignment columns

Match states

Page 14: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

Multiple alignment of a protein family shows variations in conservation along the length of a protein. For SH3 domains:

Conserved regions can be described by PWMs but variable regions can not!

Profile-HMMs: Insertion states

Page 15: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

Start End Mi

Ii

We treat insertions and deletions separately Insertion: portions of the query sequence S that do not match anything in the model: we must insert residues with respect to the model Insertion state: Ii = insertions after the residue matching the i th column of the alignment

eI i (a) = p(a) Emission probability in Insertion state = amino acid frequency in all sequences (background)

Profile-HMMs: Insertion states

Page 16: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

Profile-HMMs: Insertion states

Start End Mi

Ii

Transitions of Ii to itself model multiple insertions There is no log-odds (log-likelihood ratio) for emissions from Ii Score of a gap of length k:

log(aM i I i) + (k −1)log(aI i I i ) + log(aI iM i+1

)

Open gap penalty Gap extension penalty Gap closing penalty

Gap penalties are position-dependent!!!! compare to e.g. Needleman-Wunsch

Page 17: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

Profile-HMMs: Silent states Deletions: segments of the model that are not matched by any residue in the query sequence S. That is, trying to fit S to the model we need to jump match states: we must allow deletions in the query sequence One possibility to allow for deletions is to connect non-neighboring states:

Too complex to model arbitrary deletions in a long sequence

We therefore introduce the Silent states Dj to model deletions

Page 18: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

We can model arbitrary deletions by connecting the states to a parallel chain of silent states (circles):

It is possible to get from any “real” state go any “real” state without emitting letters

Mj Start End

Dj

Profile-HMMs: Silent states

log(aM iDi) + log(aD jD j+1

)j= i+1

i+k−1

∑ + log(aDi+kM i+k+1)

Cost of a deletion of length k

The deletion extension has different probabilities (different states) The insertion extension is of equal contribution (same state)

Page 19: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

States in a profile-HMM

Start End Mi

Ii

Di

Match states: conserved positions in the alignment (plus start/end states)

Insertion states: variable regions (not clearly alignable)

Deletion states: model gaps in the alignment

Deletion state

Insertion state

Match state

Page 20: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

Start End Mi

Ii

Di

This model represents the consensus of a family of sequences, not the sequence of any particular member.

Building a profile-HMM

Page 21: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

Multiple alignment is used to construct the HMM model. Assign each aligned (conserved) column to a Match state (M) in the HMM – this will determine the length of the model Estimate the emission probabilities according to amino acid counts in columns. Different positions in the protein will have different emission probabilities. Add Insertion (I) and Deletion (D) states: all states, connectivity to be determined… Estimate the transition probabilities between Match, Deletion and Insertion states

Building a profile-HMM

Start End Mi

Ii

Di

Page 22: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

Probabilities in a profile-HMM

Start End Mi

Ii

Di

eMi(a) Emission probability in Match state = frequency of each aminoacid in alignment columns

eI i (a) = p(a) Emission probability in Insertion state = aminoacid frequency in all sequences (background)

aMiI i Transition probability from match to insertion state

log(aM i I i) Open gap penalty

aIiI i Transition probability within a insertion state

log(aI i I i ) Extension gap penalty

aDiDi+1 Transition probability between deletion states

Page 23: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

Probabilities in a profile-HMM

Start End Mi

Ii

Di

aDiIi = 0 Transition probability between a deletion and insertion states

aIiDi+1 = 0 Transition probability between insertion and deletion states

Usually very improbable

Page 24: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

How to assign the states?

Start End Mi

Ii

Di

Heuristic rules: Denote as insertion states, the columns from the alignment that contain gaps in more than half of the sequences. Denote as match the conserved ones and with less gaps Calculate the entropy for each column and denote as insertion state the columns with high degree of disorder In the example above, all columns will be M except for 4th and 5th that will be I states.

Page 25: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

Profile-HMM Parameter estimation We start from a given sample of alignments We can estimate the parameters counting the transitions and emissions:

Akl Count the number of transitions between states k and l

Ek(b) Count the number of times the symbol b is emitted by state k

akl =AklAk '

l '∑

, ek(b) =Ek(b)Ek(b ')

b'∑

We can estimate the probabilities as follows:

To avoid overfitting, use pseudocounts:

Akl → Akl + rklEk(b)→ Ek(b)+ rk(b)

Pseudocounts reflect our prior knowledge

Accurate estimate for a large number of sequences

Page 26: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

Parameter estimation: Example

eM1 (V ) = 5/ 7

eM1 (F) = eM1 (I ) = 1/ 7

eM1 (V ) = (5+1)/(7+ 20) = 6/ 27eM1 (I ) = eM1 (V ) = (1+1)/(7+ 20) = 2/ 27Using

pseudocounts

eM1 = 1/ 27 For all other aminoacids

aM1M2= (6+1)/(7+ 3) = 7/10

aM1D1 = (1+1)/(7+ 3) = 2/10aM1I1 = (0+1)/(7+ 3) = 1/10

Using pseudocounts

aM1M2= 6/ 7

aM1D2=1/ 7

aM1I1 = 0

Page 27: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

Parameter estimation: Example

eM 3(A) = 3/6

eM 3(G) = 2 /6

eM 3(A) = (3+1) /(6 + 20) = 4 /26

eM 3(G) = (2 +1) /(6 + 20) = 3/26

Using pseudocounts

aM 3M 4= (4 +1) /(6+ 3) = 5 /9

aM 3D4 = (1+1) /(6+ 3) = 2 /9

aM 3I 3 = (1+1) /(6+ 3) = 2 /9

aD3M 4= (1+1) /(1+2) = 2 /3

aD3D4 = (0+1) /(1+2) =1/3

Using pseudocounts

(B)

(C) (D)

(A)

(B)

(C)

(D)

(A)

aM 3M 4= 4 /6

aM 3D4=1/6

aM 3I 3=1/6

aD3M 4=1/1

aD3D4 = 0Always check normalization!! Here we removed D->I

Page 28: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

Searching with Profile-HMMs

Profile-HMMs can be used to detect a possible new member of a sequence family

We must compare the new sequence against the profile-HMM model

Start End Mi

Ii

Di

Page 29: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

We can use Viterbi to obtain the most probable path π* across the model and then calculate its probability:

We can use Forward to obtain the total probability for the sequence given the model:

P(S | Π∗ )

P(S) = P(s1...sL ) = P(π

∑ s1...sL ,π 0...π N )

We use in general the log-likelihood ratios (log-odds) with a background model

Searching with Profile-HMMs

Page 30: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

VjM( i) Best score (likelihood-ratio) for the best path of states aligning

the subsequence s1…si to the submodel up to state j, ending in the emission of si by Mj

VjI ( i)

VjD( i)

Best score for the best path ending at si being emitted by Ij

Best score for the best path ending at Dj

Profile HMM Viterbi

Page 31: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

Profile HMM Viterbi

VjM( i) = log

eMj(si )qsi

+maxVj−1

M(i−1)+ log aMj−1Mj

Vj−1I (i−1)+ log aIj−1Mj

Vj−1D (i−1)+ log aDj−1Mj

#

$ %

& %

VjI ( i) = log

eIj (si )qsi

+maxVj

M(i−1)+ log aMjI j

VjI (i−1)+ log aIjI j

VjD(i−1)+ log aDjI j

#

$ %

& %

V jD (i) =max

V j−1M (i −1) + logaM j−1D j

V j−1I (i −1) + logaI j−1D j

V j−1D (i −1) + logaD j−1D j

#

$ %

& %

Page 32: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

Profile HMM Viterbi

VjM( i) = log

eMj(si )qsi

+maxVj−1

M(i−1)+ log aMj−1Mj

Vj−1I (i−1)+ log aIj−1Mj

Vj−1D (i−1)+ log aDj−1Mj

#

$ %

& %

VjI ( i) = log

eIj (si )qsi

+maxVj

M(i−1)+ log aMjI j

VjI (i−1)+ log aIjI j

VjD(i−1)+ log aDjI j

#

$ %

& %

V jD (i) =max

V j−1M (i −1) + logaM j−1D j

V j−1I (i −1) + logaI j−1D j

V j−1D (i −1) + logaD j−1D j

#

$ %

& %

eIj (si ) = qsiDoes not contribute in general since

Are usually not present (negligible when scoring an alignment to the model)

Page 33: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

Profile HMM Viterbi

V0M(0) = 0

Initialisation:

The start state is M0 such that

We allow the alignment to end in a deletion or insert state

We allow transitions to I0 and D1

The end state ML+1

Termination:

Score S |Π*( )=maxVL

M (n) + logaM L ,end

VLI (n) + logaIL ,end

VLD (n) + logaDL ,end

#

$ %

& %

Page 34: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

•  Use Blast (or similar) to separate a protein database into families of

related proteins •  Construct a multiple alignment for each protein family. •  Construct a profile HMM model and optimize the parameters of the

model (transition and emission probabilities)

•  Align the target sequence against each HMM to find the best fit between a target sequence and an HMM

Making a collection of Profile-HMM for protein families

Page 35: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

PFAM •  Pfam decribes protein domains (http://pfam.sanger.ac.uk/)

•  Each protein domain family in Pfam has: - Seed alignment: manually verified multiple alignment of a

representative set of sequences. - HMM built from the seed alignment for further database searches. - Full alignment generated automatically from the HMM

•  The distinction between seed and full alignments facilitates Pfam updates.

- Seed alignments are stable resources. - HMM profiles and full alignments can be updated with newly found

amino acid sequences. •  Pfam HMMs span entire domains that include both well-conserved motifs and

less-conserved regions with insertions and deletions.

•  It results in modeling complete domains that facilitates better sequence annotation and leeds to a more sensitive detection.

Page 36: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

Exercise (exam 2013): Consider the following multiple alignment:

Draw a hidden Markov model that would describe this alignment using two types of states, Match and Insert states. Estimate the transition and emission probabilities for the model (no need to use pseudocounts).

G C A GG – A GG C T GA – A CG – A CG – G GA – A C

Page 37: Hidden Markov Models for biological sequence analysis IIcomprna.upf.edu/courses/Master_AGB/4_HiddenMarkov... · Barcelona, Spain Master in Bioinformatics UPF 2017-2018 ... may have

References

BiologicalSequenceAnalysis:Probabilis5cModelsofProteinsandNucleicAcidsRichardDurbin,SeanR.Eddy,AndersKrogh,andGraemeMitchison.CambridgeUniversityPress,1999ProblemsandSolu5onsinBiologicalSequenceAnalysis‎MarkBorodovsky,SvetlanaEkishevaCambridgeUniversityPress,2006Bioinforma5csandMolecularEvolu5onPaulG.HiggsandTeresaAJwood.BlackwellPublishing2005.AnIntroduc5ontoBioinforma5csAlgorithms(ComputaOonalMolecularBiology)byNeilC.Jones,PavelA.Pevzner.MITPress,2004