hidden markov models biol 7711 computational bioscience
DESCRIPTION
Consortium for Comparative Genomics. University of Colorado School of Medicine. Hidden Markov Models BIOL 7711 Computational Bioscience. Why a Hidden Markov Model?. Data elements are often linked by a string of connectivity, a linear sequence - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/1.jpg)
Biochemistry and Molecular GeneticsComputational Bioscience Program
Consortium for Comparative GenomicsUniversity of Colorado School of Medicine
Hidden Markov ModelsBIOL 7711
Computational Bioscience
University of Colorado School of MedicineConsortium for Comparative Genomics
![Page 2: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/2.jpg)
Why a Hidden Markov Model?Data elements are often linked by a
string of connectivity, a linear sequence
Secondary structure prediction (Goldman, Thorne, Jones)CpG islands
Models of exons, introns, regulatory regions, genesMutation rates along genome
![Page 3: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/3.jpg)
Sequence Alignment ProfilesMouse TCR Va
![Page 4: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/4.jpg)
Why a Hidden Markov Model?
Complications?Insertion and deletion of states (indels)Long-distance interactions
BenefitsFlexible probabilistic framework
E.g., compared to regular expressions
![Page 5: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/5.jpg)
Profiles: an Example
A .1C .05D .2E .08F .01
Gap A .04C .1D .01E .2F .02
Gap A .2C .01D .05E .1F .06
insert
delete
continue continue
insertinsert
![Page 6: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/6.jpg)
Profiles, an Example: States
A .1C .05D .2E .08F .01
Gap A .04C .1D .01E .2F .02
Gap A .2C .01D .05E .1F .06
insert
delete
continue continue
insertinsert
State #1 State #2 State #3
![Page 7: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/7.jpg)
Profiles, an Example: Emission
A .1C .05D .2E .08F .01
Gap A .04C .1D .01E .2F .02
Gap A .2C .01D .05E .1F .06
insert
delete
continue continue
insertinsert
State #1 State #2 State #3
Sequence Elements
(possibly emitted by a state)
![Page 8: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/8.jpg)
Profiles, an Example: Emission
A .1C .05D .2E .08F .01
Gap A .04C .1D .01E .2F .02
Gap A .2C .01D .05E .1F .06
insert
delete
continue continue
insertinsert
State #1 State #2 State #3
Sequence Elements
(possibly emitted by a state)
Emission Probabilities
![Page 9: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/9.jpg)
Profiles, an Example: Arcs
A .1C .05D .2E .08F .01
Gap A .04C .1D .01E .2F .02
Gap A .2C .01D .05E .1F .06
delete
continue continue
insertinsert
State #1 State #2 State #3
transitioninsert
![Page 10: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/10.jpg)
Profiles, an Example: Special States
A .1C .05D .2E .08F .01
Gap
A .04C .1D .01E .2F .02
Gap A .2C .01D .05E .1F .06
delete
continue continue
insertinsert
State #1 State #2 State #3
transition
insert
Self => SelfLoop
No Delete “State”
![Page 11: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/11.jpg)
![Page 12: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/12.jpg)
A Simpler not very Hidden MM
Nucleotides, no Indels, Unambiguous Path
G .1C .3A .2T .4
G .1C .1A .7T .1
G .3C .3A .1T .3
A0.7
T0.4
T0.3
1.0 1.0 1.0
![Page 13: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/13.jpg)
A Simpler not very Hidden MM
Nucleotides, no Indels, Unambiguous Path
G .1C .3A .2T .4
G .1C .1A .7T .1
G .3C .3A .1T .3
A0.7
T0.4
T0.3
1.0 1.0 1.0
![Page 14: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/14.jpg)
A Toy not-Hidden MMNucleotides, no Indels, Unambiguous
PathAll arcs out are equal
Example sequences: GATC ATC GC GAGAGC AGATTTC
BeginEmit G
Emit A
Emit C
Emit TEnd
Arc Emission
![Page 15: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/15.jpg)
A Simple HMMCpG Islands; States are Really Hidden
Now
G .1C .1A .4T .4
G .3C .3A .2T .2 0.1
0.2
CpG Non-CpG
0.8 0.9
![Page 16: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/16.jpg)
The Forward AlgorithmProbability of a Sequence is the
Sum of All Paths that Can Produce It
G .1C .1A .4T .4
G .3C .3A .2T .2
0.10.2
Non-CpG
0.8
0.9 G
CpG
G .3
G .1
.3*(
.3*.8+
.1*.1)=.075.1*(.3*.2+.1*.9)=.015
C
![Page 17: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/17.jpg)
The Forward AlgorithmProbability of a Sequence is the
Sum of All Paths that Can Produce It
G .1C .1A .4T .4
G .3C .3A .2T .2
0.10.2
Non-CpG
0.8
0.9 G
CpG
G .3
G .1
.3*(
.3*.8+
.1*.1)=.075.1*(.3*.2+.1*.9)=.015
C
.3*(
.075*.8+
.015*.1)=.0185
.1*(
.075*.2+
.015*.9)=.0029
G
![Page 18: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/18.jpg)
The Forward AlgorithmProbability of a Sequence is the
Sum of All Paths that Can Produce It
G .1C .1A .4T .4
G .3C .3A .2T .2
0.10.2
Non-CpG
0.8
0.9 G
CpG
G .3
G .1
.3*(
.3*.8+
.1*.1)=.075.1*(.3*.2+.1*.9)=.015
C
.3*(
.075*.8+
.015*.1)=.0185
.1*(
.075*.2+
.015*.9)=.0029
G
.2*(
.0185*.8+.0029*.1)=.003.4*(.0185*.2+.0029*.9)=.0025
A
.2*(
.003*.8+
.0025*.1)=.0005.4*(.003*.2+.0025*.9)=.0011
A
![Page 19: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/19.jpg)
The Forward AlgorithmProbability of a Sequence is the
Sum of All Paths that Can Produce It
G .1C .1A .4T .4
G .3C .3A .2T .2
0.10.2
Non-CpG
0.8
0.9 G
CpG
G .3
G .1
.3*(
.3*.8+
.1*.1)=.075.1*(.3*.2+.1*.9)=.015
C
.3*(
.075*.8+
.015*.1)=.0185
.1*(
.075*.2+
.015*.9)=.0029
G
.2*(
.0185*.8+.0029*.1)=.003.4*(.0185*.2+.0029*.9)=.0025
A
.2*(
.003*.8+
.0025*.1)=.0005.4*(.003*.2+.0025*.9)=.0011
A
![Page 20: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/20.jpg)
The Viterbi AlgorithmMost Likely Path
G .1C .1A .4T .4
G .3C .3A .2T .2
0.10.2
Non-CpG
0.8
0.9 G
CpG
G .3
G .1
.3*m(
.3*.8,
.1*.1)=.072.1*m(.3*.2,.1*.9)=.009
C
.3*m(
.075*.8,
.015*.1)=.0173
.1*m(
.075*.2,
.015*.9)=.0014
G
.2*m(
.0185*.8,.0029*.1)=.0028.4*m(.0185*.2,.0029*.9)=.0014
A
.2*m(
.003*.8,
.0025*.1)=.00044.4*m(.003*.2,.0025*.9)=.0050
A
![Page 21: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/21.jpg)
Forwards and BackwardsProbability of a State at a Position
G .1C .1A .4T .4
G .3C .3A .2T .2
0.10.2
Non-CpG
0.8
0.9 G
CpG
C G
.2*(
.0185*.8+.0029*.1)=.003.4*(.0185*.2+.0029*.9)=.0025
A
.2*(
.003*.8+
.0025*.1)=.0005.4*(.003*.2+.0025*.9)=.0011
A
.003*(
.2*.8+
.4*.2)=.0007.0025*(.2*.1+.4*.9)=.0009
![Page 22: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/22.jpg)
Forwards and BackwardsProbability of a State at a Position
G C G A A
.003*(
.2*.8+
.4*.2)=.0007.0025*(.2*.1+.4*.9)=.0009
![Page 23: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/23.jpg)
Homology HMMGene recognition, identify distant
homologs
Common Ancestral SequenceMatch, site-specific emission probabilitiesInsertion (relative to ancestor), global emission probsDelete, emit nothingGlobal transition probabilities
![Page 24: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/24.jpg)
Homology HMM
start
insert insert
match
delete delete
match end
insert
![Page 25: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/25.jpg)
![Page 26: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/26.jpg)
Homology HMMUses
Score sequences for match to HMMCompare alternative modelsAlignmentStructural alignment
![Page 27: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/27.jpg)
Multiple Sequence Alignment HMM
Defines predicted homology of positions (sites)
Recognize region within longer sequenceModel domains or whole proteins
Can modify model for sub-familiesIdeally, use phylogenetic tree
Often not much back and forthIndels a problem
![Page 28: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/28.jpg)
Model ComparisonBased on
For ML, takeUsually to avoid
numeric errorFor heuristics, “score” isFor Bayesian, calculate
![Page 29: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/29.jpg)
Parameters, Types of parameters
Amino acid distributions for positionsGlobal AA distributions for insert statesOrder of match statesTransition probabilitiesTree topology and branch lengthsHidden states (integrate or augment)
Wander parameter space (search)Maximize, or move according to
posterior probability (Bayes)
![Page 30: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/30.jpg)
Expectation Maximization (EM)
Classic algorithm to fit probabilistic model parameters with unobservable statesTwo Stages
MaximizeIf know hidden variables (states), maximize
model parameters with respect to that knowledge
ExpectationIf know model parameters, find expected
values of the hidden variables (states)Works well even with e.g., Bayesian
to find near-equilibrium space
![Page 31: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/31.jpg)
Homology HMM EMStart with heuristic (e.g., ClustalW)Maximize
Match states are residues aligned in most sequencesAmino acid frequencies observed in
columnsExpectation
Realign all the sequences given modelRepeat until convergenceProblems: Local, not global
optimizationUse procedures to check how it worked
![Page 32: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/32.jpg)
Model ComparisonDetermining significance depends
on comparing two modelsUsually null model, H0, and test model,
H1
Models are nested if H0 is a subset of H1
If not nestedAkaike Iinformation Criterion (AIC) [similar
to empirical Bayes] or Bayes Factor (BF) [but be careful]
Generating a null distribution of statistic
Z-factor, bootstrapping, , parametric bootstrapping, posterior predictive
![Page 33: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/33.jpg)
Z Test MethodDatabase of known negative controls
E.g., non-homologous (NH) sequencesAssume NH scores
i.e., you are modeling known NH sequence scores as a normal distribution
Set appropriate significance level for multiple comparisons (more below)
ProblemsIs homology certain?Is it the appropriate null model?
Normal distribution often not a good approximation
Parameter control hard: e.g., length distribution
![Page 34: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/34.jpg)
Bootstrapping and Parametric Models
Random sequence sampled from the same set of emission probability distributions
Same length is easyBootstrapping is re-sampling columnsParametric uses estimated frequencies,
may include variance, tree, etc.More flexible, can have more complex nullPseudocounts of global frequencies if data
limitInsertions relatively hard to model
What frequencies for insert states? Global?
![Page 35: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/35.jpg)
Homology HMM ResourcesUCSC (Haussler)
SAM: align, secondary structure predictions, HMM parameters, etc.
WUSTL/Janelia (Eddy)Pfam: database of pre-computed HMM
alignments for various proteinsHMMer: program for building HMMs
![Page 36: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/36.jpg)
![Page 37: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/37.jpg)
Increasing Asymmetry with Increasing Single
Strandedness
e.g., P ( A=> G) = c + t
t = ( DssH * Slope ) + Intercept
![Page 38: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/38.jpg)
2x Redundant Sites
![Page 39: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/39.jpg)
4x Redundant Sites
![Page 40: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/40.jpg)
Beyond HMMsNeural netsDynamic Bayesian netsFactorial HMMsBoltzmann TreesKalman filtersHidden Markov random fields
![Page 41: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.vdocument.in/reader035/viewer/2022062501/568165fa550346895dd92a07/html5/thumbnails/41.jpg)
COI Functional Regions
D
Water
K K
HD
H
Water
OxygenElectron
H (alt)
O2 + protons+ electrons = H2O + secondary proton pumping (=ATP)