bioinformatics t8-go-hmm v2014
TRANSCRIPT
![Page 1: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/1.jpg)
![Page 2: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/2.jpg)
FBW
2-12-2014
Wim Van Criekinge
![Page 3: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/3.jpg)
![Page 4: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/4.jpg)
Gene Prediction, HMM & ncRNA
What to do with an unknown
sequence ?
Gene Ontologies
Gene Prediction
Composite Gene Prediction
Non-coding RNA
HMM
![Page 5: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/5.jpg)
UNKNOWN PROTEIN SEQUENCE
LOOK FOR:
• Similar sequences in databases ((PSI)
BLAST)
• Distinctive patterns/domains associated
with function
• Functionally important residues
• Secondary and tertiary structure
• Physical properties (hydrophobicity, IEP
etc)
![Page 6: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/6.jpg)
BASIC INFORMATION COMES FROM SEQUENCE
• One sequence- can get some information eg
amino acid properties
• More than one sequence- get more info on
conserved residues, fold and function
• Multiple alignments of related sequences-
can build up consensus sequences of known
families, domains, motifs or sites.
• Sequence alignments can give information
on loops, families and function from
conserved regions
![Page 7: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/7.jpg)
Additional analysis of protein sequences
• transmembrane
regions
• signal sequences
• localisation
signals
• targeting
sequences
• GPI anchors
• glycosylation sites
• hydrophobicity
• amino acid
composition
• molecular weight
• solvent accessibility
• antigenicity
![Page 8: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/8.jpg)
FINDING CONSERVED PATTERNS IN PROTEIN SEQUENCES
• Pattern - short, simplest, but limited
• Motif - conserved element of a sequence
alignment, usually predictive of structural or
functional region
To get more information across whole
alignment:
• Profile
• HMM
![Page 9: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/9.jpg)
PATTERNS
• Small, highly conserved regions
• Shown as regular expressions
Example:
[AG]-x-V-x(2)-x-{YW}
– [] shows either amino acid
– X is any amino acid
– X(2) any amino acid in the next 2 positions
– {} shows any amino acid except these
BUT- limited to near exact match in small region
![Page 10: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/10.jpg)
PROFILES
• Table or matrix containing comparison
information for aligned sequences
• Used to find sequences similar to
alignment rather than one sequence
• Contains same number of rows as
positions in sequences
• Row contains score for alignment of
position with each residue
![Page 11: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/11.jpg)
HIDDEN MARKOV MODELS (HMM)
• An HMM is a large-scale profile with gaps,
insertions and deletions allowed in the
alignments, and built around probabilities
• Package used HMMER (http://hmmer.wusd.edu/)
• Start with one sequence or alignment -HMMbuild,
then calibrate with HMMcalibrate, search
database with HMM
• E-value- number of false matches expected with
a certain score
• Assume extreme value distribution for noise,
calibrate by searching random seq with HMM
build up curve of noise (EVD)
HMM
![Page 12: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/12.jpg)
Sequence
![Page 13: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/13.jpg)
Gene Prediction, HMM & ncRNA
What to do with an unknown
sequence ?
Gene Ontologies
Gene Prediction
HMM
Composite Gene Prediction
Non-coding RNA
![Page 14: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/14.jpg)
What is an ontology?
• An ontology is an explicit
specification of a conceptualization.
• A conceptualization is an abstract,
simplified view of the world that we
want to represent.
• If the specification medium is a formal representation, the ontology defines the vocabulary.
![Page 15: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/15.jpg)
Why Create Ontologies?
• to enable data exchange among
programs
• to simplify unification (or translation)
of disparate representations
• to employ knowledge-based services
• to embody the representation of a
theory
• to facilitate communication among
people
![Page 16: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/16.jpg)
Summary
• Ontologies are what they do:
artifacts to help people and their
programs communicate, coordinate,
collaborate.
• Ontologies are essential elements in
the technological infrastructure of
the Knowledge Age
• http://www.geneontology.org/
![Page 17: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/17.jpg)
•Molecular Function — elemental activity or task
nuclease, DNA binding, transcription factor
•Biological Process — broad objective or goal
mitosis, signal transduction, metabolism
•Cellular Component — location or complexnucleus, ribosome, origin recognition complex
The Three Ontologies
![Page 18: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/18.jpg)
DAG Structure
Directed acyclic graph: each child may have one or more parents
![Page 19: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/19.jpg)
Example - Molecular Function
![Page 20: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/20.jpg)
Example - Biological Process
![Page 21: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/21.jpg)
Example - Cellular Location
![Page 22: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/22.jpg)
AmiGO browser
![Page 23: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/23.jpg)
GO: Applications
• Eg. chip-data analysis: Overrepresented item
can provide functional clues
• Overrepresentation check: contingency table
– Chi-square test (or Fisher is frequency < 5)
![Page 24: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/24.jpg)
Gene Prediction, HMM & ncRNA
What to do with an unknown sequence ?
Web applications
Gene Ontologies
Gene Prediction
HMM
Composite Gene Prediction
Non-coding RNA
![Page 25: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/25.jpg)
Problem:
Given a very long DNA sequence, identify coding
regions (including intron splice sites) and their
predicted protein sequences
Computational Gene Finding
![Page 26: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/26.jpg)
Eukaryotic gene structure
Computational Gene Finding
![Page 27: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/27.jpg)
• There is no (yet known) perfect method
for finding genes. All approaches rely on
combining various “weak signals”
together
• Find elements of a gene
– coding sequences (exons)
– promoters and start signals
– poly-A tails and downstream signals
• Assemble into a consistent gene model
Computational Gene Finding
![Page 28: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/28.jpg)
Gen
efi
nd
er
![Page 29: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/29.jpg)
![Page 30: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/30.jpg)
GENE STRUCTURE INFORMATION - POSITION ON PHYSICAL MAP
This gene structure corresponds to the position on the physical map
![Page 31: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/31.jpg)
GENE STRUCTURE INFORMATION - ACTIVE ZONE
This gene structure shows the Active Zone
The Active Zone limits the extent of
analysis, genefinder & fasta dumps
A blue line within the yellow box
indicates regions outside of the active
zone
The active zone is set by entering
coordinates in the active zone (yellow
box)
![Page 32: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/32.jpg)
GENE STRUCTURE INFORMATION - POSITION
This gene structure relates to the Position:
Change origin of
this scale by
entering a
number in the
green 'origin'
box
![Page 33: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/33.jpg)
GENE STRUCTURE INFORMATION - PREDICTED GENE STRUCTURE
This gene structure relates to the predicted gene structures
Boxes are Exons,
thin lines (or
springs) are Introns
![Page 34: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/34.jpg)
Find the open reading frames
GAAAAAGCTCCTGCCCAATCTGAAATGGTTAGCCTATCTTTCCACCGT
Any sequence has 3 potential reading frames (+1, +2, +3)
Its complement also has three potential reading frames (-1, -2, -3)
6 possible reading frames
The triplet, non-punctuated nature of the genetic code helps us out
64 potential codons
61 true codons
3 stop codons (TGA, TAA, TAG)
Random distribution app. 1/21 codons will be a stop
E K A P A Q S E M V S L S F H R
K K L L P N L K W L A Y L S T
K S S C P I * N G * P I F P P
![Page 35: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/35.jpg)
GENE STRUCTURE INFORMATION - OPEN READING FRAMES
This gene structure relates to Open reading Frames
There is one column
for each frame
Small horizontal
lines represent stop
codons
![Page 36: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/36.jpg)
They have one
column for each
frame
The size indicates
relative score for the
particular start site
GENE STRUCTURE INFORMATION - START CODONS
This gene structure represents Start Codons
![Page 37: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/37.jpg)
• Amino acid distributions are biased
e.g. p(A) > p(C)
• Pairwise distributions also biased
e.g. p(AT)/[p(A)*p(T)] > p(AC)/[p(A)*p(C)]
• Nucleotides that code for preferred amino
acids (and AA pairs) occur more frequently in
coding regions than in non-coding regions.
• Codon biases (per amino acid)
• Hexanucleotide distributions that reflect those
biases indicate coding regions.
Computational Gene Finding: Hexanucleotide frequencies
![Page 38: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/38.jpg)
Gene prediction
Generation of datasets (Ensmart@Ensembl):
Dataset 1 (http://biobix.ugent.be/txt/coding.txt) consists of >900 coding regions (DNA):
Dataset 2 (http://biobix.ugent.be/txt/noncoding.txt) consists of >900 non-coding regions
Distance Array: Calculate for every base all the distances (in bp) to the same nucleotide (focus on the first 1000 bp of the coding region and limit the distance array to a window of 1000 bp)
Do you see a difference in this “distance array” between coding and noncoding sequence ?
Could it be used to predict genes ?
Write a program to predict genes in the following genomic sequence (http://biobix.ugent.be/txt/genomic.txt)
What else could help in finding genes in raw genomic sequences ?
![Page 39: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/39.jpg)
GENE STRUCTURE INFORMATION - CODING POTENTIAL
This gene structure corresponds to the Coding Potential
The grey boxes indicate
regions where the codon
frequencies match those of
known C. elegans genes.
the larger the grey box the
more this region resembles a
C. elegans coding element
![Page 40: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/40.jpg)
blastn (EST)
For raw DNA sequence analysis blastx is
extremely useful
Will probe your DNA sequence against the protein database
A match (homolog) gives you some ideas regarding function
One problem are all of the genome sequences
Will get matches to genome databases that are strictly identified by
sequence homology – often you need some experimental evidence
![Page 41: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/41.jpg)
GENE STRUCTURE INFORMATION - SEQUENCE SIMILARITY
This feature shows protein sequence similarity
The blue boxes indicate
regions of sequence which
when translated have
similarity to previously
characterised proteins.
To view the alignment,
select the right mouse
button whilst over the blue
box.
![Page 42: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/42.jpg)
GENE STRUCTURE INFORMATION - EST MATCHES
This gene structure relates to Est Matches
The yellow boxes represent
DNA matches (Blast) to C.
elegans Expressed Sequence
Tags (ESTS)
To view the alignment use the
right mouse button whilst
over the yellow box to invoke
Blixem
![Page 43: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/43.jpg)
Borodovsky et al., 1999, Organization of the Prokaryotic Genome (Charlebois, ed) pp. 11-34
New generation of programs to predict gene coding
sequences based on a non-random repeat pattern(eg. Glimmer, GeneMark) – actually pretty good
![Page 44: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/44.jpg)
• CpG islands are regions of sequence that
have a high proportion of CG dinucleotide
pairs (p is a phoshodiester bond linking
them)
– CpG islands are present in the promoter and
exonic regions of approximately 40% of
mammalian genes
– Other regions of the mammalian genome contain
few CpG dinucleotides and these are largely
methylated
• Definition: sequences of >500 bp with
– G+C > 55%
– Observed(CpG)/Expected(CpG) > 0.65
Computational Gene Finding
![Page 45: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/45.jpg)
GENE STRUCTURE INFORMATION - REPEAT FAMILIES
This gene structure corresponds to Repeat Families
This column shows
matches to members of a
number of repeat families
Currently a hidden markov
model is used to detect
these
![Page 46: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/46.jpg)
GENE STRUCTURE INFORMATION - REPEATS
This gene structure relates to Repeats
This column shows regions
of localised repeats both
tandem and inverted
Clicking on the boxes will
show the complete repeat
information in the blue line
at the top end of the screen
![Page 47: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/47.jpg)
Exon/intron boundaries
![Page 48: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/48.jpg)
• Most Eukaryotic introns have a
consensus splice signal: GU at the
beginning (“donor”), AG at the end
(“acceptor”).
• Variation does occur in the splice sites
• Many AGs and GTs are not splice sites.
• Database of experimentally validated
human splice sites:
http://www.ebi.ac.uk/~thanaraj/splice.h
tml
Computational Gene Finding: Splice junctions
![Page 49: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/49.jpg)
GENE STRUCTURE INFORMATION - PUTATIVE SPLICE SITES
This gene structure shows putative splice sites
The Splice Sites are shown
'Hooked'
The Hook points in the
direction of splicing, therefore
3' splice sites point up and 5'
Splice sites point down
The colour of the Splice Site
indicates the position at which
it interrupts the Codon
The height of the Splices is
proportional to the Genefinder
score of the Splice Site
![Page 50: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/50.jpg)
Gene Prediction, HMM & ncRNA
What to do with an unknown sequence ?
Web applications
Gene Ontologies
Gene Prediction
HMM
Composite Gene Prediction
Non-coding RNA
![Page 51: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/51.jpg)
![Page 52: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/52.jpg)
• Recall that profiles are matrices that
identify the probability of seeing an
amino acid at a particular location in a
motif.
• What about motifs that allow insertions
or deletions (together, called indels)?
• Patterns and regular expressions can
handle these easily, but profiles are
more flexible.
• Can indels be integrated into profiles?
Towards profiles (PSSM) with indels – insertions and/or deletions
![Page 53: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/53.jpg)
• Need a representation that allows
specification of the probability of
introducing (and/or extending) a gap in
the profile.
A .1
C .05
D .2
E .08
F .01
Gap A .04
C .1
D .01
E .2
F .02
Gap A .2
C .01
D .05
E .1
F .06
delete
continue
Hidden Markov Models: Graphical models of sequences
![Page 54: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/54.jpg)
• A sequence is said to be Markovian if the
probability of the occurrence of an element in
a particular position depends only on the
previous elements in the sequence.
• Order of a Markov chain depends on how
many previous elements influence probability
– 0th order: uniform probability at every position
– 1st order: probability depends only on immediately
previous position.
• 1st order Markov chains are good for proteins.
Hidden Markov Chain
![Page 55: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/55.jpg)
Marchov Chain for DNA
![Page 56: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/56.jpg)
Markov chain with begin and end
![Page 57: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/57.jpg)
• Consists of states (boxes) and transitions
(arcs) labeled with probabilities
• States have probability(s) of “emitting” an
element of a sequence (or nothing).
• Arcs have probability of moving from one
state to another.
– Sum of probabilities of all out arcs must be 1
– Self-loops (e.g. gap extend) are OK.
Markov Models: Graphical models of sequences
![Page 58: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/58.jpg)
• Simplest example: Each state emits (or,
equivalently, recognizes) a particular
element with probability 1, and each
transition is equally likely.
Example sequences: 1234 234 14 121214 2123334
Begi
n
Emit 1
Emit 2
Emit 4
Emit 3
End
Markov Models
![Page 59: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/59.jpg)
• Now, add probabilities to each transition (let
emission remain a single element)
• We can calculate the probability of any sequence given this
model by multiplying
0.5
0.50.25
0.75
0.9
0.1
0.2
0.8
1.0Begi
n
Emit 1
Emit 2
Emit 4
Emit 3
End
p(1234) = 0.5 * 0.1 * 0.75 * 0.8 = 0.03
p(14) = 0.5 * 0.9 = 0.45
p(2334)= 0.5 * 0.75 * 0.2 * 0.8 = 0.06
Hidden Markov Models: Probabilistic Markov Models
![Page 60: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/60.jpg)
• If we let the states define a set of emission
probabilities for elements, we can no longer be
sure which state we are in given a particular
element of a sequence
BCCD or BCCD ?
0.5
0.50.25
0.75
0.9
0.1
0.2
0.8
1.0Begi
n
A (0.8) B(0.2)
B (0.7) C(0.3)
C (0.1) D (0.9)
C (0.6) A(0.4)
End
Hidden Markov Models: Probablistic Emmision
![Page 61: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/61.jpg)
• Emission uncertainty means the sequence doesn't
identify a unique path. The states are “hidden”
• Probability of a sequence is sum of all paths that can
produce it:
0.5
0.50.25
0.75
0.9
0.1
0.2
0.8
1.0Begi
n
A (0.8) B(0.2)
B (0.7) C(0.3)
C (0.1) D (0.9)
C (0.6) A(0.4)
End
p(bccd) = 0.5 * 0.2 * 0.1 * 0.3 * 0.75 * 0.6 * 0.8 * 0.9
+ 0.5 * 0.7 * 0.75 * 0.6 * 0.2 * 0.6 * 0.8 * 0.9= 0.000972 + 0.013608 = 0.01458
Hidden Markov Models
![Page 62: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/62.jpg)
Hidden Markov Models
![Page 63: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/63.jpg)
Hidden Markov Models: The occasionally dishonest casino
![Page 64: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/64.jpg)
Hidden Markov Models: The occasionally dishonest casino
![Page 65: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/65.jpg)
• The HMM must first be “trained” using a training set– Eg. database of known genes.
– Consensus sequences for all signal sensors are needed.
– Compositional rules (i.e., emission probabilities) and length distributions are necessary for content sensors.
• Transition probabilities between all connected states must be estimated.
• Estimate the probability of sequence s, given model m, P(s|m)
– Multiply probabilities along most likely path(or add logs – less numeric error)
Use of Hidden Markov Models
![Page 66: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/66.jpg)
• HMMs are effectively profiles with gaps, and
have applications throughout Bioinformatics
• Protein sequence applications:
– MSAs and identifying distant homologs
E.g. Pfam uses HMMs to define its MSAs
– Domain definitions
– Used for fold recognition in protein structure
prediction
• Nucleotide sequence applications:
– Models of exons, genes, etc. for gene
recognition.
Applications of Hidden Markov Models
![Page 67: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/67.jpg)
• UC Santa Cruz (David Haussler group)
– SAM-02 server. Returns alignments, secondary
structure predictions, HMM parameters, etc. etc.
– SAM HMM building program
(requires free academic license)
• Washington U. St. Louis (Sean Eddy group)
– Pfam. Large database of precomputed HMM-based
alignments of proteins
– HMMer, program for building HMMs
• Gene finders and other HMMs (more later)
Hidden Markov Models Resources
![Page 68: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/68.jpg)
Example TMHMM
Beyond Kyte-Doolitlle …
![Page 69: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/69.jpg)
HMM in protein analysis
• http://www.cse.ucsc.edu/research/compbio/is
mb99.handouts/KK185FP.html
![Page 70: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/70.jpg)
![Page 71: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/71.jpg)
Hidden Markov model for gene structure
• A representation of the linguistic rules for what features might follow what other features when parsing a sequence consisting of a multiple exon gene.
• A candidate gene structure is created by tracing a path from B to F.
• A hidden Markov model (or hidden semi-Markov model) is defined by attaching stochastic models to each of the arcs and nodes.
Signals (blue nodes):
• begin sequence (B)
• start translation (S)
• donor splice site (D)
• acceptor splice site (A)
• stop translation (T)
• end sequence (F)
Contents (red arcs):
• 5’ UTR (J5’)
• initial exon (EI)
• exon (E)
• intron (I)
• final exon (EF)
• single exon (ES)
• 3’ UTR (J3’)
![Page 72: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/72.jpg)
Classic Programs for gene finding
Some of the best programs are HMM based:• GenScan – http://genes.mit.edu/GENSCAN.html
• GeneMark – http://opal.biology.gatech.edu/GeneMark/
Other programs• AAT, EcoParse, Fexeh, Fgeneh, Fgenes, Finex, GeneHacker, GeneID-3,
GeneParser 2, GeneScope, Genie, GenLang, Glimmer, GlimmerM, Grail II, HMMgene, Morgan, MZEF, Procrustes, SORFind, Veil, Xpound
![Page 73: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/73.jpg)
GENSCANnot to be confused with GeneScan, a commercial product
• A Semi-Markov Model
– Explicit model of how long
to stay in a state (rather
than just self-loops, which
must be exponentially
decaying)
• Tracks “phase” of exon or
intron (0 coincides with codon
boundary, or 1 or 2)
• Tracks strand (and direction)
Hidden Markov Models: Gene Finding Software
![Page 74: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/74.jpg)
Conservation of Gene Features
Conservation pattern across 3165 mappings of human RefSeq mRNAs to the genome. A program sampled 200 evenly spaced bases across 500 bases upstream of transcription, the 5’ UTR, the first coding exon, introns, middle coding exons, introns, the 3’ UTR and 500 bases after polyadenylatoin. There are peaks of conservation at the transition from one region to another.
50%
55%
60%
65%
70%
75%
80%
85%
90%
95%
100%
aligning identity
![Page 75: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/75.jpg)
Composite Approaches
• Use EST info to constrain HMMs (Genie)
• Use protein homology info on top of HMMs
(fgenesh++, GenomeScan)
• Use cross species genomic alignments on top
of HMMs (twinscan, fgenesh2, SLAM, SGP)
![Page 76: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/76.jpg)
Gene Prediction: more complex …
1. Species specific
2. Splicing enhancers found in coding regions
3. Trans-splicing
4. …
![Page 77: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/77.jpg)
Length preference
5’ ss intcomp branch 3’ ss
![Page 78: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/78.jpg)
![Page 79: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/79.jpg)
Co
nte
nts
-Sche
du
le
RNA genes
Besides the 6000 protein coding-genes, there is:
140 ribosomal RNA genes
275 transfer RNA gnes
40 small nuclear RNA genes
>100 small nucleolar genes
?
pRNA in 29 rotary packaging motor (Simpson
et el. Nature 408:745-750,2000)
Cartilage-hair hypoplasmia mapped to an RNA
(Ridanpoa et al. Cell 104:195-203,2001)
The human Prader-Willi ciritical region (Cavaille
et al. PNAS 97:14035-7, 2000)
![Page 80: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/80.jpg)
![Page 81: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/81.jpg)
![Page 82: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/82.jpg)
![Page 83: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/83.jpg)
![Page 84: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/84.jpg)
RNA genes can be hard to detects
UGAGGUAGUAGGUUGUAUAGU
C.elegans let-27; 21 nt
(Pasquinelli et al. Nature 408:86-89,2000)
Often small
Sometimes multicopy and redundant
Often not polyadenylated
(not represented in ESTs)
Immune to frameshift and nonsense mutations
No open reading frame, no codon bias
Often evolving rapidly in primary sequence
miRNA genes
![Page 85: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/85.jpg)
• Lin-4 identified in a screen for mutations that affect timing and
sequence of postembryonic development in C.elegans. Mutants re-
iterate L1 instead of later stages of development
• Gene positionally cloned by isolating a 693-bp DNA fragment that
can rescue the phenotype of mutant animals
• No protein found but 61-nucleotide precursor RNA with stem-loop
structure which is processed to 22-mer ncRNA
• Genetically lin-4 acts as negative regulator of lin-14 and lin-28
• The 3’ UTR of the target genes have short stretches of
complementarity to lin-4
• Deletion of these lin-4 target seq causes unregulated gof phenotype
• Lin-4 RNA inhibits accumulation of LIN-14 and LIN-28 proteins
although the target mRNA
Lin-4
![Page 86: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/86.jpg)
Let-7 (lethal-7) was also mapped to a ncRNA gene with a 21-
nucleotide product
The small let-7 RNA is also thought to be a post-transcriptional
negative regulator for lin-41 and lin-42
100% conserved in all bilaterally symmetrical animals (not
jellyfish and sponges)
Sometimes called stRNAs, small temporal RNAs
Let-7(Pasquinelli et al. Nature 408:86-89,2000)
![Page 87: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/87.jpg)
![Page 88: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/88.jpg)
Two computational analysis problems
• Similarity search (eg BLAST), I give you a query, you find sequences in a database that look like the query (note: SW/Blat)
– For RNA, you want to take the secondary structure of the query into account
• Genefinding. Based solely on a priori knowledge of what a “gene” looks like, find genes in a genome sequence
– For RNA, with no open reading frame and no codon bias, what do you look for ?
![Page 89: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/89.jpg)
![Page 90: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/90.jpg)
![Page 91: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/91.jpg)
![Page 92: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/92.jpg)
![Page 93: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/93.jpg)
![Page 94: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/94.jpg)
![Page 95: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/95.jpg)
Basic CFG
“production rules”
S -> aS
S -> Sa
S -> aSu
S -> SS
Context-free grammers
A CFG “derivation”
S -> aS
![Page 96: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/96.jpg)
Basic CFG
“production rules”
S -> aS
S -> Sa
S -> aSu
S -> SS
Context-free grammers
A CFG “derivation”
S -> aS
S -> aaS
![Page 97: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/97.jpg)
Basic CFG
“production rules”
S -> aS
S -> Sa
S -> aSu
S -> SS
Context-free grammers
A CFG “derivation”
S -> aS
S -> aaS
S -> aaSS
![Page 98: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/98.jpg)
Basic CFG
“production rules”
S -> aS
S -> Sa
S -> aSu
S -> SS
Context-free grammers
A CFG “derivation”
S -> aS
S -> aaS
S -> aaSS
S -> aagScuS
![Page 99: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/99.jpg)
Basic CFG
“production rules”
S -> aS
S -> Sa
S -> aSu
S -> SS
Context-free grammers
A CFG “derivation”
S -> aS
S -> aaS
S -> aaSS
S -> aagScuS
![Page 100: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/100.jpg)
Basic CFG
“production rules”
S -> aS
S -> Sa
S -> aSu
S -> SS
Context-free grammers
A CFG “derivation”
S -> aS
S -> aaS
S -> aaSS
S -> aagScuS
S -> aagaSucugSc
![Page 101: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/101.jpg)
Basic CFG
“production rules”
S -> aS
S -> Sa
S -> aSu
S -> SS
Context-free grammers
A CFG “derivation”
S -> aS
S -> aaS
S -> aaSS
S -> aagScuS
S -> aagaSucugSc
S -> aagaSaucuggScc
S -> aagacSgaucuggcgSccc
![Page 102: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/102.jpg)
Basic CFG
“production rules”
S -> aS
S -> Sa
S -> aSu
S -> SS
Context-free grammers
A CFG “derivation”
S -> aS
S -> aaS
S -> aaSS
S -> aagScuS
S -> aagaSucugSc
S -> aagaSaucuggScc
S -> aagacSgaucuggcgSccc
S -> aagacuSgaucuggcgSccc
S -> aagacuuSgaucuggcgaSccc
S -> aagacuucSgaucuggcgacSccc
S -> aagacuucgSgaucuggcgacaSccc
S -> aagacuucggaucuggcgacaccc
![Page 103: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/103.jpg)
Basic CFG
“production rules”
S -> aS
S -> Sa
S -> aSu
S -> SS
Context-free grammers
A CFG “derivation”
S -> aS
S -> aaS
S -> aaSS
S -> aagScuS
S -> aagaSucugSc
S -> aagaSaucuggScc
S -> aagacSgaucuggcgSccc
S -> aagacuSgaucuggcgSccc
S -> aagacuuSgaucuggcgaSccc
S -> aagacuucSgaucuggcgacSccc
S -> aagacuucgSgaucuggcgacaSccc
S -> aagacuucggaucuggcgacaccc
![Page 104: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/104.jpg)
Basic CFG
“production rules”
S -> aS
S -> Sa
S -> aSu
S -> SS
Context-free grammers
A CFG “derivation”
S -> aS
S -> aaS
S -> aaSS
S -> aagScuS
S -> aagaSucugSc
S -> aagaSaucuggScc
S -> aagacSgaucuggcgSccc
S -> aagacuSgaucuggcgSccc
S -> aagacuuSgaucuggcgaSccc
S -> aagacuucSgaucuggcgacSccc
S -> aagacuucgSgaucuggcgacaSccc
S -> aagacuucggaucuggcgacaccc
A
C
G
U
*
A
AA
A
A
GG
G G G
C
C
C
C
CCC
U
U
U
*
*
* * *
![Page 105: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/105.jpg)
![Page 106: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/106.jpg)
![Page 107: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/107.jpg)
The power of comparative analysis
• Comparative genome analysis is an indispensable means of inferring whether a locus produces a ncRNA as opposed to encoding a protein.
• For a small gene to be called a protein-coding gene, one excellent line of evidence is that the ORF is significantly conserved in another related species.
• It is more difficult to positively corroborate a ncRNA by comparative analysis but, in at least some cases, a ncRNA might conserve an intramolecular secondary structure and comparative analysis can show compensatory base substitutions.
• With comparative genome sequence data now accumulating in the public domain for most if not all important genetic systems, comparative analysis can (and should) become routine.
![Page 108: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/108.jpg)
Compensatory substitutions
that maintain the structure
U U
C G
U A
A U
G C
A UCGAC 3’
G C
5’
![Page 109: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/109.jpg)
Evolutionary conservation of RNA molecules can be revealed
by identification of compensatory substitutions
![Page 110: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/110.jpg)
…………
![Page 111: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/111.jpg)
• Manual annotation of 60,770 full-length mouse complementary
DNA sequences, clustered into 33,409 ‘transcriptional units’,
contributing 90.1% of a newly established mouse transcriptome
database.
• Of these transcriptional units, 4,258 are new protein-coding and
11,665 are new non-coding messages, indicating that non-coding
RNA is a major component of the transcriptome.
![Page 112: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/112.jpg)
Function on ncRNAs
![Page 113: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/113.jpg)
ncRNAs & RNAi
![Page 114: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/114.jpg)
Therapeutic Applications
• Shooting millions of tiny RNA molecules into a mouse’s bloodstream can protect its liver from the ravages of hepatitis, a new study shows. In this case, they blunt the liver’s selfdestructive inflammatory response, which can be triggered by agents such as the hepatitis B or C viruses. (Harvard University immunologists Judy Lieberman and Premlata Shankar)
• In a series of experiments published online this week by Nature Medicine, Lieberman’s team gave mice injections of siRNAs designed to shut down a gene called Fas. When overactivated during an inflammatory response, it induces liver cells to self-destruct. The next day, the animals were given an antibody that sends Fas into hyperdrive. Control mice died of acute liver failure within a few days, but 82% of the siRNA-treated mice remained free of serious disease and survived. Between 80% and 90% of their liver cells had incorporated the siRNAs.
![Page 115: Bioinformatics t8-go-hmm v2014](https://reader030.vdocument.in/reader030/viewer/2022020208/55a4e71d1a28ab24748b4856/html5/thumbnails/115.jpg)