10/29/20151 gene finding project (cont.) charles yan
TRANSCRIPT
![Page 1: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/1.jpg)
04/20/23 1
Gene Finding Project (Cont.)
Charles Yan
![Page 2: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/2.jpg)
04/20/23 2
Gene Finding
Summary of the project Download the genome of E. Coli K12 Gene-finding using kth-order Markov
chains, where k= 1, 2, 3 Gene-finding using inhomogeneous
Markov chains
![Page 3: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/3.jpg)
04/20/23 3
Non-Coding Regions The rest of the genome that are not labeled as
gene or complement does not encode genetic information. These regions are non-coding regions.
The following figure shows that the positive chain is divided into three types of region: gene, non-coding region and complement region.
![Page 4: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/4.jpg)
04/20/23 4
1st-Order Markov Chain
Since there are three types of regions on the sequence we have, we will develop three models corresponding to them: gene model, non-coding model and complement model.
![Page 5: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/5.jpg)
04/20/23 5
1st-Order Markov Chain For these models, we use the same
structure as we shown in the example of identifying CpG island.
The structure of the 1st-order Markov chain model.
![Page 6: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/6.jpg)
04/20/23 6
1st-Order Markov Chain Then, each model is
reduced to a transition probability table. Here is an example for the gene model (1st-order Markov chain). We will need to estimate the probabilities for each model.
![Page 7: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/7.jpg)
04/20/23 7
1st-Order Markov Chain
![Page 8: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/8.jpg)
04/20/23 8
1st-Order Markov Chain
![Page 9: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/9.jpg)
04/20/23 9
Kth-Order Markov Chain
The assumption by 1st-order Markov chain is that xn is independent of xn-2xn-3…x1 given xn-1, i.e.,
![Page 10: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/10.jpg)
04/20/23 10
Kth-Order Markov Chain
![Page 11: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/11.jpg)
04/20/23 11
Kth-Order Markov Chain When K=2 is used, the changes in the
method include: (1) The size of the transition probability table
for each model will become 16*4.
![Page 12: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/12.jpg)
04/20/23 12
Kth-Order Markov Chain
![Page 13: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/13.jpg)
04/20/23 13
K=1,2 … One model for each type of sequence: gene,
complement, and non-coding. Every position in the same type of sequence is
considered the same. But there are some differences between different
positions. We need new methods to address these
differences.
Kth-Order Markov Chain
![Page 14: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/14.jpg)
04/20/23 14
Inhomogeneous Markov Chains
When DNA is translated into proteins, three bases (the letters for DNA, which are A, T, G, and C) make up a codon and encode one amino acid residue (the letter for Proteins).
![Page 15: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/15.jpg)
04/20/23 15
Inhomogeneous Markov Chains
Each codon has three positions. In the previous models (referred as homogeneous models), we do not distinguish between the three positions. In this section, we will build different models (referred as inhomogeneous models) for different positions.
![Page 16: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/16.jpg)
04/20/23 16
Gene Models The gene model will be split into three
models, each for one codon position:
1_ codegenea
2_ codegenea
3_ codegenea
![Page 17: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/17.jpg)
04/20/23 17
Gene Models
![Page 18: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/18.jpg)
04/20/23 18
Gene Models In the prediction stage, for a given sequence
X=x1x2x3…xn, we will have to calculate three probabilities.
![Page 19: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/19.jpg)
04/20/23 19
Complement Region Models
We treat complement region the same way as we do genes. Three models will be built for three positions.
Three probabilities will be calculated when prediction is to be made for a sequence X=x1x2x3…xn,
![Page 20: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/20.jpg)
04/20/23 20
Non-Coding Region Model Since the non-coding region does not
contain codons, every position will be considered the same. There is no change to the non-coding region model. will be calculated as described in the homogeneous models.
![Page 21: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/21.jpg)
04/20/23 21
Inhomogeneous Markov Chains
![Page 22: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/22.jpg)
04/20/23 22
![Page 23: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/23.jpg)
04/20/23 23
Markov Chains
The test sequence is divided into fragments using a sliding window of 100 letters.
Predictions are made for each window. Each prediction for a window. We need new methods that can make prediction
for each letter.
100
![Page 24: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/24.jpg)
04/20/23 24
Markov Chains
What if the window is on the boundary of a gene?
The methods can not predict boundaries precisely.
We need methods that can do so!
100
![Page 25: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/25.jpg)
04/20/23 25
Hidden Markov Model (HMM) with Duration
Some changes in terminology Gene model gene state (g) Non-coding model non-coding state (n) Complement model complement state (c)
No need to divide the test sequence into windows.
Predictions will be made for the whole sequence.ATGGTCGATAGGGCCAATGCATACATAGACATAGAATAGGGCCAATGggggggggggggggggggnnnnnggggggggnnnnnncccccccccc
duration
![Page 26: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/26.jpg)
04/20/23 26
Hidden Markov Model (HMM) with Duration
Two questions when making predictions What is the next state? What is the duration of the state?
To make predictions for a sequence is to find out a path of states (with duration associated to each state) that can generate the sequence with maximum probability. This path is called optimal path.
(g, 18)
ATGGTCGATAGGGCCAATGCATACATAGACATAGAATAGGGCCAATGggggggggggggggggggnnnnnggggggggnnnnnncccccccccc
(n, 5) (g, 8) (n, 6) (c, 10)
![Page 27: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/27.jpg)
04/20/23 27
Hidden Markov Model (HMM) with Duration
We can use dynamic programming to find the optimal path for a given sequence.
Let X=x1x2x3x4x5…xn be the sequence about with predictions are to be made.
![Page 28: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/28.jpg)
Let Zk(s,d) be the maximum probability that subsequence x1x2x3…xk is generate by a path ending with (s,d).
Let Sk be the maximum probability of generating subsequence x1x2x3…xk using any path.
Then
Let Pkbe the last note of the optimal path that generates subsequence x1x2x3…xk. Pks refers to its state and Pkd refers to the duration of its state. Note that Sk=Zk(Pks, Pkd)
(g, 18)
ATGGTCGATAGGGCCAATGCATA...TAGAATAGGGCCAATGggggggggggggggggggnnnnn...nnnnnncccccccccc
(n, 5) (s, d)
),(max),(
dsZS kds
k
k
Hidden Markov Model (HMM) with Duration
![Page 29: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/29.jpg)
Hidden Markov Model (HMM) with Duration
The recursion to calculate Zk(s,d) is: if s != Pk-ds
Q(Pk-ds,s): Transition probability from Pk-ds to sD(s,d): Probability that state s has a duration of dEs(xk-d+1xk-d+2…xk): Probability that state s generates xk-d+1xk-
d-+2xk
)...(*),(*),(*),( 11 kdkdksdkdkk xxxEdsDssPQSdsZ
(g, 18)
ATGGTCGATAGGGCCAATGCATA...AGGGCCAGATGTAGAATggggggggggggggggggnnnnn...cccccccccccnnnnnn
(s, d)
k
…(Pk-ds , Pk-
dd )
K-d
![Page 30: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/30.jpg)
Hidden Markov Model (HMM) with Duration
Since Sk=Zk(Pks, Pkd), then
Sk-d=Zk-d(Pk-ds, Pk-dd)
Thus
)...(*),(*),(*),(
)...(*),(*),(*),(
21
21
kdkdksdkdkdkdk
kdkdksdkdkk
xxxEdsDssPQdPsPZ
xxxEdsDssPQSdsZ
(g, 18)
ATGGTCGATAGGGCCAATGCATA...AGGGCCAGATGTAGAATggggggggggggggggggnnnnn...cccccccccccnnnnnn
(s, d)
k
…(Pk-ds , Pk-
dd )
K-d
![Page 31: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/31.jpg)
Hidden Markov Model (HMM) with Duration
if s == Pk-ds
)...(*
),(*),(*),(
)...(*),(*),(*),(
21
21
kdPdkdPdks
dkdPdkdPdk
kdPdkdPdksdkdPdkdPdkk
xxxE
dPdsDssPQdPsPZ
xxxEdPdsDssPQSdsZ
dkdk
dkddkPdkddkPdkdk
dkdkdkdk
(g, 18)
ATGGTCGATAG...CGACGGGCCAATGCATAAGGGCGGCCAGATGTAGAATggggggggggg...nnnnnnnnnnnnccccccccccccccccccccccccc
(s, d)
k
(Pk-ds , Pk-dd )
K-d
K-d- Pk-dd
)
,(
dP
sP
dPdk
dPdk
dk
dk
![Page 32: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/32.jpg)
04/20/23 32
Hidden Markov Model (HMM) with Duration
Now we have the recursion function
We will discuss Q(Pk-ds,s), D(s,d), and Es(xk-dxk-d-1…xk) later. Here, let’s assume that they are known.
We can calculate Zk(s,d) for all k, s, d where k<=n, and d<=k
![Page 33: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/33.jpg)
K=0, Z0(g,0)=1, S0=1
K=1, Z1(g,1)=1, Z1(c,1)=1, Z1(n,1)=1 Z1(g,0)=1, Z1(c,0)=1, Z1(n,0)=1 S0=1
2<=K<=n
Hidden Markov Model (HMM) with Duration
),(max),(
dsZS kds
k (Pks, Pkd)
),(maxarg),(
dsZkds
![Page 34: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/34.jpg)
04/20/23 34
Hidden Markov Model (HMM) with Duration
Sk,Pks, Pkd (for all 1<=k<=n) are the three tables we need to keep during this dynamic programming.ATGGTCGATAGCGACGGGCCAATGCATAAGGGCGGCCAGATGTAGAAT
s1
P1s P1d s2
P2s P2d s2
P2s P2d
sn
Pns Pnd
![Page 35: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/35.jpg)
Hidden Markov Model (HMM) with Duration
At the end of the dynamic programming we will have Sk,Pks, Pkd for all 1<=k<=n. Then we can make predictions for the whole sequence using a
back-track approach. ATGGTCGATAGCGACGGGCCAATGCATAAGGGCGGCCAGATGTAGAAT
Pns=n Pnd=5
![Page 36: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/36.jpg)
Hidden Markov Model (HMM) with Duration
At the end of the dynamic programming we will have Sk,Pks, Pkd for all 1<=k<=n. Then we can make predictions for the whole sequence using a
back-track approach. ATGGTCGATAGCGACGGGCCAATGCATAAGGGCGGCCAGATGTAGAAT
nnnnn
Pns=n Pnd=5
![Page 37: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/37.jpg)
Hidden Markov Model (HMM) with Duration
At the end of the dynamic programming we will have Sk,Pks, Pkd for all 1<=k<=n. Then we can make predictions for the whole sequence using a
back-track approach. ATGGTCGATAGCGACGGGCCAATGCATAAGGGCGGCCAGATGTAGAAT
nnnnn
Pns=n Pnd=5
Pks=g Pkd=30
![Page 38: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/38.jpg)
Hidden Markov Model (HMM) with Duration
At the end of the dynamic programming we will have Sk,Pks, Pkd for all 1<=k<=n. Then we can make predictions for the whole sequence using a
back-track approach. ATGGTCGATAGCGACGGGCCAATGCATAAGGGCGGCCAGATGTAGAAT
ggggggggggggggggggggggggggggggnnnnn
Pns=n Pnd=5
Pks=g Pkd=30
![Page 39: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/39.jpg)
Hidden Markov Model (HMM) with Duration
At the end of the dynamic programming we will have Sk,Pks, Pkd for all 1<=k<=n. Then we can make predictions for the whole sequence using a
back-track approach. ATGGTCGATAGCGACGGGCCAATGCATAAGGGCGGCCAGATGTAGAAT
ggggggggggggggggggggggggggggggnnnnn
Pns=n Pnd=5
Pks=g Pkd=30
Pks=c Pkd=10
![Page 40: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/40.jpg)
Hidden Markov Model (HMM) with Duration
At the end of the dynamic programming we will have Sk,Pks, Pkd for all 1<=k<=n. Then we can make predictions for the whole sequence using a
back-track approach. ATGGTCGATAGCGACGGGCCAATGCATAAGGGCGGCCAGATGTAGAAT
ccccccccccggggggggggggggggggggggggggggggnnnnn
Pns=n Pnd=5
Pks=g Pkd=30
Pks=c Pkd=10
![Page 41: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/41.jpg)
Hidden Markov Model (HMM) with Duration
At the end of the dynamic programming we will have Sk,Pks, Pkd for all 1<=k<=n. Then we can make predictions for the whole sequence using a
back-track approach. ATGGTCGATAGCGACGGGCCAATGCATAAGGGCGGCCAGATGTAGAAT
ccccccccccggggggggggggggggggggggggggggggnnnnn
Pns=n Pnd=5
Pks=g Pkd=30
Pks=c Pkd=10
Pks=n Pkd=6
![Page 42: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/42.jpg)
Hidden Markov Model (HMM) with Duration
At the end of the dynamic programming we will have Sk,Pks, Pkd for all 1<=k<=n. Then we can make predictions for the whole sequence using a
back-track approach. ATGGTCGATAGCGACGGGCCAATGCATAAGGGCGGCCAGATGTAGAATnnnnnnccccccccccggggggggggggggggggggggggggggggn
nnnn
Pns=n Pnd=5
Pks=g Pkd=30
Pks=c Pkd=10
Pks=n Pkd=6
![Page 43: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/43.jpg)
04/20/23 43
Hidden Markov Model (HMM) with Duration
Let’s go back to the statement “We will discuss Q(Pk-ds,s), D(s,d), and Es(xk-d+1xk-d+2…xk) later. Here, let’s assume that they are known.”
Now, we need to know how to estimate these functions.
![Page 44: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/44.jpg)
04/20/23 44
Hidden Markov Model (HMM) with Duration
Let’s go back to the statement “We will discuss Q(Pk-ds,s), D(s,d), and Es(xk-d+1xk-d+2…xk) later. Here, let’s assume that they are known.”
Now, we need to know how to estimate these functions.Q(Pk-ds,s): Transition probability from Pk-ds to s
D(s,d): Probability that state s has a duration of dEs(xk-d+1xk-d+2…xk): Probability that state s generates xk-
dxk-d-1…xk
![Page 45: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/45.jpg)
04/20/23 45
Hidden Markov Model (HMM) with Duration
Q(Pk-ds,s): Transition probability from Pk-ds to sThere are three states: gene (g), complement (c), and
non-coding (n)
Q( ) g c n
g Q(g,c)
c
n
=Ngc/Ng
Ng: Number of genes
Ngc: Number times of a gene is followed by a complement
![Page 46: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/46.jpg)
04/20/23 46
D(s,d): Probability that state s has a duration of d We just need find out the length distribution for
gene, complement and non-coding.
Hidden Markov Model (HMM) with Duration
![Page 47: 10/29/20151 Gene Finding Project (Cont.) Charles Yan](https://reader035.vdocument.in/reader035/viewer/2022062720/56649f095503460f94c1d687/html5/thumbnails/47.jpg)
04/20/23 47
Hidden Markov Model (HMM) with Duration
Es(xk-d+1xk-d+2…xk): Probability that state s generates xk-d+1xk-d+2…xk
This is the probability that a short sequence is generated by gene, complement or non-coding state.
This can be calculated using the homogenous or non-homogenous Markov chains we introduced in
the beginning of the class.