genes recognition - École polytechnique fédérale de ...lsir recognition.pdf · 2. score of...
TRANSCRIPT
![Page 1: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/1.jpg)
Genes Recognition
Julien Favre
![Page 2: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/2.jpg)
Agenda
• PART 1 : Gene Structure– Gene Definition– Transcription Process– Gene Details
• PART 2 : Problem Definition– Gene Recognition– Why?– Complexity
• PART 3 : Problem Approach– Approaches– Solutions description– Method improvements– Conclusion PART 1
2
![Page 3: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/3.jpg)
Gene Definition
What’s a Gene?
PART 13
DNA
![Page 4: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/4.jpg)
Transcription Process I
PART 14
![Page 5: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/5.jpg)
Transcription Process II
PART 15
STEP 1
STEP 2
STEP 3
![Page 6: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/6.jpg)
STEP 1 Transcription
PART 16
ANIMATION
![Page 7: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/7.jpg)
STEP 2 Processing
PART 17
Capping and Poly-A
Splicing
![Page 8: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/8.jpg)
STEP 3 Translation
PART 18
ANIMATION
![Page 9: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/9.jpg)
More details on Genes
PART 19
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
5’ 3’
Coding region Non Coding Region
TATA box
Start CodonEnd Codon
Beginning of the gene
Splice sites
It differs from genes to genes!
![Page 10: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/10.jpg)
Agenda
• PART 1 : Gene Structure– Gene Definition– Transcription Process– Gene Details
• PART 2 : Problem Definition– Gene Recognition– Why?– Complexity
• PART 3 : Problem Approach– Approaches– Solutions description– Method improvements– Conclusion PART 2
10
![Page 11: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/11.jpg)
Situation
• Over 3’500 million of nucleotides• 35’000 -50’000 genes
2 Important Questions:
1) Where are the genes?2) What are the coding parts?
PART 211
![Page 12: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/12.jpg)
Why?
• Annotate and correct the DNA databases• Link genes with the known proteins• Understand the genes functions• Understand genes expression mechanism
PART 212
We can read the DNA alphabet, but we don’t know where are the meaningful words and their meaning.
![Page 13: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/13.jpg)
Complexity I
PART 213
ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA
3’500 Million bases
![Page 14: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/14.jpg)
Complexity II
PART 214
Acceptor SitesDonor Sites Number of parses = Fibonacci(n+m+1)
DNA
Exons Exons
![Page 15: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/15.jpg)
Agenda
• PART 1 : Gene Structure– Gene Definition– Transcription Process– Gene Details
• PART 2 : Problem Definition– Gene Recognition– Why?– Complexity
• PART 3 : Problem Approach– Approaches– Solutions description– Method improvements– Conclusion PART 3
15
![Page 16: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/16.jpg)
Approaches
3 Types of Approaches :
1. Single Gene RecognitionFunctional Signals detection
Splice sitesPromoter, Poly-A, …
2. Multiple Genes Recognition3. Similarities
PART 316
![Page 17: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/17.jpg)
Single Gene RecognitionPrinciple
Functional Signals DetectionMain goal is to detect the beginning and the
end of the exons or genes
PART 317
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
5’ 3’TATA box
Start CodonEnd Codon
Splice sites
![Page 18: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/18.jpg)
Splicing Mechanism
PART 318
Time™ and a) decompressorsee this picture.DNA
• Consensus over the donor-acceptor site GU-AG (98%)
• Extremely reliable technique to detect exons
![Page 19: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/19.jpg)
Single Gene RecognitionMethods
PART 319
1. Combinatorial methods– Single block
2. Probabilistic methods– Simple – Markov based
3. Linear Discriminant methods
![Page 20: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/20.jpg)
Consensus Sequence
PART 320
Obtained by choosing the most frequent base at each position of the multiple alignment of subsequences of interest
TACGATTATAATTATAATGATACTTATGATTATGTT
Consensus Sequence TATAAT
MELONMANGOHONEYSWEETCOOKY
MONEYLeads to loss of information and can produce many false positive or false negative predictions
![Page 21: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/21.jpg)
Combinatory Methods
PART 321
Consensus Sequence(ex: TATA box)For a consensus sequence of size L and for a position in the considered sequence, we compute
1. P(L,k)= P(Detect the consensus seq. with k mismatches)2. where Fl = #possible positions in
the considered sequence and T is the number of patterns detected in the given sequence
3. For a given To, define a threshold value for the detection.
P(T) = CFlT pT (L,z)(1− P(L,z))Fl−T
z= 0
k
∑
![Page 22: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/22.jpg)
Probabilistic Methods I
PART 322
For a given consensus sequence a Weight Matrix is computed:Computed by measuring the frequency of every element of a particular position of the base in a training set:
Matrix entries can be considered as probabilitiesDisadvantages:
– assumes independence between adjacent bases
GU Acceptor site
![Page 23: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/23.jpg)
Probabilistic Methods II
PART 323
• Under the weight matrix model, the probability of having a sequence (x1, x2, .., xk) that matches a site is:
If we introduce a measure of the form :
Then, the more LLR exceeds 0, the better chances this sequence is a functional signal
P(X /S) = pxi
i
i=1
k
∏
LLR(X) = Log( P(X /S)P(X /N)
)
![Page 24: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/24.jpg)
Methods improvements
PART 324
2 blocks approach P(L1,k1,L2,k2) and distance D1Multiple nucleotides probabilitiesNeuronal network approachReading frameMarkov Models
![Page 25: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/25.jpg)
Markov Models
PART 325
Probabilistic method are 0-order Markov modelsMarkov introduces dependencies between the basesThe probabilities of observing a sequence becomes now:
P(X /S) = p0 pxi
i−1,i
i=1
k
∏
![Page 26: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/26.jpg)
Linear Discriminant methods I
PART 326
Many functional signals are very short => Exploit related characteristics1. We build a sequence characteristics vector
(x1, …,xp)2. We define and if Z>c then the sequence
correspond to a site3. We use a training set to define {ai}, c4. The training set of « site sequences » define a
vector m1 and the « non site sequence » a vector m2
Z = aixii= 0
p
∑
a = s−1(m1 −m2) c = a (m1 + m2) /2
![Page 27: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/27.jpg)
Linear Discriminant methods II
PART 327
1. Choose a set of p characteristics– Score of the weight matrix– Distance to a predicted site– Base composition in distant sequence– …
2. Test the characteristics with the Mahalonodisdistance:
3. Choose the set of q characteristics that maximizes D2
D2 = (m1 −m2)s−1(m1 −m2)
![Page 28: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/28.jpg)
Linear Discriminant methods IIIExample
PART 328
Poly-A site
Chosen characteristics:1. Score of the weight matrix of Poly-A2. Score of weight matrix of the GT el.3. Distance between Poly-A and GT4. Nucleotide composition of Downstream Region(6,100)5. Nucleotide composition of Upstream Region(-100,-1)
Stop codon
T-rich
Poly-A site
GT-Rich Last Exon
5’Score of Poly-A
CAATAAA(T/C)
Distance between poly-A and GT
Score of GT
![Page 29: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/29.jpg)
Linear Discriminant methods IVExample
PART 329
Poly-A site
Chosen characteristics:1. Score of the weight matrix of Poly-A2. Score of weight matrix of the GT el.3. Distance between Poly-A and GT4. Nucleotide composition of Downstream Region(6,100)5. Nucleotide composition of Upstream Region(-100,-1)
12.6812.3611.6710.787.61Composed D2
0.442.270.013.467.61Individual D2
35241MahalonodisDistance
![Page 30: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/30.jpg)
Multiple Genes approach
PART 330
2 Approaches:
1. Discriminant Analysis, Pattern based– FGENES
2. HMM, Probabilistic approach– FGENEH
![Page 31: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/31.jpg)
Discriminant Analysis
PART 331
Goal: Detect first and last Exons in a big sequence
1. Find internal exons
2. Find last exons based on 3’ sites
3. Find first exons based on 5’ sites
4. Combine results
AInternal exonIntron Intron
D
ALast exonIntron 3’ site
Stop
ATGFirst exon5’ site Intron
D
![Page 32: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/32.jpg)
HMM method I
PART 332
We want to use Markov model to represent and recognize genes
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
![Page 33: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/33.jpg)
HMM method II
PART 333
Real model:E0 E1 E2
E2E1E0
NP
Eterm
P
Einit
polyA
5’ UTR
I0 I1 I2
I0 I1 I2
Esngl
Esngl
Einit Eterm3’ UTR
5’ UTR 3’ UTR
polyA
![Page 34: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/34.jpg)
HMM method III
PART 334
1. The model must be trained to compute:• State transition probabilities• Initial distribution
2. For a given sequence, we look for the best path using Vitterbi algorithm
3. We analyze the best path to determine if it could be a gene.
![Page 35: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/35.jpg)
Similarity methods
PART 335
2 Goals:1. Find out the genes functions2. Improve algorithms
2 Main Methods:1. EST based2. BLAST with others species
![Page 36: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/36.jpg)
Remarks
PART 336
• Real challenge is gene recognition in long and complex sequences
• It’s very difficult to measure methods accuracy
• The databases are full of errors
![Page 37: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/37.jpg)
Conclusion
PART 337
• Best results are obtained in combining methods– HMM + EST+Dynamic programming
• This problem will be solved within a few years• But huge challenges are remaining
– Gene regulation – Alternative splicing– Gene expression
![Page 38: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition](https://reader033.vdocument.in/reader033/viewer/2022042108/5e87bf577a86e85d3b149d55/html5/thumbnails/38.jpg)
Questions And Remarks
PART 338
Thanks for your attention