simple cluster structure of triplet distributions in genetic texts andrei zinovyev
DESCRIPTION
Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev Institute des Hautes Etudes Scientifique, Bures-sur-Yvette. Markov chain models. Transition probabilities = Frequencies of N-grams … AGGTC G ATC … …A GGTCG A TC … …AG GTCGA T C …. f AAA. f AAC. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev](https://reader036.vdocument.in/reader036/viewer/2022062321/56813fc3550346895daaa120/html5/thumbnails/1.jpg)
Simple cluster structure oftriplet distributions in genetic texts
Andrei Zinovyev
Institute des Hautes Etudes Scientifique,Bures-sur-Yvette
![Page 2: Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev](https://reader036.vdocument.in/reader036/viewer/2022062321/56813fc3550346895daaa120/html5/thumbnails/2.jpg)
Transition probabilities = Frequencies of N-grams
…AGGTCGATC …
…AGGTCGATC …
…AGGTCGATC …
Markov chain models
![Page 3: Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev](https://reader036.vdocument.in/reader036/viewer/2022062321/56813fc3550346895daaa120/html5/thumbnails/3.jpg)
AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC
AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC
AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC
AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC
AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC
Sliding window
width W
fAAA
fAAC
fGGG
…= fijk, i,j,k in [A,C,G,T]
![Page 4: Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev](https://reader036.vdocument.in/reader036/viewer/2022062321/56813fc3550346895daaa120/html5/thumbnails/4.jpg)
AGGTCGATGAATCCGTATTGACAAATGAATCCGTAATGACATGACAATCCAACATGACAAT
Protein-coding sequences
bacterial gene
corr
ect f
ram
e
fijk
fijk(1)
fijk(2)
nml
kmnlijijk fffP,,
)1(
nml
ijnlmiijk fffP,,
)2(
![Page 5: Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev](https://reader036.vdocument.in/reader036/viewer/2022062321/56813fc3550346895daaa120/html5/thumbnails/5.jpg)
TCCAGCTTA TGAGGCATAACTGTTTACTGAGGCCAT ACT GTACTGTTAGGTTGTACTGTTA
AGGTCGAATACTCCGTATTGACAAATGACTCCGGTATGACATGACAATCCAACATGACAAT
“Shadow” genes
shadow gene,
ijkijkR
ijk ffCf ˆˆˆˆ TA ˆ C =G
ijkijk fPf ˆˆ )1()1( ijkijk fPf ˆˆ )2()2(
![Page 6: Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev](https://reader036.vdocument.in/reader036/viewer/2022062321/56813fc3550346895daaa120/html5/thumbnails/6.jpg)
When we can detect genes (by their content)?
,
1. When non-coding regions are very different in base composition (e.g., different GC-content)
2. When distances between the phases are large:
ijkfP )1(ijkfP )2(
ijkfnon-coding
ijk kji
ijkijk ppp
ffM 2log
![Page 7: Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev](https://reader036.vdocument.in/reader036/viewer/2022062321/56813fc3550346895daaa120/html5/thumbnails/7.jpg)
Simple experiment
,
1. Only the forward strands of genomes are used for triplet counting
2. Every p positions in the sequence, open a window (x-W/2,x+W/2,) of size W and centered at position x
3. Every window, starting from the first base-pair, is divided into W/3 non-overlapping triplets, and the frequencies of all triplets fijk are calculated
4. The dataset consists of N = [L/p] points, where L is the entire length of the sequence
5. Every data point Xi={xis} corresponds to one window and has 64 coordinates, corresponding to the frequencies of all possible triplets s = 1,…,64
![Page 8: Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev](https://reader036.vdocument.in/reader036/viewer/2022062321/56813fc3550346895daaa120/html5/thumbnails/8.jpg)
Principal Component Analysis
,
Max
imal
disp
ersio
n
1st Principalaxis
2nd principalaxis
![Page 9: Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev](https://reader036.vdocument.in/reader036/viewer/2022062321/56813fc3550346895daaa120/html5/thumbnails/9.jpg)
ViDaExpert tool
,
![Page 10: Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev](https://reader036.vdocument.in/reader036/viewer/2022062321/56813fc3550346895daaa120/html5/thumbnails/10.jpg)
Caulobacter crescentus (GenBank NC_002696)
,
ijkf
ijkf
ijkfP )1(
ijkfP )2(
![Page 11: Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev](https://reader036.vdocument.in/reader036/viewer/2022062321/56813fc3550346895daaa120/html5/thumbnails/11.jpg)
“Path” of sliding window
,
![Page 12: Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev](https://reader036.vdocument.in/reader036/viewer/2022062321/56813fc3550346895daaa120/html5/thumbnails/12.jpg)
Helicobacter pylori (GenBank NC_000921)
,
![Page 13: Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev](https://reader036.vdocument.in/reader036/viewer/2022062321/56813fc3550346895daaa120/html5/thumbnails/13.jpg)
Saccharomyces cerevisiae chromosome IV
,
![Page 14: Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev](https://reader036.vdocument.in/reader036/viewer/2022062321/56813fc3550346895daaa120/html5/thumbnails/14.jpg)
Model sequences: (random codon usage)
,
![Page 15: Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev](https://reader036.vdocument.in/reader036/viewer/2022062321/56813fc3550346895daaa120/html5/thumbnails/15.jpg)
Model sequences: (random codon usage+50% of frequencies are set to 0)
,
![Page 16: Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev](https://reader036.vdocument.in/reader036/viewer/2022062321/56813fc3550346895daaa120/html5/thumbnails/16.jpg)
Graph of coding phase
,
![Page 17: Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev](https://reader036.vdocument.in/reader036/viewer/2022062321/56813fc3550346895daaa120/html5/thumbnails/17.jpg)
Assessment
,
Sequence L W% of
codingbases
Sn1 Sp1 Sn2 Sp2
Helicobacter pylori, complete genome (NC_000921)Caulobacter crescentus, complete genome (NC_002696)Prototheca wickerhamii mitochondrion (NC_001613)Saccharomyces cerevisiae chromosome III (NC_001135)Saccharomyces cerevisiae chromosome IV (NC_001136)
16438314016947
55328316613
1531929
300300120399399
9091496973
0.930.930.820.900.89
0.970.970.930.880.91
0.930.940.840.900.92
0.980.980.950.900.92
Model text RANDOMModel text RANDOM_BIAS
100000100000
500500
4945
0.900.99
0.610.83
0.820.94
0.770.90
FNTP
TPSn
FPTP
TPSp
Completelyblind prediction
![Page 18: Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev](https://reader036.vdocument.in/reader036/viewer/2022062321/56813fc3550346895daaa120/html5/thumbnails/18.jpg)
Dependence on window size
,
0.75
0.8
0.85
0.9
0.95
1
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300
window size
Sn
Sp
![Page 19: Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev](https://reader036.vdocument.in/reader036/viewer/2022062321/56813fc3550346895daaa120/html5/thumbnails/19.jpg)
Dependence on window size
,
W = 51 W = 252
W = 900 W = 2000
![Page 20: Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev](https://reader036.vdocument.in/reader036/viewer/2022062321/56813fc3550346895daaa120/html5/thumbnails/20.jpg)
State of art: GLIMMER strategy
,
1. Use MM of 5th order (hexamers) 2. Use interpolation for transition probabilities3. Use long ORF (>500bp) as learning dataset
Problems:1. The number of hexamers to be evaluated
is still big2. Applicable only for collected genomes
of good quality (<1frameshift/1000bp)
![Page 21: Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev](https://reader036.vdocument.in/reader036/viewer/2022062321/56813fc3550346895daaa120/html5/thumbnails/21.jpg)
What can we learn from this game?
,
• Learning can be replaced with self-learning • Bacterial gene-finders work relatively well, when
concentration of coding sequences is high• Correlations in the order of codons are small• Codon usage is approximately the same along the
genome
• The method presented allows self-learning on piecesof even uncollected DNA (>150 bp)
• The method gives alternative to HMM view on the problem of gene recognition
![Page 22: Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev](https://reader036.vdocument.in/reader036/viewer/2022062321/56813fc3550346895daaa120/html5/thumbnails/22.jpg)
Acknowledgements
,
Professor Alexander GorbanProfessor Misha Gromov
My coordinates:http://www.ihes.fr/~zinovyev