1 gene predictor date:20/11/2003 implemented by: zohar idelson supervisor: dr. yizhar lavner winter...
Post on 15-Jan-2016
216 views
TRANSCRIPT
11
Gene PredictorGene Predictor
Date:20/11/2003Date:20/11/2003Implemented By: Zohar IdelsonImplemented By: Zohar IdelsonSupervisor: Dr. Yizhar LavnerSupervisor: Dr. Yizhar Lavner
Winter - Summer 2003Winter - Summer 2003
22
Genomic Signal ProcessingGenomic Signal Processing• Genomic Signal Processing is a Genomic Signal Processing is a
relatively new field in Bioinformatics, in relatively new field in Bioinformatics, in which signal processing algorithms and which signal processing algorithms and methods are used to study functional methods are used to study functional structures in the DNA.structures in the DNA.
• An appropriate mapping of the DNA An appropriate mapping of the DNA sequence into one or more numerical sequence into one or more numerical sequences, enables the use of many sequences, enables the use of many digital signal processing tools. digital signal processing tools.
atgcggatttgccgtcgatgtc…Gene
PredictorGene Gene
DNA Segment DNA Segment
33
• DNA in Eukaryotes is organized in chromosomes.DNA in Eukaryotes is organized in chromosomes.• The DNA in each chromosome can be read as a discrete The DNA in each chromosome can be read as a discrete
signal to {a,t,c,g}. (For example: atgatcccaaatggaca…).signal to {a,t,c,g}. (For example: atgatcccaaatggaca…).• In exons (protein-coding region), during the biological amino In exons (protein-coding region), during the biological amino
acids building, those letters are read as triplets (codons). acids building, those letters are read as triplets (codons). Every codon signals which amino acid to build (there 20 aa).Every codon signals which amino acid to build (there 20 aa).
• There are 6 ways of translating DNA signal to codons signal, There are 6 ways of translating DNA signal to codons signal, called the reading frames (3 * 2 directions).called the reading frames (3 * 2 directions).
• Every gene start with a start-codon and ends with a stop-Every gene start with a start-codon and ends with a stop-codon. An exon cannot consists of more than one stop-codon.codon. An exon cannot consists of more than one stop-codon.
• Non coding areas (majority usually) has a lot more random Non coding areas (majority usually) has a lot more random behavior than genes. Most of the DNA is non coding.behavior than genes. Most of the DNA is non coding.
• Genes can be detected by some statistics regularities, like Genes can be detected by some statistics regularities, like codon usage, nucleotide usage, periodicity and data base codon usage, nucleotide usage, periodicity and data base comparison.comparison.
DNA BasicsDNA Basics
44
OrganismsOrganisms
• Classified into two types:Classified into two types:
• EukaryotesEukaryotes: contain a membrane-bound nucleus : contain a membrane-bound nucleus and organelles (plants, animals, fungi,…)and organelles (plants, animals, fungi,…)
• ProkaryotesProkaryotes: lack a true membrane-bound nucleus : lack a true membrane-bound nucleus and organelles (single-celled, includes bacteria)and organelles (single-celled, includes bacteria)
• Not all single celled organisms are Not all single celled organisms are prokaryotes!prokaryotes!
55
CellsCells
• Complex system enclosed Complex system enclosed in a membranein a membrane
• Organisms are unicellular Organisms are unicellular (bacteria, baker’s yeast) (bacteria, baker’s yeast) or multicellularor multicellular
• Humans:Humans:
– 60 trillion cells 60 trillion cells – 320 cell types320 cell types
Example Animal Cellwww.ebi.ac.uk/microarray/ biology_intro.htm
66
DNA Basics – contDNA Basics – cont..
• DNA in Eukaryotes is organized in DNA in Eukaryotes is organized in chromosomes.chromosomes.
77
ChromosomesChromosomes
• In eukaryotes, nucleus In eukaryotes, nucleus contains one or several contains one or several double stranded DNA double stranded DNA molecules orgainized as molecules orgainized as chromosomeschromosomes
• Humans: Humans: – 22 Pairs of autosomes22 Pairs of autosomes– 1 pair sex 1 pair sex
chromosomeschromosomes Human Karyotype http://avery.rutgers.edu/WSSP/StudentScholars/
Session8/Session8.html
88www.biotec.or.th/Genome/whatGenome.html
99
What is DNA?What is DNA?
• DNA: Deoxyribonucleic AcidDNA: Deoxyribonucleic Acid
• Single stranded molecule (oligomer, Single stranded molecule (oligomer, polynucleotide) chain of nucleotidespolynucleotide) chain of nucleotides
• 4 different nucleotides:4 different nucleotides:– Adenosine (A)Adenosine (A)– Cytosine (C)Cytosine (C)– Guanine (G)Guanine (G)– Thymine (T)Thymine (T)
1010
Nucleotide BasesNucleotide Bases
• Purines (A and G)Purines (A and G)
• Pyrimidines (C and T)Pyrimidines (C and T)
• Difference is in base structureDifference is in base structure
Image Source: www.ebi.ac.uk/microarray/ biology_intro.htm
1111
DNADNA
1212
1313
GenomeGenome
• chromosomal DNA of an organismchromosomal DNA of an organism
• number of chromosomes and genome size number of chromosomes and genome size varies quite significantly from one organism varies quite significantly from one organism to anotherto another
• Genome size and number of genes does not Genome size and number of genes does not necessarily determine organism complexitynecessarily determine organism complexity
1414
ORGANISM CHROMOSOMES GENOME SIZE GENES
Homo sapiens (Humans)
23 3,200,000,000 ~ 30,000
Mus musculus(Mouse)
20 , 2600,000,000 ~30,000
Drosophila melanogaster
(Fruit Fly)
4 180,000,000 ~18,000
Saccharomyces cerevisiae (Yeast)
16 14,000,000 ~6,000
Zea mays (Corn) 10 2,400,000,000 ???
Genome ComparisonGenome Comparison
1515
1616
• The DNA in each The DNA in each chromosome can chromosome can be read as a be read as a discrete signal to discrete signal to {a,t,c,g}. (For {a,t,c,g}. (For example: example: atgatcccaaatggacaatgatcccaaatggaca…)…)
DNA Basics – contDNA Basics – cont..
1717
• In genes (protein-coding region), In genes (protein-coding region), during the construction of proteins during the construction of proteins by amino acids, these nucleotides by amino acids, these nucleotides (letters) are read as triplets (codons). (letters) are read as triplets (codons). Every codon signals one amino acid Every codon signals one amino acid for the protein synthesis (there are for the protein synthesis (there are 20 aa).20 aa).
DNA Basics – contDNA Basics – cont..
1818
• There are 6 ways of translating DNA There are 6 ways of translating DNA signal to codons signal, called the signal to codons signal, called the reading frames (3 * 2 directions).reading frames (3 * 2 directions).
DNA Basics – contDNA Basics – cont..
…CATTGCCAGT…
1919
DNA Basics – ContDNA Basics – Cont..…CATTGCCAGT…
Start: ATG
Stop: TAA, TGA, TAG
gene
Exon ExonExon IntronIntron Exon
2020
The ProblemThe Problem
• Given unannotated DNA, find the Given unannotated DNA, find the genes.genes.
• In practice, find the exons and their In practice, find the exons and their RF.RF.
• Smaller scale problem: given some Smaller scale problem: given some annotated DNA of a creature, find the annotated DNA of a creature, find the exons of unannotated DNA of the exons of unannotated DNA of the same creature.same creature.atgcggatttgccgtcgatgtc…
Gene Predictor
Exon Exon
2121
Solution SchemeSolution Scheme
• Solution scheme:Solution scheme:– Work in windows analysis.Work in windows analysis.– Find parameters that gives a good prediction in Find parameters that gives a good prediction in
annotated DNA (of the same organism). annotated DNA (of the same organism). LearnLearn how how to distinguish exons regions from non-exons to distinguish exons regions from non-exons regions.regions.
– Extract those parameters from the unannotated Extract those parameters from the unannotated DNA, and use the discrimination rule in order to DNA, and use the discrimination rule in order to predict.predict.
• Almost all methods shown here fit to this Almost all methods shown here fit to this scheme.scheme.
2222
Creatures in the ProjectCreatures in the Project
C. elegans S. cerevisiae (yeast)
2323
Existing MethodsExisting Methods
• Many methods relies on the pseudo Many methods relies on the pseudo periodicity of 3 in genes. For that we periodicity of 3 in genes. For that we define:define:– UUbb is the binary indicator series for base B. is the binary indicator series for base B.
– UUBB is the STFT of u is the STFT of ubb..
• N, the window size, is in the hundreds. Exons size is N, the window size, is in the hundreds. Exons size is in order of 10in order of 1011…10…103 3 (in S. Cerevisiae).).
• Overlapping windows.Overlapping windows.
– There exists a connection between the DFT in k There exists a connection between the DFT in k = N/3 frequency and nucleotides usage.= N/3 frequency and nucleotides usage.
2424
Calculating the Calculating the DFTDFT of a DNA of a DNA sequencesequence**
ATCGTACAGCTGCAAAGCATAGATTCGGTCACAGTTG…S(n)
1000010100000111001010100000001010000
01001000001…
0010001001…
000100001001…
uA(n)
uT(n)
uC(n)
uG(n)
1
3
NDFT
N
1
3A
NU
N 1
3T
NU
N
1
3C
NU
N 1
3G
NU
N
2110
0
( ) { ( )} ( ) 0 1N i nk
N Nn
n
X k DFT x n x n e k N
213
0
, , ,3
N i n
b bn
NU u n e b A T C G
*Silverman and Linsker 1986; Voss 1992
2525
SpectrogramSpectrogram
A way for showing the amplitude of UA, UC, UG and UT together.Linear Transform to RGB.Magnitude is represented by brightnessFinding exons visually: bright horizontal lines, usually in k = N/3
Position(nucleotides)
Fre
qu
enc y
N/3
2626
Spectrogram – contSpectrogram – cont..
DNA of C. Elegans chr. III versus totally random DNA
2727
Power SpectrumPower Spectrum
1 1A = Ua(k),C = Uc(k) ...
N Nk {0, 1, ... N/2}
2 2 2 2S = A + C + G + T
Difference between gene to non-gene areas is in 1 order of magnitude
Used for k = N/3
2828
IIR Anti Notch FilteringIIR Anti Notch Filtering
• IIR anti notch filter IIR anti notch filter aimed to find “peaks” aimed to find “peaks” of a chosen frequencyof a chosen frequency
2 1 2
1 2 2
2 cos( )
1 2 cos
R R z zA z
R z R z
all-pass
1 ( )( )
2
A zH z
Anti-notch
2929
Optimized Spectral Content Optimized Spectral Content Measure (OSCM)Measure (OSCM)
{ } { }
( ) ( )r r r
a,t,g r r r
E aA tT gG E aA tT gGargmax
std aA tT gG std aA tT gG
Find good coefficients (a,g,t) for high differentiation between exons and introns.Ignoring C since of the linear dependency in the rest.Ar, Tr, Gr are generated from random DNA sequence, or Introns.Performance:
2 2W = aA + cC + gG + tT
3030
OSCM ExampleOSCM Example
Direction mistake
Good forward detection
Good reverse detection
3131
OSCM JustificationOSCM Justification
• In genes, the 4 In genes, the 4 complex variables complex variables A,T,C,G are not all-A,T,C,G are not all-random and tend to random and tend to be near a specific be near a specific angle (phase).angle (phase).
• In introns, the values In introns, the values of phase seems to be of phase seems to be pure random.pure random.
• Those unique angles Those unique angles enable us to detect enable us to detect reading frame as well.reading frame as well.
3232
Distribution of the phase of the DFT at the Distribution of the phase of the DFT at the
freq of 1/3 in the freq of 1/3 in the genesgenes ofof S. CerevisiaeS. Cerevisiae::
Distribution of arg(A)
angular mean = 0.3556 angular deviation = 0.4016
Distribution of arg(T)
Distribution of arg(C) Distribution of arg(G)
Argument distributions for all experimental genes in all chromosomes in S. Cerevisiae
angular mean = -2.6862angular deviation = 0.8416
angular mean = -1.3734angular deviation = 0.7903
angular mean = 2.7962angular deviation = 0.5723
3333
Distribution of arg(A)
Distribution of arg(C) Distribution of arg(G)
Argument distribution for non-coding regions in all chromosomes in S. Cerevisiae
Distribution of arg(T)
Distribution of the phase of the DFT at the Distribution of the phase of the DFT at the
freq of 1/3 in the freq of 1/3 in the intronsintrons ofof S. CerevisiaeS. Cerevisiae::
3434
2 2
3 3,1 1 ,2 ,33
i i
b
NU f b f b e f b e
Fourier Spectra and Position Asymmetry
f(b,i) is the frequency of the base b in the codon position i, i=1,2,3.
3535
GenesGenes versusversus IntronsIntrons
Introns andIntrons and
intergenic spacersintergenic spacersCoding regionsCoding regions
genes and exonsgenes and exons
MagnitudeMagnitudesmallsmallLARGELARGE
PhasePhaseRandomlyRandomlydistributeddistributed
NarrowNarrowdistributiondistribution
0.05
0.1
30
210
60
240
90
270
120
300
150
330
180 0
Distribution of the DFT of T at 1/3 frequency
0.05
0.1
30
210
60
240
90
270
120
300
150
330
180 0
Distribution of the DFT of G at 1/3 frequencyDistribution of the DFT of T at 1/3 frequency Distribution of the DFT of G at 1/3 frequency
(Data taken from S.Cerevisiae, chr. IV)
3636
Finding Reading Frame Finding Reading Frame (OSCM Phase)(OSCM Phase)
Is concentrated around Is concentrated around 11,, 22 and and 3 3
corresponding to each corresponding to each reading frame.reading frame.
• Lowering the variance of Lowering the variance of with the optimization: with the optimization:
• Transforming Transforming to color.to color.• Deriving reading frame Deriving reading frame
by a simple look.by a simple look.
= arg(aA + cC + gG + tT)
{ }a,t,g
aA gG tTargmax E
aA gG tT
BlueBlue33
GreenGreen22
RedRed11
ColorColorReading Reading FrameFrame
3737
New Methods in This ProjectNew Methods in This Project
• Linear predictionLinear prediction
• Classification by clustering (CC)Classification by clustering (CC)
• Classification by compression ratiosClassification by compression ratios
3838
Linear PredictionLinear Prediction
• Create a walk from Create a walk from the indicator the indicator sequencessequences
• For each window, For each window, find LP coefficients. find LP coefficients. Look for differences Look for differences in correlation by:in correlation by:– Poles mapPoles map– Frequency responseFrequency response– Prediction errorPrediction error
• No new findings in No new findings in this method.this method.
1
[ ] [ ] [ ] [ ] [ ]n
A C G Tk
x n au k cu k gu k tu k
3939
Classification by ClusteringClassification by Clustering
• Recall: DFT in k=N/3 frequency has a Recall: DFT in k=N/3 frequency has a strong correlation with genes locations strong correlation with genes locations and reading frames (as shown in and reading frames (as shown in part Apart A))
• Here we’ll attempt to use it in order to Here we’ll attempt to use it in order to discriminate exons from the rest, in a discriminate exons from the rest, in a 6D space6D space
• Learning phase: clusteringLearning phase: clustering• Classification phase: fuzzy KNNClassification phase: fuzzy KNN
4040
Classification by Clustering Classification by Clustering Clustering Stage: ExampleClustering Stage: Example
From left to right: C, G and T.
S. Cerevisiae 5th chromosome.
4141
Classification by ClusteringClassification by Clustering
RF = 1
+120°
-120°Max
סף
Exon?
Reading frame (if it’s an exon)
)T,C,G (new sample
RF = 1
RF = 1
RF =? 1
RF =? 3
RF =? 2
DNA = … atcgtgactagc…
DFT(k=N/3)
Indicator
DFT(k=N/3)
Indicator
DFT(k=N/3)
Indicator
T
C G
Start here
uT uC uG
4242
Classification RuleClassification Rule
• Fuzzy KNN: create a Fuzzy KNN: create a fuzzy membership fuzzy membership function and choose function and choose the one with the the one with the highest score. Add highest score. Add fuzzy clustering fuzzy clustering iteration to the LBG iteration to the LBG algorithm.algorithm.
• Two methods for Two methods for classifying gene/non-classifying gene/non-gene:gene:– Add genes and non-Add genes and non-
genes scores, and max genes scores, and max sum wins.sum wins.
– Max centroid score wins.Max centroid score wins.
• 22ndnd method used method used (better performance). (better performance). Scores sums are used Scores sums are used for reading frame: max for reading frame: max r.f. wins.r.f. wins.
4343
ResultsResults
• Creature: S. Creature: S. Cerevisiae.Cerevisiae.
• Learning was done on Learning was done on the 5the 5thth chromosome. chromosome.
• Parameters:Parameters:– K=7 and m=2 of K=7 and m=2 of
fuzzy KNN.fuzzy KNN.– True exon True exon 50% 50%
exon.exon.– Thresh = 1.Thresh = 1.
• Total: only 4.6% of Total: only 4.6% of true exons weren’t true exons weren’t detected at all.detected at all.
f_pf_nrf_truef_n_exons# exons# missed
10.10370.45240.95740.08821029
20.08210.47350.96850.047238118
30.09170.46180.95510.07115511
40.08210.46150.96540.02972521
60.11020.42470.97620.051206
70.08210.47490.96470.025850413
80.1030.47160.96710.045626312
90.10910.4520.94760.042008
100.10050.47190.97230.029334110
110.08220.48160.96410.070332723
120.09730.47590.97220.051448625
130.08850.46820.96070.036543816
140.10410.45970.96160.039737815
150.09040.46440.96650.031151416
160.08240.47440.96620.045244220
Total5376223
4444
CC - ExampleCC - Example
4545
CC - ImprovingCC - Improving
• Instead of deciding for each reading Instead of deciding for each reading frame separately and then decide which frame separately and then decide which r.F. “Won”, we can replicate the r.F. “Won”, we can replicate the centroids for the other reading frames centroids for the other reading frames and the classification rule will determine and the classification rule will determine [exon / non-exon] + [reading frame], at [exon / non-exon] + [reading frame], at the same time. This suppose to cause a the same time. This suppose to cause a more fair competition between the more fair competition between the reading frames.reading frames.
4646
Classification by Classification by Compression RatesCompression Rates
A T C G A T C G T A C G C A T G C A T G C A T G C A T G A A A A
60…11829 • In forward coding, creating 3 different codon sequences.
• In classification of reverse coding, first complementing all the DNA, then treating it like forward (and results will also be reversed)
• In the end of this stage, we have 6 codon seriates.
Nucleotides (‘A’,’C’,’T’,’G’)
Codons (0..63)
4747
The IdeaThe Idea
• If we have a dictionary with the popular If we have a dictionary with the popular words ( = codon sequences) in exons words ( = codon sequences) in exons which aren’t popular in non-exons then:which aren’t popular in non-exons then:– Good compression will be achieved in Good compression will be achieved in
exonsexons– Good compression will not be achieved in Good compression will not be achieved in
intronsintrons
• So we need a good dictionary and a So we need a good dictionary and a good compressing algorithmgood compressing algorithm
4848
Building the DictionaryBuilding the Dictionary
• Aim: the output Aim: the output dictionary is dictionary is expected to hold expected to hold short short popularpopular words words in exons.in exons.
• Using LZW algorithm.Using LZW algorithm.• Input: all exons of Input: all exons of
learnt chromosome.learnt chromosome.• Initial dictionary: all Initial dictionary: all
codons.codons.
• Add restriction on Add restriction on length of words to length of words to be entered to the be entered to the dictionary.dictionary.
• Output I: dictionary Output I: dictionary with words that with words that appearedappeared in exons. in exons.
• Output II: the code Output II: the code of the exons by the of the exons by the dictionary.dictionary.
4949
LZW: EncodingLZW: Encoding
1)1) Accum Accum first input letter first input letter2)2) If dict.Find(accum) == falseIf dict.Find(accum) == false
1)1) Dict.Add(accum)Dict.Add(accum)2)2) Code.Add(index)Code.Add(index)3)3) Accum Accum accum(end) accum(end)4)4) Return to (2)Return to (2)
3)3) Else:Else:1)1) Index = dict.Findwhere(accum)Index = dict.Findwhere(accum)2)2) Accum.Add(next letter from input)Accum.Add(next letter from input)3)3) Return to (2)Return to (2)
5050
Dictionary PruningDictionary Pruning
• Output LZW dictionary is a tree (TRIE).Output LZW dictionary is a tree (TRIE).
• Aim: keep the most popular words, but don’t Aim: keep the most popular words, but don’t allow undesired redundancy.allow undesired redundancy.
• Method:Method:– Go on every level of the tree (starting in max Go on every level of the tree (starting in max
length words) and take predefined number of length words) and take predefined number of popular words.popular words.
– Pass number of appearances (from output code) to Pass number of appearances (from output code) to parents: pass the sum of all, OR pass the sum of parents: pass the sum of all, OR pass the sum of untaken. More variations: multiply by the entropy.untaken. More variations: multiply by the entropy.
5151
Using Entropy for Better Using Entropy for Better PruningPruning
[31 45 1 60][31 45 1 60] [31 45 1 30][31 45 1 30] [31 45 1 13][31 45 1 13][31 45 1 31][31 45 1 31]
[31 45 1][31 45 1]
66 66 66 66
24*log(4) = 48
24*log(4) = 48
[31 45 1 30][31 45 1 30]
[31 45 1][31 45 1]
4040
40*log(1) = 040*log(1) = 0
[31 45 1 60] [31 45 1 60] [31 45 1 30] [31 45 1 30] [31 45 1 13] [31 45 1 13][31 45 1 31] [31 45 1 31]
[31 45 1] [31 45 1]
11 2020 11 22
20*(-1)*[5/6*log(5/6) + 2*1/24*log(1/24) + 1/16*log(1/16)] = 20*0.8513 = 17.0255
20*(-1)*[5/6*log(5/6) + 2*1/24*log(1/24) + 1/16*log(1/16)] = 20*0.8513 = 17.0255
5252
Compression Rates Compression Rates ClassificationClassification
1. Input:DNA of a chromosome and gene based dictionary
1. Input:DNA of a chromosome and gene based dictionary
2. 6 codons sequences for the 6 different reading frames
2. 6 codons sequences for the 6 different reading frames
4. 6 compress rates vectors
4. 6 compress rates vectors
6.6 binary vectors+ post processing data
6.6 binary vectors+ post processing data
8.6 binary vectors – the final classification
8.6 binary vectors – the final classification
5. Rf_wins = Argmax{compress_rate(rf),thresh)Lowerthresh = Argmax{compress_rate(rf),lower-thresh)Too_much_stops = 1 if window has more than 1 stop codon
3. Compressing with genes based dictionary
7. Post Processing
5353
Post ProcessingPost Processing
• Lower threshold technique: tag as Lower threshold technique: tag as true every window that is between true every window that is between close already-tagged windows, if close already-tagged windows, if value larger than the lower threshold.value larger than the lower threshold.
• Stop codons quantity in the window: Stop codons quantity in the window: more than one => not an exon-more than one => not an exon-window (which is larger than analysis window (which is larger than analysis window size).window size).
5454
Compression Rates: Compression Rates: ExampleExample
5555
Stop Codons UsageStop Codons Usage
• 100,000b of 2100,000b of 2ndnd chromosomechromosome
• 1 where there is 1 where there is one stop codon in one stop codon in the window, at the window, at mostmost
5656
Post Processing: Stop-codon Post Processing: Stop-codon UsageUsage
• Stop codon usage cleans up many potential false positives, without damaging any success measure
• Hence, a lower principal threshold can be determined and we’ll get better performance
Without stop codon usage
5757
Compression Rates: ResultsCompression Rates: Results• Learnt chromosome = 1Learnt chromosome = 1stst , window size = 100c, dictionary size = , window size = 100c, dictionary size = 1381 (32 codons, branching = 3)1381 (32 codons, branching = 3)
• After choosing best configuration, going over all the chromosomes:After choosing best configuration, going over all the chromosomes:#f_pf_nrf_truef_n_exons# exons# missTHRESH
20.104420.138090.938660.046875384180.457
30.100150.160980.922340.0387115560.457
40.0842710.140140.938090.036986730270.457
50.0905560.137630.927230.03448326190.457
60.139090.142740.924950.04166712050.457
70.120530.147330.939270.027723505140.457
80.150570.145380.933620.059925267160.457
90.131610.138160.924580.04520090.457
100.122220.124110.934470.03207343110.457
110.078330.145750.937120.069069333230.457
120.141060.136540.94050.064777494320.457
130.110510.143380.928140.040816441180.457
140.150440.154750.934340.026525377100.457
150.0899950.145780.93570.044231520230.457
160.120390.137940.936570.033784444150.457
total0.04233945574236
5858
Compression Rates: Compression Rates: ImprovingImproving
• Use non-exon dictionary, or prune Use non-exon dictionary, or prune exon-dictionary considering non-exon exon-dictionary considering non-exon common words.common words.
• Adaptive dictionary: when detecting Adaptive dictionary: when detecting an exon, use its common words to an exon, use its common words to update the current dictionary.update the current dictionary.