PRABINA KUMAR MEHERSCIENTISTDIVISION OF STATISTICAL GENETICSINDIAN AGRICULTURAL STATISTICS RESERARCH INSTITUTEINDIAN COUNCIL OF AGRICULTURAL RESEARCHNEW DELHI-110012
A HYBRID APPROACH FOR IDENTIFYING 5’ SPLICING JUNCTION WITH HIGHER ACCURACY
Transcription
DNA
Pre mRNA
mRNA
Protein
Splicing
Translation
6th
Wor
ld C
ongr
ess o
n B
iote
chno
logy
THE CENTRAL DOGMA
Every GT in the gene is a possible donor site and it need to predicted as either true or false splice site
RATIONALE AND GENESIS
Probabilistic
WMM
WAM
MM1
MEM
SAE
Machine Learning
MM1-SVM
WD-SVM
LIK-SVM
MM1-SVM
DS-SVM
6th
Wor
ld C
ongr
ess o
n B
iote
chno
logy
Zhang et al. (Experts systems with Applications, 2006)
RATIONALE AND GENESIS…
Windownsize-100bp
6th
Wor
ld C
ongr
ess o
n B
iote
chno
logy
Encoded test set
Training data set of TSS and FSS
Scoring matrix of FSS
Scoring matrix of TSSDifference
Training sites
Test sites
Encoded training setDifference
matrix
44 , 43 , ............, 2 , 1 0 , 0 , 1, 2 , ..............., 43, 44
1
2
3
...AT...TA TC...AC..
...TT...GC GG...TC..
...AC...TC AT...GC..... ... ... ... ..
...GT
G
...AC CC...
TGTGTGT
AG..GTN
sss
s
POS.. (-44, -43) (-43,-42) … (42,43) (43,44)(AA) … … … …(AT) … … … …(AG) …(AC) …(……) …(CG) …(CC) … … … …
Huang et al. (Biochemie, 2006)
( , )( 44, 43)
A As
RATIONALE AND GENESIS…
Windownsize-88bp
6th
Wor
ld C
ongr
ess o
n B
iote
chno
logy
Less accurate with sub-optimal window length
4
Most of the approaches are species specific
3
Threshold is easy in MLA2
Difficult to determine threshold in probabilistic approaches1
RATIONALE AND GENESIS…
6th W
orld
Con
gres
s on
Bio
tech
nolo
gy
DATA for Validation
Human
Bovine
Fish
Worm
TSS TSS
2796 90923
10000 10000
10000 10000
1000 19000
HS3D
UCSC Genome Browser
Kamath et al. 2014
6th
Wor
ld C
ongr
ess o
n B
iote
chno
logy
DATA for Comparison
NN269
Training Testing
TSS#1116
FSS#4140
TSS#208
FSS#782
Each sequence is of 15nt long with conserved GT at 8th and 9th positions respectively
6th W
orld
Con
gres
s on
Bio
tech
nolo
gy
Sequence Encoding
1 21
log ( ) ; { , , , }L
P T ti
i
f p A C G T
2 2 21 1
log ( ) log ( )L L
P TF t fi i
i i
f p p
13
( )100
Lt t
iP T i
t t
p Mf
M N
1 14
( ) ( )100
L Lt t f f
i iP TF i i
t t f f
p M p Mf
M N M N
WMM
Shapiro and Senapathy
where M is the sum of highest frequency at position 1 to L and N is the sum of lowest frequency at position 1 to L obtained from frequency matrix of nucleotides
POSITIONAL FEATURE
6th
Wor
ld C
ongr
ess o
n B
iote
chno
logy
CONTD…
DEPENDENCY FEATURE
5 21 1( )
log ( ) ; , { , , , }L L
D T ti j
i j i
f p A C G T
6 2 21 1( ) 1 1( )
log ( ) log ( )L L L L
D TF t fi j i j
i j i i j i
f p p
71 1( )
2 ( 1) 2 ( )L L
D T ti j
i j i
f L L p
81 1( ) 1 1( )
2 ( ) 2 ( )L L L L
D TF f ti j i j
i j i i j i
f p p
SAE
WAM
6th
Wor
ld C
ongr
ess o
n B
iote
chno
logy
CONTD…
1 29 1 2 1 2
( )( ) ; , { , , , }1
C I nf A C G T
L
1 2 310 1 2 3 1 2 3
( )( ) ; , , { , , , }
2C I n
f A C G TL
1 2 3 411 1 2 3 4 1 2 3 4
( )( ) ; , , , { , , , }
3C I n
f A C G TL
COMPOSITIONAL FEATURE
Dimers
Triplets
Tetramers
6th
Wor
ld C
ongr
ess o
n B
iote
chno
logy
Feature Selection
344 4 4 16+64+256
Total Positional Dependency Compositional
( ) j j
j j
x xF j
s s
4 4 14+15+12 6th
Wor
ld C
ongr
ess o
n B
iote
chno
logy
Feature Selection…
Feature Type #Features Features
Positional 4
Dependency 4
Compositional
41
1 2 3 4, , ,P T P TF P T P TFf f f f
5 6 7 8, , ,D T D TF D T D TFf f f f
10 10 10 10 10 10 10
10 10 10 10 10 10 10
( ), ( ), ( ), ( ), ( ), ( ), ( )
( ), ( ), ( ), ( ), ( ), ( ), ( )
C I C I C I C I C I C I C I
C I C I C I C I C I C I C I
f AA f AC f AG f CA f CC f CT f GA
f GC f GG f GT f TA f TC f TG f TT
11 11 11 11 11
11 11 11 11 11
11 11 11 11 11
( ), ( ), ( ), ( ), ( ),
( ), ( ), ( ), ( ), ( ),
( ), ( ), ( ), ( ), ( )
C I C I C I C I C I
C I C I C I C I C I
C I C I C I C I C I
f AAG f AGG f AGT f CAG f GAG
f GGG f GGT f GTA f GTC f GTG
f TAA f TGA f TGC f TGG f TGT
12 12 12 12
12 12 12 12
12 12 12 12
( ), ( ), ( ), ( ),
( ), ( ), ( ), ( ),
( ), ( ), ( ), ( ),
C I C I C I C I
C I C I C I C I
C I C I C I C I
f AAGG f AGGT f CAGG f GAGG
f GGGT f GGTA f GGTG f GTAA
f GTGA f GTGG f TAAG f TGAG
6th
Wor
ld C
ongr
ess o
n B
iote
chno
logy
Cross validation
1 2 3 4 5
TSS
1 2 3 4 5
FSS
1 2 3 4
1 2 3 4
5 5
Training
Test
Classifiers Prediction
6th
Wor
ld C
ongr
ess o
n B
iote
chno
logy
Parameter Optimization
6th W
orld
Con
gres
s on
Bio
tech
nolo
gy
Performance measure
6th W
orld
Con
gres
s on
Bio
tech
nolo
gy
Performance measure…
MeasureBalanced Imbalanced
Human Bovine Fish Worm Human Bovine Fish Worm
AUC-ROC 96.05 96.94 96.95 96.24 97.21 97.45 97.41 98.06
AUC-PR 97.64 97.89 97.91 97.90 93.24 93.34 93.38 92.29
6th
Wor
ld C
ongr
ess o
n B
iote
chno
logy
Comparative Analysis
NN269 #TSS #FSSTraining 1116 208Testing 4140 782
Approaches AUC-ROC AUC-PR ReferencesMM1-SVM 97.62 89.58 Baten et al., 2006LIK-SVM 98.04 92.65
Sonnennburg et al., 2007WD-SVM 98.50 92.86
WDS-SVM 98.13 92.47EFFECT 98.20 92.81 Kamath et al., 2014Proposed 96.53 93.54
6th W
orld
Con
gres
s on
Bio
tech
nolo
gy
Prediction Server
http://cabgrid.res.in:8080/hsplice
HSplice
6th
Wor
ld C
ongr
ess o
n B
iote
chno
logy
ACKNOWLEDGEMENT
DIRECTORINDIAN AGRICULTURAL STATISTICS RESEARCH INSTITUTE
NEW DELHI
6th
Wor
ld C
ongr
ess o
n B
iote
chno
logy
6th World Congress on Biotechnology 6
th World C
ongress on Biotechnology
6th World Congress on Biotechnology