predicting ncrna genes in zebrafish genome: a maching learning approach
TRANSCRIPT
-
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
1/33
Machine Learning Methods inComputational Biology
Instructor:,George S. Vernikos PhD
Predicting ncRNA genes in Zebrafish
genomeRelevant Vector Machine
Aidonopoulos Orfeas,
Sc Student in Bioinformatics, May !"#
-
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
2/33
"$ %ntroductionNo ada!s, one of the most cha""enging #rob"ems in com#utationa" bio"og! is to transform
the huge $o"ume of data, #ro$ided b! ne "! de$e"o#ed techno"ogies, into kno "edge.
%achine "earning has become an im#ortant too" to carr! out this &'(. Se$era" techni)ues
and methods ha$e been de$e"o#ed in order to bui"d mode"s hich can be trained and
make crucia" decisions. *a!esian c"assifiers, "ogistic regression, discriminant ana"!sis,
c"assification trees, random forests, nearest neighbour, neura" net orks, su##ort $ector
machines, ensemb"es of c"assifiers, #artitiona" c"ustering, hierarchica" c"ustering, mi+ture
mode"s, hidden %arko$ mode"s, *a!esian net orks and Gaussian net orks are some of
that kind of methods.
In our #ro ect the aim as to de$e"o# genera"i-ed "inear mode"s using the Re"e$ant Vector
%achine techni)ue in order to #redict ncRNA genes in genomic &DNA( se)uences of
Zebrafish genome. Noncoding RNAs &ncRNA( are RNAs that are transcribed, but not
trans"ated into #rotein. here are t o kinds of ncRNA: short and "ong non/coding RNAs.
he! both inc"ude e""/characteri-ed transfer RNAs and ribosoma" RNAs, snRNAs,
snoRNAs, and miRNAs, as e"" as a #"ethora of ne ncRNAs that ha$e been sho n to #"a!
ma or ro"es in the ce""u"ar #rocesses of a"" "i$ing organisms&0(&1(. In addition, it has been
studied the functiona" ro"e of "ong non/coding RNA in human carcinomas &2(&3(.
Re"e$ant $ector machine &RV%( is a machine "earning method hich e+#"oits a #roba"istic
*a!esian "earning frame ork and ha$e an identica" functiona" form to the e""/kno n
su##ort $ector machine &SV%( &4(. RV% has the abi"it! to construct accurate #rediction
mode"s hich uti"i-e dramatica""! fe er basis function than a SV% hi"e offering se$era"
additiona" ad$antages. he inno$ati$e function of a RV% is the #robabi"istic #redictions it
creates. It doesn5t decide if a dato be"ongs or not in a c"ass but it gi$es it a #robabi"it! of
be"onging to a c"ass. he! can be uti"i-ed for both c"assification and regression #rob"ems.
$ &revious 'or(s and our pro)ect6a$ing done a research about #re$ious orks on the #rediction of non/coding RNA genes
e found on"! one #a#er hich is re"ated to our #ro ect and has been used the RV%
method. In &7(, Do n and 6ubbard tried to gain im#ortant information from non/coding
-
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
3/33
regions of simi"arit! bet een genomes. 8s#ecia""!, their aim as to e+tract the strongest
signa" from a set of non/coding conser$ed se)uences using RV%. 9rom this ork it as
sho n that the #redictions of the mode" ere c"ose to the start of annotated genes, as e
can see on the figure be"o . he! a"so $erified that the #romoter signa" is the strongest
sing"e motif/based signa" in the non/coding functiona" fraction of the genome hi"e subsetsof these #romoter regions ha$e an abundance of #G dinuc"eotides.
A#art from the ork of Do n and 6ubbard se$era" other a##roaches has been de$e"o#ed
for the #rediction of non/coding RNA genes or regions. At &;( the #ur#ose as the
c"assification of RNA se)uence a"ignments based on S I and a -/score using the su##ort
$ector machine. S I is a measure for RNA secondar! structure conser$ation hi"e the -/
score re#resents a measure for thermod!namic stabi"it! of a"ignments &norma"i-ed ith
res#ect to se)uence "ength and base com#osition(. At the fo""o ing figure the green circ"es
are the #ositi$e e+am#"es of the training set &nati$e a"ignments( and the red crosses the
negati$e ones &shuff"ed #ositions of random a"ignments(. he background co"or rangingfrom red to green indicates the RNA c"ass #robabi"it! for different regions of the -
-
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
4/33
he annotation of noncoding RNA genes remains a ma or bott"eneck in genome
se)uencing #ro ects. %ost genome se)uences re"eased toda! sti"" come ith sets of tRNAs
and rRNAs as the on"! annotated RNA e"ements, ignoring hundreds of other RNA fami"ies.
Se$era" on"ine too"s ha$e been created for this #ur#ose. RNAs#ace.org &=( is one of the
most recent too"s for the #rediction, annotation and ana"!sis of "ncRNA genes. NcRNA.org
is another too" for finding "ong non/coding RNA genes in RNA se)uences gi$ing a"so
information about the secondar! structure of the resu"ts & 9igure : ncRNA.org ( &'>(. Se$era"
other too"s and databases can be found on the ab"e 0 from Gibb5s #ub"ication &3( / 9igure .
-
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
5/33
*igure + ncR A$org
-
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
6/33
*igure + long non-coding R A online data.ases and tools
#$ Ma(ing the datasetAs e referred, the aim of the #resent ork as to #redict ncRNA genes in genomic &DNA(
se)uences of Zebrafish genome using the ?re"e$ant $ector machine5 method. @ut training
set as created as this: he #ositi$e e+am#"es ere consisted of kno n ncRNA genes
hich ere taken from htt#: .ensemb".org . hen, for the negati$e set, e
do n"oaded a"" the #rotein coding genes of Zebrafish from the ebsite of ensemb"e ande random"! se"ected the same number of ncRNA genes &about 23>> se)uences(. Be did
this in order to ha$e a ba"ance bet een the number of #ositi$e and negati$e e+am#"es.
As e kno , a GC% is a used form of mode" for both c"assification and regression
#rob"ems. It takes the ne+t form:
here is a set of basis functions & hich can be arbitrar! rea"/$a"ued functions( and is
a $ector of eights. In other ords this is e)ua" ith:
http://www.ensembl.org/index.htmlhttp://www.ensembl.org/index.html -
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
7/33
h/01 2 3 4 feature"5L/"14feature 5L/ 146 (Equation 1)
In our #ro ect 2> features ere used:
'. Strand: ' and > for com#"ementar!
0. Position in chromosome: Percentage ith res#ect to the "ength of the chromosome
in hich the se)uence be"ongs
1. om#osition 9re)uencies:
a. A, , , G
b. Dimer & AA, AG,E(
c. rimer & AAA, , GGG and (
2. G content
3. o Ratios: A and AG
4. '' motifs: 9or the finding of motifs, e $isited .motifsearch.com here a de/
no$o DNA motif se)uences search can be im#"emented. Be ga$e as in#ut the DNA
se)uences &from our #ositi$e set/ncRNA se)uences( in 9AS A format and the out#ut
hich as returned ere '' motifs:
FAG AAG AF, FAAG AAG F , FGAAG AAGF, F A GGGAAAF, FG AGGG GF,
F A GAAG F, FA GGGAGA F, F GAAG AAF, FGA GGGAGAF,
FG AAG AGF, FA A GGGAAF.
he ho"e construction of our raining Set as im#"emented in #er" "anguage. he scri#ts
ith their in#uts and out#uts are in fo"der ? rainSet5. In this fo"der there are t o subfo"ders
?NegSet and ? PosSet each one for the corres#onding data set &negati$e and #ositi$e
e+am#"es(. he fina" training set is the fi"e ith the name TrainingSet and is the
concatenation of the fi"es: FeaturesNegSet.bed and FeaturesPosSet.bed / 9igure .
http://www.motifsearch.com/http://www.motifsearch.com/ -
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
8/33
-
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
9/33
In case e ant to test the #erformance our mode" e indicate at "ine 3 of configuration
fi"e our est fi"e and at the "ast "ine 0 fi"es are inc"uded. he one is the out#ut ith the #ost
#robabi"ities and the other is the eights fi"e &from the training ste#( on hich the test run
as based. At the fo""o ing figure e can see the Net*eans en$ironment here the RV%
runs e$er! time.
*igure + Configuration *ile of RVM
*igure + &ost &ro.a.ilities
-
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
10/33
*igure + :eights of .asis functions
-
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
11/33
*igure + etBeans %;>? and ClassificationAccuracy 2 >>$! ?
nd Step 8 *eature e0traction and their importance
9irst of a"" it5s im#ortant to sa! that for each raining set e created a #ointer fi"e/tab"e so
e can see each bar hich feature re#resents on the fo""o ing charts sho s. *e"o is the
tab"e ith the #ointers and their features:
Pointer 9eature
' Strand
0 Position
1 A
2 G
3
4
-
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
16/33
7 AA
; AG
= A
'> A'' GA
'0 GG
'1 G
'2 G
'3 A
'4 G
'7
';
'= A
0> G
0'
00
01 AAA
02 GGG
03
04
07 G J ontent
0; A
0= A G
1> FAG AAG AF
1' FAAG AAG F
10 FGAAG AAGF
11 F A GGGAAAF
12 FG AGGG GF
-
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
17/33
13 F A GAAG F
14 FA GGGAGA F
17 F GAAG AAF
1; FGA GGGAGAF1= FG AAG AGF
2> FA A GGGAAF
he fo""o ing subcha#ters describe the mode"s e created for each different training ste#,
ho the! ere constructed, the features hich ere e+tracted, their im#ortance and their
#ossib"e ro"e in the #rob"em of #redicting non/coding RNA genes from genomic se)uences.
7raining 'ith All *eatures
9irst"!, as e said, in our training set a"" features ere inc"uded. he RV% mode" e+c"uded
= of 2> features. hese ere the Strand , t o nuc"eotides G, C and si+ of nine motifs
&AAGCTAAGC, GAAGCTAAG, TGAAGCTAA, GATGGGAGA, GCTAAGCAG,
ACATGGGAA (. he ma orit! of the rest features had a negati$e eight. 8s#ecia""!, as one
can see at 9igure , on"! = features had a #ositi$e contribution to our mode" ositi$e
eights( and these ere the Position of se)uence, the A and T nuc"eotides, the @C
content and 3 motifs & AGCTAAGCA, CATGGGAAA, GCAGGGCTG, CACTGAAGC, ATGGGAGAC (.
herefore, e ha$e constructed our ' st Genera"i-ed Cinear %ode". In our occasion e
cannot rite in this documentation our mode" because of the "arge number of basic
functions &2> features(. So, e #ro$ide a tab"e ith the eight of each basic function
hich as e+tracted from RV% machine:
*eature :eight
Position =.='e/>2GA /
0.30>'2;
-
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
18/33
2
GG/
0.'220;4
G
/1.;2471=
3
A
/2.'34>34
3
A/
0.414';=
G
/2.0;
74443/
0.1=0>==
/2.144>03
7
A
/0.3>>=3'
=
G
/2.143;==
'
A2.373470
;4
/2.'7140>
=
/
0.>1044'
3
AAA
/>.1111=;
1
-
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
19/33
GGG/
>.47=1;3
/>.3007;=
4
/'.07=22;
0
G J ontent
;.0;>;'4
3'
A
/>.>24307
2
A G/
;.>;e/>3
FAG AAGAF
>.'=';24
'7
F A GGGAAAF
>.17411>
=1
FG AGGG GF
>.0>10>0
32F A GAA
G F
>.1010=2
3
FA GGGAGA F
>.107270
;2
;.471103
41
AA
/'.>4
>0420
AG
/1.;==31'
=
A/
1.4=2'1
-
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
20/33
A
/2.>1>04>
3
O75 fi"es.
*igure + :eights of all features
he contribution of each structura" feature to our mode" is e$a"uated through a function
&R( that )uantifies the re"ati$e feature im#ortance, rather than the actua" feature eight
&W (. *rief"!, the im#ortance R of each feature is e+#ressed as the #roduct of the
corres#onding eight and the corres#onding standard de$iation & SD ( of the feature $a"ues
in the training set. Be #refer to assess the feature contribution to the mode", through the
R rather than the W $a"ue, because R takes into account the $ariabi"it! of the data set,
norma"i-ing the $a"ues ith the corres#onding SD &''(. %oreo$er, it is #ossib"e to fa"" into
the tra# of considering that a feature ith a "arge #ositi$e eight can enhance the
c"assifier5s se#arating abi"it!. here are man! cases hen a feature ith a #ositi$e eight
has a sma"" R. his ha##ens due to the feature5s dis#ersion. A feature ith a "arge #ositi$e
-
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
21/33
eight but ith sma"" dis#ersion/and conse)uent"! sma"" SD/ i"" be ha$e a sma""er R than
B, as R2:5S; .
So, sa$ing Net*eans out#ut in a fi"e &dir: RV%Runs/
Per"9i"esLout#utsLNetbeans ommandCine@uts( e can #arse the SDs of each feature and
then ca"cu"ate their significance R. he #er" scri#t that im#"ements this is the ari!"."# .
he im#ortance of each feature is de#icted on 9igure . his chart sho $erifies that
percentage of Adenines and 7hymines in a se)uence #"a! an im#ortant ro"e for
deciding if in a DNA se)uence there is a ncRNA gene &I&A( '=.144= and I& ( 10.7>4'(.
*ut the most significant feature for making a decision is the percentage of @C content
in se)uence &I&G J @N 8N ( 42.420'(.
*igure + *eatures %mportance
-
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
22/33
7raining Set 'ith %n*eatures
In order to $erif! our #re$ious resu"ts e constructed a dataset on"! ith the features for
hich RV% ga$e them a non/-ero eight. he resu"ts for eights and Im#ortance are cited
be""o . As e can see from the 9igure for once again the most of the features had a
negati$e eight. @n"! ; from 1' features ere #ositi$e eighted. herefore, it is $erified
the fact that the most of them are negati$e"! corre"ated among each other.
&ointer *eature' Position0 A12 AA3 AG
4 A7 A; GA= GG
'> G'' G'0 A'1 G'2'3'4 A'7 G';'=0> AAA0' GGG0001
-
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
23/33
02G J ontent
03 A04 A G
07
FAG AA
G AF0;
F A GGGAAAF
0=FG AGGG
GF
1>F A GAAG F
1'FA GGGAGA F
*igure + :eights for the %n*eature dataset
As far as the features5 im#ortance it is a"so $erified that @C content is the most
significant feature &I 4;.>>>>(. he )uantities of 7hymines and Adenines in a se)uence
come second and third on the "ist.
-
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
24/33
*igure + %n*eature s importance
7raining Set 'ith Out*eatures
In order to make some conc"usions on unseen data e created a training set ith the
features that had a -ero/ eight &M@ut9eatures (. hese features ere the fo""o ing:
&ointer *eature' Strand0 G1
2FAAG AAG F
3FGAAGAAGF
4F GAAG
AAF
7
FGA GGG
AGAF;
FG AAGAGF
=FA A GGGAAF
-
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
25/33
a"cu"ating the corres#onding eights and then the im#ortance, e can easi"! see that the
motifs 3/= are strong"! corre"ated. 8ssentia"!, this is some kind of "ogica" as these motifs
ha$e a $er! sma"" #resence in se)uences. A"so, FAAG AAG F motif is hig"! non/significant.
Strand and G do ha$e some re"ationshi# as e can see from 9igure but the! are being
$erified as features hich affect negati$e"! our mode" & 9igure (.
*igure + :eights of Out*eatures
*igure + Out*eatures importance
-
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
26/33
7raining Set 'ith egative *eatures
It is im#ortant to refer again that e #refer inter#reting the eight B as a measure that
sho s us ho much a feature affects one other rather than making conc"usions about its
im#ortance as a #redictor. Ne$erthe"ess, in our #ro ect e chose not to ignore the meaning
of negati$e eighted features as the! indicate the fact there is a negati$e corre"ation
bet een these basis functions. onse)uent"!, e ou"d "ike to obser$e ho the negati$e
eighted features beha$e. he features for hich RV% ga$e a negati$e eight ere the
fo""o ing:
&ointer*eature
' AA0 AG
1 A2 A3 GA4 GG7 G; G= A
'> G'''0
'1 A'2 G'3'4'7 AAA'; GGG'=0>0' A00 A G
@ne thing that #ro$es that the most of the abo$e basis functions affect our modem in a
Mnegati$e a! is the fo""o ing chart sho & 9igure (. Be can readi"! obser$e that most of
the features continue to ha$e the same beha$ior. he most of them ha$e a negati$e
eight. @n"! $% and &% are re"ated each other. his can be "ogica""! e+#"ained due to the
-
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
27/33
-
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
28/33
*igure + %mportance of eg*eatures
7raining Set 'ith &ositive *eatures
Cast"!, e created a raining set inc"uding the features that the first RV% run ga$e them a
#ositi$e eight. raining es#ecia""! these features e cou"d conc"ude that the #osition,
Adenine, h!mine, G content and the motifs 4 and ; are strong"! corre"ated & 9igure (. Not
on"! does their re"ationshi# is stab"e but their im#ortance too. he 9igure de#icts the fact
that their im#ortance to a mode" hich inc"udes on"! these basis functions are stead!. Cast
but not "east, there aren5t man! differences in im#ortance magnitude &most features5
im#ortance measure ranges from >.'3 to >.33(. So e cannot sa! for e+am#"e that motif
FAG AAG AF has a more im#ortant ro"e than G content.
&ointer *eature' Position
0 A1
2G J ontent
3FAG AAG AF
4F A GGGAAAF
-
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
29/33
7FG AGGG
GF
;F A GAAG F
=FA GGGAGA F
*igure + :eights of &os*eatures
-
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
30/33
*igure + &os*eatures importance
=$ @eneral results 8 Biological interpretationA"" trainings ha$e been considered, e can conc"ude that G content is a measure that e
ought to obser$e and stud! more essentia""!. G content as e+ce""ed among a"" the
features. Pre$ious orks ha$e sho n that G /rich isochores inc"ude in them man! #rotein
coding genesK thus determination of ratio of these s#ecific regions contributes in ma##ing
gene/rich regions of the genome. 9or e+am#"e, as e said in the beginning, it has been
sho n that human genes associated ith #G is"ands increase in number as the! increase
in of Guanine O !tosine "e$e"s, and that most genes associated ith #G is"ands are
"ocated in the G /richest com#artment of the human genome. herefore, for this reason
e create 0 distribution in order to see the differences bet een #ositi$e &kno n ncRNA
genes( and negati$e rotein coding genes( e+am#"es.
-
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
31/33
9rom the abo$e figures e can see that in -ebrafish genome #rotein coding genes are a"itt"e richer in G content than kno n ncRNAs, such as in human genome. he GO
content is bigger than 2> in about 13>> ncRNA se)uences and in 1=>> Protein Genes.
he #ercentages of Adenine and h!mine in genome se)uences a"so e+#orted and $erified
as significant #redictors. his is )uite "ogica" as e #ro$ed that G content #"a!s an
im#ortant ro"e in our mode". onse)uent"!, the com#"ementar! bases of G and ma! a"so
be #"a!ing some ro"e. he fo""o ing distributions sho us the #ercentage of Adenines and
h!mines & ith res#ect to the "ength of each se)uence( for both Positi$e and Negati$e
e+am#"es.
-
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
32/33
9rom the abo$e chart sho s, first of a"", e are seeing that the ma+imum of both Adenines
and h!mines content in each se)uence doesn5t e+ceed the #ercentage of 2> . he
se)uences hich are encoded for a #rotein are a "itt"e richer in Adenines than the ones
hich gi$e RNA genes, "ike in the case of G ontent. @n the other hand, both #rotein
coding and ncRNA genes ha$e the same content of h!mines in their content. Be cannot
e+tract an! ma or difference bet een our e+am#"es e+ce#t for the the kind of distribution
of Adenines in t o datasets. he a""ocation in Protein Genes is more ba"anced than in
ncRNA ones.
herefore, G content is rea""! a significant #redictor for making a decision if a DNAse)uence i"" be trans"ated into a #rotein or not.
-
8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach
33/33
References'. Carra aga P, a"$o *, Santana R, *ie"-a , Ga"diano Q, In-a I, et a". %achine "earning in
bioinformatics. *rief. *ioinform. 0>>4 %ar 'K7&'(:;4'' @ct 1K4&'>(:e03='3.
3. Gibb 8A, *ro n Q, Cam BC. he functiona" ro"e of "ong non/coding RNA in humancarcinomas. %o". ancer. 0>'' A#r '1K'>&'(:1;.
4. i##ing %8. S#arse ba!esian "earning and the re"e$ance $ector machine. Q %ach CearnRes. 0>>' Se#K':0''>2 Se# '3K3&'(:'1'.
;. Bashiet" S, 6ofacker IC, Stad"er P9. 9ast and re"iab"e #rediction of noncoding RNAs.Proc. Nat". Acad. Sci. H. S. A. 0>>3 9eb '3K'>0&7(:0232>; Qu" 'K14&su##" 0(:B73>; 9eb 'K';&0(:11'