datamining in bioinformatics-1
TRANSCRIPT
-
8/13/2019 Datamining in Bioinformatics-1
1/15
ATA MINING IN BIO-INFORMATICS
VIDYAA VIKAS COLLEGE OF ENGINEERING & TECHNOLOGY,
Tiruchengode,
Naa!!a"#
Prepared by,
Srimathi.K
Pavithra.B
B.Tech(IT),
Pre-final yr.
Contact:
rimathiceo!"mail.com,(#$$%%%')
mailto:[email protected]:[email protected] -
8/13/2019 Datamining in Bioinformatics-1
2/15
*BST+*CT
Biolo"y and compter cience hare a natral affinity. Phyicit rin
Schr/din"er enviioned life a an aperiodic crytal, obervin" that the or"ani0in"
trctre of life i neither completely re"lar, li1e a pre crytal, nor completely chaotic
and ithot trctre, li1e dt in the ind. Thi i hy biolo"ical information ha never
atifactorily yielded to claical mathematical analyi. 2achine comptation combine
ele"ant al"orithm ith brte-force calclation 3 hich eem a reaonable approach to
thi aperiodic trctre. The oltion to vario problem lie in the domain of or"anic
matter. Th, e4aminin" ho or"anim olve problem can lead to ne comptation-and
al"orithm-development approache that devor the problem that are difficlt to tac1le in
the laboratory, bt o eay to approach in" a compter.
$ioin%oraic'i the field of cience in hich biolo"y, compter cience, and
information technolo"y mer"e to form a in"le dicipline. The ltimate "oal of the field i
to enable the dicovery of ne biolo"ical ini"ht a ell a to create a "lobal perpective
from hich nifyin" principle in biolo"y can be dicerned. In thi firt part of thi
paper, e "ive a brief introdction on bioinformatic and data minin" and their
relationhip. In the later part, e deal ith data minin" approache in bioinformatic and
it application particlarly in biomedical and 56* data analyi.
7
-
8/13/2019 Datamining in Bioinformatics-1
3/15
THE ITINERARY
(# $ioin%oraic'
) *here $io"og+ ee' Couer 'cience
-# Daa ining ) an inroducion
.# *ha i' a /io"ogica" daa/a'e0
1# *h+ need daa ining in /ioin%oraic'0
2# Cha""enge' in /io3indu'r+
4# Aroache' o% daa ining in $ioin%oraic'
3 In%"uence /a'ed ining
3 A%%ini+3/a'ed ining
3 Tie de"a+ daa ining
3 Trend3/a'ed ining
3 Coarai5e daa ining
3 6redici5e daa ining
7# Daa ining %or $ioedica" and DNA daa ana"+'i'
8# Conc"u'ion
%
-
8/13/2019 Datamining in Bioinformatics-1
4/15
De%ining $ioin%oraic'
Bioinformatic i the compter-aited data mana"ement dicipline that help
"ather, analy0e, and repreent biolo"ical information in order to ndertand life8
procee.
Bioinformatic i conceptali0in" /io"og+ in term of molecle(in the ene of
phyical-chemitry) and then applyin" 9in%oraic': techni9e (derived from
dicipline ch a applied math, CS, and tatitic) to ndertand and or"ani0e the
information aociatedith thee molecle, on a lar"e-cale.
$ioin%oraic'; wherebiology meets computer science
Biolo"y i the yon"et of the natral cience. hen it collected information
reache a critical denity, a natral cience pro"ree from information "atherin" to
information procein". Combinin" cold ilicon and hot protoplam may contitte a
marria"e of oppoite, bt thi nion cold prodce "enetic reearch prodi"ie.
$
-
8/13/2019 Datamining in Bioinformatics-1
5/15
Thee day biolo"it e compter rotinely to ait ith many activitie,
incldin"
Biomoleclar e9ence ali"nment,
*embly of 56* piece,
2ltivariate analyi of lar"e-cale "ene e4preion, and
2etabolic pathay analyi.
Crrently, the mot ccefl e of compter in biolo"y are comparative
e9ence analyi and in silico cloning - the proce of in" a compter earch of
e4itin" databae to clone a "ene.
Daa ining ) a '+non+ %or KDD
5ata minin" can be defined a the proce of e4tractin" hidden predictive
information from lar"e databae.
5ata minin", by it implet definition, atomate the detection of relevant
pattern in a databae. ;or e4ample, a pattern mi"ht indicate that married male ith
children are tice a li1ely to drive a particlar port car than married male ith no
children.
&
-
8/13/2019 Datamining in Bioinformatics-1
6/15
5ata minin" e ell-etablihed tatitical and machine learnin" techni9e to
bild model that predict ctomer behavior.
Today, technolo"y atomate the minin" proce, inte"rate it ith commercial
data arehoe, and preent it in a relevant ay for bine er.
*ha i' a $io"ogica" Daa/a'e0
* /io"ogica" daa/a'e i a lar"e, or"ani0ed body of peritent data, ally
aociated ith compteri0ed oftare dei"ned to pdate, 9ery, and retrieve
component of the data tored ithin the ytem. * imple databae mi"ht be a in"le file
containin" many record, each of hich inclde the ame et of information. ;or
e4ample, a record aociated ith a ncleotide e9ence databae typically contain
information ch a contact name< the inpt e9ence ith a decription of the type of
molecle< the cientific name of the orce or"anim from hich it a iolated< and,
often, literatre citation aociated ith the e9ence.
;or reearcher to benefit from the data tored in a databae, to additional
re9irement mt be met:
ay acce to the information< and
* method for e4tractin" only that information needed to aner a pecific biolo"ical
9etion.
Need %or daa ining in /ioin%oraic'
=
-
8/13/2019 Datamining in Bioinformatics-1
7/15
The "roth of crve of biolo"ical information databae follo an e4ponential crve that
cloely mimic 2oore> la - doblin" every ' month or o.
By helpin" reearcher proce thi vat collection of data, Computer sciencecan ait in
diperin" thi information torm.
2ore than 7,, biolo"ical abtract are lyin" for information e4traction, and the
amont i till pdatin".
The biopharmacetical indtry i "eneratin" more chemical and biolo"ical creenin" data
than it 1no hat to do ith or ho bet to handle. * a relt, decidin" hich tar"et
and lead compond to develop frther i often a lon" and ardo ta1.
2edical data ha increaed dramatically
2anal analyi i not ade9ate
The traditional data analyi method are not ade9ate to deal ith enormo data flo.
5ata minin" i neceary.
Comprehenive pre-procein" facilitie are inclded
The "enerated rle ere imple to ndertand
In the medical domain primary ob?ective a e4planation rather than prediction
2edical databae typically have a hi"h proportion of miin" vale. The data minin"
oftare can efficiently handle the miin" vale.
Cha""enge' in $io ) indu'r+
@
-
8/13/2019 Datamining in Bioinformatics-1
8/15
4plainin" the cale of data that need to be handled in Biotechnolo"y,
Orac"eAeneral 2ana"er S#Gro5er ay, There are %7, "enome ith .& million
protein in them. ach "enome re9ire appro4imately % terabyte of trace file. So
%7, time %TB i maive. 2edical ima"in" "enerate $ million AB of data
annally.
ach ma pectrometer "enerate 7 AB of data daily. 2ltiply thi by
> of ma pectrometer in e in the orld today and yo "et the pictre. Thi
heer volme of data call for intelli"ent databae.
Biolo"it ometime can>t a"ree on the very definition and concept the
databae are ppoe to mana"e. In "enomic, the data entered i not accrate and
precie. ven if it i tandardi0ed, earchin" a coloal databae i no mean ta1. *nother
problem i that databae created by different or"ani0ation, tore information
idioyncratically, creatin" different file format that cannot tal1 to each other.
To be"in ith itelf, biolo"ical data i comple4 and interlin1ed. * pot on
a 56* array, for intance, i connected not only to immediate information abot it
intenity, bt to layer of information abot "enomic location, 56* e9ence, trctre,
fnction, and mch more.
Creatin" information ytem that allo biolo"it to eamlely follo
thee lin1 ithot "ettin" lot in a ea of information i a challen"e for Compter
cientit. 5ata minin" ith ele"ant al"orithm eem to be a better oltion.
'
-
8/13/2019 Datamining in Bioinformatics-1
9/15
Aroache' o% Daa ining in $ioin%oraic'
In%"uence3/a'ed ining;
Comple4 and "ranlar (a oppoed to linear) data in lar"e databae are
canned for inflence beteen pecific data et, and thi i done alon" many
dimenion and in mlti-table format.
Thee ytem find application herever there are i"nificant cae-and-
effect relationhip beteen data et D a occr, for e4ample, in lar"e and mltivariant
"ene e4preion tdie, hich are behind area ch a pharmaco"enomic.
A%%ini+3/a'ed ining:
Ear"e and comple4 data et are analy0ed acro mltiple dimenion, and
the data-minin" ytem identifie data point or et that tend to be "roped to"ether.
Thee ytem differentiate themelve by providin" hierarchie of aociation and
hoin" any nderlyin" lo"ical condition or rle that accont for the pecific "ropin"
of data. Thi approach i particlarly efl in biolo"ical motif analyi, hereby it i
important to ditin"ih FaccidentalF or incidental motif from one ith biolo"ical
i"nificance.
Tie de"a+ daa ining:
The data et i not available immediately and in complete form, bt i
collected over time. The ytem dei"ned to handle ch data loo1 for pattern that are
#
-
8/13/2019 Datamining in Bioinformatics-1
10/15
confirmed or re?ected a the data et increae and become more robt. Thi approach i
"eared toard lon"-term clinical trial analyi and mlticomponent mode of action
tdie.
Trend3/a'ed ining:
The oftare analy0e lar"e and comple4 data et in term of any
chan"e that occr in pecific data et over time. The data et can be er-defined, or
the ytem can ncover them itelf. entially, the ytem report on anythin" that i
chan"in" over time.
Coarai5e daa ining:
It foce on overlayin" lar"e and comple4 data et that are imilar to
each other and comparin" them. Thi i particlarly efl in all form of clinical trial
meta analye, here data collected at different ite over different time period, and
perhap nder imilar bt not alay identical condition, need to be compared. Gere, the
emphai i on findin" diimilaritie, not imilaritie.
6redici5e daa ining:
5ata minin" alone i lac1in" omehat if it i nable to alo offer a
frameor1 for ma1in" imlation, prediction, and forecat, baed on the data et it
ha analy0ed. It combine pattern matchin", inflence relationhip, time et correlation,
and diimilarity analyi to offer imlation of ftre data et.
-
8/13/2019 Datamining in Bioinformatics-1
11/15
Daa ining %or /ioedica" and DNA daa ana"+'i';
The pat decade ha een an e4ploive "roth in biomedical reearch,
ran"in" from the development of ne pharmacetical and advance in cancer therapie
to the identification and tdy of the hman "enome by dicoverin" lar"e-cale
e9encin" pattern
and "ene fnction. Since a "reat deal of biomedical reearch ha foced on 56* data
analyi, e tdy thi application here. +ecent reearch in 56* analyi ha lead to the
dicovery of "enetic cae for many dieae and diabilitie, a ell a the dicovery of
ne medicine and approache for dieae dia"noi, prevention, and treatment.
*n important foc in "enome reearch i the tdy of 56* e9ence ince
ch e9ence form the fondation of the "enetic code of all livin" or"anim.
*ll 56* e9ence are compried of for bildin" bloc1 (called ncleotide):
adenine (*), cytosine(C), guanine (A), and thymine (T). Thee for ncleotide are
combined to form lon" e9ence or chain that reemble a tited ladder.
-
8/13/2019 Datamining in Bioinformatics-1
12/15
Gman bein" have arond , , "ene. * "ene i ally compried of
hndred of individal ncleotide arran"ed in a particlar order. There are almot an
nlimited nmber of ay that the ncleotide can be ordered and e9enced to form
ditinct "ene. It i challen"in" to identify particlar "ene e9ence pattern that play
role in vario dieae. Since many interetin" e9ential pattern analyi and imilarity
earch techni9e have been developed in data minin", data minin" ha become a
poerfl tool and contribte btantially to 56* analyi in the folloin" ay,
Seanic inegraion o% heerogeneou', di'ri/ued genoe daa/a'e';
5e to the hi"hly ditribted, ncontrolled "eneration and e of a ide
variety of 56* data, the emantic inte"ration of ch hetero"eneo and idely
ditribted "enome databae become an important ta1 for ytematic coordinated
analyi of 56* databae. Thi ha promoted the development of inte"rated data
arehoe and ditribted federated
databae to tore and mana"e the primary and derived "enetic data.
5ata cleanin" and data inte"ration method developed in data minin" ill
help the inte"ration of "enetic data and the contrction of data arehoe for "enetic
data analyi.
Sii"ari+ 'earch and coari'on aong DNA 'e
-
8/13/2019 Datamining in Bioinformatics-1
13/15
and healthy tie can be compared to identify critical difference beteen the to
clae of "ene. Tho can be done by firt retrievin" the "ene e9ence from the to
tie clae, and then findin" and comparin" the fre9ently occrrin" pattern of each
cla. ally, e9ence occrrin" more fre9ently in the dieaed ample than in the
healthy ample mi"ht indicate the "enetic factor of the dieae< on the other hand, thoe
occrrin" only more fre9ently in the healthy ample mi"ht indicate mechanim that
protect the body from the dieae. *ltho"h "enetic analyi re9ire imilarity earch,
the techni9e needed here i 9ite different. ;or e4ample, ome of the data
tranformation method, hich are poplarly ed in the analyi of time-erie data, are
ineffective for "enetic data ince ch data are nonnmeric data and the precie
interconnection beteen different 1ind of ncleotide play an important role in their
fnction. Hn the other hand, the analyi of fre9ent e9ential pattern i important in
the analyi of imilarity and diimilarity in "enetic e9ence.
A''ociaion ana"+'i'; identification of co-occurring gene sequences;
Crrently, many tdie have foced on the comparion of one "ene to
other. Goever, mot dieae are not tri""ered by a in"le "ene bt by a combination of
"ene actin" to"ether. *ociation analyi method can be ed to help determine the
1ind of "ene that are li1ely to co-occr in tar"et ample. Sch analyi old facilitate
the dicovery of "rop of "ene and the tdy of interaction and relationhip beteen
them.
%
-
8/13/2019 Datamining in Bioinformatics-1
14/15
6ah ana"+'i'; linking genes to different stages of disease development;
hile a "rop of "ene may contribte to a dieae proce, different "ene
may become active at different ta"e of the dieae. If the e9ence of "enetic activitie
acro the different ta"e of dieae development can be identified, it may be poible to
develop pharmacetical intervention that tar"et the different ta"e eparately, therefore
achievin" more effective treatment of the dieae. Sch path analyi i e4pected to play
an important role in "enetic tdie.
Vi'ua"i=aion oo"' and geneic daa ana"+'i';
Comple4 trctre and e9encin" pattern of "ene are mot effectively
preented in "raph, tree, cboid, and chain by vario 1ind of viali0ation tool.
Sch vially appealin" trctre and pattern facilitate pattern ndertandin",
1noled"e dicovery, and interactive data e4ploration. Jiali0ation therefore play an
important role in biomedical data minin".
An Indu'ria" "oo!
*fter the dotcom> donfall many leadin" companie li1e TCS, *iro,
and I$>are no loo1in" at compteri0in" the medical field. IT profeional feel
databae mana"ement and data minin" oltion and ervice play an important role in
thi. Dr#>ano?!uar, director, I$> re'earch "a/', ay, that competence in area li1e
data and tora"e mana"ement, data minin" old aid in prit of bioinformatic.
Conc"u'ion;
$
-
8/13/2019 Datamining in Bioinformatics-1
15/15
Bioinformatic ytem benefit from the e of data minin" trate"ie to
locate interetin" and pertinent relationhip ithin maive information. ;or e4ample,
data minin" method can acertain and mmari0e the et of "ene repondin" to a certain
level of tre in an or"anim. +eearcher can e "raphical model and relational
al"orithm to mine ch "ene et and model a "ene e4preion netor1. Thi paper on
it part reveal the peritent role of data minin" in e4perimental biolo"y. Th, /io"og+
combined ith couer 'ciencei an emer"in" field that ha come to tay and erve the
hmanity for it better cae.
Re%erence;
. I Compter 3 Lly 77
7. 5ata minin": Concept and Techni9e 3 by L. Gan and 2. Kamber, 7
%. .c.te4a.ed
$. $. 5*T*MST 3 Jol NN 6o.< dated 2ay%, 77
&
http://www.cs.utexas.edu/http://www.cs.utexas.edu/