creation and maintenance of genekeydb research being conducted by kevin kastner under the direction...
DESCRIPTION
The Problem Traditional database approaches are too structured. Scientific objects change identification over time. Gene names change over time. The Human Genome Nomenclature Database (HUGO) contains 13,594 active symbols, 9635 literature aliases, and 2739 withdrawn symbols. SIR2L1 (w/drawn) is a synonym for SIRT1 and sir2- like 1.TRANSCRIPT
Creation and Maintenance Creation and Maintenance of GeneKeyDBof GeneKeyDB
Research being conducted byResearch being conducted byKevin KastnerKevin Kastner
Under the direction ofUnder the direction ofDr. Erich BakerDr. Erich Baker
The ProblemThe Problem
There exists thousands of biomedical data There exists thousands of biomedical data sources.sources. In 2006, there were ~557 relevant public In 2006, there were ~557 relevant public
resources in molecular biology.resources in molecular biology. This is growing rapidly.This is growing rapidly.
203 sources in 1999203 sources in 1999 226 sources in 2000226 sources in 2000 277 sources in 2001.277 sources in 2001.
The ProblemThe Problem Traditional database approaches are too Traditional database approaches are too
structured.structured. Scientific objects change identification over time.Scientific objects change identification over time. Gene names change over time.Gene names change over time.
The Human Genome Nomenclature Database The Human Genome Nomenclature Database (HUGO) contains 13,594 active symbols, 9635 (HUGO) contains 13,594 active symbols, 9635 literature aliases, and 2739 withdrawn symbols. literature aliases, and 2739 withdrawn symbols.
SIR2L1 (w/drawn) is a synonym for SIRT1 and sir2-SIR2L1 (w/drawn) is a synonym for SIRT1 and sir2-like 1.like 1.
Scientific Object IdentitiesScientific Object Identities
Hugo NameHugo Name GDBGDB GenAtlasGenAtlas OMIMOMIM GeneCardsGeneCards LocusLinkLocusLink
TP53TP53 11 3333 5252 2222 1313
P53P53 1(same)1(same) 1717 188188 6969 6363
SIRT1SIRT1 11 00 55 11 22
SIR2L1SIR2L1 00 00 11 1(same)1(same) 1(same)1(same)
The SolutionThe Solution GeneKeyDBGeneKeyDB
A gene-centered relational database A gene-centered relational database developed to enhance data mining in developed to enhance data mining in biological data sets.biological data sets.
GeneKeyDB relies primarily on existing GeneKeyDB relies primarily on existing database identifiers derived from community database identifiers derived from community databases (NCBI, GO, Ensembl, et al.) as databases (NCBI, GO, Ensembl, et al.) as well as the known relationships among those well as the known relationships among those identifiers.identifiers.
Version 1 is already out!Version 1 is already out! http://www.biomedcentral.com/1471-2105/6/72http://www.biomedcentral.com/1471-2105/6/72
Weaknesses of Version 1Weaknesses of Version 1
Can no longer be updatedCan no longer be updated Complex queries must be made to the Complex queries must be made to the
database in order to obtain desired database in order to obtain desired informationinformation
Complex QueriesComplex QueriesSELECT ll_xp_cdd.cdd_name, ll_np_cdd.cdd_name, organismSELECT ll_xp_cdd.cdd_name, ll_np_cdd.cdd_name, organismFROM ll_xp_cdd, ll_np_cdd, ll_locusFROM ll_xp_cdd, ll_np_cdd, ll_locusWHERE ll_xp_cdd.cdd_score = ll_np_cdd.cdd_scoreWHERE ll_xp_cdd.cdd_score = ll_np_cdd.cdd_score
AND ll_id INAND ll_id IN(SELECT ll_id(SELECT ll_idFROM ll_refseq_xmFROM ll_refseq_xmWHERE ll_refseq_xm_id INWHERE ll_refseq_xm_id IN
(SELECT ll_refseq_xm_id(SELECT ll_refseq_xm_idFROM ll_xp_cdd, ll_np_cddFROM ll_xp_cdd, ll_np_cddWHERE ll_xp_cdd.cdd_score = ll_np_cdd.cdd_score))WHERE ll_xp_cdd.cdd_score = ll_np_cdd.cdd_score))
AND ll_id INAND ll_id IN(SELECT ll_id(SELECT ll_idFROM ll_refseq_nmFROM ll_refseq_nmWHERE ll_refseq_nm_id INWHERE ll_refseq_nm_id IN(SELECT ll_refseq_nm_id(SELECT ll_refseq_nm_idFROM ll_xp_cdd, ll_np_cddFROM ll_xp_cdd, ll_np_cddWHERE ll_xp_cdd.cdd_score = ll_np_cdd.cdd_score));WHERE ll_xp_cdd.cdd_score = ll_np_cdd.cdd_score));
Current ResearchCurrent Research
Creation of APIs to validate data in the Creation of APIs to validate data in the database and to enable querying to database and to enable querying to become much easier for the user. become much easier for the user.
One-step updating of the database and One-step updating of the database and the information it contains. the information it contains.
API AlternativeAPI Alternative// fxn(search_params, desired_info), returns ll_id// fxn(search_params, desired_info), returns ll_idcurated.cdd(score[ ],curated.cdd(score[ ],nullnull))curated_score[ ] curated_score[ ] score[ ] score[ ]locus_id1[ ] locus_id1[ ] gaa.cdd((name[ ],score[ ]), score[ ]) gaa.cdd((name[ ],score[ ]), score[ ])gaa_name[ ] gaa_name[ ] name[ ] name[ ]gaa_score[ ] gaa_score[ ] score[ ] score[ ]locus_id2[ ] locus_id2[ ] curated.cdd(name[ ],score[ ]) curated.cdd(name[ ],score[ ])curated_name[ ] curated_name[ ] name[ ] name[ ]locus_id[ ] locus_id[ ] intersect(locus_id1[ ],locus_id2[ ]) intersect(locus_id1[ ],locus_id2[ ])locus(organism[ ], locus_id[ ])locus(organism[ ], locus_id[ ])print(gaa_name[ ], curated_name[ ], organism[ ])print(gaa_name[ ], curated_name[ ], organism[ ])
External ImplementationsExternal Implementations
Some databases have APIs as well.Some databases have APIs as well. EnsemblEnsembl
APIs are done in Perl.APIs are done in Perl.
APIs for GeneKeyDB will be done in Java.APIs for GeneKeyDB will be done in Java. More structured language.More structured language. Easier to read.Easier to read.
The Future of GeneKeyDBThe Future of GeneKeyDB
GeneKeyDB will join even more external GeneKeyDB will join even more external and widely used databases together.and widely used databases together.
Code for updating GeneKeyDB will tie into Code for updating GeneKeyDB will tie into database information that will change in database information that will change in expected ways.expected ways. Lowers the required number of code rewrites.Lowers the required number of code rewrites.
GeneKeyDB will be dynamically updated.GeneKeyDB will be dynamically updated.
The Future of GeneKeyDBThe Future of GeneKeyDB
APIs made that will be written in Perl.APIs made that will be written in Perl. Perl is used often, almost exclusively, by Perl is used often, almost exclusively, by
biologists.biologists. Can have Perl APIs tie into Java APIs, rather Can have Perl APIs tie into Java APIs, rather
than creating all new ones.than creating all new ones.
Comments? Questions?Comments? Questions?
http://http://genereg.ornl.gov/gkdbgenereg.ornl.gov/gkdb//