BioMart Query Network
Arek KasprzykEuropean Bioinformatics Institute8 January 2005
Biological databases
• Distributed• Different format• Different focus• Different release schedule• Scalability factor
BioMart
Retrieval
myDatabase
SNPVega
EnsemblUniProt
myMart
MSD
BioMart API
JAVA Perl
MartExplorer MartShell MartView
Schema transformation
MartBuilder
XML
MartEditor
Configuration
Databases
Public data (local or remote)
MartView
BioMart@Ensembl
MartShell
MartExplorer
Database
FK
FK
FK
FK
PK
FK FK FKFK
PK PK
PK PK
Schema
FK
FK
FK
FK
PK
PK
FK FK
FK FK
Schema
FK
FK
FK
FK
PK
PK
Schema
main1
PK1
2
PK2PK1
FK2
dm
FK2
dm
FK1 FK2
dm
FK1 FK2
PK1FK1 FK1
FK2 FK2PK2 FK1
Schema - ‘reversed star’
Fixed schema transformationA
B
TA
TB
C
Schema transformation
• Central table– Longest n:1, 1:1 path
• Dimension table– Central transformation ‘around’ 1:n
table. – Link tables are decomposed into a set
of 1:n first
MartBuilder• Input
– central object– database meta data– cardinalities
• Output– Set of SQL statements:
• “create table as select …”
• Transformations – represented as asymmetric tree
MartBuilder
DATASET: hsapiens_gene_ensemblTYPE MAIN [M] DIMENSION [D] EXIT [E]: MTABLE NAME: genegene: alt_allele cardinality [11] [n1] [0n] [1n] [SKIP S]: Sgene: gene cardinality [11] [n1] [0n] [1n] [SKIP S]: Sgene: gene_description cardinality [11] [n1] [0n] [1n] [SKIP S]: 11gene: gene_stable_id cardinality [11] [n1] [0n] [1n] [SKIP S]: 11gene: kk__gene__main cardinality [11] [n1] [0n] [1n] [SKIP S]: Sgene: transcript cardinality [11] [n1] [0n] [1n] [SKIP S]: Sgene: analysis cardinality [11] [n1] [0n] [1n] [SKIP S]: n1gene: dna cardinality [11] [n1] [0n] [1n] [SKIP S]: Sgene: dnac cardinality [11] [n1] [0n] [1n] [SKIP S]: Sgene: seq_region cardinality [11] [n1] [0n] [1n] [SKIP S]: STYPE MAIN [M] DIMENSION [D] EXIT [E]: EADD EXTENSION: hsapiens_gene_ensembl__gene__MAIN [Y|N]: NCHANGE FINAL TABLE NAME: hsapiens_gene_ensembl__gene__MAIN TO:
CREATE TABLE TEMP0 as SELECT gene.gene_id,gene.type,gene.analysis_id,gene.seq_region_id,gene.seq_region_start,gene.seq_region_end,gene.seq_region_strand,gene.display_xref_id,gene_description.gene_id AS gene_id_TEMP0,gene_description.description FROM gene, gene_description WHERE gene_description.gene_id = gene.gene_id;CREATE TABLE hsapiens_gene_ensembl__gene__MAIN as SELECT TEMP0.gene_id,TEMP0.type,TEMP0.analysis_id,TEMP0.seq_region_id,TEMP0.seq_region_start,TEMP0.seq_region_end,TEMP0.seq_region_strand,TEMP0.display_xref_id,TEMP0.gene_id_TEMP0,TEMP0.description,gene_stable_id.gene_id AS gene_id_TEMP1,gene_stable_id.stable_id,gene_stable_id.version FROM TEMP0, gene_stable_id WHERE gene_stable_id.gene_id = TEMP0.gene_id;drop table TEMP0;
Transformation configuration
satellog_repeats M repeats disease n1satellog_repeats M repeats gc 11satellog_repeats M repeats linkage_depth Ssatellog_repeats M repeats repeats Ssatellog_repeats M repeats transcripts Ssatellog_repeats M repeats ugcount Ssatellog_repeats M repeats ugstats Ssatellog_repeats M repeats rep_class n1satellog_repeats D ugcount ugcount Ssatellog_repeats D ugcount ugstats Ssatellog_repeats D ugcount gc Ssatellog_repeats D ugcount repeats n1r
Data access
Dataset – Key Abstraction
• Dataset– Organised into a single schema– BioMart database contains one or more dataset(s)– Attribute– Filter– Exportable/Importable (Links)
• Dataset - an equivalent of relational table– Exportable/Importable = PK/FK
Key Abstractions
GENE CENTRAL
gene_id(PK)gene_stable_id gene_startgene_chrom_endchromosomegene_display_iddescription
Mart
Dataset
Attribute
Filter
Exportables, Importables and Links
• Exportable = ordered list of attributes• Importable = ordered list of filters
– WHERE filt1=value1– WHERE filt1=value1 or filt1=value2– WHERE filt1>value1 and filt2<value2
• Links = matching importable and exportable
MartView
Dataset Configuration
• Dataset configuration • Attributes • Filters• Trees, Groups, Collections• Links • Semantics• Relational mapping
• User interface• Linking datasets• XML-based
Dataset Configuration
XML
XML
XML
Table naming conventionNaïve configuration
• Tables– Meta tables meta_content– Data tables dataset__content__type
• Data tables– Main __main – Dimension __dm
• Columns– Key _key– Boolean filter _bool– List filter _list
MartEditor
MartEditor
• Naïve configuration• Updates• Links• Automatic discovery of new tables
Class diagram - configuration
Class diagram - querying
Information flow
• Read connections• Register individual datasets and create
linked datasets• Get input from the user, split queries to
individual datasets. • Find the shortest path between datasets
(Dijikstra)• Compile SQL
Summary
BioMart
• Domain independent• Platform independent
– MySQL 4– Oracle 9i
• Plugin architecture
BioMart model
• Already applied– Ensembl– Vega– dbSNP– Uniprot– MSD– Variety of small projects
• In development– ArrayExpress– Wormbase– RGD
Future work
• BioMart v 0.2 to be released later on in january
• Java library to be upgraded over coming months to the new architecture
• BioMart has been integrated with Taverna
• MartBuilder - to be properly implemented
BioMart
• www.ebi.ac.uk/biomart• Open source (LGPL)• Public MySQL server• ftp• [email protected]• [email protected]
Acknowledgments
• BioMart– Damian Smedley– Darin London
• Contributors– Arne Stabenau (Ensembl)– Andreas Kahari (Ensembl)– Craig Melsopp (Ensembl)– Katerina Tzouvara (Uniprot)– Paul Donlon (Unilever)– Will Spooner (CSHL)