biomart databases made easy richard holland european bioinformatics institute helsinki, september...
TRANSCRIPT
![Page 1: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/1.jpg)
BioMart
Databases made easy
Richard HollandEuropean Bioinformatics InstituteHelsinki, September 2006
![Page 2: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/2.jpg)
BioMart
• A joint project – European Bioinformatics Institute (EBI) – Cold Spring Harbor Laboratory (CSHL)
• Aim– To develop a generic, query-oriented data
management system capable of integrating distributed data sources.
![Page 3: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/3.jpg)
Focus
• ‘Data mining’ or advance search – Creating custom datasets– Querying multiple datasets– Interactive
•Users– People who provide database-based service– ‘Power user’ biologists and bioinformaticians
![Page 4: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/4.jpg)
Requirements
• User– ‘One-stop shop’ for biological data– Suitable for power biologists and bioinformaticians– A set of interfaces that allow user to group and refine
biological data based upon many criteria
• Deployer– ‘Out of the box’ installation– Built in ‘ query optimization– Easy data federation
• Architecture– Domain agnostic– Distributed– Platform independent
![Page 5: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/5.jpg)
Advanced search GUIs
![Page 6: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/6.jpg)
Single interface
![Page 7: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/7.jpg)
Single access point
![Page 8: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/8.jpg)
Queries across different databases
Dataset 1
Dataset 2
Links
![Page 9: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/9.jpg)
Main features
• Domain agnostic• Platform independent (MySQL, ORACLE,
Postgres)• Scalable for big datasets• Federated architecture• Automated UI configuration
![Page 10: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/10.jpg)
How does it work?
![Page 11: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/11.jpg)
BioMart
Data mart XML XML XML Meta data
BioMart software
Source data
![Page 12: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/12.jpg)
Query Engine
Federated architecture
![Page 13: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/13.jpg)
FK
FK
FK
FK
PK
PK
Data model
![Page 14: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/14.jpg)
FK
FK
FK
FK
PK
PK
FK FK
FK FK
Data model
![Page 15: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/15.jpg)
main1
PK1
2
PK2PK1
FK2
dm
FK2
dm
FK1 FK2
dm
FK1 FK2
PK1FK1 FK1
FK2 FK2PK2 FK1
Data model - ‘reversed star’
![Page 16: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/16.jpg)
Data mart and dataset
Dataset
![Page 17: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/17.jpg)
Data mart, dataset and virtual schema
virtual schema
![Page 18: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/18.jpg)
BioMart abstractions
• Dataset– A subset of data organized into 1 or more tables
• Attribute– A single data point – e. g. gene name
• Filter– An operation on an attribute – e. g. ‘Chromosome =1’
![Page 19: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/19.jpg)
Datasets, Attributes and Filters
GENE
gene_id(PK)gene_stable_id gene_startgene_chrom_endchromosomegene_display_iddescription
Mart
Dataset
Attribute
Filter
![Page 20: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/20.jpg)
BioMart abstractions (cont)
• Link– ‘common currency’ between two datasets – e. g. accession
• Exportable – Potential links to export
• Importable– Potential links to import
![Page 21: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/21.jpg)
Exportables, Importables and Links
Dataset 1
Dataset 2
Links
![Page 22: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/22.jpg)
Exportables, Importables and Links
Dataset 1 Dataset 2
Exportable Importable
name = uniprot_id
attributes = uniprot_ac
name = uniprot_id
filters = uniprot_ac
Links
![Page 23: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/23.jpg)
Exportables, Importables and Links
Dataset 1 Dataset 2
Exportable Importable
name=genomic_region
attributes=chr_name, chr_start, chr_end
name=genomic_region
filters=chr_name (=), chr_start (>=), chr_end (<=)
Links
![Page 24: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/24.jpg)
Creating BioMart databases
![Page 25: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/25.jpg)
Building BioMart databases
Source databases
Mart
Transformation
MartBuilder
Configuration
XML
MartEditorMartBuilder
![Page 26: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/26.jpg)
Schema transformationprinciples
• Central table– Longest n:1, 1:1 path
• Dimension table– Central transformation ‘around’ 1:n table. – Link tables are decomposed into a set of 1:n first
![Page 27: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/27.jpg)
MartBuilder Application
• Read database meta data• Transforms a source schema into suggested datasets and lets you edit
the process• Produces a set of SQL statements (DDL)
to run against the server to perform the transformation
![Page 28: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/28.jpg)
![Page 29: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/29.jpg)
Dataset Configuration
• Dataset configuration • Attributes • Filters• Trees, Groups, Collections• Exportables, Importables• Semantics• Relational mapping
• User interface• Linking datasets• XML-based
![Page 30: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/30.jpg)
Table naming conventionNaïve configuration
• Tables– Meta tables meta_content– Data tables dataset__content__type
• Data tables– Main __main – Dimension __dm
• Columns– Key _key
![Page 31: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/31.jpg)
Naming convention examples
• Homo sapiens gene ensembl– hsapiens_gene_ensembl__gene__main– hsapiens_gene_ensembl__xref_hugo__dm
• Encode– hsapiens_encode__encode__main
• Uniprot– uniprot__protein__main– uniprot__interpro__dm
• Uniprot sequence– uniprot_sequence__sequence__main
![Page 32: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/32.jpg)
Dataset Configuration
XML
XML
XML
![Page 33: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/33.jpg)
MartEditor
![Page 34: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/34.jpg)
Accessing BioMart databases
![Page 35: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/35.jpg)
Retrieval
myDatabase
SNPVega
EnsemblUniProt
myMart
MSD
BioMart API
JAVA Perl
MartExplorer MartShell MartView
Schema transformation
MartBuilder
XML
MartEditor
Configuration
Databases
Public data (local or remote)
BioMart architecture
![Page 36: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/36.jpg)
MartView (current)
![Page 37: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/37.jpg)
MartView (new 0_5)
![Page 38: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/38.jpg)
MartExplorer
![Page 39: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/39.jpg)
MartShell
Using = dataset
Get = attribute
Where = filter
![Page 40: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/40.jpg)
MartShell (MQL)● Uses Mart Query Language (MQL) to generate queries:
using <dataset> get <attributes> where <filters>
● Can join datasets together:
using Dataset1 get Attribute1 where Filter1=var1 as q;
using Dataset2 get Attribute2 where Filter2=var2 and filter3 in q
● Can script and pipe:
martshell.sh -E MQLscript.mql > results.txtmartshell.sh -E MQLscript.mql | wc
![Page 41: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/41.jpg)
MartShell examplesMartShell> using MSD.msd get pdb_id where
resolution_less < 1.5 and has_ec_info only;193l194l1arb ...
MartShell> using MSD.msd get pdb_id where resolution_less < 1.5 and has_ec_info only as q;MartShell> using Ensembl.hsapiens_gene_ensembl get sequence transcript_flanks+1000 where pdb in q;ENST00000270142.2 ENSG00000142168.2strand=forward chr=21 assembly=NCBI34downstream flanking sequence of transcript only
AAACTAAATTAGCTCTGATACTTATTTATATAAACAGCTTCAGTGGAA ....
![Page 42: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/42.jpg)
biomaRt
![Page 43: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/43.jpg)
Taverna
![Page 44: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/44.jpg)
DAS ProServer
![Page 45: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/45.jpg)
BioMart deployers
• Large scale data federation (EBI)• Optimising access to a large database
(Ensembl, WormBase)• Connecting priopriatery datasets to
public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)
![Page 46: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/46.jpg)
EBI
UniprotMSD
SANGEREnsemblSNPVegaSequenceWWW
Hinxton example
![Page 47: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/47.jpg)
BioMart deployers
• Large scale data federation (Hinxton)
• Optimising access to a large database (Ensembl, WormBase, ArrayExpress)
• Connecting priopriatery datasets to public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)
![Page 48: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/48.jpg)
WormBase
Genes
Expression
Phenotypes
Variations
Literature
Ontologies
Sequence
Genes
Expression
Phenotypes
Variations
Literature
Ontologies
Sequence
![Page 49: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/49.jpg)
Ensembl
Genes
Ontologies
Variations
Protein annotation
Disease
Homologies
Sequence
Array annotations
Genes
Ontologies
Variations
Protein annotation
Disease
Homologies
Sequence
Array annotations
![Page 50: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/50.jpg)
HapMap
Population
Frequencies
Inter population
comparisons
Gene
annotation
Population
Frequencies
Inter population
comparisons
Gene
annotation
![Page 51: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/51.jpg)
ArrayExpress
![Page 52: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/52.jpg)
BioMart deployers
• Large scale data federation (Hinxton)• Optimising access to a large database
(Ensembl, WormBase)• Federating third party data with public
data (Pasteur, INRA, Bayer,Unilever, Serono, Sanofi-Aventis, DevGen, Solexa etc …)
![Page 53: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/53.jpg)
In development
• CAPRISA• RGD• DICTYBASE• PURDUE UNIVERSITY• RZPD
![Page 54: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/54.jpg)
Music Mart
![Page 55: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/55.jpg)
BioMart model
• Already applied– Ensembl– Vega– SNP– Uniprot– MSD– ArrayExpress– WormBase– Gramene– HapMap– Variety of ‘in house’ projects (academia and industrial)
![Page 56: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/56.jpg)
User restriction
XML
Dataset
XML
martUser
“default”
“advanced”
![Page 57: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/57.jpg)
Interface configuration
XML
Dataset
XML
Interface
“single-pageweb interface”
“wizard styleweb interface”
![Page 58: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/58.jpg)
Web services
MartView
3306
Local Mart
3306
X
Remote Mart
MartService
3306
80
XML
![Page 59: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/59.jpg)
Web services (cont)MartService requests
• Registry XML
• Dataset information: name, type etc
• DatasetConfig XML
• Mart Query: – API query object is converted to a XML representation on the client
and sent to the server.
– Query object is regenerated on the server and processed. Results are sent back to client as a simple tab-delim HTML page.
![Page 60: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/60.jpg)
Summary
• A generic data management system– A set of easily configurable user interfaces– Distributed Data federation– Query optimization
![Page 61: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/61.jpg)
BioMart
• www.biomart.org• Open source (LGPL)• Public MySQL server• ftp• [email protected]• [email protected]
![Page 62: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006](https://reader035.vdocument.in/reader035/viewer/2022062409/56649ea45503460f94ba915f/html5/thumbnails/62.jpg)
Acknowledgments• BioMart
– Arek Kasprzyk (EBI)– Damian Smedley (EBI)– Syed Haider (EBI)– Gudmundur Thorisson (CSHL)
• Contributors– Darin London (EBI)– Will Spooner (CSHL)– Damian Keefe (Ensembl)– Arne Stabenau (Ensembl)– Andreas Kahari (Ensembl)– Craig Melsopp (Ensembl)– Katerina Tzouvara (Uniprot)– Paul Donlon (Unilever)– Steffen Durinck (SCD-ESAT, Katholieke Universiteit Leuven)– Benoit Ballester (Universite de la Mediterranee)– Stephen Robinson (EBI)– Asif Kibria (EBI)– Paul Donlon (Unilever)