arrayexpress – a public database for microarray gene expression data helen parkinson microarray...
TRANSCRIPT
ArrayExpress – a public database for microarray gene
expression data
Helen ParkinsonMicroarray Informatics Team
European Bioinformatics Institute
MGED V, Tokyo 2002
Introducing ArrayExpress Submission methods Ontologies and Annotation The Future
Talk Structure
Infrastructure at the EBI
ArrayExpress(Oracle)
Other publicMicroarrayDatabases
(GEO, CIBEX)
www
EBI
ExpressionProfiler
Externaldatabases
Data analysis
www
Queries
www
MIAMExpress(MySQL)
MAGE-ML
Submissions
MA
GE
-ML
Array Manufacturers
LIMS
Microarray
software
Data AnalysissoftwareM
AG
E-M
L E
xpo
rt
Local MIAMExpressInstallations
MAGE-ML files
Submissions
MAGE-ML pipelines
ArrayExpress Public database for gene expression data, one of
three including CIBEX and GEO Aims to store well annotated (MIAME compliant) and
well structured data MIAME
Recorded info should be sufficient to interpret and replicate the experiment
Information should be structured so that querying and automated data analysis and mining are feasible*
*Brazma et al,.Nature Genetics, 2001
ArrayExpress Conceptual Model
PublicationExternal links
Hybridisation ArraySampleSource
(e.g., Taxonomy)
Experiment
Normalisation
Gene(e.g., EMBL)
Data
ArrayExpress Details
Database schema derived from MAGE-OM Standard SQL, we use Oracle Validating data loader for MAGE-ML Web interface based on java servlets
– Queries - experiment, array, sample– Browsing – views on expt
Object model-based query mechanism, automatic mapping to SQL
Data Submission to ArrayExpress
Via MAGE-ML pipeline->ftp->curation->database Parsed from LIMS or local database Current direct MAGE-ML submitters: Sanger, TIGR,
Paul Spellman, Affymetrix Via external tools e.g BASE, J-Express Via MIAMExpress ArrayExpress Accepts Array, Experiment and
Protocol submissions Provides accession numbers which can be used by
journals
ArrayExpress Curation Effort Provide user support help and documentation Curate MAGE-ML pipeline data and data from
MIAMExpress Support on use of ontologies and CV’s Task, to minimize free text, removal of
synonyms MIAME encouragement Help on MAGE-ML Goal: to provide high-quality, well-annotated
data to allow automated data analysis
Access to Data in ArrayExpress
Via query interface, accession, author, species, experimental factors...etc
Browsable lists Data export as tab delimited file Export to Expression profiler As MAGE-ML from query interface Array descriptions exportable as tab
delimited file
Submission and Annotation Tool - MIAMExpress
What Makes Up a Submission ?
Final data
ArraySample HybridisationArraySample Hybridisation
ArraySample HybridisationArraySample Hybridisation
Experiment
Protocols
Normalisation
MIAMExpress Submission and Annotation Tool
Based on MIAME concepts and questionnaire
Perl-CGI, MySQL database Experiment, Array, Protocol
submissions Generic annotation tool Exports MAGE-ML
Expected Users of MIAMExpress
Biologists wishing to submit to AE Groups who install it and support
locally (Perl-CGI and MySQL) Biologists who want to use it as part of
a LIMS system
Annotation- Arrays and Samples (BioMaterial)
The Annotation Challenge
Use of controlled terms (avoiding free text) Need for integration of controlled terms into
annotation tools Avoidance of synonyms Provision of definitions and sources for user
defined terms Provision of accurate up to date annotation
for samples and arrays
Gene (BioSequence) Annotation
Can be given by links to existing databases and GO can be used to describe sequences and for analysis
MIAME is flexible, allows many kinds of sequence identifiers or sequence itself
Use of existing resources is desirable to achieve interoperability
Sanger Human and Mouse Array Annotation Pipeline
Takes sequences which represent Reporters on the array
Exonerate (alignment) against the NCBI assembly (from Ensembl)
Inherits annotation from Ensembl, gene names, database references, GO terms provides a common annotation in tab delimited format, can be parsed to MAGE-ML, or used in ADF
Pipeline available for external users to beta test
Array Definition Format
Tab delimited file format that can be parsed to MAGE-ML
Defines relationships between features and sequences
Provides sequence annotation Example in the Agenda, open letter to
journals
Example BioSequence Annotation341310_A FRZB; 2q32.1 ENSG00000162998 FRIZZLED-RELATED PROTEIN PRECURSOR (FRZB-1) (FREZZLED) (FRITZ). [Source:SWISSPROT;Acc:Q92765]AAC50736;AAB51298;AAC51217;transmembrane receptor (F);developmental processes (P);skeletal development (P);extracellular (C);membrane (C);U24163;U91903;U68057;NM_001463;753587_A282310_A PPFIA1; 11q13.3 ENSG00000131626 PROTEIN TYROSINE PHOSPHATASE, RECEPTOR TYPE, F POLYPEPTIDE (PTPRF), INTERACTING PROTEIN (LIPRIN), ALPHA 1. [Source:RefSeq;Acc:NM_003626]Q13136;AAC50173; U22816; NM_003626;770192_A LGALS9; 17q11.2 ENSG00000168961 GALECTIN-9 (HOM-HD-21) (ECALECTIN). [Source:SWISSPROT;Acc:O00182]Q8WYQ7;BAB83623;CAA88922;BAA22166;BAA31542;BAB83625;BAB83624;CAB93851;lectin (F);galactose binding lectin (F); AB040130;AB040129;Z49107;AB006782;AB005894;AJ288083;AJ288084;AJ288085;AJ288086;AJ288087;AJ288088;AJ288089;AJ288090;NM_002308;NM_009587;30502_C DLL1; 6q27 ENSG00000112577 DELTA-LIKE PROTEIN 1 PRECURSOR (DROSOPHILA DELTA HOMOLOG 1) (DELTA1) (H-DELTA-1). [Source:SWISSPROT;Acc:O00548]Q9UJV2;Q9NU41;AAF05834;AAB61286;AAG09716;CAB89569;calcium binding (F);structural molecule (F);Notch receptor ligand (F);histogenesis and organogenesis (P);cell differentiation (P);cell communication (P);integral membrane protein (C);membrane (C);AF196571;AF003522;AF222310;AL078605;NM_005618;stSG89231PSCD4; 22q13.1 ENSG00000100055 CYTOHESIN 4. [Source:SWISSPROT;Acc:Q9UIA0]Q9H7Q0;BAB15718;AAF15389;AAF28896;CAB63067;guanyl-nucleotide release factor (F);ARF guanyl-nucleotide exchange factor (F);AK024428;AF075458;AF125349;Z94160;NM_013385;stSG89236EP300; 22q13.2 ENSG00000100393 E1A-ASSOCIATED PROTEIN P300. [Source:SWISSPROT;Acc:Q09472]AAA18639; transcription factor (F);transcription co-activator (F);protein C-terminus binding (F);transcription regulation (P);signal transduction (P);neurogenesis (P);cell cycle (P);transcription, from Pol II promoter (P);nucleus (C);U01877; NM_001429;stSG89351PPARA; 22q13.31 ENSG00000100406 PEROXISOME PROLIFERATOR ACTIVATED RECEPTOR ALPHA (PPAR-ALPHA). [Source:SWISSPROT;Acc:Q07869]Q9BWQ7;AAA36468;CAA68898;AAB32649;CAB42862;CAB44427;AAH00052;transcription factor (F);steroid hormone receptor (F);ligand-dependent nuclear receptor (F);peroxisome receptor (F);transcription regulation (P);transcription, from Pol II promoter (P);energy pathways (P);fatty acid metabolism (P);nucleus (C);L02932;Y07619;S74349;AL049856;AL078611;BC000052;NM_005036;stSG89356 22q12.1 ENSG00000159873 DJ366L4.2 (NOVEL PROTEIN) (FRAGMENT). [Source:SPTREMBL;Acc:Q9UGY6]Q9UGY6;Q9ULT6;CAB63034;BAB33319; AL023494;AB051436;stSG89488PCQAP; 22q11.21 ENSG00000099917 POSITIVE COFACTOR 2 GLUTAMINE/Q-RICH-ASSOCIATED PROTEIN (PC2 GLUTAMINE/Q-RICH-ASSOCIATED PROTEIN) (TPA-INDUCIBLE GENE-1) (TIG-1) (CTG7A). [Source:SWISSPROT;Acc:Q96RN5]AAC12944;AAK58423;BAB85034;AAH13985;AAH07529;AAB91443;transcription regulation (P);nucleus (C);AF056191;AF328769;AK074268;BC013985;BC007529;U80745;NM_015889;stSG89493 22q12.2 EC 1.10.2.2;ENSG00000170192 UBIQUINOL-CYTOCHROME C REDUCTASE COMPLEX 7.2 KDA PROTEIN (EC 1.10.2.2) (CYTOCHROME C1, NONHEME 7 KDA PROTEIN) (COMPLEX III SUBUNIT X) (7.2 KDA CYTOCHROME C1-ASSOCIATED PROTEIN SUBUNIT) (HSPC119). [Source:SWISSPROT;Acc:Q9UDW1]Q9P012;Q9NZY4;AAF29115;BAB20672;AAF29083;AAH05402;AAF29023;oxidoreductase (F);ubiquinol-cytochrome c reductase (F);electron transport (P);complex III (ubiquinone to cytochrome c) (P);mitochondrion (C);inner membrane (C);mitochondrial electron transport chain complex (sensu Eukarya) (C);ubiquinol-cytochrome c reduAF161500;AC004882;AB028598;AF161468;BC005402;AF161536;NM_013387;stSG89214 22q13.31 ENSG00000075234 Q9NWP8;BAA91331; AK000706;NM_017931;
<BioSequence length=“500“> <SequenceDatabases_assnlist> <DatabaseEntry accession="ENSG00000162998"> <Database_assnref> <Database_ref identifier="DB:Ensembl"/> </Database_assnref> </DatabaseEntry> </SequenceDatabases_assnlist> <Type_assn> <OntologyEntry category="biosequence:type" value=“Ensembl gene"/> </Type_assn> <NameValueType category="Ensembl:Description" value=“FRIZZLED-
RELATED PROTEIN PRECURSOR (FRZB-1) (FRIZZLED) (FRITZ)"> <OntologyEntry category="GO:molecular_function" value="transcription
factor"></BioSequence>
Sample Annotation Gene expression data only have meaning in
the context of detailed sample descriptions If the data is going to be interpreted by
independent parties, sample information has to be searchable and in the database
Controlled vocabularies and ontologies (species, cell types, compound nomenclature, treatments, etc) are needed for unambiguous sample description
None of this is trivial
What Does an Ontology Do?
Captures knowledge Creates a shared understanding –
between humans and for computers Makes knowledge machine processable Makes meaning explicit – by definition
and context It is more than a controlled vocabulary, it
has structure
MGED Biomaterial (sample) Ontology
Under construction – by MGED OWG– Using OILed
Motivated by MIAME and coordinated with the database model (mapping available)
We are extending classes, providing constraints, defining terms, and adding terms
Extension to include other MAGE-OM required ontology entries and cv’s
BioMaterial Ontology Internal Terms
Internal and External Terms Combined
Examples of External Ontologies and CV’s
NCBI taxonomy database Jackson Lab mouse strains and genes Edinburgh mouse atlas anatomy HUGO nomenclature for Human genes Chemical and compound Ontologies, e.g. CAS TAIR Flybase anatomy GO (www.geneontology.org) GOBO ontologies
Example BioMaterial Annotation
Sample source and treatment description, and its correct annotation using the MGED BioMaterial Ontology classes and corresponding external references:
“Seven week old C57BL/6N mice were treated with fenofibrate. Liver was dissected out, RNA prepared”
©-BioMaterialDescription
©-Biosource Property
©-Organism
©-Age
©-DevelopmentStage
©-Sex
©-StrainOrLine
©-BiosourceProvider
©-OrganismPart
©-BioMaterialManipulation
©-EnvironmentalHistory
©-CultureCondition
©-Temperature
©-Humidity
©-Light
©-PathogenTests
©-Water
©-Nutrients
©-Treatment
©-CompoundBasedTreatment
(Compound)
(Treatment_application)
(Measurement)
MGED BioMaterial Ontology Instances
7 weeks after birth
Female
Charles River, Japan
22 2C
55 5%
12 hours light/dark cycle
Specified pathogen free conditions
ad libitum
MF, Oriental Yeast, Tokyo, Japan
in vivo, oral gavage
100mg/kg body weight
External References
NCBI TaxonomyNCBI Taxonomy
Mouse Anatomical DictionaryMouse Anatomical Dictionary
International Committee on Standardized Genetic Nomenclature for Mice
International Committee on Standardized Genetic Nomenclature for Mice
Mouse Anatomical DictionaryMouse Anatomical Dictionary
ChemIDplusChemIDplus
Mus musculus musculus id: 39442
Stage 28
C57BL/6
Liver
Fenofibrate, CAS 49562-28-9
Future Data acquisition for ArrayExpress Update ArrayExpress to newest MAGE-OM V2.0 MIAMExpress, domain specific,
portable Further ontology development and
integration into tools, use of OWL Curation tools (other than grep, Perl scripts) Improved query interface for AE ArrayExpress update tool
Resources Schemas for both ArrayExpress and MIAMExpress,
access to code MAGE-ML examples, Arrays, Expts, Protocols MIAME glossary, MAGE-MIAME-ontology mappings List of ontology resources from MGED pages Help in establishing pipelines MGED software MAGEstk Curation, help and advice www.mged.org www.ebi.ac.uk/arrayexpress
Acknowledgments Microarray Informatics Team, EBI, esp. Alvis
Brazma, Ugis Sarkans, Mohammad Shojatalab, Ele Holloway, Gaurab Mukherjee, Philippe Rocca-Serra, Susanna Sansone
Chris Stoeckert, U. Penn. Members of MGED Sanger Institute - Rob Andrews, Jurg Bahler,
Adam Butler, Kate Rice, EMBL Heidelberg - Wilhelm Ansorge, Martina
Muckenthaler,Thomas Preiss