arrayexpress – a public database for microarray gene expression data helen parkinson microarray...

ArrayExpress – a public database for microarray gene

expression data

Helen ParkinsonMicroarray Informatics Team

European Bioinformatics Institute

MGED V, Tokyo 2002

Introducing ArrayExpress Submission methods Ontologies and Annotation The Future

Talk Structure

Infrastructure at the EBI

ArrayExpress(Oracle)

Other publicMicroarrayDatabases

(GEO, CIBEX)

www

EBI

ExpressionProfiler

Externaldatabases

Data analysis

www

Queries

www

MIAMExpress(MySQL)

MAGE-ML

Submissions

MA

GE

-ML

Array Manufacturers

LIMS

Microarray

software

Data AnalysissoftwareM

AG

E-M

L E

xpo

rt

Local MIAMExpressInstallations

MAGE-ML files

Submissions

MAGE-ML pipelines

ArrayExpress Public database for gene expression data, one of

three including CIBEX and GEO Aims to store well annotated (MIAME compliant) and

well structured data MIAME

Recorded info should be sufficient to interpret and replicate the experiment

Information should be structured so that querying and automated data analysis and mining are feasible*

*Brazma et al,.Nature Genetics, 2001

ArrayExpress Conceptual Model

PublicationExternal links

Hybridisation ArraySampleSource

(e.g., Taxonomy)

Experiment

Normalisation

Gene(e.g., EMBL)

Data

ArrayExpress Details

Database schema derived from MAGE-OM Standard SQL, we use Oracle Validating data loader for MAGE-ML Web interface based on java servlets

– Queries - experiment, array, sample– Browsing – views on expt

Object model-based query mechanism, automatic mapping to SQL

Data Submission to ArrayExpress

Via MAGE-ML pipeline->ftp->curation->database Parsed from LIMS or local database Current direct MAGE-ML submitters: Sanger, TIGR,

Paul Spellman, Affymetrix Via external tools e.g BASE, J-Express Via MIAMExpress ArrayExpress Accepts Array, Experiment and

Protocol submissions Provides accession numbers which can be used by

journals

ArrayExpress Curation Effort Provide user support help and documentation Curate MAGE-ML pipeline data and data from

MIAMExpress Support on use of ontologies and CV’s Task, to minimize free text, removal of

synonyms MIAME encouragement Help on MAGE-ML Goal: to provide high-quality, well-annotated

data to allow automated data analysis

Access to Data in ArrayExpress

Via query interface, accession, author, species, experimental factors...etc

Browsable lists Data export as tab delimited file Export to Expression profiler As MAGE-ML from query interface Array descriptions exportable as tab

delimited file

Submission and Annotation Tool - MIAMExpress

What Makes Up a Submission ?

Final data

ArraySample HybridisationArraySample Hybridisation

ArraySample HybridisationArraySample Hybridisation

Experiment

Protocols

Normalisation

MIAMExpress Submission and Annotation Tool

Based on MIAME concepts and questionnaire

Perl-CGI, MySQL database Experiment, Array, Protocol

submissions Generic annotation tool Exports MAGE-ML

Expected Users of MIAMExpress

Biologists wishing to submit to AE Groups who install it and support

locally (Perl-CGI and MySQL) Biologists who want to use it as part of

a LIMS system

Annotation- Arrays and Samples (BioMaterial)

The Annotation Challenge

Use of controlled terms (avoiding free text) Need for integration of controlled terms into

annotation tools Avoidance of synonyms Provision of definitions and sources for user

defined terms Provision of accurate up to date annotation

for samples and arrays

Gene (BioSequence) Annotation

Can be given by links to existing databases and GO can be used to describe sequences and for analysis

MIAME is flexible, allows many kinds of sequence identifiers or sequence itself

Use of existing resources is desirable to achieve interoperability

Sanger Human and Mouse Array Annotation Pipeline

Takes sequences which represent Reporters on the array

Exonerate (alignment) against the NCBI assembly (from Ensembl)

Inherits annotation from Ensembl, gene names, database references, GO terms provides a common annotation in tab delimited format, can be parsed to MAGE-ML, or used in ADF

Pipeline available for external users to beta test

Array Definition Format

Tab delimited file format that can be parsed to MAGE-ML

Defines relationships between features and sequences

Provides sequence annotation Example in the Agenda, open letter to

journals

Example BioSequence Annotation341310_A FRZB; 2q32.1 ENSG00000162998 FRIZZLED-RELATED PROTEIN PRECURSOR (FRZB-1) (FREZZLED) (FRITZ). [Source:SWISSPROT;Acc:Q92765]AAC50736;AAB51298;AAC51217;transmembrane receptor (F);developmental processes (P);skeletal development (P);extracellular (C);membrane (C);U24163;U91903;U68057;NM_001463;753587_A282310_A PPFIA1; 11q13.3 ENSG00000131626 PROTEIN TYROSINE PHOSPHATASE, RECEPTOR TYPE, F POLYPEPTIDE (PTPRF), INTERACTING PROTEIN (LIPRIN), ALPHA 1. [Source:RefSeq;Acc:NM_003626]Q13136;AAC50173; U22816; NM_003626;770192_A LGALS9; 17q11.2 ENSG00000168961 GALECTIN-9 (HOM-HD-21) (ECALECTIN). [Source:SWISSPROT;Acc:O00182]Q8WYQ7;BAB83623;CAA88922;BAA22166;BAA31542;BAB83625;BAB83624;CAB93851;lectin (F);galactose binding lectin (F); AB040130;AB040129;Z49107;AB006782;AB005894;AJ288083;AJ288084;AJ288085;AJ288086;AJ288087;AJ288088;AJ288089;AJ288090;NM_002308;NM_009587;30502_C DLL1; 6q27 ENSG00000112577 DELTA-LIKE PROTEIN 1 PRECURSOR (DROSOPHILA DELTA HOMOLOG 1) (DELTA1) (H-DELTA-1). [Source:SWISSPROT;Acc:O00548]Q9UJV2;Q9NU41;AAF05834;AAB61286;AAG09716;CAB89569;calcium binding (F);structural molecule (F);Notch receptor ligand (F);histogenesis and organogenesis (P);cell differentiation (P);cell communication (P);integral membrane protein (C);membrane (C);AF196571;AF003522;AF222310;AL078605;NM_005618;stSG89231PSCD4; 22q13.1 ENSG00000100055 CYTOHESIN 4. [Source:SWISSPROT;Acc:Q9UIA0]Q9H7Q0;BAB15718;AAF15389;AAF28896;CAB63067;guanyl-nucleotide release factor (F);ARF guanyl-nucleotide exchange factor (F);AK024428;AF075458;AF125349;Z94160;NM_013385;stSG89236EP300; 22q13.2 ENSG00000100393 E1A-ASSOCIATED PROTEIN P300. [Source:SWISSPROT;Acc:Q09472]AAA18639; transcription factor (F);transcription co-activator (F);protein C-terminus binding (F);transcription regulation (P);signal transduction (P);neurogenesis (P);cell cycle (P);transcription, from Pol II promoter (P);nucleus (C);U01877; NM_001429;stSG89351PPARA; 22q13.31 ENSG00000100406 PEROXISOME PROLIFERATOR ACTIVATED RECEPTOR ALPHA (PPAR-ALPHA). [Source:SWISSPROT;Acc:Q07869]Q9BWQ7;AAA36468;CAA68898;AAB32649;CAB42862;CAB44427;AAH00052;transcription factor (F);steroid hormone receptor (F);ligand-dependent nuclear receptor (F);peroxisome receptor (F);transcription regulation (P);transcription, from Pol II promoter (P);energy pathways (P);fatty acid metabolism (P);nucleus (C);L02932;Y07619;S74349;AL049856;AL078611;BC000052;NM_005036;stSG89356 22q12.1 ENSG00000159873 DJ366L4.2 (NOVEL PROTEIN) (FRAGMENT). [Source:SPTREMBL;Acc:Q9UGY6]Q9UGY6;Q9ULT6;CAB63034;BAB33319; AL023494;AB051436;stSG89488PCQAP; 22q11.21 ENSG00000099917 POSITIVE COFACTOR 2 GLUTAMINE/Q-RICH-ASSOCIATED PROTEIN (PC2 GLUTAMINE/Q-RICH-ASSOCIATED PROTEIN) (TPA-INDUCIBLE GENE-1) (TIG-1) (CTG7A). [Source:SWISSPROT;Acc:Q96RN5]AAC12944;AAK58423;BAB85034;AAH13985;AAH07529;AAB91443;transcription regulation (P);nucleus (C);AF056191;AF328769;AK074268;BC013985;BC007529;U80745;NM_015889;stSG89493 22q12.2 EC 1.10.2.2;ENSG00000170192 UBIQUINOL-CYTOCHROME C REDUCTASE COMPLEX 7.2 KDA PROTEIN (EC 1.10.2.2) (CYTOCHROME C1, NONHEME 7 KDA PROTEIN) (COMPLEX III SUBUNIT X) (7.2 KDA CYTOCHROME C1-ASSOCIATED PROTEIN SUBUNIT) (HSPC119). [Source:SWISSPROT;Acc:Q9UDW1]Q9P012;Q9NZY4;AAF29115;BAB20672;AAF29083;AAH05402;AAF29023;oxidoreductase (F);ubiquinol-cytochrome c reductase (F);electron transport (P);complex III (ubiquinone to cytochrome c) (P);mitochondrion (C);inner membrane (C);mitochondrial electron transport chain complex (sensu Eukarya) (C);ubiquinol-cytochrome c reduAF161500;AC004882;AB028598;AF161468;BC005402;AF161536;NM_013387;stSG89214 22q13.31 ENSG00000075234 Q9NWP8;BAA91331; AK000706;NM_017931;

<BioSequence length=“500“> <SequenceDatabases_assnlist> <DatabaseEntry accession="ENSG00000162998"> <Database_assnref> <Database_ref identifier="DB:Ensembl"/> </Database_assnref> </DatabaseEntry> </SequenceDatabases_assnlist> <Type_assn> <OntologyEntry category="biosequence:type" value=“Ensembl gene"/> </Type_assn> <NameValueType category="Ensembl:Description" value=“FRIZZLED-

RELATED PROTEIN PRECURSOR (FRZB-1) (FRIZZLED) (FRITZ)"> <OntologyEntry category="GO:molecular_function" value="transcription

factor"></BioSequence>

Sample Annotation Gene expression data only have meaning in

the context of detailed sample descriptions If the data is going to be interpreted by

independent parties, sample information has to be searchable and in the database

Controlled vocabularies and ontologies (species, cell types, compound nomenclature, treatments, etc) are needed for unambiguous sample description

None of this is trivial

What Does an Ontology Do?

Captures knowledge Creates a shared understanding –

between humans and for computers Makes knowledge machine processable Makes meaning explicit – by definition

and context It is more than a controlled vocabulary, it

has structure

MGED Biomaterial (sample) Ontology

Under construction – by MGED OWG– Using OILed

Motivated by MIAME and coordinated with the database model (mapping available)

We are extending classes, providing constraints, defining terms, and adding terms

Extension to include other MAGE-OM required ontology entries and cv’s

BioMaterial Ontology Internal Terms

Internal and External Terms Combined

Examples of External Ontologies and CV’s

NCBI taxonomy database Jackson Lab mouse strains and genes Edinburgh mouse atlas anatomy HUGO nomenclature for Human genes Chemical and compound Ontologies, e.g. CAS TAIR Flybase anatomy GO (www.geneontology.org) GOBO ontologies

http://www.geneontology.org/

Example BioMaterial Annotation

Sample source and treatment description, and its correct annotation using the MGED BioMaterial Ontology classes and corresponding external references:

“Seven week old C57BL/6N mice were treated with fenofibrate. Liver was dissected out, RNA prepared”

©-BioMaterialDescription

©-Biosource Property

©-Organism

©-Age

©-DevelopmentStage

©-Sex

©-StrainOrLine

©-BiosourceProvider

©-OrganismPart

©-BioMaterialManipulation

©-EnvironmentalHistory

©-CultureCondition

©-Temperature

©-Humidity

©-Light

©-PathogenTests

©-Water

©-Nutrients

©-Treatment

©-CompoundBasedTreatment

(Compound)

(Treatment_application)

(Measurement)

MGED BioMaterial Ontology Instances

7 weeks after birth

Female

Charles River, Japan

22 2C

55 5%

12 hours light/dark cycle

Specified pathogen free conditions

ad libitum

MF, Oriental Yeast, Tokyo, Japan

in vivo, oral gavage

100mg/kg body weight

External References

NCBI TaxonomyNCBI Taxonomy

Mouse Anatomical DictionaryMouse Anatomical Dictionary

International Committee on Standardized Genetic Nomenclature for Mice

International Committee on Standardized Genetic Nomenclature for Mice

Mouse Anatomical DictionaryMouse Anatomical Dictionary

ChemIDplusChemIDplus

Mus musculus musculus id: 39442

Stage 28

C57BL/6

Liver

Fenofibrate, CAS 49562-28-9

Future Data acquisition for ArrayExpress Update ArrayExpress to newest MAGE-OM V2.0 MIAMExpress, domain specific,

portable Further ontology development and

integration into tools, use of OWL Curation tools (other than grep, Perl scripts) Improved query interface for AE ArrayExpress update tool

Resources Schemas for both ArrayExpress and MIAMExpress,

access to code MAGE-ML examples, Arrays, Expts, Protocols MIAME glossary, MAGE-MIAME-ontology mappings List of ontology resources from MGED pages Help in establishing pipelines MGED software MAGEstk Curation, help and advice www.mged.org www.ebi.ac.uk/arrayexpress

http://www.mged.org/

Acknowledgments Microarray Informatics Team, EBI, esp. Alvis

Brazma, Ugis Sarkans, Mohammad Shojatalab, Ele Holloway, Gaurab Mukherjee, Philippe Rocca-Serra, Susanna Sansone

Chris Stoeckert, U. Penn. Members of MGED Sanger Institute - Rob Andrews, Jurg Bahler,

Adam Butler, Kate Rice, EMBL Heidelberg - Wilhelm Ansorge, Martina

Muckenthaler,Thomas Preiss

arrayexpress – a public database for microarray gene expression data helen parkinson microarray...

Documents

pipeline data

arrayexpressvia mage

documentationcurate

sql data submission

expression profileras

automated data analysisaccess

wellannotated data

magemlweb interface