the arrayexpress gene expression database: a software engineering and implementation perspective...

28
The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

Upload: jonathan-oconnell

Post on 28-Mar-2015

215 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

The ArrayExpress Gene Expression Database: a Software Engineering and Implementation

Perspective

Ugis Sarkans

European Bioinformatics Institute

Page 2: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

Outline

• Microarray data and standards overview• ArrayExpress overall principles• ArrayExpress architecture• AE repository• AE data warehouse• Future plans and conclusions

Page 3: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

SamplesG

enes

Gene expression levels – problem 2

Sample annotations problem 1

Gene annotations

Gene expression matrix

Gene expression data and annotation

Page 4: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

Platform comparison (Tan et al, PNAS, 2003)

‘Our conclusion was very straightforward: there was very little overlap in the types of data in terms of differential expression’ (Margareth Cam, NIH)

Page 5: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

hybridisationlabelled

nucleic acidarray

RNA extract

Sample

Array design

hybridisationlabelled

nucleic acidarray

RNA extract

Sample

hybridisationlabelled

nucleic acidarray

RNA extract

Sample

hybridisationlabelled

nucleic acidarray

RNA extract

Sample

hybridisationlabelled

nucleic acidMicroarray

RNA extract

Sample

Experiment

Gene expression data matrix

normalization

integration

ProtocolProtocolProtocolProtocolProtocolProtocol

genes

Page 6: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

Array scans

Spo

ts

Quantitations

Gen

es

Samples

Different processing levels of MA data

A

B

C

D

Page 7: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

MGED standards

• MIAME – minimum information about a microarray experiment

• MAGE-OM and MAGE-ML – microarray gene expression object model and mark-up language

• MO – microarray ontology

• Data normalisation and transformations (and quality control)

Page 8: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

BioEvent

Experiment

ArrayDesign

BioMaterial BioAssayData

BioAssay

DesignElement

UML Packages of MAGE

HigherLevelAnalysis

BioSequence

ArrayQuantitationType

DescriptionProtocol

MeasurementAuditAndSecurity

BQS

what was used what was done results

miscellaneous

Page 9: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

MAGE – an example diagram

Page 10: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

ArrayExpress aims

• An archive for microarray data supporting scientific publications

• Providing easy access to public gene expression and other to microarray data in a structured format

• Facilitating the sharing of microarray designs and protocols

• Facilitating the establishment of infrastructure for microarray data sharing

Page 11: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

AE users

• Experimentalists

• “Single-gene” biologists

• Bioinformaticians; genome-wide studies

• Bioinformaticians – algorithm developers

• Software developers

Page 12: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

ArrayExpress repository

Other MicroarrayDatabases

(SMD, TIGR, Utrecht, RZPD)

www

EBI

ExpressionProfiler

External Databases (EMBL, UniProt, Ensemble)

Data analysis

Queries, analysis

MIAMExpress

Submissions

Array Manufacturers

(Affymetrix,Agilent)

Data AnalysisSoftware

(R/Bioconductor, J-Express,Resolver)

Submissions

Warehouse(Biomart)

ArrayExpress infrastructure

Submission tracking/curation toolExternal MIAMExpress

installations (Camb. U., EMBL)

www

MAGE-ML

MAGE-ML

MAGE-MLAnalysis

ArrayExpres

MAGE-ML

Page 13: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

AE: overall principles

• Adherence to community standards

• Data captured in a granular, formalized manner

• Modern but proven software technologies

• Incremental development

Page 14: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

AE design considerations

• Separate data archiving from the query-optimized data warehouse

• Generate default implementation, then refine– ~2 full-time developers– pressure to bring system online quickly

• Use object abstraction layer– deal with performance overhead on case-by-

case basis

Page 15: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

Web pagetemplate

Tomcat

Curationenvironment

Oracle DB

MAGE-MLDTD

MAGE-OM

MAGE-ML (doc)MAGE-ML (doc)MAGE-ML document

MAGEloader

Velocity

Castorobject/

relationalmapping

Java servlets

MAGEvalidator

MAGEunloader

error.log Web pagetemplate

Repository architecture overview

Page 16: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

AE schema- Why auto-generated?

– AE must be able to import any valid MAGE-ML and not lose information

– good for navigating through data in terms of object model

– if some queries don’t work well, add something to the schema

• Experiment-Biomaterial, Experiment-Protocol links

– so far works for 400Gb of data

Page 17: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

Auto-generated web pages

Page 18: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

To ontologize ornot to ontologize

BioSource

speciesagesex

cellLinetissuecolor

distanceToSunweight

favoriteCereal..........

BioSource

OntologyEntry

categoryvalue

description

0..n

At the beginning: At the end:

Page 19: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

To ontologize ornot to ontologize

BioSource

speciesagesex

cellLinetissuecolor

distanceToSunweight

favoriteCereal..........

BioSource

OntologyEntry

categoryvalue

description

0..n

At the beginning: At the end:

Page 20: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

Model vs. ontology

• Model – stable; ontologies – flexible

• Adding/modifying/deleting attributes – easy; adding/modifying/deleting associations – hard

• Therefore: attributes and their types in ontologies, domain structure (classes + associations) in the model

Page 21: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

Experiment1• type• performer• ….

Hybridization data 1• Experimental factors• Quantitation type definitions•…

>15 000 000 000 data points

NetCDF

Page 22: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

sample

bioassay(hybridization)

experiment

expression value(ratio or absolute)

genegene

property(e.g. GO annot.)

experimentproperty

(e.g. type)

bioassayproperty

(e.g. exper.factor)

sampleproperty

(e.g. species,tissue)

arraydesign

array element

Data warehouse schema

Page 23: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

What BioMart gives to AEDW

• Query language abstraction– Joins automatically generated

• Schema optimized for performance

• Clear database integration roadmap

Page 24: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

prod. DBclone

productiondatabase

curation(data testing)

database

dev./testdatabase

curationTomcat(alpha)

developer'sTomcat

(PC)

developer'sTomcat

(PC)

web router

external users curators

productiondata mgmt

tools

curationdata mgmt

tools

developmentdata mgmt

tools

MIAMExpressor pipelineMAGE-ML

MAGE-ML froma new pipeline any MAGE-ML

prototypeDW

developmentDW

developers

productionTomcat 1

(Linux node)

productionTomcat 2

(Linux node)

ArrayExpress environment

Page 25: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

Future plans

• Data management environment automation

• Flexible data warehouse interface

• Programmatic interface (HTTP/XML based)

• Distributed infrastructure??

Page 26: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

Distributed data infrastructure

ArrayExpress

A local database A local

database

A local database

Query broker

Users

query

find resource

deliverdata

Page 27: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

Conclusions

• Conceptual object modeling works well for complex life sciences domains

• Many software infrastructure components can be auto-generated from object models

• A range of approaches can be used for modeling, e.g., UML framework + ontologies

• Repository and data warehouse – different aims and different implementation principles

Page 28: The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

Acknowledgements• Gonzalo Garcia Lara - web interface• Ahmet Oezcimen - DBA• Anjan Sharma - curation tool• Sergio Contrino, Richard Coulson – data

warehouse• Niran Abeygunawardena – webmaster• Mohammadreza Shojatalab –

MIAMExpress• Misha Kapushesky – Expression Profiler• Curation team:

– Helen Parkinson, Ele Holloway, Gaurab Mukherjee, Anna Farne, Tim Rayner

• Domain-specific projects:– Susanna Sansone, Philippe Rocca-

Serra• Alvis Brazma

• MGED collaborators– Stanford, TIGR,

Affymetrix, EMBL, ….• BioMart team