1 outline standardization - necessary components –what information should be exchanged –how the...
TRANSCRIPT
1
Outline
• Standardization - necessary components– what information should be exchanged– how the information should be exchanged– common terms (ontologies)– common ways of describing data processing– how to query information
• ArrayExpress– public repository for microarray data– www.ebi.ac.uk/arrayexpress
2
What information should be exchanged?
• MIAME - Minimum Information About a Microarray Experiment– informal specification– paper published in Nature Genetics– goal - to initiate discussion:
• which details are important and which may not be
3
Ultimate dream
Samples
Gen
es
Gene expression levels (in mRNAcounts/cell)Pointers to (a)
well-establishedgene database(s)
Pointers to a well-establishedsample ontology
Minimuminformationis the followingtable:
4
Currently: MIAME six parts
1. Experimental design: the set of the hybridisation experiments as a whole
2. Array design: each array used and each element (spot) on the array
3. Samples: samples used, the extract preparation and labeling
4. Hybridizations: procedures and parameters
5. Measurements: images, quantitation, specifications
6. Controls: types, values, specifications
5
Login
Pending/New Experiment
Sample1 Sample2 Sample3 Samplen Sample protocol
Hybridisations Hyb protocol
Array1 Array2 Array3 Arrayn Scanning protocol
Data1 Data2 Data3 Datan Image analysis protocol
Combined Experiment Data Transformation protocol
Submit Final free text comment
Create account
Extracts 1…nExtracts 1…n Extracts 1…n Extracts 1…n
E1 E2 En E1 E2 En E1 E2 En E1 E2 En
Extraction protocol
MIAMExpresssubmission procedure
http://www.ebi.ac.uk/miamexpress
MAGE-ML
6
How the information should be exchanged?
• MAGE OM- MicroArray Gene Expression Object Model– formal specification - UML (Unified Modeling
Language) model– described by a set of diagrams– standardized through Object Management Group– describes the domain of microarray data– can serve as a source for generating various
software artifacts
7
MAGE - brief history
• August 1997 - Life Sciences Research group formed within the Object Management Group
• March 2000 - gene expression RFP issued• December 2000 - initial submissions of proposals
for gene expression data standards:– EBI (on behalf of MGED) - MAML
– Rosetta (on behalf of GEML community) - GEML + some IDLs
– NetGenics - IDLs
8
MAGE - brief history (2)• Decision to proceed with a joint submission• Decision to base the standard on UML• Submitters’ meetings throughout 2001• End of January 2002 - MAGE becomes an adopted
specification• October 2002 - MAGE becomes an available
specification
• MAGE-ML - XML language - automatically derived from MAGE
• (More than) MIAME-compliant; only subset can be used
9
MAGE – an example diagram
10
Use case of MAGE:ArrayExpress architecture
ArrayExpress(Oracle)
Browser
MIAMEexpress
MAGE-ML(DTD)
MAGE-OM
MAGE-ML (doc)MAGE-ML (doc)MAGE-ML (doc)
dataloader
Velocitytemplateengine
Castor
object/relationalmapping
Web pagetemplateWeb pagetemplate
Java servlets Tomcat
11
ArrayExpress(Oracle)
OtherMicroarraydatabases
www
EBI
ExpressionProfiler
ExternalBioinformatics
databases
Data analysis
www
Queries
www
MIAMExpress(MySQL)
MAGE-ML
Submissions
Array Manufacturers
LIMS
Microarray
software
Data Analysissoftware
ArrayExpress Infrastructure
MAGE-ML import,
export
Local MIAMExpressInstallations
Data
pipelines
MAGE-ML
12
Common terms (ontologies)• What is an ontology?
– formal model of some domain– simplest ontologies – controlled vocabularies– hierarchical, other relations, constraints, …
• MGED Ontology• maintained by Chris Stoeckert, UPenn• enables:
– unambiguous annotation– therefore, queries
• currently sample description• experiment design description to come• multiple formats: RDFS, DAML+OIL
13
Ontologies and ArrayExpress
• Curation team– lead by Helen Parkinson– currently 5 curators
• Curation tool under development– management of all relevant ontologies “under one roof”– support in distributed ontology development– submission tracking– accession numbers– ...
14
Common ways of describing data processing
• no “deliverables” yet
• MAGE can describe data processing– just syntax, too much free text
• Laboratory Activity Broker process within OMG - common points?
• problem:– it is possible to come up with a universal
framework that can describe all possible scenarios of data processing
– however, how will it be used in real life?
15
process instance
clustering pattern discovery
visualization data filtering
data parameter values
in
out
...
workflow enactment
process typedata type
in
out
workflow
parameters
16
Benefits
• compile “best practices” of data analysis
• document what has been done to obtain final results
• enable “high-throughput” data analysis work
17
How to query information• again no “deliverables”
• initial plan - MAGE will include query support– all methods were dropped - a data model
• ArrayExpress - 2 large components:– repository - retrieve experiments as units,
MAGE-based– warehouse - gene & data- oriented queries,
work across experiments
• G2G (Jason Stewart) - protocol + query language for distributed queries
18
ratio absolute change
confidence measure
namedesign element type
speciessample type
bioassay type
performer labexper. type
array design name
platform type
provider
Properties Properties
Properties
Properties Properties
19
Problems Components ArrayExpress
What toexchange
MIAME - MIAMExpress, a tool that helps tocapture important annotations
How toexchange
MAGE - can import/will be able to exportMAGE-ML
- DB schema based on MAGE
Commonterms
MGEDontology
- curation team
- curation tool being developed
Dataprocessing
ongoingwork
- data analysis modules built on top(Expression Profiler)
Queries ongoingwork
- repository (experiments as units)
- warehouse (for numeric/gene basedqueries, cross-experiment)
Summary