the importance of meta data capture – problems and solutions helen parkinson microarray...
TRANSCRIPT
The importance of meta data capture – problems and solutions
Helen ParkinsonMicroarray Informatics Team
European Bioinformatics Institute
NERC Meta Data Meeting
Bristol, April 2003
Metadata Annotation problems What is an ontology? The MGED ontology Annotation and stds for microarray
data Annotation tools
Talk structure
The meta data challenge
Meta data is data about data Microarray experiments (should) have
lots of meta data of different types Meta data varies by experiment and by
community Meta data is often free text Free text is evil in a database context
Informatics resources for biologists
Over 500 databanks and analysis tools that work over various resources
Repositories of knowledge and data at various levels, primary and secondary databases,and interfaces eg EMBL, Swissprot, Ensembl
Knowledge often held as free text; limited use made of controlled vocabularies
Enormous amount of semantic heterogeneity and poor query facilities
Lots of effort goes into filtering databases to add value e.g RefSeq from Genbank
What are the problems with free text
It’s slow to search It’s written by humans, humans are error
prone, (typos, synonyms) Humans have a limited processing power Meaning is not explicit in free text Two options, develop tools for NLP (~40%
efficient) Or eliminate free text
Why is there so much free text?
Historically there was a small amount of data and a limited number databases that held the data
Humans are like free text and are resistant to change
There was nothing better than free text, though use of simple lists helped (e.g. SP keywords)
Gene names and free text The EMBL flat file controls only Species
nomenclature, everything else is free text, genes, KW, descriptions etc
Searching primary databases which are redundant and annotated with free text is hard work and relies upon the user infering information from what is supplied
Reproducing this for new databases isn’t desirable
Without resources to help them curators cannot control free text efficiently
Search for “Ssp1” gene in DDBJ/EMBL/Genbank using SRSgene=“Ssp1” Species=“S.pombe”
1: AB027913 Schizosaccharomyces pombe gene for Ser/Thr protein kinase, partial cds, clone:TA76
2: AL441624 S.pombe chromosome I cosmid c110 3: AL159180 S.pombe chromosome I P1 p14E8 4: AL049609 S.pombe chromosome III cosmid c297 5: AL136235 S.pombe chromosome I cosmid c664 6: D45882 Yeast ssp1 gene for protein kinase,
complete cds 7: X59987 S.pombe SSP1 gene for mitochondrial
Hsp70 protein (Ssp1)
Whose responsibility?
Many important projects have no coordination of standards , for e.g. gene nomenclature, disease state, developmental stage
There are competing resources in some domains, and none in others
Successful projects are built by a community or with community input, e.g GO
What is an ontology?
Captures knowledge for both humans and computer applications
Has a set of vocabulary definitions that capture a community’s knowledge of a domain
`An ontology may take a variety of forms, but necessarily it will include a vocabulary of terms, and some specification of their meaning. This includes definitions and an indication of how concepts are inter-related which collectively impose a structure on the domain and constrain the possible interpretations of terms.‘
It is more than a controlled vocabulary, it has structure (but a cv is a good place to start)
Advantages of using ontologies
Meaning is explicit Meaning is human and computer readable Ease of updating, no need to find terms in
free text and change them Data transfer possible without loss of
meaning Reasoning to aid queries, annotation etc.
Building Ontologies
Simple lists work well but adding structure adds reasoning power
Class, container for information, has a definition and a relationship to other classes (is-a, part-of, kind-of)
Instances, terms that are contained within a class
Example of a class, subclass relationship and a constraint
Class Domestic Catsub-class of Pet
slot constraint cleans itself, eats stinky foodslot has filler
slot has instance Suki Just formalised way to say that a domestic cat is a pet that you don’t have
to clean - but this is machine readable, and I can use it to classify
•Minimum Information About a Microarray Experiment.
•MIAME is a guideline for microarray experimenters to describe their data so that:
•Sufficient information is recorded to:•Correctly interpret & verify their experiments.•Able to replicate the experiments.
•Structured information must be recorded to:• Query and correctly retrieve the data.• Analyse the data.
MIAME
MIAMEMIAME
6 parts of a microarray experiment
Experiment
HybridisationSample Array
Normalisation
Data
• Sample source• Sample treatments• Extraction protocol• Labeling protocol
•Array design information• Location of each element• Description of each element
•Control array elements
•Statistical treatment
• Image• Scanning protocol• Software specifications
• Quantification matrix• Analysis protocol• Software specifications
ArrayExpress
A public repository for Microarray Data – MIAME compliant
Uses the MAGE model (designed to hold MIAME compliant data)
Holds public and private data Uses controlled vocabulary where possible Can represent complex metadata and
reference external resources e.g databases and ontologies
Infrastructure at the EBI
ArrayExpress(Oracle)
Other publicMicroarrayDatabases
(GEO, CIBEX)
www
EBI
ExpressionProfiler
ExternalBioinformatic
databases
Data analysis
www
Queries
www
MIAMExpress(MySQL)
MAGE-ML
Submissions
MA
GE
-ML
Array Manufacturers
LIMS
Microarray
software
Data AnalysissoftwareM
AG
E-M
L E
xpo
rt
Local MIAMExpressInstallations
MAGE-ML files
Submissions
MAGE-ML pipelines
Desirable Queries
Show me all experiments where gene x is on the array
Show me all the experiments where organism x is treated by compound Y
Return all experiments using developmental stage X, disease stage Y
– Sort by platform type– Which are untreated? Treated?
• Treated by what• How comparable are these?
MAGE-OM and Ontology Entries
ArrayExpress uses the MAGE-OM Requires OntologyEntries of 3 types
– Simple lists, e.g image format, GIFF, TIFF etc– Infinite – e.g. anything that could be meta data
relating to a sample, Species, disease state– In between - types of protocols, types of data
transformation, types of Biosequence etc
ArrayExpress Conceptual Model
PublicationExternal links
Hybridisation ArraySampleSource
(e.g., Taxonomy)
Experiment
Normalisation
Gene(e.g., EMBL)
Data
MGED
MGED is the Microarray Gene Expression Database group
It’s a group who develop resources, including the object model and ontologies for the comunnity
It includes, SMD, TIGR, Affy , Agilent etc Conference in Aix-en-Provence Sep 2003
www.mged.org
The MGED Ontology An ontology for microarray experiments
– Parts are applicable to describing experiments in general with a focus on microarray
– Built by people with legacy data problems EBI, U.Penn,TIGR,SMD, UC Berkeley, NIH plus contributions from the mailing list
– Supports the ontology requirements of the MAGE model Our approach to interfacing with other ontologies is
“experimental”– Provide a framework to point to other ontologies
• Know where to find different types of annotation• How to interpret that annotation
The MGED ontology is not
Limited to any one domain or species Modelling the real world or reinventing the
MAGE model Mapping terms from external non orthogonal
ontologies Recommending one ontology over another
(though some are not freely available) Just for microarrays, the same concepts like
‘assay’ apply to phenotype and we want to make a reusable resource
Organising by sub-classing
In MAGE Spots relate to BioSequences, these have types: gene, intergenic sequence, clone, PCR product, EST
BioSequenceType
ArrayExpress
MIAMExpress
RADMAGE-ML data exchange
Ontology instances propagated to submission/annotation web forms
Curation of user defined terms, before inclusion in the ontology
User defined terms collected via forms
MGED Ontology
BiomaterialDescription
SexC
C
C
C Gender
documentation: Subclass of sex applicable to heterogametic species (i.e., those in which the sexes produce gametes of markedly different size). Males produce small numerous gametes. Females produce small numbers of large gametes. Hermaphrodites are individuals with both male and female characteristics. Mixed refers to a population of individuals with more than one type of gender.
used in individuals: female, hermaphrodite,male,mixed_sex,unknown_sex
Turning free text into controlled annotation
Sample source and treatment description, and its correct annotation using the MGED ontology
“Seven week old C57BL/6N mice were treated with fenofibrate. Liver was dissected out, RNA prepared”
©-BioMaterialDescription
©-Biosource Property
©-Organism
©-Age
©-DevelopmentStage
©-Sex
©-StrainOrLine
©-BiosourceProvider
©-OrganismPart
©-BioMaterialManipulation
©-EnvironmentalHistory
©-CultureCondition
©-Temperature
©-Humidity
©-Light
©-PathogenTests
©-Water
©-Nutrients
©-Treatment
©-CompoundBasedTreatment
(Compound)
(Treatment_application)
(Measurement)
MGED BioMaterial Ontology Instances
7 weeks after birth
Female
Charles River, Japan
22 2C
55 5%
12 hours light/dark cycle
Specified pathogen free conditions
ad libitum
MF, Oriental Yeast, Tokyo, Japan
in vivo, oral gavage
100mg/kg body weight
External References
NCBI TaxonomyNCBI Taxonomy
Mouse Anatomical DictionaryMouse Anatomical Dictionary
International Committee on Standardized Genetic Nomenclature for Mice
International Committee on Standardized Genetic Nomenclature for Mice
Mouse Anatomical DictionaryMouse Anatomical Dictionary
ChemIDplusChemIDplus
Mus musculus musculus id: 39442
Stage 28
C57BL/6
Liver
Fenofibrate, CAS 49562-28-9
Care when using ontologies
Ontologies are rarely complete Ontologies are fit for the purpose for
which they are designed – not off the shelf solutions to your problem
Building ontologies is hard - it needs both domain experts and tools
A simple list is an excellent start
Possible solutions
Think about what you want to query – this determines what you will annotate
Look for existing resources and use them if they are appropriate
Do the doable first Share your resources
Quote
‘Most biologists would rather share their toothbrush than share a gene name’
Michael Ashburner