the importance of meta data capture – problems and solutions helen parkinson microarray...

The importance of meta data capture – problems and solutions

Helen ParkinsonMicroarray Informatics Team

European Bioinformatics Institute

NERC Meta Data Meeting

Bristol, April 2003

Metadata Annotation problems What is an ontology? The MGED ontology Annotation and stds for microarray

data Annotation tools

Talk structure

The meta data challenge

Meta data is data about data Microarray experiments (should) have

lots of meta data of different types Meta data varies by experiment and by

community Meta data is often free text Free text is evil in a database context

Informatics resources for biologists

Over 500 databanks and analysis tools that work over various resources

Repositories of knowledge and data at various levels, primary and secondary databases,and interfaces eg EMBL, Swissprot, Ensembl

Knowledge often held as free text; limited use made of controlled vocabularies

Enormous amount of semantic heterogeneity and poor query facilities

Lots of effort goes into filtering databases to add value e.g RefSeq from Genbank

What are the problems with free text

It’s slow to search It’s written by humans, humans are error

prone, (typos, synonyms) Humans have a limited processing power Meaning is not explicit in free text Two options, develop tools for NLP (~40%

efficient) Or eliminate free text

Why is there so much free text?

Historically there was a small amount of data and a limited number databases that held the data

Humans are like free text and are resistant to change

There was nothing better than free text, though use of simple lists helped (e.g. SP keywords)

Gene names and free text The EMBL flat file controls only Species

nomenclature, everything else is free text, genes, KW, descriptions etc

Searching primary databases which are redundant and annotated with free text is hard work and relies upon the user infering information from what is supplied

Reproducing this for new databases isn’t desirable

Without resources to help them curators cannot control free text efficiently

Search for “Ssp1” gene in DDBJ/EMBL/Genbank using SRSgene=“Ssp1” Species=“S.pombe”

1: AB027913 Schizosaccharomyces pombe gene for Ser/Thr protein kinase, partial cds, clone:TA76

2: AL441624 S.pombe chromosome I cosmid c110 3: AL159180 S.pombe chromosome I P1 p14E8 4: AL049609 S.pombe chromosome III cosmid c297 5: AL136235 S.pombe chromosome I cosmid c664 6: D45882 Yeast ssp1 gene for protein kinase,

complete cds 7: X59987 S.pombe SSP1 gene for mitochondrial

Hsp70 protein (Ssp1)

Whose responsibility?

Many important projects have no coordination of standards , for e.g. gene nomenclature, disease state, developmental stage

There are competing resources in some domains, and none in others

Successful projects are built by a community or with community input, e.g GO

What is an ontology?

Captures knowledge for both humans and computer applications

Has a set of vocabulary definitions that capture a community’s knowledge of a domain

`An ontology may take a variety of forms, but necessarily it will include a vocabulary of terms, and some specification of their meaning. This includes definitions and an indication of how concepts are inter-related which collectively impose a structure on the domain and constrain the possible interpretations of terms.‘

It is more than a controlled vocabulary, it has structure (but a cv is a good place to start)

Advantages of using ontologies

Meaning is explicit Meaning is human and computer readable Ease of updating, no need to find terms in

free text and change them Data transfer possible without loss of

meaning Reasoning to aid queries, annotation etc.

Building Ontologies

Simple lists work well but adding structure adds reasoning power

Class, container for information, has a definition and a relationship to other classes (is-a, part-of, kind-of)

Instances, terms that are contained within a class

Example of a class, subclass relationship and a constraint

Class Domestic Catsub-class of Pet

slot constraint cleans itself, eats stinky foodslot has filler

slot has instance Suki Just formalised way to say that a domestic cat is a pet that you don’t have

to clean - but this is machine readable, and I can use it to classify

MIAME and ArrayExpress

•Minimum Information About a Microarray Experiment.

•MIAME is a guideline for microarray experimenters to describe their data so that:

•Sufficient information is recorded to:•Correctly interpret & verify their experiments.•Able to replicate the experiments.

•Structured information must be recorded to:• Query and correctly retrieve the data.• Analyse the data.

MIAME

MIAMEMIAME

6 parts of a microarray experiment

Experiment

HybridisationSample Array

Normalisation

Data

• Sample source• Sample treatments• Extraction protocol• Labeling protocol

•Array design information• Location of each element• Description of each element

•Control array elements

•Statistical treatment

• Image• Scanning protocol• Software specifications

• Quantification matrix• Analysis protocol• Software specifications

ArrayExpress

A public repository for Microarray Data – MIAME compliant

Uses the MAGE model (designed to hold MIAME compliant data)

Holds public and private data Uses controlled vocabulary where possible Can represent complex metadata and

reference external resources e.g databases and ontologies

Infrastructure at the EBI

ArrayExpress(Oracle)

Other publicMicroarrayDatabases

(GEO, CIBEX)

www

EBI

ExpressionProfiler

ExternalBioinformatic

databases

Data analysis

www

Queries

www

MIAMExpress(MySQL)

MAGE-ML

Submissions

MA

GE

-ML

Array Manufacturers

LIMS

Microarray

software

Data AnalysissoftwareM

AG

E-M

L E

xpo

rt

Local MIAMExpressInstallations

MAGE-ML files

Submissions

MAGE-ML pipelines

Desirable Queries

Show me all experiments where gene x is on the array

Show me all the experiments where organism x is treated by compound Y

Return all experiments using developmental stage X, disease stage Y

– Sort by platform type– Which are untreated? Treated?

• Treated by what• How comparable are these?

MAGE-OM and Ontology Entries

ArrayExpress uses the MAGE-OM Requires OntologyEntries of 3 types

– Simple lists, e.g image format, GIFF, TIFF etc– Infinite – e.g. anything that could be meta data

relating to a sample, Species, disease state– In between - types of protocols, types of data

transformation, types of Biosequence etc

ArrayExpress Conceptual Model

PublicationExternal links

Hybridisation ArraySampleSource

(e.g., Taxonomy)

Experiment

Normalisation

Gene(e.g., EMBL)

Data

MGED

MGED is the Microarray Gene Expression Database group

It’s a group who develop resources, including the object model and ontologies for the comunnity

It includes, SMD, TIGR, Affy , Agilent etc Conference in Aix-en-Provence Sep 2003

www.mged.org

The MGED Ontology:A framework for describing functional

genomics experiments

The MGED Ontology An ontology for microarray experiments

– Parts are applicable to describing experiments in general with a focus on microarray

– Built by people with legacy data problems EBI, U.Penn,TIGR,SMD, UC Berkeley, NIH plus contributions from the mailing list

– Supports the ontology requirements of the MAGE model Our approach to interfacing with other ontologies is

“experimental”– Provide a framework to point to other ontologies

• Know where to find different types of annotation• How to interpret that annotation

The MGED ontology is not

Limited to any one domain or species Modelling the real world or reinventing the

MAGE model Mapping terms from external non orthogonal

ontologies Recommending one ontology over another

(though some are not freely available) Just for microarrays, the same concepts like

‘assay’ apply to phenotype and we want to make a reusable resource

Current MGED Ontology

MGED Ontology: BioSequence

Organising by sub-classing

In MAGE Spots relate to BioSequences, these have types: gene, intergenic sequence, clone, PCR product, EST

BioSequenceType

MGED Ontology: limiting redundancy

MGED Ontology: OntologyEntry

Example of External Terms

Ontology in Browseable Form

ArrayExpress

MIAMExpress

RADMAGE-ML data exchange

Ontology instances propagated to submission/annotation web forms

Curation of user defined terms, before inclusion in the ontology

User defined terms collected via forms

MGED Ontology

BiomaterialDescription

SexC

C

C

C Gender

documentation: Subclass of sex applicable to heterogametic species (i.e., those in which the sexes produce gametes of markedly different size). Males produce small numerous gametes. Females produce small numbers of large gametes. Hermaphrodites are individuals with both male and female characteristics. Mixed refers to a population of individuals with more than one type of gender.

used in individuals: female, hermaphrodite,male,mixed_sex,unknown_sex

Turning free text into controlled annotation

Sample source and treatment description, and its correct annotation using the MGED ontology

“Seven week old C57BL/6N mice were treated with fenofibrate. Liver was dissected out, RNA prepared”

©-BioMaterialDescription

©-Biosource Property

©-Organism

©-Age

©-DevelopmentStage

©-Sex

©-StrainOrLine

©-BiosourceProvider

©-OrganismPart

©-BioMaterialManipulation

©-EnvironmentalHistory

©-CultureCondition

©-Temperature

©-Humidity

©-Light

©-PathogenTests

©-Water

©-Nutrients

©-Treatment

©-CompoundBasedTreatment

(Compound)

(Treatment_application)

(Measurement)

MGED BioMaterial Ontology Instances

7 weeks after birth

Female

Charles River, Japan

22 2C

55 5%

12 hours light/dark cycle

Specified pathogen free conditions

ad libitum

MF, Oriental Yeast, Tokyo, Japan

in vivo, oral gavage

100mg/kg body weight

External References

NCBI TaxonomyNCBI Taxonomy

Mouse Anatomical DictionaryMouse Anatomical Dictionary

International Committee on Standardized Genetic Nomenclature for Mice

International Committee on Standardized Genetic Nomenclature for Mice

Mouse Anatomical DictionaryMouse Anatomical Dictionary

ChemIDplusChemIDplus

Mus musculus musculus id: 39442

Stage 28

C57BL/6

Liver

Fenofibrate, CAS 49562-28-9

Care when using ontologies

Ontologies are rarely complete Ontologies are fit for the purpose for

which they are designed – not off the shelf solutions to your problem

Building ontologies is hard - it needs both domain experts and tools

A simple list is an excellent start

Possible solutions

Think about what you want to query – this determines what you will annotate

Look for existing resources and use them if they are appropriate

Do the doable first Share your resources

Quote

‘Most biologists would rather share their toothbrush than share a gene name’

Michael Ashburner

Acknowledgements

Chris Stoeckert, Trish Whetzel, Joe White, Cathy Ball, Paul Spellman - MGED ontology

Microarray Informatics Team, EBI

Robert Stevens, University of Manchester

Funding: EMBL, EU, ILSI

the importance of meta data capture – problems and solutions helen parkinson microarray...

Documents

free text free text

free text slide

data humans

meta data challenge

free text limited use

community meta data

pombe ssp1 gene

structure slide