real-life ontology development: lessons from the gene ontology

47
Real-life ontology development: lessons from the Gene Ontology

Post on 18-Dec-2015

228 views

Category:

Documents


4 download

TRANSCRIPT

Real-life ontology development:

Real-life ontology development:

lessons from the Gene Ontology

lessons from the Gene Ontology

• What is GO?• Evolution of GO• Mechanisms of updating GO• Tools for ontology development• Lessons learned

• What is GO?• Evolution of GO• Mechanisms of updating GO• Tools for ontology development• Lessons learned

Gene OntologyGene Ontology

• Built for a very specific purpose:“annotation of genes and proteins in

genomic and protein databases”• Applicable to all species

• Built for a very specific purpose:“annotation of genes and proteins in

genomic and protein databases”• Applicable to all species

Gene Ontology - scopeGene Ontology - scope

• Three disjoint axes:– molecular function

• molecular role e.g. catalytic activity, binding

– biological process• broad biological phenomena e.g. mitosis, growth,

digestion

– cellular component• sub-cellular location e.g nucleus, ribosome, origin

recognition complex

• Three disjoint axes:– molecular function

• molecular role e.g. catalytic activity, binding

– biological process• broad biological phenomena e.g. mitosis, growth,

digestion

– cellular component• sub-cellular location e.g nucleus, ribosome, origin

recognition complex

Gene OntologyGene Ontology

• Directed acyclic graph (DAG)• Terms connected by two transitive

relations (edges):– is_a– part_of

• Directed acyclic graph (DAG)• Terms connected by two transitive

relations (edges):– is_a– part_of

Gene OntologyGene Ontology

• Developed by an international consortium– about 50 members

• Editorial office, 4 full-time editors (ish)• Many other part-time editors at

databases• Multiple changes made a day

– made live immediately

• Developed by an international consortium– about 50 members

• Editorial office, 4 full-time editors (ish)• Many other part-time editors at

databases• Multiple changes made a day

– made live immediately

Gene OntologyGene Ontology

• Main ontology format OBO flat file• Changes are live immediately

– no releases

• Propagated to GO database– monthly snapshots archived

• Main ontology format OBO flat file• Changes are live immediately

– no releases

• Propagated to GO database– monthly snapshots archived

Evolution of GOEvolution of GO

• Original GO created in 2000• Three databases involved:

– FlyBase (Drosophila)– MGI (Mouse)– SGD (S. cerevisae)

• Used immediately

• Original GO created in 2000• Three databases involved:

– FlyBase (Drosophila)– MGI (Mouse)– SGD (S. cerevisae)

• Used immediately

Evolution of GOEvolution of GO

• Later databases:– TAIR (Arabadopsis)– TIGR (microbes including prokaryotes)– SWISS-PROT (several thousand species inc. human)– PSU (P. falciparum)

• Recent additions– ZFIN (zebrafish)– PAMGO (plant pathogens)

• Later databases:– TAIR (Arabadopsis)– TIGR (microbes including prokaryotes)– SWISS-PROT (several thousand species inc. human)– PSU (P. falciparum)

• Recent additions– ZFIN (zebrafish)– PAMGO (plant pathogens)

Evolution of GOEvolution of GO

• GO development traditionally annotation-driven– development directed by use

• Terms added as new species annotated• Terms added on as as-needed basis

• GO development traditionally annotation-driven– development directed by use

• Terms added as new species annotated• Terms added on as as-needed basis

Evolution of GOEvolution of GO

• Resulted in ‘organic’ structure, little formality

• Ontological formality added subsequently– philosophical and logical

• Resulted in ‘organic’ structure, little formality

• Ontological formality added subsequently– philosophical and logical

Growth of GOGrowth of GOGO term history 2001 - 2007

0

5000

10000

15000

20000

25000

30000

Jan-01Apr-01Jul-01Oct-01Jan-02Apr-02Jul-02Oct-02Jan-03Apr-03Jul-03Oct-03Jan-04Apr-04Jul-04Oct-04Jan-05Apr-05Jul-05Oct-05Jan-06Apr-06Jul-06Oct-06Jan-07

Date

Number of terms

obsolete

undefined terms

defined terms

Modifying the graph:

• Before:

Modifying the graph:

• But then I need to annotate VW Beetles, pre-1980

• The graph no longer works, because the engine is in the boot

Modifying the graph:

• After:

Mechanisms for ontology changeMechanisms for ontology change• Small incremental changes• Initially all changes to the

ontologies made this way

• Small incremental changes• Initially all changes to the

ontologies made this way

Mechanisms for ontology changeMechanisms for ontology change• Suggested changes initially

submitted by email• Moved to an online tracking

system when this became unmanageable

• Suggested changes initially submitted by email

• Moved to an online tracking system when this became unmanageable

Requesting changes to GO - curator requests trackerRequesting changes to GO - curator requests tracker• Web-based tracking system hosted at

SourceForge.net• Public• Tracker item for each new request or

question

• Web-based tracking system hosted at SourceForge.net

• Public• Tracker item for each new request or

question

Curator requests trackerCurator requests tracker

Mechanisms for ontology changeMechanisms for ontology change• Problems:

– Larger questions about the higher ontology structure remain unresolved

– Makes some items impossible to close– No sense of the ‘big picture’– Large areas of the ontologies missing

or incomplete because no annotations– Massive volume

• needed to increase the number of editors

• Problems:– Larger questions about the higher

ontology structure remain unresolved– Makes some items impossible to close– No sense of the ‘big picture’– Large areas of the ontologies missing

or incomplete because no annotations– Massive volume

• needed to increase the number of editors

Mechanisms for ontology changeMechanisms for ontology change• Larger-scale changes:

– content meetings– interest groups

• Larger-scale changes:– content meetings– interest groups

Content meetingsContent meetings

• Short meetings aimed at developing specific areas of GO ontology content– proposals refined and discussed before

meeting– small number of people (10-15)– invited experts– specific topics

• Short meetings aimed at developing specific areas of GO ontology content– proposals refined and discussed before

meeting– small number of people (10-15)– invited experts– specific topics

Content meetingsContent meetings

• Further refinements made following meeting by email

• Changes are made once consensus reached

• Large number of terms typically added (500+)

• Further refinements made following meeting by email

• Changes are made once consensus reached

• Large number of terms typically added (500+)

Content meetingsContent meetings

• Recent meetings:– immunology– interactions between organisms– CNS development

• Recent meetings:– immunology– interactions between organisms– CNS development

Content meetingsContent meetings

• Advantages– Allows a lot of detailed work to be

done on a very specific area– Involves external expertise

• Advantages– Allows a lot of detailed work to be

done on a very specific area– Involves external expertise

Content meetingsContent meetings

• Problems:– Expensive - everyone has to be in the

same location– Only works for very specific topics– Long lag time getting terms into

ontologies

• Problems:– Expensive - everyone has to be in the

same location– Only works for very specific topics– Long lag time getting terms into

ontologies

Interest groupsInterest groups

• Groups of experts for a specific topic– e.g. development, cell cycle, plants

• Includes GO curators/annotators and external experts

• Don’t typically meet face to face

• Groups of experts for a specific topic– e.g. development, cell cycle, plants

• Includes GO curators/annotators and external experts

• Don’t typically meet face to face

Interest groupsInterest groups

• Communicate via email, desktop sharing etc

• Transporters area of the ontology recently revised this way

• Communicate via email, desktop sharing etc

• Transporters area of the ontology recently revised this way

Interest groupsInterest groups

• Advantages– Cheap, no travel required– Allows a lot of detailed work to be

done on a very specific area– Involves external expertise

• Advantages– Cheap, no travel required– Allows a lot of detailed work to be

done on a very specific area– Involves external expertise

Interest groupsInterest groups

• Disadvantages– Harder to reach consensus when not

face to face– Projects tend to drag on

• Disadvantages– Harder to reach consensus when not

face to face– Projects tend to drag on

Mechanisms for ontology changeMechanisms for ontology change• Systematic changes via small working

groups• Systematic changes via small working

groups

Systematic changesSystematic changes

• Projects not directly related to biological content

• Systematic changes throughout ontology

• Small group of GO consortium members– meets regularly by desktop sharing, voice

over IP

• Experts recruited to meetings as needed

• Projects not directly related to biological content

• Systematic changes throughout ontology

• Small group of GO consortium members– meets regularly by desktop sharing, voice

over IP

• Experts recruited to meetings as needed

Systematic changesSystematic changes

• Changes either– made on a branch of the ontology and

merged in later• always have big problems merging branched file

into main file

– merged directly into live ontology after session

• fast, but people get angry

• Changes either– made on a branch of the ontology and

merged in later• always have big problems merging branched file

into main file

– merged directly into live ontology after session

• fast, but people get angry

is_a completeis_a complete

• GO contains both is_a and part_of relations

• Typically, graphs a mixture of incomplete is_a and part_of hierarchies

• A result of ‘organic’ evolution of GO• All graphs now have complete is_a

paths to root

• GO contains both is_a and part_of relations

• Typically, graphs a mixture of incomplete is_a and part_of hierarchies

• A result of ‘organic’ evolution of GO• All graphs now have complete is_a

paths to root

partial disjointnesspartial disjointness

• Biological process terms organised by granularity:– cellular process– multicellular organism process– multi-organism process

• To avoid massive increase in number of paths to root, these terms are disjoint– no is_a children in common

• Biological process terms organised by granularity:– cellular process– multicellular organism process– multi-organism process

• To avoid massive increase in number of paths to root, these terms are disjoint– no is_a children in common

sensusensu

• sensu (meaning ‘in the sense of’) used to disambiguate, by taxonomic group, terms with identical strings but different meanings

• e.g. sporulation (sensu Viridiplantae) v/s sporulation (sensu Bacteria)

• sensu (meaning ‘in the sense of’) used to disambiguate, by taxonomic group, terms with identical strings but different meanings

• e.g. sporulation (sensu Viridiplantae) v/s sporulation (sensu Bacteria)

sensusensu

• Current project to remove the sensu term strings

• Replace with strings that represent the true differentiae

• e.g. – cell wall (sensu Bacteria) -> peptidoglycan-

based cell wall– cell wall (sensu Fungi) -> chitin- and beta-

glucan-containing cell wall

• Current project to remove the sensu term strings

• Replace with strings that represent the true differentiae

• e.g. – cell wall (sensu Bacteria) -> peptidoglycan-

based cell wall– cell wall (sensu Fungi) -> chitin- and beta-

glucan-containing cell wall

• Advantages– Fast– Efficient– Small number of people required

• Advantages– Fast– Efficient– Small number of people required

Systematic changes to GOSystematic changes to GO

• Disadvantages– Difficult to obtain wider consensus– Changes sometimes have to be

undone

• Disadvantages– Difficult to obtain wider consensus– Changes sometimes have to be

undone

Systematic changes to GOSystematic changes to GO

Useful tools for ontology developmentUseful tools for ontology development• WebEx

– desktop sharing, can control each others desktops

• wiki– mainly internal

• Skype– free international calls!

• conference calls– not free

• WebEx– desktop sharing, can control each others

desktops

• wiki– mainly internal

• Skype– free international calls!

• conference calls– not free

Tracking changes to GOTracking changes to GO

• General tracking– files stored in cvs, all differences

trackable (in theory)– far from ideal - frequent discussion is

should we history track, date-stamp terms?

• General tracking– files stored in cvs, all differences

trackable (in theory)– far from ideal - frequent discussion is

should we history track, date-stamp terms?

Tracking changes to GOTracking changes to GO

• Obsolete terms– formerly stored within the ontology– in OBO format made a special kind of

deprecated term (tag is_obsolete)– Soon to create ‘replaced_by’ and

‘consider’ tags to point to live terms

• Obsolete terms– formerly stored within the ontology– in OBO format made a special kind of

deprecated term (tag is_obsolete)– Soon to create ‘replaced_by’ and

‘consider’ tags to point to live terms

Tracking changes to GOTracking changes to GO

• Crediting experts– traditionally no mechanism for doing

this– creating abstracts for content

meetings, adding tag to term– as yet no mechanism for crediting

individuals

• Crediting experts– traditionally no mechanism for doing

this– creating abstracts for content

meetings, adding tag to term– as yet no mechanism for crediting

individuals

Useful tools for ontology developmentUseful tools for ontology development• OBO-Edit

– ontology editor originally developed for GO

– can be used for any OBO format ontology

– developed by group of users

• OBO-Edit– ontology editor originally developed

for GO– can be used for any OBO format

ontology– developed by group of users

Useful tools for ontology developmentUseful tools for ontology development• Reasoner integrated into OBO-Edit

– based on OBOL– detects missing links, redundant links, – soon misplaced terms, automatic

term creation

• Validation system– typographical errors, is_a orphans,

duplicate synonyms etc.

• Reasoner integrated into OBO-Edit– based on OBOL– detects missing links, redundant links, – soon misplaced terms, automatic

term creation

• Validation system– typographical errors, is_a orphans,

duplicate synonyms etc.

Lessons learnedLessons learned

• An ontology doesn’t have to be perfect or complete to be used

• For domain ontologies, external experts should be involved

• Communication is critical• You will never please everyone

• An ontology doesn’t have to be perfect or complete to be used

• For domain ontologies, external experts should be involved

• Communication is critical• You will never please everyone