principles for building biomedical ontologies suzanna lewis national center biomedical ontology 22...

Post on 16-Jan-2016

219 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Principles for Building Biomedical

Ontologies Suzanna Lewis

National Center Biomedical Ontology22 October 2005

Advanced Bioinformatics, Cold Spring Harbor

National Center Biomedical Ontology

http://bioontology.org/ Mark Musen

Suzanna Lewis

Barry Smith

Sima Misra

Daniel Rubin

Michael Ashburner

Monte Westerfield

Ida Sim

PI & Core 1: computer science (SMI) Co-PI & Core 2: bioinformatics (BiKR;

GO) Core 6: Outreach and training (ECOR) Associate Program Director Program Director Core 3: Phenotype Project (Cambridge;

FlyBase; and GO) Core 3: Phenotype Project (UOregon; PI

of ZFIN) Core 3: HIV clinical trials Project

(UCSF)

BiKRs

Sima Misra Shu Shengqiang Christopher J. Mungall

Nomi Harris John Day-Richter Karen Eilbeck Mark Gibson

Outline for the Morning

A definition of “ontology” Four sessions:

Organizational Challenges Principles for Ontology Construction

Case Studies from the GO Case Studies for group discussion.

My newbie questions

What data is missing?

What I’ve heard

Where is the data generated?

What is the motivation?

How will it be gathered?

Organism, environment, data quality and attribution TIGR, Sanger, JGI, and coming soon to a 954 near you!

Still an issue. Low threshold of effort relative to benefits of complying Data it is accumulating on disks across the world and we’d like to be able to locate and use it

The hardest part: Sharing (semantics)

handy ontology tells us what’s there…

Where should I eat…?

Ontologies help with decision making

Type of cuisine (Presumable) country of origin

Ontologies don’t just organize data; they also facilitate inference, and that creates new knowledge, often unconsciously in the user.

Where delicatessen food hails from…

‘Frozen Yogurt’ cuisine in search of a national identity?

What a computer would likely infer about the world from this helpful ontology:

Flag of fresh juiceFresh Juice is a national cuisine…

Ontology is all about meaning

Communities form (scientific) theories that seek to explain all of the existing evidence

and can be used for prediction We make inferences and decisions based upon what we know about (biological) reality.

Make our meanings clear enough for a

computer to understand An ontology is a computable representation of this underlying (biological) reality.

An ontology enables a computer to reason over the data in (some of) the ways that we do particularly to query and locate relevant data.

A shared, common, backbone taxonomy of relevant entities, and the relationships between them, within an application domain. Referred to by information scientists as an ’Ontology'.

But really…

What is an Ontology? From Aristotle to Artificial Intelligence

It is ”a formalism of what exists” Follows formal rules for creating definitions originally laid down by Aristotle.

A definition is: the specification of the essence (nature, invariant structure) shared by all the members of a class or natural kind.

The Aristotelian Methodology

Topmost nodes are the undefinable primitives.

The definition of a class lower down in the hierarchy is provided by specifying the parent of the class together with the relevant differentia.

Differentia tells us what marks out instances of the defined class within the wider parent class as in

Plasma membrane is a cell part [immediate parent] that surrounds the cytoplasm [differentia]

Siamese

mammal

cat

organism

Physical object (substance)

classes

animal

instances

frogleaf class

all members of the class frog share a froggy nature

Thorax

Lung

Heart

Cell

Anatomical structures

Cornelius Rosse

Content of FMA

Challenge:Duplicate graphical model in symbolic model

Adapted fromAdapted fromBloom & Fawcett: Bloom & Fawcett:

Textbook of Textbook of Histology Histology

1994 12th ed1994 12th edChapman & HallChapman & Hall

Universals or classes:Kinds of anatomical entities

Content of FMA

1. Organizational Challenges

http://obo.sourceforge.net

So you want an ontology…

What do you have to do to make/get/use/steal/beg one?

Why Survey

Improve

Domain covered

?

Public?

Active?

Applied?

Community?

DevelopSalvage

Collaborate & Learn

yes

no

What you must do

Justify exactly why there is a need Scope it very, very tightly

Communicate with people

The decisions you must make

What domain does it cover? It is privately held? Is it active? Is it applied?

Why Survey

Improve

Domain covered

?

Public?

Active?

Applied?

Community?

DevelopSalvage

Collaborate & Learn (Listen to Barry)

yes

no

Due diligence & background research

Step 1: Learn what is out there The most comprehensive list is on the OBO site. http://obo.sourceforge.net

Assess ontologies critically and realistically.

Make contact

Why Survey

Improve

Domain covered

?

Public?

Active?

Applied?

Community?

DevelopSalvage

Collaborate & Learn (Listen to Barry)

yes

no

Ontologies must be shared

Proprietary ontologies Belief that ownership of the terminology gives the owners a competitive edge

For example, Incyte or Monsanto in the past, SNOMED for non-US.

Data cannot be shared if the ontologies describing the data are not shared.

Don’t reinvent—Use the power of combination and collaboration

Why Survey

Improve

Domain covered

?

Public?

Active?

Applied?

Community?

DevelopSalvage

Collaborate & Learn (Listen to Barry)

yes

no

Pragmatic assessment of an ontology

Is there access to help, e.g.:help-me@weird.ontology.net ?

Does a warm body answer help mail within a ‘reasonable’ time—say 2 working days ?

Why Survey

Improve

Domain covered

?

Public?

Active?

Applied?

Community?

DevelopSalvage

Collaborate & Learn (Listen to Barry)

yes

no

Use it to improve it

Every ontology improves when it is applied to actual data

It improves even more when these data are used to answer questions

There will be fewer problems in the ontology and more commitment to fixing remaining problems when important research data is involved that scientists depend upon

Be very wary of ontologies that have never been applied

Work with that community To improve (if you found one) To develop (if you did not)

Getting it right It is impossible to get it right the 1st (or 2nd, or 3rd, …) time.

What we know about reality is continually growing

Improve

Collaborate and Learn

Implication: “prepare for change”

Establish a mechanism for change. Use CVS or Subversion. Changes must be reviewed by experts

Unique Identifiers Versions Archives

Ontology development is hard

Have a stake in seeing it work. Have broad, detailed domain knowledge.

Will engage in vigorous debate without engaging egos.

Will do concrete work and attend frequent working sessions (quarterly), phone conferences (weekly), e-mail correspondence (daily).

2. Principles for Ontology Construction

Why do we need rules for good ontology?

Ontologies must be intelligible to humans (for annotation) and to machines (for reasoning and error-checking)

Unintuitive rules for classification lead to entry errors (problematic links)

Facilitate training of curators Overcome obstacles to alignment with other ontology and terminology systems

Enhance harvesting of content through automatic reasoning systems

Following basic rules makes more useful ontologies

Aristotle’s categoriesThis is Aristotle’s list of types of predication, that is, the different ways in which things can be said to be. He identifies 10 mutually exclusive categories.

1. Substance.2. Quantity.3. Quality.4. Relation.5. Location.6. Time.7. Position.8. Possession. 9. Doing.10.Undergoing.

SNOMED-CT Top Level Substance Body Structure Specimen Context-Dependent

Categories* Attribute Finding* Staging and Scales Organism Physical Object

Events Environments and

Geographic Locations Qualifier Value Special Concept* Pharmaceutical and

Biological Products Social Context Disease Procedure Physical Force

Examples of Rules

Don’t confuse instances with universals Your navel (instance) is not the abstract representation of all navels

Your microarray result is not the abstract representation of all microarray results

The meaning of an ontology should not change when the programming language changes

First Rule: Univocity

Terms (including those describing relations) should have the same meanings on every occasion of use.

In other words, they should refer to the same kinds of instances in reality

Example of univocity problem in case of part_of relation

(Old) Gene Ontology: ‘part_of’ = ‘may be part of’

flagellum part_of cell ‘part_of’ = ‘is at times part of’

replication fork part_of the nucleoplasm

‘part_of’ = ‘is included as a sub-list in’

Second Rule: Positivity

Complements of classes are not themselves classes.

Terms such as ‘non-mammal’, or ‘non-frog’, or ‘non-membrane’ do not designate genuine classes.

Third Rule: Objectivity

Which classes exist is not a function of our biological knowledge.

Terms such as ‘unknown’ or ‘unclassified’ do not designate biological natural kinds.

Fourth Rule: Single Inheritance

No class in a classificatory hierarchy should have more than one is_a parent on the immediate higher level

I.e. no diamonds

Cis_a2

Bis_a1

A

Following the single inheritance rule

The position of a term within the hierarchy enriches its own definition by incorporating automatically the definitions of all the terms above it.

The entire information content of the term hierarchy can be translated very cleanly into a computer representation

B C

is_a1 is_a2

A

‘is_a’ no longer univocal

Problems with multiple inheritance

Fifth Rule: Clarity of Text Definitions

The terms used in a definition should be simpler (more intelligible) than the term to be defined

otherwise the definition provides no assistance to human understanding

Machines can cope with the full formal representation (it doesn’t need the text)

Sixth Rule: Basis in Reality

When building or maintaining an ontology, always think carefully about how classes (types, kinds, species) relate to instances in reality

Axioms governing instances Every class has at least one instance (exceptions will occur at top levels)

Each child class has a smaller collection of instances than its parent class

Axiom: Every parent class has at least two children

The reason that rules are important:

Interoperability Ontologies should work together Avoid redundancy in ontology building

Support reuse Ontologies should be capable of being used by other ontologies (cumulation)

The problem of ontology re-use

SNOMEDMeSHUMLSNCIT HL7-RIM …

None of these have clearly defined relations

Still remain too much at the level of TERMINOLOGY

Not based on a common set of rules

Not based on a common set of relations

An example of unclear relationship use

A is_a B ‘A’ is more specific in meaning than ‘B’

HL7-RIM: Individual Allele is_a Act of Observation

cancer documentation is_a cancer disease prevention is_a disease

How to define A is_a B

A is_a B = def.

• A and B are names of universals (natural kinds, types) in reality

• all instances of A are as a matter of biological science also instances of B

Benefits of well-defined relationships

If the relations in an ontology are well-defined, then reasoning can cascade from one relational assertion (A R1 B) to the next (B R2 C). Relations used in ontologies thus far have not been well defined in this sense.

Find all DNA binding proteins should also find all transcription factor proteins because Transcription factor is_a DNA binding protein

Biomedical data integration /

interoperability Will never be achieved through integration of meanings or concepts

The problem: different user communities use different concepts

What is really needed is a well-defined, commonly used set of relationships

Seventh Rule: Distinguish Universals

and Instances A good ontology must distinguish clearly between universals (types, kinds, classes)

and instances (tokens, individuals, particulars)

Why distinguish classes from instances?

What holds on the level of instances may not hold on the level of universals

For example, my definition of an “adjacent_to” relation requires that it work in either direction

(This particular) nucleus adjacent_to (this particular) cytoplasm Always true

Cytoplasm adjacent_to nucleus Not always true

Using relations

Between classes: is_a, part_of, ...

Between an instance and a class: this explosion instance_of the class explosion

Between instances: Mary’s heart part_of Mary

Relations must be defined to always work

Defining the part_of relation can be a

problem part_of as a relation between classes versus part_of as a relation between instances

nucleus part_of cell (classes) your heart part_of you (instances)

testis part_of human being ? heart part_of human being ? human being has_part human testis ?

Similar considerations are required to clearly define

nearly all relations A causes B A is_located in B A is_adjacent_to B A derives_from B

Zygote derives_from ovum, sperm

A transformation_of B Adult transformation_of child

The Rules1. Univocity: Terms should have the same

meanings on every occasion of use2. Positivity: Terms such as ‘non-mammal’ or

‘non-membrane’ do not designate genuine classes.

3. Objectivity: Terms such as ‘unknown’ or ‘unclassified’ or ‘unlocalized’ do not designate biological natural kinds.

4. Single Inheritance: No class in a classification hierarchy should have more than one is_a parent on the immediate higher level

5. Intelligibility of Definitions: The terms used in a definition should be simpler (more intelligible) than the term to be defined

6. Basis in Reality: When building or maintaining an ontology, always think carefully at how classes relate to instances in reality

7. Distinguish Classes and Instances

Some rules are Rules of Thumb

The world is full of difficult trade-offs

The benefits of formal (logical and ontological) rigor need to be balanced Against the constraints of computer tractability,

Against the needs of biomedical practitioners.

BUT do the very best you can!

3. Case Studies from the GO

http://www.geneontology.org

How has GO dealt with some specific aspects of ontology

development? Univocity Positivity Objectivity Definitions

Formal definitions Written definitions

Ontology Re-use (Alignment)

Tactile senseTactionTactition

?

The Challenge of Univocity:People call the same thing by

different names

Tactile senseTactionTactition

perception of touch ; GO:0050975

Univocity: GO uses one term and many characterized

synonyms

= bud initiation

= bud initiation

= bud initiation

The Challenge of Univocity: People use the same words to describe different things

Bud initiation? How is a computer to know?

= bud initiation

sensu Metazoa

= bud initiation

sensu Saccharomyces

= bud initiation

sensu Viridiplantae

Univocity: GO adds “sensu” descriptors to discriminate

among organisms

The Challenge of Positivity

Some organelles are membrane-bound.A centrosome is not a membrane bound organelle,but it still may be considered an organelle.

The Challenge of Positivity: Sometimes absence is a

distinction in a Biologist’s mind

non-membrane-bound organelle

GO:0043228 membrane-bound organelle

GO:0043227

Positivity

Note the logical difference between “non-membrane-bound organelle” and “not a membrane-bound organelle”

The latter includes everything that is not a membrane bound organelle!

The Challenge of Objectivity: Database users want to know if

we don’t know anything (Exhaustiveness with respect

to knowledge)

We don’t know anything about a gene product with

respect to these

We don’t know anything about the ligand that

binds this type of GPCR

Objectivity

How can we use GO to annotate gene products when we know that we don’t have any information about them? Currently GO has terms in each ontology to describe unknown

An alternative might be to annotate genes to root nodes and use an evidence code to describe that we have no data.

Similar strategies could be used for things like receptors where the ligand is unknown.

GPCRs with unknown ligands

We could annotate to

this

GO DefinitionsA definition written by

a biologist:necessary & sufficient

conditions written definition(not computable)

Graph structure: necessary conditions

formal(computable)

Relationships and definitions

Important considerations: Placement in the graph- selecting parents

Appropriate relationships to different parents

True path violation

True path violationWhat is it?

..”the path from a child term all the way up to its top-level parent(s) must always be true".

chromosome

Mitochondrial chromosome

Is_a relationship

Part_of relationship

nucleus

True path violationWhat is it?

nucleus chromosome

Nuclear chromosome

Mitochondrial chromosome

Is_a relationshipsPart_of relationship

The Importance of synonyms:is tRNA a function?

Molecular_function

Triplet codon amino acid adaptor activity

GO Definition: Mediates the insertion of an amino acid at the correct point in the sequence of a nascent polypeptide chain during protein synthesis.

Synonym: tRNA

Ontology integrationOne of the current goals of GO

is integration

cone cell fate commitment

retinal_cone_cell keratinocyte differentiation keratinocyte adipocyte differentiation fat_cell

dendritic cell activation dendritic_cell

lymphocyte proliferation lymphocyte

T-cell homeostasis T_lymphocyte

garland cell differentiation garland_cell

heterocyst cell differentiation heterocyst

References to Cell Types in GO

Cell Types in the Cell Ontology

with

We can integrate the GO with other ontologies

Chemical ontologies 3,4-dihydroxy-2-butanone-4-phosphate synthase activity

Anatomy ontologies metanephros development

GO itself mitochondrial inner membrane peptidase activity

Nota bene: some time and effort will be required

Building Ontology

Improve

Collaborate and Learn

Applied Ontology: a summary

Dedicated editors Practice good ontological hygiene Engage the community Reward compliance and get the ontology into use

Plan for change over time KISS: Concentrate on what you can definitely agree upon: the steps you can take with certainty.

4. Case Studies for group discussion

mitosis and meiosis It's been a full lunar cycle since we last talked about this

on the mailing list, and I would like to draw everyone's attention once again to the exciting topics of chromosome segregation, nuclear division and cell division. The basic problem is the multiplicity of meanings attached to 'mitosis'. The word are used in the literature and colloquially to represent everything from chromosome segregation up to a full round of nuclear and cell division and there is no consensus on how to define it in scientific or general dictionaries (check www.onelook.com for proof). To compound the problem, the only process common to all species which undergo 'mitosis' is chromosome segregation; not all species undergo nuclear division or cell division during the processes described in the literature as 'mitosis'. In the ontologies, we currently have 'mitosis' defined as chromosome segregation and nuclear division. This is therefore wrong for those species in which there is no nuclear division accompanying chromosome segregation. How are we going to define mitosis?

Events of the mitotic cell cycle that need to be represented: mitotic chromosome segregation mitotic nuclear division mitotic cell division

Only component common to all these is mitotic chromosome segregation.

Structure must be flexible enough to accommodate any of the flavors of 'mitosis’, no matter what the species and no matter whether the annotator has read the definition or not.

Backing up assertions

QUESTION: What evidence code is appropriate to use for statements of “common knowledge”?

The current documentation states that TAS may be used as the evidence code for statements of common knowledge. For example, let’s say you have a paper that says that Protein X is an xxxxx , with a direct assay for activity, so you can use IDA for this function term. Then it also makes a mutation in the gene for Protein X and shows that it is involved in process yyyy, so you can use IMP for the process term. But, the paper does not have any direct evidence about the localization of Protein X. However, everyone knows that process yyyy occurs in the cytoplasm, so you can annotate protein X to the component term “cytoplasm ; GO:5737” by TAS using a general reference like Biochemistry by Lupert Stryer.

There is not really a traceable statement in Stryer providing evidence that process yyyy occurs in this location in yeast.

SGD feels that it is better to use the newer evidence code IC for these “common” knowledge types of annotations. Thus, if an SGD curator felt that it was reasonable to make the annotation “cytoplasm” based on the knowledge that Protein X the process annotation yyyy, then the curator could assign the component term “cytoplasm ; GO:5737” using IC and the GOid of the process term yyyyy.

many of these “common knowledge” types of statements are often not well based in actual experiments conducted on the organism of interest, that early biochemists would often perform experiments with materials that were easy to obtain, e.g. calf thymus, and assume that this accurately represented the situation for another organism, e.g. human. This may or may not be the case.

What is the most appropriate GO term for annotating a

response to methylmercury? "Response to mercury ion" doesn't seem quite right, as it specifically states that the response is "as a result of exposure to mercuric ions (Hg2+)", but the more general-sounding "response to mercury" is a synonym of it. In the publication I am working on, they exposed zebrafish to methylmercury and documented the resulting changes in gene expression.

"Response to mercury ion

Definition: A change in state or activity of the organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a result of exposure to mercuric ions (Hg2+).

Synonyms: response to mercuric, response to mercury

Bloggers and other online groups (eg. del.icio.us, Flickr [online photo archive], Technorati) have been self-categorizing or 'tagging' web sites and their content using user-defined words and phrases and not an expertly curated vocabulary or ontology. The end result is that a vast amount of content has been indexed using a rich vocabulary of tags (to date, technorati has over 1.2 billion links tagged with 1.2 million tags).

Whilst this certainly lacks the formal consistency that would be obtained with curated annotation against a standard vocabulary, the quantity of content being categorized far exceeds what could be done by a group of annotators and perhaps is richer because the tags are defined by the users and creators of that content, not by a third party interpreting the material after the fact.

Given the ever increasing quantity of scientific data, the proliferation of online publishing, etc., could scientists tagging their own data with their own terms be the way to go?

How can you recruit and train people, in both logic and biology, given that without a sufficient number of competent personnel the ontology cannot be maintained?

Thanks to NIH and HHMI for funding and

supportAnd to my fantastic

colleagues (whose slides these are)

MICHAEL ASHBURNTER, BARRY SMITH, DAVID HILL,

CORNELIUS ROSSE & CHRIS MUNGALL

P.S. Graphical User Interfaces

Semantics

Common pitfalls

Don’t confuse instances with artifacts of your database representation...

part_of

part_of must be time-indexed for spatial classes

A part_of B is defined as: Given any instance a and any time t, If a is an instance of the universal A at t,

then there is some instance b of the universal B

such that a is an instance-level part_of b at t

C

c at t

C1

c1 at t1

C'

c' at t

time

instances

derives_from

derives_fromovumsperm

zygote

c at t1

C

c at t

C1

time

same instance

transformation_of

pre-RNA mature RNA

adultchild

C2 transformation_of C1 is defined as Given any instance c of C2

c was at some earlier time an instance of C1

embryological development C

c at t c at t1

C1

C

c at t c at t1

C1

tumor development

Key

In the following discussion: Classes are in upper case

‘A’ is the class Instances are in lower case

‘a’ is a particular instance

Placement in the graph

Example- Proteasome complex

top related