the language of the gene ontology

33
The Language of the Gene Ontology Robert Stevens Bio-Health Informatics Group The University of Manchester Manchester United Kingdom [email protected]

Upload: robertstevens65

Post on 21-May-2015

91 views

Category:

Science


2 download

DESCRIPTION

Invited talk at Dept of Computer Science, Open university, 2010

TRANSCRIPT

Page 1: The Language of the Gene Ontology

The Language of the Gene Ontology

Robert StevensBio-Health Informatics GroupThe University of Manchester

ManchesterUnited Kingdom

[email protected]

Page 2: The Language of the Gene Ontology

Overview

• Annotation of biological data using ontologies;• Communicating between annotator and user• Least effort, power laws and quality metrics • The analysis of GOA corpora • Interpreting the results

Page 3: The Language of the Gene Ontology

Names in biology

Some category of protein

U2-type nuclear mRNA 5' splice site recognition

spliceosomal E complex formation

spliceosomal E complex biosynthesis

spliceosomal CC complex formation

U2-type nuclear mRNA 5'-splice site recognition

Page 4: The Language of the Gene Ontology

De facto Integration with Ontologies• Agreement on the entities in biology to be

described• Describe those entities with ontologies• Label those data entities with those ontology’s

terms• Ontology building now a mainstream activity

within biology• Ontologies built by and for biologists

Page 5: The Language of the Gene Ontology

Genotype Phenotype

Sequence

Proteins

Gene products Transcript

Pathways

Cell type

BRENDA tissue / enzyme source

Development

Anatomy

Phenotype

Plasmodium life cycle

-Sequence types and features-Genetic Context

- Molecule role - Molecular Function- Biological process - Cellular component

-Protein covalent bond -Protein domain -UniProt taxonomy

-Pathway ontology -Event (INOH pathway ontology) -Systems Biology -Protein-protein interaction

-Arabidopsis development -Cereal plant development -Plant growth and developmental stage -C. elegans development -Drosophila development FBdv fly development.obo OBO yes yes -Human developmental anatomy, abstract version -Human developmental anatomy, timed version

-Mosquito gross anatomy-Mouse adult gross anatomy -Mouse gross anatomy and development -C. elegans gross anatomy-Arabidopsis gross anatomy -Cereal plant gross anatomy -Drosophila gross anatomy -Dictyostelium discoideum anatomy -Fungal gross anatomy FAO -Plant structure -Maize gross anatomy -Medaka fish anatomy and development -Zebrafish anatomy and development

-NCI Thesaurus -Mouse pathology -Human disease -Cereal plant trait -PATO PATO attribute and value.obo -Mammalian phenotype -Habronattus courtship -Loggerhead nesting -Animal natural history and life history

eVOC (Expressed Sequence Annotation for Humans)

Page 6: The Language of the Gene Ontology

Gene Ontology http://www.geneontology.org

“a dynamic controlled vocabulary that can be applied to all eukaryotes”

Built by the community for the community.

Three organising principles: Molecular function, Biological

process, Cellular component Describes kinds of things and

parts of things Describes ~25,000 things

Page 7: The Language of the Gene Ontology

Annotating Biological Data

• Some 40 species genome DB now annotated with GO

• http://www.geneontology.org/GO.current.annotations.shtml

• 395173 species specific, non-redundant genes/gene products annotated

• 7718253 annotations in total

Page 8: The Language of the Gene Ontology

GO associations

• CYP 51

Page 9: The Language of the Gene Ontology

GO associations

CYP 51CYP 51

GO:0020037 : heme bindingGO:0020037 : heme binding

GO:0005506 :Iron ion bindingGO:0005506 :Iron ion binding

GO:0004497 : monooxygenase activityGO:0004497 : monooxygenase activity

Page 10: The Language of the Gene Ontology

GO Evidence Codes

• Each annotation given an evidence code• Broadly divide in to “computational inference”

and “experimental inference”• Can partition GO annotated data in to “high”

and “low” confidence anotations• Not directly quality

Page 11: The Language of the Gene Ontology

Zipf’s Law (1934, 1949)

• Frequency of a word in a corpus inversely proportional to rankMost popular word occurs twice as frequently as next most popular word, which itself occurs twice as frequently as the fourthPower law distributions seen in many natural and social situations

• This distribution is a characteristic of human language

Plot of log frequency against log rankThe slope β gives information about the language used

in the corpus

Page 12: The Language of the Gene Ontology

The Communication Process

Biology

Encoded Message

Encoding Channel Decoding ReceiverSource

Decoded Message

Source=Annotators

Receiver= User of Annotation

Page 13: The Language of the Gene Ontology

Principle of Least Effort• In the process of message passing from encoder to decoder

effort is expended• Maximum information transfer with minimum effort• A rich language precisely defining the message is hard work to

encode and should be easier to decode• The steeper the slope (β) the richer the message and the

more effort involved• Values for β of 2 is about optimum• Does GO annotation behave like messages in a language?• Looking at β might tell us about annotation quality – how

well is the message transfered

Page 14: The Language of the Gene Ontology

Listener Speaker

Effort

High

Low

Integrin Complex

Cell

Effort in Encoding and Decoding Annotation for Intergrin Alpha8 Protein

Cell

Integrin Complex

Page 15: The Language of the Gene Ontology

Values of Power Law Exponent• single author sources in English β is about 2 Ferrer i Cancho

and Sole, 2001• For young children β is around 1.6 Piotrowski and

Pashkovskii, 1994 • β > 2 for sets of nouns in siphisticated, single authored texts

Balasubrahmanyan and Naranan, 1996 • English texts in the range 1.6 < β < 2.4 Ferrer i Cancho, 2005b • Low values favour the speaker and is low effort for the

speaker• High values favour the listener and are high effort for the

speaker

Page 16: The Language of the Gene Ontology

The Questions

Does the Gene Ontology act like a language?Are GO annotations utterances in that

language?That is, do GO annotations follow a power law?What is the quality of that communication?

What is the exponentWhat is the effort involved in that

communication? what is the effort involved in encoding and decoding?

Page 17: The Language of the Gene Ontology

Materials and Methods

• GOA and ENSEMBL annotations• For species: Human, gorilla, mouse, rat, yeast,

fly, cow, fish• Divided in to “high” and “low” confidence

using evidence codes• Plot log cumulative frequence against log rank• Fit to power law (Clauset, et al., 2009) • Look at exponent of lines for various samples

Page 18: The Language of the Gene Ontology

The Equation….If is the proportion of words in a text with frequency f, the Zipf Law is given as:

Where refers to the frequency of word and

indicates the exponent or scaling parameter of power law model

( )P f f

( )P f

f

Page 19: The Language of the Gene Ontology

Power law behavior for GO gene annotation of Biological Process within Human GOA

Page 20: The Language of the Gene Ontology

Power law behavior for GO gene annotation of Molecular Function within Human GOA

Page 21: The Language of the Gene Ontology

Power law behavior for the GO gene annotation of Cellular Component within Human GOA

Page 22: The Language of the Gene Ontology

Values for β

• For human GOA:– Biological process 2.04– Molecular function 1.83– Cellular component 1.73 Across species, most fit1.6< β < 2.4 which is normal for languageMean for BP 2.14 Mean for MF 1.80Mean for CC 1.75 BP different from MF and CC, which do not differ

Page 23: The Language of the Gene Ontology

Species Sub- Ontology GOA Ensemblβ P-value β P-value

Hs CC 1.73 0.63 1.73 0.61MF 1.83 0.55 1.68 0BP 2.04 0.65 2 0.2

Mm CC 1.69 0.74 1.73 0.38MF 1.76 0.36 1.79 0.29BP 2.08 0.97 2.12 0.46

Dr CC 1.62 0.74 1.73 0.93MF 1.69 0.91 1.82 0.84BP 1.88 0.11 1.88 0.67

Bt CC 1.72 0.25 1.75 0.33MF 1.72 0.36 1.71 0.01BP 2.04 0.56 2.11 0.89

Sc CC 1.86 0.29 1.89 0.89MF 1.88 0.78 1.81 0.79BP 2.27 0.42 2.26 0.78

Rn CC 1.68 0.24 1.76 0.58MF 1.91 0.85 1.71 0BP 2.38 0.76 2.07 0.17

Dm CC 1.94 0.13 1.8 0.61MF 1.84 0.01 1.69 0.01BP 2.31 0.06 2.13 0.58

Following table shows the results obtained from the power law analysis of each of the data sets characterized in supplementary table 1. β is the Zipf’s law exponent and P-value is a statistic used to determine how good a model the power law is of the data. Statistically significant values are denoted in bold. H. sapiens (Hs), M. musculus (Mm), D. rerio (Dr), B. taurus (Bt), S. cerevisiae (Sc), R. norvegicus (Rn), D. melanogaster (Dm)

Page 24: The Language of the Gene Ontology

Species

Sub-Ontology

GOA Ensembl

HC LC HC LC

β P-Value β P-Value β P-Value β P-Value

Hs CC 1.88 0.37 1.62 0.11 1.89 0.3 1.87 0.64

MF 2.05 0.18 1.75 0.16 2.06 0.83 1.77 0.02

BP 2.12 0.37 2.04 0.62 2.11 0.34 1.86 0.04

Mm CC 1.9 0.43 1.5 0.71 1.91 0.2 1.6 0.86

MF 2.15 0.65 1.67 0.03 2.15 0.65 1.8 0.00

BP 2.6 0.61 1.62 0.00 2.62 0.3 1.8 0.08

Table below shows the results obtained from power law analysis of each of the data sets characterized in supplementary Table 2. β is the Zipf’s law exponent and P-value is a statistic used to determine how good a model the power law is of the data. Statistically significant values are denoted in bold. The GO evidence codes used to define the High Confidence (HC) and Low Confidence (LC) data sets are described in the materials and methods

Page 25: The Language of the Gene Ontology

0.0

0.5

1.0

1.5

2.0

2.5

Hs Mm Dr Bt Sc Rn Dm

Species

β

BP

MF

CC

Comparison of calculated Zipf’s Law exponents for various sub-ontologies chosen from GOA for different species: H. sapiens (Hs), M. musculus (Mm), D. rerio (Dr), B. taurus (Bt), S. cerevisiae (Sc), R. norvegicus (Rn), D. melanogaster (Dm)

Page 26: The Language of the Gene Ontology

Power law behavior for the GO gene annotation of Molecular Function within Human Ensembl (with two-regime behavior)

Page 27: The Language of the Gene Ontology

Findings• Most annotations with GO behave like utterences in a

language• We can say something about the quality of those utterences• Non-fitting to power law suggests non-language like

communication• Low confidence data fits less well to power law

– Utterences in biological process about right (2.14)– Utterences in cell component and molecular function biased towards

speaker

Why might this be? – the quality is lower (1.7 and 1.8)Is it just because BP is that much bigger and thus it is easier to be more

specific?Bias towards speaker for lower quality annotations

Page 28: The Language of the Gene Ontology

1.5

1.6

1.7

1.8

1.9

2.0

2.1

2.2

2.3

2.4

2.5

0 1000 2000 3000 4000 5000 6000

Distinct GO identifier

β

CC

MF

BP

The power law exponent, β, as a function of the total number of distinct GO identifier in each of the GO sub-ontologies

Page 29: The Language of the Gene Ontology

More Findings

• This effect is independent of size• Why are BP and MF/CC different?• Can speculate about what we know: We know

a lot about processes, much less about specific functions of proteins; there is simply much less to know about components and location is also a bit “tricky”

Page 30: The Language of the Gene Ontology

Power law behaviour in dataset 1 from EFO (Experimental Factor Ontology)

Page 31: The Language of the Gene Ontology

Power law behaviour in dataset 2 from EFO (Experimental Factor Ontology)

Page 32: The Language of the Gene Ontology

Conclusions and the Future

• Rapid assessment of language like qualities of GO annotations

• Gives some idea of the quality of those annotations – what effort is involved

• Need to make it more analytic• Look at many more annotations

Page 33: The Language of the Gene Ontology

Acknowledgements

• Leila R. Kalankesh • Andy brass• Robert Stevens