Download - The Language of the Gene Ontology
The Language of the Gene Ontology
Robert StevensBio-Health Informatics GroupThe University of Manchester
ManchesterUnited Kingdom
Overview
• Annotation of biological data using ontologies;• Communicating between annotator and user• Least effort, power laws and quality metrics • The analysis of GOA corpora • Interpreting the results
Names in biology
Some category of protein
U2-type nuclear mRNA 5' splice site recognition
spliceosomal E complex formation
spliceosomal E complex biosynthesis
spliceosomal CC complex formation
U2-type nuclear mRNA 5'-splice site recognition
De facto Integration with Ontologies• Agreement on the entities in biology to be
described• Describe those entities with ontologies• Label those data entities with those ontology’s
terms• Ontology building now a mainstream activity
within biology• Ontologies built by and for biologists
Genotype Phenotype
Sequence
Proteins
Gene products Transcript
Pathways
Cell type
BRENDA tissue / enzyme source
Development
Anatomy
Phenotype
Plasmodium life cycle
-Sequence types and features-Genetic Context
- Molecule role - Molecular Function- Biological process - Cellular component
-Protein covalent bond -Protein domain -UniProt taxonomy
-Pathway ontology -Event (INOH pathway ontology) -Systems Biology -Protein-protein interaction
-Arabidopsis development -Cereal plant development -Plant growth and developmental stage -C. elegans development -Drosophila development FBdv fly development.obo OBO yes yes -Human developmental anatomy, abstract version -Human developmental anatomy, timed version
-Mosquito gross anatomy-Mouse adult gross anatomy -Mouse gross anatomy and development -C. elegans gross anatomy-Arabidopsis gross anatomy -Cereal plant gross anatomy -Drosophila gross anatomy -Dictyostelium discoideum anatomy -Fungal gross anatomy FAO -Plant structure -Maize gross anatomy -Medaka fish anatomy and development -Zebrafish anatomy and development
-NCI Thesaurus -Mouse pathology -Human disease -Cereal plant trait -PATO PATO attribute and value.obo -Mammalian phenotype -Habronattus courtship -Loggerhead nesting -Animal natural history and life history
eVOC (Expressed Sequence Annotation for Humans)
Gene Ontology http://www.geneontology.org
“a dynamic controlled vocabulary that can be applied to all eukaryotes”
Built by the community for the community.
Three organising principles: Molecular function, Biological
process, Cellular component Describes kinds of things and
parts of things Describes ~25,000 things
Annotating Biological Data
• Some 40 species genome DB now annotated with GO
• http://www.geneontology.org/GO.current.annotations.shtml
• 395173 species specific, non-redundant genes/gene products annotated
• 7718253 annotations in total
GO associations
• CYP 51
GO associations
CYP 51CYP 51
GO:0020037 : heme bindingGO:0020037 : heme binding
GO:0005506 :Iron ion bindingGO:0005506 :Iron ion binding
GO:0004497 : monooxygenase activityGO:0004497 : monooxygenase activity
GO Evidence Codes
• Each annotation given an evidence code• Broadly divide in to “computational inference”
and “experimental inference”• Can partition GO annotated data in to “high”
and “low” confidence anotations• Not directly quality
Zipf’s Law (1934, 1949)
• Frequency of a word in a corpus inversely proportional to rankMost popular word occurs twice as frequently as next most popular word, which itself occurs twice as frequently as the fourthPower law distributions seen in many natural and social situations
• This distribution is a characteristic of human language
Plot of log frequency against log rankThe slope β gives information about the language used
in the corpus
The Communication Process
Biology
Encoded Message
Encoding Channel Decoding ReceiverSource
Decoded Message
Source=Annotators
Receiver= User of Annotation
Principle of Least Effort• In the process of message passing from encoder to decoder
effort is expended• Maximum information transfer with minimum effort• A rich language precisely defining the message is hard work to
encode and should be easier to decode• The steeper the slope (β) the richer the message and the
more effort involved• Values for β of 2 is about optimum• Does GO annotation behave like messages in a language?• Looking at β might tell us about annotation quality – how
well is the message transfered
Listener Speaker
Effort
High
Low
Integrin Complex
Cell
Effort in Encoding and Decoding Annotation for Intergrin Alpha8 Protein
Cell
Integrin Complex
Values of Power Law Exponent• single author sources in English β is about 2 Ferrer i Cancho
and Sole, 2001• For young children β is around 1.6 Piotrowski and
Pashkovskii, 1994 • β > 2 for sets of nouns in siphisticated, single authored texts
Balasubrahmanyan and Naranan, 1996 • English texts in the range 1.6 < β < 2.4 Ferrer i Cancho, 2005b • Low values favour the speaker and is low effort for the
speaker• High values favour the listener and are high effort for the
speaker
The Questions
Does the Gene Ontology act like a language?Are GO annotations utterances in that
language?That is, do GO annotations follow a power law?What is the quality of that communication?
What is the exponentWhat is the effort involved in that
communication? what is the effort involved in encoding and decoding?
Materials and Methods
• GOA and ENSEMBL annotations• For species: Human, gorilla, mouse, rat, yeast,
fly, cow, fish• Divided in to “high” and “low” confidence
using evidence codes• Plot log cumulative frequence against log rank• Fit to power law (Clauset, et al., 2009) • Look at exponent of lines for various samples
The Equation….If is the proportion of words in a text with frequency f, the Zipf Law is given as:
Where refers to the frequency of word and
indicates the exponent or scaling parameter of power law model
( )P f f
( )P f
f
Power law behavior for GO gene annotation of Biological Process within Human GOA
Power law behavior for GO gene annotation of Molecular Function within Human GOA
Power law behavior for the GO gene annotation of Cellular Component within Human GOA
Values for β
• For human GOA:– Biological process 2.04– Molecular function 1.83– Cellular component 1.73 Across species, most fit1.6< β < 2.4 which is normal for languageMean for BP 2.14 Mean for MF 1.80Mean for CC 1.75 BP different from MF and CC, which do not differ
Species Sub- Ontology GOA Ensemblβ P-value β P-value
Hs CC 1.73 0.63 1.73 0.61MF 1.83 0.55 1.68 0BP 2.04 0.65 2 0.2
Mm CC 1.69 0.74 1.73 0.38MF 1.76 0.36 1.79 0.29BP 2.08 0.97 2.12 0.46
Dr CC 1.62 0.74 1.73 0.93MF 1.69 0.91 1.82 0.84BP 1.88 0.11 1.88 0.67
Bt CC 1.72 0.25 1.75 0.33MF 1.72 0.36 1.71 0.01BP 2.04 0.56 2.11 0.89
Sc CC 1.86 0.29 1.89 0.89MF 1.88 0.78 1.81 0.79BP 2.27 0.42 2.26 0.78
Rn CC 1.68 0.24 1.76 0.58MF 1.91 0.85 1.71 0BP 2.38 0.76 2.07 0.17
Dm CC 1.94 0.13 1.8 0.61MF 1.84 0.01 1.69 0.01BP 2.31 0.06 2.13 0.58
Following table shows the results obtained from the power law analysis of each of the data sets characterized in supplementary table 1. β is the Zipf’s law exponent and P-value is a statistic used to determine how good a model the power law is of the data. Statistically significant values are denoted in bold. H. sapiens (Hs), M. musculus (Mm), D. rerio (Dr), B. taurus (Bt), S. cerevisiae (Sc), R. norvegicus (Rn), D. melanogaster (Dm)
Species
Sub-Ontology
GOA Ensembl
HC LC HC LC
β P-Value β P-Value β P-Value β P-Value
Hs CC 1.88 0.37 1.62 0.11 1.89 0.3 1.87 0.64
MF 2.05 0.18 1.75 0.16 2.06 0.83 1.77 0.02
BP 2.12 0.37 2.04 0.62 2.11 0.34 1.86 0.04
Mm CC 1.9 0.43 1.5 0.71 1.91 0.2 1.6 0.86
MF 2.15 0.65 1.67 0.03 2.15 0.65 1.8 0.00
BP 2.6 0.61 1.62 0.00 2.62 0.3 1.8 0.08
Table below shows the results obtained from power law analysis of each of the data sets characterized in supplementary Table 2. β is the Zipf’s law exponent and P-value is a statistic used to determine how good a model the power law is of the data. Statistically significant values are denoted in bold. The GO evidence codes used to define the High Confidence (HC) and Low Confidence (LC) data sets are described in the materials and methods
0.0
0.5
1.0
1.5
2.0
2.5
Hs Mm Dr Bt Sc Rn Dm
Species
β
BP
MF
CC
Comparison of calculated Zipf’s Law exponents for various sub-ontologies chosen from GOA for different species: H. sapiens (Hs), M. musculus (Mm), D. rerio (Dr), B. taurus (Bt), S. cerevisiae (Sc), R. norvegicus (Rn), D. melanogaster (Dm)
Power law behavior for the GO gene annotation of Molecular Function within Human Ensembl (with two-regime behavior)
Findings• Most annotations with GO behave like utterences in a
language• We can say something about the quality of those utterences• Non-fitting to power law suggests non-language like
communication• Low confidence data fits less well to power law
– Utterences in biological process about right (2.14)– Utterences in cell component and molecular function biased towards
speaker
Why might this be? – the quality is lower (1.7 and 1.8)Is it just because BP is that much bigger and thus it is easier to be more
specific?Bias towards speaker for lower quality annotations
1.5
1.6
1.7
1.8
1.9
2.0
2.1
2.2
2.3
2.4
2.5
0 1000 2000 3000 4000 5000 6000
Distinct GO identifier
β
CC
MF
BP
The power law exponent, β, as a function of the total number of distinct GO identifier in each of the GO sub-ontologies
More Findings
• This effect is independent of size• Why are BP and MF/CC different?• Can speculate about what we know: We know
a lot about processes, much less about specific functions of proteins; there is simply much less to know about components and location is also a bit “tricky”
Power law behaviour in dataset 1 from EFO (Experimental Factor Ontology)
Power law behaviour in dataset 2 from EFO (Experimental Factor Ontology)
Conclusions and the Future
• Rapid assessment of language like qualities of GO annotations
• Gives some idea of the quality of those annotations – what effort is involved
• Need to make it more analytic• Look at many more annotations
Acknowledgements
• Leila R. Kalankesh • Andy brass• Robert Stevens