the language of the gene ontology

Post on 21-May-2015

91 Views

Category:

Science

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Invited talk at Dept of Computer Science, Open university, 2010

TRANSCRIPT

The Language of the Gene Ontology

Robert StevensBio-Health Informatics GroupThe University of Manchester

ManchesterUnited Kingdom

Robert.Stevens@manchester.ac.uk

Overview

• Annotation of biological data using ontologies;• Communicating between annotator and user• Least effort, power laws and quality metrics • The analysis of GOA corpora • Interpreting the results

Names in biology

Some category of protein

U2-type nuclear mRNA 5' splice site recognition

spliceosomal E complex formation

spliceosomal E complex biosynthesis

spliceosomal CC complex formation

U2-type nuclear mRNA 5'-splice site recognition

De facto Integration with Ontologies• Agreement on the entities in biology to be

described• Describe those entities with ontologies• Label those data entities with those ontology’s

terms• Ontology building now a mainstream activity

within biology• Ontologies built by and for biologists

Genotype Phenotype

Sequence

Proteins

Gene products Transcript

Pathways

Cell type

BRENDA tissue / enzyme source

Development

Anatomy

Phenotype

Plasmodium life cycle

-Sequence types and features-Genetic Context

- Molecule role - Molecular Function- Biological process - Cellular component

-Protein covalent bond -Protein domain -UniProt taxonomy

-Pathway ontology -Event (INOH pathway ontology) -Systems Biology -Protein-protein interaction

-Arabidopsis development -Cereal plant development -Plant growth and developmental stage -C. elegans development -Drosophila development FBdv fly development.obo OBO yes yes -Human developmental anatomy, abstract version -Human developmental anatomy, timed version

-Mosquito gross anatomy-Mouse adult gross anatomy -Mouse gross anatomy and development -C. elegans gross anatomy-Arabidopsis gross anatomy -Cereal plant gross anatomy -Drosophila gross anatomy -Dictyostelium discoideum anatomy -Fungal gross anatomy FAO -Plant structure -Maize gross anatomy -Medaka fish anatomy and development -Zebrafish anatomy and development

-NCI Thesaurus -Mouse pathology -Human disease -Cereal plant trait -PATO PATO attribute and value.obo -Mammalian phenotype -Habronattus courtship -Loggerhead nesting -Animal natural history and life history

eVOC (Expressed Sequence Annotation for Humans)

Gene Ontology http://www.geneontology.org

“a dynamic controlled vocabulary that can be applied to all eukaryotes”

Built by the community for the community.

Three organising principles: Molecular function, Biological

process, Cellular component Describes kinds of things and

parts of things Describes ~25,000 things

Annotating Biological Data

• Some 40 species genome DB now annotated with GO

• http://www.geneontology.org/GO.current.annotations.shtml

• 395173 species specific, non-redundant genes/gene products annotated

• 7718253 annotations in total

GO associations

• CYP 51

GO associations

CYP 51CYP 51

GO:0020037 : heme bindingGO:0020037 : heme binding

GO:0005506 :Iron ion bindingGO:0005506 :Iron ion binding

GO:0004497 : monooxygenase activityGO:0004497 : monooxygenase activity

GO Evidence Codes

• Each annotation given an evidence code• Broadly divide in to “computational inference”

and “experimental inference”• Can partition GO annotated data in to “high”

and “low” confidence anotations• Not directly quality

Zipf’s Law (1934, 1949)

• Frequency of a word in a corpus inversely proportional to rankMost popular word occurs twice as frequently as next most popular word, which itself occurs twice as frequently as the fourthPower law distributions seen in many natural and social situations

• This distribution is a characteristic of human language

Plot of log frequency against log rankThe slope β gives information about the language used

in the corpus

The Communication Process

Biology

Encoded Message

Encoding Channel Decoding ReceiverSource

Decoded Message

Source=Annotators

Receiver= User of Annotation

Principle of Least Effort• In the process of message passing from encoder to decoder

effort is expended• Maximum information transfer with minimum effort• A rich language precisely defining the message is hard work to

encode and should be easier to decode• The steeper the slope (β) the richer the message and the

more effort involved• Values for β of 2 is about optimum• Does GO annotation behave like messages in a language?• Looking at β might tell us about annotation quality – how

well is the message transfered

Listener Speaker

Effort

High

Low

Integrin Complex

Cell

Effort in Encoding and Decoding Annotation for Intergrin Alpha8 Protein

Cell

Integrin Complex

Values of Power Law Exponent• single author sources in English β is about 2 Ferrer i Cancho

and Sole, 2001• For young children β is around 1.6 Piotrowski and

Pashkovskii, 1994 • β > 2 for sets of nouns in siphisticated, single authored texts

Balasubrahmanyan and Naranan, 1996 • English texts in the range 1.6 < β < 2.4 Ferrer i Cancho, 2005b • Low values favour the speaker and is low effort for the

speaker• High values favour the listener and are high effort for the

speaker

The Questions

Does the Gene Ontology act like a language?Are GO annotations utterances in that

language?That is, do GO annotations follow a power law?What is the quality of that communication?

What is the exponentWhat is the effort involved in that

communication? what is the effort involved in encoding and decoding?

Materials and Methods

• GOA and ENSEMBL annotations• For species: Human, gorilla, mouse, rat, yeast,

fly, cow, fish• Divided in to “high” and “low” confidence

using evidence codes• Plot log cumulative frequence against log rank• Fit to power law (Clauset, et al., 2009) • Look at exponent of lines for various samples

The Equation….If is the proportion of words in a text with frequency f, the Zipf Law is given as:

Where refers to the frequency of word and

indicates the exponent or scaling parameter of power law model

( )P f f

( )P f

f

Power law behavior for GO gene annotation of Biological Process within Human GOA

Power law behavior for GO gene annotation of Molecular Function within Human GOA

Power law behavior for the GO gene annotation of Cellular Component within Human GOA

Values for β

• For human GOA:– Biological process 2.04– Molecular function 1.83– Cellular component 1.73 Across species, most fit1.6< β < 2.4 which is normal for languageMean for BP 2.14 Mean for MF 1.80Mean for CC 1.75 BP different from MF and CC, which do not differ

Species Sub- Ontology GOA Ensemblβ P-value β P-value

Hs CC 1.73 0.63 1.73 0.61MF 1.83 0.55 1.68 0BP 2.04 0.65 2 0.2

Mm CC 1.69 0.74 1.73 0.38MF 1.76 0.36 1.79 0.29BP 2.08 0.97 2.12 0.46

Dr CC 1.62 0.74 1.73 0.93MF 1.69 0.91 1.82 0.84BP 1.88 0.11 1.88 0.67

Bt CC 1.72 0.25 1.75 0.33MF 1.72 0.36 1.71 0.01BP 2.04 0.56 2.11 0.89

Sc CC 1.86 0.29 1.89 0.89MF 1.88 0.78 1.81 0.79BP 2.27 0.42 2.26 0.78

Rn CC 1.68 0.24 1.76 0.58MF 1.91 0.85 1.71 0BP 2.38 0.76 2.07 0.17

Dm CC 1.94 0.13 1.8 0.61MF 1.84 0.01 1.69 0.01BP 2.31 0.06 2.13 0.58

Following table shows the results obtained from the power law analysis of each of the data sets characterized in supplementary table 1. β is the Zipf’s law exponent and P-value is a statistic used to determine how good a model the power law is of the data. Statistically significant values are denoted in bold. H. sapiens (Hs), M. musculus (Mm), D. rerio (Dr), B. taurus (Bt), S. cerevisiae (Sc), R. norvegicus (Rn), D. melanogaster (Dm)

Species

Sub-Ontology

GOA Ensembl

HC LC HC LC

β P-Value β P-Value β P-Value β P-Value

Hs CC 1.88 0.37 1.62 0.11 1.89 0.3 1.87 0.64

MF 2.05 0.18 1.75 0.16 2.06 0.83 1.77 0.02

BP 2.12 0.37 2.04 0.62 2.11 0.34 1.86 0.04

Mm CC 1.9 0.43 1.5 0.71 1.91 0.2 1.6 0.86

MF 2.15 0.65 1.67 0.03 2.15 0.65 1.8 0.00

BP 2.6 0.61 1.62 0.00 2.62 0.3 1.8 0.08

Table below shows the results obtained from power law analysis of each of the data sets characterized in supplementary Table 2. β is the Zipf’s law exponent and P-value is a statistic used to determine how good a model the power law is of the data. Statistically significant values are denoted in bold. The GO evidence codes used to define the High Confidence (HC) and Low Confidence (LC) data sets are described in the materials and methods

0.0

0.5

1.0

1.5

2.0

2.5

Hs Mm Dr Bt Sc Rn Dm

Species

β

BP

MF

CC

Comparison of calculated Zipf’s Law exponents for various sub-ontologies chosen from GOA for different species: H. sapiens (Hs), M. musculus (Mm), D. rerio (Dr), B. taurus (Bt), S. cerevisiae (Sc), R. norvegicus (Rn), D. melanogaster (Dm)

Power law behavior for the GO gene annotation of Molecular Function within Human Ensembl (with two-regime behavior)

Findings• Most annotations with GO behave like utterences in a

language• We can say something about the quality of those utterences• Non-fitting to power law suggests non-language like

communication• Low confidence data fits less well to power law

– Utterences in biological process about right (2.14)– Utterences in cell component and molecular function biased towards

speaker

Why might this be? – the quality is lower (1.7 and 1.8)Is it just because BP is that much bigger and thus it is easier to be more

specific?Bias towards speaker for lower quality annotations

1.5

1.6

1.7

1.8

1.9

2.0

2.1

2.2

2.3

2.4

2.5

0 1000 2000 3000 4000 5000 6000

Distinct GO identifier

β

CC

MF

BP

The power law exponent, β, as a function of the total number of distinct GO identifier in each of the GO sub-ontologies

More Findings

• This effect is independent of size• Why are BP and MF/CC different?• Can speculate about what we know: We know

a lot about processes, much less about specific functions of proteins; there is simply much less to know about components and location is also a bit “tricky”

Power law behaviour in dataset 1 from EFO (Experimental Factor Ontology)

Power law behaviour in dataset 2 from EFO (Experimental Factor Ontology)

Conclusions and the Future

• Rapid assessment of language like qualities of GO annotations

• Gives some idea of the quality of those annotations – what effort is involved

• Need to make it more analytic• Look at many more annotations

Acknowledgements

• Leila R. Kalankesh • Andy brass• Robert Stevens

top related