original article · 2013. 8. 21. · original article a guide to best practices for gene ontology...
TRANSCRIPT
-
Original article
A guide to best practices for Gene Ontology(GO) manual annotation
Rama Balakrishnan1, Midori A. Harris2, Rachael Huntley3, Kimberly Van Auken4 andJ. Michael Cherry1,*
1Saccharomyces Genome Database, Department of Genetics, Stanford University, 300 Pasteur Drive, MC-5477 Stanford, CA 94305, USA,2PomBase, Cambridge Systems Biology Centre, Department of Biochemistry, University of Cambridge, Sanger Building, 80 Tennis Court Road,
Cambridge CB2 1GA, UK, 3UniProt, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
and 4WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA
*Corresponding author: Tel: +1 650 723 7541; Email: [email protected]
Submitted 15 February 2013; Revised 11 June 2013; Accepted 17 June 2013
Citation details: Balakrishnan,R., Harris,M.A., Huntley,R., et al. A guide to best practices for Gene Ontology (GO) manual annotation. Database
(2013), Vol. 2013: article ID bat054; doi:10.1093/database/bat054.
.............................................................................................................................................................................................................................................................................................
The Gene Ontology Consortium (GOC) is a community-based bioinformatics project that classifies gene product function
through the use of structured controlled vocabularies. A fundamental application of the Gene Ontology (GO) is in the
creation of gene product annotations, evidence-based associations between GO definitions and experimental or sequence-
based analysis. Currently, the GOC disseminates 126 million annotations covering >374 000 species including all the king-
doms of life. This number includes two classes of GO annotations: those created manually by experienced biocurators
reviewing the literature or by examination of biological data (1.1 million annotations covering 2226 species) and those
generated computationally via automated methods. As manual annotations are often used to propagate functional pre-
dictions between related proteins within and between genomes, it is critical to provide accurate consistent manual anno-
tations. Toward this goal, we present here the conventions defined by the GOC for the creation of manual annotation. This
guide represents the best practices for manual annotation as established by the GOC project over the past 12 years. We
hope this guide will encourage research communities to annotate gene products of their interest to enhance the corpus of
GO annotations available to all.
Database URL: http://www.geneontology.org
.............................................................................................................................................................................................................................................................................................
Introduction
The Gene Ontology Consortium (GOC; http://www.geneon
tology.org) is a bioinformatics resource that serves as a
comprehensive repository of functional information about
gene products assembled through the use of domain-
specific ontologies (1). The project is a collaborative effort
working to describe how and where gene products act by
creating evidence-supported gene-product annotations to
structured comprehensive controlled vocabularies. The
Gene Ontology (GO) is a controlled vocabulary composed
of >38 000 precise defined phrases called GO terms that
describe the molecular actions of gene products, the
biological processes in which those actions occur and the
cellular locations where they are present. First developed in
1998 (2), the GOC project has grown to become an inte-
grated resource providing functional information for a
wide variety of species. As of January 2013, there are
>126 million annotations to >19 million gene products
from species throughout the tree of life. Of these there
are 1.1 million manually curated annotations, from pub-
lished experimental results, to 234 000 gene products. As
the GOC develops the standard language to describe func-
tion, it also defines standards for using these ontologies in
the creation of annotations. This article elaborates on the
methods and conventions adopted by the GOC curation
.............................................................................................................................................................................................................................................................................................
� The Author(s) 2013. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properlycited. Page 1 of 18
(page number not for citation purposes)
Database, Vol. 2013, Article ID bat054, doi:10.1093/database/bat054.............................................................................................................................................................................................................................................................................................
at California Institute of T
echnology on August 9, 2013
http://database.oxfordjournals.org/D
ownloaded from
http://www.geneontology.orghttp://www.geneontology.orghttp://www.geneontology.org,over ,over over http://database.oxfordjournals.org/
-
teams for constructing annotations and serves as a guide to
new or potential annotators, and the biological community
at large, for understanding the requirements necessary to
create and maintain the highest quality GO annotations.
Overview of GO annotations
The goal of the GOC is the unification of biology by creat-
ing a nomenclature used for describing the functional char-
acteristics of any gene product, protein or RNA, from any
organism. There are two parts to a GO annotation: first, the
association asserted between a gene product and a GO def-
inition; and second, the source (e.g. published article) and
evidence used as the authority to make the assertion. The
GO is a set of highly structured directed acyclic graphs
(DAGs); its structure and content have been extensively
described elsewhere (2, 3). Here, we limit our presentation
to the GO term name (the phrase that is typically used
when discussing individual components of the ontologies,
often shortened to ‘GO term’), the GO definition, the text
string that explains the precise meaning of the GO term
and a numerical identifier called the GOID (examples used
in this guide are shown in Table 1). In addition, each term
can have multiple ontological relationships to broader
(parent) and more specific (child) terms (Figure 1 illustrates
how terms and relationships are represented in GO).
Although annotations are typically viewed as connec-
tions between a gene product and a GO term, it is import-
ant to stress that the GO term name is a surrogate for the
definition, and that the biological concept described by the
definition is really the core assertion being made by an
annotation. This is a subtle yet important point central to
understanding the power of the GO, one that is not always
appreciated by both annotators and consumers of GO an-
notations. As with a spoken language, the understanding
of its usage is based on shared definitions of the phrases
and definitions of the terms. Thus, annotating to the def-
inition is required to alleviate confusion if the names of
biological concepts or terminology used in the published
literature are ambiguous.
The source of the information used to make an anno-
tation includes both a specific reference, usually a pub-
lished scientific article represented by a PubMed identifier
(PMID), that describes the result of an experimental or
computational analysis on which the association was
based, and an evidence code (Table 2) that reflects the
type of experimental assay or analysis that supports the
association. Annotations can be asserted manually from
the literature by biocurators or computationally by auto-
mated methods. This article will focus on standards defined
by the GOC for manual curation. Computational annota-
tion methods and their guidelines have been reported
elsewhere (4).
Annotation format
GO annotations are recorded and supplied in a standard
tab-delimited file format called the Gene Associations
File (GAF, http://www.geneontology.org/GO.format.anno
tation.shtml). For each annotation, the GAF format con-
tains both required and optional fields, some of which
will be discussed below. The required fields are—the iden-
tifier of the gene product being annotated, the GOID of
the GO term associated with the gene product, an evidence
code and the reference (either a published article or a GOC-
specific internal reference) supporting the use of the GOID,
the aspect of the ontology (Molecular Function, Biological
Process, Cellular Component), the curation project that cre-
ated the annotation, the object type that is being anno-
tated (see below), the NCBI taxonomy database identifier
for the species of the gene product and the date the anno-
tation was created or modified. A sample annotation is
shown in Table 3.
Manual curation
Within the GOC, manual annotations are made by experi-
enced biocurators from a variety of annotation projects
including, but not limited to, the Saccharomyces Genome
Database [SGD, (6)], Mouse Genome Informatics [MGI, (7)],
WormBase (8), PomBase (9), FlyBase (10), ZFIN (11) and
UniProt (12). Manual curation typically encompasses two
approaches. The first involves reading relevant publications,
identifying the gene product(s) of interest, and ascribing
the reported experimental results to a GO definition
using an appropriate evidence code (Table 2). The second
involves inferring a gene’s role by manual examination of
its nucleic acid or protein sequence motifs, structure or
phylogenetic relationships. For consistent interpretation
of experimental results and sequence analysis, the GOC
has established annotation guidelines that are elaborated
below. GOC member projects (http://www.geneontology.
org/GO.consortiumlist.shtml) with assistance from other
groups engaged in advancing the representation of biolo-
gical function so that it can be presented in a straightfor-
ward but precisely defined form have developed these
guidelines. Over time these guidelines have evolved into
required standards for all manual annotations and have
been incorporated into validation tools used by the GOC
to maintain their quality and uniformity.
Gene product: Object ofannotation
The annotation object or molecular entity are those defined
by the Sequence Ontology [(13), http://www.sequenceontol
ogy.org] and includes complex, gene, gene_product,
.............................................................................................................................................................................................................................................................................................
Page 2 of 18
Original article Database, Vol. 2013, Article ID bat054, doi:10.1093/database/bat054.............................................................................................................................................................................................................................................................................................
at California Institute of T
echnology on August 9, 2013
http://database.oxfordjournals.org/D
ownloaded from
paper``''very upis paperuphttp://www.geneontology.org/GO.format.annotation.shtmlhttp://www.geneontology.org/GO.format.annotation.shtml: paperCuration((http://www.geneontology.org/GO.consortiumlist.shtmlhttp://www.geneontology.org/GO.consortiumlist.shtmlAnnotationhttp://www.sequenceontology.orghttp://www.sequenceontology.org):http://database.oxfordjournals.org/
-
Ta
ble
1.
Th
ista
ble
list
sa
sma
llsu
bse
to
fth
eG
Ote
rms
use
da
se
xam
ple
sin
the
ma
inte
xto
fth
ea
rtic
le
GO
IDA
spe
ctG
Ote
rmn
am
eD
efi
nit
ion
GO
:00
03
67
4M
ole
cula
rfu
nct
ion
Mo
lecu
lar_
fun
ctio
nE
lem
en
tal
act
ivit
ies,
such
as
cata
lysi
so
rb
ind
ing
,d
esc
rib
ing
the
act
ion
so
fa
ge
ne
pro
du
cta
tth
em
ole
cula
rle
vel.
A
giv
en
ge
ne
pro
du
ctm
ay
exh
ibit
on
eo
rm
ore
mo
lecu
lar
fun
ctio
ns.
GO
:00
08
15
0B
iolo
gic
al
pro
cess
Bio
log
ica
l_p
roce
ssA
ny
pro
cess
spe
cifi
cally
pe
rtin
en
tto
the
fun
ctio
nin
go
fin
teg
rate
dlivi
ng
un
its:
cells,
tiss
ue
s,o
rga
ns
an
do
rga
nis
ms.
Ap
roce
ssis
aco
lle
ctio
no
fm
ole
cula
re
ven
tsw
ith
ad
efi
ne
db
eg
inn
ing
an
de
nd
.
GO
:00
05
57
5C
ell
ula
rco
mp
on
en
tC
ell
ula
r_co
mp
on
en
tT
he
pa
rto
fa
cell
or
its
ext
race
llu
lar
en
viro
nm
en
tin
wh
ich
ag
en
ep
rod
uct
islo
cate
d.
Ag
en
ep
rod
uct
ma
yb
e
loca
ted
ino
ne
or
mo
rep
art
so
fa
cell
an
dit
slo
cati
on
ma
yb
ea
ssp
eci
fic
as
ap
art
icu
lar
ma
cro
mo
lecu
lar
com
ple
x,th
at
is,
ast
ab
le,
pe
rsis
ten
ta
sso
cia
tio
no
fm
acr
om
ole
cule
sth
at
fun
ctio
nto
ge
the
r.
GO
:00
04
67
2M
ole
cula
rfu
nct
ion
Pro
tein
kin
ase
act
ivit
yC
ata
lysi
so
fth
ep
ho
sph
ory
lati
on
of
an
am
ino
aci
dre
sid
ue
ina
pro
tein
,u
sua
lly
acc
ord
ing
toth
ere
act
ion
:a
pro
tein
+A
TP
=a
ph
osp
ho
pro
tein
+A
DP
.
GO
:00
22
85
8M
ole
cula
rfu
nct
ion
Ala
nin
etr
an
sme
mb
ran
etr
an
spo
rte
r
act
ivit
y
Ca
taly
sis
of
the
tra
nsf
er
of
ala
nin
efr
om
on
esi
de
of
am
em
bra
ne
toth
eo
the
r.A
lan
ine
is2
-am
ino
pro
pa
no
ica
cid
.
GO
:00
03
87
2M
ole
cula
rfu
nct
ion
6-p
ho
sph
ofr
uct
ok
ina
sea
ctiv
ity
Ca
taly
sis
of
the
rea
ctio
n:
AT
P+
D-f
ruct
ose
-6-p
ho
sph
ate
=A
DP
+D
-fru
cto
se1
,6-b
isp
ho
sph
ate
.
GO
:00
00
98
1M
ole
cula
rfu
nct
ion
Seq
ue
nce
-sp
eci
fic
DN
Ab
ind
ing
RN
A
po
lym
era
seII
tra
nsc
rip
tio
nfa
cto
r
act
ivit
y
Inte
ract
ing
sele
ctiv
ely
an
dn
on
cova
len
tly
wit
ha
spe
cifi
cD
NA
seq
ue
nce
ino
rde
rto
mo
du
late
tra
nsc
rip
tio
nb
yR
NA
po
lym
era
seII.
Th
etr
an
scri
pti
on
fact
or
ma
yo
rm
ay
no
ta
lso
inte
ract
sele
ctiv
ely
wit
ha
pro
tein
or
ma
cro
mo
l-
ecu
lar
com
ple
x.
GO
:00
00
98
8M
ole
cula
rfu
nct
ion
Pro
tein
bin
din
gtr
an
scri
pti
on
fact
or
act
ivit
y
Inte
ract
ing
sele
ctiv
ely
an
dn
on
cova
len
tly
wit
ha
ny
pro
tein
or
pro
tein
com
ple
x(a
com
ple
xo
ftw
oo
rm
ore
pro
tein
s
tha
tm
ay
incl
ud
eo
the
rn
on
pro
tein
mo
lecu
les)
,in
ord
er
tom
od
ula
tetr
an
scri
pti
on
.A
pro
tein
bin
din
gtr
an
-
scri
pti
on
fact
or
ma
yo
rm
ay
no
ta
lso
inte
ract
wit
hth
ete
mp
late
nu
cle
ica
cid
(eit
he
rD
NA
or
RN
A)
as
we
ll.
GO
:00
43
56
5M
ole
cula
rfu
nct
ion
Seq
ue
nce
-sp
eci
fic
DN
Ab
ind
ing
Inte
ract
ing
sele
ctiv
ely
an
dn
on
cova
len
tly
wit
hD
NA
of
asp
eci
fic
nu
cle
oti
de
com
po
siti
on
,e
.g.
GC
-ric
hD
NA
bin
d-
ing
,o
rw
ith
asp
eci
fic
seq
ue
nce
mo
tif
or
typ
eo
fD
NA
e.g
.p
rom
oto
rb
ind
ing
or
rDN
Ab
ind
ing
.
GO
:00
04
67
4M
ole
cula
rfu
nct
ion
Pro
tein
seri
ne
/th
reo
nin
ek
ina
se
act
ivit
y
Ca
taly
sis
of
the
rea
ctio
ns:
AT
P+
pro
tein
seri
ne
=A
DP
+p
rote
inse
rin
ep
ho
sph
ate
,a
nd
AT
P+
pro
tein
thre
on
ine
=A
DP
+p
rote
inth
reo
nin
ep
ho
sph
ate
.
GO
:00
04
71
2M
ole
cula
rfu
nct
ion
Pro
tein
seri
ne
/th
reo
nin
e/t
yro
sin
e
kin
ase
act
ivit
y
Ca
taly
sis
of
the
rea
ctio
ns:
AT
P+
ap
rote
inse
rin
e=
AD
P+
pro
tein
seri
ne
ph
osp
ha
te;
AT
P+
ap
rote
in
thre
on
ine
=A
DP
+p
rote
inth
reo
nin
ep
ho
sph
ate
;a
nd
AT
P+
ap
rote
inty
rosi
ne
=A
DP
+p
rote
inty
rosi
ne
ph
osp
ha
te.
GO
:00
05
51
5M
ole
cula
rfu
nct
ion
Pro
tein
bin
din
gIn
tera
ctin
gse
lect
ive
lya
nd
no
nco
vale
ntl
yw
ith
an
yp
rote
ino
rp
rote
inco
mp
lex
(aco
mp
lex
of
two
or
mo
rep
rote
ins
tha
tm
ay
incl
ud
eo
the
rn
on
pro
tein
mo
lecu
les)
.
GO
:00
42
39
3M
ole
cula
rfu
nct
ion
His
ton
eb
ind
ing
Inte
ract
ing
sele
ctiv
ely
an
dn
on
cova
len
tly
wit
ha
his
ton
e,
an
yo
fa
gro
up
of
wa
ter-
solu
ble
pro
tein
sfo
un
din
ass
oci
ati
on
wit
hth
eD
NA
of
pla
nt
an
da
nim
al
chro
mo
som
es.
Th
ey
are
invo
lve
din
the
con
de
nsa
tio
na
nd
coil
ing
of
chro
mo
som
es
du
rin
gce
lld
ivis
ion
an
dh
ave
als
ob
ee
nim
pli
cate
din
no
nsp
eci
fic
sup
pre
ssio
no
fg
en
ea
ctiv
ity.
GO
:0
01
68
87
Mo
lecu
lar
fun
ctio
nA
TP
ase
act
ivit
yC
ata
lysi
so
fth
ere
act
ion
:A
TP
+H
2O
=A
DP
+p
ho
sph
ate
+2
H+
.M
ay
or
ma
yn
ot
be
cou
ple
dto
an
oth
er
rea
ctio
n.
GO
:00
08
12
1M
ole
cula
rfu
nct
ion
Ub
iqu
ino
l-cy
toch
rom
e-c
red
uct
ase
act
ivit
y
Ca
taly
sis
of
the
tra
nsf
er
of
aso
lute
or
solu
tes
fro
mo
ne
sid
eo
fa
me
mb
ran
eto
the
oth
er
acc
ord
ing
toth
e
rea
ctio
n:
Co
QH
2+
2fe
rric
yto
chro
me
c=
Co
Q+
2fe
rro
cyto
chro
me
c+
2H
+.
GO
:00
04
46
3M
ole
cula
rfu
nct
ion
Leu
ko
trie
ne
-A4
hyd
rola
sea
ctiv
ity
Ca
taly
sis
of
the
rea
ctio
n:
H(2
)O+
leu
ko
trie
ne
A(4
)=le
uk
otr
ien
eB
(4).
GO
:00
08
15
2B
iolo
gic
al
pro
cess
Me
tab
oli
cp
roce
ssT
he
che
mic
al
rea
ctio
ns
an
dp
ath
wa
ys,
incl
ud
ing
an
ab
oli
sma
nd
cata
bo
lism
,b
yw
hic
hli
vin
go
rga
nis
ms
tra
nsf
orm
che
mic
al
sub
sta
nce
s.M
eta
bo
lic
pro
cess
es
typ
ica
lly
tra
nsf
orm
sma
llm
ole
cule
s,b
ut
als
oin
clu
de
ma
cro
mo
lecu
lar
pro
cess
es
such
as
DN
Are
pa
ira
nd
rep
lica
tio
n,
an
dp
rote
insy
nth
esi
sa
nd
de
gra
da
tio
n.
(Co
nti
nu
ed
)
.............................................................................................................................................................................................................................................................................................
Page 3 of 18
Database, Vol. 2013, Article ID bat054, doi:10.1093/database/bat054 Original article.............................................................................................................................................................................................................................................................................................
at California Institute of T
echnology on August 9, 2013
http://database.oxfordjournals.org/D
ownloaded from
http://database.oxfordjournals.org/
-
Ta
ble
1.
Co
nti
nu
ed
GO
IDA
spe
ctG
Ote
rmn
am
eD
efi
nit
ion
GO
:00
23
05
2B
iolo
gic
al
pro
cess
Sig
na
lin
gT
he
en
tire
tyo
fa
pro
cess
inw
hic
hin
form
ati
on
istr
an
smit
ted
wit
hin
ab
iolo
gic
al
syst
em
.T
his
pro
cess
be
gin
sw
ith
an
act
ive
sig
na
la
nd
en
ds
wh
en
ace
llu
lar
resp
on
seh
as
be
en
trig
ge
red
.
GO
:00
16
26
5B
iolo
gic
al
pro
cess
De
ath
Ap
erm
an
en
tce
ssa
tio
no
fa
llvi
tal
fun
ctio
ns:
the
en
do
fli
fe;
can
be
ap
pli
ed
toa
wh
ole
org
an
ism
or
toa
pa
rto
f
an
org
an
ism
.
GO
:00
08
21
9B
iolo
gic
al
pro
cess
Ce
lld
ea
thA
ny
bio
log
ica
lp
roce
ssth
at
resu
lts
inp
erm
an
en
tce
ssa
tio
no
fa
llvi
tal
fun
ctio
ns
of
ace
ll.
Ace
llsh
ou
ldb
eco
n-
sid
ere
dd
ea
dw
he
na
ny
on
eo
fth
efo
llo
win
gm
ole
cula
ro
rm
orp
ho
log
ica
lcr
ite
ria
ism
et:
(i)
the
cell
ha
slo
stth
e
inte
gri
tyo
fit
sp
lasm
am
em
bra
ne
;(i
i)th
ece
ll,
incl
ud
ing
its
nu
cle
us,
ha
su
nd
erg
on
eco
mp
lete
fra
gm
en
tati
on
into
dis
cre
teb
od
ies
(fre
qu
en
tly
refe
rre
dto
as
‘ap
op
toti
cb
od
ies’
);a
nd
/or
(iii)
its
corp
se(o
rit
sfr
ag
me
nts
)h
ave
be
en
en
gu
lfe
db
ya
na
dja
cen
tce
llin
vivo
.
GO
:00
06
91
5B
iolo
gic
al
pro
cess
Ap
op
toti
cp
roce
ssA
pro
gra
mm
ed
cell
de
ath
pro
cess
wh
ich
be
gin
sw
he
na
cell
rece
ive
sa
nin
tern
al
(e.g
.D
NA
da
ma
ge
)o
re
xte
rna
l
sig
na
l(e
.g.
an
ext
race
llu
lar
de
ath
lig
an
d),
an
dp
roce
ed
sth
rou
gh
ase
rie
so
fb
ioch
em
ica
le
ven
ts(s
ign
ali
ng
pa
thw
ays
)w
hic
hty
pic
ally
lea
dto
rou
nd
ing
-up
of
the
cell,
retr
act
ion
of
pse
ud
op
od
es,
red
uct
ion
of
cellu
lar
volu
me
(pyk
no
sis)
,ch
rom
ati
nco
nd
en
sati
on
,n
ucl
ea
rfr
ag
me
nta
tio
n(k
ary
orr
he
xis)
,p
lasm
am
em
bra
ne
ble
bb
ing
an
dfr
ag
me
nta
tio
no
fth
ece
llin
toa
po
pto
tic
bo
die
s.T
he
pro
cess
en
ds
wh
en
the
cell
ha
sd
ied
.T
he
pro
cess
is
div
ide
din
toa
sig
na
lin
gp
ath
wa
yp
ha
sea
nd
into
an
exe
cuti
on
ph
ase
,w
hic
his
trig
ge
red
by
the
form
er.
GO
:00
30
26
3B
iolo
gic
al
pro
cess
Ap
op
toti
cch
rom
oso
me
con
de
nsa
tio
n
Th
eco
mp
act
ion
of
chro
ma
tin
du
rin
ga
po
pto
sis.
GO
:00
32
32
9B
iolo
gic
al
pro
cess
Seri
ne
tra
nsp
ort
Th
ed
ire
cte
dm
ove
me
nt
of
L-se
rin
e,
2-a
min
o-3
-hyd
roxy
pro
pa
no
ica
cid
,in
to,
ou
to
fo
rw
ith
ina
cell
,o
rb
etw
ee
n
cell
s,b
ym
ea
ns
of
som
ea
ge
nt
such
as
atr
an
spo
rte
ro
rp
ore
.
GO
:00
15
82
6B
iolo
gic
al
pro
cess
Th
reo
nin
etr
an
spo
rtT
he
dir
ect
ed
mo
vem
en
to
fth
reo
nin
e,
(2R
*,3
S*)-
2-a
min
o-3
-hyd
roxy
bu
tan
oic
aci
d,
into
,o
ut
of
or
wit
hin
ace
ll,
or
be
twe
en
cell
s,b
ym
ea
ns
of
som
ea
ge
nt
such
as
atr
an
spo
rte
ro
rp
ore
.
GO
:00
32
32
8B
iolo
gic
al
pro
cess
Ala
nin
etr
an
spo
rtT
he
dir
ect
ed
mo
vem
en
to
fa
lan
ine
,2
-am
ino
pro
pa
no
ica
cid
,in
to,
ou
to
fo
rw
ith
ina
cell
,o
rb
etw
ee
nce
lls,
by
me
an
so
fso
me
ag
en
tsu
cha
sa
tra
nsp
ort
er
or
po
re.
GO
:00
03
46
05
Bio
log
ica
lp
roce
ssC
ell
ula
rre
spo
nse
toh
ea
tA
ny
pro
cess
tha
tre
sult
sin
ach
an
ge
inst
ate
or
act
ivit
yo
fa
cell
(in
term
so
fm
ove
me
nt,
secr
eti
on
,e
nzy
me
pro
du
ctio
n,
ge
ne
exp
ress
ion
,e
tc.)
as
are
sult
of
ah
ea
tst
imu
lus,
ate
mp
era
ture
stim
ulu
sa
bo
veth
eo
pti
ma
l
tem
pe
ratu
refo
rth
at
org
an
ism
.
GO
:00
71
47
0B
iolo
gic
al
pro
cess
Ce
llu
lar
resp
on
seto
osm
oti
cst
ress
An
yp
roce
ssth
at
resu
lts
ina
cha
ng
ein
sta
teo
ra
ctiv
ity
of
ace
ll(i
nte
rms
of
mo
vem
en
t,se
cre
tio
n,
en
zym
e
pro
du
ctio
n,
ge
ne
exp
ress
ion
,e
tc.)
as
are
sult
of
ast
imu
lus
ind
ica
tin
ga
nin
cre
ase
or
de
cre
ase
inth
eco
nce
n-
tra
tio
no
fso
lute
so
uts
ide
the
org
an
ism
or
cell.
GO
:00
34
59
9B
iolo
gic
al
pro
cess
Ce
llu
lar
resp
on
seto
oxi
da
tive
stre
ssA
ny
pro
cess
tha
tre
sult
sin
ach
an
ge
inst
ate
or
act
ivit
yo
fa
cell
(in
term
so
fm
ove
me
nt,
secr
eti
on
,e
nzy
me
pro
du
ctio
n,
ge
ne
exp
ress
ion
,e
tc.)
as
are
sult
of
oxi
da
tive
stre
ss,
ast
ate
oft
en
resu
ltin
gfr
om
exp
osu
reto
hig
h
leve
lso
fre
act
ive
oxy
ge
nsp
eci
es,
e.g
.su
pe
roxi
de
an
ion
s,h
ydro
ge
np
ero
xid
e(H
2O
2),
an
dh
ydro
xyl
rad
ica
ls.
GO
:00
33
55
4B
iolo
gic
al
pro
cess
Ce
llu
lar
resp
on
seto
stre
ssA
ny
pro
cess
tha
tre
sult
sin
ach
an
ge
inst
ate
or
act
ivit
yo
fa
cell
(in
term
so
fm
ove
me
nt,
secr
eti
on
,e
nzy
me
pro
du
ctio
n,
ge
ne
exp
ress
ion
,e
tc.)
as
are
sult
of
ast
imu
lus
ind
ica
tin
gth
eo
rga
nis
mis
un
de
rst
ress
.T
he
stre
ssis
usu
all
y,b
ut
no
tn
ece
ssa
rily
,e
xog
en
ou
s(e
.g.
tem
pe
ratu
re,
hu
mid
ity,
ion
izin
gra
dia
tio
n).
GO
:00
06
35
1B
iolo
gic
al
pro
cess
Tra
nsc
rip
tio
n,
DN
Ad
ep
en
de
nt
Th
ece
llu
lar
syn
the
sis
of
RN
Ao
na
tem
pla
teo
fD
NA
.
GO
:00
06
35
7B
iolo
gic
al
pro
cess
Re
gu
lati
on
of
tra
nsc
rip
tio
nfr
om
RN
Ap
oly
me
rase
IIp
rom
ote
r
An
yp
roce
ssth
at
mo
du
late
sth
efr
eq
ue
ncy
,ra
teo
re
xte
nt
of
tra
nsc
rip
tio
nfr
om
an
RN
Ap
oly
me
rase
IIp
rom
ote
r.
(Co
nti
nu
ed
)
.............................................................................................................................................................................................................................................................................................
Page 4 of 18
Original article Database, Vol. 2013, Article ID bat054, doi:10.1093/database/bat054.............................................................................................................................................................................................................................................................................................
at California Institute of T
echnology on August 9, 2013
http://database.oxfordjournals.org/D
ownloaded from
http://database.oxfordjournals.org/
-
Ta
ble
1.
Co
nti
nu
ed
GO
IDA
spe
ctG
Ote
rmn
am
eD
efi
nit
ion
GO
:00
07
06
7B
iolo
gic
al
pro
cess
Mit
osi
sA
cell
cycl
ep
roce
ssco
mp
risi
ng
the
ste
ps
by
wh
ich
the
nu
cle
us
of
ae
uk
ary
oti
cce
lld
ivid
es;
the
pro
cess
invo
lve
s
con
de
nsa
tio
no
fch
rom
oso
ma
lD
NA
into
ah
igh
lyco
mp
act
ed
form
.C
an
on
ica
lly,
mit
osi
sp
rod
uce
stw
od
au
gh
ter
nu
cle
iw
ho
sech
rom
oso
me
com
ple
me
nt
isid
en
tica
lto
tha
to
fth
em
oth
er
cell
.
GO
:00
00
08
4B
iolo
gic
al
pro
cess
Sp
ha
seo
fm
ito
tic
cell
cycl
eS
ph
ase
occ
urr
ing
as
pa
rto
fth
em
ito
tic
cell
cycl
e.
Sp
ha
seis
the
pa
rto
fth
ece
llcy
cle
du
rin
gw
hic
hD
NA
syn
the
sis
tak
es
pla
ce.
Am
ito
tic
cell
cycl
eis
on
ew
hic
hca
no
nic
ally
com
pri
ses
fou
rsu
cce
ssiv
ep
ha
ses
calle
dG
1,
S,G
2a
nd
M
an
din
clu
de
sre
pli
cati
on
of
the
ge
no
me
an
dth
esu
bse
qu
en
tse
gre
ga
tio
no
fch
rom
oso
me
sin
tod
au
gh
ter
cell
s.
GO
:00
31
02
8B
iolo
gic
al
pro
cess
Sep
tati
on
init
iati
on
sig
na
lin
g
casc
ad
e
Th
ese
rie
so
fm
ole
cula
rsi
gn
als
,m
ed
iate
db
yth
esm
all
GT
Pa
seR
as,
tha
tre
sult
sin
the
init
iati
on
of
con
tra
ctio
no
f
the
con
tra
ctil
eri
ng
,a
tth
eb
eg
inn
ing
of
cyto
kin
esi
sa
nd
cell
div
isio
nb
yse
ptu
mfo
rma
tio
n.
Th
ep
ath
wa
yco
-
ord
ina
tes
chro
mo
som
ese
gre
ga
tio
nw
ith
mit
oti
ce
xit
an
dcy
tok
ine
sis.
GO
:00
51
32
1B
iolo
gic
al
pro
cess
Me
ioti
cce
llcy
cle
Pro
gre
ssio
nth
rou
gh
the
ph
ase
so
fth
em
eio
tic
cell
cycl
e,
inw
hic
hca
no
nic
all
ya
cell
rep
lica
tes
top
rod
uce
fou
r
off
spri
ng
wit
hh
alf
the
chro
mo
som
al
con
ten
to
fth
ep
rog
en
ito
rce
ll.
GO
:00
05
63
4C
ell
ula
rco
mp
on
en
tN
ucl
eu
sA
me
mb
ran
e-b
ou
nd
ed
org
an
ell
eo
fe
uk
ary
oti
cce
lls
inw
hic
hch
rom
oso
me
sa
reh
ou
sed
an
dre
pli
cate
d.
Inm
ost
cell
s,th
en
ucl
eu
sco
nta
ins
all
of
the
cell
’sch
rom
oso
me
se
xce
pt
the
org
an
ell
ar
chro
mo
som
es,
an
dis
the
site
of
RN
Asy
nth
esi
sa
nd
pro
cess
ing
.In
som
esp
eci
es,
or
insp
eci
ali
zed
cell
typ
es,
RN
Am
eta
bo
lism
or
DN
Are
pli
cati
on
ma
yb
ea
bse
nt.
GO
:00
05
68
1C
ell
ula
rco
mp
on
en
tSp
lice
soso
ma
lco
mp
lex
An
yo
fa
seri
es
of
rib
on
ucl
eo
pro
tein
com
ple
xes
tha
tco
nta
inR
NA
an
dsm
all
nu
cle
ar
rib
on
ucl
eo
pro
tein
s(s
nR
NP
s),
an
da
refo
rme
dse
qu
en
tia
lly
du
rin
gth
esp
lici
ng
of
am
ess
en
ge
rR
NA
pri
ma
rytr
an
scri
pt
toe
xcis
ea
nin
tro
n.
GO
:00
44
42
8,
Ce
llu
lar
com
po
ne
nt
Nu
cle
ar
pa
rtA
ny
con
stit
ue
nt
pa
rto
fth
en
ucl
eu
s,a
me
mb
ran
e-b
ou
nd
ed
org
an
ell
eo
fe
uk
ary
oti
cce
lls
inw
hic
hch
rom
oso
me
s
are
ho
use
da
nd
rep
lica
ted
.
GO
:00
05
82
6C
ell
ula
rco
mp
on
en
tA
cto
myo
sin
con
tra
ctil
eri
ng
Acy
tosk
ele
tal
stru
ctu
reco
mp
ose
do
fa
ctin
fila
me
nts
an
dm
yosi
nth
at
form
sb
en
ea
thth
ep
lasm
am
em
bra
ne
of
ma
ny
cell
s,in
clu
din
ga
nim
al
cell
sa
nd
yea
stce
lls,
ina
pla
ne
pe
rpe
nd
icu
lar
toth
ea
xis
of
the
spin
dle
,i.
e.
the
cell
div
isio
np
lan
e.
Rin
gco
ntr
act
ion
isa
sso
cia
ted
wit
hce
ntr
ipe
tal
gro
wth
of
the
me
mb
ran
eth
at
div
ide
sth
ecy
to-
pla
smo
fth
etw
od
au
gh
ter
cell
s.In
an
ima
lce
lls,
the
con
tra
ctil
eri
ng
islo
cate
din
sid
eth
ep
lasm
am
em
bra
ne
at
the
loca
tio
no
fth
ecl
ea
vag
efu
rro
w.
Inb
ud
din
gfu
ng
al
cell
s,e
.g.
mit
oti
cS.
cere
visi
ae
cells,
the
con
tra
ctile
rin
g
form
sb
en
ea
thth
ep
lasm
am
em
bra
ne
at
the
mo
the
r-b
ud
ne
ckb
efo
rem
ito
sis.
GO
:00
05
75
0C
ell
ula
rco
mp
on
en
tM
ito
cho
nd
ria
lre
spir
ato
rych
ain
com
ple
xII
I
Ap
rote
inco
mp
lex
loca
ted
inth
em
ito
cho
nd
ria
lin
ne
rm
em
bra
ne
tha
tfo
rms
pa
rto
fth
em
ito
cho
nd
ria
lre
spir
ato
ry
cha
in.
Co
nta
ins
ab
ou
t1
0p
oly
pe
pti
de
sub
un
its
incl
ud
ing
fou
rre
do
xce
nte
rs:
cyto
chro
me
b/b
6,
cyto
chro
me
c1
an
da
n2
Fe-2
Scl
ust
er.
Ca
taly
zes
the
oxi
da
tio
no
fu
biq
uin
ol
by
oxi
diz
ed
cyto
chro
me
c1.
Th
eG
OT
erm
isa
sho
rtp
hra
seth
at
isty
pic
ally
use
dto
rep
rese
nt
the
ind
ivid
ua
lco
mp
on
en
tso
fth
eo
nto
log
ies,
wh
ile
the
de
fin
itio
ns
pro
vid
eth
ep
reci
sem
ea
nin
go
fth
eG
Ote
rms.
As
em
ph
asi
zed
inth
ete
xt,
itis
imp
ort
an
tto
cre
ate
an
no
tati
on
sto
the
de
fin
itio
na
nd
no
tto
the
GO
Te
rm.
Cu
rato
rssh
ou
lde
xplo
reth
eo
nto
log
ies
usi
ng
Am
iGO
(17
,h
ttp
://a
mig
o.
ge
ne
on
tolo
gy.
org
)o
rQ
uic
kG
O(1
8,
htt
p:/
/ww
w.e
bi.
ac.
uk
/Qu
ick
GO
/)to
ide
nti
fya
pp
rop
ria
tete
rms
for
an
no
tati
on
.M
ore
info
rma
tio
no
nth
est
ruct
ure
of
the
Ge
ne
On
tolo
gy
an
dh
ow
itis
de
velo
pe
dis
ava
ila
ble
on
lin
efr
om
htt
p://w
ww
.ge
ne
on
tolo
gy.
org
.
.............................................................................................................................................................................................................................................................................................
Page 5 of 18
Database, Vol. 2013, Article ID bat054, doi:10.1093/database/bat054 Original article.............................................................................................................................................................................................................................................................................................
at California Institute of T
echnology on August 9, 2013
http://database.oxfordjournals.org/D
ownloaded from
http://amigo.geneontology.orghttp://amigo.geneontology.orghttp://www.ebi.ac.uk/QuickGO/http://www.geneontology.orghttp://database.oxfordjournals.org/
-
miRNA, ncRNA, protein, protein_complex, protein_struc-
ture, RNA, rRNA, snoRNA, snRNA, transcript, tRNA and poly-
peptide. While annotations are typically created for
chromosomal features, such as a gene for its protein or
ncRNA product, other types of objects can be annotated
including groups of gene products that make a complex.
The annotation object can be associated to a GO term
from one or more of the three aspects of the GO
(Molecular Function, Biological Process and Cellular
Component). A gene product is the most common object
of annotation, and all such objects require a stable identifier
such as those specified by sequence databases maintained
by European Bioinformatic Institute (EBI) and National
Center for Biotechnology Information (NCBI). Model
Organism Databases (MODs) also maintain unique identi-
fiers that often represent specific types of molecular entities
such as RNA transcripts that often do not have an identifier
from one of the archival repositories.
Approaching an article for curation
When experimental data on a gene product has been pub-
lished, the following guidelines can be used to identify the
relevant or annotatable pieces of information that may
generate GO annotation for that gene product.
(i) Identification of relevant articles describing a gene
product’s function is the essential starting point for
Figure 1. GO Term ‘leukotriene-A4 hydrolase activity’ [GO:0004463], one of the terms mentioned in the main text of the article,as seen in AmiGO (16, http://amigo.geneontology.org). (a) Graphical view of the ontology structure showing the most granularterm ‘leukotriene-A4 hydrolase activity’ [GO:0004463] at the bottom (highlighted in red), and all its parent terms leading up tothe root node (‘molecular_function’ [GO:0003674]) at the top. Each box representing a GO term includes the GO identifier, andthe blue line connecting the terms represent the ontological relationship ‘is_a’ (implying that a child term is a subtype of theparent term). (b) Alternate text display for viewing the ontology structure. ‘leukotriene-A4 hydrolase activity’ [GO:0004463] ishighlighted in red. Each child term is indented from its parent to indicate the depth of the tree. Apart from the GOID and GOterm, each row includes other pieces of information that are important to understand the ontology and the annotations to eachterm. Starting from the left end of the row, the + sign indicates that there are child terms for that node and clicking on the + signopens the browser to display the child terms. Next the small icon ‘i’ indicates the term is related to its parent by an is–arelationship (explained above). At the right end of the row in brackets is the total number of gene products annotated tothat term and all its child terms. (c) Term information relevant to making an annotation is highlighted in red, which includes theGOID, Aspect of the ontology (Molecular Function), Synonyms and Definition of the term.
.............................................................................................................................................................................................................................................................................................
Page 6 of 18
Original article Database, Vol. 2013, Article ID bat054, doi:10.1093/database/bat054.............................................................................................................................................................................................................................................................................................
at California Institute of T
echnology on August 9, 2013
http://database.oxfordjournals.org/D
ownloaded from
,paperpapershttp://amigo.geneontology.orghttp://database.oxfordjournals.org/
-
Ta
ble
2.
Evi
de
nce
cod
eca
teg
ori
es,
evi
de
nce
cod
en
am
es
an
dd
efi
nit
ion
s
Ca
teg
ory
Evi
de
nce
cod
eT
ype
so
fe
vid
en
ceSu
pp
ort
ing
da
ta
Exp
eri
me
nta
lIn
ferr
ed
fro
mE
xpe
rim
en
t
(EX
P)
An
ye
xpe
rim
en
tal
ass
ay
Infe
rre
dfr
om
Dir
ect
Ass
ay
(ID
A)
(i)
En
zym
ea
ssa
ys
(ii)
Invi
tro
reco
nst
itu
tio
n(e
.g.
tra
nsc
rip
tio
n)
(iii)
Imm
un
ofl
uo
resc
en
ce(f
or
cellu
lar
com
po
ne
nt)
(iv)
Ce
llfr
act
ion
ati
on
(fo
rce
llu
lar
com
po
ne
nt)
(v)
Ph
ysic
al
inte
ract
ion
/bin
din
ga
ssa
y(s
om
eti
me
sa
pp
rop
ria
tefo
rce
llu
lar
com
po
ne
nt
or
mo
lecu
lar
fun
ctio
n)
Infe
rre
dfr
om
Ph
ysic
al
Inte
ract
ion
(IP
I)
(i)
Tw
o-h
ybri
din
tera
ctio
ns
Wit
hco
lum
nsh
ou
ldb
efi
lle
dw
ith
Ide
nti
fie
ro
fth
ein
tera
ctin
gp
rote
in(i
i)C
o-p
uri
fica
tio
n
(iii)
Co
-im
mu
no
pre
cip
ita
tio
n
(iv)
Ion
/pro
tein
bin
din
ge
xpe
rim
en
ts
Infe
rre
dfr
om
Mu
tan
t
Ph
en
oty
pe
(IM
P)
(i)
Mu
tati
on
s,n
atu
ral
or
intr
od
uce
d,
tha
tre
sult
inp
art
ial
or
com
ple
teim
pa
irm
en
to
r
alt
era
tio
no
fth
efu
nct
ion
of
tha
tg
en
e
(ii)
Po
lym
orp
his
mo
ra
lle
lic
vari
ati
on
(in
clu
din
gw
he
ren
oa
lle
leis
de
sig
na
ted
wild
-typ
eo
r
mu
tan
t)
(iii
)A
ny
pro
ced
ure
tha
td
istu
rbs
the
exp
ress
ion
or
fun
ctio
no
fth
eg
en
e,
incl
ud
ing
RN
Ai,
an
ti-s
en
seR
NA
s,a
nti
bo
dy
de
ple
tio
n,
or
the
use
of
an
ym
ole
cule
or
exp
eri
me
nta
l
con
dit
ion
tha
tm
ay
dis
turb
or
aff
ect
the
no
rma
lfu
nct
ion
ing
of
the
ge
ne
,in
clu
din
g:
inh
ibit
ors
,b
lock
ers
,m
od
ifie
rs,
an
yty
pe
of
an
tag
on
ists
,te
mp
era
ture
jum
ps,
cha
ng
es
in
pH
or
ion
icst
ren
gth
(iv)
Ove
rexp
ress
ion
or
ect
op
ice
xpre
ssio
no
fw
ild
-typ
eo
rm
uta
nt
ge
ne
Infe
rre
dfr
om
Ge
ne
tic
Inte
ract
ion
(IG
I)
(i)
‘Tra
dit
ion
al’
ge
ne
tic
inte
ract
ion
ssu
cha
ssu
pp
ress
ors
,sy
nth
eti
cle
tha
ls,
etc
.W
ith
colu
mn
sho
uld
be
fille
dw
ith
ide
n-
tifi
er
of
the
inte
ract
ing
ge
ne
(ii)
Fun
ctio
na
lco
mp
lem
en
tati
on
(iii)
Re
scu
ee
xpe
rim
en
ts
(iv)
Infe
ren
cea
bo
ut
on
eg
en
ed
raw
nfr
om
the
ph
en
oty
pe
of
am
uta
tio
nin
ad
iffe
ren
t
ge
ne
Infe
rre
dfr
om
Exp
ress
ion
Pa
tte
rn(I
EP
)
(i)
Tra
nsc
rip
tle
vels
or
tim
ing
(e.g
.N
ort
he
rns,
mic
roa
rra
yd
ata
)
(ii)
Pro
tein
leve
ls(e
.g.
We
ste
rnb
lots
)
(Co
nti
nu
ed
)
.............................................................................................................................................................................................................................................................................................
Page 7 of 18
Database, Vol. 2013, Article ID bat054, doi:10.1093/database/bat054 Original article.............................................................................................................................................................................................................................................................................................
at California Institute of T
echnology on August 9, 2013
http://database.oxfordjournals.org/D
ownloaded from
http://database.oxfordjournals.org/
-
Ta
ble
2.
Co
nti
nu
ed
Ca
teg
ory
Evi
de
nce
cod
eT
ype
so
fe
vid
en
ceSu
pp
ort
ing
da
ta
Co
mp
uta
tio
na
l
An
aly
sis
Infe
rre
db
ySe
qu
en
ce
Sim
ila
rity
(ISS
)
Seq
ue
nce
,st
ruct
ura
lsi
mila
rity
-ba
sed
an
aly
sis
Wit
hco
lum
nsh
ou
ldb
efi
lle
dw
ith
ide
n-
tifi
er
of
the
sim
ila
rg
en
e/p
rote
in
Infe
rre
db
ySe
qu
en
ce
Ali
gn
me
nt
(ISA
)
Pa
irw
ise
or
mu
ltip
lea
lig
nm
en
tW
ith
colu
mn
sho
uld
be
fille
dw
ith
ide
n-
tifi
er
of
the
sim
ila
rg
en
e/p
rote
in
Infe
rre
db
ySe
qu
en
ce
Ort
ho
log
y(I
SO)
Ass
ert
ion
of
ort
ho
log
yb
etw
ee
nth
eg
en
ep
rod
uct
an
da
ge
ne
pro
du
ctin
an
oth
er
org
an
ism
Wit
hco
lum
nsh
ou
ldb
efi
lle
dw
ith
ide
n-
tifi
er
of
the
sim
ila
rg
en
e/p
rote
in
Infe
rre
dfr
om
Seq
ue
nce
Mo
de
l(I
SM)
Pre
dic
tio
nm
eth
od
sfo
rn
on
cod
ing
RN
Ag
en
es
such
as
tRN
ASC
AN
-SE
,Sn
osc
an
,a
nd
Rfa
m
Infe
rre
dfr
om
ge
no
mic
con
text
(IG
C)
Info
rma
tio
na
bo
ut
the
ge
no
mic
con
text
of
ag
en
ep
rod
uct
form
sp
art
of
the
evi
de
nce
for
ap
art
icu
lar
an
no
tati
on
.
(i)
op
ero
nst
ruct
ure
(ii)
syn
ten
icre
gio
ns
(iii)
pa
thw
ay
an
aly
sis
(iv)
ge
no
me
sca
lea
na
lysi
so
fp
roce
sse
s
Infe
rre
dfr
om
Ke
y
Re
sid
ue
s(I
KR
)
Seq
ue
nce
an
aly
sis,
wh
ere
lack
of
ke
yse
qu
en
cere
sid
ue
sis
use
dto
ma
ke
an
eg
ati
ve(N
OT
)
an
no
tati
on
Sho
uld
incl
ud
eN
OT
qu
alifi
er
an
dW
ith
colu
mn
isre
qu
ire
dif
an
aly
sis
isca
rrie
d
ou
tb
ya
cura
tor
an
d
GO
_re
fere
nce
:00
00
04
7
Au
tho
rT
race
ab
leA
uth
or
Sta
tem
en
t(T
AS)
Ori
gin
al
evi
de
nce
isre
fere
nce
din
the
art
icle
an
dth
ere
fore
can
be
tra
ced
toa
no
the
r
sou
rce
No
n-t
race
ab
leA
uth
or
Sta
tem
en
t(N
AS)
Sta
tem
en
tsin
art
icle
sth
at
can
no
tb
etr
ace
dto
an
oth
er
art
icle
or
exp
eri
me
nt
Cu
rato
ria
lN
oD
ata
(ND
)U
sed
for
an
no
tati
on
sw
he
nin
form
ati
on
ab
ou
tth
em
ole
cula
rfu
nct
ion
,b
iolo
gic
al
pro
cess
,
or
cell
ula
rco
mp
on
en
to
fth
eg
en
eo
rg
en
ep
rod
uct
be
ing
an
no
tate
dis
no
ta
vail
ab
le
Sho
uld
be
use
do
nly
wit
hro
ot
no
de
s
Infe
rre
db
yC
ura
tor
(IC
)A
nn
ota
tio
nre
aso
na
bly
infe
rre
db
ya
cura
tor
fro
mo
the
rG
Oa
nn
ota
tio
ns,
for
wh
ich
dir
ect
evi
de
nce
isa
vaila
ble
GO
IDsh
ou
ldb
efi
lle
din
the
fro
m
colu
mn
Th
ee
vid
en
ceco
de
sa
reu
sed
tore
pre
sen
tth
em
eth
od
or
typ
eo
fre
sult
su
sed
tod
efi
ne
the
an
no
tati
on
.A
nn
ota
tors
sho
uld
use
this
tab
lea
sa
qu
ick
refe
ren
ceg
uid
ea
nd
con
sult
the
de
taile
dd
ocu
me
nta
tio
na
vaila
ble
on
lin
e(h
ttp
://w
ww
.ge
ne
on
tolo
gy.
org
/GO
.evi
de
nce
.sh
tml)
for
spe
cifi
cd
eta
ils
(in
clu
din
gth
ed
o’s
an
dd
on
’ts)
on
ho
wth
em
eth
od
or
resu
lts
are
ma
tch
ed
toe
ach
evi
de
nce
cod
e.
.............................................................................................................................................................................................................................................................................................
Page 8 of 18
Original article Database, Vol. 2013, Article ID bat054, doi:10.1093/database/bat054.............................................................................................................................................................................................................................................................................................
at California Institute of T
echnology on August 9, 2013
http://database.oxfordjournals.org/D
ownloaded from
http://www.geneontology.org/GO.evidence.shtmlhttp://database.oxfordjournals.org/
-
making annotations. While PubMed is a typical start-
ing point for finding relevant articles, research in the
area of Natural Language Processing (NLP) provides
additional methods that can aid in the search for
curatable articles. More on NLP methods used for bio-
curation can be found in the reports from the
Biocreative workshops (14). Once an article has been
identified, biocurators must properly specify the ob-
jects of annotation including confirmation of the cor-
rect taxa. These details are often found in the
Methods section of the article, but unambiguously
determining species for annotation can be problem-
atic, particularly in vertebrate systems where ortholo-
gous gene names are shared among taxa. Further,
when multiple model organism systems are being
used simultaneously, the taxa of the genes being
investigated is not always specifically designated.
For example, Lin and Isaacson (15) studied axonal
growth regulation by netrin and slit proteins using
both mouse and rat cells. Two of the plasmids con-
taining slit coding sequences were acknowledged as
gifts and no reference to the species of origin was
provided. In this case, to determine the species the
sequences represent, the authors had to be contacted
to confirm that the sequences actually originated
from human, neither mouse or rat.
(ii) The Introduction section of the article will often pre-
sent previous knowledge about the gene product’s
function. If citations to original works are included
then the article can be used as a source of the infor-
mation and annotated using the evidence code (see
below) Traceable Author Statement (TAS). The use of
TAS evidence has decreased over time, as it is best
practice to go to the original article to capture the
annotation directly from experimental results. This
allows for clear attribution of an annotation to the
original experimental details. Thus, GOC strongly dis-
courages the continued use of TAS and recommends
replacing existing TAS annotations with those to the
published experimental results.
(iii) Annotations derived from experimental data are
most often found in the Methods and Results sections
or in the figure legends of articles. A biocurator can
efficiently receive an overview of the biological
Table 3. A sample annotation in the GAF 2.0 format
Column Content Required? Example
1 DB Required MGI
2 DB Object ID Required MGI:1350922
3 DB Object Symbol Required Cadps
4 Qualifier Optional NOT
5 GO ID Required GO:0006887
6 DB:Reference (jDB:Reference) Required MGI:MGI:3583730jPMID:158206957 Evidence Code Required IMP
8 With (or) From Optional MGI:MGI:3583931
9 Aspect Required P
10 DB Object Name Optional Ca2+-dependent secretion activator
11 DB Object Synonym (jSynonym) Optional CAPS112 DB Object Type Required Protein
13 Taxon(jtaxon) Required Taxon:1009014 Date Required 20060202
15 Assigned By Required MGI
16 Annotation Extension Optional Occurs_in(CL:0000001)joccurs_in(CL:0000336)17 Gene Product Form ID Optional UniProtKB:Q80TJ1
This table provides an example of an annotation from the Mouse Genome Informatics group (from February 2013). The Cadps protein
(MGI identifier MGI:1350922) was annotated by the MGI project to ‘exocytosis’ [GO:0006887], a term in the Biological Process ontology
indicated by ‘P’ in column 9. This annotation used the ‘NOT’ qualifier indicating the authors of PMID:15820695 (5) showed that this
protein is ‘NOT’ involved in ‘exocytosis’. The non-PMID reference number, MGI:MGI:3583730, is MGI’s internal identifier for the same
reference. The curators arrived at this annotation based on the phenotype of the Cadps mutant, which is indicated with the IMP
evidence code. The identifier of the allele (MGI:MGI:3583931) used in the experiment is captured in column 8 (WITH/FORM). In addition,
the annotation extension field (column 16) indicates the cell types where this protein (CL:0000001, primary cell culture or CL:0000336,
adrenal medulla chromaffin cell) was NOT found to be involved in this process (exocytosis). Finally, the last column represents the
UniProtKB identifier for the isoform of the mouse Cadps protein that was studied.
.............................................................................................................................................................................................................................................................................................
Page 9 of 18
Database, Vol. 2013, Article ID bat054, doi:10.1093/database/bat054 Original article.............................................................................................................................................................................................................................................................................................
at California Institute of T
echnology on August 9, 2013
http://database.oxfordjournals.org/D
ownloaded from
paperspaperpaperpaperpaperpaperpaperpapershttp://database.oxfordjournals.org/
-
context of the article from the Introduction section
and then, using the experimental data in the
Results section, create annotations with appropriate
supporting experimental evidence.
(iv) Authors often speculate on the role of the gene
product in the Discussion section based on the experi-
mental results they present. The authors may propose
a hypothesis that combines previous knowledge, new
findings from the current study and new ideas that
have not yet been experimentally verified. This infor-
mation is not suitable for an annotation assertion and
if used to create an annotation can be detrimental, as
these hypotheses have not been validated.
Manual curation using sequencesimilarity data
Manual curation by biocurators includes the in-silico ana-
lysis of chromosomal features to infer a gene product’s role
and location. GO terms can be assigned to gene products
on the basis of sequence similarity using the evidence code
’Inferred from Sequence or structural Similarity’ (ISS) with a
custom reference, GO_Reference (GO_REF:0000024), as
described in the next section. Potential homologs are ini-
tially identified using sequence similarity search programs
such as BLAST. The significance of the sequence similarity is
then verified manually using a combination of sequence
resources and analysis tools, including phylogenetic and
comparative genomics databases such as Ensembl
Compara (16), INPARANOID (17) and OrthoMCL (18). In all
cases, biocurators validate each alignment to assess
whether similarity is appropriate to infer the gene prod-
uct’s function. While there is no universal definition for
the minimum requirements for similarity results, the signifi-
cance of a match is judged on a case-by-case basis by the
biocurator’s expertise. Although the similarity criteria
required to make these annotations are defined by the
annotating group, the GOC has established several rules
for making these assignments. They are as follows:
(i) Mandatory inclusion of a stable database identifier
that identifies the similar gene/gene product in the
‘WITH/FROM’ field (column 8 in Table 3)
(ii) The similar gene must be experimentally character-
ized; to avoid circular inferences, the GO term
should only be assigned if the similar gene/gene
product is, or can be annotated, with the same
term (or a more specific child term) using an experi-
mental evidence code (e.g. Inferred from Direct assay,
IDA; Inferred from Mutant Phenotype, IMP; Inferred
from Genetic Interaction, IGI, Inferred from Physical
Interaction, IPI; Inferred from Expression Pattern, IEP).
Annotations made with the NOT qualifier should not
be transferred.
Sequence characteristics can be used to infer GO anno-
tations for all three aspects of the ontology. However, care
should be taken when transferring biological process anno-
tations, as cellular processes and metabolic processes, for
example, may be more readily inferred from sequence
similarity than developmental processes which may be
species- or clade-specific
Use of GO reference
As mentioned above, manual curation does not always re-
quire a published reference to indicate the source of evi-
dence. Annotations can be inferred by biocurators by
analysis of the gene sequence or by combining direct
experimental evidence from multiple sources. In these situ-
ations, the citation is to a custom reference. These so-called
GO references describe the methods and procedures used
in creating such annotations. For example, GO_REF:
0000024 (http://www.geneontology.org/cgi-bin/references.
cgi#GO_REF:0000024), titled ‘Manual transfer of experi-
mentally verified manual GO annotation data to orthologs
by curator judgment of sequence similarity’, was created to
describe the transfer of manual annotations using curator
judgment to annotations associated with the ISS code. A
second example is GO_REF:0000036 (http://www.geneon
tology.org/cgi-bin/references.cgi#GO_REF:0000036),
‘Manual annotations that require more than one source of
functional data to support the assignment of the associated
GO term.’ This GO reference is used with the Inferred by
Curator (IC) evidence code, described below. GO references
are created and published on the GOC Web site (http://
www.geneontology.org/cgi-bin/references.cgi) only once
the biocurators agree on the content of the abstract and
its usage.
How to define an annotation?
Once literature relevant to a gene product has been iden-
tified, the following guidelines can be used to decide which
GO term(s) and evidence code(s) should be associated.
Individual articles may not provide results that support an-
notations for all three aspects of the ontology; thus, anno-
tations to the different aspects will generally need to come
from different articles. Also it is common, from a single
article, to identify multiple annotations identified for one
aspect and to annotate to different levels of granularity in
the same branch of the ontology. The granularity of the GO
term selected depends heavily on the type of experiments
being reported as well as the ability of the biocurator to
understand the limitations of that experimental method.
MacCullen (19) interviewed biocurators from the GOC in
.............................................................................................................................................................................................................................................................................................
Page 10 of 18
Original article Database, Vol. 2013, Article ID bat054, doi:10.1093/database/bat054.............................................................................................................................................................................................................................................................................................
at California Institute of T
echnology on August 9, 2013
http://database.oxfordjournals.org/D
ownloaded from
paper:Gene Ontology ()'Inferred`,R:http://www.geneontology.org/cgi-bin/references.cgi#GO_REF:0000024http://www.geneontology.org/cgi-bin/references.cgi#GO_REF:0000024``-''Inferred from Sequence or Structural Similarity ()http://www.geneontology.org/cgi-bin/references.cgi#GO_REF:0000036http://www.geneontology.org/cgi-bin/references.cgi#GO_REF:0000036``''http://www.geneontology.org/cgi-bin/references.cgihttp://www.geneontology.org/cgi-bin/references.cgipaper,paperpaperhttp://database.oxfordjournals.org/
-
an effort to correlate the curator’s education, work experi-
ence and research experience to measured variability in an-
notation. After observing there was significant variability in
a test set of annotations, he explored possible causes.
MacCullen reported no correlation between the amount
of variation and any specific characteristic of the biocura-
tor’s education or experience and suggested that biocura-
tors should continually work to coordinate annotation
methods with the goal of minimizing variation. The solu-
tion used by the Consortium’s member projects is to have
continuing education and discussions between biocurators
to reduce variability that arises from inconsistent use of the
rules and misunderstanding of the ontology terms. Also to
further address the variability in the interpretation by bio-
curators, the GOC holds regular controlled annotation ex-
ercises to define standards and maintain consistent
procedures. These exercises are conducted within and
across most projects where biocurators annotate the same
article or a small set of articles and then compare their an-
notations. A discussion follows where the GOC comes to a
consensus about the most appropriate annotations for that
article and in the process educates its staff.
Choosing the right GO term
As emphasized above, ontology terms should be chosen
based not on the term name, but on the definition of the
term. Ontology terms can be explored using AmiGO (20),
http://amigo.geneontology.org, or QuickGO (21), http://
www.ebi.ac.uk/QuickGO/. Often it is hard to find the appro-
priate GO term using the description or phrases from the
literature because GO terms can be more descriptive and
they reflect the actual function or process rather than a
gene product name or family name. Therefore, to assist in
searching, and to accurately reflect the language of biology,
many ontology terms are associated with synonyms, which
are typically the terminology or language used in the litera-
ture. For example, the phrase ‘transcription repressor’ is
loosely used in the literature to refer to any transcription
repressing role. This concept is represented in the GO as
‘negative regulation of transcription, DNA-dependent’
[GO:0045892], and the phrase transcription repressor is a
synonym of this term. Development of the ontologies (i.e.,
adding new terms, refining definitions) is an active process
and if an appropriate GO term that is suitable to describe a
gene product is not available, biocurators are encouraged to
request that a new term be added to the ontology. The GOC
has setup several ways to handle new term requests and to
evaluate existing terms. The easiest way is to contact the GO
helpdesk ([email protected] or http://www.
geneontology.org/GO.contacts.shtml) providing as much
detail as possible.
What if nothing is known aboutthe gene product?
Typically after an organism’s genome sequence is deter-
mined, structural annotation is performed using computa-
tional methods to make gene model predictions. Some of
the resulting predicted genes will have been previously
characterized and as a result will have literature-associated
evidence or sequence based relationships to other well-
defined genes. For other predicted genes neither experi-
mental nor sequence based functions will be available.
This represents sets of similar proteins that have yet to be
characterized and proteins without similarity to any previ-
ously characterized sequence. Thus no literature is available
on which to base an annotation. In cases such as this where
nothing can be gleaned from the literature, it is correct to
associate the gene product to the most general terms in the
three ontologies, ‘molecular_function’ (GO:0003674), ‘bio-
logical_process’ (GO:0008150) and ‘cellular_component’
(GO:0005575) (called the root nodes, see Table 1) with the
evidence code No Data (ND). It should be noted that anno-
tating to the root node specifically states that an extensive
search of the literature was conducted and no experimental
results were found to indicate the function of this gene
product. Since a biocurator infers that no