introduction to the go: a user’s guide iowa state workshop 11 june 2009
TRANSCRIPT
Introduction to the GO:a user’s guide
Iowa State Workshop
11 June 2009
All workshop materials are available at AgBase.
Genomic Annotation Genome annotation is the process of
attaching biological information to genomic sequences. It consists of two main steps:
1. identifying functional elements in the genome: “structural annotation”
2. attaching biological information to these elements: “functional annotation”
biologists often use the term “annotation” when they are referring only to structural annotation
CHICK_OLF6
DNA annotation
Protein annotation
Data from Ensembl Genome browser
TRAF 1, 2 and 3 TRAF 1 and 2
Structural annotation:
catenin
Functional annotation:
Structural & Functional AnnotationStructural Annotation: Open reading frames (ORFs) predicted during genome
assembly predicted ORFs require experimental confirmation the Sequence Ontology (SO) provides a structured controlled
vocabulary for sequence annotation
Functional Annotation: annotation of gene products = Gene Ontology (GO)
annotation initially, predicted ORFs have no functional literature and GO
annotation relies on computational methods (rapid) functional literature exists for many genes/proteins prior to
genome sequencing GO annotation does not rely on a completed genome
sequence!
1. Provides structural annotation for agriculturally important genomes
2. Provides functional annotation (GO)
3. Provides tools for functional modeling
4. Provides bioinformatics & modeling support for research community
Introduction to GO1. pre-GO: managing large datasets
2. Bio-ontologies
3. the Gene Ontology (GO) a GO annotation example GO evidence codes literature biocuration & computation analysis ND vs no GO sources of GO
1. pre-GO: managing large datasets
AgBase User Support Functional modeling training Database ID mapping
approx. 75% of requests Providing GO annotation for datasets/arrays Assistance with GO modeling tools Intermediary with between research community
and public databases NCBI, UniProtKB, GO Consortium
Computational assistance
Converting database accessions UniProt database Ensembl BioMart Online analysis tools
DAVID, g:profiler, etc
AgBase database ArrayIDer tool
More information about these tools is available from the online workshop resources.
1. UniProt ID Mapping
2. Ensembl BioMart
NOTE: Ensembl is scheduled to add plant & microbe species in 2009.
3. Online analysis toolsg:profiler conversion toolhttp://biit.cs.ut.ee/gprofiler/gconvert.cgi
This tool works for all species found in Ensembl.
3. Online analysis toolsDatabase for Annotation, Visualization and Integrated Discovery (DAVID)http://david.abcc.ncifcrf.gov/conversion.jsp
This tool works for a wide range of species.
Contact AgBase to request additional species.
4. AgBase: ArrayIDer
2. Bio-ontologies
Bio-ontologies Bio-ontologies are used to capture biological
information in a way that can be read by both humans and computers.necessary for high-throughput “omics” datasetsallows data sharing across databases
Objects in an ontology (eg. genes, cell types, tissue types, stages of development) are well defined.
The ontology shows how the objects relate to each other.
Bio-ontologies:http://www.obofoundry.org/
Ontologies
digital identifier(computers)
description(humans)
relationships between terms
3. The Gene Ontology
Functional Annotation Gene Ontology (GO) is the de facto method
for functional annotation Widely used for functional genomics (high
throughput) Many tools available for gene expression
analysis using GO The GO Consortium homepage:
http://www.geneontology.org
GO Mapping Example
NDUFAB1 (UniProt P52505)Bovine NADH dehydrogenase (ubiquinone) 1, alpha/beta subcomplex, 1, 8kDa
Biological Process (BP or P)GO:0006633 fatty acid biosynthetic process TASGO:0006120 mitochondrial electron transport, NADH to ubiquinone TASGO:0008610 lipid biosynthetic process IEA
Cellular Component (CC or C)GO:0005759 mitochondrial matrix IDAGO:0005747 mitochondrial respiratory chain complex I IDAGO:0005739 mitochondrion IEA
NDUFAB1
Molecular Function (MF or F)GO:0005504 fatty acid binding IDAGO:0008137 NADH dehydrogenase (ubiquinone) activity TASGO:0016491 oxidoreductase activity TASGO:0000036 acyl carrier activity IEA
GO Mapping Example
NDUFAB1 (UniProt P52505)Bovine NADH dehydrogenase (ubiquinone) 1, alpha/beta subcomplex, 1, 8kDa
Biological Process (BP or P)GO:0006633 fatty acid biosynthetic process TASGO:0006120 mitochondrial electron transport, NADH to ubiquinone TASGO:0008610 lipid biosynthetic process IEA
Cellular Component (CC or C)GO:0005759 mitochondrial matrix IDAGO:0005747 mitochondrial respiratory chain complex I IDAGO:0005739 mitochondrion IEA
NDUFAB1
Molecular Function (MF or F)GO:0005504 fatty acid binding IDAGO:0008137 NADH dehydrogenase (ubiquinone) activity TASGO:0016491 oxidoreductase activity TASGO:0000036 acyl carrier activity IEA
aspect or ontologyGO:ID (unique)
GO term nameGO evidence code
GO Mapping Example
NDUFAB1 (UniProt P52505)Bovine NADH dehydrogenase (ubiquinone) 1, alpha/beta subcomplex, 1, 8kDa
Biological Process (BP or P)GO:0006633 fatty acid biosynthetic process TASGO:0006120 mitochondrial electron transport, NADH to ubiquinone TASGO:0008610 lipid biosynthetic process IEA
Cellular Component (CC or C)GO:0005759 mitochondrial matrix IDAGO:0005747 mitochondrial respiratory chain complex I IDAGO:0005739 mitochondrion IEA
NDUFAB1
Molecular Function (MF or F)GO:0005504 fatty acid binding IDAGO:0008137 NADH dehydrogenase (ubiquinone) activity TASGO:0016491 oxidoreductase activity TASGO:0000036 acyl carrier activity IEA
GO EVIDENCE CODESDirect Evidence CodesIDA - inferred from direct assayIEP - inferred from expression patternIGI - inferred from genetic interactionIMP - inferred from mutant phenotypeIPI - inferred from physical interaction
Indirect Evidence Codesinferred from literatureIGC - inferred from genomic contextTAS - traceable author statementNAS - non-traceable author statementIC - inferred by curatorinferred by sequence analysisRCA - inferred from reviewed computational analysisIS* - inferred from sequence*IEA - inferred from electronic annotation
OtherNR - not recorded (historical)ND - no biological data available
ISS - inferred from sequence or structural similarity ISA - inferred from sequence alignment ISO - inferred from sequence orthology ISM - inferred from sequence model
GO Mapping Example
NDUFAB1
GO EVIDENCE CODESDirect Evidence CodesIDA - inferred from direct assayIEP - inferred from expression patternIGI - inferred from genetic interactionIMP - inferred from mutant phenotypeIPI - inferred from physical interaction
Indirect Evidence Codesinferred from literatureIGC - inferred from genomic contextTAS - traceable author statementNAS - non-traceable author statementIC - inferred by curatorinferred by sequence analysisRCA - inferred from reviewed computational analysisIS* - inferred from sequence*IEA - inferred from electronic annotation
OtherNR - not recorded (historical)ND - no biological data available
ISS - inferred from sequence or structural similarity ISA - inferred from sequence alignment ISO - inferred from sequence orthology ISM - inferred from sequence model
Biocuration of literature• detailed function • “depth”• slower (manual)
P05147
PMID: 2976880
Find a paperabout the protein.
Biocuration of Literature:detailed gene function
Read paper to get experimental evidence of function
Use most specific termpossible
experiment assayed kinase activity:use IDA evidence code
GO Mapping Example
NDUFAB1
GO EVIDENCE CODESDirect Evidence CodesIDA - inferred from direct assayIEP - inferred from expression patternIGI - inferred from genetic interactionIMP - inferred from mutant phenotypeIPI - inferred from physical interaction
Indirect Evidence Codesinferred from literatureIGC - inferred from genomic contextTAS - traceable author statementNAS - non-traceable author statementIC - inferred by curatorinferred by sequence analysisRCA - inferred from reviewed computational analysisIS* - inferred from sequence*IEA - inferred from electronic annotation
OtherNR - not recorded (historical)ND - no biological data available
ISS - inferred from sequence or structural similarity ISA - inferred from sequence alignment ISO - inferred from sequence orthology ISM - inferred from sequence model
Biocuration of literature• detailed function • “depth”• slower (manual)
Sequence analysis• rapid (computational)• “breadth” of coverage •less detailed
Computational GO annotation (“breadth”)
Ranjit Kumar
existing GO annotations
ga file accessions with no ISO
ISO PIPELINE
accessions from your species(species 1)
public orthology prediction tool(s)
1:1 orthologs
transfer GO annotation to your species (ISO)
IEA PIPELINE
fasta file of sequences (aa or nt)
InterPro analysis(domains/motifs) GO2InterPro
mapping file
domains/motifs in sequence
assign GO (IEA)no GO: “ND”
ga file
(integrate output into one ga file)
Unknown Function vs No GO ND – no data
Biocurators have tried to add GO but there is no functional data available
Previously: “process_unknown”, “function_unknown”, “component_unknown”
Now: “biological process”, “molecular function”, “cellular component”
No annotations (including no “ND”): biocurators have not annotated
1. Primary sources of GO: from the GO Consortium (GOC) & GOC members
most up to date most comprehensive
2. Secondary sources: other resources that use GO provided by GOC members
public databases (eg. NCBI, UniProtKB) genome browsers (eg. Ensembl) array vendors (eg. Affymetrix) GO expression analysis tools
Different tools and databases display the GO annotations differently.
Since GO terms are continually changing and GO annotations are continually added, need to know when GO annotations were last updated.
EXAMPLES: public databases (eg. NCBI, UniProtKB) genome browsers (eg. Ensembl) array vendors (eg. Affymetrix)
CONSIDERATIONS: What is the original source? When was it last updated? Are evidence codes displayed?
Secondary Sources of GO annotation
For more information about GO GO Evidence Codes:
http://www.geneontology.org/GO.evidence.shtml
gene association file information: http://www.geneontology.org/GO.format.annotation.shtml
tools that use the GO: http://www.geneontology.org/GO.tools.shtml
GO Consortium wiki: http://wiki.geneontology.org/index.php/Main_Page
All websites are available from the workshop website & handout.