judith blake biomedical ontologies and their role in functional genomics judith a. blake, ph.d. the...
TRANSCRIPT
Judith BlakeFunc Genomics2012
Biomedical Ontologies and their role in functional genomics
Judith A. Blake, Ph.D.The Jackson Laboratory
Functional Genomics – February 2012
Judith BlakeFunc Genomics2012
Bioinformatics-What is that?Bioinformatics is:• the use of computers (and persistent data structures) in pursuit
of biological research
• an emerging new discipline, with its own goals, research program, and practitioners
• the fundamental tool for 21st century biology
• all of the above. Robert J. Robbins
Judith Blake
Topics:
• We need to coordinate the representation of information – from genetic and genomic studies, – as might be reported in the biomedical literature, and – from the output of high-throughput experiments
• This is done by designing databases (e.g., MGI) and bio-ontologies (e.g., GO) to support comprehensive data integration
• Such resources enable comparative analysis between different organisms and biological systems
• With the objective of helping us gain new knowledge about biological systems and particularly about genetic components of human diseases
Func Genomics2012
Judith BlakeFunc Genomics2012
Roxy Laybourne and others, photo by Chip Clark
Managing Biological Information is Nothing New
Bird Collections at the Smithsonian Natural History Museum
Judith Blake
The trouble with facts is that there are so many of them.
Samuel McChord Crothers, The Gentle Reader (1903)
Func Genomics2012
Judith Blake
The data integration problem
• Vast wealth of data residing in different databases– Meaning of those records must be reconciled
for data to be automatically integrated
Sciencedatabase
medicaldatabase
Func Genomics2012
Judith BlakeFunc Genomics2012
Accession File
Judith BlakeFunc Genomics2012
TCTCTCCCCCGCCCCCCAGGCTCCCCCGGTCGCTCTCCTCCGGCGGTCGCCCGCGCTCGGTGGATGTGGC
TGGCAGCTGCCGCCCCCTCCCTCGCTCGCCGCCTGCTCTTCCTCGGCCCTCCGCCTCCTCCCCTCCTCCT
TCTCGTCTTCAGCCGCTCCTCTCGCCGCCGCCTCCACAGCCTGGGCCTCGCCGCGATGCCGGAGAAGAGG
CCCTTCGAGCGGCTGCCTGCCGATGTCTCCCCCATCAACTACAGCCTTTGCCTCAAGCCCGACTTGCTGG
ACTTCACCTTCGAGGGCAAGCTGGAGGCCGCCGCCCAGGTGAGGCAGGCGACTAATCAGATTGTGATGAA
TTGTGCTGATATTGATATTATTACAGCTTCATATGCACCAGAAGGAGATGAAGAAATACATGCTACAGGA
TTTAACTATCAGAATGAAGATGAAAAAGTCACCTTGTCTTTCCCTAGTACTCTGCAAACAGGTACGGGAA
CCTTAAAGATAGATTTTGTTGGAGAGCTGAATGACAAAATGAAAGGTTTCTATAGAAGTAAATATACTAC
CCCTTCTGGAGAGGTGCGCTATGCTGCTGTAACACAGTTTGAGGCTACTGATGCCCGAAGGGCTTTTCCT
TGCTGGGATGAGCCTGCTATCAAAGCAACTTTTGATATCTCATTGGTTGTTCCTAAAGACAGAGTAGCTT
TATCAAACATGAATGTAATTGACCGGAAACCATACCCTGATGATGAAAATTTAGTGGAAGTGAAGTTTGC
CCGCACACCTGTTATGTCTACATATCTGGTGGCATTTGTTGTGGGTGAATATGACTTTGTAGAAACAAGG
TCAAAAGATGGTGTGTGTGTCCGTGTTTACACTCCTGTTGGCAAAGCAGAGCAAGGAAAATTTGCGTTAG
AGGTTGCTGCTAAAACCTTGCCTTTTTATAAGGACTACTTCAATGTTCCTTATCCTCTACCTAAAATTGA
TCTCATTGCTATTGCAGACTTTGCAGCTGGTGCCATGGAGAACTGGGGCCTTGTTACTTATAGGGAGACT
GCATTGCTTATTGATCCAAAAAATTCCTGTTCTTCATCCCGCCAGTGGGTTGCTCTGGTTGTGGGACATG
AACTCGCCCATCAATGGTTTGGAAATCTTGTTACTATGGAATGGTGGACTCATCTTTGGTTAAATGAAGG
TTTTGCATCCTGGATTGAATATCTGTGTGTAGACCACTGCTTCCCAGAGTATGATATTTGGACTCAGTTT
GTTTCTGCTGATTACACCCGTGCCCAGGAGCTTGACGCCTTAGATAACAGCCATCCTATTGAAGTCAGTG
TGGGCCATCCATCTGAGGTTGATGAGATATTTGATGCTATATCATATAGCAAAGGTGCATCTGTCATCCG
AATGCTGCATGACTACATTGGGGATAAGGACTTTAAGAAAGGAATGAACATGTATTTAACCAAGTTCCAA
CAAAAGAATGCTGCCACAGAGGATCTCTGGGAAAGTTTAGAAAATGCTAGTGGTAAACCTATAGCAGCTG
GTTTCTGCTGATTACACCCGTGCCCAGGAGCTTGACGCCTTAGATAACAGCCATCCTATTGAAGTCAGTG
TGGGCCATCCATCTGAGGTTGATGAGATATTTGATGCTATATCATATAGCAAAGGTGCATCTGTCATCCG
AATGCTGCATGACTACATTGGGGATAAGGACTTTAAGAAAGGAATGAACATGTATTTAACCAAGTTCCAA
CAAAAGAATGCTGCCACAGAGGATCTCTGGGAAAGTTTAGAAAATGCTAGTGGTAAACCTATAGCAGCTG
From the birth of the field of genetics until a decade ago, it was generally assumed that the parental origin of a gene could have no effect on its function. In the vast majority of studies carried out during the last 90 years, this paradigm has appeared to hold true. However, with increasingly sophisticated genetic and embryological investigations in the mouse, important exceptions to this rule have been uncovered over the last decade. First, the results of nuclear transplantation experiments carried out with single-cell fertilized embryos have demonstrated an absolute requirement for both a maternally-derived and a paternally-derived pronculeus to allow full-term development (McGrath and Solter, 1983). Second, in animals that receive both homologs of certain chromosomes or subchromosomal regions from one parent and not the other (through the mating of translocation heterozygotes as described in Section 5.2.3), dramatic effects on development can be observed including enhanced or retarded growth and outright lethality (Cattanach and Kirk, 1985). Third, either of two deletions that cover a small region of mouse chromosome 17 can be transmitted normally from a father to his offspring, but these same deletions cause prenatal lethality when they are maternally transmitted (Johnson, 1974; Winking and Silver, 1984). Fourth, similar parent-of-origin effects have been observed on the phenotypes expressed by animals that carry a targeted knock-out allele at the Igf2 locus (DeChiara et al., 1991). Finally, molecular techniques have been used to directly demonstrate the expression of transcripts from one parental allele and not the other at the Igf2r locus (Barlow et al., 1991) and the H19 locus (Bartolomei et al., 1991). The accumulated data indicate that a subset of mouse genes (on the order of 0.2%) will function differently in normal embryos depending on whether they have been inherited through the male or the female gamete, such that one allele will be expressed and the other will be silent. Genomic imprinting is the term that has been coined to describe this situation in which the phenotype expressed by a gene varies depending on its parental origin (Sapienza, 1989). Further experiments have demonstrated that, in general, the "imprint" is erased and regenerated during gametogenesis so that the function of an imprintable gene is fully determined by the sex of its progenitor alone, and not by earlier ancestors.
Judith BlakeFunc Genomics2012
Crash Blossomsand other semantic ambiguities
translating what we say into what we mean: data, words and knowledge
Crash Blossoms
“Violinist Linked to JAL Crash Blossoms”
“MacArthur Flies Back to Front”
“Squad Helps Dog Bite Victim”
“Red Tape Holds Up New Bridge.”
Judith BlakeFunc Genomics2012
The English Language is hard to learn, even for computers.
“Jessica Hahn Pooped After Long Day Testifying”
Focus: creating the data structures and mining the biomedical literature to provide knowledge representations –
with the objective of using logical reasoning applications and predictive approaches to ‘interrogate’ very large data sets,
generating new hypothesis for further experimental investigation
Judith BlakeFunc Genomics2012
What is an ontology?
Judith BlakeFunc Genomics2012
A biological ontology is: A formal representation of some
portion of biological reality
eye
what kinds of things exist?
what are the relationships between these things?
ommatidium
sense organeyedisc
is_a
part_of
developsfrom
Judith BlakeFunc Genomics2012
Why do we need ontologies?
Judith BlakeFunc Genomics2012
Connections are not made explicit by default
• Computers are not intelligent• We need to spell out interconnectedness of entities
– Specificity Bone mineralization vs ossification
– Granularity Osteocyte vs bone
– Spatial Gill membrane and branchiostegal ray – Perspective Anatomy vs physiology
– Causally related entities• pathways• development
– Evolutionary Homology and descent
Judith BlakeFunc Genomics2012
Ontologies : the key to data integration
• Ontologies provide:– rigorous, shared computable definitions for terms– classifications and connections that can be used
for database search and inference
Judith BlakeFunc Genomics2012
Annotation of genes and proteins using ontologies are key to data integration
Biomedical Ontologies
Ontologies are human and machine readable classification of biological knowledge.
Ontologies have:•Terms •Term definitions•Relationships among terms
Judith BlakeFunc Genomics2012
Good ontology design is required for data integration
• Not any old ontology will do– Data integration served poorly by poor ontologies
• How do we know good ontologies?– Types and classifications should be constructed
according to science and should reflect nature– Ontology constructed along lines of ontology best
practices• http://www.obofoundry.org• Formal definitions and relations• Based on distinction between types and instances• Distinction between types and their labels
Judith BlakeFunc Genomics2012
The Gene Ontology
• Mid-size– ~33,700 terms in all 3 ontologies– ~2n,nnn links (is_a, part_of, regulates)
• Each term represents a type– Terms also have alternate labels (synonyms)
• These do not represent distinct types• Humans use different labels to refer to the same
biological pattern– E.g: endoplasmic reticulum vs ER
Judith BlakeFunc Genomics2012
Ontology is not nomenclature• A type can have many labels
– Preferred label (term)– Synonyms, aliases
• Types are not labels– Types are the underlying pattern
• Identified by a formal definition– Labels are important for doing science
• But life existed for billions of years quite happily prior to the invention of names and labels
– Good ontology separates the underlying patterns in nature from the labels used to describe them
Judith BlakeFunc Genomics2012
Ontologies and annotation
• Ontologies are of little practical use without annotation– GO has ~6 million annotations linking genes and gene
products to GO terms– Mostly (but not all) MOD & Human– Same terms are shared across species
• All annotation statements have provenance– Source/publication– Evidence & evidence codes
Judith BlakeFunc Genomics2012
Use of GO annotations
• Database search• Database integration• Automating further annotation• Data mining and data analysis
– Microarray analysis:• 1. Extract cluster of co-exressed genes• 2. Analyses annotations for enrichment of certain terms
Judith BlakeFunc Genomics2012
What is a Database?
• an organized body of related information
• In computing, a database can be defined as a structured collection of records or data that is stored in a computer so that a program can consult it to answer queries. The records retrieved in answer to queries become information that can be used to make decisions.
Judith BlakeFunc Genomics2012
Mouse Genome Informatics (MGI) Database
• Comprehensive information resource about the laboratory mouse
• Provides consensus representation of the mouse genome
• International scientific community resource• Integrated data acquisition and query capabilites
MGI Database is a Relational Database: Information is stored in tables that have relationships to each other. This facilitates query and retrieval of subsets of data.
Judith Blake
MGI’s primary mission is to facilitate the use of mouse as a model for human biology by providing integrated access to data on the genetics,
genomics, and biology of the laboratory mouse.
Hermansky-Pudlak syndrome Mouse model & human phenotype
Information content spans from sequence to phenotype/disease
sequence
variants & polymorphisms
gene function
genome location
mouse/humanorthologs & maps
strain geneaologyexpression
tumors
Database Resource:Mouse Genome Informatics (MGI)
Func Genomics2012
Judith Blake
MGI integrates genetic, genomic and phenotypic data
IntegrateFactor out common objects
Assemble integrated objects
Gather data from multiple sources
• Within MGI • Genes• Sequence• Expression• Literature• Alleles• Phenotypes
• Between MGI and others• Via shared sequence
annotations……UniProt, EntrezGene, Ensembl
• Via shared semantic representations……Drosophila, Arabidopsis, etc.
Func Genomics2012
Judith Blake
• Data Acquisition• Object Identity• Standardizations• Data Associations• Integration with other
bioinformatics resources
New Gene, Strain or
Sequence?
Controlled Vocabularies
Evidence & Citation
Co-curation of shared objects and concepts
Annotation PipelineLiterature &
Loads
Func Genomics2012
Judith BlakeFunc Genomics2012
RPCI
Automated (mostly) Data Integration (Loads)
MGI db
Associations
Clones
Non-mouse
Gene models and coordinates
Sequences
Vocabularies
SNP db
GOM
PAnatomyInterproOMI
MPIRSFAnnotatio
n
MGC
GenBankRefSe
qUniProtDFCIse
qDoTSseqNIAse
qNCB
IVEGA
dbSNP
EG chimpEG
dogEG ratEG
human
EG mouseUniPro
tDFCIDoT
SNIAUnigen
eTreeFamGene traps
Ensembl
microRNAs
UniSTS
HCOPHomologene
Judith Blake
Manual (mostly) annotation of the biomedical literature
Func Genomics2012
> 12,000 / year
Judith BlakeFunc Genomics2012
Data acquisition is constant
Load Program Summary of Data Loaded
Mouse EntrezGene EntrezGene IDs for mouse markers. Plus marker-to-sequence associations from EntrezGene not already in MGD
Human/Rat EntrezGene Nomenclature, map position and other data regarding human and rat genes. OMIM associations for human.
GenBank Seq Mouse sequence records from GenBank
RefSeq Seq Mouse sequence records from RefSeq
UniProt/TrEMBL Seq Mouse sequence records from UniProt and TrEMBL
TIGR/DoTS/NIA Seq Mouse consensus sequence records from TIGR/DoTS/NIA clusters
TIGR/DoTS/NIA Association Associations between TIGR/DoTS/NIA cluster sequences and markers.
Ensembl Gene Model Ensembl gene model sequences, coordinates, & associations between these & markers
NCBI Gene Model NCBI gene model sequences, coordinates, & associations between these & markers
UniProt Association UniProt/TrEMBL IDs and additional GenBank IDs for mouse markers. Plus GO and InterPro annotations
UniGene Association UniGene cluster IDs for mouse markers.
EST cDNA Clone Mouse IMAGE, NIA, MGC, Riken, cDNAs and EST sequence associations
MGC Association MGC IDs and associations between MGC full length sequences and MGC cDNAs
RPCI Clone RPCI 23/24 BAC clones and sequence associations
GO Vocabulary Updated Gene Ontology (GO) vocabularies from the central GO site.
OMIM Vocabulary Updated OMIM disease terms
MP Vocabulary Updated MP vocabulary (from OBO-Edit)
Anatomy Updated adult mouse anatomy ontology (from OBO-Edit)
Mapping panel JAX, EUCIB, Copeland-Jenkins and many others
PIRSF Mouse PIR superfamily terms and associations to markers
SNPs Mouse SNPs from dbSNP and associations between SNPs & markers.
Judith BlakeFunc Genomics2012
Who is the authority?
Mouse data for which MGI serves as the authoritative source.Data type Working relationship
Gene Symbol/Name MGD makes primary assignment; coordination with HGNC, RGNC
Allele Symbol/Name MGD makes primary assignment
Strain Designations MGD makes primary assignment
Gene -to- nucleotide sequence association Co-curation with NCBI
Gene -to- protein sequence association Co-curation with UniProt
Gene Ontology (GO) annotations MGD provides primary data set
Mammalian Phenotype Ontology MGD develops and applies vocabulary
Gene homology data between mouse & other species
MGD curated orthology relationships
Genotype -to- phenotype data MGD provides primary curation
Mouse model -to- human disease (OMIM) MGD provides primary curation
Judith BlakeFunc Genomics2012
Snapshot of MGI data content
MGI data statisticsMarch 2010
Genes (including unmapped mutants) 36,290
Genes w/ nucleotide sequence 29,110
Genes w/ protein sequence 26,108
Genes annotated to GO (comprehensive) 25,644
Mouse/human orthologs 17,841
Mouse/rat orthologs 16,767
Targeted alleles mutant alleles in mice
24,77023,866
Genes w/ phenotypic alleles genes w/ targeted alleles
12,35010,340
Human diseases w/ one or more mouse model 999
QTL 4,404
References 150,341
mouse refSNPs 10,089,692
Judith BlakeFunc Genomics2012
Having the data, we want to ask complex questions
Judith BlakeFunc Genomics2012
Curators use controlled terms from structured vocabularies (ontologies) to annotate complex biological systems described in the literature
The knowledge is in the details
Judith Blake
• Gene Nomenclature• Gene/Marker Type• Allele Type• Assay Type
– Expression– Mapping
• Molecular Mutation• Inheritance Mode
• Tissue Types• Cell Types• Cell Lines• Units
– Cytogenetic– Molecular
• ES Cell Line• Strain Nomenclature
Keyword lists standardize descriptions and enable comprehensive data retrieval
Keyword lists support data integration
Func Genomics2012
Judith Blake
• Sheer number of terms too much to remember and sort– Need standardized, stable, carefully defined terms– Need to describe different levels of detail– So…defined terms need to be related in a hierarchy
• With structured vocabularies/hierarchies– Parent/child relationships exist between terms– Increased depth -> Increased resolution– Can annotate data at appropriate level– May query at appropriate level
• All model organisms database and genome annotation systems have same issues
Organogenesis
Blood vessel development
Angiogenesis
Vasculogenesis
Process terms
But, keyword lists are not enough
Func Genomics2012
Judith Blake
And so, we started theGene Ontology (GO)
aa
www.geneontology.org
• Formed to develop a shared language adequate for the annotation of molecular characteristics across organisms; a common language to share knowledge.
• Seeks to achieve a mutual understanding of the definition and meaning of any word used; thus we are able to support cross-database queries.
• Members agree to contribute gene product annotations and associated sequences to GO database; thus facilitating data analysis and semantic interoperability.
Func Genomics2012
Judith Blake
What is Ontology?
Func Genomics2012
• Dictionary:A branch of metaphysics concerned with the nature and relations of being.
• Barry Smith: The science of what is, of the kinds and structures of objects, properties, events, processes and relations in every area of reality.
16061700s
Judith BlakeFunc Genomics2012
what kinds of things exist?
what are the relationships between these things?
eye
_part of
sclera
_is a
sense organ
developsfrom
Optic placode
A biological ontology is:
• A (machine and human) interpretable representation of some aspect of biological reality
http://www.macula.org/anatomy/eyeframe.html
Judith Blake
Gene Ontology: widely adopted
AgBase
Func Genomics2012
Judith Blake
• Molecular Function = elemental activity/task - the tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity
• Biological Process = biological goal or objective– broad biological goals, such as mitosis or purine metabolism, that are accomplished
by ordered assemblies of molecular functions
• Cellular Component = location or complex– subcellular structures, locations, and macromolecular complexes; examples include
nucleus, telomere, and RNA polymerase II holoenzyme
• Sequence Ontology = genome features– regions, attributes, variants; examples include exon, CpG island, and transgenic
insertion
• Cell Ontology = cell types– Examples include photoreceptor cell and pillar cell
GO represents selected molecular domains
Func Genomics2012
Judith BlakeFunc Genomics2012
Biological ProcessGO term: tricarboxylic acid cycleSynonym: Krebs cycleSynonym: citric acid cycleGO id: GO:0006099
Cellular ComponentGO term: mitochondrionGO id: GO:0005739Definition: A semiautonomous, self replicating organelle that occurs in varying numbers, shapes, and sizes in the cytoplasm of virtually all eukaryotic cells. It is notably the site of tissue respiration.
Molecular FunctionGO term: Malate dehydrogenase. GO id: GO:0030060(S)-malate + NAD(+) = oxaloacetate + NADH.
H
O
H
O
O
H
O
H
O
H
H
O
O
H
O
H
O
H
H
O
NAD+NADH + H+
GO reflects biological knowledge for computers
Judith BlakeFunc Genomics2012Terms are defined graphically relative to other terms
Judith BlakeFunc Genomics2012
Ontologies can be represented as graphs, where the nodes are connected by edges • Nodes = terms in the ontology• Edges = relationships between the concepts
node
nodenode
edge
Ontology Structure
Judith BlakeFunc Genomics2012
Ontological relations
• Types are related• Network of terms forms a graph
– Terms (nodes)– The edge type (relation) is important
• Two common relations:– Is_a– Part_of
Judith BlakeFunc Genomics2012
eyeball
cavitated organ
is_a
organ
is_a
instance_of
Types(represented in the ontology)
Instances(NOT represented in the ontology)
Judith BlakeFunc Genomics2012
Formal definition of is_a
• is_a holds between types• X is_a Y holds if and only if:
– Given any thing that instantiates X at some time, that thing also instantiates Y at the same time
Judith BlakeFunc Genomics2012
GO terms are used for functional annotations
I
I Denotes an ‘is-a’ relationshipDenotes a ‘part-of’ relationship
P
Brain development [GO:0007420] (141 genes, 207 annotations)I
Judith Blake
Annotations are assertions
• There is evidence that this gene product can be best classified using this term
• The source of the evidence and other information is included
• There is agreement on the meaning of the term
Func Genomics2012
Judith Blake
P05147
PMID: 2976880
GO:0047519IDA
P05147 GO:0047519 IDA PMID:2976880
GO Term
Reference
Evidence
Annotating Gene Products using GO
Gene Product
Func Genomics2012
Judith Blake
NO Direct ExperimentInferred from evidence
Direct Experiment in organism
Evidence codes describe the basis of the annotation
• IDA: Inferred from direct assay• IPI: Inferred from physical interaction• IMP: Inferred from mutant phenotype• IGI: Inferred from genetic interaction• IEP: Inferred from expression pattern• IEA: Inferred from electronic annotation• ISS: Inferred from sequence or structural similarity• TAS: Traceable author statement • NAS: Non-traceable author statement • IC: Inferred by curator• RCA: Reviewed Computational Analysis• ND: no data available
Func Genomics2012
Judith Blake
DAGs
DefinitionSynonyms
GO:54321
Terms
…
Transcription factor
DNA binding
Protein binding
Ligand binding or carrier
Vocabulary
Annotations
…
J:65378TAS
J:62648IDA
J:60000IEAAhr
Edr2
Genes
Synonyms
Name MGI:105043
Vocabularies in MGI: GO Example
Func Genomics2012
Judith Blake
34,315 genes75,933 annotations
Acetyl-CoACoA-SH
Citrate synthase
Function
34,517 genes65,513 annotations
Cellular Component
Biological Process
34,063 genes87,565 annotations
TCACycle
March, 2010
GO @ MGI
Total Genes: 35,147Total Annot.: 145,895Total Papers: 8,985
Func Genomics2012
Judith Blake
Now we can query across all annotations based on shared biological activity.
Func Genomics2012
Judith Blake
Biomedical Ontologies in MGI
• GO: (function, process, cellular location)• SO: (sequence features)• PRO: (specific proteins by species/strain)• MP: (phenotypes)• Traits / Behavior /• Anatomies / Homologies (morphology)• DO: (diseases, not phenotypes; definitions not
diagnoses)• CL: (cells and their lineages)• OBO Foundry (standards and status)
Func Genom
ics2012
Judith BlakeFunc Genomics2012
BioOntologies (GO) enable science
• Ontologies as terminology / classifications• Ontologies enable data aggregation• Ontologies used for data mining • Ontologies used for statistical analysis
Judith BlakeFunc Genomics2012