bioinformatics and genome annotation
Embed Size (px)
DESCRIPTION
Bioinformatics and Genome Annotation. Shane C Burgess. http://www.agbase.msstate.edu/. NIH WORKING DEFINITION OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY July 17, 2000. - PowerPoint PPT PresentationTRANSCRIPT

Bioinformatics and Genome Annotation
Shane C Burgess
http://www.agbase.msstate.edu/

NIH WORKING DEFINITION OF BIOINFORMATICS AND
COMPUTATIONAL BIOLOGYJuly 17, 2000
Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.
Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.

Biocomputing:computational biology & bioinformatics
Gene Ontology Consortium members

Dr Nan Wang
Dr Susan Bridges
Dr Divya Pedinti
Dr Fiona McCarthy
Dr Teresia Buza
Philippe Chouvarine
Cathy Grisham
Lakshmi Pillai

Source: Richard Gibbs, Baylor College of Medicine
and biocomputing becomes more of an issue.
Sequencing is getting cheaperCost of human or similar sized genome

A. Complexity1. Sequence itself and from all it’s compatriots and assorted microbes2. SNPs3. Transcripts (all of them…don’t forget alternative splicing, starts)4. CNVs5. Epigenetic changes to DNA6. Proteome (expression, epigenetics, PTMs, location, flux, enzyme kinetics)7. Metabolites8. Phenotypes9. Drugs
B. Statistical. 1. Multiple testing problem. 2. Search space
Both have potential computationally-intensive solutions (Monte Carlo/Resampling/ Permutation/Bootstrap and target/decoy).
C. Information: publications are no longer the sole source of “valid” or “legitimate” information.
Trusted databases and not just publications used as research sources; not just data but also community annotations etc
D. Biocomputing issues: LOCAL--storage, compute power (CPUs days), RAM; DISTANT– linking, data movement, cyberinfrastucture (hard, soft and human).
E. How and who?

Titus Brown, Mich. SU

Putting Genomes in the Cloud. Making data sharing faster, easier and more scalable.By M. May, May 18, 2010.
Storage costsA. Simple Storage Service (S3) e.g. Amazon. For the first 50 TB = 15 US cents/Gb ($7,500/50 TB) plus pay for data transfer and operations.
VS
Buy, store and scale as needed e.g. Web Object Scaler (WOS)
Immediate or “longer” term solution

10 Gigabits (Gb)/second

Annotation: Nomenclature, Structural & Functional
Structural Annotation:• Open reading frames (ORFs) predicted during genome assembly• predicted ORFs require experimental confirmation
Functional Annotation:• annotation of gene products = Gene Ontology (GO) annotation• initially, predicted ORFs have no functional literature and GO
annotation relies on computational methods (rapid) • functional literature exists for many genes/proteins prior to
genome sequencing• Gene Ontology annotation does not rely on a completed
genome sequence
Nomenclature

Chicken Gene Nomenclature• 1995: chicken gene nomenclature will follow HGNC
guidelines• 2007: chicken biocurators begin assigning standardized
nomenclature• 2008: first CGNC report; NCBI begins using standardized
nomenclature & CGNC links• 2010: first dedicated chicken gene nomenclature
biocurator; NCBI/AgBase/Marcia Miller – structural annotation & nomenclature for MHC regions (chr 16)
• Chicken gene nomenclature database – UK & US databases sharing and co-coordinating data.
Livestock Gene Nomenclature:Jim Reecy et al., International Society for Animal Genetics from 26th – 30th July 2010, Edinburgh

http://edit-genenames.roslin.ac.uk/
Available via BirdBase & AgBase

Experimental Structural genome annotation
Proteogenomic mapping

Problems with Current Structural Annotation Methods
• EST evidence is biased for the ends of the genes
• Computational gene finding programs – Misidentify some, and especially short, genes,
genes.– Overlook exons– Incorrectly demarcate gene boundaries,
especially splice junctions

Proteogenomic Mapping• Combines genomic and proteomic data for structural annotation of
genomes• First reported by Jaffe et al. at Harvard in 2004 in bacteria • McCarthy et al. 2006 first applied in chicken (one of the first uses
in a eukaryote; the other two in human).• Improves genome structural annotation based on expressed protein
evidence– Confirms existence of predicted protein-coding gene– Identifies exons missed by gene finder– Corrects incorrect boundaries of previously identified genes– Identifies new genes that the gene finding programs missed

CCV genome was sequenced in 1992
But only 12 of predicted 76 ORFs confirmed to exist as proteins.
Confirmed 37/76.
Identified 17 novel ORFs that were not predicted.


Structural Annotation of the Chicken Genome
• Location of genes on the genome• Computational gene finding programs such as
Gnomen (NCBI) based on Markov Models and also use– ESTs
– Known proteins
– Sequence conservation

ePST Generation Process
chromosome
Map peptide nucleotide sequence to chromosome
Peptide nucleotide sequence

Biological Sample
Trypsin Digestion
LC ESI-MS/MS Data
Search against genome translated in 6 reading frame
Search against protein Database
Peptide matches Peptide matches
Correction / validation of genome annotation
Novel protein-coding gene
Generate ePST (expressed PeptideSequence Tags) from peptides matching genome only
Confirm predicted protein-coding gene

ePST Generation Process
chromosome
Peptide nucleotide sequence
Locate first downstream in-frame stop codon or canonical splice junction
Stop codon

ePST Generation Process
chromosome
Locate upstream canonical splice junction or in-frame stop
Peptide nucleotide sequence
Stop codon

ePST Generation Process
chromosome
Find 1st start codon between in-frame stop and peptide
Peptide nucleotide sequence
Stop codon
Start codon

ePST Generation Process
chromosome
Use splice junction or in-frame start as beginning of ePST

ePST Generation Process
chromosome
ePST coding nucleotide sequence
Translate
Expressed Peptide Sequence Tag (ePST) amino acid sequence





Functional annotation

0
5000
10000
15000
20000
25000
‘00 ‘01 ‘02 ‘03 ‘04 ‘05 ‘06 ‘07 ‘08 ‘09
No.
YEAR
0
2
4
6
8
10
12
14
16
18
70 75 80 85 90 95 00 05
No. x 106

OntologiesCanonical and other Networks
GO Cellular Component
GO Biological Process
GO Molecular Function
BRENDA
Pathway Studio 5.0
Ingenuity Pathway Analyses
Cytoscape
Interactome Databases
Functional Understanding

Physiology (= Cellular Component + Biological Process + Molecular
Function)
Gene Ontology Network Modeling
Biological interpretation
ImpliedDerived

What is the Gene Ontology?“a controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing”
• the de facto standard for functional annotation• assign functions to gene products at different levels, depending on how much is known about a gene product • is used for a diverse range of species• structured to be queried at different levels, eg:
– find all the chicken gene products in the genome that are involved in signal transduction
– zoom in on all the receptor tyrosine kinases • human readable GO function has a digital tag to allow computational analysis of large datasets
COMPUTATIONALLY AMENABLE ENCYCLOPEDIA OF GENE FUNCTIONS AND THEIR RELATIONSHIPS

GO is the “encyclopedia” of gene functions captured, coded and put into a directed acyclic graph (DAG) structure.
In other words, by collecting all of the known data about gene product biological processes, molecular functions and cell locations, GO has become the master “cheat-sheet” for our total knowledge of the genetic basis of phenotype.
Because every GO annotation term has a unique digital code,we can use computers to mine the GO DAGs for granular functional information.
Instead of having to plough through thousands of papers at the library and make notes and then decide what the differential gene expression from your microarray experiment means as a net affect, the aim is for GO to have all the biological information captured and then retrieve it and compile it with your quantitative gene product expression data and provide a net affect.

Use GO for…….1. Determining which classes of gene products are
over-represented or under-represented.
2. Grouping gene products.
3. Relating a protein’s location to its function.
4. Focusing on particular biological pathways and functions (hypothesis-testing).


“GO Slim”
In contrast, we need to use the deep granular information rich data suitable for hypothesis-testing
Many people use “GO Slims” which capture only high-level terms which are more often then not extremely poorly informative and not suitable for hypothesis-testing.

Sourcing displaying GO annotations: secondary and tertiary sources.


GO Consortium: Reference Genome Project
• Limited resources to GO annotate gene products for every genome– rely on computational GO annotations
– most robust method is to transfer GO between orthologs
• Reference genome project: goal is to produce a “gold standard” manually biocurated GO annotation dataset for orthologous genes– 12 reference genomes – chicken is only agricultural species
– Chicken RGP contributions provided via USDA CSREES MISV-329140
http://www.geneontology.org/GO.refgenome.shtml

RGP & Taxonomy checks• Transferring GO annotation between orthologs
requires:– determining orthologs – computational prediction
followed by manual curation– developing ‘sanity’ checks to ensure transferred
functions make sense phylogenetically (eg. no lactating chickens!)

Further taxon checking comments may be added here, or contact the AgBase database.

AgBase Biocurators
AgBasebiocuration
interface
AgBase database
‘sanity’ check
‘sanity’ check& GOC QC
EBI GOA Project
GO Consortiumdatabase
‘sanity’ check& GOC
QC ‘sanity’ check
GO analysis tools Microarray developers
UniProt dbQuickGO browserGO analysis toolsMicroarray developers
Public databases AmiGO browserGO analysis toolsMicroarray developers
AgBase Quality Checks & Releases
‘sanity’ check: checks to ensure all appropriate information is captured, no obsolete GO:IDs are used, etc.

Comparing AgBase & EBI-GOA Annotations
computational
manual - sequence
manual - literature
Gen
e P
rod
uct
s an
no
tate
d
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
AgBase
Chick
EBI-GOA
Chick
AgBase
Cow
EBI-GOA
Cow
Project
Complementary to EBI-GOA: Genbank proteins not represented in UniProt & EST sequences on arrays

AgBase EBI GOA
EBI-IntAct
Roslin
HGNC
UCL-Heart project
MGI
Reactome
Contribution to GO Literature Biocuration
Chicken
Cow
< 0.50%
< 1.50%
97.82%
88.78%

INPUT: functional genomics data (e.g. Microarray data)
GOanna
Biocuration from literature
Manual interpretation of GOanna output
gene products with NO orthologs OR with orthologs but NO GO annotations
GOModeler
Generic: qualitative data presentation. Analysis can only be changed if user has programming skills
Specific: user-defined, hypothesis-driven, quantitative data presentation
must wait on experimental evidence or new electronic inference
NO literature or specialist knowledge that can be used to make GO annotations
gene products with orthologs and GO annotations
gene products with NO GO annotations
gene products with GO annotations
BLAST output
biocurated annotations from literature or specialist knowledge
GOSlimViewer
GORetriever
data visualization
ArrayIDer
GOanna2ga
comprehensive GO annotation
(existing GO analysis programs)
GA2GEO
GAQ Score

To request a workshop contactFiona McCarthy
2010 GO Training Opportunities
- on site training by request/interest - webinar: notification via ANGENMAP & GO discussion groups

GO trainingWorkshop Surveys
10 20 30 40 50 60
Topics covered were relevant
Topics were well explained
I am confident in using GO for modeling
I am confident I can get GO questions answered
I would recommend this workshop
% of respondents
strongly agree
agree
uncertain
disagree
strongly disagree
0
50
100
150
200
2007 2008 2009
Year workshops offered
No
. o
f p
eo
ple
Annual
Cumulative
2009 Workshop hosts:ISU – Dr Susan LamontNCSU – Dr Hsiao-Ching Liu


ARK-Genomics
AffymetrixAgilent 44K array
UD 7.4K Metabolic/Somatic
UD_Liver_3.2K
Arizona 20.7K
Neuroendocrine
Chicken Array Usage
Number of participants: 25Number of arrays: 22Number of votes: 41
Bovine array usageNumber of participants: 26Number of arrays: 26Number of votes: 42
UIUC 13.2K
Affymetrix
UIUC 7,872-element
Bovine Total Leukocyte cDNA
Agilent 44k

Quality improvement Microarray annotations

• Most microarray analysis tools do not readily accept EST clone names (abundantly on arrays). • Manual re-annotation of microarrays is impracticable • Retrieves the most recent accession mapping files from public databases based on EST clone names or accessions and rapidly generates database accessions.•Fred Hutchinson Cancer Research Centre 13K chicken cDNA array• structurally re-annotated 55% of the array; decreased non-chicken functional annotations by 2 fold; identified 290 pseudogenes, 66 of which were previously incorrectly annotated.


Zhou H, Lamont SJ:Global gene expression profile after Salmonella enterica Serovar enteritidis challenge in two F8 advanced intercross chicken lines. Cytogenet Genome Res 2007;117:131-138 (DOI: 10.1159/000103173)




1. Increased the pathway coverage of several major immune response pathways and provided more comprehensive modelling of signalling pathways e.g. FAS :originally not annotated but now pathways involving FAS identified.
2. Confirm and consolidate previous suggestions that CD3, IL-1β, and CCL5 differential expression involved in the immune response to SE. Chicken-specific functional annotation of these genes allowed identification of these gene’s related pathways with statistical confidence.
3. Identified additional genes involved in major immune pathways important in bacterial gut disease but not identified in the original work e.g. tyrosine phosphatase type IVA member 1 (PTP4A1); CD28; T-cell co-stimulator (ICOS, CD287) and NK-lysin and associated pathway genes.

Bacterial functional genomic responses to structural differences in explosive compounds.
KTR9 and V. fischeri proteomics

Quantifying re-annotation
Metrics
Granularity Specificity
# previous annotations # chicken annotations
# re-annotations # human/mouse annotations
Quality
Gene Ontology Annotation Quality (GAQ) score

Mean GAQ score
DoD: Bobwhite Quail Toxicogenomics• Reads in annotated gene regions + 20 kb radius
• Reads in “RNAFAR” regions i.e. clustered reads forming novel transcripts (these reads do not belong to any gene model the reference set and can either be assigned to neighboring gene models, if they are within a specified threshold radius, or assigned their own predicted transcript model.
• Repeats with > 10 alignments• Reads overlapping annotated repeat regions• Unmapped reads• Other (regulatory, etc. do not include reads
discarded as poor quality).


GO Cellular Component DAG

Differential Detergent Fractionation
2 3 41
DDF Fraction
2007. Non-electrophoretic differential detergent fractionation proteomics using frozen whole organs. Rapid Commun Mass Spectrom 21:3905-9.2007. Sequential detergent extraction prior to mass spectrometry analysis. Methods in Molecular Medicine: Proteomic analysis of membrane proteins. Humana Press. 117 (1-4):278-87.2005. Differential detergent fractionation for non-electrophoretic eukaryote cell proteomics. Journal of Proteome Research. 4 (2), 316-324.

Sub-cellular localization of pro-PCD proteins. One mechanism controlling PCD is the release of “pro-death” proteins mitochondria into the cytoplasm or nucleus.
CytC
B-cells Stroma
Apaf1
AMID
EndoG
AIF
Smac
N
M
C

-3
-2
-1
0
1
2
3
4
IL-2 IL
-4
IL-6
IL-8
IL-1
0
IL-1
2
IL-1
3
IL-1
8IF
NTGF
CTLA-4
GPR-83
SMAD-7
Protein
1
10
100
1000
10000
100000
IL-2
IL-4
IL-6
IL-8
IL-1
0IL
-12
IL-1
3
IL-1
8IF
NTGF
CTLA-4
GPR-83
SMAD-7
mRNAN
eop
last
ic c
om
par
ed t
o H
yper
pla
stic
ly
mp
ho
ma
cell
s (%
)
Cancer Immunology and Immunotherapy, 2008. 57:1253-62

Shack et al., Cancer Immunology and Immunotherapy, 2008. 57:1253-62
IL-18 distribution: it matters where proteins are
10
20
0
30
40
50
60
70
80
15
20
25
30
35
0
5
10
1 2 3 4DDF Fraction
Neoplastic Lymphocytes (T-reg)
Hyperplastic Lymphocytes
Extracellular Nuclear1 2 3 4DDF Fraction
1 2 3 4


Pig
Total mRNA and protein expression was measured from quadruplicate samples of control, electroscalple and harmonic scalple-treated tissue.
Differentially-expressed mRNA’s and proteins identified using Monte-Carlo resampling1.
Using network and pathway analysis as well as Gene Ontology-based hypothesis testing, differences in specific phyisological processes between electroscalple and harmonic scalple-treated tissue were quantified and reported as net effects.
Translation to clinical research
(1) Nanduri, B., P. Shah, M. Ramkumar, E. A. Allen, E. Swaitlo, S. C. Burgess*, and M. L. Lawrence*. 2008. Quantitative analysis of Streptococcus Pneumoniae TIGR4 response to in vitro iron restriction by 2-D LC ESI MS/MS. Proteomics 8, 2104-14.
Bindu Nanduri


hemorrhage
Proportional distribution of protein functions differentially-expressed by Electro and Harmonic Scalpel
Total differentially-expressed proteins: 509
Electroscalpel
Total differentially-expressed proteins: 433
Harmonic Scalpel
immunity (primarily innate)
inflammation
Wound Healing
Lipid metabolism
response to Thermal Injury
angiogenesis
HYPOTHESIS TERMS

8 6 4 2 0 2 4 6
immunity (primarily innate)
classical inflammation(heat, redness, swelling, pain, loss of function)
Wound healing
Lipid metabolism
response to thermal injury
angiogenesis
sensory response to pain
hemorrhage
Relative bias
Net functional distribution of differentially-expressed proteins
Electroscalpel Harmonic Scalpel
