eukaryotic genome annotation
Embed Size (px)
DESCRIPTION
Eukaryotic Genome Annotation. Asaf Salamov, Fungal Genomics Program US DOE Joint Genome Institute, Walnut Creek, CA. Started with Human Genome Project. IMG. MycoCosm. 150+ annotated eukaryotic genomes. genome.jgi.doe.gov. Annotation Pipeline. Gene families Gene expression Phylogenomics - PowerPoint PPT PresentationTRANSCRIPT

Eukaryotic Genome Annotation
Asaf Salamov,Fungal Genomics Program
US DOE Joint Genome Institute, Walnut Creek, CA

2
Started with Human Genome Project

3genome.jgi.doe.gov
IMG
MycoCosm
150+ annotated eukaryotic genomes

4
Genomic assembly and ESTs
Ann
otat
ion
Pipe
line
Gene predictions Gene predictions
Protein annotationsProtein annotations
Reference data mappingReference data mapping
Repeat maskingRepeat masking
Manual curation (optional)Manual curation (optional)
Annotation Pipeline
Analysis
Gene familiesGene expressionPhylogenomicsProteomicsProtein targetingetc
Annotation
ValidationsValidations

5
Protein-based methods build CDS exons around known protein alignments.(Fgenesh, GeneWise)
GenBank protein
Transcript-based methods map or assemble transcripts on the genome, including UTRs (EST_map, Combest)
EST contig
Predict model
Predict model
Ab initio methods use knowledge of known genes’ structures to predict start, stop, and splice sites in CDS only. (Fgenesh+, GeneMark)
Train on known genes
ATG TGA
GT AG
exons introns5’UTR5’UTR3’UTR3’UTR
Promoter PolyA
Gene model
Eukaryotic Gene Prediction

6
More Gene Prediction
Use ESTs/cDNAs to extend, correct or predict gene models
• ESTEXT
Predicted model
ESTs
Extended model
5’UTR5’UTR 3’UTR3’UTR
ATG TGA
ATG TGADetect orthologs with poor alignments and refine with synteny based methods • FGENESH2
Genome A
Genome B
FGENESH
Representative set
GENEWISE
EXTERNAL MODELS
Non-redundant gene set is built from “the best” models from each locus according to homology and ESTs, followed by manual curation

7
Combine Gene Predictors for Better Quality
Eugene Genemark Fgenesh JGI PipeNumber of gene models 11,547 9,609 8,409 12,270
Models with partial EST support 5544 3829 4567 5248
with full length EST support 2538 1182 2896 3073
EST coverage per gene 77.7% 68.2% 80.8% 79.1%
supported splice sites 41,581 40,808 45,498 47,671
Models with homology support 6758 6043 5750 7214
with strong homology support (80+%ide, 80+%cov.)
112 109 174 187
model coverage 64% 60% 68% 69%
Models with homology and EST support
2894 2172 2720 2953
Heterobasidion annosum v1.0

8
Re-annotation Using Comparative Genomics
MAKER JGI pipeline Re-annot
# of predicted gene models
9,940 12,290 12,802
with Swissprot hits 6,521 7,356 7,900
With non-repeat PFAM domains
5,365 6,010 6,353
with EST support 9,252 10,796 11,105
with >90% EST support
7,729 9,178 9,444
# of unique PFAM domains
2,207 2,245 2,322
EST coverage per gene
93.0% 93.3% 93.3%
# EST-supported splice sites
99,627 102,200 104,246

9
Validation with Transcriptomics
0%10%20%30%40%50%60%70%80%90%
100%
Other GenesSupported by ESTs
Sanger 454 Illumina
5531
34
EST profile
Processing RNA-Seq with CombEST
models
ESTsGood Old Sanger Days
Transformation of EST sequencing

10
Validation with Proteomics
Wright et al, BMC Genomics (2009)

11
Predicted protein
Protein Annotation
Higher order assignments:
Gene Ontology terms
EC numbers --> KEGG pathways
Gene families, with and without other species
Possible orthologs
(in nr, SwissProt, KEGG, KOG)
Possible paralog
(Blastp+MCL)
Domain
(InterPro, tmhmm)
Signal peptide
(signalP)

12
Gene Cluster Analysis
Comparative analysis

13
Genome Portal Framework

1414
Daphnia pulex – Environmental Model
• First crustacean, aquatic animal sequenced
• New model organism• Compact genome (~200Mb)• Largest gene count (~31,000)• Unknown genes most responsive
to environmental changes
Colbourne et al, Science, Feb 4, 2011

15
Half of Daphnia Genes: no Homologs, Experessed Under Environmental Stress
With Evgeny Zdobnov’s group (Univ. Genève)
* Of 716 highly conserved single copy orthologs, Daphnia is missing only two
Colbourne et al, 2011