eukaryotic genome annotation

15
Eukaryotic Genome Annotation Asaf Salamov, Fungal Genomics Program US DOE Joint Genome Institute, Walnut Creek, CA

Upload: akando

Post on 18-Feb-2016

69 views

Category:

Documents


1 download

DESCRIPTION

Eukaryotic Genome Annotation. Asaf Salamov, Fungal Genomics Program US DOE Joint Genome Institute, Walnut Creek, CA. Started with Human Genome Project. IMG. MycoCosm. 150+ annotated eukaryotic genomes. genome.jgi.doe.gov. Annotation Pipeline. Gene families Gene expression Phylogenomics - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Eukaryotic Genome Annotation

Eukaryotic Genome Annotation

Asaf Salamov,Fungal Genomics Program

US DOE Joint Genome Institute, Walnut Creek, CA

Page 2: Eukaryotic Genome Annotation

2

Started with Human Genome Project

Page 3: Eukaryotic Genome Annotation

3genome.jgi.doe.gov

IMG

MycoCosm

150+ annotated eukaryotic genomes

Page 4: Eukaryotic Genome Annotation

4

Genomic assembly and ESTs

Ann

otat

ion

Pipe

line

Gene predictions Gene predictions

Protein annotationsProtein annotations

Reference data mappingReference data mapping

Repeat maskingRepeat masking

Manual curation (optional)Manual curation (optional)

Annotation Pipeline

Analysis

Gene familiesGene expressionPhylogenomicsProteomicsProtein targetingetc

Annotation

ValidationsValidations

Page 5: Eukaryotic Genome Annotation

5

Protein-based methods build CDS exons around known protein alignments.(Fgenesh, GeneWise)

GenBank protein

Transcript-based methods map or assemble transcripts on the genome, including UTRs (EST_map, Combest)

EST contig

Predict model

Predict model

Ab initio methods use knowledge of known genes’ structures to predict start, stop, and splice sites in CDS only. (Fgenesh+, GeneMark)

Train on known genes

ATG TGA

GT AG

exons introns5’UTR5’UTR3’UTR3’UTR

Promoter PolyA

Gene model

Eukaryotic Gene Prediction

Page 6: Eukaryotic Genome Annotation

6

More Gene Prediction

Use ESTs/cDNAs to extend, correct or predict gene models

• ESTEXT

Predicted model

ESTs

Extended model

5’UTR5’UTR 3’UTR3’UTR

ATG TGA

ATG TGADetect orthologs with poor alignments and refine with synteny based methods • FGENESH2

Genome A

Genome B

FGENESH

Representative set

GENEWISE

EXTERNAL MODELS

Non-redundant gene set is built from “the best” models from each locus according to homology and ESTs, followed by manual curation

Page 7: Eukaryotic Genome Annotation

7

Combine Gene Predictors for Better Quality

Eugene Genemark Fgenesh JGI PipeNumber of gene models 11,547 9,609 8,409 12,270

Models with partial EST support 5544 3829 4567 5248

with full length EST support 2538 1182 2896 3073

EST coverage per gene 77.7% 68.2% 80.8% 79.1%

supported splice sites 41,581 40,808 45,498 47,671

Models with homology support 6758 6043 5750 7214

with strong homology support (80+%ide, 80+%cov.)

112 109 174 187

model coverage 64% 60% 68% 69%

Models with homology and EST support

2894 2172 2720 2953

Heterobasidion annosum v1.0

Page 8: Eukaryotic Genome Annotation

8

Re-annotation Using Comparative Genomics

MAKER JGI pipeline Re-annot

# of predicted gene models

9,940 12,290 12,802

with Swissprot hits 6,521 7,356 7,900

With non-repeat PFAM domains

5,365 6,010 6,353

with EST support 9,252 10,796 11,105

with >90% EST support

7,729 9,178 9,444

# of unique PFAM domains

2,207 2,245 2,322

EST coverage per gene

93.0% 93.3% 93.3%

# EST-supported splice sites

99,627 102,200 104,246

Page 9: Eukaryotic Genome Annotation

9

Validation with Transcriptomics

0%10%20%30%40%50%60%70%80%90%

100%

Other GenesSupported by ESTs

Sanger 454 Illumina

5531

34

EST profile

Processing RNA-Seq with CombEST

models

ESTsGood Old Sanger Days

Transformation of EST sequencing

Page 10: Eukaryotic Genome Annotation

10

Validation with Proteomics

Wright et al, BMC Genomics (2009)

Page 11: Eukaryotic Genome Annotation

11

Predicted protein

Protein Annotation

Higher order assignments:

Gene Ontology terms

EC numbers --> KEGG pathways

Gene families, with and without other species

Possible orthologs

(in nr, SwissProt, KEGG, KOG)

Possible paralog

(Blastp+MCL)

Domain

(InterPro, tmhmm)

Signal peptide

(signalP)

Page 12: Eukaryotic Genome Annotation

12

Gene Cluster Analysis

Comparative analysis

Page 13: Eukaryotic Genome Annotation

13

Genome Portal Framework

Page 14: Eukaryotic Genome Annotation

1414

Daphnia pulex – Environmental Model

• First crustacean, aquatic animal sequenced

• New model organism• Compact genome (~200Mb)• Largest gene count (~31,000)• Unknown genes most responsive

to environmental changes

Colbourne et al, Science, Feb 4, 2011

Page 15: Eukaryotic Genome Annotation

15

Half of Daphnia Genes: no Homologs, Experessed Under Environmental Stress

With Evgeny Zdobnov’s group (Univ. Genève)

* Of 716 highly conserved single copy orthologs, Daphnia is missing only two

Colbourne et al, 2011