melampsora genome annotation and genome structure analysis first annotation workshop of the...

19
Melampsora Genome Annotation and Genome Structure Analysis First Annotation Workshop of the Melampsora Genome Consortium Yao-Cheng Lin Bioinformatics & Evolutionary Genomics VIB Department of Plant Systems Biology, UGent

Upload: ernest-holt

Post on 03-Jan-2016

230 views

Category:

Documents


0 download

TRANSCRIPT

Melampsora Genome Annotation and Genome Structure Analysis

First Annotation Workshop of the Melampsora Genome Consortium

Yao-Cheng LinBioinformatics & Evolutionary Genomics

VIB Department of Plant Systems Biology, UGent

Overview

• Gene prediction (structure annotation)• Gene family analysis• Phylogeney position of Melampsora

EuGène: gene prediction platform

EuGène

Intrinsic information

Extrinsic information

FunSiP

Coding IMMIntronic IMM

Translation start

TE & Repeat database

Protein databases

ESTs databases

Puccinia genomic sequence

RepeatMasker TblastXBlastX

BlastNGenomeThreader

start siteGT/AG

Splice site

Content potential for coding, intronic

and intergenic

Other prediction programs

Alternative models

Predicted genes

Genomic sequence

Resources for Melampsora gene prediction

• Gene models for training– Previously identified core genes in basidiomycetes– Genes with manual curation from INRA-Nancy

• Splice site training/prediction– FunSiP: Michiel Van Bel developed it & helped for training

• BlastX database– 8 basidiomycete proteomes, Fungi RefSeq, SwissProt

• TBLASTX database– Puccinia graminis genomic sequence

• EST libraries– JGI Sanger sequencing– 454 Pyrosequencing (the 1st mira assembly)

• Repeat libraries– Hadi/Marie-Pierre.– In-house script, collected from first run of gene prediction. – Masked area from JGI.

• EuGene 3.4

Gene prediction – comparison of two prediction results

EuGene JGI

Number of protein coding genes 17,167 16,694

Coding sequence < 300 aa 6,989 (40.7%) 8,212 (49.2%)

Average gene length (bp) 1,742.7 1,685.5

Average coding sequence length (bp) 1,369.7 1,131.4

Average exon length (bp) 261.1 235

Average exon number 5.3 4.8

Average intron length (bp) 86.9 117.8

SwissProt support 6,521 (38.0%) 5,699 (34.1%)

EST support 6,152 (35.8%) 6,241 (37.4%)

EST support (< 300 aa) 1,066 995

Gene prediction – protein length distribution

100

300

500

700

900

1100

1300

1500

1700

1900

0

5

10

15

20

25

30

35

40

Melampsora JGIMelampsora EuGeneLaccariaPuccinia

Protein length (aa)

Fre

qu

ency

(%

)

Example: metallothionein-like protein

• Metallothionein-like protein in Magnaporthe• Protein length: 22-amino acid (MMT1)• Six Cystein residues.• Mmt1 mutants loose the ability to cause plant disease.

• Difficulties in in silicon identification– Sequence divergence.– Short sequence, easily been rejected by E-value cut-off.

Overview

• Gene prediction and annotation platform• Gene family analysis• Phylogeny position of Melampsora

Gene family expansion and contraction

• Gene family clustering– Similarity search with 12 fungi genomes (10 basidiomycetes, 2

ascomycetes), (All-against-all BLASTP, E-value cutoff 1e-5).– Gene families constructed by TribeMCL with inflation factor 4.0.

• Species/Lineage specific gene family expansions– The mean gene family size and standard deviations were

calculate for all gene families (exclude SSFs and orphans).– To center and normalize the data, the matrix of previous profile

was transformed into a matrix of z-score.

• Functional assignment– Domain based: RPS-BLAST– HMM profile for each family -> Search the SwissProt and NR

database.– GO terms.

Protein phylogeny profile / z-score

A B C Mean SD

1 5 10 15 10 5

2 4 6 5 5 1

320

5 10 11.7 7.6

100 1 1 1

201 0 10 0

A B C

1 -1 0 1

2 -1 1 0

3 1.1 -0.9 -0.2

Protein phylogeny profileZ-score profile

Z = Gene number – mean gene number

Standard deviation

Species specific gene family

Core-gene family

Genome

Fam

ily

Fungi genomes characteristics

GenomeGenome size (Mb)

Genes< 300 a.a

genesGC content

(%)

Magnaporthe grisea 41.7 12,832 5,312 (41.4%) 51.6

Neurospora crassa 39.23 9,822 3,445 (35.1%) 49.3

Sporobolomyces roseus 21.1 5536 1,714 (31.0%) 49.5

Puccinia graminis 88.64 20,566 11,319 (55.0%) 43.0Melampsora larici-

populina 101.1 16,694 8,212 (49.2%) 42.1

Ustilago maydis 19.7 6,522 1,668 (25.6%) 54.0

Malassezia globosa 8.9 4,286 1,468 (34.3%) 52.0

Postia placenta 90.9 12,415 4,629 (37.3%) 52.4Phanerochaete chrysosporium 35.1 10,048 3,579 (35.6%) 53.2

Laccaria bicolor 64.9 19,036 10,013 (52.6%) 46.6

Coprinus cinereus 37.5 13,544 5,487 (40.5%) 51.6Cryptococcus neoformans 19.5 7,170 2,372 (33.1%) 48.2

1

2

3

Orphans / Species specific gene families

Neuro

spora

cra

ssa

Mag

naporth

e gris

ea

Crypto

cocc

us neo

form

ans

Coprinus

ciner

eus

Lacca

ria b

icolo

r

Phaner

ochae

te c

hryso

sporiu

m

Postia

pla

centa

Mal

asse

zia g

lobosa

Ustila

go may

dis

Sporobolo

myc

es ro

seus

Puccin

ia g

ram

inis

Mel

ampso

ra la

rici-p

opulina

0

10

20

30

40

50

60

70

80

Orphans Genes in species specific families

% o

f g

en

es

1

23

Difference in average gene family size

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4M

ea

n z

-sco

re

Neurospora crassa

Magnaporthe grisea

Cryptococcus neoformans C

oprinus cinereus

Laccaria bicolor

Phanerochaete chrysosporium

Postia placent

Malassezia globosa

Ustilago maydis

Sporobolomyces roseus

Puccinia graminis_f._sp._tritici

Melampsora larici-populina

*Total 8035 families, exclude the species specific families

Hierarchical clustering of gene family

N. crassa

M. grisea

S. roseus

P. graminis

M. larici-populin

U. maydis

M. globosa

P. placenta

P. chrysosporium

C. cinereus

L. bicolor

C. neoformans

• Top100 most variable profiles, based on the standard deviations were calculated.

• Red: Protein kinase, esterase lipase, cre recombinase, DNA/RNA helicase, Leucine-rich repeat

• Blue: major facilitator superfamily

Overview

• Gene prediction and annotation platform• Gene family analysis• Phylogeny position of Melampsora

Phylogenies of Melampsora

• Construct the Melampsora phylogenic tree based on FUNYBASE with selected fungi genomes.

• FUNYBASE: single-copy gene family (246 genes) within 21 fungi species (mostly ascomycetes).

• 22 selected species:– Ascomycete: Aspergillus nidulans, Coccidioides immitis, Fusarium

graminearum, Mycosphaerella graminicola, Magnaporthe grisea, Neurospora crassa, Nectria haematococca, Pyrenophora tritici-repentis, Stagonospora nodorum, Schizosaccharomyces pombe, Sclerotinia sclerotiorum.

– Basidiomycete: Coprinus cinereus, Cryptococcus neoformans, Laccaria bicolor, Malassezia globosa, Melampsora larici-populina, Phanerochaete chrysosporium, Puccinia graminis, Postia placenta, Sporobolomyces roseus, Ustilago maydis

– Zygomycete: Rhizopus oryzae

*new genome; reject in FUNYBASE

Phylogenies of Melampsora - Method

• 246 HMM models for the conserved protein sequence blocks in FUNYBASE .

• For each genome, HMMER search against whole proteome and retain the protein sequence of the best hit in each model.

• 148 models have single-copy gene in our 22 selected species.

• Concatenate the 148 single-copy orthologs for tree building.

Melampsora in the phylogenetic tree of fungi

using phylo_win, Neighbor joining method with Poisson correction, 500 bootstrap.

Acknowledgements• Gent

• Stephane Rombauts• Michiel Van Bel• Klaas Vandepoele• Kenny Billiau• Thomas Abeel• Pierre Rouzé• Lieven Sterck• Yves Van de Peer

• Nancy

• Stéphane Hacquard• Emilie Tisserant• Marie-Pierre Oudot-Le Secq• Sébastien Duplessis• Francis Martin