melampsora genome annotation and genome structure analysis first annotation workshop of the...
Embed Size (px)
TRANSCRIPT

Melampsora Genome Annotation and Genome Structure Analysis
First Annotation Workshop of the Melampsora Genome Consortium
Yao-Cheng LinBioinformatics & Evolutionary Genomics
VIB Department of Plant Systems Biology, UGent

Overview
• Gene prediction (structure annotation)• Gene family analysis• Phylogeney position of Melampsora

EuGène: gene prediction platform
EuGène
Intrinsic information
Extrinsic information
FunSiP
Coding IMMIntronic IMM
Translation start
TE & Repeat database
Protein databases
ESTs databases
Puccinia genomic sequence
RepeatMasker TblastXBlastX
BlastNGenomeThreader
start siteGT/AG
Splice site
Content potential for coding, intronic
and intergenic
Other prediction programs
Alternative models
Predicted genes
Genomic sequence

Resources for Melampsora gene prediction
• Gene models for training– Previously identified core genes in basidiomycetes– Genes with manual curation from INRA-Nancy
• Splice site training/prediction– FunSiP: Michiel Van Bel developed it & helped for training
• BlastX database– 8 basidiomycete proteomes, Fungi RefSeq, SwissProt
• TBLASTX database– Puccinia graminis genomic sequence
• EST libraries– JGI Sanger sequencing– 454 Pyrosequencing (the 1st mira assembly)
• Repeat libraries– Hadi/Marie-Pierre.– In-house script, collected from first run of gene prediction. – Masked area from JGI.
• EuGene 3.4

Gene prediction – comparison of two prediction results
EuGene JGI
Number of protein coding genes 17,167 16,694
Coding sequence < 300 aa 6,989 (40.7%) 8,212 (49.2%)
Average gene length (bp) 1,742.7 1,685.5
Average coding sequence length (bp) 1,369.7 1,131.4
Average exon length (bp) 261.1 235
Average exon number 5.3 4.8
Average intron length (bp) 86.9 117.8
SwissProt support 6,521 (38.0%) 5,699 (34.1%)
EST support 6,152 (35.8%) 6,241 (37.4%)
EST support (< 300 aa) 1,066 995

Gene prediction – protein length distribution
100
300
500
700
900
1100
1300
1500
1700
1900
0
5
10
15
20
25
30
35
40
Melampsora JGIMelampsora EuGeneLaccariaPuccinia
Protein length (aa)
Fre
qu
ency
(%
)

Example: metallothionein-like protein
• Metallothionein-like protein in Magnaporthe• Protein length: 22-amino acid (MMT1)• Six Cystein residues.• Mmt1 mutants loose the ability to cause plant disease.
• Difficulties in in silicon identification– Sequence divergence.– Short sequence, easily been rejected by E-value cut-off.

Overview
• Gene prediction and annotation platform• Gene family analysis• Phylogeny position of Melampsora

Gene family expansion and contraction
• Gene family clustering– Similarity search with 12 fungi genomes (10 basidiomycetes, 2
ascomycetes), (All-against-all BLASTP, E-value cutoff 1e-5).– Gene families constructed by TribeMCL with inflation factor 4.0.
• Species/Lineage specific gene family expansions– The mean gene family size and standard deviations were
calculate for all gene families (exclude SSFs and orphans).– To center and normalize the data, the matrix of previous profile
was transformed into a matrix of z-score.
• Functional assignment– Domain based: RPS-BLAST– HMM profile for each family -> Search the SwissProt and NR
database.– GO terms.

Protein phylogeny profile / z-score
A B C Mean SD
1 5 10 15 10 5
2 4 6 5 5 1
320
5 10 11.7 7.6
100 1 1 1
201 0 10 0
A B C
1 -1 0 1
2 -1 1 0
3 1.1 -0.9 -0.2
Protein phylogeny profileZ-score profile
Z = Gene number – mean gene number
Standard deviation
Species specific gene family
Core-gene family
Genome
Fam
ily

Fungi genomes characteristics
GenomeGenome size (Mb)
Genes< 300 a.a
genesGC content
(%)
Magnaporthe grisea 41.7 12,832 5,312 (41.4%) 51.6
Neurospora crassa 39.23 9,822 3,445 (35.1%) 49.3
Sporobolomyces roseus 21.1 5536 1,714 (31.0%) 49.5
Puccinia graminis 88.64 20,566 11,319 (55.0%) 43.0Melampsora larici-
populina 101.1 16,694 8,212 (49.2%) 42.1
Ustilago maydis 19.7 6,522 1,668 (25.6%) 54.0
Malassezia globosa 8.9 4,286 1,468 (34.3%) 52.0
Postia placenta 90.9 12,415 4,629 (37.3%) 52.4Phanerochaete chrysosporium 35.1 10,048 3,579 (35.6%) 53.2
Laccaria bicolor 64.9 19,036 10,013 (52.6%) 46.6
Coprinus cinereus 37.5 13,544 5,487 (40.5%) 51.6Cryptococcus neoformans 19.5 7,170 2,372 (33.1%) 48.2
1
2
3

Orphans / Species specific gene families
Neuro
spora
cra
ssa
Mag
naporth
e gris
ea
Crypto
cocc
us neo
form
ans
Coprinus
ciner
eus
Lacca
ria b
icolo
r
Phaner
ochae
te c
hryso
sporiu
m
Postia
pla
centa
Mal
asse
zia g
lobosa
Ustila
go may
dis
Sporobolo
myc
es ro
seus
Puccin
ia g
ram
inis
Mel
ampso
ra la
rici-p
opulina
0
10
20
30
40
50
60
70
80
Orphans Genes in species specific families
% o
f g
en
es
1
23

Difference in average gene family size
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4M
ea
n z
-sco
re
Neurospora crassa
Magnaporthe grisea
Cryptococcus neoformans C
oprinus cinereus
Laccaria bicolor
Phanerochaete chrysosporium
Postia placent
Malassezia globosa
Ustilago maydis
Sporobolomyces roseus
Puccinia graminis_f._sp._tritici
Melampsora larici-populina
*Total 8035 families, exclude the species specific families

Hierarchical clustering of gene family
N. crassa
M. grisea
S. roseus
P. graminis
M. larici-populin
U. maydis
M. globosa
P. placenta
P. chrysosporium
C. cinereus
L. bicolor
C. neoformans
• Top100 most variable profiles, based on the standard deviations were calculated.
• Red: Protein kinase, esterase lipase, cre recombinase, DNA/RNA helicase, Leucine-rich repeat
• Blue: major facilitator superfamily

Overview
• Gene prediction and annotation platform• Gene family analysis• Phylogeny position of Melampsora

Phylogenies of Melampsora
• Construct the Melampsora phylogenic tree based on FUNYBASE with selected fungi genomes.
• FUNYBASE: single-copy gene family (246 genes) within 21 fungi species (mostly ascomycetes).
• 22 selected species:– Ascomycete: Aspergillus nidulans, Coccidioides immitis, Fusarium
graminearum, Mycosphaerella graminicola, Magnaporthe grisea, Neurospora crassa, Nectria haematococca, Pyrenophora tritici-repentis, Stagonospora nodorum, Schizosaccharomyces pombe, Sclerotinia sclerotiorum.
– Basidiomycete: Coprinus cinereus, Cryptococcus neoformans, Laccaria bicolor, Malassezia globosa, Melampsora larici-populina, Phanerochaete chrysosporium, Puccinia graminis, Postia placenta, Sporobolomyces roseus, Ustilago maydis
– Zygomycete: Rhizopus oryzae
*new genome; reject in FUNYBASE

Phylogenies of Melampsora - Method
• 246 HMM models for the conserved protein sequence blocks in FUNYBASE .
• For each genome, HMMER search against whole proteome and retain the protein sequence of the best hit in each model.
• 148 models have single-copy gene in our 22 selected species.
• Concatenate the 148 single-copy orthologs for tree building.

Melampsora in the phylogenetic tree of fungi
using phylo_win, Neighbor joining method with Poisson correction, 500 bootstrap.

Acknowledgements• Gent
• Stephane Rombauts• Michiel Van Bel• Klaas Vandepoele• Kenny Billiau• Thomas Abeel• Pierre Rouzé• Lieven Sterck• Yves Van de Peer
• Nancy
• Stéphane Hacquard• Emilie Tisserant• Marie-Pierre Oudot-Le Secq• Sébastien Duplessis• Francis Martin