tomato genome annotation pipeline in cyrille2 erwin datema

19
Tomato genome annotation pipeline in Cyrille2 Erwin Datema

Upload: paulina-floyd

Post on 28-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tomato genome annotation pipeline in Cyrille2 Erwin Datema

Tomato genome annotation pipeline in Cyrille2

Erwin Datema

Page 2: Tomato genome annotation pipeline in Cyrille2 Erwin Datema

Contents of the annotation pipeline

Annotation on the BAC level Gene prediction Repeat identification Other features

Annotation on the gene level (work in progress)

blastx vs NCBI’s nr (sequence similarity) InterProScan (domain identifcation)

Page 3: Tomato genome annotation pipeline in Cyrille2 Erwin Datema

Ab initio gene structure prediction

Ab initio predictors included in the pipeline Genscan GlimmerHMM (trained on tomato!) GeneId (has been trained on

Solanaceae) SNAP Augustus (predicts alternative spliced

variants)

Page 4: Tomato genome annotation pipeline in Cyrille2 Erwin Datema

Alignment-based gene structure prediction (1)

Transcript alignment (blastn + Sim4) SGN tomato UniGenes (34.829 UniGenes) SGN potato UniGenes (31.072 UniGenes) SGN coffee UniGenes (13.171 UniGenes) SGN pepper UniGenes (9.554 UniGenes) SGN petunia Unigenes (5.135 UniGenes) SGN S. melongena UniGenes (1.841 UniGenes) NCBI full-length tomato cDNAs (678 cDNAs)

Protein alignment (tblastn + GeneWise) TAIR6 Arabidopsis thaliana proteome (30.690

proteins) TIGR4 Oryza sativa proteome (62.827 proteins) UniProt Plant division (17.831 proteins)

Page 5: Tomato genome annotation pipeline in Cyrille2 Erwin Datema

Additional feature prediction

Repeat Identification Tandem Repeats Finder RepeatMasker

• RepBase + ‘default’ features (low complexity, etc)• TIGR Solanum lycopersicon repeat library V2• SGN Solanum lycopersicon UniRepeats

Feature prediction tRNAscan-SE MarScan GeneSplicer Marker identification (blastn + Sim4)

Page 6: Tomato genome annotation pipeline in Cyrille2 Erwin Datema

Preliminary results

Annotation of chromosome 6 BACs phase 1, 2 and 3 632 contigs Older version of the pipeline

• GlimmerHMM only trained on Arabidopsis• 2 UniGene sets (tomato, potato)• 2 protein sets (Arabidopsis, UniProt plant)• Protein alignment parameters too strict

Page 7: Tomato genome annotation pipeline in Cyrille2 Erwin Datema

The genomic landscape of chromosome 6

632 contigs have been annotated Length of contigs varies between 348 – 148.256

nt Average length of 9.061 nt, median length of

5.105 nt Total length of 5.726.791 nt GC content: 29.9% min, 34.1% avg, 42.2% max

(sequences longer than 10.000 nt)

Page 8: Tomato genome annotation pipeline in Cyrille2 Erwin Datema

Ab initio gene prediction

genes exons exons/gene exon length gene length genes/kb

Genscan 1065 4630 4.3 249 1084 0.19

GlimmerHMM 1218 3901 3.2 272 872 0.21

GeneId 1210 4002 3.3 273 903 0.21

SNAP 1782 5059 2.8 230 653 0.31

Augustus 1888 8810 4.7 227 1061 0.33

Note: Augustus predictions include up to 3 splice variants per gene

Estimated gene density is 1 gene per 5 kb ~1.200 genes in currently sequenced BACs

Page 9: Tomato genome annotation pipeline in Cyrille2 Erwin Datema

Transcript alignment-based gene prediction

Tomato 34.829 UniGenes (derived from 239.593 ESTs) 574 hits to the contigs

Potato 31.072 UniGenes (derived from 133.657 ESTs) 631 hits to the contigs

Page 10: Tomato genome annotation pipeline in Cyrille2 Erwin Datema

Protein alignment-based gene prediction

UniProt Plant proteins 17.378 protein sequences from the plant

kingdom 195 hits to the contigs

Arabidopsis thaliana TAIR6 annotation 30.690 protein sequences 228 hits to the contigs

Page 11: Tomato genome annotation pipeline in Cyrille2 Erwin Datema

Repeat density

TIGR Tomato Repeat Library (95 repeats) 118 regions spanning 53.024 nt Minimum 48 nt, average 449 nt, maximum 7.675 nt

SGN Tomato UniRepeats (668 repeats) 2.860 regions spanning 1.220.101 nt Minimum 10 nt, average 427 nt, maximum 8.896 nt

Tandem repeats 1.313 regions spanning 157.921 nt Minimum 24 nt, average 120 nt, maximum 2.526 nt

Page 12: Tomato genome annotation pipeline in Cyrille2 Erwin Datema

Additional features

74 markers could be aligned alignment quality unverified

39 predicted tRNA genes

1.301 predicted MAR/SAR elements

Page 13: Tomato genome annotation pipeline in Cyrille2 Erwin Datema

Generic Genome Browser (1)

Page 14: Tomato genome annotation pipeline in Cyrille2 Erwin Datema

Generic Genome Browser (2)

Page 15: Tomato genome annotation pipeline in Cyrille2 Erwin Datema

Generic Genome Browser (3)

Page 16: Tomato genome annotation pipeline in Cyrille2 Erwin Datema

Recent work

GeneModelCollector Tries to find ‘full’ open reading frames in aligned

UniGenes Automatic generation of gene predictor training

set Parameters?

JIGSAW Appears not to provide a prediction for every

region which contains annotations Training?

Page 17: Tomato genome annotation pipeline in Cyrille2 Erwin Datema

Future Work – Tomato Annotation Pipeline

Gene prediction Combining predictions into a single consensus

model Train individual predictors with recently curated

tomato gene set

Automated functional annotation of genes “Giving a biological meaning to the nicely colored

bars” blastx InterProScan

Page 18: Tomato genome annotation pipeline in Cyrille2 Erwin Datema

Future Work – Tomato Genome Browser

Annotation of features Meaningful names for features such as genes,

marker alignments, blast hits More detailed and better readable data when

clicking on a feature

Links to external data sources NCBI GenBank SGN

Page 19: Tomato genome annotation pipeline in Cyrille2 Erwin Datema

Acknowledgements

Cyrille2 development Mark Fiers Ate van der Burgt Joost de Groot

Tomato BAC sequencing (chromosome 6) Greenomics

Supervision Willem Stiekema Roeland van Ham