maker annotation process example of glossina vectorbase karyn mégy dan hughes

19
MAKER Annotation Process Example of Glossina VectorBase http:// www.vectorbase.org Karyn Mégy Dan Hughes

Upload: gwen-morrison

Post on 01-Jan-2016

226 views

Category:

Documents


0 download

TRANSCRIPT

MAKER Annotation ProcessExample of Glossina

VectorBasehttp://www.vectorbase.org

Karyn Mégy Dan Hughes

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

Annotation: aims and means

• Aims– Preliminary

– Locus rather than exact position

• Means– Automatic annotation

• By similarity

• Ab initio

– Manual annotation

• By regions

• By gene families

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

Annotation: similarity vs. ab initio

• Similarity– Similarity to known sequences

-> only know genes

-> based on available data (qty, qlty)

• Ab initio– Follow a gene “recipe”

-> potentially identify new genes

-> over predictions

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

Ensembl annotation

Masking: RepeatModeler repeats

+ known repeats/transposons

Rawgenome

Maskedgenome

Community Annotation

1

Proteinspecies specific

2

Transcriptomespecies specific

3

Protein‘close’ specific

4

Ab initio

5

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

Ensembl annotation

• Similarity-focused • Data rich organisms• Fiddly, time consuming• Rhodnius prolixus experience

• In the meantime:

Heliconius annotation using MAKER

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

MAKER

http://www.yandell-lab.org/software/maker.html Cantarel et al. Gen. Res. 2008. PMID 18025269

Rawgenome

DATADAT

ADATA

Annotatedgenome

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

MAKER

Rawgenome

DATADAT

ADATA

Annotatedgenome

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

Intermediate gene sets

Masking: RepeatModeler repeats

+ known repeats/transposons

Rawgenome

Maskedgenome

Raw data

- ESTs - from GenBank - cleaned and clustered/assembled with CAP3- 71,700 contigs

- Insecta/metazoa proteins- from UniProt- align to the genome with BLAST- 690,000 seqces (insecta)- 2,200,00 seqces (metazoa)

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

Intermediate gene sets

Masking: RepeatModeler repeats

+ known repeats/transposons

Rawgenome

Maskedgenome

Raw data

- RNAseq Illumina Yale - cleaned - aligned to the genome using Tophat/Bowtie - build ‘tranfrag’ with Cufflinks

- 78,000 ‘transfrag’ (on 4 sets -> overlaps)

- Augustus - generated by Martin Swain - trained with SOLiD data

- 16, 963 models – high quality

Gene models

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

Intermediate gene sets

Masking: RepeatModeler repeats

+ known repeats/transposons

Rawgenome

Maskedgenome

Raw data

Ab initio

- ESTs – aligned to the genome- from GenBank – clustered with CAP3- 71,700 clusters

- Insecta/metazoa proteins (UniProt)- 690,000 seqces (insecta)- 2,200,00 seqces (metazoa)

- RNAseq Illumina Yale – using Tophat/Cufflinks- 78,000 ‘transfrag’ (on 4 sets -> overlaps)

- Augustus – SOLiD data trained- 16, 963 models – high QC

- SNAP – trained for Glossina (MAKER)- Augustus – trained for Glossina (Martin Swain)- GenScan

Gene models

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

Intermediate gene sets

Masking: RepeatModeler repeats

+ known repeats/transposons

Rawgenome

Maskedgenome

Raw data

Ab initio

Gene models

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

MAKER

Masking: RepeatModeler repeats

+ known repeats/transposons

Rawgenome

Maskedgenome

Raw data

Ab initio

Gene models

ESTs

Proteins

Provided as input

Run software within MAKER

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

MAKER – iterative process

• Round-1:– Align ESTs and Insecta proteins to the genome

– Train SNAP (1): Drosophila HMM

ESTs and protein alignments,

RNA-seq Illumina Yale, Augustus (SOLiD)

• Round-2:– Re-train SNAP (2) – same as above but HMM = output of SNAP-1

• Round-3:– Re-train SNAP (3) – same as above but HMM = output of SNAP-2

– Align Metazoa proteins to the genome

– Combine final gene set

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

Using MAKER for…

Heliconius

Tsetse fly

Salmon louse

Centipede

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

Annex…

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

Augustus (SOLiD)

Martin Swain’s stats, July 22nd, 2011

• Glossina trained:> ESTs only: 14,739 predictions,

9.8% with similarity to Gl. proteins (1,455 seq., 95% seq. identity)

-> ESTs + SOLiD: 14,739 predictions, 9.9% with similarity to Gl. proteins (1,465 seq., 95%

ID)

-> Glossina GenBank proteins: 2,754 proteins sequences 53% matching Augustus models

• Glossina un-trained:-> 8,581 predictions, 15% with similarity to Gl. proteins (1,299 seq., exact matches)

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

ESTs• Total: 79,292 ESTs

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

• [1] Adult midgut expressed sequence tags from the tsetse fly Glossina morsitans morsitans and expression analysis of putative immune response genes. Genome Biol. 2003. Lehane et al.

• [2] Differential expression of fat body genes in Glossina morsitans morsitans following infection with Trypanosoma brucei brucei. Int. J. Parasitol. 2008. Lehane et al.

• [3] Analysis of fat body transcriptome from the adult tsetse fly, Glossina morsitans morsitans. Insect Mol. Biol. 2006 Attardo et al.

 

• [4] Functional Characterisations of odorant binding proteins and chemosensory proteins in tsetse fly Glossina morsitans morsitans. Unpublished 2009. …., Lehane,M., Hertz-Fowler,C., Berriman,M., …

 

• [5] Comprehensive analysis of the transcriptome of the Tsetse fly Glossina morsitans morsitans. Unpublished. 2009. Hertz-Fowler,C., Aslett,M.A. and Berriman,M.EST submitted under: GenomeProject:9563

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

MAKER – final gene set

• Genes: – Final genes: 12,220

– Raw data: • EST-based genes: 23,469• Protein-based genes : 416,9591 (redundancy)

– Gene sets: • Illumina-Yale: 70,915 (redundancy)• Augustus (SOLiD): 16,155

– Ab initio• SNAP: 48,464• Augustus (MAKER): 14,413

(417,000)