gmod 2014 maker lecture

Post on 23-Aug-2014

380 Views

Category:

Science

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

Lecture for the MAKER2 Tutorial for the GMOD 2014 Summer Training

TRANSCRIPT

MAKERThe Genome Annotation Pipeline

GMOD Summer CourseMay 19, 2014

Barry Moore/Carson HoltYandell Lab

University of Utah

MAKER

• The Annotation Problem• How MAKER Works• Why Choose MAKER• Working with MAKER

What are Annotations?Fu

nctio

nal

Stru

ctur

al

FunctioncAMP-dependent and sulfonylurea-sensitive anion transporter. Key gatekeeper influencing intracellular cholesterol transport.

Subcellular location Membrane; Multi-pass membrane protein Ref.13 Ref.14.

Domain

Multifunctional polypeptide with two homologous halves, each containing a hydrophobic membrane-anchoring domain and an ATP binding cassette (ABC) domain.

Genomes Online Database

http://www.genomesonline.org/

1998 2000 2002 2004 2006 2008 2010 20120

1000

2000

3000

4000

5000

6000

7000

8000

9000

Genome Project Status

IncompleteComplete

Year

Geno

mes

http://www.genome.gov/

http://www.genome.gov/

100

1,600

3,200

4,800

6,400

8,000

0

Next Gen Genome Annotation 2013-14

• Coelacanth• Pine• Sacred Lotus• Conus ballatus• Pigeon• King Cobra• Hymenopterids

• Fusarium cirinatum• Cardiocondyla

obscurior• Burmese Python• Sarcocystis neurona• Spotted Gar• Apple magot fly

The ‘NextGen’ Genome ProjectLab/Small Group FundingShort-read Genome SequencingRNASeq DataGenome/Transcriptome AssemblyGene AnnotationGenome Database / Blast ServerManual curationNew assemblyReannotate/Merge annotations

• The Annotation Problem• How MAKER Works• Why Choose MAKER• Working with MAKER

MAKER

The Source of Annotations

RNA and Protein

Evidence

AccurateGene

Annotations

Ab Initio Computational

Evidence

Annotating the Genome – Apollo View

current evidence

gene annotations

genome assembly

http://apollo.berkeleybop.org/

Identify and mask repetitive elements

current evidence

genome assembly

http://www.repeatmasker.org

Generate ab initio gene predictions

ab initio predictionsSNAPGeneMark

Augustus

current evidence

genome assembly

http://korflab.ucdavis.edu/

Align RNA and protein evidence

ab initio predictions

protein - BLASTXEST - BLASTN

altEST - TBLASTX

current evidence

genome assembly

http://blast.ncbi.nlm.nih.gov

Polish BLAST alignments with Exonerate

ab initio predictions

polished proteinpolished EST

current evidence

genome assembly

http://www.ebi.ac.uk/~guy/exonerate/

current evidence

Pass gene-finders evidence-based ‘hints’

ab initio predictions

Hint-based SNAP Hint-based Augustus

genome assembly

current evidence

Identify gene model most consistent with evidence

ab initio predictions*Hint-based SNAP Hint-based Augustus

genome assembly

current evidence

Revise further if necessary; create new annotation

ab initio predictions

genome assembly

Compute support for each portion of gene model

Eilbeck et al BMC Bioinformatics 2009

genome assembly

Compute support for each portion of gene model

Cantarel BL et al., Genome Res 2008

genome assembly

GFF3

FASTA

MAKER2 Workflow

MAKER2 Distributed Workflow

ParalellizationEfficiency

Holt C, Yandell M. BMC Bioinformatics. 2011 12:491.

30 GB Pine genome annotated in 37 hrs on

6,000 CPUs at the TACC

• The Annotation Problem• How MAKER Works• Why Choose MAKER• Working with MAKER

MAKER

MAKERThe Genome Annotation PipelineMaintenance and Management

^GMOD Summer Course

May 19, 2014

Barry Moore/Carson HoltYandell Lab

University of Utah

MAKER2 Use Cases

1. De novo annotation providing quality metrics2. Merging multiple annotation sets3. Re-annotation with new evidence4. Mapping annotations forward to a new

assembly5. Generating GMOD Compliant Output

1. Gbrowse/JBrowse2. Apollo3. Tripal

Sensitivity, Specificity, AccuracyAs a Measure of Annotation Quality

Gold Standard Genes

SN SP AC

1.0 1.0 100%

Gold Standard Genes

Perfect Accuracy

Sensitivity, Specificity, AccuracyAs a Measure of Annotation Quality

SN SP AC

1.0 1.0 100%

1.0 0.5 80%

Gold Standard Genes

Perfect Accuracy

Poor Specificity

Sensitivity, Specificity, AccuracyAs a Measure of Annotation Quality

SN SP AC

1.0 1.0 100%

1.0 0.5 80%

0.5 1.0 80%

Gold Standard Genes

Perfect Accuracy

Poor Specificity

Poor Sensitivity

Sensitivity, Specificity, AccuracyAs a Measure of Annotation Quality

SN SP AC

1.0 1.0 100%

1.0 0.5 80%

0.5 1.0 80%

0.5 0.5 50%

Gold Standard Genes

Perfect Accuracy

Poor Specificity

Poor Sensitivity

Poor Specificityand Sensitivity

Sensitivity, Specificity, AccuracyAs a Measure of Annotation Quality

Guigó R et al. Genome Biol. 2006

MAKER vs. Predictors

Holt C, Yandell M. BMC Bioinformatics. 2011

MAKER vs. Predictors(the wrong HMM...)

Holt C, Yandell M. BMC Bioinformatics. 2011 12:491.

Annotation Edit Distance

Gold Standard GenesGold StandardEvidence

Protein Alignments

EST Alignments

mRNASeq

Eilbeck et al BMC Bioinformatics 2009

Annotation Edit Distance

SN SP AED

1.0 1.0 0.0

1.0 0.5 0.2

0.5 1.0 0.2

0.5 0.5 0.5

Gold StandardEvidence

Perfect Accuracy

Poor Specificity

Poor Sensitivity

Poor Specificityand Sensitivity

Eilbeck et al BMC Bioinformatics 2009

AED as a Measure of Genome Wide Annotation Quality

Eilbeck et al BMC Bioinformatics 2009

TAIR Star Rating System

http://www.arabidopsis.org/

AED Agrees well with the TAIR star system

Evidence: mRNA-seq (17 experiments), ESTs, full length cDNAs, Swiss-Prot (minus Arabidopsis)

0 0.25 0.5 0.75 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

***** (7,880)

**** (12,654)

*** (2,087)

** (2,188)

* (1,788)

(604)

AED

Cum

ulat

ive

Frac

tion

of A

nnot

atio

ns

Holt C, Yandell M. BMC Bioinformatics. 2011

AED as a Measure of Annotation Quality

MAKER Annotations Match the Evidence Well

0 0.25 0.5 0.75 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

TAIR10 rep transcripts (27,206)MAKER de novo (25,956)MAKER update of TAIR10 (26,885)

AED

Cum

ulat

ive

Frac

tion

of A

nnot

atio

ns

0 0.25 0.5 0.75 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

chr10 rep transcripts (2,688)MAKER de novo (3,056)MAKER update of v3 (2,661)

AED

Cum

ulat

ive

Frac

tion

of A

nnot

atio

ns

A. thaliana Z. mays

Campbell et al, 2013 submitted

Protein Domain ContentAs a Measure of Annotation Quality

Holt C, Yandell M. BMC Bioinformatics. 2011

MAKER vs. Predictors

Holt C, Yandell M. BMC Bioinformatics. 2011

• The Annotation Problem• How MAKER Works• Why Choose MAKER• Working with MAKER

MAKER

http://derringer.genetics.utah.edu/cgi-bin/mwas/maker.cgi

MAKER Installation• Automated query/answer based installation

script.• Installs Perl prerequisites.• Installs necessary executables

– RepeatMasker (RepBase)– BLAST+– Exonerate– SNAP

• Even installs MWAS and MPICH2

MAKER Runtime Features

• Fill out a config file with input data and parameters

• Parallelize:– Running with MPI– Simply start multiple instances in the same

directory.• Re-run MAKER in the same directory and it

won't redo completed work.• Restart aborted jobs without losing any work.

Accessory ScriptsOver 30 accessory scripts:

•cegma2zff•chado2gff3•cufflinks2gff3•gff3_2_gtf•gff3_preds2models•gff3_to_eval_gtf•maker2chado•maker2jbrowse•maker2zff•tophat2gff3•compare•evaluator•gff3_merge•fasta_merge•fasta_tool

•fix_fasta•genemark_gtf2gff3•ipr_update_gff•iprscan2gff3iprscan_batch•iprscan_wrap•maker_functional•maker_functional_fasta•maker_functional_gff•maker_map_ids•map2assembly•map_data_ids•map_fasta_ids•map_gff_ids•split_fasta

• The Annotation Problem• How MAKER Works• Why Choose MAKER• Working with MAKER

MAKER

Acknowledgements• Mark Yandell

– Carson Holt– Mike Campbell– Daniel Ence– Steven Flygare– Zev Kronenberg– Qing Li– Marc Singleton– Bretty Kennedy– Brandi Cantarel– Hadi Islam

• Karen Eilbeck– Shawn Reynearson– Nicole Ruiz– Keith Simmons– Bret Heale

• Alejandro Alvarado– Eric Ross

• Jason Stajich• Sophia Robb• Kevin Childs• Shin-Han Shui• Ning Jiang• Yanni Sun

NSF IOS-1126998

top related