assembly & annotation at iplant iplant: josh stein (cshl) matt vaughn (tacc) dian jiao (tacc)...

42
Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz (CSHL) Roger Barthelson (CSHL) Cantarel et al. 2008. Genome Research 18:188 Holt & Yandell. 2011. BMC Bioinformatics 12:491

Upload: gary-floyd

Post on 26-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Assembly & Annotation at iPlant

iPlant:Josh Stein (CSHL)Matt Vaughn (TACC)Dian Jiao (TACC)Zhenyuan Lu (CSHL)Nirav Merchant (U. Arizona)Michael Schatz (CSHL)Roger Barthelson (CSHL)

Cantarel et al. 2008. Genome Research 18:188Holt & Yandell. 2011. BMC Bioinformatics 12:491

Page 2: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Maize Genome Project

• Genome– 2500 Mb– 10 chromosomes– 50,000 genes

• Strategy– BAC-by-BAC– 17,000 clones– Finish genic regions

• 9 PI’s

GenBank

Mapping

FPC map

Min. tiling path

BAC selection

U. Arizona

6X shotgun

Sequencing

Auto finish

Manual finishing

Washington U.

c

Annotation

Repeat analysis

Gene prediction

DatabaseBrowser

CSHL

3-yr NSF funded project -- $30 M

Maizesequence.org

Page 3: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

3

Technology Lowering Barriers

Page 4: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Assembly & Annotation at iPlant

Page 5: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Complexity of Genomes

Science. 2009 Nov 20;326(5956):1112-5. doi: 10.1126/science.1178534.

Page 6: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Assembling a Genome

3. Simplify assembly graph

1. Shear & Sequence DNA

4. Detangle graph with long reads, mates, and other links

2. Construct assembly graph from overlapping reads…AGCCTAGGGATGCGCGACACGT GGATGCGCGACACGTCGCATATCCGGTTTGGTCAACCTCGGACGGAC

CAACCTCGGACGGACCTCAGCGAA…

Page 7: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Ingredients for a good assembly

Current challenges in de novo plant genome sequencing and assembly

Schatz MC, Witkowski, McCombie, WR (2012) Genome Biology. 12:243

Coverage

High coverage is required– Oversample the genome to

ensure every base is sequenced with long overlaps

between reads– Biased coverage will also

fragment assembly

Read Coverage

Exp

ecte

d C

on

tig

Len

gth

Read Length

Reads & mates must be longer than the repeats– Short reads will have false

overlaps forming hairball assembly graphs

– With long enough reads, assemble entire chromosomes

into contigs

Quality

Errors obscure overlaps– Reads are assembled by finding kmers shared in pair of

reads– High error rate requires very

short seeds, increasing complexity and forming

assembly hairballs

Page 8: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

N50 sizeDef: 50% of the genome is in contigs as large as the N50 value

Example: 1 Mbp genome

N50 size = 30 kbp (300k+100k+45k+45k+30k = 520k >= 500kbp)

Note:N50 values are only meaningful to compare when base genome size is the same in all cases

1000

300 45 30100 20 15 1510 . . . . .45

50%

Page 9: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

• Attempt to answer the question:“What makes a good assembly?”

• Organizers provided sequence data to assembly experts around the world– Assemblathon 1: ~100Mbp simulated genome– Assemblathon 2: 3 vertebrate genomes each ~1GB

• Results demonstrate trade-offs assemblers must make

Assemblathon 1: A competitive assessment of de novo short read assembly methods.Earl, DA, et al. (2011) Genome Research. doi: 10.1101/gr.126599.111

Assemblathon 2: Evaluating de novo methods of genome assembly in three vertebrate speciesBradnam, KR. et al (2013) GigaScience 2:10 doi:10.1186/2047-217X-2-10

Page 10: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Final Rankings

• ALLPATHS and SOAPdenovo came out neck-and-neck followed closely behind by Celera Assembler, SGA, and ABySS

• My recommendation for “typical” short read assembly is to use ALLPATHS

• Single molecule sequencing becoming extremely attractive if you have access

Page 11: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Apps in Discovery Environment

• Genome Assembly– Allpaths-LG– Soapdenovo2– ABySS– Velvet– Newbler– Ray– Contig analysis tools

• With or without reference sequence for comparison

Page 12: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Assembly Workflow

Upload ReadsMinutes to Months

Quality Assessment

Minutes to Hours

De novo Assembly

Hours to Days

Assembly Assessment

Minutes to Hours

Page 13: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Apps in Discovery Environment

• Sequence Quality Control– FastQC– Fastx Toolkit– Suffixerator/Tallymer/mkindex– Sabre, Scythe, Sickle (paired end

trimming)– SGA cleanup (paired end quality

trimming)– Future plans

• Sequence induction, assessment, and trimming pipeline

• Mira contaminant detection and removal

(for sequencing studies)

Page 14: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

QC: FastQC

Page 15: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

QC: Read Coverage

Reference:

Reads:

ErrorsCoverage

Repeats

Page 16: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Wheat Genome(A. tauschi / CSHL)

Page 17: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

QC: Mer counts

Frag1.fq Frag2.fq

FASTX_fastq-to-fasta

FASTX_fastq-to-fasta

Suffixerator

Suffixerator-Tallymer-mkindex

A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomesKurtz S. Narechania A, Stein JC, Ware D. (2008) BMC Genomics. 9:517

Page 18: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Using Allpaths LG

• You must have at least 2 libraries– One overlapping fragment library, e.g. 100 bp

reads with 180 bp spacing– One jumping mate-pair library, e.g. 3000 bp

spacing

Page 19: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

How ALLPATHS-LG works

assembly

reads

unipaths

corrected reads

doubled reads

localized data

local graph assemblies

global graph assembly

Sante Gnerre et al (2010) PNAS 1513–1518, doi: 10.1073/pnas.1017351108

See Youtube:https://www.youtube.com/watch?v=USlTWhmw0oQ&index=3&list=PL-0S9LiUi0viEhYTP_EQtKpYkcYAVW6IH

Page 20: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Where is the sample data?

Data Source: GAGE Project

ALLPATHS-LG in DE

180 bp

3500 bp

Page 21: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Where is the Allpaths LG App?ALLPATHS-LG in DE

Page 22: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Fragment ReadsALLPATHS-LG in DE

Page 23: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Jumping ReadsALLPATHS-LG in DE

Page 24: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Run SettingsALLPATHS-LG in DE

Page 25: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Running ALLPATHS-LG

Page 26: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Post-QC: CEGMA

CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes

Parra G, Bradnam K, Korf I. (2007) Bioinformatics. 23 (9): 1061-1067.

Page 27: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Resources• iPlant

– http://www.iplantcollaborative.org/

• Assembly Competitions– Assemblathon: http://assemblathon.org/– GAGE: http://gage.cbcb.umd.edu/

• Assembler Websites:– ALLPATHS-LG:

http://www.broadinstitute.org/software/allpaths-lg/blog/– SOAPdenovo: http://soap.genomics.org.cn/soapdenovo.html– Celera Assembler: http://wgs-assembler.sf.net

• Tools:– FastQC:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/– Tallymer: http://www.zbh.uni-hamburg.de/?id=211– CEGMA: http://korflab.ucdavis.edu/datasets/cegma/

Page 28: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

What Are Annotations?

• Annotations are descriptions of features of the genome• Structural: exons, introns, UTRs, splice forms etc.

• Coding & non-coding genes

• Expression, repeats, transposons

• Annotations should include evidence trail• Assists in quality control of genome annotations

• Examples of evidence supporting a structural annotation:• Ab initio gene predictions

• ESTs

• Protein homology

Page 29: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Secondary Annotation

• Protein Domains• InterPro Scan: combines many HMM databases

• GO and other ontologies• Pathway mapping

• E.g. BioCyc Pathway tools

Page 30: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Challenges in Plant Genome Annotation• Genomes are BIG • Highly repetitive• Many pseudogenes• Assembly contamination• Incomplete evidence• No method is 100% accurate

Page 31: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Options for Protein-coding Gene Annotation

Yandell & Ence. Nature Reviews Genetics 13, 329-342 (May 2012) | doi:10.1038/nrg3174

Page 32: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Typical Annotation Pipeline

• Contamination screening• Repeat/TE masking• Ab initio prediction• Evidence alignment (cDNA, EST, RNA-seq,

protein)• Evidence-driven prediction• Chooser/combiner• Evaluation/filtering• Manual curation

Page 33: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

MAKER-P Automated Pipeline

Ab initio prediction Evidence

MPI-enabled to allow parallel operation on large compute clusters

Collaboration with Yandell Lab

Repeat Library

Page 34: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

What is a GFF File?

Generic Feature Format

Page 35: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Quality Control evaluation of the MAKER-P and TAIR10 datasets using Annotation Edit Distance (AED).

Better Quality Worse

Page 36: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

• W559 - Annotation of the Lobolly Pine Megagenome—Jill Wegrzyn• 20.15 Gb assembly—split into 40 jobs—216 CPU/job (8640 CPU total)—17 hours

• P157 - Disease Resistance Gene Analysis on Chromosome 11 Across Ten Oryza Species• 10 rice species (each w/12 chromosome pseudomolecules)

• 96 CPU per chromosome (1152 CPU total) ~ 2hr per genome

36

22,656 CPU cores on1,888 nodes

Genome AssemblySize (Mb) CPU

Run Time

Arabidopsis thaliana TAIR10 120 600 2:44Arabidopsis thaliana TAIR10 120 1500 1:27Zea mays RefGen_v2 2067 2172 2:53

TACC Lonestar Supercomputer

Campbell et al. Plant Physiology. December 4, 2013, DOI:10.1104/pp.113.230144

PAG 2014:

MAKER-P at iPlant

Page 37: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

MAKER-P at iPlant

• Virtual image

• MPI-enabled for parallel computing

• Check out with up to 16 CPU

• Tested with 4 CPU instance

• Completed rice chr 1 in 8 hr 45 min

37

Atmosphere: MAKER_2.28 (emi-F13821D0)

Page 38: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

MAKER-P Tutorial

https://pods.iplantcollaborative.org/wiki/display/sciplant/MAKER-P+Atmosphere+Tutorial

Page 39: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Annotation Post-Analysis

• AED threshold• InterProScan• Comparative analysis, e.g. BLAST vs RefSeq

proteins

Page 40: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Annotation Post-Analysis

InterProScan

Page 41: Assembly & Annotation at iPlant iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Michael Schatz

Assembly & Annotation at iPlant