high quality arthropod genome assembly with single molecule reads and long-range scaffolding

14
www.citrusgreening.org High quality arthropod genome assembly with single molecule reads and long - range scaffolding Prashant S Hosmani 1 , Mirella Flores-Gonzalez 1 , Wayne Hunter 2 , Lukas A. Mueller 1 , Susan Brown 3 , and Surya Saha 1 1 Boyce Thompson Institute; 2 USDA-ARS U.S. Horticultural Research Laboratory; 3 Kansas State University [email protected] @SahaSurya Entomology 2017 Advances in Arthropod Genomics Workshop

Upload: surya-saha

Post on 22-Jan-2018

234 views

Category:

Science


3 download

TRANSCRIPT

Page 1: High quality arthropod genome assembly with single molecule reads and long-range scaffolding

www.citrusgreening.org

High quality arthropod genome assembly with single molecule reads and long-range

scaffolding

Prashant S Hosmani1, Mirella Flores-Gonzalez1, Wayne Hunter2, Lukas A. Mueller1, Susan Brown3, and Surya Saha1

1Boyce Thompson Institute; 2USDA-ARS U.S. Horticultural Research Laboratory; 3Kansas State University

[email protected] @SahaSurya

Entomology 2017Advances in Arthropod Genomics Workshop

Page 2: High quality arthropod genome assembly with single molecule reads and long-range scaffolding

www.citrusgreening.org

AcknowledgementsMueller Lab

Mirella Flores

Prashant Hosmani

Kansas State University

Sue Brown

Cornell University/BTI

Michelle (Cilia) Heck

USDA/ARS

Wayne Hunter

Robert Shatters

University of California, Davis

Carolyn Slupsky

Indian River State College

Tom D’elia

Page 3: High quality arthropod genome assembly with single molecule reads and long-range scaffolding

www.citrusgreening.org

Citrus Greening: Huanglongbing• Most significant disease of citrus worldwide

• More than $4.5 billion in lost citrus production and more than 8,200 lost jobs (2006/07 to 2010/11)

• Associated with gram negative bacterium Candidatus Liberibacter asiaticus (CLas)

• Spread by insect vector, Diaphorina citri (Asian citrus psyllid, ACP)

Annie Kruse

Page 4: High quality arthropod genome assembly with single molecule reads and long-range scaffolding

www.citrusgreening.org

Omics resources and databases are required for identification of targets for interdiction

4

Genome Annotation

Target for interdiction molecules

Pathway DatabasesExpression Networks

…….

Host

Vector

Pathogen

Page 5: High quality arthropod genome assembly with single molecule reads and long-range scaffolding

www.citrusgreening.org

Genome Diaci1.1

Contigs 161,988

Total Length

485 Mb

Longest 1 Mb

Shortest 201bp

Ns 19.3 Mb

Scaffold N50: 109,898 bpContig N50: 34,407bp

Highly fragmented

Many examples of misassemblies!!

Current Illumina assembly

http://biobeans.blogspot.com/2012/11/bioinformatics-genome-assembly.html

Page 6: High quality arthropod genome assembly with single molecule reads and long-range scaffolding

www.citrusgreening.org

Pacbio assembly

Error rate 0.013 Error rate 0.015

Number of contigs

7,832 8,030

Total bases 462.8 Mb 493.1 Mb

Longest 1.6 Mb 1.7 Mb

Shortest 4.4 Kbp 5 Kbp

Averagelength

59.9 Kb 61.4 Kb

Contig N50 85.8 Kb 92.6 Kb

Koren 2017

Contiguous assembly with longer contigsMultiple individuals in DNA sample

http://canu.readthedocs.io/en/stable/

Page 7: High quality arthropod genome assembly with single molecule reads and long-range scaffolding

www.citrusgreening.org

PBJelly scaffolding

Canu assembly Scaffolded Assemblyv1.9

Number of contigs 7,832 8,352

Total bases 462.8 Mb 591.7 Mb

Longest 1.6 Mb 2 Mb

Shortest 4.4 Kb 1.5 Kb

Average length 59 Kb 70.8 Kb

Contig N50 85.8 Kb 115.8 Kb

5,290 gap extensions535 gaps filledNumber of Ns: 0 bp

English 2012

Page 8: High quality arthropod genome assembly with single molecule reads and long-range scaffolding

www.citrusgreening.org

v1.91 v1.92 REFERENCE

v1.92 ALTERNATE

Number of contigs

3,681 1,918 1,763

Total bases 596 Mb 513 Mb 83.4 Mb

Longest 4.2 Mb 4.2 Mb 760.6 Kb

Shortest 1.5 Kb 6 Kb 1.5 Kb

Averagelength

162 Kb 267 Kb 47.3 Kb

Contig N50 620 Kb 755.7 Kb 75.1 Kb

Ns 5.1 Mb 4.6 Mb 467 Kb

500ng input DNA from single male psyllidDuplicated contigs added to alternate assembly

https://github.com/Gabaldonlab/redundans

https://github.com/broadinstitute/pilon/wiki

Error correction• DNA sequencing data• RNA sequencing data

• Duplication removal• Scaffolding

scaffolding

Page 9: High quality arthropod genome assembly with single molecule reads and long-range scaffolding

www.citrusgreening.org

Gene isoform sequencing (Iso-Seq)

Accurate gene models are

necessary for targeting assays

• Majority of genes are alternatively

spliced to produce multiple

transcript isoforms.

• Iso-Seq generates full-length cDNA

sequences (full-length transcripts

and gene isoforms).

Current MCOT (de novo and genome-based) transcriptome is useful but fragmented

Korf 2013

Page 10: High quality arthropod genome assembly with single molecule reads and long-range scaffolding

www.citrusgreening.org

Sequencing full-length gene isoforms

Page 11: High quality arthropod genome assembly with single molecule reads and long-range scaffolding

www.citrusgreening.org

Mapping to D. citri genome

Isoforms mapped to D. citri v1.92

Total isoforms: 314,275

Isoseq provides a comprehensive (de novo and genome-based) transcriptome with full-length transcripts and a range of isoforms

Counts

Number of genes

18,799

(30,562 in MCOT)

Number of isoforms

61,086

Average number of

isoforms/gene3.24

N50 2.7 Kb

Longest 9 Kb

Shortest 100 bp

Page 12: High quality arthropod genome assembly with single molecule reads and long-range scaffolding

www.citrusgreening.org

Evaluating the assembly

Complete Fragmented Missing

Diaci 1.1 74.8% 0.3% 24.9%

Diaci 1.92 85.2% 0.1% 14.7%

Overall alignment rate

Concordant alignment rate

Diaci 1.1 82% 0.62%

Diaci 1.92 88% 60%

Benchmarking sets of Universal Single-Copy Orthologs based on a set of 3350 single-copy orthologs from hemipteran species

Paired-end RNAseq alignment

MCOT Isoseq(full-length transcripts)

Diaci 1.1 1054 bp 470 bp

Diaci 1.92 1321 bp 699 bp

Average length of aligned coding sequence

NNN

Page 13: High quality arthropod genome assembly with single molecule reads and long-range scaffolding

www.citrusgreening.org

Improved genome and annotation will expedite identification of targets for interdiction

13

GenomePacbio v1.92

AnnotationIsoseq

Target for interdiction molecules

Pathway DatabasesExpression Networks

…….

Host

Vector

Pathogen

Page 14: High quality arthropod genome assembly with single molecule reads and long-range scaffolding

www.citrusgreening.org

Thank you!!

Utilizing system biology resources to decipher a tritrophic disease complexPrashant HosmaniWednesday, 10:30 AM - 10:45 AMMember Symposium: Applying Emerging Genomic Techniques to Control Invasive Species