improving and validating the atlantic cod genome assembly using pacbio

Post on 10-May-2015

2.757 Views

Category:

Spiritual

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

My talk at the PacBio European Usergroup Meeting, November 20th, 2013

TRANSCRIPT

Improving and validating the Atlantic Cod genome assembly using error-corrected

as well as raw PacBio reads

Lex Nederbragt, NSC and CEESlex.nederbragt@ibv.uio.no

@lexnederbragtOK

Acknowledgements

University of Oslo

Sequencing team NSC

Ole Kristian TøressenKjetill Jakobsen

Sissel JentoftCod genome group

Jason Miller, JCVI

Pacific Biosciences

The Atlantic cod genome project

Cod: the genome

850 million bases (Mbp )Heterozygote

‘Wild-caught’

Cod: phase 1

(Sanger sequencing)454 sequencing

N50

50% of the genome is in contigs as large as the N50 value

Courtesy of Michael Schatz, CSHL

1000 bp genome

445

520

400

490

N50

Sum

Cod: phase 1

(Sanger sequencing)454 sequencing

Phase 1 assembly157 887 sequences753 Mbp of 830 Mbp

Scaffoldcontig

gap

N50 460 kbp

N50 2.8 kbp

Cod: phase 1

6467 scaffolds

35% gap bases

The causes

Short Tandem Repeats (>20% of gaps)

The causes

Polymorphic contig 2Polymorphic contig 2

Polymorphic contig 3Polymorphic contig 3

Contig 4Contig 1

Heterozygosity?

Cod: phase 2

New dataIllumina sequencingPaired end >200xMate Pair 5kb >100x

Improved/new software

23 pseudochromosomes

Below 5% gap bases

Longer contigs

Cod: phase 2 goal

Phase 2 goalScaffold N50 1 MbpContig N50 15 kbp

Cod: phase 2 programs

Zhang et al. PLoSOne 2011

Cod phase 2: status

Goal

Contig scaffold N50 gaps N50

15 kbp <5% 1.5 Mbp

Celera, 454 + Ilmn

Newbler, 454

9 kbp 5% too short

6 kbp 24% OK

Enter PacBio

Large Insert Sizes

Sequencing

Aim for looooong insert sizes

Photo: Tore Oldeide Elgvin

147 SMRT Cells

Chemistry Coverage Av. Raw length

C2 9.2x 3.0 kb

C2-XL 3.2x 4.6 kb

XL-XL 3.5x 5.1 kb

TOTAL 15.9x

Error-correction

Celera Assembler merTrim

+

27x

234x

PacBioToCa (Koren et al)

+

13.7x

27x

9x (67%) recovered

Using PacBio reads

PacBio reads for cod

Error-correctedreads

Rawreads

Assembly improvement Celera PBJelly

Assembly validation blasr blasrbridgemapper

De novo assembly Celera

PacBio reads for cod

Error-correctedreads

Rawreads

Assembly improvement PBJelly PBJelly

Assembly validation blasr blasrbridgemapper

De novo assembly Celera

PacBio reads for cod

Error-correctedreads

Rawreads

Assembly improvement PBJelly PBJelly

Assembly validation blasr blasrbridgemapper

De novo assembly Celera

PacBio reads for cod

Error-correctedreads

Rawreads

Assembly improvement PBJelly PBJelly

Assembly validation blasr blasrbridgemapper

De novo assembly Celera

Assembly improvement: corrected reads

Celera, 454 reads

Goal

N50 gaps

15 kbp <5%

9 kbp 5%

+ corrected PacBio + PBJelly 11 kbp 1.5%

PacBio reads for cod

Error-correctedreads

Rawreads

Assembly improvement PBJelly PBJelly

Assembly validation blasr blasrbridgemapper

De novo assembly Celera

Assembly improvement: raw reads

Goal

N50 gaps

15 kbp <5%

6 kbp 24%Newbler, 454

+ raw PacBio + PBJelly30 kbp 20%

Assembly improvement: raw reads

Goal

N50 gaps

15 kbp <5%

9 kbp 5%

Too good to be true?

Celera, 454 + Ilmn

+ raw PacBio + PBJelly

46 kbp 1.5%

PacBio reads for cod

Error-correctedreads

Rawreads

Assembly improvement PBJelly PBJelly

Assembly validation blasr blasrbridgemapper

De novo assembly Celera

Assembly validation

Sequence

Assembly validation

Sequence

Aligned raw Pacbio reads

Coverage

Assembly validation

Sequence

Aligned raw Pacbio reads

Coverage

Aligned corrected Pacbio reads

Assembly validationRa

wpa

cbio

read

sCo

rrec

ted

pacb

io re

ads

(TG)n repeat (TG)n repeat

308 bp gap

Newbler scaffold

Assembly validationRa

wpa

cbio

read

s

(AG)n repeat

939 bp gap

Newbler scaffold

Heterozygous region

Assembly validationRa

wpa

cbio

read

s

Celera scaffold

Misassembly?

PacBio reads for cod

Error-correctedreads

Rawreads

Assembly improvement PBJelly PBJelly

Assembly validation blasr blasrbridgemapper

De novo assembly Celera

Assembly validation: bridgemapper (beta)

structural variation misassemblies

Split alignments

bridgemapper (beta) on E. coli

Positions in the contig color coded Illumina + velvet

s05514

bridgemapper (beta) on cod

2510 bp gap

Point to a 2350 bp scaffold

s08737

bridgemapper (beta) on cod

2145 bp gap

Point to a 3 kbp scaffold

PacBio reads for cod

Error-correctedreads

Rawreads

Assembly improvement PBJelly PBJelly

Assembly validation blasr blasrbridgemapper

De novo assembly Celera

Assembly with error-corrected reads

Celera Assembly

Goal

Contig N50 gaps scaffolds

15 kbp <5%

9 kbp 5% too short

1.4 times genome size underassembled

CA + corrected PacBio + 454 mates 8 kbp 2% very short

The improved Atlantic cod genome: status

http://en.wikipedia.org

Newbler plus Celera

Scaffoldcontig

gap

Celera: Long contigs, short scaffolds

Slide courtesy of Ole Kristian Tøressen

Newbler plus Celera

Scaffoldcontig

gap

Scaffoldcontig

gap

Celera: Long contigs, short scaffolds

Newbler: Short contigs, long scaffolds

Slide courtesy of Ole Kristian Tøressen

Newbler plus Celera

Scaffoldcontig

gap

Scaffoldcontig

gap

Celera: Long contigs, short scaffolds

Newbler: Short contigs, long scaffolds

Scaffoldcontig

gapCombined: Long contigs, long scaffolds

Slide courtesy of Ole Kristian Tøressen

Contig

Scaffold

PacBio reads

Slide courtesy of Ole Kristian Tøressen

Adding PacBio

Closed gap Reduced gap

Using PBJelly

Polishing the assembly

454 and Illumina reads

Slide courtesy of Ole Kristian Tøressen

Contig

Scaffold

Contig N50: 30 - 40 kbpScaffold N50: 1 - 1.5 Mbp

Imageby Mathieu Thouvenin http://www.flickr.com/photos/mathoov/4681491052/

PacBio reads for cod

Error-correctedreads

Rawreads

Assembly improvement PBJelly PBJelly

Assembly validation blasr blasrbridgemapper

De novo assembly Celera

PacBio reads for cod

Error-correctedreads

Rawreads

Assembly improvement PBJelly PBJelly

Assembly validation blasr blasrbridgemapper

De novo assembly Celera Celera

Assembly

Goal

Contig N50 gaps scaffolds

15 kbp <5%

8 kbp 2% very short CA + corrected PacBio + 454 mates

1.6 times genome size underassembled

CA + raw PacBio reads + 454 mates 38 kbp <1% very short

Lessons learned from PacBio reads

Heterozygous:Large polymorphism

(100’s of bases)

Heterozygous:Large indel

(100’s of bases)

Homozygous HomozygousHomozygous

Cod genome

Atlantic cod version 2

23 pseudochromosomes

Below 5% gap bases

Longer contigs

New annotation

From observation to insight

Mathias Bigge, Ricordisamoa, others (wikimedia commons)

We need better programs

top related