kelley bullard, henry dewhurst, kizee etienne, esha jain, viveksagar kr, benjamin metcalf, raghav...

31
Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick Genome Assembly

Upload: bethanie-blake

Post on 17-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick

Genome Assembly

Outline Input Data Sequence read data Pipeline Review Un-processed data Assemblers Preliminary data – assembler comparison Visualization Future

Input Data

V. navarrensis V. vulnificus

2423-01 2009V-1368

08-2462 06-2432

2541-90 08-2435

2756-81 08-2439

- 07-2444

Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Vibrio navarrensis- 454

SequenceID 2423-01 08-2462 2541-90 2756-81

Min. Read Length

21 bp 25 bp 19 bp 28 bp

Max. Read Length

738 bp 573 bp 704 bp 704 bp

Avg. Read Length

423.27 (± 117.36 bp)

401.80 (± 117.12 bp)

416.23 (± 125.84 bp)

423.53(± 117.19 bp)

Total Reads 160,560 13,854 303,434 218,021

Coverage 15x 1.23x 28.06x 20.51x

Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Vibrio vulnificus- 454

SequenceID 2009V-1368 06-2432 08-2435 08-2439 07-2444

Min. Read Length

26 bp 21 bp 23 bp 22 bp 18 bp

Max. Read Length

593 bp 597 bp 723 bp 594 bp 736 bp

Avg. Read Length

416.05(± 123.19 bp)

371.91(± 112.13bp)

416.98 (± 121.56 bp)

418.12 (± 120.88 bp)

368.78(± 115.96 bp)

Total Reads 191,280 786,944 352,726 173,538 777,228

Coverage 17x 65x 32x 16x 63x

Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Vibrio navarrensis- Illumina

SequenceID 2423-01 08-2462 2541-90 2756-81

Min. Read Length

76 bp 76 bp 76 bp 76 bp

Max. Read Length

76 bp 76 bp 76 bp 76 bp

Avg. Read Length

76 bp 76 bp 76 bp 76 bp

Total Reads 19,316,659 29,414,237 126,298,691 92,338,634

Coverage 326x 496x 250x 237x

Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Vibrio vulnificus- Illumina

SequenceID 2009V-1368 06-2432 08-2435 08-2439 07-2444

Min. Read Length

76 bp 76 bp 76 bp 76 bp 76 bp

Max. Read Length

76 bp 76 bp 76 bp 76 bp 76 bp

Avg. Read Length

76 bp 76 bp 76 bp 76 bp 76 bp

Total Reads 15,764,329 14,562,252 15,343,648 16,007,895 15,495,709

Coverage ~250x ~250x ~250x ~250x ~250x

Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

454 raw reads

PRE-PROCESSING

Illumina raw reads

Pre-processing

454 reads

Illumina reads

Statistical analysis

Read stats

Published Genomes from public databases

V. vulnificus

YJ016

V. vulnificus CMCP6

V. vulnificus MO6-24/O

Align Illumina against the reference

FastqcPrinseqNGS QC

Compare mapping statistics

Reference genome

samstats

bwa

REFERENCE SELECTION

Hybrid DeNovo • Ray• MIRA

Illumina/ 454/ Hybrid DeNovo assembly

454 DeNovo• Newbler• CABOG• SUTTA

Illumina DeNovo• Allpaths LG• SOAP DeNovo• Velvet• Taipan• SUTTA

contigs * 3

Align illumina reads against 454 contigs

Unmapped reads

Mac vectorCLC wb

contigs

Unmapped reads

Evaluation

GAGEHawk-eye

Illumina/(454?) reference based

assembly

AMOScmp

contigs

Unmapped reads

DENOVO ASSEMBLY

REFERENCE BASED ASSEMBLY

Draft/ Finished genome

Reference evaluation

Reference evaluation

DNA DiffMUMmer

Parameter optimization

CONTIG MERGING

All possible combinations of the

best 3

MimimusMAIA

PAGITMauve

Finished genomeScaffolds

GAGE

GENOME FINISHING

Gap filling Nulceotide identity

MUMmer

GRASSBuilt-in

Process

454

Illumina

Info.

Chosen Ref.

Assemblers

Assemblers

Illumina454

LEGEND

hybrid

Pipeline: Revisited

Vibrio vulnificus- 454Metric 1368 2432 2435 2439 2444

Per Base Seq. Quality

Per Seq. Quality Score

Per Base Seq. Content

Per Base GC Content

Per Seq. GC Content

Per Base N Content

Seq. Length Dist.

Seq. Dup. Levels

Overrepresented Seqs.

Kmer Content

Vibrio navarrensis- 454; unprocessed data

Metric 2423-01 08-2462 2541-90 2756-81

Per Base Seq. Quality

Per Seq. Quality Score

Per Base Seq. Content

Per Base GC Content

Per Seq. GC Content

Per Base N Content

Seq. Length Dist.

Seq. Dup. Levels

Overrepresented Seqs.

Kmer Content

Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Vibrio vulnificus- Illumina; unprocessed dataMetric 2009V-1368 06-2432 08-2435 08-2439 07-2444

Per Base Seq. Quality

Per Seq. Quality Score

Per Base Seq. Content

Per Base GC Content

Per Seq. GC Content

Per Base N Content

Seq. Length Dist.

Seq. Dup. Levels

Overrepresented Seqs.

Kmer Content

Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Vibrio navarrensis- Illumina; unprocessed data

Metric 2423-01 08-2462 2541-90 2756-81

Per Base Seq. Quality

Per Seq. Quality Score

Per Base Seq. Content

Per Base GC Content

Per Seq. GC Content

Per Base N Content

Seq. Length Dist.

Seq. Dup. Levels

Overrepresented Seqs.

Kmer Content

Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Per base sequence qualityvul_454_07-2444 nav_454_2541-90

vul_ill_06-2432 nav_ill_08-2462

Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Per base sequence contentvul_454_06-2432

vul_ill_06-2432 nav_ill_06-2756-81

nav_454_08-2462

Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Seq. duplicate levelsvul_454_08-2435 nav_454_2541-90

vul_ill_06-2432 nav_ill_08-2462

Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Pre-processing stats

Parameter Value

Total sequences 15,343,648

Good sequences 9,775,116

Bad sequences 5,568,532

vul_ill_07-2444

Good readsExact repeatsTrim tail rightMin qual mean

Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

454 raw reads

PRE-PROCESSING

Illumina raw reads

Pre-processing

454 reads

Illumina reads

Statistical analysis

Read stats

Published Genomes from public databases

V. vulnificus

YJ016

V. vulnificus CMCP6

V. vulnificus MO6-24/O

Align Illumina against the reference

FastqcPrinseqNGS QC

Compare mapping statistics

Reference genome

samstats

bwa

REFERENCE SELECTION

Hybrid DeNovo • Ray• MIRA

Illumina/ 454/ Hybrid DeNovo assembly

454 DeNovo• Newbler• CABOG• SUTTA

Illumina DeNovo• Allpaths LG• SOAP DeNovo• Velvet• Taipan• SUTTA

contigs * 3

Align illumina reads against 454 contigs

Unmapped reads

Mac vectorCLC wb

contigs

Unmapped reads

Evaluation

GAGEHawk-eye

Illumina/(454?) reference based

assembly

AMOScmp

contigs

Unmapped reads

DENOVO ASSEMBLY

REFERENCE BASED ASSEMBLY

Draft/ Finished genome

Reference evaluation

Reference evaluation

DNA DiffMUMmer

Parameter optimization

CONTIG MERGING

All possible combinations of the

best 3

MimimusMAIA

PAGITMauve

Finished genomeScaffolds

GAGE

GENOME FINISHING

Gap filling Nulceotide identity

MUMmer

GRASSBuilt-in

Process

454

Illumina

Info.

Chosen Ref.

Assemblers

Assemblers

Illumina454

LEGEND

hybrid

Pipeline: Revisited

Assemblers

Name Platform Source file Installation Usage

Allpaths LG Illumina

SOAP DeNovo Illumina

Velvet Illumina

SUTTA Hybrid

RAY Hybrid

CLC genomics workbench Hybrid

Newbler 454

CABOG 454

Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

CLC Genomics Word Size: Automatic Word Size CLC bio's de novo assembly algorithm works by using de Bruijn graphs. It makes a table of all sub-sequences of a certain length (called words) found in the reads.

Bubble Size: Automatic Bubble Size A bubble is defined as a bifurcation in the graph where a path furcates into two nodes and then merge back into one.

Minimum Contig Length: 200

Mismatch cost : 2 The cost of a mismatch between the read and the reference sequence.

Insertion cost: 3 The cost of an insertion in the read (causing a gap in the reference sequence)

Deletion cost: 3 The cost of having a gap in the read. The score for a match is always 1.

Length fraction: 0.5 Set minimum length fraction of a read that must match the reference sequence. Setting a value at 0.5 means that at least half the read needs to match the reference

sequence for the read to be included in the final mapping.

Similarity: 0.8 Set minimum fraction of identity between the read and the reference sequence. If you want the reads to have e.g. at least 90% identity with the reference sequence

in order to be included in the final mapping, set this value to 0.9.

Update contigs based on mapped reads This means that the original contig sequences produced from the de novo assembly will be updated to reflect the mapping of the reads

Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Velvet De brujin assembler

Max kmer length-31, default 29 Commands

velveth directory -k-mer -readtype –file format filename velvetg VAssemILL -exp_cov auto -cov_cutoff auto

exp_cov – allow the sytem to infer expected coverage of unique regions Cov_cutoff - Allow the system to infer the removal of low coverage nodes

Designed for very short reads (25-50bp)

Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Newbler De Novo OLC assembler

Uses k-mer based hashing Command – runAssembly [filename] Designed for longer reads (454)

Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

SOAP DeNovo2 Short reads DeNovo assembler Designed to study Illumina GAII contigs Command - SOAPdenovo-127mer all -s test.config -K 30 -R -p 4

-N 4600000 -o test_OP 1>ass.log 2>ass.err Parameters specified:

Insert_size: 0, single end reads Kmer_size: 23, default asm_flag: both contigs and scaffold

Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Assembler comparison- 454

Tool N50 No. of contigs

Avg. contig length

No. of large contigs

Largest contig Read usage %

CLC Genomics wb.

93,536 363 13,107 NA NA 99.32

Newbler 194,540 142 33,550 94 777,156 98.9

Tool N50 No. of contigs

Avg. contig length

No. of large contigs

Largest contig

Read usage %

CLC Genomics wb.

84,313 313 13,828 NA NA 98.53

Newbler 111,462 347 12,606 168 218,091 97.88

nav_454_2541-90

vul_454_06-2432

Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Assembler comparison- Illumina

Tool N50 No. of contigs Avg. contig length

Read usage % Largest contig

Median coverage depth

SOAP DeNovo 1,077 28,760 184 NA NA NA

Velvet 17,408 1,402 3,072 99.26 58,246 92.09

CLC Genomics wb 56,628 291 14,766 99.36 193,565 NA

Tool N50 No. of contigs Avg. contig length

Read usage % Largest contig

Median coverage depth

SOAP DeNovo 1,094 26,773 207 NA NA NA

Velvet 15,699 1,253 3,759 99.57 51,343 86.93

CLC Genomics wb 87,298 260 18,087 99.40 233,510 NA

nav_ill_2541-90

vul_ill_06-2432

Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

454 raw reads

PRE-PROCESSING

Illumina raw reads

Pre-processing

454 reads

Illumina reads

Statistical analysis

Read stats

Published Genomes from public databases

V. vulnificus

YJ016

V. vulnificus CMCP6

V. vulnificus MO6-24/O

Align Illumina against the reference

FastqcPrinseqNGS QC

Compare mapping statistics

Reference genome

samstats

bwa

REFERENCE SELECTION

Hybrid DeNovo • Ray

Illumina/ 454/ Hybrid DeNovo assembly

454 DeNovo• Newbler• CABOG• SUTTA

Illumina DeNovo• Allpaths LG• SOAP DeNovo• Velvet• SUTTA

contigs * 3

Align illumina reads against 454 contigs

Unmapped reads

Mac vectorCLC wb

contigs

Unmapped reads

Evaluation

GAGEHawk-eye

Illumina/454? reference based

assembly

AMOScmp

contigs

Unmapped reads

DENOVO ASSEMBLY

REFERENCE BASED ASSEMBLY

Draft/ Finished genome

Reference evaluation

Reference evaluation

DNA DiffDNA Diff

Parameter optimization

CONTIG MERGING

All possible combinations of the

best 3

MimimusMAIA

PAGITMauve

Finished genomeScaffolds

GAGE

GENOME FINISHING

Gap filling Nulceotide identity

MUMmer

GRASSBuilt-in

Process

454

Illumina

Info.

Chosen Ref.

Assemblers

Assemblers

Illumina454

LEGEND

hybrid

Pipeline: Revisited

Reference Genomes

V. vulnificus MO6-24/O V. vulnificus YJ016 V. vulnificus CMCP6

Reference vs. all contigs- 454

Tool/Reference CMCP6 YJ016 MO6-24/O

Aligned contigs%

Aligned bases%

Aligned contigs%

Aligned bases%

Aligned contigs%

Aligned bases%

CLC Genomics wb.(n = 313)

45 25 41 25 39 25

Newbler (n = 347) 59 25 58 25 43 24

nav_454_2541-90

vul_454_06-2432Tool/Reference CMCP6 YJ016 MO6-24/O

Aligned contigs%-

Aligned bases%

Aligned contigs%

Aligned bases%

Aligned contigs%

Aligned bases%

CLC Genomics wb. NA NA NA NA NA NA

Newbler (n = 142) 85 91 84 91 86 92

Reference vs. all contigs- Illumina

Tool/Reference CMCP6 YJ016 MO6-24/O

Aligned contigs%

Aligned bases%

Aligned contigs%

Aligned bases%

Aligned contigs%

Aligned bases%

SOAP DeNovo (n = 28,760)

3 13- 3 14 3 14

Velvet (n = 1402) 20 23 20 23 20 23

nav_ill_2541-90

vul_ill_06-2432Tool/Reference CMCP6 YJ016 MO6-24/O

Aligned contigs%

Aligned bases%

Aligned contigs%

Aligned bases%-

Aligned contigs%

Aligned bases%

SOAP DeNovo (n = 26,773)

18 76 18 76 18 76

Velvet(n = 1,253) 46 91 47 91 47 91

Visualization

Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Road ahead…..

Get all the tools working

Optimize tool parameters

Use Illumina reads to finish 454 contigs

Performance considerations for the tool

Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Questions???