de novo whole genome assembly · 2019-11-18 · de novo whole genome assembly. qi sun ....
TRANSCRIPT
![Page 1: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/1.jpg)
De novo whole genome assembly
Qi Sun
Bioinformatics FacilityCornell University
![Page 2: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/2.jpg)
The Concept of Reference Genome>personA_chr1-paternalGATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTC TAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTG CTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGA GCCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTT CCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATA TTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAATGCCAGAGGCTGCTCCCCCCGT GGCCCCTGCACCAGCAGCTCCTACACCGGCGGCCCCTGCACCAGCCCCCTCCTGGCCCCTGTCATCTTCT
>personA_chr1-maternalGATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTC TAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTG CTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGA GCCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTT CCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATA TTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAATGCCAGAGGCTGCTCCCCCCGT GGCCCCTGCACCAGCAGCTCCTACACCGGCGGCCCCTGCACCAGCCCCCTCCTGGCCCCTGTCATCTTCT
>personB_chr1-paternalGATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTC TAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTG CTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGA GCCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTT CCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATA TTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAATGCCAGAGGCTGCTCCCCCCGT GGCCCCTGCACCAGCAGCTCCTACACCGGCGGCCCCTGCACCAGCCCCCTCCTGGCCCCTGTCATCTTCT
>personB_chr1-maternalGATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTC TAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTG CTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGA GCCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTT CCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATA TTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAATGCCAGAGGCTGCTCCCCCCGT GGCCCCTGCACCAGCAGCTCCTACACCGGCGGCCCCTGCACCAGCCCCCTCCTGGCCCCTGTCATCTTCT
GATGGGATTGGGGTTTGAGCTTCTCAAAAGTCpersonA ………………………………T/A…………………………..
personB ………………………………T/T…………………………..
personC ………………………………T/A…………………………..
personD ………………………………T/A…………………………..
Reference
![Page 3: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/3.jpg)
Diploid genome
Maternal
Paternal
Reference
Reference genome is a mosaic of paternal and maternal genomes from one individual
![Page 4: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/4.jpg)
What individual to use as the reference?
Cornell professor Dr. Doug Antczak with Twilight, DNA donor for the horse reference genome.
Use an individual that is imbred.
The human reference is a composite genome from multiple anonymous individuals
![Page 5: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/5.jpg)
• Short reads (150bp)
o Illumina
• Long reads (>10kb)
o PacBio
o Oxford Nanopore
Sequencing platforms
0.1% Error
10% Error
![Page 6: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/6.jpg)
Steps in genome assembly
Raw sequencing reads
ContigingAssemble reads into longer pieces called contigs
*
*Polishing & ScaffoldingError correction and connecting neighboring pieces
NNNN
Pseudo-moleculesChromosomal level finish
NNNN
Contigs
Scaffolds
Chromosomes
![Page 7: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/7.jpg)
Two assembly strategies
overlap–layout–consensus de-bruijn-graph
source: http://www.homolog.us/Tutorials/index.php
Spades, Abyss, et alCanu, Falson, Flye, et al.
Long reads Short reads
![Page 8: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/8.jpg)
ATGGAAGTGGAAGTTGGAAGA
Branching in de-bruijn-graph
Kmer 1
Kmer 2
Kmer 3
T
A
ATGGAAG CCATAC
Bubble (e.g. sequencing errors)
T
A
ATGGAAG
CCATAC
Crosslinks (e.g. paralogous regions)
TTAGAC
![Page 9: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/9.jpg)
Genome: GGATGGAAGTCG…………… CGATGGAAGGAT(black regions are identical sequence)
GATGGAA ATGGAAGTGGAAGT
TGGAAGG
GGATGGAAGT CGATGGAAGGGATGGAAGT GATGGAAGG
GGATTGGA
CCGATTGGA
Longer kmer
Short kmers are more likely to branch
…..
Short kmer
![Page 10: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/10.jpg)
Longer kmer = Lower kmer coverage
1x coverage 5x coverage
Read coverage:
![Page 11: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/11.jpg)
Read depth vs Kmer depthRead depth: number of reads at a genome position
Kmer depth: number of occurrence of each kmer
Read depth: 4
Kmer depth: 3
For de-bruijn-graph, it is kmer depth that matters.
![Page 12: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/12.jpg)
Longer the kmer,Lower the kmer depth
30mer 80mer
Kmer depth =3 Kmer depth =1
Read depth: 3
![Page 13: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/13.jpg)
30mer 80mer
Kmer depth =3 Kmer depth =1
Longer kmer = lower kmer depth
Longer kmer = less branching
GATGGAA ATGGAAGTGGAAGT
TGGAAGG
GGATTGGA
CCGATTGGA
Optimize kmer length
![Page 14: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/14.jpg)
de-bruijn-graph for contiging short reads
source: http://www.homolog.us/Tutorials/index.php
Kmers
![Page 15: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/15.jpg)
Kmer size 75 95 105 115contig N50 268 476 476 268
scaffold N50 543 543 543 268Read length: 199 bp Coverage: ~100 x
N50 with different kmer (kb)
SPAdes: use a series of kmers.
![Page 16: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/16.jpg)
Tips, bubbles and crosslinks
source: http://www.homolog.us/Tutorials/index.php
Tips
Bubbles
Crosslinks
Branching in the de-bruijn-graph and how to solve
![Page 17: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/17.jpg)
Deal with sequencing errors and repetitive regions
1. Sequencing errors• Remove low depth kmers in a bubble;• Too long kmers would cause coverage problem;
2. Repeatitive regions• Longer kmers
Tiny repeat:Separate the path
Break boundary between low and high copy regions
![Page 18: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/18.jpg)
Genome assembly software gives us a graph, then algorithmically identify a path in the graph
https://en.wikipedia.org/wiki/Velvet_assembler
• Collapsed paralogous genes;
• Chimeric contigs/scaffolds;
Common errors:
![Page 19: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/19.jpg)
Assembly software is tuned to collapse allelic genes, but not paralogous genes;
Paralogous genes
Allelic Maternal
Paternal
![Page 20: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/20.jpg)
Kmer distribution can be used to estimate genome size
Genome size = Total base pairs
Coverage
Calculated by modeling kmer
distruction
Sequencing data: 20 GBCoverage: 10xGenome size = 2GB
![Page 21: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/21.jpg)
Kmer coverage distribution
Kmers counts
Frq of counts
Kmers from heterozygous
regions
![Page 22: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/22.jpg)
Estimate genome size based on kmer distribution
N = M* L/(L-K+1)
Step 1: convert read depth to kmer depth
M: kmer depth = 112L: read length = 101 bpK: Kmer size =21 bpN: read depth =140
Step 2: genome size is total sequenced base pairs devided by read depth
Genome size = T/N
T: total base pairs = 0.505 gbN: read depth = 140Genome size: 3.6 mb
75 merRead depth = 3Kmer depth = 2
![Page 23: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/23.jpg)
ErrorCorrectReads.pl \PAIRED_READS_A_IN=R1.fastq.gz \PAIRED_READS_B_IN=R2.fastq.gz \KEEP_KMER_SPECTRA=1 \PHRED_ENCODING=33 \PLOIDY=1 \READS_OUT=corrected_out \>& report.log &
Software to estimate genome size: ErrorCorrectReads.pl (from ALLPATHS-LG)
![Page 24: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/24.jpg)
Polishing with pilon
Align raw reads back to the assembly and identify discrepancies
SPAdes has a “--careful” option that does error correction with
read alignment
![Page 25: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/25.jpg)
• Long-read: PacBio or Nanopore
• BioNano: Optical Mapping:
• Hi-C: Dovetail; Phase Genomics
Scaffolding contigs
NNNScaffolds NNNN
Technologies
Short range
Long range
![Page 26: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/26.jpg)
Scaffolding strategies: Physical maps
Hi-C:
• Label 7-mer nickase recognition sites; • Measure fragment length
Dovetail & Phased Genomics
* * *Assemble the optical map
BioNano optical map
*
* * * * ** ***Place contig to the optical map
Tissues with intact nucleus
![Page 27: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/27.jpg)
Assembly pipeline:
1.Trim adapters: Trimmomatic
2.Contiging: SPAdes
3.Polishing: (included in SPAdes with –careful option)
4.Scaffolding: PBJelly. (If you have both long and short reads, it is better to run hybrid assembly tool, e.g MaSuRCA or SPAdes)
5.Assessment: QUAST, BUSCO
![Page 28: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/28.jpg)
overlap–layout–consensus
Long reads
• PacBio• Oxford Nanopore
![Page 29: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/29.jpg)
Fragment: >>10kbRead: 10-20 kb
Long-read Sequencing Platform: PacBio SMRT
2017 2018
From PacBio web site
![Page 30: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/30.jpg)
DNA fragment length vsSequencing read length
DNA fragment
Sequencing Read: ACGGGAGGGACCCG…
![Page 31: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/31.jpg)
Assembly pipeline:
1. Run basecaller
2. Assembly with Canu
3. Polishing
4. Assessment
5. Scaffolding
Raw signal data => FASTQ
• Run Nanopore basecaller “guppy” on a computer with good GPU.
https://biohpc.cornell.edu/lab/userguide.aspx?a=software&i=653#c
![Page 32: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/32.jpg)
Assembly pipeline:
1. Run basecaller
2. Assembly with Canu
3. Polishing
4. Assessment
5. Scaffolding
1.Correct sequencing errors• All-versus-all alignment• Correct errors through overlaps
1.Trim reads• Trim regions of reads not
supported by other reads
2.Assemble• Contiging corrected reads
![Page 33: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/33.jpg)
Assembly pipeline:
1. Run basecaller
2. Assembly with Canu
3. Polishing
4. Assessment
5. Scaffolding
Default setting:
rawErrorRate (raw reads)PacBio: 0.300Nanopore: 0.500
correctedErrorRate (corrected reads)PacBio: 0.045Nanopore: 0.144(decrease with higher coverage)
![Page 34: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/34.jpg)
Assembly pipeline:
1. Run basecaller
2. Assembly with Canu
3. Polishing
4. Assessment
5. Scaffolding
Alignment: minimap2 for long reads; bwa for short reads.
Arrow: polish with pacbio reads
Nanopolish: polish with nanopore reads
Pilon: polish with Illumina reads (optional for PacBioassembly, needed for Nanopore assembly.
Polishing tools
![Page 35: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/35.jpg)
Assembly pipeline:
1. Run basecaller
2. Assembly with Canu
3. Polishing
4. Assessment
5. Scaffolding
--frags illumina.bam
--nanopore np.bam
--pacbio pb.bam
Polishing with Pilon(Illumina, Nanopore or PacBio)
![Page 36: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/36.jpg)
Assembly pipeline:
1. Run basecaller
2. Assembly with Canu
3. Polishing
4. Assessment
5. Scaffolding
BUSCO: completeness
QUAST: length
![Page 37: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/37.jpg)
Assembly pipeline:
1. Run basecaller
2. Assembly with Canu
3. Polishing with Pilon
4. Assessment
5. Scaffolding
BioNano: optical map
Hi-C: physical map from chromatin structure
![Page 38: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/38.jpg)
A diploid genome
Paternal
Maternal
Heterozygous regions
Assembly results
Random phased assembly
Haplotigs
![Page 39: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/39.jpg)
Why hybrid assembly (PacBio + Illumina)
If you have lots of money,
50 -100x PacBio
If you have very little money,
If you are in the middle,
50 -100x Illumina
• 50 -100x Illumina• 10x PacBio
![Page 40: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/40.jpg)
Why hybrid assembly (PacBio + Illumina)
Medium/large genomes
50 -100x PacBio
Bacterial/fungal genomes
Extra large genomes
50 -100x Illumina
• 50 -100x Illumina• 10x PacBio
![Page 41: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/41.jpg)
MaCuRCAZimin, Aleksey V. et al. (2017) Genome research 27 5: 787-792 .
![Page 42: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/42.jpg)
Assessment of assemblies
• Completeness
• Contig/scaffold length
• Contig/scaffold quality• Chimeric contigs/scaffolds;• Collapsed paralogs;
• Estimated genome size vs assembly size;• Gene space completeness: BUSCO
• N50
![Page 43: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/43.jpg)
Evaluate the completeness in Genespace
BUSCO gene sets: Present in at least 90% of the species in each lineage, single copy.
BUSCO
Arthropods Vertebrates Fungi BacteriaMetazoans EkaryotesPlants
BUSCO Lineages
![Page 44: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/44.jpg)
C: 85% [S:34%,D:51%],F: 14%,M: 1%,n: 1658
BUSCO Scorehttps://biohpc.cornell.edu/lab/userguide.aspx?a=software&i=255#c
Evaluation of assembly qualityhttps://github.com/trinityrnaseq/trinityrnaseq/wiki/Transcriptome-Assembly-Quality-Assessment
C: Complete• S: single copy;• D: duplicated;
F: Fragmented;M: Missing;n: Total groups;
![Page 45: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/45.jpg)
Evaluation of Genome assembly 1Metrics for contig length
N50 and L50 *
N50 50% (base pairs) of the assemblies are contigs above this size.
L50 Number of contigs greater than the N50 length.
NG50 and LG50N50 is calculated based on assembly size. NG50 is calculated based on estimated genome size.
![Page 46: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/46.jpg)
Usage Shortreads
Longreads
hybrid
SPAdes 60.31% x x
canu 17.93% x
MaSuRCA 5.47% x x
supernova 0.50% x
Unicycler 0.46% x x x
Flye 0.26% x
abyss 0.17% x x
mccortex 0.06% x
velvet 0.06% x
SOAPdenovo2 0.01% x
Usage statistics of assembly software on BioHPC
![Page 47: De novo whole genome assembly · 2019-11-18 · De novo whole genome assembly. Qi Sun . Bioinformatics Facility. Cornell University. The Concept of Reference Genome >personA_chr1-paternal](https://reader033.vdocument.in/reader033/viewer/2022042311/5ed8d1cb6714ca7f47689fe8/html5/thumbnails/47.jpg)
From assembled genome to annotated genome
Procaryotic genomes Eucaryotic genomes
Genome annotation servers (web based or local)
1. RAST2. NCBI
Gene prediction pipeline: Maker / Braker
Function annotation pipeline: Blast2GO