aug2015 ali bashir and jason chin pac bio giab_assembly_summary_ali3
Post on 17-Jan-2017
525 Views
Preview:
TRANSCRIPT
FIND MEANING IN COMPLEXITY© Copyright 2015 by Pacific Biosciences of California, Inc. All rights reserved.For Research Use Only. Not for use in diagnostic procedures.
GIAB workshop, Aug 27, 2015
PacBio GIAB Assembly Summary
2
Two Draft Assemblies Generated
• Two Assemblies:– “Family genome”: using all data from three genomes to create a family-level
reference to get better continuity. It can be used for other downstream analysis.– Child genome: standard Falcon diploid-aware genome assembly for doing
“Falcon Unzip” to get “haplotigs” of regions of interests
• Primary Contigs Statistics Summary:
Child Contig Stats#Seqs 9,973Max 39,181,442Total 2,959,326,490*n50 7,162,062n90 668,759n95 66,926
“Family” Contig Stats#Seqs 5,680Max 50,291,873Total 2,892,908,408n50 9,242,933n90 855,896n95 233,041*We use short read length cutoff and more sensitive overlapping parameters for the child assembly. This might explain why the assembler size is bigger than the family one.
3
“Naïve” Structure Variation Calls by Whole Genome Alignments
• Whole genome alignment are use to identify the difference between the assembled contigs to GRCh38. We have a couple of example shown her.
Haplotype 1
Haplotype 2
~50kb deletion in one haplotype
Child Contigs
4
Heterozygous Insertion
Haplotype 1
Haplotype 2
~3.7 kb insertion in one haplotype
~ 22000 SV calls
Need to develop a methodto filter out some alignmentartifacts
Falcon “Unzip”
5
Contig 000400F, ~5.1 Mb
MHC, HLA Class I Region
Primary contig with phased sequence + alternative haplotigs
haplotype block haplotype block
Region of low density het-SNPs
“haplotig”
“Unzipped” Graphs
“Haplotype Fused” Graph
phased block
Step 1
Step 2
Step 3
6
Phased Variants of All Kinds
haplotype block haplotype block
Sequence alignments between the haplotigs
Phases structure variants + SNPsbetween the haplotypes
Haplotype 0
Hap
loty
pe 1
Some Haplotype StatisticsEM on hybrid SNPsTest region : 33 Mb Number of haplotigs : 218Haplotig coverage : 72.1%N50 : 287,557 bpSwitches : 336Switch error rate : 4.13%Total phased variants : 8131Concordant variants : 7049S50: 259
*Many small (10-15 kb) gaps where heterozygous SNPs were not present between blocks
Assembly Statistics• We assembled linked SNP and
indel information from the previous step into finished haplotigs by phasing reads
• Reads partitioned using a greedy algorithm on alleles
• Phased reads were then added to a network graph of read overlaps
Rough Schematic for Hybrid Scaffolding - Flow ChartContig /Scaffold (fasta)
NGS.cmap
In silica digestion
BN.cmap
De Novo Assemble
NGS.vs.BN.xmap
Filtered NGS.cmap Filtered BN.cmap
LeftOver NGS.cmap
LeftOver BN.cmap
Merged Hybrid Scaffold.cmap
1. Flag Inconsistencies (QC and/or manual curation)
Scaffold Pipeline
2. Scaffolding Pipeline
Fasta AGP3. Export
NGS Genome Maps
Optional Iterations of different stringencies
Hybrid Scaffolding Stats
Input Input Contigs# of
ScaffoldsMean Length N50 Max Total
HG002GIAB upload (Falcon) 248 9.5Mb 22.7Mb 92.8Mb 2.4Gb
HG002 celera child 275 8.1Mb 16.9Mb 61.0Mb 2.2Gb
HG002updated Falcon Child 302 7.4Mb 18.2Mb 61.0Mb 2.3Gb
Trio(more) updated Falcon 210 11.1Mb 29.3Mb 87.6Mb 2.3Gb
2 Step Trio
celera child + falcon trio 187 13.9Mb 34.3Mb 98.0Mb 2.6Gb
2 Step Child
celera child + falcon child 200 7.8Mb 23.8Mb 77.9Mb 1.6Gb
top related