agbt2017 reference workshop: fulton
TRANSCRIPT
Laboratory Aspects of Generating High Quality Assemblies
MGI Reference Genomes Workshop
Bob FultonFebruary 13th 2017
Primary Objectives
• Develop Tools and Techniques to Provide High Quality, Haplo-resolved Genome Assemblies Sampling and Capturing as Much Human Diversity as Possible
Sequencing Strategy for Reference Genomes
• PacBio Large Insert Library Construction• Linked Reads with 10X Genomics• Validation Using BioNano Physical Map
PacBio
PacBio WGS Library Construction
• High Molecular Weight Genomic DNA• DNA must be of sufficient quality to allow for 50 kb shearing to
produce PacBio Continuous Long Reads (CLR)
• Consistent Shearing 50 kb• Preferred method: Diagenode Megaruptor
• Fragment size setting – 50kb
• Working on 3 Methods for Library Construction• PacBio SMRTbell – Current Standard PacBio SMRTbell Template Prep
Kit 1.0 and SMRTbell Damage Repair Kit• Hybrid Library– Swift Accel-NGS XL Library Prep Kit but exchanging
PacBio Damage Repair Kit• Swift Library - Swift Accel-NGS XL Library Prep Kit Including Swift
DNA Repair Enzymes • New Data Recently Available with New Repair Process
HG02818 Library Preparation and Sequencing
• Three library reactions(15ug) each of HG02818 were processed using the PacBio SMRTbell, Hybrid, and Swift library preps.
• Library recoveries leading into BluePippin size selection for the Hybrid and Swift methods were double the PacBio library prep.
• All libraries were size selected on the BP at 20Kb-50Kb..
• The PacBio SMRTbell library generated over a Gb of data for the first two SMRT cells. Additional SMRT cells produced less data as the library appeared to degrade.
Library Method Library Recovery Pre-BP
ROI Read Length
PacBio SMRTbell 35.8% (5.3ug) 12178
Hybrid 68.8% (10.3ug) 13511
Swift 70.9% (10.6ug) 10232
HG02818 Library Preparation and Sequencing
0
200
400
600
800
1000
1200
1400
1600
1800
11/6/16 11/11/16 11/16/16 11/21/16 11/26/16 12/1/16 12/6/16
PacBio SMRTbell
Hybrid
Swift
Date of PacBio RSII Sequencing Run
Read
of
Inse
rt M
base
spe
r SM
RT c
ell
Subread Length Comparisons - HG02818
SMRTbell Library
• Mean Subread Length: 11,391 bp
• N50 Subread Length: 17,007 bp
Hybrid Libraries
• Mean Subread Length: 13,406 bp
• N50 Subread Length: 18,649 bp
Subread Length Comparisons - HG02818
Swift Library
• Mean Subread Length: 10,163 bp
• N50 Subread Length: 15,220 bp
E. Coli New Swift Only Kit
• Mean Subread Length:
16,387 bp
• N50 Subread Length:
22,625 bp
Agilent Tape Station Assessment of Library Size
PacBio SMRTbell No BluePippin Size Selection
Agilent Tape Station Assessment of Library Size
PacBio SMRTbell 6Kb-50Kb BluePippin Size Selection
Agilent Tape Station Assessment of Library Size
Hybrid Prep Pre-BluePippin Size Selection
Agilent Tape Station Assessment of Library Size
PacBio SMRTbell 8Kb-50Kb BluePippin Size Selection
Agilent Tape Station Assessment of Library Size
Hybrid Prep 18Kb-50Kb BluePippin Size Selection
10X Genomics
10X Genomics
• Chromium Instrument• Long Range Linking Information on a Genome Wide
Scale• Phasing Information Across a Genome• Enhanced Variant Calling and Structural Variation
Detection• DeNovo Assembly of Diploid Genomes• Both WGS and Targeted Approaches
10X Genomics Overview
(Church 10X Genomics)
10X Genomics Phasing – Important for Het vs. Repeat Copy Resolution
(Church 10X Genomics)
(Church 10X Genomics)
BioNano
Bionano Stats from Human Cell Lines
Genome Coverage Mol N50(Kb)
# of Map Contigs
Contig N50 (Mb)
Total Map Size (Gb)
NA19240 96X 174.9 3148 1.26 2.85
NA19238 93X 216.9 2798 1.47 2.93
NA19239 118X 201 2565 1.68 2.96
HG00733 157X 202.9 2484 1.69 2.92
HG00514 161X 211.7 3025 1.35 2.83
NA12878 134X 202.7 2739 1.46 2.84
HG01352 117X 184.5 3666 1.01 2.80
Large Inversion in HG00514
Printrepeats showing ~25kb Inverted Repeat
Read Mapping of Short Reads
A CG TG T
Short ReadsA A
CC ? ?G G G G
TTTT ??? ?
Short Read Assembly
A CG TG T
Short ReadsA A
CC ? ?G G G G
TTTT ??? ?
A
C
G
T
G
T
Long (PacBio) Reads
A CG TG T
Long ReadsA CG
T
T
A
GA
G G
G
G
T
CT
10X Linked Reads
A CG TG T
A
C
G G
T
A
C
G
T T
T
T
G T
10X Linked Reads
A CG TG T
CT TA
T T
A G T
G TX
We only achieve ~.2X per Molecule
X
X
10X Linked Reads – Resolving Alleles vs Repeats
A CG T/GG T
CT TA G
CT T
A G G
G GX
BioNano Map
A CG TG T
Nick Sites
BioNano Map
A CG TG T
Nick Sites
Indicates Flipped Loop of Inverted Repeat
Future Plans
• Refine Existing Platforms• Longer Linking• Longer Sequences• Cost Reductions
• Investigate New Platforms• PacBio Sequel• Oxford Nanopore
• Investigate New Techniques• Hybridization of Long Linked Reads in Lieu of Large Insert Clones to
Capture Allelic Diversity Across as Many Humans as Possible
Summary
• Goal: Generate Robust Data Sets for Additional High-quality Reference Genome Enhancing the Full Range of Genetic Diversity in Humans
• These Long Read (Long Range) Sequencing/Mapping Applications Provide Orthogonal Synergistic Data Sets to Help Accomplish Our Goal.
• Each System Possesses Unique Challenges and Requires Optimization of Protocols and Running Conditions Specific to Our Needs.
• Experience and Communication is Key.
(Magrini)
Acknowledgements
The McDonnell Genome Institute at Washington University in St. Louis
Tina GravesAmy LyLisa CookCatrina FronickKaryn Meltz SteinbergWes WarrenChad TomlinsonEddie BelterSusan Dutcher
10X GenomicsDeanna ChurchMichael Chase
BioNano GenomicsAlex Hastie
Pacific Biosciences Nick SisnerosLaura Nolden
Nationwide Children’s Hospital
Rick WilsonVince MagriniSean McGrath
NCBIValerie Schneider